# Natural language analysis

* Analysis of short texts and their classification to language families

## Data exploration

Dataset – Pater Noster prayers in various languages.

***

#### ❓ Task 1

  * read dataset from file *paternoster.csv* into pandas data frame
  * show dataset sample
  * print number of columns and rows

***

In [1]:
import pandas as pd
d=pd.read_csv('paternoster.csv', sep=';')
d

Unnamed: 0,lang,class,text
0,Czech,S,"Otče náš, jenž jsi na nebesích, Posvěť se jmén..."
1,Polish,S,"Ojcze nasz, któryś jest w niebie, święć się im..."
2,Deutsch,G,"Vater unser im Himmel, geheiligt werde dein Na..."
3,English,G,"Our Father in heaven, hallowed be your name, y..."
4,French,R,"Notre Père, toi qui es dans les cieux, que tu ..."
5,Dutch,G,"Onze Vader in de hemel, uw naam worde geheilig..."
6,Latin,R,"Pater noster, qui es in caelis, sanctificetur ..."
7,Italian,R,"Padre nostro che sei nei cieli, sia santificat..."
8,Spanish,R,"Padre nuestro que estás en el Cielo, santifica..."
9,Slovak,S,"Otče náš, ktorý si na nebesách, posväť sa meno..."


In [2]:
print(f' cols: {d.shape[1]}, rows: {d.shape[0]}');

 cols: 3, rows: 13


#### Language classes


* **S** – slavic languages
* **R** – roman languages
* **G** – german languages
* **F** – finnish

***

#### ❓ Task 2

  * calculate the number of languages in every class (hint: groupby or value_counts)

***

In [3]:
d.groupby('class')['lang'].count()
d['class'].value_counts()

S    4
R    4
G    4
F    1
Name: class, dtype: int64

## Text preprocessing

***

#### ❓ Task 3

  * create column *proc* with the text from *text* after
    * lower case
    * removing the diacritics
    * replacing any punctuation with single space
    * trimming leading and trailing spaces

***

In [4]:
import numpy as np
import unicodedata
import re

def remove_diac(text: str):
    return ''.join(c for c in unicodedata.normalize('NFD', text)
                   if unicodedata.category(c) != 'Mn')

def preprocess(text: str):
    return remove_diac(re.sub(r"[,.; '-]+",' ',text.strip().lower()))

preprocess_np = np.vectorize(preprocess)

d['proc'] = preprocess_np(d['text'].values)
d

Unnamed: 0,lang,class,text,proc
0,Czech,S,"Otče náš, jenž jsi na nebesích, Posvěť se jmén...",otce nas jenz jsi na nebesich posvet se jmeno ...
1,Polish,S,"Ojcze nasz, któryś jest w niebie, święć się im...",ojcze nasz ktorys jest w niebie swiec sie imie...
2,Deutsch,G,"Vater unser im Himmel, geheiligt werde dein Na...",vater unser im himmel geheiligt werde dein nam...
3,English,G,"Our Father in heaven, hallowed be your name, y...",our father in heaven hallowed be your name you...
4,French,R,"Notre Père, toi qui es dans les cieux, que tu ...",notre pere toi qui es dans les cieux que tu so...
5,Dutch,G,"Onze Vader in de hemel, uw naam worde geheilig...",onze vader in de hemel uw naam worde geheiligd...
6,Latin,R,"Pater noster, qui es in caelis, sanctificetur ...",pater noster qui es in caelis sanctificetur no...
7,Italian,R,"Padre nostro che sei nei cieli, sia santificat...",padre nostro che sei nei cieli sia santificato...
8,Spanish,R,"Padre nuestro que estás en el Cielo, santifica...",padre nuestro que estas en el cielo santificad...
9,Slovak,S,"Otče náš, ktorý si na nebesách, posväť sa meno...",otce nas ktory si na nebesach posvat sa meno t...


## Vectorization

* trasforming the plain text into cartesian vector space
  * dimensions: symbols – words or ngrams
  * values: frequency of symbol in text

***

#### ❓ Task 4

  * create object *vec* of class CountVectorizer
    * set maximum ftrs count to 1500
    * fit the object with texts from column *proc*
    * print feature names
  * create matrix X with transformed values of *proc*
  * answer the questions:
    * What is the most common word in English prayer?
    * What is the most common bigram in Czech prayer?

***



In [5]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(max_features = 1500, analyzer='char', ngram_range=(2,2))

vec.fit(d['proc'].values)

#vec.get_feature_names()

CountVectorizer(analyzer='char', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=1500, min_df=1,
                ngram_range=(2, 2), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [6]:
X = vec.transform(d['proc'].values)
y = d['class'].values
#pd.DataFrame({'ftr':vec.get_feature_names(), 'nums':X.todense()[3].A1}).sort_values('nums', ascending=False)
#pd.DataFrame({'ftr':vec.get_feature_names(), 'nums':X.todense()[0].A1}).sort_values('nums', ascending=False)