# Natural language analysis

* Analysis of short texts and their classification to language families

## Data exploration

Dataset – Pater Noster prayers in various languages.

***

#### ❓ Task 1

  * read dataset from file *paternoster.csv* into pandas data frame *d*
  * show dataset sample
  * print number of columns and rows

***

In [257]:
## ...

#### Language classes


* **S** – slavic languages
* **R** – roman languages
* **G** – german languages
* **F** – finnish

***

#### ❓ Task 2

  * calculate the number of languages in every class (hint: groupby or value_counts)

***

In [258]:
## ...

## Text preprocessing

***

#### ❓ Task 3

  * create column *proc* with the text from column *text* after
    * lower case
    * removing the diacritics
    * replacing any punctuation with single space
    * trimming leading and trailing spaces
    * create numpy vectorized function *preprocess_np* with all this functionality

***

In [259]:
## ...

def remove_diac(text: str):
    return ''.join(c for c in unicodedata.normalize('NFD', text)
                   if unicodedata.category(c) != 'Mn')


preprocess_np = np.vectorize(preprocess)


## Vectorization

* trasforming the plain text into cartesian vector space
  * dimensions: symbols – words or ngrams
  * values: frequency of symbol in text

***

#### ❓ Task 4

  * create object *vec* of class CountVectorizer
    * set maximum ftrs count to 1500
    * fit the object with texts from column *proc*
    * print feature names
  * create matrix X with transformed values of *proc*
  * answer the questions:
    * What is the most common word in English prayer?
    * What is the most common bigram in Czech prayer?

***



In [246]:
## ...

CountVectorizer(analyzer='char', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=1500, min_df=1,
                ngram_range=(2, 2), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

## Hierarchical clustering of languages

* Input:
  * vectorized matrix of frequencies
* Parameters:
  * **affinity** – similarity measure
    * euclidean, l1, l2, manhattan, cosine
  * **linkage** – clustering algorithm
    * **single** – closest members distance
    * **complete** – furtherst members distance
    * **average** – average distance
    * **ward** – minimal combined variance
     
***

#### ❓ Task 5

  * read the code bellow
  * experiment with different parameters
    * vectorizer parameters (analyzer, ngram_range)
    * similarity measures
    * clustering algorithm
  * find best possible clustering
    * the one most separating for language classes

***  



In [None]:
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt

wes_rushmore=["#E1BD6D", "#EABE94", "#0B775E", "#35274A", "#F2300F"]

%matplotlib inline

cols={'S': wes_rushmore[4], 'G': wes_rushmore[2], 'R': wes_rushmore[3], 'F': wes_rushmore[0]}

plt.style.use('fivethirtyeight')

def plot_dendrogram(model, **kwargs):

    # Children of hierarchical clustering
    children = model.children_

    # Distances between each pair of children
    # Since we don't have this information, we can use a uniform one for plotting
    distance = np.arange(children.shape[0])

    # The number of observations contained in each cluster level
    no_of_observations = np.arange(2, children.shape[0]+2)

    # Create linkage matrix and then plot the dendrogram
    linkage_matrix = np.column_stack([children, distance, no_of_observations]).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

    
clustering = AgglomerativeClustering(n_clusters = 4, affinity='l2', linkage='average').fit(X.toarray())

plot_dendrogram(clustering, labels=lng, distance_sort='ascending')
plt.title('Language Dendrogram');


## Binary classification

* Model: Multinomial Naive Bayes classifier
* Classes: 
  * **S**lavic languages
  * **G**erman languages
* Training set: Czech and German
* Test set: all the other languages


In [260]:
from sklearn.naive_bayes import MultinomialNB

y = d['class'].values

mnb = MultinomialNB(alpha=100)

train_lng = [0,2]

mnb.fit(X[train_lng],y[train_lng])

pred=mnb.predict_proba(X)

## Classification results

***

#### ❓ Task 6

  * read and understand the code bellow
  * evaluate the model performance on the test set
    * mean absolute error
    * predictions plot
  * experiment with parameters
    * vectorizer parameters
    * smoothing alpha
  * answer the questions:
    * does model generaly work?
    * what is the most/least slavic and most german language in the test set?
    * what are the results on roman languages?
    * what about finnish?

***  

### Prediction on training set

In [None]:
def plot_predictions(pred, idxs, title):

    fig = plt.figure(figsize=[3+len(idxs),4]);
    x=np.arange(len(idxs))
    plt.bar(x-0.2, pred[idxs,0], width=0.4, label = mnb.classes_[0], color=cols[mnb.classes_[0]])
    plt.bar(x+0.2, pred[idxs,1], width=0.4, label = mnb.classes_[1], color=cols[mnb.classes_[1]])
    plt.legend()
    plt.xticks(ticks=x, labels=[lng[i] for i in idxs])
    plt.title(f"{title}, {mnb.classes_[0]} vs {mnb.classes_[1]}")
    plt.xlabel('Language')
    plt.ylabel('Prediction')

    
plot_predictions(pred, idxs=train_lng, title='Training')

from sklearn import metrics

mae_tr = metrics.mean_absolute_error(y[train_lng]==mnb.classes_[0], pred[train_lng,0])
print(f'MAE TRAIN:  {mae_tr:.2%}')

### Prediction on test set

In [263]:
## ...

### Strongest features

* Which features (words or ngrams) most strongly support its language?

***

#### ❓ Task 7

  * read and understand the code bellow  
  * modify the code to answer qustions:
    * what are the strongest slavic, german, roman bigrams?

***  


In [None]:
train_lng=d[d['class'].isin(['S','G'])].index
mnb.fit(X[train_lng],y[train_lng])

df=pd.DataFrame({'ftr':vec.get_feature_names(), mnb.classes_[0]:mnb.feature_log_prob_[0], mnb.classes_[1]:mnb.feature_log_prob_[1]})
df['dif']=df[mnb.classes_[0]]-df[mnb.classes_[1]]
df['odds']=np.exp(df['dif'])
df.sort_values('dif',  inplace=True)
df.tail(20)

In [None]:
fig = plt.figure(figsize=[6,10]);
n=10
de=pd.concat([df.head(n), df.tail(n)])
x=np.concatenate([np.arange(0,n,dtype=int),np.arange(n+1,2*n+1,dtype=int)])
h1=plt.barh(x[:n],de['dif'].values[0:n],
        color=np.repeat(cols[mnb.classes_[1]],n))
h2=plt.barh(x[n:],de['dif'].values[n:],
        color=np.repeat(cols[mnb.classes_[0]],n))
plt.title('Strongest features')
plt.yticks(ticks=x, labels=de['ftr']);
plt.legend([h1,h2], mnb.classes_[[1,0]]);

## Final task

* generalize the model for three classes (S,G,R)
    1. train the model on three representative languages
    2. evaluate the model results
    3. find three languages most representative for its classes
      - lowest mean abs. error
    
* decide, which class is most simmilar to finish
    

In [264]:
### ...