# Theoretical question

3. Language model inference methods: properties, differences, cases of usage (without formulas, at least 2)  
Greedy search  
Greedy search simply selects the word with the highest probability as its next word.
Beam search  
Beam search reduces the risk of missing hidden high probability word sequences by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability.  
Beam search can work very well in tasks where the length of the desired generation is more or less predictable as in machine translation or summarization.  
Beam search heavily suffers from repetitive generation.  
Sampling  
In its most basic form, sampling means randomly picking the next word.  
Top-K Sampling  
In Top-K sampling, the K most likely next words are filtered and the probability mass is redistributed among only those K next words.  
Top-p (nucleus) sampling  
Instead of sampling only from the most likely K words, in Top-p sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability p. The probability mass is then redistributed among this set of words. This way, the size of the set of words (a.k.a the number of words in the set) can dynamically increase and decrease according to the next word's probability distribution.

# Exam

Develop a model for predicting review rating.  
**Multiclass classification into 5 classes**  
Score: **F1 with macro averaging**  
You are forbidden to use test dataset for any kind of training.  
Remember proper training pipeline.  
If you are not using default params in the models, you have to use some validation scheme to justify them. 

Use `random_state` or `seed` params - your experiment must be reprodusible.


### 1 baseline = 0.51
### 2 baseline = 0.53


In [0]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

SEED = 1337

In [0]:
from google.colab import drive

In [3]:
drive.mount('/content/gdrive/')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive/


In [0]:
import os
os.chdir('gdrive/My Drive/Colab Notebooks/exam_data')

In [5]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
df_train.shape

(48192, 3)

In [6]:
df_train = pd.read_csv('train.csv')
df_train.head()

Unnamed: 0,review,title,target
0,"The staff was very friendly, the breakfast ver...",Walker Gem,5
1,Excellent service - very approachable and prof...,Excellent Service,4
2,Really a top notch place to spend a day at the...,"Good location, warm and friendly staff",5
3,"a little noisy, there was a false fire alarm a...","nice hotel,",4
4,Place had too many animals and I'm allergic to...,Experience,3


In [7]:
df_test = pd.read_csv('train.csv')
df_test.head()

Unnamed: 0,review,title,target
0,"The staff was very friendly, the breakfast ver...",Walker Gem,5
1,Excellent service - very approachable and prof...,Excellent Service,4
2,Really a top notch place to spend a day at the...,"Good location, warm and friendly staff",5
3,"a little noisy, there was a false fire alarm a...","nice hotel,",4
4,Place had too many animals and I'm allergic to...,Experience,3


In [8]:
# class distribution
df_train.target.value_counts(normalize=True )

5    0.405690
4    0.286126
3    0.153137
1    0.077648
2    0.077399
Name: target, dtype: float64

In [0]:
from nltk.tokenize import RegexpTokenizer

In [10]:
pip install pymorphy2

Collecting pymorphy2
[?25l  Downloading https://files.pythonhosted.org/packages/a3/33/fff9675c68b5f6c63ec8c6e6ff57827dda28a1fa5b2c2d727dffff92dd47/pymorphy2-0.8-py2.py3-none-any.whl (46kB)
[K     |███████                         | 10kB 20.9MB/s eta 0:00:01[K     |██████████████▏                 | 20kB 2.2MB/s eta 0:00:01[K     |█████████████████████▎          | 30kB 2.8MB/s eta 0:00:01[K     |████████████████████████████▍   | 40kB 2.1MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 2.0MB/s 
Collecting dawg-python>=0.7
  Downloading https://files.pythonhosted.org/packages/6a/84/ff1ce2071d4c650ec85745766c0047ccc3b5036f1d03559fd46bb38b5eeb/DAWG_Python-0.7.2-py2.py3-none-any.whl
Collecting pymorphy2-dicts<3.0,>=2.4
[?25l  Downloading https://files.pythonhosted.org/packages/02/51/2465fd4f72328ab50877b54777764d928da8cb15b74e2680fc1bd8cb3173/pymorphy2_dicts-2.4.393442.3710985-py2.py3-none-any.whl (7.1MB)
[K     |████████████████████████████████| 7.1MB 6.8MB/s 
[?25

In [0]:
import pymorphy2
from pymorphy2 import MorphAnalyzer
morph = MorphAnalyzer()
token = RegexpTokenizer('\w+')

In [0]:
import re

In [0]:
def tokenize(text):
    return token.tokenize(text)

def normalize_pm(text):
    words = [morph.parse(word)[0].normal_form for word in tokenize(text) if word]
    return words

In [0]:
df_train['review_clean'] = df_train['review']
df_test['review_clean'] = df_test['review']

In [0]:
def prep(data):
  tokens = [normalize_pm(sen) for sen in data.review_clean]
  result = [' '.join(sen) for sen in tokens]
  data['review_final'] = result
  return data

In [0]:
df_train = prep(df_train)

In [72]:
df_train.head()

Unnamed: 0,review,title,target,review_clean,review_final
0,"The staff was very friendly, the breakfast ver...",Walker Gem,5,"The staff was very friendly, the breakfast ver...",the staff was very friendly the breakfast very...
1,Excellent service - very approachable and prof...,Excellent Service,4,Excellent service - very approachable and prof...,excellent service very approachable and profes...
2,Really a top notch place to spend a day at the...,"Good location, warm and friendly staff",5,Really a top notch place to spend a day at the...,really a top notch place to spend a day at the...
3,"a little noisy, there was a false fire alarm a...","nice hotel,",4,"a little noisy, there was a false fire alarm a...",a little noisy there was a false fire alarm at...
4,Place had too many animals and I'm allergic to...,Experience,3,Place had too many animals and I'm allergic to...,place had too many animals and i m allergic to...


In [0]:
df_test = prep(df_test)

In [73]:
df_test.head()

Unnamed: 0,review,title,target,review_clean,review_final
0,"The staff was very friendly, the breakfast ver...",Walker Gem,5,"The staff was very friendly, the breakfast ver...",the staff was very friendly the breakfast very...
1,Excellent service - very approachable and prof...,Excellent Service,4,Excellent service - very approachable and prof...,excellent service very approachable and profes...
2,Really a top notch place to spend a day at the...,"Good location, warm and friendly staff",5,Really a top notch place to spend a day at the...,really a top notch place to spend a day at the...
3,"a little noisy, there was a false fire alarm a...","nice hotel,",4,"a little noisy, there was a false fire alarm a...",a little noisy there was a false fire alarm at...
4,Place had too many animals and I'm allergic to...,Experience,3,Place had too many animals and I'm allergic to...,place had too many animals and i m allergic to...


In [0]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

In [0]:
df_train['review_vectors'] = vectorizer.fit_transform(df_train['review_final'])

In [0]:
df_test['review_vectors'] = vectorizer.fit_transform(df_test['review_final'])

In [0]:
# encode categorial variables

le = LabelEncoder()
df_train['title_vec'] = le.fit_transform(df_train['title'])
df_test['title_vec'] = le.fit_transform(df_test['title'])

In [0]:
from sklearn.pipeline import Pipeline

In [93]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder

columns = ['review_vectors','title_vec']

model = Pipeline([
    ('enc', OneHotEncoder()),
    ('classif', RandomForestClassifier(max_depth=None, random_state=0, n_estimators=100))
])

model.fit(df_train[columns], df_train['target'])

print('train', metrics.f1_score(df_train['target'], model.predict(df_train[columns]), average='macro'))
print('test', metrics.f1_score(df_test['target'], model.predict(df_test[columns]), average='macro'))

AttributeError: ignored

In [0]:
from sklearn import model_selection

In [90]:
from sklearn.ensemble import RandomForestClassifier

#columns = ['review_vectors','title_vec']

RFC = RandomForestClassifier(random_state=SEED)
model = model_selection.GridSearchCV(RFC, {'n_estimators': [30]}, 
                                    cv=5, scoring='f1_macro', n_jobs=-1, verbose=1)
model.fit(df_train['review_vectors'], df_train['target'])

print('train', metrics.f1_score(df_train['target'], model.predict(df_train['review_vectors']), average='macro'))
print('test', metrics.f1_score(df_test['target'], model.predict(df_test['review_vectors']), average='macro'))

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.9s finished


ValueError: ignored