<div style="line-height:1.2;">

<h1 style="color:#B0EE8F; margin-bottom: 0.2em;">Common practices in Machine Learning 1</h1>

</div>

<div style="line-height:1.2;">

<h4 style="margin-top: 0.2em; margin-bottom: 0.5em;">Scikit-learn tips and tricks. Focus on Imputers, Encoders, Tokenizers, and Taggers.</h4>

</div>

<div style="margin-top: 3px;">
<span style="display: inline-block;">
    <h3 style="color: lightblue; display: inline; margin-bottom: 0;">Keywords:</h3> fancyimpute + font size in markdown + nltk.tag + DictVectorizer + CountVectorizer + TfidfVectorizer 
    + np.vstack + np.hstack
</span>
</div>


In [2]:
import nltk
from nltk.corpus import brown 
from nltk.tag import UnigramTagger 
from nltk.tag import BigramTagger 
from nltk.tag import TrigramTagger 

import numpy as np
import pandas as pd
import seaborn as sns
from pprint import pprint
import matplotlib.cm as cm

from fancyimpute import KNN, MatrixFactorization, BiScaler, IterativeImputer, SoftImpute
from fancyimpute import IterativeSVD, SimpleFill

from category_encoders import TargetEncoder

from sklearn.impute import SimpleImputer
from sklearn.datasets import load_iris, make_blobs
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import StandardScaler, LabelBinarizer, MultiLabelBinarizer, LabelEncoder

<h2 style="color:#B0EE8F"> <b>Imputing </b></h2>

<h3 style="color:#B0EE8F"> Recap Fancyimpute: </h3>
<div style="margin-top: -20px;">
Fancyimpute module => for matrix completion and imputation techniques. <br>
It provides a set of tools and algorithms to fill in missing values in matrices:

- Matrix Factorization
- K-Nearest Neighbors (KNN)
- SoftImpute
- Iterative Imputer
- BiScaler
- Nuclear Norm Minimization
- Bayesian Low Rank Matrix Completion
</div>

In [8]:
# Create a numpy array with missing values
features = np.array([[1, 2], [3, 4], [5, np.nan]])

## Standardize features
scaler = StandardScaler()
standardized_features = scaler.fit_transform(features)
# First feature value
true_value = standardized_features[0, 0]
# Introduce a missing value
standardized_features[0, 0] = np.nan

#### => KNN

In [6]:
""" KNN: Missing values are imputed based on the values of their k-nearest neighbors in the dataset. """
imputer = KNN(k=5, verbose=0)
imputed_features = imputer.fit_transform(standardized_features)
imputed_value = imputed_features[-1, 0]

print(imputed_features)
print("True Value:", true_value)
print("\n Imputed Feature:", imputed_features)
print("\n Imputed Value:", imputed_value)

[[ 0.         -1.        ]
 [ 0.          1.        ]
 [ 1.22474487  1.        ]]
True Value: -1.224744871391589

 Imputed Feature: [[ 0.         -1.        ]
 [ 0.          1.        ]
 [ 1.22474487  1.        ]]

 Imputed Value: 1.224744871391589


#### => Matrix Factorization (SVD = Singular Value Decomposition)

In [16]:
imputer = SimpleFill()
imputed_features_svd = imputer.fit_transform(standardized_features)
imputed_features_svd

array([[ 0.61237244, -1.        ],
       [ 0.        ,  1.        ],
       [ 1.22474487,  0.        ]])

#### => SoftImputer

In [17]:
imputer = SoftImpute()
imputed_features_soft = imputer.fit_transform(standardized_features)
imputed_features_soft

[SoftImpute] Max Singular Value of X_init = 1.414214
[SoftImpute] Iter 1: observed MAE=0.017071 rank=2
[SoftImpute] Iter 2: observed MAE=0.017071 rank=2
[SoftImpute] Iter 3: observed MAE=0.017071 rank=2
[SoftImpute] Iter 4: observed MAE=0.017071 rank=2
[SoftImpute] Iter 5: observed MAE=0.017071 rank=2
[SoftImpute] Iter 6: observed MAE=0.017071 rank=2
[SoftImpute] Iter 7: observed MAE=0.017071 rank=2
[SoftImpute] Iter 8: observed MAE=0.017071 rank=2
[SoftImpute] Iter 9: observed MAE=0.017071 rank=2
[SoftImpute] Iter 10: observed MAE=0.017071 rank=2
[SoftImpute] Iter 11: observed MAE=0.017071 rank=2
[SoftImpute] Iter 12: observed MAE=0.017071 rank=2
[SoftImpute] Iter 13: observed MAE=0.017071 rank=2
[SoftImpute] Iter 14: observed MAE=0.017071 rank=2
[SoftImpute] Iter 15: observed MAE=0.017071 rank=2
[SoftImpute] Iter 16: observed MAE=0.017071 rank=2
[SoftImpute] Iter 17: observed MAE=0.017071 rank=2
[SoftImpute] Iter 18: observed MAE=0.017071 rank=2
[SoftImpute] Iter 19: observed MAE=0.0

array([[ 0.        , -1.        ],
       [ 0.        ,  1.        ],
       [ 1.22474487,  0.        ]])

#### => Iterative Imputer

In [31]:
imputer = IterativeImputer()
imputed_features_bsr = imputer.fit_transform(standardized_features)
imputed_features_bsr

array([[ 2.44947994e+00, -1.00000000e+00],
       [ 0.00000000e+00,  1.00000000e+00],
       [ 1.22474487e+00, -3.99999933e-06]])

#### => BiScaler

In [20]:
# Create a BiScaler imputer
imputer = BiScaler()
# Impute the missing values using BiScaler
imputed_features_bi = imputer.fit_transform(standardized_features)
imputed_features_bi

[BiScaler] Initial log residual value = 1.793231
[BiScaler] Iter 1: log residual = 0.693147, log improvement ratio=1.100084
[BiScaler] Iter 2: log residual = 0.693147, log improvement ratio=0.000000


array([[nan, -1.],
       [-1.,  1.],
       [ 1., nan]])

#### => Nuclear Norm Minimization

In [21]:
# Create a MatrixFactorization imputer using Nuclear Norm Minimization
imputer = MatrixFactorization()
# Impute the missing values using Nuclear Norm Minimization
imputed_features_mnm = imputer.fit_transform(standardized_features)
imputed_features_mnm 

[MatrixFactorization] Iter 10: observed MAE=0.756164 rank=40
[MatrixFactorization] Iter 20: observed MAE=0.706554 rank=40
[MatrixFactorization] Iter 30: observed MAE=0.662719 rank=40
[MatrixFactorization] Iter 40: observed MAE=0.623491 rank=40
[MatrixFactorization] Iter 50: observed MAE=0.587722 rank=40


array([[-0.06696763, -1.        ],
       [ 0.        ,  1.        ],
       [ 1.22474487,  0.51991123]])

<h3 style="color:#B0EE8F"> Sklearn Imputer </h3>

In [None]:
mean_imputer = SimpleImputer(strategy="mean")
features_mean_imputed = mean_imputer.fit_transform(features) 
# Compare true and imputed values 
print("True Value:", true_value) 
print("Imputed Value:", features_mean_imputed[0,0])

<h2 style="color:#B0EE8F"> <b>Encoding </b></h2>

In [10]:
# 1
multiclass_feature = [("Texas", "Florida"), ("California", "Alabama"), ("Texas", "Florida"),
                    ("Delware", "Florida"), ("Texas", "Alabama")] 
# Create multiclass one-hot encoder 
one_hot_multiclass = MultiLabelBinarizer() 
# One-hot encode multiclass feature 
one_hot_multiclass.fit_transform(multiclass_feature)

array([[0, 0, 0, 1, 1],
       [1, 1, 0, 0, 0],
       [0, 0, 0, 1, 1],
       [0, 0, 1, 1, 0],
       [1, 0, 0, 0, 1]])

In [5]:
# 2 
le = LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam", "ciao", "forever"])

print(list(le.classes_))
print(le.transform(["tokyo", "tokyo", "paris", "forever"]))
print(list(le.inverse_transform([2, 2, 1])))

['amsterdam', 'ciao', 'forever', 'paris', 'tokyo']
[4 4 3 2]
['forever', 'forever', 'ciao']


In [7]:
# 3
enc = TargetEncoder(cols=['paris','tokyo', "amsterdam", "ciao", "ecco"])
# Target with default parameters
ce_target = TargetEncoder(cols = ['color'])
print(enc)
print(ce_target)

TargetEncoder(cols=['paris', 'tokyo', 'amsterdam', 'ciao', 'ecco'])
TargetEncoder(cols=['color'])


In [13]:
# 4
dataf = pd.DataFrame({"Score":["Low","Low","Medium","Medium","High","Barely"]})
print(dataf)
print()
scale_mapper = {"Low":1,"Medium":2,"Barely":3,"High":4}
dataf["Score"].replace(scale_mapper, inplace=True)
print(dataf)

data_dict = [
            {"Red": 2, "Green": 2},
            {"Red": 3, "Yellow": 5},
            {"Red": 4, "Blue": 4},
            {"Red": 5, "Yellow": 2},
            {"Red": 1, "Green": 1},
            ]

## Create vectorized dictionaries
dict_vectorizer = DictVectorizer(sparse=False)
dict_vectorizer1 = DictVectorizer(sparse=True)

features = dict_vectorizer.fit_transform(data_dict)
features1 = dict_vectorizer1.fit_transform(data_dict)
print(features)
print()
print(features1)

    Score
0     Low
1     Low
2  Medium
3  Medium
4    High
5  Barely

   Score
0      1
1      1
2      2
3      2
4      4
5      3
[[0. 2. 2. 0.]
 [0. 0. 3. 5.]
 [4. 0. 4. 0.]
 [0. 0. 5. 2.]
 [0. 1. 1. 0.]]

  (0, 1)	2.0
  (0, 2)	2.0
  (1, 2)	3.0
  (1, 3)	5.0
  (2, 0)	4.0
  (2, 2)	4.0
  (3, 2)	5.0
  (3, 3)	2.0
  (4, 1)	1.0
  (4, 2)	1.0


In [15]:
# 5
data_dict = [{'color': 'red', 'size': 'large'}, {'color': 'blue', 'size': 'medium'}, {'color': 'green', 'size': 'small'}]

dict_vectorizer = DictVectorizer(sparse=False)
features = dict_vectorizer.fit_transform(data_dict)
feature_names = dict_vectorizer.get_feature_names_out()

piddi = pd.DataFrame(features, columns=feature_names)
piddi

Unnamed: 0,color=blue,color=green,color=red,size=large,size=medium,size=small
0,0.0,0.0,1.0,1.0,0.0,0.0
1,1.0,0.0,0.0,0.0,1.0,0.0
2,0.0,1.0,0.0,0.0,0.0,1.0


<font size="4">
Replacing categorical feature containing missing values, with values predicted with KNeighborsClassifier :</font>

In [34]:
X = np.array([[0,2.10,1.45],[1,2.10,1.43],[1,0.89,1.05],[0,-12.10,1.78]])
X_with_nan = np.array([[np.nan, 0.87, 1.31],[np.nan, 0.87, 1.31]])
print(X[:,1:]) 
print(X[:,0])
print()
print(X_with_nan[:,1:])

[[  2.1    1.45]
 [  2.1    1.43]
 [  0.89   1.05]
 [-12.1    1.78]]
[0. 1. 1. 0.]

[[0.87 1.31]
 [0.87 1.31]]


In [35]:
clf = KNeighborsClassifier(3, weights='distance')
trained_model = clf.fit(X[:,1:], X[:,0])
imputed_values = trained_model.predict(X_with_nan[:,1:])
print("inputed_values")
print(imputed_values)
X_with_imputed = np.hstack((imputed_values.reshape(-1,1), X_with_nan[:,1:]))

# Stack arrays in sequence vertically (row wise)
np.vstack((X_with_imputed, X))

inputed_values
[1. 1.]


array([[  1.  ,   0.87,   1.31],
       [  1.  ,   0.87,   1.31],
       [  0.  ,   2.1 ,   1.45],
       [  1.  ,   2.1 ,   1.43],
       [  1.  ,   0.89,   1.05],
       [  0.  , -12.1 ,   1.78]])

In [36]:
# Join the two feature matrices 
X_complete = np.vstack((X_with_nan, X)) 
imputer = SimpleImputer(strategy='most_frequent') 
imputer.fit_transform(X_complete)

array([[  0.  ,   0.87,   1.31],
       [  0.  ,   0.87,   1.31],
       [  0.  ,   2.1 ,   1.45],
       [  1.  ,   2.1 ,   1.43],
       [  1.  ,   0.89,   1.05],
       [  0.  , -12.1 ,   1.78]])

<font size="4">
Handling imbalanced classes: </font>

In [37]:
iris = load_iris()
features_iris = iris.data
target_iris = iris.target
features_iris = features_iris[40,:]
target_iris = target_iris[40:]
# create a binary target vector indicating if class is 0
target_iris2 = np.where((target_iris == 0), 0, 1) 
print(target_iris)
print()
print(target_iris2)

[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


In [38]:
weights = {0: .9, 1: 0.1} 
clf1 = RandomForestClassifier(class_weight=weights)
clf2 = RandomForestClassifier(class_weight='balanced')

In [39]:
features_iris.shape

(4,)

<font size="4"> Alternatively, it is possible to downsample the majority class or upsample the minority class <br>
Downsampling => sample without replacement from the majority class <br>
(i.e., the class with more observations) to create a new subset of observations equal in size to the minority class <br>

For example, if the minority class has 10 observations, 10 observations are randomly selected from the majority class <br> 
to use 20 observations in total as data <br>
</font>

In [40]:
ab = np.array([1,2,3,4,5])
bc = np.array([10,12,13,14,15])

cd = np.vstack((ab, bc))        #nb! not cd = np.vstack(ab,bc)
de = np.hstack((ab, bc))        #nb! not de = np.hstack(ab, bc)

print(cd)
print(de)
print(f"ab.shape {ab.shape}")
print(f"ab.size {ab.size}")
print(f"ab.ndim {ab.ndim}")
print(f"cd.shape {cd.shape}")
print(f"cd.size {cd.size}")
print(f"cd.ndim {cd.ndim}")

[[ 1  2  3  4  5]
 [10 12 13 14 15]]
[ 1  2  3  4  5 10 12 13 14 15]
ab.shape (5,)
ab.size 5
ab.ndim 1
cd.shape (2, 5)
cd.size 10
cd.ndim 2


In [42]:
iris = load_iris()
# Select the first feature as a column vector
features_iris = iris.data[:, 0]  
target_iris = iris.target

# Create a binary target vector indicating if class is 0
target_iris2 = np.where(target_iris == 0, 0, 1)

## Split the indices of class 0 and class 1
i_class0 = np.where(target_iris2 == 0)[0]
i_class1 = np.where(target_iris2 == 1)[0]

# Downsample class 1 without replacement
i_class1_downsampled = np.random.choice(i_class1, size=len(i_class0), replace=False)

# Join together class 0's target vector with the downsampled class 1's target vector
target_downsampled = np.hstack((target_iris2[i_class0], target_iris2[i_class1_downsampled]))
# Join together class 0's feature vector with the downsampled class 1's feature vector
features_downsampled = np.hstack((features_iris[i_class0], features_iris[i_class1_downsampled]))

print(target_iris2[i_class0] )
print("\n", target_iris2[i_class1_downsampled])
print("\n", target_downsampled)
print("\n", features_downsampled)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]

 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1]

 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

 [5.1 4.9 4.7 4.6 5.  5.4 4.6 5.  4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.  5.  5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.
 5.5 4.9 4.4 5.1 5.  4.5 4.4 5.  5.1 4.8 5.1 4.6 5.3 5.  6.4 7.7 5.5 4.9
 5.6 6.2 6.7 6.  5.9 5.8 6.4 5.7 6.7 5.8 6.9 5.1 6.4 5.6 6.3 6.9 6.5 5.6
 7.3 5.  5.9 6.3 7.2 6.3 5.2 7.7 7.2 6.  5.7 6.3 5.5 5.8 5.8 5.7 7.9 6.1
 6.5 5.  6.1 6.2 5.5 6.6 6.4 5.6 6.1 6.7]


<h2 style="color:#B0EE8F"> <b> Tokenizing text wiht NLTK </b> </h2>

In [44]:
""" The Natural Language Toolkit (NLTK) can be used for natural language processing tasks such as: \
    tokenization, stemming, and part-of-speech tagging. 
    
    The punkt package contains pre-trained models for tokenization and sentence segmentation in various languages. \
    ---> The catch block is used to avoid the download to be done every time, if the package is already present.
    _The word_tokenize() function, tokenizes a given text string into words, \
        using the TreebankWordTokenizer from the nltk.tokenize module. 
    _The Sentence tokenization is used by the sent_tokenize() function in the nltk.tokenize module ;  \
    This tokenizer is designed to split text into sentences using a set of heuristics 
    that take into account common abbreviations, punctuation marks, and other linguistic patterns.
"""
# nltk.download('punkt') 
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt') 

stri = "The science of today is the technology of tomorrrow"
stri2 = "The science of today is the technology of tomorrrow. Tomorrow is today."
yea = nltk.tokenize.word_tokenize(stri) 
azz = nltk.tokenize.sent_tokenize(stri2)
print(yea)
print(azz)

['The', 'science', 'of', 'today', 'is', 'the', 'technology', 'of', 'tomorrrow']
['The science of today is the technology of tomorrrow.', 'Tomorrow is today.']


In [46]:
""" Removing stopwords means to eliminate from a string the common linking words \
(like pronouns or articles) that contain little informational value.
The set of stopwords contains the words that we want to remove before processing.
"""
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

tokenized_words = ['i', 'am', 'going', 'to', 'go', 'to', 'the', 'store', 'and', 'park']
stop_words = nltk.corpus.stopwords.words('english')
res = [word for word in tokenized_words if word not in stop_words]

print(stop_words[:10])
print(res)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
['going', 'go', 'store', 'park']


<font size="5"> => Stemming </font>

In [47]:
""" Stemming means to reduce a word to its stem by identifying and removing affixes (e.g., gerunds) \
while keeping the root meaning of the word. 
For example, both “tradition” and “traditional” have “tradit” as their stem, \
indicating that while they are different words they represent the same general concept. 

By stemming our text data, we transform it to something less readable, \
but closer to its base meaning and thus more suitable for comparison across observations.
nltk PorterStemmer implements the widely used Porter stemming algorithm to remove or \
replace common SUFFIXES to produce the word stem.
"""

tokenized_words = ['i', 'am', 'humbled', 'by', 'this', 'traditional', 'meeting']
porter = nltk.stem.porter.PorterStemmer()
stemm = [porter.stem(word) for word in tokenized_words]
stemm

['i', 'am', 'humbl', 'by', 'thi', 'tradit', 'meet']

<font size="6"> Tagging: <br>
</font>

    - NNP Proper noun, singular 
    - NN Noun, singular or mass 
    - RB Adverb 
    - VBD Verb, past tense 
    - VBG Verb, gerund or present participle
    - JJ Adjective 
    - PRP Personal pronoun

In [48]:
""" Tag Part of speech """
try:
    nltk.data.find('taggers/averaged_perceptron_tagger')
except LookupError:
    nltk.download('averaged_perceptron_tagger')

text_data = "Chris loved outdoor running"
text_tagged = nltk.pos_tag(nltk.word_tokenize(text_data))

resu = [word for word, tag in text_tagged if tag in ['NNP','NN','NNS','NNPS']]
print(text_tagged)
print(f"filtered result = {resu}")

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/notto4/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('Chris', 'NNP'), ('loved', 'VBD'), ('outdoor', 'RP'), ('running', 'VBG')]
filtered result = ['Chris']
[[1 1 0 1 0 1 1 1 0]
 [1 0 1 1 0 0 0 0 1]
 [1 0 1 1 1 0 0 0 1]]
<class 'sklearn.preprocessing._label.MultiLabelBinarizer'>
['DT' 'IN' 'JJ' 'NN' 'NNP' 'PRP' 'VBG' 'VBP' 'VBZ']


In [50]:
tweets = ["I am eating a burrito for breakfast", "Political science is an amazing field", "San Francisco is an awesome city"]
tagged_tweets = []
""" Tag each word and each tweet """
for tweet in tweets:
    tweet_tag = nltk.pos_tag(nltk.word_tokenize(tweet))
    tagged_tweets.append([tag for word, tag in tweet_tag])

""" Use one-hot encoding to convert the tags into features """    
one_hot_tweet = MultiLabelBinarizer()
enco_res = one_hot_tweet.fit_transform(tagged_tweets)
print(enco_res)
print("\n", one_hot_tweet.__class__)
print(one_hot_tweet.classes_)

[[1 1 0 1 0 1 1 1 0]
 [1 0 1 1 0 0 0 0 1]
 [1 0 1 1 1 0 0 0 1]]

 <class 'sklearn.preprocessing._label.MultiLabelBinarizer'>
['DT' 'IN' 'JJ' 'NN' 'NNP' 'PRP' 'VBG' 'VBP' 'VBZ']


<font size="6"> 
Taggers: <br>
</font>

    - UnigramTagger 
    - BigramTagger 
    - TrigramTagger 

In [52]:
""" 
- UnigramTagger: based on unigrams (single words),
- BigramTagger : based on bigrams (pairs of words)
- TrigramTagger : based on trigrams (triplets of words)

The list of tagged sentences is retrieved from the Brown Corpus, specifically those in the 'news' category.
"""
#nltk.download('brown')
try:
    nltk.data.find('corpora/brown')
except LookupError:
    nltk.download('brown')

# Get some text from the Brown Corpus, broken into sentences 
sentences = brown.tagged_sents(categories='news')  
# Split into 4000 sentences for training and 623 for testing 
train, test = sentences[:4000], sentences[4000:]   

unigram = UnigramTagger(train) 
bigram = BigramTagger(train, backoff=unigram) 
trigram = TrigramTagger(train, backoff=bigram) 

# Show the accuracy 
#trigram.evaluate(test)     #deprecated !! 
trigram.accuracy(test) 

0.8174734002697437

<font size="5"> Encoding text as bag of words: </font> <br>
Bag-of-words models output a feature for every unique word in text data, with each feature containing a count of occurrences in observations. <br>
Every feature can be set to be the combination of two words (called a 2-gram) or even three words (3-gram). 

In [54]:
""" Transforming text into features is by using a bag-of-words model. 

=> CountVectorizer can transform a given text into a vector on the basis of the frequency (count) of each word \
that occurs in the entire text.
It creates a matrix in which each unique word is represented by a column of the matrix, \
and each text sample from the document is a row in the matrix. 

=> "ngram_range" sets the minimum and maximum size of the n-grams. 
"""
text_data = np.array(['I love Brazil. Brazil!', 'Sweden is best', 'Italy beats both']) 
count_matri = CountVectorizer() 
bag_of_words = count_matri.fit_transform(text_data)

print(count_matri.get_feature_names_out())
print(bag_of_words)
print(bag_of_words.toarray())

# Create feature matrix with arguments
count_2gram = CountVectorizer(ngram_range=(1,2), stop_words="english", vocabulary=['brazil'])
bag2 = count_2gram.fit_transform(text_data)
print(bag2.toarray())
print(count_2gram.vocabulary_)

['beats' 'best' 'both' 'brazil' 'is' 'italy' 'love' 'sweden']
  (0, 6)	1
  (0, 3)	2
  (1, 7)	1
  (1, 4)	1
  (1, 1)	1
  (2, 5)	1
  (2, 0)	1
  (2, 2)	1
[[0 0 0 2 0 0 1 0]
 [0 1 0 0 1 0 0 1]
 [1 0 1 0 0 1 0 0]]
[[2]
 [0]
 [0]]
{'brazil': 0}


<h2 style="color:#B0EE8F"> <b> Weighting word importance: </b> </h2>

**Term frequency (tf):** <br>
&emsp;&emsp;&emsp;
The more a word appears in a document, the more likely it is important to that document. <br>
**Document frequency (df):** <br>
&emsp;&emsp;&emsp;
if a word appears in many documents, it is likely less important to any individual document. <br>

By combining these two statistics, it is possible to assign a score to every word representing how important that word is in a document. <br>
Specifically, multiply tf to idf (inverse of document frequency) [where t is a word and d is a document].

$$tf - idf(t,d) = tf(t,d) \times idf(t)$$

$$idt(f) = \frac{\log (1 + n_{d})} { 1 + df(d,t)} + 1$$

In [60]:
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(text_data)
print("Vocabulary:")
print(tfidf.vocabulary_)
print("Feature_matrix:\n", feature_matrix)
print("\n", feature_matrix.toarray())

Vocabulary:
{'love': 6, 'brazil': 3, 'sweden': 7, 'is': 4, 'best': 1, 'italy': 5, 'beats': 0, 'both': 2}
Feature_matrix:
   (0, 3)	0.8944271909999159
  (0, 6)	0.4472135954999579
  (1, 1)	0.5773502691896257
  (1, 4)	0.5773502691896257
  (1, 7)	0.5773502691896257
  (2, 2)	0.5773502691896257
  (2, 0)	0.5773502691896257
  (2, 5)	0.5773502691896257

 [[0.         0.         0.         0.89442719 0.         0.
  0.4472136  0.        ]
 [0.         0.57735027 0.         0.         0.57735027 0.
  0.         0.57735027]
 [0.57735027 0.         0.57735027 0.         0.         0.57735027
  0.         0.        ]]
