# Аудиториска вежба 10: NLP

Text Vectorization is the process of converting text into a numerical representation.

We can accomplish this using different methods.

**Binary Term Frequency** - captures presence (1) or absence (0) of a term in a documnent.

**Bag of Words (BoW) Term Frequency** - captures frequency of term in a document.

**Normalized Term Frequency L1** - captures normalized BoW Term Frequency in a document.

**Normalized TF-IDF (Term Frequency-Inverse Document Frequency) L2** - captures normalized TD-IDF in a document

### Binary Term Frequency

In [2]:
import pandas as pd

In [3]:
corpus = ["This is a brown house. This house is big. The street number is 1.",
          "This is a small house. This house has 1 bedroom. The street number is 12.",
          "This dog is brown. This dog likes to play.",
          "The dog is in the bedroom."]

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer


#Binarna vektorizacija stave 1 ako zborut sa pojavave vo dokumentot i 0 ako ne sa pojavavae
tv = TfidfVectorizer(binary=True, norm=None, use_idf=False, smooth_idf=False, lowercase=True, stop_words='english', token_pattern=r'(?u)\b[A-Za-z]+\b', min_df=1, max_df=1.0, max_features=None, ngram_range=(1, 1))
  

In [9]:
data = pd.DataFrame(tv.fit_transform(corpus).toarray(), columns=tv.get_feature_names_out())


In [10]:
data

# #This is a brown house. This house is big. The street number is 1.

Unnamed: 0,bedroom,big,brown,dog,house,likes,number,play,small,street
0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
1,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0
2,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
3,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


✔ Binary vectorization assigns 1 if the word appears in the document, otherwise 0.

### Bag of Words Term Frequency

In [11]:
tv = TfidfVectorizer(binary=False, norm=None, use_idf=False, smooth_idf=False, lowercase=True, stop_words='english', token_pattern=r'(?u)\b[A-Za-z]+\b', min_df=1, max_df=1.0, max_features=None, ngram_range=(1, 1))


In [12]:
data = pd.DataFrame(tv.fit_transform(corpus).toarray(), columns=tv.get_feature_names_out())


In [13]:
data

Unnamed: 0,bedroom,big,brown,dog,house,likes,number,play,small,street
0,0.0,1.0,1.0,0.0,2.0,0.0,1.0,0.0,0.0,1.0
1,1.0,0.0,0.0,0.0,2.0,0.0,1.0,0.0,1.0,1.0
2,0.0,0.0,1.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0
3,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


✔ Counts how many times each word appears in a document.

### Normalized Term Frequency

In [14]:
tv = TfidfVectorizer(binary=False, norm='l1', use_idf=False, smooth_idf=False, lowercase=True, stop_words='english', token_pattern=r'(?u)\b[A-Za-z]+\b', min_df=1, max_df=1.0, max_features=None, ngram_range=(1, 1))


In [15]:
data = pd.DataFrame(tv.fit_transform(corpus).toarray(), columns=tv.get_feature_names_out())


In [16]:
data

Unnamed: 0,bedroom,big,brown,dog,house,likes,number,play,small,street
0,0.0,0.166667,0.166667,0.0,0.333333,0.0,0.166667,0.0,0.0,0.166667
1,0.166667,0.0,0.0,0.0,0.333333,0.0,0.166667,0.0,0.166667,0.166667
2,0.0,0.0,0.2,0.4,0.0,0.2,0.0,0.2,0.0,0.0
3,0.5,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0


✔ Scales the word frequency so that the sum of all word frequencies in a document equals 1.

✔ Ensures documents of different lengths have comparable word importance.

### Normalized TF-IDF

In [22]:
tv = TfidfVectorizer(binary=False, norm='l2', use_idf=False, smooth_idf=False, lowercase=True, stop_words='english', token_pattern=r'(?u)\b[A-Za-z]+\b', min_df=1, max_df=1.0, max_features=None, ngram_range=(1, 1))


In [23]:
data = pd.DataFrame(tv.fit_transform(corpus).toarray(), columns=tv.get_feature_names_out())


In [24]:
data

Unnamed: 0,bedroom,big,brown,dog,house,likes,number,play,small,street
0,0.0,0.353553,0.353553,0.0,0.707107,0.0,0.353553,0.0,0.0,0.353553
1,0.353553,0.0,0.0,0.0,0.707107,0.0,0.353553,0.0,0.353553,0.353553
2,0.0,0.0,0.377964,0.755929,0.0,0.377964,0.0,0.377964,0.0,0.0
3,0.707107,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0


✔ TF-IDF (Term Frequency-Inverse Document Frequency) assigns more weight to words that appear frequently in a document but are rare across all documents.

✔ L2 Norm ensures the sum of squares of all word frequencies in a document is 1.

### Text Classification

In [17]:
!gdown 1rmX4GzVy9kKzwPjtaC0WYR34iYmb7Beu

Downloading...
From: https://drive.google.com/uc?id=1rmX4GzVy9kKzwPjtaC0WYR34iYmb7Beu
To: C:\Users\imomc\Desktop\VNP\VNP\AV\SPAM text message 20170820 - Data.csv

  0%|          | 0.00/486k [00:00<?, ?B/s]
100%|##########| 486k/486k [00:00<00:00, 1.79MB/s]
100%|##########| 486k/486k [00:00<00:00, 1.79MB/s]


In [27]:
data = pd.read_csv('SPAM text message 20170820 - Data.csv')

In [28]:
data

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [29]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(data['Message'],  data['Category'], test_size=0.2)


In [30]:
from collections import Counter

#Proverkja dali e ne balansiran data setut

print(f"Training class distributions summary: {Counter(Y_train)}")
print(f"Test class distributions summary: {Counter(Y_test)}")

Training class distributions summary: Counter({'ham': 3831, 'spam': 626})
Test class distributions summary: Counter({'ham': 994, 'spam': 121})


In [31]:
#Note: The make_pipeline() method is used to create a pipeline using the provided estimators.
#Note: We can use it when we want to perform operations step by step on some dataset.
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(X_train, Y_train)

y_pred = model.predict(X_test)

In [32]:

from imblearn.metrics import classification_report_imbalanced

print(classification_report_imbalanced(Y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

        ham       0.96      1.00      0.67      0.98      0.82      0.69       994
       spam       1.00      0.67      1.00      0.80      0.82      0.65       121

avg / total       0.97      0.96      0.71      0.96      0.82      0.69      1115



In [33]:

#Note: Undersampling is a technique used to balance uneven datasets by keeping all of the data from the minority class and decreasing the size of data of the majority class.
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.under_sampling import RandomUnderSampler

model = make_pipeline_imb(TfidfVectorizer(), RandomUnderSampler(), MultinomialNB())
model.fit(X_train, Y_train)

y_pred = model.predict(X_test)

In [34]:

print(classification_report_imbalanced(Y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

        ham       1.00      0.96      0.98      0.98      0.97      0.94       994
       spam       0.76      0.98      0.96      0.86      0.97      0.94       121

avg / total       0.97      0.96      0.97      0.97      0.97      0.94      1115

