#Text Representation

- Text Representation is a way to convert text data into numerical format that machines can understand, allowing them to analyze and process text by representing words and sentences as vectors.

- People avoid using "**label encoding**" and "**one-hot encoding**" in NLP or machine learning because of specific drawbacks. Instead, others prefer "**Bag of Words,**" "**TF-IDF,**" or "**Word Embedding**."

#Bag Of Words (BOW)
**Limitations**

- It may consume too much memory & compute resources e.g. If we have vocabulary 100k words then each vector or each document would be of size 100k which is better than one hot encoding because in one hot encoding each word in document itself has it own vector which would be 100k let's suppose if we have 500 words in one document then it will be 100k * 500 for just one document.

- It doesn't capture meaning of words properly e.g. if we have two docs in one doc we have this sentence "i need help" and second doc have "i need assistance" then the vector representation will be different like this different words have same meaning but it increase the size of vocabulary and each document vector too.

In [1]:
import pandas as pd
import numpy as np

In [3]:
#Loading dataset
df = pd.read_csv("/content/spam.csv")
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [9]:
#getting unique values from a column
# df['Category'].unique() # this will only return unique values without count
# df.Category.value_counts() # or below code
df['Category'].value_counts()

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
ham,4825
spam,747


This is imbalanced dataset. Let's try to build machine learning model on this dataset without handling imbalance dataset.

In [10]:
#creating new column and change category text into value
df['spam'] = df['Category'].apply(lambda x: 1 if x == 'spam' else 0)

df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [11]:
df.shape

(5572, 3)

#Creating Training and Testing Dataset

In [12]:
#let's split our dataset into train and test dataset
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.Message, df.spam, test_size=0.2)

In [16]:
#total training dataset
X_train.shape

(4457,)

In [15]:
# let's see how much spam and ham rows in training dataset
y_train.value_counts()

Unnamed: 0_level_0,count
spam,Unnamed: 1_level_1
0,3873
1,584


In [17]:
#total testing dataset
X_test.shape

(1115,)

In [19]:
#counts of spam and ham in testing dataset
y_test.value_counts()

Unnamed: 0_level_0,count
spam,Unnamed: 1_level_1
0,952
1,163


In [22]:
#checking our first four training dataset
X_train[:4]

Unnamed: 0,Message
692,Sorry to trouble u again. Can buy 4d for my da...
5303,"I can. But it will tell quite long, cos i have..."
1424,Lol great now im getting hungry.
409,Headin towards busetop


In [23]:
y_train[:4]

Unnamed: 0,spam
692,0
5303,0
1424,0
409,0


#Creating Bag of Words Representation Using CountVectorizer

In [24]:
type(X_train.values)

numpy.ndarray

In [25]:
from sklearn.feature_extraction.text import CountVectorizer

v = CountVectorizer()

X_train_cv = v.fit_transform(X_train.values)
X_train_cv

<4457x7775 sparse matrix of type '<class 'numpy.int64'>'
	with 59194 stored elements in Compressed Sparse Row format>

CountVectorizer change our X_train dataset into bag of words with 7775 columns which we can say vocabulary and 4457 rows

In [38]:
X_train_np = X_train_cv.toarray()
X_train_np

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [27]:
X_train_cv.shape

(4457, 7775)

We have 4457 emails, and each email has a vector size or column of 7775.

In [29]:
#to see bag of words or vocabulary
v.get_feature_names_out()[1000:1050]

array(['anthony', 'anti', 'any', 'anybody', 'anyhow', 'anymore', 'anyone',
       'anyplaces', 'anythiing', 'anythin', 'anything',
       'anythingtomorrow', 'anytime', 'anyway', 'anyways', 'anywhere',
       'aom', 'apart', 'apartment', 'apes', 'apeshit', 'aphex', 'apnt',
       'apo', 'apologise', 'apologize', 'app', 'apparently', 'appeal',
       'appear', 'appendix', 'apples', 'application', 'apply', 'applyed',
       'appointment', 'appointments', 'appreciate', 'appreciated',
       'approaching', 'appropriate', 'approve', 'approved', 'approx',
       'apps', 'appt', 'appy', 'april', 'aproach', 'apt'], dtype=object)

In [30]:
#this will shows us the function we can perform on our countVectorizor object
dir(v)

['_CountVectorizer__metadata_request__fit',
 '_CountVectorizer__metadata_request__transform',
 '__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__sklearn_clone__',
 '__sklearn_tags__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_build_request_for_signature',
 '_char_ngrams',
 '_char_wb_ngrams',
 '_check_feature_names',
 '_check_n_features',
 '_check_stop_words_consistency',
 '_check_vocabulary',
 '_count_vocab',
 '_doc_link_module',
 '_doc_link_template',
 '_doc_link_url_param_generator',
 '_get_default_requests',
 '_get_doc_link',
 '_get_metadata_request',
 '_get_param_names',
 '_get_tags',
 '_limit_features',
 '_more_tags',
 '_parameter_constraints',
 '_r

In [31]:
#this function will give us the vocabulary and the column number
v.vocabulary_

{'sorry': 6350,
 'to': 6956,
 'trouble': 7069,
 'again': 871,
 'can': 1638,
 'buy': 1584,
 '4d': 519,
 'for': 2966,
 'my': 4673,
 'dad': 2156,
 '1405': 294,
 '1680': 313,
 '1843': 318,
 'all': 919,
 'big': 1351,
 'small': 6260,
 'sat': 5922,
 'sun': 6618,
 'thanx': 6828,
 'but': 1580,
 'it': 3777,
 'will': 7548,
 'tell': 6774,
 'quite': 5540,
 'long': 4219,
 'cos': 2030,
 'haven': 3378,
 'finish': 2880,
 'film': 2867,
 'yet': 7723,
 'lol': 4212,
 'great': 3259,
 'now': 4854,
 'im': 3633,
 'getting': 3155,
 'hungry': 3572,
 'headin': 3390,
 'towards': 7022,
 'busetop': 1576,
 'nope': 4824,
 'since': 6192,
 'ayo': 1190,
 'travelled': 7049,
 'he': 3387,
 'has': 3364,
 'forgotten': 2979,
 'his': 3463,
 'guy': 3308,
 'freemsg': 3016,
 'month': 4592,
 'unlimited': 7198,
 'free': 3010,
 'calls': 1628,
 'activate': 811,
 'smartcall': 6263,
 'txt': 7117,
 'call': 1613,
 'no': 4803,
 '68866': 597,
 'subscriptn3gbp': 6584,
 'wk': 7585,
 'help': 3421,
 '08448714184': 67,
 'stop': 6515,
 'landlineo

In [37]:
v.get_feature_names_out()[1315]

'believe'

In [39]:
#let's check which columns are not 0 in first email
np.where(X_train_np[0] != 0)

(array([ 294,  313,  318,  519,  871,  919, 1351, 1584, 1638, 2156, 2966,
        4673, 5922, 6260, 6350, 6618, 6828, 6956, 7069]),)

In [42]:
X_train[:4][692]

'Sorry to trouble u again. Can buy 4d for my dad again? 1405, 1680, 1843. All 2 big 1 small, sat n sun. Thanx.'

In [45]:
#at column index 6350 the word is "sorry"
v.get_feature_names_out()[6350]

'sorry'

Now we have our bags of words we build the machine learning model.

#Naive Bayes Model

In [46]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train_cv, y_train)

Now we are done with our training model. We have to change our X_test data into countVectorizor format to check the performance of our model.

In [47]:
X_test_cv = v.transform(X_test)

In [49]:
#evaluating performance of the model
from sklearn.metrics import classification_report

y_pred = model.predict(X_test_cv)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       952
           1       0.96      0.94      0.95       163

    accuracy                           0.99      1115
   macro avg       0.98      0.97      0.97      1115
weighted avg       0.99      0.99      0.99      1115



When dealing with an imbalanced dataset, precision, recall, and especially the F1-score are preferred metrics because they provide a more nuanced view of model performance compared to simple accuracy, as they take into account both the true positive rate (recall) and the rate of correct positive predictions (precision)

#Testing Model

In [52]:
emails = [
    'Hey Mike, can we catch up for coffee this afternoon?',
    'Congratulations! You have been selected for a $500 gift card. Claim it now!',
    'Reminder: Team meeting at 3 PM today. Don’t be late!',
    'Limited time offer: Buy 1 get 1 free on all electronics! Shop now!',
    'You’ve won a free vacation package! Claim your prize before it expires!',
    'Hi Sara, I’ve attached the report you asked for. Let me know if you need any changes.',
    'Urgent: Your account has been compromised. Click here to secure it immediately!',
    'Hey Tom, want to grab dinner this weekend?',

]

emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 1, 0, 1, 1, 0, 1, 0])

#Model with Pipeline

To reduce number of lines in code we will use pipeline

In [56]:
from sklearn.pipeline import Pipeline

#creating pipeline with CountVectorizer and Multinomial Naive Bayes model
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In this code, we are using a Pipeline from sklearn to streamline the process of training a model. Both X_train (input data) and y_train (target labels) will go through the following steps:

1. Vectorization: The input data (X_train) is first passed through the CountVectorizer, which converts the text data into numerical feature vectors.
2. Model Training: The transformed data is then passed to the MultinomialNB model, which trains the model based on the given input data and corresponding target labels (y_train).

By using the pipeline, we ensure that the process is efficient, clean, and all steps are executed in sequence.

In [54]:
clf.fit(X_train, y_train)

In [55]:
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       952
           1       0.96      0.94      0.95       163

    accuracy                           0.99      1115
   macro avg       0.98      0.97      0.97      1115
weighted avg       0.99      0.99      0.99      1115



#Exercise

*  In this Exercise, you are going to classify whether a given movie review is positive or negative.
*  you are going to use Bag of words for pre-processing the text and apply different classification algorithms.
*  Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.

In [57]:
#Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

About Data: IMDB Dataset

Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download

* This data consists of two columns. - review - sentiment
* Reviews are the statements given by users after watching the movie.
* sentiment feature tells whether the given review is positive or negative.

In [73]:
# 1. read the data provided in the same directory with name 'movies_sentiment_data.csv' and store it in df variable
df = pd.read_csv("/content/movies_sentiment_data.csv")


#2. print the shape of the data
print(df.shape)

#3. print top 5 datapoints
df.head()

(19000, 2)


Unnamed: 0,review,sentiment
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive
1,I enjoyed the movie and the story immensely! I...,positive
2,I had a hard time sitting through this. Every ...,negative
3,It's hard to imagine that anyone could find th...,negative
4,This is one military drama I like a lot! Tom B...,positive


In [61]:
#creating a new column "Category" which represent 1 if the sentiment is positive or 0 if it is negative
df['Category'] = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

In [62]:
#check the distribution of 'Category' and see whether the Target labels are balanced or not.
df['Category'].value_counts()

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
1,9500
0,9500


In [63]:
#Do the 'train-test' splitting with test size of 20%
X_train, X_test, y_train, y_test = train_test_split(df.review, df.Category, test_size=0.2)

**Exercise-1**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.

**Note:**

* use CountVectorizer for pre-processing the text.

* use Random Forest as the classifier with estimators as 50 and criterion as entropy.

* print the classification report.

In [83]:
#1. create a pipeline object
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('rfc', RandomForestClassifier(n_estimators=100, criterion='entropy'))
])



#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.84      0.84      1845
           1       0.85      0.85      0.85      1955

    accuracy                           0.85      3800
   macro avg       0.85      0.85      0.85      3800
weighted avg       0.85      0.85      0.85      3800



**Exercise-2**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**

* use CountVectorizer for pre-processing the text.
* use KNN as the classifier with n_neighbors of 10 and metric as 'euclidean'.
* print the classification report.

In [81]:
#1. create a pipeline object
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('knn', KNeighborsClassifier(n_neighbors=20, metric='euclidean'))
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.65      0.62      0.64      1845
           1       0.66      0.69      0.67      1955

    accuracy                           0.66      3800
   macro avg       0.66      0.65      0.65      3800
weighted avg       0.66      0.66      0.66      3800



**Exercise-3**

using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**

* use CountVectorizer for pre-processing the text.
* use Multinomial Naive Bayes as the classifier.
* print the classification report.

In [67]:
#1. create a pipeline object

clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('mnb', MultinomialNB())
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.88      0.85      1845
           1       0.88      0.81      0.84      1955

    accuracy                           0.84      3800
   macro avg       0.85      0.85      0.84      3800
weighted avg       0.85      0.84      0.84      3800



**Can you write some observations of why model like KNN fails to produce good results unlike RandomForest and MultinomialNB?**

* As Machine learning algorithms does not work on Text data directly, we need to convert them into numeric vector and feed that into models while training.
* In this process, we convert text into a very high dimensional numeric vector using the technique of Bag of words.
* Model like K-Nearest Neighbours(KNN) doesn't work well with high dimensional data because with large number of dimensions, it becomes difficult for the algorithm to calculate distance in each dimension. In higher dimensional space, the cost to calculate distance becomes expensive and hence impacts the performance of model.
* The easy calculation of probabilities for the words in corpus(Bag of words) and storing them in contigency table is the major reason for the Multinomial NaiveBayes to be a text classification friendly algorithm.
* As Random Forest uses Bootstrapping(Row and column Sampling) with many decision tree and overcomes the high variance and overfitting of high dimensional data and also uses feature importance of words for better classifing the categories.
* Machine Learning is like trial and error scientific method, where we keep trying all the possible algorithms we have and select the one which give good results and satisfy the requirements like latency, interpretability etc.