Importing necessary modules

In [35]:
import pandas as pd


In [36]:
df = pd.read_csv('C:/Users/Marc/Dropbox/06_ESCP/01_Uni/04_Term 2/10_NLP with Python/spam.csv')

inspecting Dataset

In [37]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [38]:
# Renaming columns to ham / spam
df.rename(columns = {'v1': 'class', 'v2': 'sms'}, inplace = True)

In [39]:
# Replace ham with 0, and spam with 1
df['class'] = df['class'].map({'ham': 0, 'spam':1})

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   class       5572 non-null   int64 
 1   sms         5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: int64(1), object(4)
memory usage: 217.8+ KB


Columns unnamed have 1% non NaN at most, therefore henceforth neglected

In [41]:
df = df[['class', 'sms']]

Split Dataset into train and testset

In [42]:
from sklearn.model_selection import train_test_split

In [43]:
x_train, x_test, y_train, y_test = train_test_split(df['sms'], df['class'], random_state=42, test_size = 0.2)

# CountVectorizer
https://www.youtube.com/watch?v=lBO1L8pgR9s&t=284s&ab_channel=UnfoldDataScience
1. It looks at a first text and creates a word vector of all used words in the first text (in our case all the train sms)
2. Counts the amount of times each different word in each sms of the test set is the word vector
3. Predicts a category based on these counts

In [44]:
from sklearn.feature_extraction.text import CountVectorizer

In [45]:
# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit to vector
vectorizer.fit(x_train)

# See word vector
print(vectorizer.vocabulary_)



In [46]:
# Get feture names rows: 1000-1005
print(vectorizer.get_feature_names()[1000:1005])

['apartment', 'aphex', 'apnt', 'apo', 'apologetic']


Create Vector out of training data

Here, we have already used all the messages (of train set) to see what words are in the dataset and transform them into a vector. Then we see how many times these words occur in each sms (test set).

In [47]:
vectorized_train =vectorizer.transform(x_train)

print(vectorized_train.toarray())

[[0 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


Create Classifiers with M_NB

In [48]:
from sklearn.naive_bayes import MultinomialNB

In [49]:
nb = MultinomialNB()

In [50]:
nb.fit(vectorized_train, y_train)

MultinomialNB()

In [51]:
vectorized_test = vectorizer.transform(x_test)

In [52]:
y_pred = nb.predict(vectorized_test)

In [53]:
outcome = pd.DataFrame(y_pred)

In [54]:
outcome.rename(columns= {0:'y_pred'}, inplace = True)

In [55]:
outcome['y_pred'].value_counts()

0    979
1    136
Name: y_pred, dtype: int64

Socring M_NB Classifier

In [56]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import numpy as np

In [57]:
#Accuracy
print('Accuracy of the model is {}'.format(round(accuracy_score(y_test,outcome['y_pred']),3)),'\n')

print('The classification report :','\n',classification_report(y_test,outcome['y_pred']))

Accuracy of the model is 0.984 

The classification report : 
               precision    recall  f1-score   support

           0       0.98      1.00      0.99       965
           1       0.99      0.89      0.94       150

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115



LogReg Classifier

In [58]:
from sklearn.linear_model import LogisticRegression

In [59]:
lg = LogisticRegression()

#train model
lg.fit(vectorized_train, y_train)

#predict
lg_pred = dict()
lg_pred['y_pred'] = lg.predict(vectorized_test)

#outcome
print('Accuracy of the model logreg with CountVectorizer is {}'.format(round(accuracy_score(y_test,lg_pred['y_pred']),3)),'\n')
print(classification_report(y_test,lg_pred['y_pred']))

Accuracy of the model logreg with CountVectorizer is 0.978 

              precision    recall  f1-score   support

           0       0.98      1.00      0.99       965
           1       1.00      0.84      0.91       150

    accuracy                           0.98      1115
   macro avg       0.99      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115



RandomForestClassifier

In [60]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [61]:
rf = RandomForestClassifier(random_state=42)

#Define Params
parameters = {
    'n_estimators': [5, 100],
    'max_depth' : [3,4,5,6],
}

#GridsearchCV
gcv = GridSearchCV(estimator=rf,param_grid=parameters)
gcv.fit(vectorized_train,y_train)

#Return the best parameters
print(gcv.best_params_)

{'max_depth': 6, 'n_estimators': 5}


In [62]:
#Create the model with the right parameters
rf_ = RandomForestClassifier(n_estimators=5,max_depth=6)

#Fit the new model
rf_.fit(vectorized_train, y_train)

#Predict the target variable
pred_rf = dict()
pred_rf['y_pred'] = rf_.predict(vectorized_test)

#Outcome
print('Accuracy of the model is {}'.format(round(accuracy_score(y_test,pred_rf['y_pred']),3)),'\n')
print(classification_report(y_test,pred_rf['y_pred']))


Accuracy of the model is 0.9 

              precision    recall  f1-score   support

           0       0.90      1.00      0.95       965
           1       1.00      0.25      0.40       150

    accuracy                           0.90      1115
   macro avg       0.95      0.63      0.67      1115
weighted avg       0.91      0.90      0.87      1115



# Tf-Idf
Highlight words that are frequent in one document ut not over all documents
- In comparison to countvectorizer this results in normalized outputs:
    - https://datascience.stackexchange.com/questions/25581/what-is-the-difference-between-countvectorizer-token-counts-and-tfidftransformer
- TF = Turn Frequency: Count frequency of words in each document
- IDF = Supress the effect of words that are in all documents -> based on that you can not see different partterns of doucments

In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

In [64]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Marc\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [65]:
tf = TfidfVectorizer(stop_words=stopwords.words('english'))

# Train model
tf.fit(x_train)

# Vectorize data with trained model
tf_vector_train = tf.transform(x_train)
tf_vector_test = tf.transform(x_test)

# Print feature names and idf values
print(tf.get_feature_names()[1000:1005])
print(tf.idf_[1000:1005])

['appear', 'applausestore', 'applebees', 'apply', 'applyed']
[8.70930833 8.70930833 8.70930833 6.10661865 8.70930833]


In [66]:
#tokenize and build vocab
tf.fit(df['sms'])

TfidfVectorizer(stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...])

In [67]:
# Values that would = 1, would appear in basically every sms
print(tf.idf_)

[7.22779351 6.29348428 8.93254161 ... 8.93254161 8.93254161 8.93254161]


With tf-idf with M-nb

In [68]:
nb_tf = MultinomialNB()

nb_tf.fit(tf_vector_train, y_train)

#predict
pred_tf = dict()
pred_tf['y_pred'] = nb_tf.predict(tf_vector_test)

# Outcome
print('Accuracy of the model td-idf with MNB is {}'.format(round(accuracy_score(y_test,pred_rf['y_pred']),3)),'\n')
print(classification_report(y_test,pred_tf['y_pred']))

Accuracy of the model td-idf with MNB is 0.9 

              precision    recall  f1-score   support

           0       0.97      1.00      0.98       965
           1       1.00      0.78      0.88       150

    accuracy                           0.97      1115
   macro avg       0.98      0.89      0.93      1115
weighted avg       0.97      0.97      0.97      1115



In [69]:
lg_tf = LogisticRegression()

#train model
lg.fit(tf_vector_train, y_train)

#predict
lg_pred_tf = dict()
lg_pred_tf['y_pred'] = lg.predict(tf_vector_test)

#outcome
print('Accuracy of the model is {}'.format(round(accuracy_score(y_test,lg_pred_tf['y_pred']),3)),'\n')
print(classification_report(y_test,lg_pred_tf['y_pred']))

Accuracy of the model is 0.954 

              precision    recall  f1-score   support

           0       0.95      0.99      0.97       965
           1       0.95      0.69      0.80       150

    accuracy                           0.95      1115
   macro avg       0.95      0.84      0.89      1115
weighted avg       0.95      0.95      0.95      1115



RandomForestClassifier

In [70]:
rf_tf = RandomForestClassifier(random_state=42)

#Define Params
parameters = {
    'n_estimators': [5, 100],
    'max_depth' : [3,4,5,6],
}

#GridsearchCV
gcv = GridSearchCV(estimator=rf_tf,param_grid=parameters)
gcv.fit(tf_vector_train,y_train)

#Return the best parameters
print(gcv.best_params_)

{'max_depth': 6, 'n_estimators': 5}


In [71]:
#Create the model with the right parameters
rf_tf_ = RandomForestClassifier(n_estimators=5,max_depth=6)

#Fit the new model
rf_tf_.fit(tf_vector_train, y_train)

#Predict the target variable
pred_rf_tf = dict()
pred_rf_tf['y_pred'] = rf_tf_.predict(tf_vector_test)

#Outcome
print('Accuracy of the model is {}'.format(round(accuracy_score(y_test,pred_rf_tf['y_pred']),3)),'\n')
print(classification_report(y_test,pred_rf_tf['y_pred']))


Accuracy of the model is 0.888 

              precision    recall  f1-score   support

           0       0.89      1.00      0.94       965
           1       1.00      0.17      0.29       150

    accuracy                           0.89      1115
   macro avg       0.94      0.58      0.61      1115
weighted avg       0.90      0.89      0.85      1115



# Conclusion
If we look at the mailing problem, one wants to optimize for the precision. This is due to the fact that if a mail is falsly classified as spam, one would highly likely onversee it. If that happens to be a very important mail, the cost of it would be immense. On the other hand, if one would be falsly classified as not spam, the operator of the mailbox can just delete / move it to spam.
Hence when 1 = spam:
- CountVectorizer + logreg: precision  1 respectively 0.98
- td-idf + MultinomialNB:   precision  1 respectively 0.97

--> Use CountVectorizer in combination with a logreg (when compared feature td-idf and ML Algorithms RF and MultinomialNB)