In [None]:
!pip install sklearn



In [None]:
# imports
import pandas as pd

# For Question 2
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# For Question 3
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score

The code block below:

1.   Reads in the spam.csv from the most immediate directory. Making sure the coding is for English
2.   Next we select all the rows and the first two columns and assign that as the dataframe
3.  Next replace all instances of ham in the df as 0 and then replace all instances of spam in the df as 1
4.  Next rename the columns v1 to label and v2 to message
5.  Finally drop duplicate rows and then reset the indexes accounting for the dropped rows


In [None]:
df = pd.read_csv('spam.csv', encoding='latin-1')
df = df.iloc[:, :2]
df = df.replace('ham', 0).replace('spam', 1)
df = df.rename(columns = {'v1': 'label', 'v2': 'message'})
df = df.drop_duplicates().reset_index(drop=True)

In [None]:
df

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5164,1,This is the 2nd time we have tried 2 contact u...
5165,0,Will Ì_ b going to esplanade fr home?
5166,0,"Pity, * was in mood for that. So...any other s..."
5167,0,The guy did some bitching but I acted like i'd...


**Tokenizing data:** Given theses string, the act of tokenizing is to split the inputs up into keywords, elements, etc that are referred to as tokens.

**CountVectorizer:** For CountVectorizer, using the default token_pattern input, 2 or more alphanumeric characters in a row are considered a token.

In [None]:
messages = df['message']
y = df['label']
x = CountVectorizer().fit_transform(messages)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2021)
x

<5169x8672 sparse matrix of type '<class 'numpy.int64'>'
	with 68018 stored elements in Compressed Sparse Row format>

**Unique Tokens Identified**: 4767
This was found by summing up all the columns and then counting how many occurances of 1 occurred

In [None]:
token_occ = x.toarray().sum(axis=0)

print(token_occ.shape)

unique = 0
for i in token_occ:
    if i == 1:
        unique = unique + 1

# unique words
print(unique)

(8672,)
4767


Analyzing the training data, the training set does not contain at least one instance of each token.

In [None]:
train_occ = x_train.toarray().sum(axis=0)

print(train_occ.shape)

atleast_one = True
for i in train_occ:
    if i == 0: 
        atleast_one = False        
        break

print(atleast_one)

(8672,)
False


**Accuracy**: All ML models in sklearn have a score method. This score method takes the model that we've trained using the x and y training data and then when score is called with x and y test passed in, it calculates how accurate the model was in predicting y from x. Accuracy is calculated as:
`number of correct predictions / total predictions` (i.e. the percentage of samples that the model was able to successfully predict). This is an important metric because this gives us an immediate idea of how good our machine learning model at doing its job. For example, if the model correctly predicts 90% of the test set, we could potentially say that our model will give us the correct answer 90% of the time. In a way, this metric encompasses both precision and recall.

**Precision**: `tp / (tp + fp)` where tp is the number of true positives and fp is the number of false positives. In otherwords, precision is how well the model was able to correctly classify samples as positive. Precision is very important to assessing our model because precision is a great indicator of how consistantly the model is getting "answers" right.

**Recall**: `tp / (tp + fn)` where tp is the number of true positives and fn is the number of false negatives. Recall is different than precision because recall assesses how well the model is able to find all positive samples. Recall is very important to assessing our model because recall is a great indicator of how well our model is getting all 

In [None]:
mnb_clf = MultinomialNB()
mnb_clf.fit(x_train, y_train)
y_hat = mnb_clf.predict(x_test)
[mnb_clf.score(x_test, y_test), precision_score(y_test, y_hat), recall_score(y_test, y_hat)]

[0.9777562862669246, 0.9154929577464789, 0.9219858156028369]

In [None]:
linear_svc = LinearSVC()
linear_svc.fit(x_train, y_train)
y_hat = linear_svc.predict(x_test)
[linear_svc.score(x_test, y_test), precision_score(y_test, y_hat), recall_score(y_test, y_hat)]

[0.9816247582205029, 0.9919354838709677, 0.8723404255319149]

In [None]:
logistic_reg = LogisticRegression()
logistic_reg.fit(x_train, y_train)
y_hat = logistic_reg.predict(x_test)
[logistic_reg.score(x_test, y_test), precision_score(y_test, y_hat), recall_score(y_test, y_hat)]

[0.9787234042553191, 0.983739837398374, 0.8581560283687943]