## Spam Filtering
In this programming assignment, we will be looking at Spam Filtering with a real data set that has a "label" for every email - i.e. spam or not spam. We will use logistic regression classifier to solve this assignment and participate in a friendly competition on Kaggle (Details below). The assignment goes from data loading to data inspection to data pre-processing to creating a train/test data set to finally doing machine learning, making predictions and evaluating it. This is typically one part of the "full pipeline" in ML modeling/prototyping - So you will get a sampler taste of some "prototype pipeline" work that happens in practice! Have fun!! And if you get stuck somewhere - Use discord - Maybe someone has a suggestion that will unblock you.

The submission consists of two parts:
a) A submission of your complete working code with train/validation data sets + your write-up with insights and your learnings (details on this provided below)
b) Evaluation of your best model on the Kaggle evaluation data set - For this you can form a team of 2 - To brainstorm ideas and make your best submission. Include your team name, team members in your submission.

Kaggle Starting Point for the competition: https://www.kaggle.com/t/7d2850f5b99a41fba457f2ad7acd0fca

## Loading the data set

In [159]:
import pandas as pd
import nltk
local_file="all_emails.csv"
data_set = pd.read_csv(local_file,sep=',',index_col=0,header=0,engine='python',error_bad_lines=False)

## 1) Inspecting the data set

In [160]:
!pip install nltk



In [161]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('corpus')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Error loading corpus: Package 'corpus' not found in index
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Exploration

In [162]:
# 1. Print a few lines (i.e. each line is an email and a label) from the data_set containing spam (use a pandas functionality - e.g. getting the top lines)
print("printing few lines from dataset")
print(data_set[:10])
# 2. Print a few lines from data_set that are not spam
print("printing few lines that are not spam")
print(data_set[data_set['spam']==0][:10])
# 3. Print the emails between lines 5000 and 5010 in the data set

printing few lines from dataset
                                                   text  spam
id                                                           
1235  Subject: naturally irresistible your corporate...     1
1236  Subject: the stock trading gunslinger  fanny i...     1
1238  Subject: 4 color printing special  request add...     1
1239  Subject: do not have money , get software cds ...     1
1240  Subject: great nnews  hello , welcome to medzo...     1
1242  Subject: save your money buy getting this thin...     1
1243  Subject: undeliverable : home based business f...     1
1244  Subject: save your money buy getting this thin...     1
1246  Subject: save your money buy getting this thin...     1
1247  Subject: brighten those teeth  get your  teeth...     1
printing few lines that are not spam
                                                   text  spam
id                                                           
2603  Subject: hello guys ,  i ' m " bugging you " f...     0
2

## 2) Data processing step for this HW: 
Do the following process for all emails in your data set - 1) Tokenize into words 2) Remove stop/filler words and 3) Remove punctuations 
Below - We have it done for a sample sentence

## Replacing some words for the classifier to learn betterm

In [163]:
import re
def replace_words(data_set):
    for index,row in data_set.iterrows():
        symbol_replaced = row["text"].lower()
        # symbol_replaced = re.sub('[0-9]{2,20}', 'numbers', symbol_replaced)
        # symbol_replaced = re.sub('[0-9]', 'number', symbol_replaced)
        symbol_replaced=symbol_replaced.replace('subject:','')
        symbol_replaced = re.sub('\$+', 'dollar', symbol_replaced)
        symbol_replaced = re.sub('\%+', 'percent', symbol_replaced)
        symbol_replaced = re.sub('\scc', 'carbon', symbol_replaced)

        symbol_replaced = re.sub('/^[A-Z0-9._%+-]+\s@\s[A-Z0-9.-]\s+\.\s[A-Z]{2,4}$/i', 'emailaddr', symbol_replaced)
        symbol_replaced = re.sub('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\s\.\s[\w/\-&?=%.]+\s', 'website', symbol_replaced)

        data_set.at[index,"text"]=symbol_replaced
    return data_set
data_set=replace_words(data_set)
data_set.head()

Unnamed: 0_level_0,text,spam
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1235,naturally irresistible your corporate identit...,1
1236,the stock trading gunslinger fanny is merril...,1
1238,4 color printing special request additional ...,1
1239,"do not have money , get software cds from her...",1
1240,"great nnews hello , welcome to medzonline sh...",1


## Tokenizer
Apply a tokenizer to tokenize the sentences in your email - So your sentence gets broken down to words. We will use a tokenizer from the NLTK library (Natural Language Tool Kit) below for a single sentence. 

In [164]:
# Example Sentence
from nltk.tokenize import word_tokenize
sentence = """Subject: only our software is guaranteed 100 % legal . name - brand software at low , low , low , low prices everything comes to him who hustles while he waits . many would be cowards if they had courage enough ."""
sentence_tokenized = word_tokenize(sentence)
print(sentence_tokenized)
#nltk.download('punkt')

['Subject', ':', 'only', 'our', 'software', 'is', 'guaranteed', '100', '%', 'legal', '.', 'name', '-', 'brand', 'software', 'at', 'low', ',', 'low', ',', 'low', ',', 'low', 'prices', 'everything', 'comes', 'to', 'him', 'who', 'hustles', 'while', 'he', 'waits', '.', 'many', 'would', 'be', 'cowards', 'if', 'they', 'had', 'courage', 'enough', '.']


In [165]:
def tokenize_words(data_set):
    for index,row in data_set.iterrows():
        tokenized=word_tokenize(row["text"])
        data_set.at[index,"text"]=tokenized
    return data_set
data_set=tokenize_words(data_set)
data_set.head()

Unnamed: 0_level_0,text,spam
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1235,"[naturally, irresistible, your, corporate, ide...",1
1236,"[the, stock, trading, gunslinger, fanny, is, m...",1
1238,"[4, color, printing, special, request, additio...",1
1239,"[do, not, have, money, ,, get, software, cds, ...",1
1240,"[great, nnews, hello, ,, welcome, to, medzonli...",1


## Stop Words: Remove Stop Words (or Filler words ) using stop words list

In [166]:
from nltk.corpus import stopwords
filtered_words = [word for word in sentence_tokenized if word not in stopwords.words('english')]

In [167]:
def remove_stopwords(data_set):
    for index,row in data_set.iterrows():
        filtered=[word for word in row["text"] if word not in stopwords.words('english')]
        data_set.at[index,"text"]=filtered
    return data_set
data_set=remove_stopwords(data_set)
data_set.head()

Unnamed: 0_level_0,text,spam
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1235,"[naturally, irresistible, corporate, identity,...",1
1236,"[stock, trading, gunslinger, fanny, merrill, m...",1
1238,"[4, color, printing, special, request, additio...",1
1239,"[money, ,, get, software, cds, !, software, we...",1
1240,"[great, nnews, hello, ,, welcome, medzonline, ...",1


## applying lemmatisation

In [168]:
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
    return [lemmatizer.lemmatize(word) for word in text]
def lemmatize_word(data_set):
    data_set['text'] = data_set.text.apply(lemmatize_text)
    return data_set
data_set=lemmatize_word(data_set)
data_set.head()

Unnamed: 0_level_0,text,spam
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1235,"[naturally, irresistible, corporate, identity,...",1
1236,"[stock, trading, gunslinger, fanny, merrill, m...",1
1238,"[4, color, printing, special, request, additio...",1
1239,"[money, ,, get, software, cd, !, software, web...",1
1240,"[great, nnews, hello, ,, welcome, medzonline, ...",1


## Punctuations: Remove punctuations and other special characters from tokens

### 3) Exercise: 
Inspect the resulting list below for any of your emails - Does it look clean and ready to be used for the next step in spam detection? Any other pre-processing steps you can think of or may want to do before spam detection? How about including other NLP features like bi-grams and tri-grams?

In [169]:
new_words= [word for word in filtered_words if word.isalnum()]
new_words

['Subject',
 'software',
 'guaranteed',
 '100',
 'legal',
 'name',
 'brand',
 'software',
 'low',
 'low',
 'low',
 'low',
 'prices',
 'everything',
 'comes',
 'hustles',
 'waits',
 'many',
 'would',
 'cowards',
 'courage',
 'enough']

In [170]:
def remove_punctuations(data_set):
    for index,row in data_set.iterrows():
        new_word=[word for word in row["text"] if word.isalnum()]
        data_set.at[index,"text"]=new_word
    return data_set
data_set=remove_punctuations(data_set) 
data_set.head()

Unnamed: 0_level_0,text,spam
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1235,"[naturally, irresistible, corporate, identity,...",1
1236,"[stock, trading, gunslinger, fanny, merrill, m...",1
1238,"[4, color, printing, special, request, additio...",1
1239,"[money, get, software, cd, software, websitewe...",1
1240,"[great, nnews, hello, welcome, medzonline, sh,...",1


## 4) Train/Validation Split
Now for each email in your data set - You have boiled the email down to its essentials - A list of words that are clean and ready for some Machine Learning! Maybe punctuations matter for spam emails!!? 
If you wish to keep them, you may for your curiosity and see how it impacts metrics (i.e. skip step 3 above). 

What we will do now is split the data set into train and test set - The train set can have 80% of the data (i.e. emails along with their labels) chosen at random - But with good representation from both spam and not-spam email classes. And the same goes for the test set - Which would have the remaining 20% of the data.
Look up python libraries that can do this data split for you automatically?

In [171]:
X=data_set["text"]
y=data_set["spam"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## applying count vectorizer

In [172]:
combined_text=X_train.append(X_test)
def word_from_series(df):
    return df

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer=word_from_series,ngram_range=(1, 4 )) 
vec = vectorizer.fit_transform(combined_text)
X_train_1gram=vectorizer.transform(X_train)
X_test_1gram=vectorizer.transform(X_test)
X_train_1gram

<3408x31120 sparse matrix of type '<class 'numpy.int64'>'
	with 316043 stored elements in Compressed Sparse Row format>

### 5) Train your model and evaluate on Kaggle
Report your train/validation F1-score for your baseline model (starter LR model) and also your best LR model. Also report your insights on what worked and what did not on the Kaggle evaluation. How can your model be improved? Where does your model make mistakes?

In [173]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0,max_iter=1000)
clf.fit(X_train_1gram,y_train)
from sklearn.metrics import f1_score
y_pred=clf.predict(X_test_1gram)
f1=f1_score(y_test, y_pred, average='macro')
f1

0.9818819776714514

1. 0.9864307922626494 with alnum
2. 0.9849444428737029 with alpha
3.0.981984140969163 with L1 liblinear
4. 0.9802875708115092numberreplace 
5.0.9879555542989624 lemmatisation
6.0.9849016480595427 lemma with numbers as numbers

In [185]:
from sklearn.model_selection import GridSearchCV


parameters ={'C' : [0.1,.5,1,5,10],'solver':['liblinear','sag','saga','lbfgs']}
model_2 = LogisticRegression(random_state=0,max_iter=1000)
GridSearch_model = GridSearchCV(estimator=model_2, param_grid=parameters)
GridSearch_model.fit(X_train_1gram, y_train)
print("best parameters : ",GridSearch_model.best_params_)
y_pred=GridSearch_model.predict(X_test_1gram)
f1=f1_score(y_test, y_pred, average='macro')
f1



best parameters :  {'C': 0.5, 'solver': 'saga'}




0.9819333314484435

The Model makes a best prediction upon applying lemmatisation without substituting the numbers with the term numbers. I applied tokenisation, word replacement, lemmatisation , stop word removal and countvectorisation <br>

My first submission with no text replacement and lemmatisation worked well and scored the best for me in kaggle. But in the evaluation set in which the labels are given, I could find that the lemmatisation worked better. <br>
I believe that the model can be improved by converting all the emails and websites into 'email' or 'website' rather than the names. I could also do tfidf and tried if that improves the f1 score. <br>
Making a wordcloud and rejecting the least important words also could have improved f1. I also did some hyperparameter tuning, but that didnt help much<br>
Since there are less spam examples, the class imbalance is causing problem for the model in predicting some spam labels. This is realized after plotting the confuson matrix and also by comparing the mispredictions with the evaluation set that was set apart initially from the training data.<br> 

In [186]:
mispredicion_indexes=[]
for index,y in enumerate(zip(y_pred,y_test)):
    if y[0]!=y[1]:
        mispredicion_indexes.append(index)


In [175]:
X_test_list=X_test.tolist()

mispredicted_label_list=[]
for item in mispredicion_indexes:
    mispredicted_label_list.append(X_test_list[item])
mispredicted_df=pd.DataFrame()
mispredicted_df['index']=mispredicion_indexes
mispredicted_df['text']=mispredicted_label_list
mispredicted_df

Unnamed: 0,index,text
0,5,"[gpcm, modeler, news, august, 31, 2000, new, g..."
1,47,"[kinja, account, activation, hello, iztari, th..."
2,104,"[energy, oil, drilling, survey, find, producer..."
3,199,"[long, sleeve, denim, shirt, enron, research, ..."
4,202,"[drogi, websitety, byles, na, tyle, mily, ze, ..."
5,205,"[change, plan, hello, two, sorry, catherine, w..."
6,224,"[welcome, energy, news, live, dear, vincent, k..."
7,303,"[nymex, invitation, learn, power, trading, pow..."
8,326,"[http, website, ally, sport, hello, hoping, co..."
9,499,"[21, keep, calm, 1827, sims, vietnam, warit, b..."


In [176]:
# mispredicted_df.to_csv("m.csv")


In [177]:
# for item in combined_text:
#     if "Subject" not in item:
#        print(item) 

# Evaluation

In [180]:
eval_df=pd.read_csv("eval_students_2.csv")
eval_df=replace_words(eval_df)
eval_df=tokenize_words(eval_df)
eval_df=remove_stopwords(eval_df)
eval_df=lemmatize_word(eval_df)
eval_df=remove_punctuations(eval_df) 


X_eval_1gram=vectorizer.transform(eval_df["text"])
y_pred=clf.predict(X_eval_1gram)

# Saving evaluation result

In [181]:
submission_df=pd.DataFrame()
submission_df["id"]=eval_df["id"]
submission_df["spam"]=y_pred
submission_df.to_csv("results.csv",index=False)