<a href="https://colab.research.google.com/github/coryroyce/code_assignments/blob/main/211111_ML_Based_Spam_Filter_Cory_Randolph.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ML Based Spam Filter

CMPE 256

Cory Randolph

11/11/2021



# Prompt

Learning objective: apply TF-IDF and develop Spam Filter mode for the enclosed documents.


# Summary of Analysis

After applying TF-IDF to the documents provided and only considering the scores of the Spam Dictionary provided, the below table summarizes the results.

| document_id   | document                                                        |   total_tf_idf |
|:--------------|:----------------------------------------------------------------|---------------:|
| d1            | Free - Coupons for next movie. The above links will take you... |       0.490822 |
| d2            | Free - Coupons for next movie. The above links will take you... |       0.490822 |
| d3            | Our records indicate your Pension is under performing to see... |       0.705559 |
| d4            | Enter to win $25,000 and get a Free Hotel Night! Just click ... |       0.23803  |
| d5            | Dear recipient, Avangar Technologies announces the beginning... |       0.417345 |
| d6            | I know that's an incredible statement, but bear with me whil... |       0.163002 |

Based on total TF-IDF score for each document we can conclude the the following likelihood of spam:
*   d3 is Mostlikely to be spam with a total TF-IDF score of 0.71
*   d1,d2 are the same document and have a decently high chance of being spal with a TF-IDF score of 0.49
*   d5 also has a decently high chance of being spam with a TF-IDF score of 0.42
*   d4, d6 have low chances of being spam with TF-IDF scores of 0.24 and 0.16 respectively.

# Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

# Data

Input the data for the documents manually.

In [2]:
data = [
        ['d1', '''Free - Coupons for next movie. The above links will take you straight to our partner's site. For more information or to see other offers available, you can also visit the Groupon on the Working Advantage website.'''],
        ['d2', '''Free - Coupons for next movie. The above links will take you straight to our partner's site. For more information or to see other offers available, you can also visit the Groupon on the Working Advantage website.'''],
        ['d3', '''Our records indicate your Pension is under performing to see higher growth and up to 25% cash release reply PENSION for a free review. To opt out reply STOP'''],
        ['d4', '''Enter to win $25,000 and get a Free Hotel Night! Just click here for a $1 trial membership in NetMarket, the Internet'spremier discount shopping site: Fast Company EZVenture gives you FREE business articles,PLUS, you could win YOUR CHOICE of a BMW Z3 convertible, $100,000, shares of Microsoft stock, or a home office computer. Go there and get your chances to win now. A crazy-funny-cool trivia book with a $10,000 prize? PLUS chocolate, nail polish, cats, barnyard animals, and more?'''],
        ['d5','''Dear recipient, Avangar Technologies announces the beginning of a new unprecendented global employment campaign. Due to company's exploding growth Avangar is expanding business to the European region. During last employment campaign over 1500 people worldwide took part in Avangar's business and more than half of them are currently employed by the company. And now we are offering you one more opportunity to earn extra money working with Avangar Technologies. We are looking for honest, responsible, hard-working people that can dedicate 2-4 hours of their time per day and earn extra Â£300-500 weekly. All offered positions are currently part-time and give you a chance to work mainly from home.'''],
        ['d6','''I know that's an incredible statement, but bear with me while I explain. You have already deleted mail from dozens of "Get Rich Quick" schemes, chain letter offers, and LOTS of other absurd scams that promise to make you rich overnight with no investment and no work. My offer isn't one of those. What I'm offering is a straightforward computer-based service that you can run full-or part-time like a regular business. This service runs auto-matically while you sleep, vacation, or work a "regular" job. It provides a valuable new service for businesses in your area. I'm offering a high-tech, low-maintenance, work-fromanywhere business that can bring in a nice comfortable additional income for your family. I did it for eight years. Since I started inviting others to join me, I've helped over 4000 do the same.'''],
]

columns = ['document_id', 'document']


spam_dictionary = ['Free', 'Click', 'visit', 'attachment', 'call',
                   'money', 'Out', 'extra', 'offer', 'available', 'Pension', 'Opportunity',
                   'Chance', 'Investment', 'Pension',]

# Convert to all lower case for later analysis                 
spam_dictionary = [x.lower() for x in spam_dictionary]

Convert the data into a Pandas Dataframe

In [3]:
df = pd.DataFrame(data = data, columns = columns)

# Set the index
df.set_index('document_id',inplace = True)

# Display the first few rows
df.head()

Unnamed: 0_level_0,document
document_id,Unnamed: 1_level_1
d1,Free - Coupons for next movie. The above links...
d2,Free - Coupons for next movie. The above links...
d3,Our records indicate your Pension is under per...
d4,"Enter to win $25,000 and get a Free Hotel Nigh..."
d5,"Dear recipient, Avangar Technologies announces..."


Apply the bag of words representation to the normalized text.

In [4]:
from collections import Counter

bag_of_words = (
    df['document'].
    str.lower().                  # convert all letters to lowercase
    str.replace("[^\w\s]", " ").  # replace non-alphanumeric characters by whitespace
    str.split()                   # split on whitespace
).apply(Counter)

bag_of_words

document_id
d1    {'free': 1, 'coupons': 1, 'for': 2, 'next': 1,...
d2    {'free': 1, 'coupons': 1, 'for': 2, 'next': 1,...
d3    {'our': 1, 'records': 1, 'indicate': 1, 'your'...
d4    {'enter': 1, 'to': 2, 'win': 3, '25': 1, '000'...
d5    {'dear': 1, 'recipient': 1, 'avangar': 4, 'tec...
d6    {'i': 7, 'know': 1, 'that': 4, 's': 1, 'an': 1...
Name: document, dtype: object

Convert the bag of words representation into a term-frequency matrix.

In [5]:
tf = pd.DataFrame(list(bag_of_words))

# Fill the NA's with 0's
tf = tf.fillna(0)

tf

Unnamed: 0,free,coupons,for,next,movie,the,above,links,will,take,you,straight,to,our,partner,s,site,more,information,or,see,other,offers,available,can,also,visit,groupon,on,working,advantage,website,records,indicate,your,pension,is,under,performing,higher,...,run,full,like,regular,this,runs,auto,matically,sleep,vacation,job,it,provides,valuable,businesses,area,high,tech,low,maintenance,fromanywhere,bring,nice,comfortable,additional,income,family,did,eight,years,since,started,inviting,others,join,ve,helped,4000,do,same
0,1.0,1.0,2,1.0,1.0,3.0,1.0,1.0,1.0,1.0,2.0,1.0,2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,1.0,2,1.0,1.0,3.0,1.0,1.0,1.0,1.0,2.0,1.0,2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2.0,0.0,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,0.0,2,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1,0.0,0.0,3.0,0.0,0.0,0.0,0.0,2.0,0.0,4,0.0,0.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,0.0,2,0.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,1.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1.0,0.0,0.0,0.0,...,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


# Apply Vector Space Model

Use Sklearn to help create and extract the feature (similar to the manual method above)

In [6]:
# Create the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Create vectors based on the input documents
vectors = vectorizer.fit_transform(df['document'])

# Create a datframe of all the vectors
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df_vectors = pd.DataFrame(denselist, columns=feature_names)
df_vectors.index = df.index

In [7]:
df_vectors

Unnamed: 0_level_0,000,10,100,1500,25,300,4000,500,above,absurd,additional,advantage,all,already,also,an,and,animals,announces,are,area,articles,auto,available,avangar,barnyard,based,bear,beginning,bmw,book,bring,business,businesses,but,by,campaign,can,cash,cats,...,straight,straightforward,take,tech,technologies,than,that,the,their,them,there,this,those,time,to,took,trial,trivia,under,unprecendented,up,vacation,valuable,ve,visit,we,website,weekly,what,while,will,win,with,work,working,worldwide,years,you,your,z3
document_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
d1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.180219,0.0,0.0,0.180219,0.0,0.0,0.180219,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.180219,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.130384,0.0,0.0,...,0.180219,0.0,0.180219,0.0,0.0,0.0,0.0,0.337791,0.0,0.0,0.0,0.0,0.0,0.0,0.195116,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.180219,0.0,0.180219,0.0,0.0,0.0,0.180219,0.0,0.0,0.0,0.152153,0.0,0.0,0.225194,0.0,0.0
d2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.180219,0.0,0.0,0.180219,0.0,0.0,0.180219,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.180219,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.130384,0.0,0.0,...,0.180219,0.0,0.180219,0.0,0.0,0.0,0.0,0.337791,0.0,0.0,0.0,0.0,0.0,0.0,0.195116,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.180219,0.0,0.180219,0.0,0.0,0.0,0.180219,0.0,0.0,0.0,0.152153,0.0,0.0,0.225194,0.0,0.0
d3,0.0,0.0,0.0,0.0,0.161015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11649,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.196356,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.261487,0.0,0.0,0.0,0.196356,0.0,0.196356,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.13594,0.0
d4,0.326588,0.108863,0.108863,0.0,0.089269,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.193751,0.108863,0.0,0.0,0.0,0.108863,0.0,0.0,0.0,0.108863,0.0,0.0,0.0,0.108863,0.108863,0.0,0.075367,0.0,0.0,0.0,0.0,0.0,0.0,0.108863,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055773,0.0,0.0,0.108863,0.0,0.0,0.0,0.096648,0.0,0.108863,0.108863,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.326588,0.075367,0.0,0.0,0.0,0.0,0.111547,0.150734,0.108863
d5,0.0,0.0,0.0,0.083469,0.0,0.083469,0.0,0.083469,0.0,0.0,0.0,0.0,0.083469,0.0,0.0,0.0,0.198075,0.0,0.083469,0.333876,0.0,0.0,0.0,0.0,0.333876,0.0,0.0,0.0,0.083469,0.0,0.0,0.0,0.115573,0.0,0.0,0.083469,0.166938,0.049519,0.0,0.0,...,0.0,0.0,0.0,0.0,0.166938,0.083469,0.068446,0.12829,0.083469,0.083469,0.0,0.0,0.0,0.136892,0.148207,0.083469,0.0,0.0,0.0,0.083469,0.0,0.0,0.0,0.0,0.0,0.166938,0.0,0.083469,0.0,0.0,0.0,0.0,0.057787,0.068446,0.115573,0.083469,0.0,0.085527,0.0,0.0
d6,0.0,0.0,0.0,0.0,0.0,0.0,0.081501,0.0,0.0,0.081501,0.081501,0.0,0.0,0.081501,0.0,0.081501,0.096703,0.0,0.0,0.0,0.081501,0.0,0.081501,0.0,0.0,0.0,0.081501,0.081501,0.0,0.0,0.0,0.081501,0.112849,0.081501,0.081501,0.0,0.0,0.096703,0.0,0.0,...,0.0,0.081501,0.0,0.081501,0.0,0.0,0.267328,0.041755,0.0,0.0,0.0,0.081501,0.081501,0.066832,0.072357,0.0,0.0,0.0,0.0,0.0,0.0,0.081501,0.081501,0.081501,0.0,0.0,0.0,0.0,0.081501,0.163002,0.0,0.0,0.112849,0.200496,0.0,0.0,0.081501,0.167021,0.112849,0.0


Since we did not start with labled spam mesages we have to filter down the vectorized dataframe to the spam words and only compare those.

In [8]:
cols_to_keep = set(feature_names).intersection(spam_dictionary)
cols_to_keep

{'available',
 'chance',
 'click',
 'extra',
 'free',
 'investment',
 'money',
 'offer',
 'opportunity',
 'out',
 'pension',
 'visit'}

In [9]:
df_vectors = df_vectors[cols_to_keep]

Create a final dataframe to work with.

In [10]:
df_final = pd.concat([df, df_vectors], axis=1)

Since the dictionary we applied was a collection of all the likely spam words, we can take a simple sum arcoss each document/row to see which ones had the highest score.

In [11]:
df_final['total_tf_idf'] = df_final.sum(axis = 1, numeric_only = True)

Keep on the ducment colum for reference and the total tf-idf

In [12]:
df_final = df_final[['document', 'total_tf_idf']]
df_final

Unnamed: 0_level_0,document,total_tf_idf
document_id,Unnamed: 1_level_1,Unnamed: 2_level_1
d1,Free - Coupons for next movie. The above links...,0.490822
d2,Free - Coupons for next movie. The above links...,0.490822
d3,Our records indicate your Pension is under per...,0.705559
d4,"Enter to win $25,000 and get a Free Hotel Nigh...",0.23803
d5,"Dear recipient, Avangar Technologies announces...",0.417345
d6,"I know that's an incredible statement, but bea...",0.163002


Based on total TF-IDF score for each document we can conclude the the following likelihood of spam:
*   d3 is Mostlikely to be spam with a total TF-IDF score of 0.71
*   d1,d2 are the same document and have a decently high chance of being spal with a TF-IDF score of 0.49
*   d5 also has a decently high chance of being spam with a TF-IDF score of 0.42
*   d4, d6 have low chances of being spam with TF-IDF scores of 0.24 and 0.16 respectively.


# Reference

Example of Vector Space Model [reference](https://colab.research.google.com/github/dlsun/pods/blob/master/10-Textual-Data/10.2%20The%20Vector%20Space%20Model.ipynb#scrollTo=2UOASR79b74x)

Second refernce of Vector Space Model [reference](https://towardsdatascience.com/natural-language-processing-feature-engineering-using-tf-idf-e8b9d00e7e76)

In [13]:
# To turn a dataframe into a markdown:
df_temp = df_final.copy()
df_temp['document'] = df_temp['document'].str[0:60] + '...'
df_temp.to_markdown()

"| document_id   | document                                                        |   total_tf_idf |\n|:--------------|:----------------------------------------------------------------|---------------:|\n| d1            | Free - Coupons for next movie. The above links will take you... |       0.490822 |\n| d2            | Free - Coupons for next movie. The above links will take you... |       0.490822 |\n| d3            | Our records indicate your Pension is under performing to see... |       0.705559 |\n| d4            | Enter to win $25,000 and get a Free Hotel Night! Just click ... |       0.23803  |\n| d5            | Dear recipient, Avangar Technologies announces the beginning... |       0.417345 |\n| d6            | I know that's an incredible statement, but bear with me whil... |       0.163002 |"

| document_id   | document                                                        |   total_tf_idf |
|:--------------|:----------------------------------------------------------------|---------------:|
| d1            | Free - Coupons for next movie. The above links will take you... |       0.490822 |
| d2            | Free - Coupons for next movie. The above links will take you... |       0.490822 |
| d3            | Our records indicate your Pension is under performing to see... |       0.705559 |
| d4            | Enter to win $25,000 and get a Free Hotel Night! Just click ... |       0.23803  |
| d5            | Dear recipient, Avangar Technologies announces the beginning... |       0.417345 |
| d6            | I know that's an incredible statement, but bear with me whil... |       0.163002 |