## Natural Language Processing

In this exercise we will attempt to classify text messages as "SPAM" or "HAM" using TF-IDF Vectorization. Once we successfully classify our texts we will examine our results to see which words are most important to each class of text messages. 

Complete the functions below and answer the question(s) at the end. 

In [1]:
# import necessary libraries 
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import string
from nltk.corpus import stopwords
from nltk import word_tokenize

In [2]:
# read in data
df_messages = pd.read_csv('data/spam.csv', usecols=[0,1])

# convert string labels to 1 or 0 
le = LabelEncoder()
df_messages['target'] = le.fit_transform(df_messages['v1'])

# examine our data
df_messages.head()

Unnamed: 0,v1,v2,target
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


### TF-IDF

In [5]:
# separate features and labels 
X = df_messages['v2']
y = df_messages['target']

# generate a list of stopwords for TfidfVectorizer to ignore
stopwords_list = stopwords.words('english') + list(string.punctuation)

#stopwords_list[:100]

<b>1) Let's create a function that takes in our various texts along with their respective labels and uses TF-IDF to vectorize the texts.  Recall that TF-IDF helps us "vectorize" text (turn text into numbers) so we can do "math" with it.  It is used to reflect how relevant a term is in a given document in a numerical way. </b>

In [40]:
# generate tf-idf vectorization (use sklearn's TfidfVectorizer) for our data
import sklearn

def tfidf(X, y,  stopwords_list): 
    '''
    Generate train and test TF-IDF vectorization for our data set
    
    Parameters
    ----------
    X: pandas.Series object
        Pandas series of text documents to classify 
    y : pandas.Series object
        Pandas series containing label for each document
    stopwords_list: list ojbect
        List containing words and punctuation to remove. 
    Returns
    --------
    tf_idf_train :  sparse matrix, [n_train_samples, n_features]
        Vector representation of train data
    tf_idf_test :  sparse matrix, [n_test_samples, n_features]
        Vector representation of test data
    y_train : array-like object
        labels for training data
    y_test : array-like object
        labels for testing data
    vectorizer : vectorizer object
        fit TF-IDF vecotrizer object

    '''
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    
    
    vectorizer = TfidfVectorizer()
    X_train = vectorizer.fit_transform(X_train)
    X_test = vectorizer.transform(X_test)
    
    return X_train, X_test, y_train, y_test, vectorizer

In [41]:
tf_idf_train, tf_idf_test, y_train, y_test, vectorizer = tfidf(X, y, stopwords_list)

tf_idf_train

<4179x7398 sparse matrix of type '<class 'numpy.float64'>'
	with 55050 stored elements in Compressed Sparse Row format>

### Classification

<b>2) Now that we have a set of vectorized training data we can use this data to train a classifier to learn how to classify a specific text based on the vectorized version of the text. Below we have initialized a simple Naive Bayes Classifier and Random Forest Classifier. Complete the function below which will accept a classifier object, a vectorized training set, vectorized test set, and list of training labels and return a list of predictions for our training set and a separate list of predictions for our test set.</b> 

In [42]:
nb_classifier = MultinomialNB()
rf_classifier = RandomForestClassifier(n_estimators=100)

In [54]:
# create a function that takes in a classifier, trains it on our tf-idf vectors,
# and generates train and test predictiions
def classify_text(classifier, tf_idf_train, tf_idf_test, y_train):
    '''
    Train a classifier to identify whether a message is spam or ham
    
    Parameters
    ----------
    classifier: sklearn classifier
       initialized sklearn classifier (MultinomialNB, RandomForestClassifier, etc.)
    tf_idf_train : sparse matrix, [n_train_samples, n_features]
        TF-IDF vectorization of train data
    tf_idf_test : sparse matrix, [n_test_samples, n_features]
        TF-IDF vectorization of test data
    y_train : pandas.Series object
        Pandas series containing label for each document in the train set
    Returns
    --------
    train_preds :  list object
        Predictions for train data
    test_preds :  list object
        Predictions for test data
    '''
    # your code here
    # a) fit the classifier with our training data
    classifier.fit(tf_idf_train, y_train)
    
    # b) predict the labels of our train data and store them in train_preds
    train_preds = classifier.predict(tf_idf_train)
    
    # c) predict the labels of our test data and store them in test_preds
    test_preds = classifier.predict(tf_idf_test)    
    
    # d) return train_preds and test_preds
    return train_preds, test_preds

In [55]:
# generate predictions with Naive Bayes Classifier
nb_train_preds, nb_test_preds = classify_text(nb_classifier, tf_idf_train, tf_idf_test, y_train)

# evaluate performance of Naive Bayes Classifier
print(confusion_matrix(y_test, nb_test_preds))
print(accuracy_score(y_test, nb_test_preds))

[[1202    0]
 [  56  135]]
0.9597989949748744


In [56]:
# generate predictions with Random Forest Classifier
rf_train_preds, rf_test_preds = classify_text(rf_classifier, tf_idf_train, tf_idf_test, y_train)

# evaluate performance of Random Forest Classifier
print(confusion_matrix(y_test, rf_test_preds))
print(accuracy_score(y_test, rf_test_preds))

[[1202    0]
 [  37  154]]
0.9734386216798278


You can see both classifiers do a pretty good job classifying texts as either "SPAM" or "HAM". Let's figure out which words are the most important to each class of texts! Recall that Inverse Document Frequency can help us determine which words are most important in an entire corpus or group of documents. 

<b>3) Create a function that calculates the inverse document frequency (IDF) of each word in our collection of texts.</b>

In [191]:
def get_idf(class_, df, stopwords_list):
    '''
    Get ten words with lowest IDF values representing 10 most important
    words for a defined class (spam or ham)
    
    Parameters
    ----------
    class_ : str object
        string defining class 'spam' or 'ham'
    df : pandas DataFrame object
        data frame containing texts and labels
    stopwords_list: list object
        List containing words and punctuation to remove. 
    --------
    important_10 : pandas dataframe object
        Dataframe containing 10 words and respective IDF values
        representing the 10 most important words found in the texts
        associated with the defined class
    '''
    # your code here
    import math    
    # a) generate series containing all texts associated with the defined class
    docs = df[df['v1'] == class_]
    
    # b) initialize dictionary to count document frequency 
    # (number of documents that contain a certain word)
    class_dict = {}

    
    # c) loop over each text and split each text into a list of its unique words 
    for doc in docs.v2:
        class_dict = dict.fromkeys(docs.keys(), 0)
        words = set(doc.split())        
        # d) loop over each word and if it is not in the stopwords_list add the word 
        #    to class_dict with a value of 1. if it is already in the dictionary
        #    increment it by 1
        for word in words:
            if not word in stopwords_list:
                print(word)
                  
    # e) take our dictionary and calculate the 
    #    IDF (number of docs / number of docs containing each word) 
    #    for each word
    N = len(docs)
    for word, val in docs.v2.items():
        print(class_dict.items())
        print(val)
        class_dict[word] = math.log(N / val)    
    # f) return the 10 words with the lowest IDF 

        
    return class_dict.min()

In [202]:
 docs = df_messages[df_messages['v1'] == 'spam']
# class_dict = {}
# for doc in docs.v2:
#     words = set(doc.split())
dict.fromkeys(docs.items, 0)   
docs.items

TypeError: 'method' object is not iterable

In [193]:
get_idf('spam', df_messages, stopwords_list)

Text
rate)T&C's
apply
2
entry
final
txt
wkly
21st
08452810075over18's
87121
May
receive
Cup
2005.
FA
tkts
Free
win
comp
question(std
fun
XxX
like
send,
Tb
I'd
still?
word
week's
chgs
FreeMsg
3
back!
rcv
Hey
darling
ok!
std
�1.50
only.
code
Valid
hours
To
Claim
network
customer
reward!
As
claim
selected
09061701461.
WINNER!!
call
KL341.
12
prize
receivea
�900
valued
entitled
Mobile
08002986030
latest
Co
mobile
U
more?
R
colour
FREE
11
camera
Had
Update
months
mobiles
Call
The
Free!
apply
Reply
From
SIX
100
TsandCs
20,000
150p/day,
16+
6days,
CASH!
chances
4
Cost
info
87575.
pounds
CSH11
txt>
HL
win
send
T&C
POBOX
No:
Jackpot!
week
Prize
4403LDNW1A7RW18
LCCLTD
word:
www.dbuk.net
membership
1
81010
You
FREE
Txt
�100,000
URGENT!
CLAIM
next
XXXMobileMovieClub:
click
message
link
txt
use
http://wap.
To
here>>
xxxmobilemovieclub.com?n=QJKGIGHJJGCBL
credit,
WAP
national
Macedonia
England
POBOXox36504W45WQ
v
news.
16+
miss
eg
SCOTLAND
4txt/̼1.20
87077
ur
Txt
dont
goals/team
team
ENGLAND
Try:WAL

Congratulations
week's
+123
call
150ppm
18
T&Cs/stop
SMS
draw
prize
b4280703.
Over
�1450
No:
150p/Mtmsgrcvd18+
guaranteed
83355!
latest
40GB
word:
�500
COLLECT
iPod
IBHltd
You
prize!
Txt
MP3
LdnW15H
Nokia
Phone,
player
3.
Reply
Cha
tones
Me
txt
6.
With
2.
Come
1.
Yeah
POLY3
STOP
eg
Slow
MORE
8.
4
Slide
Boltblue
POLY#
Jamz
Toxic
150p
MONO#
Your
Pin
http://www.bubbletext.com
renewal
tgxxrz
topped
credits
Mobile
0871-872-9755
02/09/03!
attempt
Bonus
Prize
URGENT!:
BOX95QU
No.
contact
�2,000
2nd
Your
Caller
YOU!
This
Call
awarded
Text
16
Claim
Unsub
Sub.
�150
worth
08717898035.
mobile!
offers
discount
now!
vouchers!
T
Cs
ur
member
�3.00
Today's
YES
Offer!
X
SavaMob,
reply
85023
next
within
Pg
tone
For
please
750
U
24hrs.
Terms
You
Channel
Teletext
conditions
see
recieve
Identifier
Account
41782
08718738001
Code
PRIVATE!
un-redeemed
Expires
Your
2003
800
shows
18/11/04
S.I.M.
Statement
Call
points.
07815296484
www.Applausestore.com
T&CsC
age16
txt
web
2stop
stop
MonthlySubscription@50p/msg


WILL
various
address
completely
ph:08700435505150p
�250
NOW.
T&C
weekly
entry
word
www.textcomp.com
cust
18
08712405020.
FREE
84128
ENTER
comp
care
send
08712402779
CALL
message
Please
waiting
immediately
urgent
gay
texts
To
Hungry
feeling
now.
stop
call
08712460324
08718730555
hungry
4
(10p/min)
guys
Call
10p/min.
it,
09096102316
meet
xx
2
JANE
chat
phone
NOW?
get
U
cum
2moro
I
Can
Calls�1/minmoremobsEMSPOBox45PO139WA
NOW
wanna
set
Call
Luv
service
operator.
80488.biz
For
The
free.
T
C's
network
visit
fun
2
call:
logos+musicnews!
new
credits
09701213186
club
get
gold
Help?
jamster
videosound
16+only
jamster.co.uk!
videosounds+2
Enjoy
apply
England
2
tone,
network
operator
rates
lionp
mono
original
3
3GBP
Tones
n
4
best.
go
Get
Lions
reply
www.ringtones.co.uk,
poly.
lionm
among
chance
Win
���Harry
answer
readers!
(Book
HARRY,
newest
questions
5
first
Potter
Order
reply
5)
Phoenix
next
2
Who
txt
ANSWER
Ur
balance
sang
answer
is:
�500.
ur
Good
Girl'
question
80's
'Uptown
83600.
luck!
pic

Box
ur
Drinks
med
address
Us
Saturday!
Eire.
UK,
Starts
send
WIN
1Winaweek,
www.Idew.com
Gift
week
Music
150ppermessSubscription
word
TsCs
DRAW
SkillGame,
Vouchers
age16.
U
starting
every
Txt
NOW
87066
�100
u
competition
08718727868.
claim
09050002311
Congratulations
week's
+123
call
150ppm
18
T&Cs/stop
SMS
draw
prize
b4280703.
Over
�1450
27/03
opt
Customer
To
MARSMS.
call
stop.
Your
discount
voucher
Log
www.B4Utele.com
w/c
credit.
reply
08717168528
care
B4U
onto
Reply
2
C
Help08700621170150p
PIX!
randy.
satisfy
feeling
txts
25
stop
FreeMsg:
Home
Send
QlynnBV
love
Hey
alone
msg
I'm
Buffy.
men.
113,
holiday
Wicklow,
To
Sunshine
claim
Bray,
Unsub
self
Stop
stamped
PO
envelope
Quiz
Hols.
Box
ur
Drinks
med
address
Us
Saturday!
Eire.
UK,
Starts
send
mob
tone
txt
week
tell
16+
zed
POBox
No1
1st
4
NOKIA
87077
FREE
ur
every
Get
Nokia
norm150p/tone
36504
W45WQ
mates.
txting
week!
u
�100
CR01327BT
fixedline
vary
5K,
mobile
till
�500
IT
either
IS
09064011000.
150ppm
PO
Drop,
Box
Cost
Travel
NTT
S

service
Reply
What
weekly!
sport
87239
win
professional
end
play?
�100
Tiger
STOP
mob
txtin
16
tone
txt
150p/tone.
PT2
get
Tone
tell
4info
POLYPHONIC
1
1st
4
87575.
FREE
ur
Just
every
friends.
HL
reply
No
week!
08707509020
national
FANTASIES
Croydon
min
rate
call
Ltd,
PO
20p
Box
5WB
0870
per
Just
NTT
1327
HOT
LIVE
CR9
18+only
message
To
club!
msgs@150p
unsubscribe
new
STOP.
Welcome
free.
service
Dogging
This
improved
reply
Sex
tkts
EURO2004
You've
POBOX
7876150ppm
CALL
FINAL
collect
CUP
�800
09058099801
CASH,
b4190604,
purpose
still
Homeowners
refused?
back
�500
0800
Tenants
text
previously
1956669
We
Have
help.
'help'
Free
welcome.
�75,000.
Call
Loan
400mins...Call
MobileUpd8
08000839402
Price
Orange
rental:
call2optout=J5Q
Update_Now
Half
12Mths
line
2
txt
HARDCORE
UNLIMITED
get
mobile
hrs
PORN
24
direct
Stop
chrgd@50p
free
access
2exit.
per
FREE
Txt
This
msg
day
69200
WIN
F
2
txt
chance
WKENT/150P16+
TULIP
DOT
F=
EASTENDERS
D=
E
84025
VIOLET
E=
What
4
D
NOW
compare
TV
Quiz.
to?
LILY

mins?
network
new
08000930705
handset?
750
want
call
Do
Rental?
Camcorder?
delivery
video
anytime
service
announcement.
You
now!
important
Call
0800
542
FREEPHONE
customer
0825
u
3.
Reply
tone
REAL1
POLY
WeBeBurnin
2.
BabyGoodbye
�3/wk
FREE>Ringtone!
1.
eg
5.
1st
join
4.
FREE
DontCha
6
REAL
PushButton
GoldDigger
get
Gnarls
GO
Free
Barkleys
message
\Crazy\"
TOTALLY
reply
right
Msg:
FREE
now!"
ringtone
loan?
credit?
get
back
Secured
6669
0800
text
free
Can't
will!
'help'
Call
Refused
195
Unsecured?
16+.
08712402050
Promo"
selected
BEFORE
AG
close.
\3000
receive
apply.
You
Cost
specially
10ppm.
T&Cs
award!
Call
lines
computer
easy.
Vodafone
To
customer
As
�150
collect
call
WON!
prize.
Just
picked
valued
YOU
win
09061743386
HAVE
08001950382
Price
100
Half
Call2OptOut/674&
ntwk
rental
txts.
phones
cross
12
MobileUpd8
video
500
mins
camera
mths
line
Free
Call
AND
2stoptxt
tariffs
DOUBLE
latest
08000839402
get
best
phones
Offer
free
MobileUpd8
Mins
Great
4
Txt
camera
FREE!
NOW!
T&Cs
Orange
NE

free
Get
Had
YES
Update
2
chance
Invaders
Arcade
settings.
(std
charge)
Terms
o2.co.uk/games
0
4
Press
See
WAP
orig
purchase
Game
Games
console.
win
Buy
No
Space
SiPix
within
You
Digital
Camera
09061221066
landline.
Camera!
call
awarded
28
Delivery
days.
fromm
ready
weekly
tones
!This
new
Cool-Mob
Akon-Lonely>>>
Crazy
Tones
n
F>>>
P
Your
weeks
Eyed-Dont
Black
download
info
1)
2)
>>>More
3)
include:
Frog-AXEL
These..
(Get
cash
www.cashbin.co.uk
away!!
Dear
best
weekend
We
Welcome
lots
Cashbin.co.uk
biggest
EVER
give
got
weekend!)
Mobile
Valid
Claim
09061790121
number
land
GUARANTEED.
150ppm
3030.
Your
prize
�2000
12hrs
line.
Call
URGENT!
awarded
Thanks
u
80082
cash.
txt
week
continued
in2
Name
Your
US
ans
4
support
draw
enter
President?
question
NEW
�100
Your
unique
08708034412
87239
For
1172.
ID
removal
user
STOP
services
customer
send
cash
-call
8WP
collection
09066649731from
Landline.
Ibiza
Holiday
434
SK3
4*
SAE
150ppm
PO
BOX
Your
await
�10,000
T&Cs
complimentary
Urgent
18+
still
at

go2
summer
FREE
Get
bulbs
The
T&C:
cash
�1000
weekend's
150ppm
3SS.
holiday!
shows
RSTM,
Spanish
draw
09050000332
claim.
NOW
CALL
Last
URGENT!
SW7
u
code
�1000
Valid
Claim
pm
contact
We
shows
draw
prize
09064017295
K52
weekends
12hrs
URGENT
Call
150p
Last
trying
GUARANTEED
opt
Text
Check
min.
T's
min
call
stop
PlanetTalkInstant.com
2p
info
per
Just
BT
08448350055
line.
C's.
Germany
u
Text
Mobile
wallpaper
83338
mobile
Play
(�4.50)
now.
SPIDER
Ultimate
official
right
game
Spider-man
FREE
ur
8Ball
Marvel
send
goto
220-CM2
extra
unsubscribe
08702840625.COMUK.
login=
credits,
STOP,
text
www.comuk.net
help
pls
inclusive
SMS
9AE
SERVICES.
charge.
3qxj9
Identifier
Account
06/11/04
PRIVATE!
I.
un-redeemed
Expires
Your
2003
800
S.
shows
08719899229
M.
07808247860
Statement
Call
Code:
40411
points.
p
within
16
Digital
M221BP.
150ppm.
09061221061
Camera!
Delivery
Box177.
call
warranty.
SiPix
28days.
You
2yr
landline.
T
Cs
p�3.99
awarded
26/11/04
Your
2003
49557
Identifier
800
Account
Statement
PR

TypeError: unsupported operand type(s) for /: 'int' and 'str'

In [152]:
get_idf('ham', df_messages, stopwords_list)

### Explain
<b> 4) Imagine that the word "school" has the highest TF-IDF value in the second document of our test data. What does that tell us about the word school? </b>

In [None]:
# Your answer here