So, what did we do in the first two lessons?

First, we opened the iris data set to show how to build a machine learning model. We used train_test_split and then ran a KNearestNeighbors model on them to get a prediction of what kind of iris each sample was. 

In [2]:
import pandas as pd
import numpy as np
from sklearn import naive_bayes 

In [5]:
from sklearn.datasets import load_iris

In [12]:
iris = load_iris()
X = iris.data
y = iris.target

Inspect the data with a Pandas DataFrame.

In [33]:
pd.DataFrame(X, columns=iris.feature_names).head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


First I will run naive_bayes() without train_test_split, then with. 

In [16]:
nb = naive_bayes.MultinomialNB()

In [17]:
nb.fit(X, y)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [22]:
y_pred = nb.predict(X)

In [23]:
from sklearn.metrics import accuracy_score

In [25]:
accuracy_score(y, y_pred)

0.9533333333333334

In [34]:
accuracy_score(y_pred, y)

0.9533333333333334

In [44]:
from sklearn.neighbors import KNeighborsClassifier

In [50]:
knn = KNeighborsClassifier()
knn.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

Now we get a prediction on some made up data. 

In [54]:
knn.predict([[5,3,4,0.2]])

array([1])

We made some pretend files or SMS messages. We then made a data set out of the text messages to demonstrate how to use CountVectorizer. 

So first we make up some data. 

In [65]:
simple_train = ['call you tonight', 'Call me a cab', 'please call me...PLEASE!']
labels = [0,0,1]



Now we put the data into count vectorizer to turn the text into numbers.

In [72]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(simple_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [73]:
simple_train_dtm = vect.transform(simple_train)

In [74]:
simple_train_dtm

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [75]:
simple_train_dtm.toarray()

array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]])

In [76]:
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


In [79]:
print(simple_train_dtm)

  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2


Note how just printing simple_train_dtm gives you the locations of the values ignoring the cells of the values that are zero. So the first row of the second column (0, 1) has the value 1. The only value of 2 is in the third row of the fourth column (2, 3), 'please'. 

In [81]:
simple_test = ["please don't call me"]

In [83]:
simple_test_dtm = vect.transform(simple_test)
print(simple_test_dtm)

  (0, 1)	1
  (0, 2)	1
  (0, 3)	1


In [86]:
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,1,1,0,0


Note that the word 'don't' was left out because it doesn't occur in the original data set. 

Then we read some real SMS data to do some real machine learning. 

In [99]:
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv"
sms = pd.read_table(url, header=None, names=['label', 'message'])

In [100]:
sms

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [104]:
sms.label.value_counts()/len(sms)*100

ham     86.593683
spam    13.406317
Name: label, dtype: float64

Now this is a real data set. The first thing we have to do is change the 'ham' and 'spam' strings into 0s and 1s that the computer can read. 

In [105]:
sms['label'] = sms.label.map({'ham':0, 'spam':1})

In [106]:
sms

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will ü b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


In [109]:
sms.label.value_counts()/len(sms)*100

0    86.593683
1    13.406317
Name: label, dtype: float64

In [112]:
X = sms.message.values
X

array(['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
       'Ok lar... Joking wif u oni...',
       "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
       ..., 'Pity, * was in mood for that. So...any other suggestions?',
       "The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free",
       'Rofl. Its true to its name'], dtype=object)

In [116]:
y = sms.label
y

0       0
1       0
2       1
3       0
4       0
       ..
5567    1
5568    0
5569    0
5570    0
5571    0
Name: label, Length: 5572, dtype: int64

Now we do the train_test_split. This is so we can give it only the training portion of the data set for the fit function which learns the vocabulary.

In [117]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

Now we can pass it to the CountVectorizer.

In [118]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [119]:
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)

In [121]:
len(X_test)==len(y_test)

True

In [124]:
pd.DataFrame(X_train_dtm.toarray(), columns=vect.get_feature_names()).head()

Unnamed: 0,00,000,008704050406,0089,0121,01223585334,02,0207,02073162414,021,...,zed,zeros,zhong,zindgi,zoe,zogtorius,zouk,zyada,èn,ú1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


So now I am ready to put the data into a machine learning model. Let's try naive bayes.

In [126]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_test_pred = nb.predict(X_test_dtm)

Let's look at the metrics, shall we?

In [127]:
from sklearn import metrics
metrics.accuracy_score(y_test, y_test_pred)

0.9885139985642498

In [128]:
metrics.confusion_matrix(y_test, y_test_pred)

array([[1197,    4],
       [  12,  180]])

In [130]:
pd.DataFrame(metrics.confusion_matrix(y_test, y_test_pred), columns=['0','1'], index=['0','1'])

Unnamed: 0,0,1
0,1197,4
1,12,180


Now logistic regression

In [133]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

In [135]:
lr.fit(X_train_dtm, y_train)
y_test_pred = lr.predict(X_test_dtm)
metrics.accuracy_score(y_test, y_test_pred)

0.9827709978463748

Now let's look at the spamminess of individual tokens. First, we get the tokens.

In [137]:
X_train_tokens = vect.get_feature_names()

In [138]:
len(X_train_tokens)

7482

In [145]:
print(X_train_tokens[1000:1050])

['appeal', 'appendix', 'applebees', 'apples', 'application', 'apply', 'applyed', 'applying', 'appointment', 'appointments', 'appreciate', 'appreciated', 'approaches', 'approved', 'approx', 'apps', 'appt', 'appy', 'april', 'aproach', 'aptitude', 'aquarius', 'ar', 'arcade', 'archive', 'ard', 'are', 'area', 'aren', 'arent', 'arestaurant', 'aretaking', 'areyouunique', 'argh', 'argue', 'arguing', 'argument', 'arise', 'arises', 'arithmetic', 'arm', 'armand', 'armenia', 'arms', 'arng', 'arnt', 'around', 'aroundn', 'arr', 'arrange']


Now the naive bayes model counts the number of times each token occurs in each class. 

In [146]:
nb.feature_count_.shape

(2, 7482)

In [147]:
nb.feature_count_

array([[ 0.,  0.,  0., ...,  1.,  1.,  0.],
       [ 9., 26.,  2., ...,  0.,  0.,  1.]])

At the end we used glob to get some files out of the folders in data. Glob makes a list of the file titles then you can then loop through with open(). You can then apppend the contents to a list. You can make that list into a data set with the keys being the names of the files. 