In [1]:
"""
Importing the pandas and numpy libraries. These are the libraries necessary to read data
and storing it as arrays or dataframes 
""" 

import pandas as pd
import numpy as np

# Reading in data

Here we will use the pandas library to read the raw text data. Check the file SMSSpamCollection.csv file provided with this homework (source: https://www.kaggle.com/uciml/sms-spam-collection-dataset/version/1 for more details on the data). This is a labelled dataset containing text messages labelled as spam or otherwise.

We will use the read_csv function of the pandas library to read the file called 'SMSSpamCollection'
and then store it as a DataFrame called 'fvt'


In [2]:
fvt = pd.read_csv("SMSSpamCollection.csv", sep="	", names=["class","text"])

In [3]:
# This what the data looks like. Notice that it has 2 columns. One called 'class' and the other is 'text'
fvt

Unnamed: 0,class,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


Setting the column titled 'class' as the variable y as per machine learning convention

In [4]:
y = fvt["class"]

# Representing the data as a vector of features

Now we notice that our dataset is made up of rows of text messages. Machine learning algorithms are run on numeric vectors. How do we represent this text data as a numeric vector?

The scikit-learn library implements a technique known as tf-idf vectorization to convert a stream of words to a vector with fixed feature length. A hand-wavy explanation for tf-idf is that it is a method based on the count of all the unique words in a given text data corpus. Check out this doc for more information on tf-idf (http://scikit-learn.org/stable/modules/feature_extraction.html)

First we import the TfidfVectorizer from the scikitlearn library. Next we initialize an object of the TfidfVectorizer class and name it 'vectorizer'. Finally we use the fit_transform function of the vectorizer object to transform the rows of text to numeric vectors 

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(fvt["text"])

Now we will look at the shape of the data matrix X. X is a matrix where 5572 is the number of rows i.e the number of data points in the set while 8713 is the features of the data which in this case are the unique words in the corpus.


In [6]:
X.shape

(5572, 8713)

We convert the label vector, y to a numeric vector of 1s and 0s where 1 represents spam and 0 represents not spam

In [7]:
Y = []
for i in range(len(y)):
    if(y[i] == "spam"):
        Y.append(1)
        
    else:
        Y.append(0)   

# The actual Machine Learning part

Now that we have the data vectorized into a numerical form, we can apply machine learning algorithms we learnt in class to classify any message as spam or otherwise. To this end we will use a Support Vector Machine clasifier (more info here: http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC)

But first, we have to split the data into train and test data to evaluate our model better. sklearn has that covered too. We use the train_test_split function to achieve a 75-25% train-test split. Also notice the use of the stratify parameter. We encourage you to look up more on the stratify parameter! http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X,Y, stratify = Y)

Now that we have split our data into 75% training dataset and 25% test data, we can train or 'fit' our model on the training set and evaluate its accuracy on the test set.

But first we must instantiate an object of the LinearSVC class and call it svc. Next we train the model with the training data 

In [10]:
from sklearn.svm import LinearSVC 

In [11]:
svc = LinearSVC()

In [12]:
svc.fit(X_train, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [13]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB

In [14]:
gb = GaussianNB()

In [15]:
# Please note that while fitting on the Naive Bayes classifer the X_train variable has to be converted to an array
gb.fit(X_train.toarray(),y_train)

GaussianNB(priors=None)

# Evaluating the Model

Now we put our trained model to the test. First we observe the accuracy of the model on the training data that
we fit the model on and then we evaluate it on the test set.

To evaluate the model we use the 'score' function as shown below:

In [20]:
svc.score(X_train,y_train)

0.99928212491026558

In [21]:
print("The test score for SVC is: ", svc.score(X_test, y_test))

The test score for SVC is:  0.986360373295


In [22]:
print("The test score for Naive Bayes is: ", gb.score(X_test.toarray(),y_test))

The test score for Naive Bayes is:  0.896625987078


# Task 1
Implement a Naive Bayes (NB) classifier for this same spam dataset. Feel free to explore all the types of NB classifiers that sklearn has to offer. Tune the parameters to see where you get the best test accuracy 

http://scikit-learn.org/stable/modules/naive_bayes.html

Hint: Look at the code above for the SVC and NB classifiers. Understand the syntax and implement a NB classifier using the same approch of fit, transform and then evaluate

In [19]:
models = [GaussianNB(), BernoulliNB(), MultinomialNB()]
for model in models:
    model.fit(X_train.toarray(),y_train)
    print(model.score(X_test.toarray(),y_test))

0.895190236899
0.982770997846
0.956209619526


# Task 2 

You are provided with a dataset of YouTube comments for various artists. The comments are classified as spam or otherwise. Your job is to first read the relevant csv files for each artist, use the tf-idf vectorizer to convert it into numeric vectors and then train and evaluate a classification model for each artist and report the best test accuracy. For each case split the data into 75% training and 25% test. You should report the accuracy on the test set.

Bonus (Optional): Can you find the most frequent words for data that is classified as spam?

In [31]:
# Your code here
yb_psy = pd.read_csv("Youtube01-Psy.csv")
yb_perry = pd.read_csv("Youtube02-KatyPerry.csv")
yb_em = pd.read_csv("Youtube04-Eminem.csv")

datasets = [yb_psy, yb_perry, yb_em]
for data in datasets:
    X = data["CONTENT"]
    Y = data["CLASS"]
    X_train, X_test, y_train, y_test = train_test_split(X,Y, stratify = Y, test_size = 0.25, train_size = 0.75)
    vect = TfidfVectorizer()
    X_train = vect.fit_transform(X_train)
    X_test = vect.transform(X_test)
    svc = LinearSVC().fit(X_train, y_train)
    print(svc.score(X_test, y_test))
    
#     cv = CountVectorizer()
#     X_train = cv.fit_transform(X_train.toarray())
#     print(zip(cv.get_feature_names(), np.asarray(X_train.sum(axis=0)).ravel()))

0.954545454545
0.965909090909
0.982142857143


In [38]:
yb_psy[yb_psy.CLASS == 1].shape

(175, 5)

In [61]:
# Your code here
yb_psy = pd.read_csv("Youtube01-Psy.csv")
yb_perry = pd.read_csv("Youtube02-KatyPerry.csv")
yb_em = pd.read_csv("Youtube04-Eminem.csv")

datasets = [yb_psy, yb_perry, yb_em]
for data in datasets:
    spam = data[data.CLASS == 1]["CONTENT"]
    cv = CountVectorizer()
    spam_matrix = cv.fit_transform(spam)    
    maximum, index_max = np.max(spam_matrix.sum(axis = 0)), np.argmax(spam_matrix.sum(axis = 0)) 
    print((cv.get_feature_names()[index_max], maximum))

('my', 81)
('com', 92)
('out', 215)


# Task 3

In the business world you will rarely be asked to distinguish between just 2 classes. You will usually be presented with text data in the form of reports, news articles or web scraped information that can fall in multiple categories.
This is a multiclass classification problem. So the labels will be integers 1,2,3... instead of just 0 or 1. 

For your final task you are presented with the News Group data set. It contains articles on a diverse range of topics from atheism, motorcycles, baseball to space. We encourage you to look at the description of the dataset as well as read up some of the articles in the dataset.

You have been provided with code to read and then vectorize the data. Your job is to split the data, then train the model on the training data and finally evaluate the multi class calssification model on the test data. Also, pick any 5 random rows of the test data and read the articles. Do your predicted labels for those articles match with what they actually should be?

Hint: You can get the vector of predictions y_pred by calling the 'predict' function of the classifier object. So for example if you used an svc classifier your code should look something like this:

svc = LinearSVC()

svc.fit(X_train, y_train)

y_pred = svc.predict(X_test)



In [64]:
from sklearn.datasets import fetch_20newsgroups

In [65]:
#Downloading the dataset
data = fetch_20newsgroups()

In [66]:
#Loading the data into the variable X_text and the labels into the variable y
X = data["data"]
y = data["target"]

In [75]:
#Names of the categories 
list(zip(data.target_names, y))

[('alt.atheism', 7),
 ('comp.graphics', 4),
 ('comp.os.ms-windows.misc', 4),
 ('comp.sys.ibm.pc.hardware', 1),
 ('comp.sys.mac.hardware', 14),
 ('comp.windows.x', 16),
 ('misc.forsale', 13),
 ('rec.autos', 3),
 ('rec.motorcycles', 2),
 ('rec.sport.baseball', 4),
 ('rec.sport.hockey', 8),
 ('sci.crypt', 19),
 ('sci.electronics', 4),
 ('sci.med', 14),
 ('sci.space', 6),
 ('soc.religion.christian', 0),
 ('talk.politics.guns', 1),
 ('talk.politics.mideast', 7),
 ('talk.politics.misc', 12),
 ('talk.religion.misc', 5)]

In [71]:
# Converting X_text to a numeric vector X
vectorizer_newsgroup = TfidfVectorizer()
X_train, X_test, y_train, y_test = train_test_split(X,y, stratify = y, test_size = 0.25, train_size = 0.75)
X_train = vectorizer_newsgroup.fit_transform(X_train)
X_test = vectorizer_newsgroup.transform(X_test)

In [73]:
#fitting the model 
svc = LinearSVC()
svc.fit(X_train, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [83]:
#printing out category predicted by the model, next to actual label 
for i in range(5):
    print(svc.predict(X_test[i]), y_test[i])

[5] 5
[18] 18
[3] 3
[12] 12
[12] 6
