#### Assignment: Document Classification
##### Data 620, 
###### Team 1: Jason Givens-Doyle, Mehdi Khan, Paul Britton

Video link:


##### The assignment:
It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  http://archive.ics.uci.edu/ml/datasets/Spambase

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

##### Solution:
The Data: The project used data downloaded from  Dataturks (https://dataturks.com/) that were collected to  explore cyber-bullying through email, messaging etc. The dataset has 20001 texts or messages, each of which were tagged by '1' or '0' indicating offensive or normal messages respectively. The downloaded document was a text file with each line representing a text/message in json (python dictionary) format.

In [387]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import requests
import json
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer,  TfidfTransformer
from sklearn import metrics


Since each line was in json format, json.loads were used to get the data in python environment. Then a dataframe was created from the data with two columns containing the messages and the tags (Label):

In [339]:
with open('Cyber-Trolls.txt', 'r')as f:
        dataset = [line.strip() for line in f]
   
d = [json.loads(dictobject) for dictobject in dataset]
df = pd.DataFrame(d)
df.drop(["extras","metadata"], axis=1, inplace=True)
lbl= [d.get('label')[0] for d in df[df.columns[0]]]
df['Label']=lbl
df.drop(["annotation"], inplace=True,axis=1)
df.head()

Unnamed: 0,content,Label
0,Get fucking real dude.,1
1,She is as dirty as they come and that crook R...,1
2,why did you fuck it up. I could do it all day ...,1
3,Dude they dont finish enclosing the fucking sh...,1
4,WTF are you talking about Men? No men thats no...,1


#### Data Exploration

In [340]:
df.groupby('Label').describe()

Unnamed: 0_level_0,content,content,content,content
Unnamed: 0_level_1,count,unique,top,freq
Label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,12179,11853,nope,21
1,7822,2789,#NAME?,18


The above table shows that there are 12,179 messages that were not considered offensive while 7,822 messages were offensive. The most frequent words in both categories do not seem to carry any special meaning toward those two categories. One interesting finding is that there are lesser number of unique words in the offensive messages indicating that same words were used to offend people. 

In [392]:
table = pd.DataFrame([['Avg. length of normal text:',
                       round(df.loc[df.Label=='0','content'].apply(len).mean(),2)],
                      ['Avg. length of offensive text:',
                       round(df.loc[df.Label=='1','content'].apply(len).mean(),2)]])
table.columns=['']*len(table.columns)
table

Unnamed: 0,Unnamed: 1,Unnamed: 2
0,Avg. length of normal text:,39.79
1,Avg. length of offensive text:,44.0


From the above table suggests average length of offensive messages is a bit higher than that of normal messages, so message length may not be an indicator of offensive or non offensive texts.

#### Text Pr- processing 
All the messages were tokenized i.e. each of the messages were split into individual words i.e. each of the texts were represented as a set of words. The punctuation were also removed from each of texts. Regular expression and tokenize from nltk library were used on the content column of the dataframe to achieve both.

In [342]:
tokenizer = RegexpTokenizer(r'\w+')
df['content']=df.content.apply(tokenizer.tokenize)
df.head()

Unnamed: 0,content,Label
0,"[Get, fucking, real, dude]",1
1,"[She, is, as, dirty, as, they, come, and, that...",1
2,"[why, did, you, fuck, it, up, I, could, do, it...",1
3,"[Dude, they, dont, finish, enclosing, the, fuc...",1
4,"[WTF, are, you, talking, about, Men, No, men, ...",1


The common words such as 'the', 'a', 'are' etc. that are not useful but occur frequently were also removed using stopwards from nltk package 

In [343]:
def remove_stopwords(txt):
    removed = [word for word in txt if word.lower() not in stopwords.words('english')] 
    return removed

df['content']=df.content.apply(remove_stopwords)
df.head()

Unnamed: 0,content,Label
0,"[Get, fucking, real, dude]",1
1,"[dirty, come, crook, Rengel, Dems, fucking, co...",1
2,"[fuck, could, day, Let, hour, Ping, later, sch...",1
3,"[Dude, dont, finish, enclosing, fucking, showe...",1
4,"[WTF, talking, Men, men, thats, menage, gay]",1


#### Vectorization
Vectorization is a process that converts text into numbers. So far each of the texts (messages) were converted into a list of tokens (words). The vectorization is needed to represent these texts in ML algorithms. There are multiple ways to do vectorization and some combination of which will be used here:

1. Term Document Matrix or term frequency: Counting the number of times each word appears in a document
2. TF-IDF (Term Frequency-Inverse Document Frequency): Weighting the frequency to asses relative importance of a term in the document and the entire corpus

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

TF-IDF Vectors can be generated at words, characters, or n-grams levels. Only word level will be applied here.


Before vectorization was done, two sets of data were created to train and test classification models, the rows were selected randomly so that both datasets have reasonable numbers of offensive and non offensive messages:

In [347]:
df['content'] = [" ".join(con) for con in df['content'].values]
df_train = df.sample(frac=.8, random_state=100).reset_index(drop=True)
df_test = df.drop(df_train.index).sample(frac=1).reset_index(drop=True)

In [369]:
df_train.head(8)

Unnamed: 0,content,Label
0,anyone said Arial gonna beat ass,1
1,Hahaha Well take care would hate lose homeland...,1
2,filipina foreigner,0
3,hope singing along hate think torturing poor l...,1
4,Duck tape fixes everything,0
5,says gubgivits teeth thingies stole blood ever...,1
6,sucks,1
7,N3qRO WUZZ YA NUMBER,0


In [346]:
df_test.head()

Unnamed: 0,content,Label
0,sigh oh Karrine said phrase Sluts Butts site c...,1
1,mind,0
2,hate u looooooool,1
3,thats get fat ass FAT ASS,1
4,would talk someone suicide,0


##### Train data prepration:

In [358]:
bag_of_words = CountVectorizer()
# ignore terms that appear in more than 50% of the documents
#bag_of_words.set_params(max_df=0.5)
bag_of_words_fit = bag_of_words.fit(df_train['content'])

In [366]:
tdm_train = bag_of_words_fit.transform(df_train['content'])
print(tdm_train[5])

  (0, 1792)	2
  (0, 4575)	1
  (0, 5873)	1
  (0, 8744)	1
  (0, 11513)	1
  (0, 12052)	1
  (0, 12634)	1
  (0, 12691)	1
  (0, 12848)	1
  (0, 13196)	1
  (0, 13364)	1


Term weighting:

In [370]:
tfidf_fit_train = TfidfTransformer().fit(tdm_train)
tfidf_train = tfidf_fit_train.transform(tdm_train)
tfidf_train.shape

(16001, 15182)

In [375]:
tfidf_train

<16001x15182 sparse matrix of type '<class 'numpy.float64'>'
	with 109406 stored elements in Compressed Sparse Row format>

##### Test data prepration:

In [376]:
#bag_of_words_test = CountVectorizer()
# ignore terms that appear in more than 50% of the documents
#bag_of_words_test.set_params(max_df=0.5)
#bag_of_words_test.fit(df_test['content'])

tdm_test =bag_of_words.transform(df_test['content'])
tdm_test.shape


(4000, 15182)

In [378]:
print(tdm_test[5])

  (0, 8064)	1
  (0, 13036)	1


In [379]:
tfidf_fit_test = TfidfTransformer().fit(tdm_test)
tfidf_test = tfidf_fit_test.transform(tdm_test)
tfidf_test.shape


(4000, 15182)

In [380]:
tfidf_test

<4000x15182 sparse matrix of type '<class 'numpy.float64'>'
	with 18540 stored elements in Compressed Sparse Row format>

### Model Building:

###### Naive Bayes Model:

In [382]:
from sklearn.naive_bayes import MultinomialNB
online_troll_model = MultinomialNB().fit(tfidf_train,df_train['Label'])

Prediting Label for test dataset using the model:

In [388]:

predicted_label = online_troll_model.predict(tfidf_test)


##### Accuracy test:

In [389]:
metrics.accuracy_score(df_test['Label'], predicted_label)

0.98775

The naive bayes model showed almost 99% accuracy, which is impressive