# CSCI - 5901 - The Process of Data Science - Summer 2019 Assignment 2
## Nirav Solanki(B00808427)
## Aditya Gadhvi(B00809664) 

# Dataset 
## For this particular assignment, the fetch20_newsgroups has been used. It contains around 18000 newsgroups post divided into 20 different topics. We are only going to use newsgroups associated with only 4 topics: alt.atheism, talk.religion.misc, comp.graphics, sci.space. 

# Collocation extraction
## Fetching the data
### We will be utilizing the entire dataset associated with the library. We will filter the data to only include the data associated with the 4 newsgroups described in above. We will also discard headers, footers and quotes. 

In [1]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups = fetch_20newsgroups(categories=categories)

## To easily manipulate the dataset, we will store the data inside a pandas dataframe, generate proper shape, and give the columns appropriate names. 

In [2]:
import pandas as pd
df = pd.DataFrame([newsgroups.data,newsgroups.target.tolist()])
df = df.T #Transposing the dataframe to create proper shape.
df.columns = ["news","categories"]
df.head()

Unnamed: 0,news,categories
0,From: rych@festival.ed.ac.uk (R Hawkes)\nSubje...,1
1,Subject: Re: Biblical Backing of Koresh's 3-02...,3
2,From: Mark.Perew@p201.f208.n103.z1.fidonet.org...,2
3,From: dpw@sei.cmu.edu (David Wood)\nSubject: R...,0
4,From: prb@access.digex.com (Pat)\nSubject: Con...,2


# Tokenizing the data

In [3]:
import nltk
#nltk.download('punkt')
from nltk.tokenize import word_tokenize

df["tokenized_news"] = df["news"].apply(word_tokenize).tolist()
df["tokenized_news"].head()

0    [From, :, rych, @, festival.ed.ac.uk, (, R, Ha...
1    [Subject, :, Re, :, Biblical, Backing, of, Kor...
2    [From, :, Mark.Perew, @, p201.f208.n103.z1.fid...
3    [From, :, dpw, @, sei.cmu.edu, (, David, Wood,...
4    [From, :, prb, @, access.digex.com, (, Pat, ),...
Name: tokenized_news, dtype: object

# Performing POS(Part-Of-Speech) Tagging

In [4]:
from nltk import pos_tag_sents
#nltk.download('averaged_perceptron_tagger')
df['POS'] = pos_tag_sents(df["tokenized_news"])
df['POS'].head()

0    [(From, IN), (:, :), (rych, NN), (@, NN), (fes...
1    [(Subject, JJ), (:, :), (Re, NN), (:, :), (Bib...
2    [(From, IN), (:, :), (Mark.Perew, NNP), (@, NN...
3    [(From, IN), (:, :), (dpw, NN), (@, NN), (sei....
4    [(From, IN), (:, :), (prb, NN), (@, NN), (acce...
Name: POS, dtype: object

# Calculating Frequency Distribution on the tokens
## The FreqDist() was not accepting the dataframe as its parameter, so we converted the dataframe into a list and then calculated the frequency distribution

In [5]:
tokenized_news =  df['tokenized_news']
words = []
for wordList in tokenized_news:
    words += wordList


In [6]:
from nltk.probability import FreqDist
fdist = FreqDist(words)
fdist.most_common(3)

[(',', 30129), ('.', 27573), ('the', 23477)]

In [7]:
fdist.most_common(3)

[(',', 30129), ('.', 27573), ('the', 23477)]

# Applying techniques such as Frequency with filter, PMI, T-test with filter, Chi-Sq test to extract bigram collocations from the corpus

## In order to extract meaningful bigrams, we have cleaned the data first and then applied all of the techniques. Without cleaning, irrelevant and unmeaningful bigrams were being extracted

##  Applying cleaning techniques:

In [8]:
import re
cleanlist=[re.sub('[^a-zA-Z]+', '', _) for _ in words]

### (1) Frequency with filter

In [9]:
# The following code prints the bigrams without applying any kind of filter.
import nltk
bigrams = nltk.collocations.BigramAssocMeasures()
bigramFinder = nltk.collocations.BigramCollocationFinder.from_words(cleanlist)
#bigrams
bigram_freq = bigramFinder.ngram_fd.items()
bigramFreqTable = pd.DataFrame(list(bigram_freq), columns=['bigram','freq']).sort_values(by='freq', ascending=False)
bigramFreqTable.head(20)

Unnamed: 0,bigram,freq
22,"(, )",61931
25,"(, I)",4863
441,"(, and)",3629
482,"(of, the)",3085
170,"(, The)",2842
0,"(From, )",2137
9,"(Subject, )",2087
21,"(Lines, )",2049
169,"(Organization, )",1962
8,"(, Subject)",1896


In [10]:
# The following code performs cleaning such as removing stop words etc. From this we will get filtered bigrams.
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords 
#get english stopwords
en_stopwords = set(stopwords.words('english'))

#function to filter for ADJ/NN bigrams
def rightTypes(ngram):
    if '-pron-' in ngram or '' in ngram or ' 'in ngram or 't' in ngram:
        return False
    for word in ngram:
        if word in en_stopwords:
            return False
    acceptable_types = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS')
    second_type = ('NN', 'NNS', 'NNP', 'NNPS')
    tags = nltk.pos_tag(ngram)
    if tags[0][1] in acceptable_types and tags[1][1] in second_type:
        return True
    else:
        return False

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\adity\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [11]:
#filter bigrams
# We get meaningfull bigrams after apllying filter.
filtered_bi = bigramFreqTable[bigramFreqTable.bigram.map(lambda x: rightTypes(x))]
filtered_bi.head(20)

Unnamed: 0,bigram,freq
1562,"(world, NNTPPostingHost)",128
1324,"(Henry, Spencer)",96
7127,"(Jon, Livesey)",93
7147,"(Keith, Allan)",89
7148,"(Allan, Schneider)",89
733,"(Computer, Science)",87
13988,"(University, Lines)",85
5878,"(Kent, Sandvik)",78
7133,"(Political, Atheists)",76
15083,"(California, Institute)",70


In [12]:
frequency_bi = filtered_bi[:20].bigram.values

### (2) PMI (Pointwise Mutual Information)

In [13]:
bigramFinder.apply_freq_filter(20)

In [14]:
bigramPMITable = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.pmi)), columns=['bigram','PMI']).sort_values(by='PMI', ascending=False)

In [15]:
bigramPMITable.head(20)

Unnamed: 0,bigram,PMI
0,"(comme, aucun)",15.265952
1,"(Steinn, Sigurdsson)",15.195563
2,"(ISLAMIC, LAW)",15.128448
3,"(fait, comme)",15.064318
4,"(sank, Manhattan)",14.997204
5,"(Carnegie, Mellon)",14.832993
6,"(Beam, Jockey)",14.729899
7,"(Bake, Timmons)",14.707462
8,"(Frequently, Asked)",14.633684
9,"(Chapel, Hill)",14.619589


In [16]:
pmi_bi=bigramPMITable[:20].bigram.values

## (3) T-test with filter

In [17]:
bigramTtable = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.student_t)), columns=['bigram','t']).sort_values(by='t', ascending=False)
bigramTtable.head(20)

Unnamed: 0,bigram,t
0,"(, )",54.07434
1,"(of, the)",47.884251
2,"(, I)",40.172893
3,"(, The)",39.116069
4,"(in, the)",36.42529
5,"(From, )",34.338498
6,"(Subject, )",34.336979
7,"(Lines, )",34.022641
8,"(In, article)",33.641848
9,"(Organization, )",33.258201


In [18]:
filteredT_bi = bigramTtable[bigramTtable.bigram.map(lambda x: rightTypes(x))]
filteredT_bi.head(20)

Unnamed: 0,bigram,t
171,"(world, NNTPPostingHost)",11.281002
241,"(Henry, Spencer)",9.796435
246,"(Jon, Livesey)",9.642051
266,"(Allan, Schneider)",9.432706
267,"(Keith, Allan)",9.431556
276,"(Computer, Science)",9.319078
304,"(University, Lines)",8.954049
318,"(Kent, Sandvik)",8.830147
321,"(Political, Atheists)",8.716659
356,"(California, Institute)",8.362592


In [19]:
ttest_bi=filteredT_bi[:20].bigram.values

## (4) Chi-Sq test

In [20]:
bigramChiTable = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.chi_sq)), columns=['bigram','chi-sq']).sort_values(by='chi-sq', ascending=False)
bigramChiTable.head(20)

Unnamed: 0,bigram,chi-sq
0,"(Carnegie, Mellon)",788024.0
1,"(Cookamunga, Tourist)",788024.0
2,"(Steinn, Sigurdsson)",788024.0
3,"(comme, aucun)",788024.0
4,"(VAXVMS, VNEWS)",761309.389691
5,"(ISLAMIC, LAW)",752203.772702
6,"(Frequently, Asked)",737181.870899
7,"(Los, Angeles)",722352.583218
8,"(Beam, Jockey)",706501.586118
9,"(sank, Manhattan)",686792.839845


In [21]:
chi_bi=bigramChiTable[:20].bigram.values

### Comparing the top 20 bigram results of all the applied techniques:

In [22]:
bigrams_compare_results = pd.DataFrame([frequency_bi, pmi_bi, ttest_bi, chi_bi]).T


In [23]:
bigrams_compare_results.columns = ['Frequency With Filter', 'PMI', 'T-test With Filter', 'Chi-Sq Test']


In [24]:
bigrams_compare_results

Unnamed: 0,Frequency With Filter,PMI,T-test With Filter,Chi-Sq Test
0,"(world, NNTPPostingHost)","(comme, aucun)","(world, NNTPPostingHost)","(Carnegie, Mellon)"
1,"(Henry, Spencer)","(Steinn, Sigurdsson)","(Henry, Spencer)","(Cookamunga, Tourist)"
2,"(Jon, Livesey)","(ISLAMIC, LAW)","(Jon, Livesey)","(Steinn, Sigurdsson)"
3,"(Keith, Allan)","(fait, comme)","(Allan, Schneider)","(comme, aucun)"
4,"(Allan, Schneider)","(sank, Manhattan)","(Keith, Allan)","(VAXVMS, VNEWS)"
5,"(Computer, Science)","(Carnegie, Mellon)","(Computer, Science)","(ISLAMIC, LAW)"
6,"(University, Lines)","(Beam, Jockey)","(University, Lines)","(Frequently, Asked)"
7,"(Kent, Sandvik)","(Bake, Timmons)","(Kent, Sandvik)","(Los, Angeles)"
8,"(Political, Atheists)","(Frequently, Asked)","(Political, Atheists)","(Beam, Jockey)"
9,"(California, Institute)","(Chapel, Hill)","(California, Institute)","(sank, Manhattan)"


## Overlap among the techniques:
### From the above results it can be clearly seen that there is an overlap among the techniques. There are many bigrams that are common among the techniques. By carefully looking at the bigrams of Frequency with filter and T-test with filter, it is observed that almost all of the bigrams are common among these two techniques. As there is a high overlap among the techniques, it makes sense to consider the union of the results. There are many bigrams which are common among the techniques, hence we should union the common results and not keep them seperate.

# SVM and NB for Text Classification

## Data Cleaning : 
### Remove stop words
### Remove numbers and other non-letter characters
### Stem the words

In [25]:
df["cleaned_news"] = df["news"].str.lower()
df["cleaned_news"] = df["cleaned_news"].str.replace(r"[^a-z]+"," ")
df["cleaned_news"].head()

0    from rych festival ed ac uk r hawkes subject d...
1    subject re biblical backing of koresh s tape c...
2    from mark perew p f n z fidonet org subject re...
3    from dpw sei cmu edu david wood subject reques...
4    from prb access digex com pat subject conferen...
Name: cleaned_news, dtype: object

In [26]:
import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

stop_words = set(stopwords.words('english')) 

df["cleaned_news"] = df["cleaned_news"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
df["cleaned_news"].head()

0    rych festival ed ac uk r hawkes subject ds tex...
1    subject biblical backing koresh tape cites enc...
2    mark perew p f n z fidonet org subject comet t...
3    dpw sei cmu edu david wood subject request sup...
4    prb access digex com pat subject conference ma...
Name: cleaned_news, dtype: object

In [27]:
df["tokenized_news"] = df["cleaned_news"].apply(word_tokenize).tolist()
df["tokenized_news"].head()

0    [rych, festival, ed, ac, uk, r, hawkes, subjec...
1    [subject, biblical, backing, koresh, tape, cit...
2    [mark, perew, p, f, n, z, fidonet, org, subjec...
3    [dpw, sei, cmu, edu, david, wood, subject, req...
4    [prb, access, digex, com, pat, subject, confer...
Name: tokenized_news, dtype: object

In [28]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
df["tokenized_news"] = df["tokenized_news"].apply(lambda x: [ps.stem(y) for y in x])
df['tokenized_news'].head()

0    [rych, festiv, ed, ac, uk, r, hawk, subject, d...
1    [subject, biblic, back, koresh, tape, cite, en...
2    [mark, perew, p, f, n, z, fidonet, org, subjec...
3    [dpw, sei, cmu, edu, david, wood, subject, req...
4    [prb, access, digex, com, pat, subject, confer...
Name: tokenized_news, dtype: object

In [29]:
df["cleaned_news"] = df['tokenized_news'].apply(' '.join)
df["cleaned_news"].head()

0    rych festiv ed ac uk r hawk subject ds textur ...
1    subject biblic back koresh tape cite enclos km...
2    mark perew p f n z fidonet org subject comet t...
3    dpw sei cmu edu david wood subject request sup...
4    prb access digex com pat subject confer man lu...
Name: cleaned_news, dtype: object

# Creating TFIDF vectors
## Machine learning algorithms only accepts float or integer values to make model. So we find the TFIDF values associated with words in the dataframe. 


In [57]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

corpus = (df['cleaned_news'])
vectorizer = CountVectorizer()
X_counts = vectorizer.fit_transform(corpus)
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_counts).toarray()
#Vocabulary size of entire data
X_tfidf.shape

(2034, 21051)

# Splitting the data into train and test sets.
## The TFIDF values are taken as features and the categories are taken as targets. The Size of train dataset is kept as 70% and test dataset is taken as 30%.

In [31]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df["categories"].astype(int), train_size=0.70, test_size=0.30, random_state=1)

# We will utilize two models : nusvc and NB for Text Classification

In [32]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
import numpy as np
M_Classifier = MultinomialNB().fit(X_train, y_train)

pred_1 = M_Classifier.predict(X_test)

print("Accuracy:",np.mean(pred_1 == y_test))
print("Confusion Matrix:\n",confusion_matrix(y_test, pred_1))

Accuracy: 0.8887070376432079
Confusion Matrix:
 [[136   0   2   1]
 [  0 161   3   0]
 [  0   6 182   0]
 [ 43   3  10  64]]


In [33]:
# Here kernel is taken by default as rbf
from sklearn.svm import NuSVC
S_Classifier = NuSVC().fit(X_train, y_train)

pred_2 = S_Classifier.predict(X_test)

print("Accuracy:",np.mean(pred_2 == y_test))
print("Confusion Matrix:\n",confusion_matrix(y_test, pred_2))




Accuracy: 0.8265139116202946
Confusion Matrix:
 [[101   4   0  34]
 [  0 162   2   0]
 [  4  44 140   0]
 [  2  15   1 102]]


In [34]:
# Here kernel is taken as linear
from sklearn.svm import NuSVC
S_Classifier = NuSVC(kernel="linear").fit(X_train, y_train)

pred_2 = S_Classifier.predict(X_test)

print("Accuracy:",np.mean(pred_2 == y_test))
print("Confusion Matrix:\n",confusion_matrix(y_test, pred_2))

Accuracy: 0.9410801963993454
Confusion Matrix:
 [[134   2   1   2]
 [  0 163   1   0]
 [  0   9 179   0]
 [  5  14   2  99]]


In [35]:
# Here kernel is taken as poly
from sklearn.svm import NuSVC
S_Classifier = NuSVC(kernel="poly").fit(X_train, y_train)

pred_2 = S_Classifier.predict(X_test)

print("Accuracy:",np.mean(pred_2 == y_test))
print("Confusion Matrix:\n",confusion_matrix(y_test, pred_2))



Accuracy: 0.6579378068739771
Confusion Matrix:
 [[ 76  51   0  12]
 [  0 161   1   2]
 [  0 118  70   0]
 [  5  20   0  95]]


In [36]:
# Here kernel is taken as sigmoid
from sklearn.svm import NuSVC
S_Classifier = NuSVC(kernel="sigmoid").fit(X_train, y_train)

pred_2 = S_Classifier.predict(X_test)

print("Accuracy:",np.mean(pred_2 == y_test))
print("Confusion Matrix:\n",confusion_matrix(y_test, pred_2))



Accuracy: 0.8297872340425532
Confusion Matrix:
 [[101   4   0  34]
 [  0 162   2   0]
 [  2  44 142   0]
 [  2  15   1 102]]


## Comparing the accuracy of both algorithms: MultinomialNB and NuSVC:
### MultinomialNB algorithm gives an accuracy of 88%, while NuSVC algorithm(default kernel rbf) gives an accuracy of 82%.
### MultinomialNB algorithm is giving higher accuracy than NuSVC algorithm. The reason behind this is that MultinomialNB algorithm generally works better with text data in which same words occur more frequently. In addition to this, it also works better with discrete count of text data. It is a type of algorithm that is considered as best one to use when the text data has higher frequency count of words. Due to all of these reasons, MultinomialNB is giving higher accuracy than NuSVC.

## Changing the kernel of NuSVC:
### The accuracy of NuSVC on default kernel rbf is 82.65%
### The accuracy of NuSVC on  kernel linear is 94%
### The accuracy of NuSVC on kernel poly is 65%
### The accuracy of NuSVC on  kernel sigmoid is 82.78%

### By comparing the NuSVC accuracy with different kernels, it can be clearly observed that changing the kernel does affect the accuracy. From the results, it can be seen that linear kernel gives the highest accuracy while poly kernel gives the lowest accuracy. The rbf and sigmoid kernel pretty much gives the same accuracy. So from this, we can say that kernel plays an important role in SVM and it can clearly increase as well as decrease the accuracy.

# Creating the model using noun tags. 
## First we perform pos-tagging and fetch the nouns present in the dataset.  

In [37]:
import nltk
#nltk.download('punkt')
from nltk.tokenize import word_tokenize

df["tokenized_news"] = df["news"].apply(word_tokenize).tolist()
df["tokenized_news"].head()

0    [From, :, rych, @, festival.ed.ac.uk, (, R, Ha...
1    [Subject, :, Re, :, Biblical, Backing, of, Kor...
2    [From, :, Mark.Perew, @, p201.f208.n103.z1.fid...
3    [From, :, dpw, @, sei.cmu.edu, (, David, Wood,...
4    [From, :, prb, @, access.digex.com, (, Pat, ),...
Name: tokenized_news, dtype: object

In [38]:
df['POS'] = pos_tag_sents(df["tokenized_news"])
df['POS'].head()

0    [(From, IN), (:, :), (rych, NN), (@, NN), (fes...
1    [(Subject, JJ), (:, :), (Re, NN), (:, :), (Bib...
2    [(From, IN), (:, :), (Mark.Perew, NNP), (@, NN...
3    [(From, IN), (:, :), (dpw, NN), (@, NN), (sei....
4    [(From, IN), (:, :), (prb, NN), (@, NN), (acce...
Name: POS, dtype: object

In [39]:
noun_tags = ["NN","NNS","NP","NPS"]
noun_data = []
for i in range(0,df["POS"].size-1):
    noun = []
    for tags in df["POS"][i]:
        if tags[1] in noun_tags:
            noun.append(tags)
    noun_data.append(noun)
# print(noun_data)

In [40]:
df["noun_tags"] = pd.Series(noun_data)
df["noun_tags"]

0       [(rych, NN), (@, NN), (festival.ed.ac.uk, NN),...
1       [(Re, NN), (Tape, NN), (kmcvay, NN), (@, NN), ...
2       [(p201.f208.n103.z1.fidonet.org, NN), (Subject...
3       [(dpw, NN), (@, NN), (sei.cmu.edu, NN), (Subje...
4       [(prb, NN), (@, NN), (access.digex.com, NN), (...
5       [(andrew.cmu.edu, NN), (>, NN), (Re, NN), (acc...
6       [(ofa123.fidonet.org, NN), (Subject, NN), (Re,...
7       [(mjw19, NN), (@, NN), (cl.cam.ac.uk, NN), (Su...
8       [(henry, NN), (@, NN), (zoo.toronto.edu, NN), ...
9       [(hendrix, NN), (@, NN), (oasys.dt.navy.mil, N...
10      [(prb, NN), (@, NN), (access.digex.com, NN), (...
11      [(zemcik, NN), (@, NN), (ls, NN), (Subject, NN...
12      [(ddeciacco, NN), (@, NN), (cix.compulink.co.u...
13      [(pgf, NN), (@, NN), (srl03.cacs.usl.edu, NN),...
14      [(kph2q, NN), (@, NN), (onyx.cs.Virginia.EDU, ...
15      [(pyron, NN), (@, NN), (skndiv.dseg.ti.com, NN...
16      [(batman.bmd.trw.com, NN), (Subject, NN), (Re,...
17      [(dpag

In [41]:
noun_data = []
for i in range(0,df["noun_tags"].size-1):
    nouns = []
    for tags in df["noun_tags"][i]:
        nouns.append(tags[0])
    noun_data.append(nouns)
print(noun_data)



In [42]:
df["nouns"] = pd.Series(noun_data)
df["nouns"]

0       [rych, @, festival.ed.ac.uk, Subject, texture,...
1       [Re, Tape, kmcvay, @, oneb.almanac.bc.ca, Orga...
2       [p201.f208.n103.z1.fidonet.org, Subject, Re, X...
3       [dpw, @, sei.cmu.edu, Subject, Request, reques...
4       [prb, @, access.digex.com, Subject, Conference...
5       [andrew.cmu.edu, >, Re, account, NNTP-Posting-...
6       [ofa123.fidonet.org, Subject, Re, mission, nam...
7       [mjw19, @, cl.cam.ac.uk, Subject, Re, Keywords...
8       [henry, @, zoo.toronto.edu, Subject, Re, moon,...
9       [hendrix, @, oasys.dt.navy.mil, Subject, Proce...
10      [prb, @, access.digex.com, Subject, Re, Organi...
11      [zemcik, @, ls, Subject, pixel, clock, pixel, ...
12      [ddeciacco, @, cix.compulink.co.uk, Subject, R...
13      [pgf, @, srl03.cacs.usl.edu, Subject, Re, Orga...
14      [kph2q, @, onyx.cs.Virginia.EDU, Subject, vend...
15      [pyron, @, skndiv.dseg.ti.com, Subject, Re, wo...
16      [batman.bmd.trw.com, Subject, Re, article, man...
17      [dpage

In [43]:
df["nouns"] = df['tokenized_news'].apply(' '.join)
df["nouns"].head()

0    From : rych @ festival.ed.ac.uk ( R Hawkes ) S...
1    Subject : Re : Biblical Backing of Koresh 's 3...
2    From : Mark.Perew @ p201.f208.n103.z1.fidonet....
3    From : dpw @ sei.cmu.edu ( David Wood ) Subjec...
4    From : prb @ access.digex.com ( Pat ) Subject ...
Name: nouns, dtype: object

# We now again create TFIDF vectors, divide data into test and train, and build models.

In [58]:
corpus = (df['nouns'])
vectorizer = CountVectorizer()
X_counts = vectorizer.fit_transform(corpus)
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_counts).toarray()
#Vocabulary size of just noun data
X_tfidf.shape

(2034, 34112)

In [45]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df["categories"].astype(int), train_size=0.70, test_size=0.30, random_state=1)

In [46]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
import numpy as np
M_Classifier = MultinomialNB().fit(X_train, y_train)

pred_1 = M_Classifier.predict(X_test)

print("Accuracy:",np.mean(pred_1 == y_test))
print("Confusion Matrix:\n",confusion_matrix(y_test, pred_1))

Accuracy: 0.8494271685761048
Confusion Matrix:
 [[137   0   1   1]
 [  3 156   5   0]
 [  0   6 182   0]
 [ 66   1   9  44]]


In [48]:
from sklearn.svm import NuSVC
S_Classifier = NuSVC().fit(X_train, y_train)

pred_2 = S_Classifier.predict(X_test)

print("Accuracy:",np.mean(pred_2 == y_test))
print("Confusion Matrix:\n",confusion_matrix(y_test, pred_2))



Accuracy: 0.7446808510638298
Confusion Matrix:
 [[ 79  30   0  30]
 [  0 158   1   5]
 [  5  46 129   8]
 [  2  28   1  89]]


In [49]:
from sklearn.svm import NuSVC
S_Classifier = NuSVC(kernel="linear").fit(X_train, y_train)

pred_2 = S_Classifier.predict(X_test)

print("Accuracy:",np.mean(pred_2 == y_test))
print("Confusion Matrix:\n",confusion_matrix(y_test, pred_2))

Accuracy: 0.9198036006546645
Confusion Matrix:
 [[131   3   1   4]
 [  1 161   1   1]
 [  0  13 175   0]
 [  6  16   3  95]]


In [50]:
from sklearn.svm import NuSVC
S_Classifier = NuSVC(kernel="poly").fit(X_train, y_train)

pred_2 = S_Classifier.predict(X_test)

print("Accuracy:",np.mean(pred_2 == y_test))
print("Confusion Matrix:\n",confusion_matrix(y_test, pred_2))



Accuracy: 0.6333878887070377
Confusion Matrix:
 [[ 66  61   0  12]
 [  0 161   1   2]
 [  0  92  84  12]
 [  4  40   0  76]]


In [51]:
from sklearn.svm import NuSVC
S_Classifier = NuSVC(kernel="sigmoid").fit(X_train, y_train)

pred_2 = S_Classifier.predict(X_test)

print("Accuracy:",np.mean(pred_2 == y_test))
print("Confusion Matrix:\n",confusion_matrix(y_test, pred_2))



Accuracy: 0.7414075286415712
Confusion Matrix:
 [[ 78  30   0  31]
 [  0 158   1   5]
 [  5  46 128   9]
 [  2  28   1  89]]


## Comparing the accuracy of both algorithms when performed only on Nouns: 
### MultinomialNB and NuSVC:

### MultinomialNB algorithm gives an accuracy of 84%, while NuSVC algorithm(default kernel rbf) gives an accuracy of 74%. MultinomialNB algorithm is giving higher accuracy than NuSVC algorithm also with Noun data. The reason behind this is that MultinomialNB algorithm generally works better with text data in which same words occur more frequently. In addition to this, it also works better with discrete count of text data. It is a type of algorithm that is considered as best one to use when the text data has higher frequency count of words. Due to all of these reasons, MultinomialNB is giving higher accuracy than NuSVC.

### In both the cases, entire text data and just noun data, the MultinomialNB algorithm is more effective algorithm and gives higher accuracy than SVM( NuSVC)

## Changing the kernel of NuSVC when performed on just Nouns:
### The accuracy of NuSVC on default kernel rbf is 74.46%
### The accuracy of NuSVC on kernel linear is 91%
### The accuracy of NuSVC on kernel poly is 63%
### The accuracy of NuSVC on kernel sigmoid is 74.14%

### By comparing the NuSVC accuracy with different kernels, it can be clearly observed that changing the kernel does affect the accuracy. From the results, it can be seen that linear kernel gives the highest accuracy while poly kernel gives the lowest accuracy. The rbf and sigmoid kernel pretty much gives the same accuracy. So from this, we can say that kernel plays an important role in SVM and it can clearly increase as well as decrease the accuracy.

# Comparision of accuracies (of entire data and just nouns)

## MultinomialNB algorithm accuracy, when performed on entire data is 88%, while when performed on just noun data, it is 84%. There is a reduction in its accuracy when the data just contains nouns.

## NuSVC algorithm(kernel='linear') accuracy, when performed on entire data is 94%, while when performed on just noun data, it is 91%. There is a reduction in its accuracy when the data just contains nouns.

## By observing the accuracies of both the algorithms when performed on both entire data as well as noun data, it is clear that there is a reduction in accuracy when just noun data is considered.

# Comparision of vocabulary size of entire data and just noun data:

## The output of shape function for the entire data( part-c) is (2034, 21051)
## The output of shape function for the noun data( part-d) is (2034, 34112)

## It can be observed from the results that the number of rows remain same, while the number of columns gets increased when comparing both the entire data and just noun data.

## Implementation of shape functions for both types of data can be found in the code( TFIDF code) above cells.

# References:

### We have used some of the code from Tutorial-6, more specifically the filtering code and we are providing its reference below. Apart from this, we are also taking code from stackoverflow as reference and using it in our code( the cleaning code used in performing t-test etc.). We are providing all of the references below:

### [1]"Google Colaboratory", Colab.research.google.com, 2019. [Online]. Available: https://colab.research.google.com/drive/17LMCbDOnny8h1KTqqoX3smy6US6z1vki#scrollTo=yva-7zBNgSJx. [Accessed: 17- Jul- 2019].[Tutorial-6]

### [2]r. [closed], M. Karia, M. Pieters and V. Peer, "removing special characters from a list of items in python", Stack Overflow, 2019. [Online]. Available: https://stackoverflow.com/questions/47301795/removing-special-characters-from-a-list-of-items-in-python. [Accessed: 17- Jul- 2019].

### [3]"nicharuc/Collocations", GitHub, 2019. [Online]. Available: https://github.com/nicharuc/Collocations/blob/master/Collocations.ipynb. [Accessed: 17- Jul- 2019].

### [4]"sklearn.svm.NuSVC — scikit-learn 0.21.2 documentation", Scikit-learn.org, 2019. [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html. [Accessed: 17- Jul- 2019].

### [5]"sklearn.naive_bayes.MultinomialNB — scikit-learn 0.21.2 documentation", Scikit-learn.org, 2019. [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html. [Accessed: 17- Jul- 2019].

### [6]"5.6.2. The 20 newsgroups text dataset — scikit-learn 0.19.2 documentation", Scikit-learn.org, 2019. [Online]. Available: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html. [Accessed: 17- Jul- 2019].

### [7]"5.2. Feature extraction — scikit-learn 0.21.2 documentation", Scikit-learn.org, 2019. [Online]. Available: https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction. [Accessed: 17- Jul- 2019].

### [8]"Natural Language Toolkit — NLTK 3.4.4 documentation", Nltk.org, 2019. [Online]. Available: http://www.nltk.org/. [Accessed: 17- Jul- 2019].

### [9] Lab_Text Processing.ipynb (CSCI-5901 Lab-6 Lab_Text Processing) [Accessed: 17- Jul- 2019].

