## Joby John
## Week 10/11 Assignment



 

For this project, we will use the Reuters Corpus data set that comes with NLTk. Dataset contains 10,788 financial news articles with around 1.3 million words. The documents have been classified into 90 business topics.
For instance,
    earning reports are flagged as 'earn'; 
    commodity price reports are labeled as commodity label ('crude' or 'palm-oil');
    economic indicators are labeled as type ('cpi', 'gnp')

We will split the dataset into training and test data. We will train the  model with the training dataset and  predict the class of documents in the test dataset. For news articles with multiple labels, we used label listed first to train and test. We will train the models using Naive Bayes and SVM classifiers and compare the accuracy of the models.

In [1]:
import nltk, re, pprint
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.collocations import BigramCollocationFinder
from nltk.corpus import stopwords
from collections import Counter
import codecs
import sys
import os
import string
import random
import itertools as it
import pandas as pd
import numpy as np
import matplotlib.pyplot as plot
import matplotlib.colors as colors
plot.rcParams['figure.figsize'] = (21, 14)
from scipy.stats import rankdata
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import words
from nltk.corpus import wordnet as wn
from nltk.corpus import reuters
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix


### Import and Prepare

We will normalize text and remove non-words and stop words. We will also stem words using the Porter stemmer. 

In [2]:
 docs = [(reuters.raw(fileid), category)
    for category in reuters.categories()
    for fileid in reuters.fileids(category)]

The function below will tokenize the text, remove any non words and remove any words that are less than 3 characters long.  

and then procceds to stem the word down to its "root".

In [3]:
StopWords = stopwords.words()
 
def tokenize(text):
    min_length = 3
    words = map(lambda word: word.lower(), word_tokenize(text));
    words = [word for word in words
                  if word not in StopWords]
    tokens =(list(map(lambda token: PorterStemmer().stem(token),
                  words)));
    p = re.compile('[a-zA-Z]+');
    filtered_tokens = list(filter(lambda token:
         p.match(token) and len(token)>=min_length,
         tokens));
    return filtered_tokens

In [4]:
## Creating a new document list after passing each document through the tokenizing function.
documents2 = []
for doc in docs:
    documents2.append((tokenize(doc[0]), doc[1]))

### Feature extractor

We will select the 2,000 most-common words and build a feature extractor.
For document topic identification, we can define a feature for each word, indicating whether the document contains that word.

In [5]:
t_words = [word.lower() for word in reuters.words() if word.isalpha()]
t_words = [w for w in t_words if w not in StopWords]

In [6]:
list(t_words)[:9]

['asian',
 'exporters',
 'fear',
 'damage',
 'japan',
 'rift',
 'mounting',
 'trade',
 'friction']

In [7]:
## 2,000 most common words
all_words = nltk.FreqDist(t_words)
word_features = all_words.most_common(2000)

## get just the words, no counts
word_features = [w for (w,v) in word_features]

In [8]:
 

dict(list(all_words.items())[0:10])

{'asian': 65,
 'exporters': 255,
 'fear': 51,
 'damage': 143,
 'japan': 1890,
 'rift': 3,
 'mounting': 22,
 'trade': 3098,
 'friction': 29,
 'raised': 351}

In [9]:
word_features[:9]

['said', 'mln', 'vs', 'dlrs', 'pct', 'lt', 'cts', 'year', 'net']

In [10]:
# update feature extractor with new features
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features[word] = (word in document_words)
    return features

### Training and Test Sets

In [11]:
## creating the feature set and trainign and test sets.
featuresets = [(document_features(d), c) for (d, c) in documents2]
train_set, test_set = featuresets[1000:], featuresets[:1000]

In [12]:
## Building the classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

### Accuracy

We see the the accuracy of the test set is about 80%

In [13]:
print("Accuracy is {}%.".format(nltk.classify.accuracy(classifier, test_set)*100))

Accuracy is 80.7%.


Table belwo dispalys the most informative featues. As expected, occurance of a specific commodity name in a document is a strong predictor of commodity category. The two most informative word features are 'palm' and 'sorghum.' and  'Indonesia', the largest world producer palm oil. Also, 'saudi' and 'arabia' are most informative features for the 'propane' category. 

In [14]:
classifier.show_most_informative_features(10)

Most Informative Features
                    palm = True           palm-o : earn   =   2611.1 : 1.0
                 sorghum = True           sorghu : earn   =   2454.5 : 1.0
               economist = True             rand : earn   =   2312.9 : 1.0
               indonesia = True           copra- : earn   =   2312.9 : 1.0
              indonesian = True           copra- : earn   =   1652.1 : 1.0
                    crop = True           copra- : earn   =   1652.1 : 1.0
                  tariff = True              dfl : earn   =   1652.1 : 1.0
                  arabia = True           propan : earn   =   1321.7 : 1.0
                   cargo = True           propan : earn   =   1321.7 : 1.0
                   saudi = True           propan : earn   =   1321.7 : 1.0


##  Support Vector Machine(SVM) Classifier

We will use the SVM classifier in the Scikit-Learn machine learning package  
We will use TF-IDF to normalize the word feature set and classify documents using SVM classifier. 

Reference 
https://www.quantstart.com/articles/Supervised-Learning-for-Document-Classification-with-Scikit-Learn

In [15]:
corpus = []
for id in range(len(nltk.corpus.reuters.fileids())):
    file = nltk.corpus.reuters.raw(fileids=list(nltk.corpus.reuters.fileids())[id])
    corpus.append(file)

In [16]:
topics = []
for id in range(len(nltk.corpus.reuters.fileids())):
    cat = nltk.corpus.reuters.categories(fileids=list(nltk.corpus.reuters.fileids())[id])
    topics.append(cat)

### Extract topic

In [17]:

def filter_topics(topics):
    categories = []
    for i in range(len(topics)):
        if topics[i][0] in nltk.corpus.reuters.categories():           
            categories.append(topics[i][0])
    return categories

In [18]:
y = filter_topics(topics)

### Test and Training sets

We will use 80% for training the classifier and  20% for testing.

In [19]:
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### SVM

In [20]:
svm = SVC(C=1000000.0, gamma='auto', kernel='rbf')
svm.fit(X_train, y_train) 

SVC(C=1000000.0, gamma='auto')

In [21]:
pred = svm.predict(X_test)

score = svm.score(X_test, y_test)
score
 

0.8934198331788693

### Result Discussion
We trained SVM and Naive classifer on training set and tested the accuracy of the models on the test set. 
Based on the accuracy we calculated on the test set above, we see that 
SVM classifier has a higher overall classification accuracy(90%) compared to Naive Bayes (84%).

