## ToDo9 ##
### Ben Naismith
### Due 10/17/2017 ###

**Following adapted Scikit-Learn Tutorial from http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html   
(additional documentation on dataset loading: http://scikit-learn.org/stable/datasets/)**

### Load movie_reviews corpus data through sklearn ###

In [10]:
import sklearn
from sklearn.datasets import load_files

#Turn off pretty print
%pprint

Pretty printing has been turned OFF


In [2]:
moviedir = "/Users/Benjamin's/nltk_data/corpora/movie_reviews"

In [3]:
#loading all files as training data. 
movie_train = load_files(moviedir, shuffle=True)

In [4]:
#check length
len(movie_train.data)

2000

In [5]:
#target names ("classes") are automatically generated from subfolder names
movie_train.target_names

['neg', 'pos']

In [6]:
#First file seems to be about a Schwarzenegger movie. 
#First 500 characters of first file
movie_train.data[0][:500]

b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so cal"

In [7]:
#use .filenames to check the path of a file
#first file is in "neg" folder
movie_train.filenames[0]

"/Users/Benjamin's/nltk_data/corpora/movie_reviews/neg/cv405_21868.txt"

In [8]:
#first file is a negative review and is mapped to 0 index 'neg' in target_names
movie_train.target[0]

0

### A detour: try out CountVectorizer & TF-IDF ###

In [9]:
# import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [11]:
import nltk

In [12]:
#Create small test list
sents = ['A rose is a rose is a rose is a rose.',
         'Oh, what a fine day it is.',
        "It ain't over till it's over, I tell you!!"]

In [13]:
#Initialize a CoutVectorizer to use NLTK's tokenizer instead of its 
#default one (which ignores punctuation and stopwords). 
#Minimum document frequency set to 1. 
foovec = CountVectorizer(min_df=1, tokenizer=nltk.word_tokenize)

#ignoring punctuation and stop words could also be useful!
#What about just ignoring punctuation?

##### From here to end of this section, I don't understand what is going on. #####

In [16]:
# sents turned into sparse vector of word frequency counts
sents_counts = foovec.fit_transform(sents)
# foovec now contains vocab dictionary which maps unique words to indexes
foovec.vocabulary_

{'a': 4, 'rose': 14, 'is': 9, '.': 3, 'oh': 12, ',': 2, 'what': 17, 'fine': 7, 'day': 6, 'it': 10, 'ai': 5, "n't": 11, 'over': 13, 'till': 16, "'s": 1, 'i': 8, 'tell': 15, 'you': 18, '!': 0}

In [17]:
#sents_counts has a dimension of 3 (document count) by 19 (# of unique words)
sents_counts.shape

(3, 19)

In [18]:
#this vector is small enough to view in full! 
sents_counts.toarray()

array([[0, 0, 0, 1, 4, 0, 0, 0, 0, 3, 0, 0, 0, 0, 4, 0, 0, 0, 0],
       [0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0],
       [2, 1, 1, 0, 0, 1, 0, 0, 1, 0, 2, 1, 0, 2, 0, 1, 1, 0, 1]], dtype=int64)

In [19]:
#Convert raw frequency counts into TF-IDF (Term Frequency -- Inverse Document Frequency) values
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
sents_tfidf = tfidf_transformer.fit_transform(sents_counts)

In [20]:
#TF-IDF values
#raw counts have been normalized against document length, 
#terms that are found across many docs are weighted down
sents_tfidf.toarray()

array([[ 0.        ,  0.        ,  0.        ,  0.13650997,  0.54603988,
         0.        ,  0.        ,  0.        ,  0.        ,  0.40952991,
         0.        ,  0.        ,  0.        ,  0.        ,  0.71797683,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.28969526,  0.28969526,  0.28969526,
         0.        ,  0.38091445,  0.38091445,  0.        ,  0.28969526,
         0.28969526,  0.        ,  0.38091445,  0.        ,  0.        ,
         0.        ,  0.        ,  0.38091445,  0.        ],
       [ 0.47282517,  0.23641258,  0.17979786,  0.        ,  0.        ,
         0.23641258,  0.        ,  0.        ,  0.23641258,  0.        ,
         0.35959573,  0.23641258,  0.        ,  0.47282517,  0.        ,
         0.23641258,  0.23641258,  0.        ,  0.23641258]])

### Back to real data: transforming movie reviews ###

In [21]:
#initialize movie_vector object, and then turn movie train data into a vector 
movie_vec = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize)
#use all 25K words. 82.2% acc.

#movie_vec = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize, max_features = 3000) 
#use top 3000 words only. 78.5% acc.
movie_counts = movie_vec.fit_transform(movie_train.data)

**Using .vocabulary_.get to find index of specific words**

In [22]:
#'screen' is found in the corpus, mapped to index 19637
movie_vec.vocabulary_.get('screen')

19637

In [23]:
#Likewise, Mr. Steven Seagal is present...
movie_vec.vocabulary_.get('seagal')

19690

**Using .shape to find number of docs and tokens**

In [24]:
#huge dimensions! 2,000 documents, 25K unique terms. 
movie_counts.shape

(2000, 25313)

**Again lost with the following cell**

In [25]:
#Convert raw frequency counts into TF-IDF values
tfidf_transformer = TfidfTransformer()
movie_tfidf = tfidf_transformer.fit_transform(movie_counts)

In [26]:
#Same dimensions, now with tf-idf values instead of raw frequency counts
movie_tfidf.shape

(2000, 25313)

### Training and testing a Naive Bayes classifier ###

In [27]:
# Now ready to build a classifier. 
# We will use Multinominal Naive Bayes as our model
from sklearn.naive_bayes import MultinomialNB

**Definition of Multinominal Naive Bayes:**  
Multinomial Naive Bayes is a specialized version of Naive Bayes that is designed more for text documents. Whereas simple naive Bayes would model a document as the presence and absence of particular words, multinomial naive bayes explicitly models the word counts and adjusts the underlying calculations to deal with in.

In [28]:
#Split data into training and test sets
#from sklearn.cross_validation import train_test_split
#deprecated in 0.18

from sklearn.model_selection import train_test_split
docs_train, docs_test, y_train, y_test = train_test_split(
    movie_tfidf, movie_train.target, test_size = 0.20, random_state = 12)

In [29]:
#Train a Multimoda Naive Bayes classifier
clf = MultinomialNB().fit(docs_train, y_train)

In [30]:
#Predicting the Test set results, find accuracy
y_pred = clf.predict(docs_test)
sklearn.metrics.accuracy_score(y_test, y_pred)

0.82250000000000001

In [31]:
#Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[176,  30],
       [ 41, 153]])

In [32]:
#This is not at all helpful and the name if far too apt
help(confusion_matrix)

Help on function confusion_matrix in module sklearn.metrics.classification:

confusion_matrix(y_true, y_pred, labels=None, sample_weight=None)
    Compute confusion matrix to evaluate the accuracy of a classification
    
    By definition a confusion matrix :math:`C` is such that :math:`C_{i, j}`
    is equal to the number of observations known to be in group :math:`i` but
    predicted to be in group :math:`j`.
    
    Thus in binary classification, the count of true negatives is
    :math:`C_{0,0}`, false negatives is :math:`C_{1,0}`, true positives is
    :math:`C_{1,1}` and false positives is :math:`C_{0,1}`.
    
    Read more in the :ref:`User Guide <confusion_matrix>`.
    
    Parameters
    ----------
    y_true : array, shape = [n_samples]
        Ground truth (correct) target values.
    
    y_pred : array, shape = [n_samples]
        Estimated targets as returned by a classifier.
    
    labels : array, shape = [n_classes], optional
        List of labels to index the m

### Trying classifier on fake movie reviews ###

In [33]:
#very short and fake movie reviews
reviews_new = ['This movie was excellent', 'Absolute joy ride', 
            'Steven Seagal was terrible', 'Steven Seagal shined through.', 
              'This was certainly a movie', 'Two thumbs up', 'I fell asleep halfway through', 
              "We can't wait for the sequel!!", '!', '?', 'I cannot recommend this highly enough', 
              'instant classic.', 'Steven Seagal was amazing. His performance was Oscar-worthy.']
reviews_new_counts = movie_vec.transform(reviews_new)
reviews_new_tfidf = tfidf_transformer.transform(reviews_new_counts)

In [34]:
# have classifier make a prediction
pred = clf.predict(reviews_new_tfidf)

In [35]:
# print out results
for review, category in zip(reviews_new, pred):
    print('%r => %s' % (review, movie_train.target_names[category]))

'This movie was excellent' => pos
'Absolute joy ride' => pos
'Steven Seagal was terrible' => neg
'Steven Seagal shined through.' => neg
'This was certainly a movie' => neg
'Two thumbs up' => neg
'I fell asleep halfway through' => neg
"We can't wait for the sequel!!" => neg
'!' => neg
'?' => neg
'I cannot recommend this highly enough' => pos
'instant classic.' => pos
'Steven Seagal was amazing. His performance was Oscar-worthy.' => neg


In [None]:
#Not great accuracy with this small sample size

#### Conclusions ####
- I can see the utility of being able to do this and the big picture machine learning goals
- I don't understand a lot of the code in this 'tutorial' and could not replicate it for my own purposes (yet)
- I will complete the DataCamp tutorials and revisit these examples rather than pretending to be able to explore this data with current skills
- Will visit during office hours to discuss any remaining doubts