### Now let's try out tf-idf weighting! This involves term frequency and inverse document frequency.
#### Term frequency increases the weight of the words that occur more frequently in the document i.e. tf( t, d ) indicates the number of occurrences of term 't' in doucment 'd'. However fifty occurences of a word in a document does not mean that the word is really fifty times more significant that any other word that occured just one, so we scale the values in logarithmic way.
#### Inverse document frequency increases the weight of terms that occur rarely i.e. in few documents. Similarly it decreases the weight of terms that occur in all the documents. We define idf(t,D) as log( ( total number of documents in corpus D /  total documents with terms t in corpus D ) )

In [1]:
#Get the movie sentiment corpus data
import os
import numpy as np
from collections import Counter

corpus_path = './corpus/' #this path needs to be changed depending on where your files lie
sub_directories = [ 'pos', 'neg' ]

def get_data():
    all_docs = []
    positive_ex = 0;
    negative_ex = 0;
    for subdir in sub_directories:
        sentiment = corpus_path + subdir;
        files = [ os.path.join(sentiment,f) for f in os.listdir(sentiment) ]
        if( subdir == 'pos' ):
            positive_ex = positive_ex + len( files )
        else:
            negative_ex = negative_ex + len( files )
        for file in files:
            doc = "";
            for line in open( file, 'r' ):
                doc = doc + line
            all_docs.append( doc )
    return [ positive_ex, negative_ex, all_docs ]

In [2]:
[ positive_ex, negative_ex, all_docs ] = get_data()

In [3]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer

total_examples = positive_ex + negative_ex
total_splits = 10

labels = np.zeros(total_examples);
labels[0:positive_ex] = 0;
labels[positive_ex:total_examples] = 1;

folds = StratifiedKFold( n_splits=total_splits )

model = MultinomialNB()
test_accuracy_h = [ 0.0, 0.0 ]
train_accuracy_h = [ 0.0, 0.0 ]

for min_df_val in [ 0, 0.001, 0.01, 0.05 ]:
    for max_df_val in [ 0.1, 0.3, 0.5, 0.7, 0.9, 0.999 ]:
        Vectorizer = TfidfVectorizer(
            sublinear_tf=True, #change to log scale i.e change tf value to 1 + log(tf)
            use_idf=True,      #use idf as well
            stop_words='english', #filter out most common english words
            min_df=min_df_val,  #ignore words that occurs in less than min_df proportion of documents
            max_df=max_df_val, #ignore words that occurs a lot! i.e. in max_df proportion of documents
            )
        for train_indices, test_indices in folds.split(all_docs, labels):
            docs_train = [ all_docs[ index ] for index in train_indices ]
            docs_test  = [ all_docs[ index ] for index in test_indices ]
            Y_train = labels[ train_indices ]
            Y_test  = labels[ test_indices ]
            X_train = Vectorizer.fit_transform(docs_train) 
            X_test = Vectorizer.transform(docs_test) 

            model.fit( X_train, Y_train )
            train_result = model.predict( X_train )
            test_result = model.predict( X_test )

            train_accuracy_h[0] = train_accuracy_h[0] + sum( train_result==Y_train )
            test_accuracy_h[0] = test_accuracy_h[0] + sum( test_result==Y_test )
            train_accuracy_h[1] = train_accuracy_h[1] + len( train_result )
            test_accuracy_h[1] = test_accuracy_h[1] + len( test_result )

        train_acc = (train_accuracy_h[0]*100)/train_accuracy_h[1]    
        test_acc = (test_accuracy_h[0]*100)/test_accuracy_h[1]
        print('For min_df value {} and max_df value {}, train accuracy is {}% and test accuracy is {}%'.format(
                min_df_val, max_df_val, round(train_acc, 2 ), round(test_acc, 2 )))

For min_df value 0 and max_df value 0.1, train accuracy is 98.11% and test accuracy is 81.8%
For min_df value 0 and max_df value 0.3, train accuracy is 97.95% and test accuracy is 82.4%
For min_df value 0 and max_df value 0.5, train accuracy is 97.85% and test accuracy is 82.7%
For min_df value 0 and max_df value 0.7, train accuracy is 97.8% and test accuracy is 82.86%
For min_df value 0 and max_df value 0.9, train accuracy is 97.76% and test accuracy is 82.94%
For min_df value 0 and max_df value 0.999, train accuracy is 97.73% and test accuracy is 82.99%
For min_df value 0.001 and max_df value 0.1, train accuracy is 97.65% and test accuracy is 82.9%
For min_df value 0.001 and max_df value 0.3, train accuracy is 97.55% and test accuracy is 82.96%
For min_df value 0.001 and max_df value 0.5, train accuracy is 97.45% and test accuracy is 83.0%
For min_df value 0.001 and max_df value 0.7, train accuracy is 97.37% and test accuracy is 83.01%
For min_df value 0.001 and max_df value 0.9, tra

### As we increase min_df value we see reduction in train_accuracy ( in our case ) but mixed trend for test accuracy. As we increase max_df value, we see reduction in train_accuracy, but in general increase in test accuracy.