# Mandatory assignment 1B
### by Per Halvorsen, pmhalvor

1.   [a](#1a)
     [b](#1b)
2.   [a](#2a)
     [b](#2b)
3.   [a](#3a)


In this part of the assignment we will look at:
- setting up and running experiments
- splitting your data into development and test data
- *n*-fold cross-validation
- models for text classification
- Naive Bayes vs Logistic Regression
- the scikit-learner toolkit
- vectorization of categorical data

# 1a
## First classifier and vectorization

We start by importing our libraries

In [1]:
import nltk
import random
import numpy as np
import scipy as sp
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression

We'll be looking at the Movie Review corpus in NLTK.

In [2]:
from nltk.corpus import movie_reviews

We can import the documents by following the recipe from the scikit “Working with text data” page, using the raw
documents which we can get from NLTK by:
- `movie_reviews.raw(fileid)`


In [3]:
raw_movie_docs = [(movie_reviews.raw(fileid), category) for
category in movie_reviews.categories() for fileid in
movie_reviews.fileids(category)]

We can then shuffle the data and split it into 200 documents for final testing (which we will not use for
a while) and 1800 documents for development. Make sure to set a random seed, so that the results can be regenerated if needed.

In [4]:
random.seed(1234)
random.shuffle(raw_movie_docs)
movie_test = raw_movie_docs[:200]
movie_dev = raw_movie_docs[200:]

We then split the development data into 1600 documents for training and 200 for development test set,
call them `train_data` and `dev_test_data`. 

The `train_data` should now be a list of 1600 items, where each is a pair of a text represented as a string and a label. You should then split this `train_data` into two lists, each of 1600 elements, the first, `train_texts`, containing the texts (as strings) for each document, and the `train_target`, containing the corresponding 1600 labels. 

The same is done for `dev_test_data`.

In [5]:
random.shuffle(movie_dev)
train_data    = movie_dev[200:]
dev_test_data = movie_dev[:200]

In [6]:
unzipped = list(zip(*train_data))
train_text   = list(unzipped[0])
train_target = list(unzipped[1])
'text', type(train_text), len(train_text),'///', ' target', type(train_target), len(train_target)

('text', list, 1600, '///', ' target', list, 1600)

In [7]:
unzipped = list(zip(*dev_test_data))
dev_test_text   = list(unzipped[0])
dev_test_target = list(unzipped[1])
'text', type(dev_test_text), len(dev_test_text),'///', ' target', type(dev_test_target), len(dev_test_target)

('text', list, 200, '///', ' target', list, 200)

Now, extract features from the text by using `CountVectorizer`, which was imported above. This first considers the whole set of training data, to determine which features to extract:

In [8]:
v = CountVectorizer()
v.fit(train_text)

CountVectorizer()

In [9]:
train_vector    = v.transform(train_text)
dev_test_vector = v.transform(dev_test_text)

This builds a dictionary of features within the documents, feature vectors as the values. These vectors represent the occurrence of that feature across all the reviews in the training set.

In [10]:
train_vector.shape, dev_test_vector.shape

((1600, 36331), (200, 36331))

In [11]:
v.vocabulary_

{'for': 12538,
 'original': 22624,
 'sin': 29285,
 'the': 32333,
 'road': 27268,
 'to': 32740,
 'screen': 28283,
 'has': 14627,
 'been': 3204,
 'rocky': 27345,
 'initially': 16489,
 'slated': 29519,
 'release': 26455,
 'last': 18313,
 'november': 22062,
 'film': 12066,
 'was': 35192,
 'bumped': 4618,
 'twice': 33506,
 'finally': 12097,
 'landing': 18226,
 'in': 16118,
 'dog': 9504,
 'days': 8119,
 'of': 22317,
 'summer': 31361,
 '2001': 207,
 'advance': 1039,
 'screenings': 28287,
 'were': 35397,
 'denied': 8529,
 'all': 1373,
 'but': 4742,
 'few': 11977,
 'critics': 7614,
 'generally': 13286,
 'sign': 29191,
 'that': 32330,
 'studio': 31102,
 'realizes': 25993,
 'it': 17066,
 'dud': 9936,
 'on': 22427,
 'its': 17081,
 'hands': 14474,
 'so': 29874,
 'is': 17024,
 'really': 25995,
 'bad': 2724,
 'yes': 36117,
 'melodrama': 20225,
 'does': 9497,
 'offer': 22332,
 'some': 29980,
 'rewards': 27043,
 'location': 18982,
 'settings': 28678,
 'are': 2030,
 'gorgeous': 13786,
 'and': 1654,
 'th

We can now train our classification model. We'll start with a multinomial naive bayesian. 

In [12]:
clf = MultinomialNB()
clf.fit(train_vector, train_target)

MultinomialNB()

We can get the accuracy of the model directly by using the model's `score()` function.

In [13]:
clf.score(dev_test_vector, dev_test_target)

0.79

# 1b
## Parameters of the vectorizer
We have so far considered the standard parameters for the procedures from scikit-learn. These procedures have, however, many parameters. To get optimal results, we should adjust the parameters. We can use `train_data` for training various models and `dev_test_data` for testing and comparing them.

We observe that `CountVectorizer` case-folds by default. For a different corpus, it could be interesting to check the effect of this feature, i.e. how often capital letters occur. Here, this has no effect, since the `movie_reviews.raw()` is already all lower case. We could also have explored the effect of exchanging the default tokenizer included in CountVectorizer with other tokenizers.

Another interesting feature is `binary`. Setting this to `True` implies only counting whether a word
occurs in a document and not how many times it occurs. It could be interesting to see the effect
of this feature.

(Observe, by the way, that <mark>this is not the same as the Bernoulli model</mark> for text classfication. The
<mark>Bernoulli model</mark> takes into consideration <mark>both the probability of being present</mark> for the present words,
_as well as_ the probability of <mark>not being present</mark> for the absent words. The binary multinomial model
only considers the present words.)

The feature `ngram_range=[1,1]` means we use tokens (=unigrams) only, [2,2] means using bigrams
only, while [1,2] means both unigrams and bigrams, and so on. Ngrams are groups of $n$ words used next to each other in a sentence. 

Let's run experiments where we let `binary` vary over `[False, True]` and `ngram_range` vary over `[[1,1],
[1,2], [1,3]]` and compare the accuracies with the 6 different settings in a 2x3 table.

To do this, we'll need to make use of the `GridSearchCV()` method, which runs a cross-validation for each combination of parameters, and returns the accuracy scores. 

We'll also need to build a pipeline, which will funnel our data from the `CountVectorizer()` function (with the different parameters), through the transformer, and into the multinomial model. 

The transformer we will have to use here is different from the one used above, which was built into the `countVectorizer()`method. Instead, we will use `TfidfTransformer()`, which can be used as a step in our pipeline (unlike `v.transform()`).

In [14]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

In [15]:
MNB_clf = Pipeline([
    ('v', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB())
])

In [16]:
parameters = {
    'v__binary': (True, False),
    'v__ngram_range': [(1,1), (1,2), (1,3)]
}

In [17]:
gs_clf = GridSearchCV(MNB_clf, parameters, cv=5, n_jobs=-1)

In [18]:
gs_clf = gs_clf.fit(train_text, train_target)

In [19]:
import pandas as pd
pd.concat([pd.DataFrame(gs_clf.cv_results_['params']), 
           pd.DataFrame(gs_clf.cv_results_['mean_test_score'], 
                        columns=['score'])], 
          axis=1)

Unnamed: 0,v__binary,v__ngram_range,score
0,True,"(1, 1)",0.81125
1,True,"(1, 2)",0.825
2,True,"(1, 3)",0.82375
3,False,"(1, 1)",0.76375
4,False,"(1, 2)",0.815
5,False,"(1, 3)",0.826875


In [31]:
{type(e): e for e in gs_clf.cv_results_}
gs_clf.best_estimator_.named_steps['clf']

MultinomialNB()

The results shown above tell us that this particular cross-valid found that parameter `binary=True` gave better accuracy than `binary=False`, and that higher `ngram_ranges` gave better scores than lower ranges, for the most part.

# 2a
# $n$-fold cross validation

Our `dev_test_data` contain only 200 items. That is a small number for a test set for a binary classifier. The numbers we report may depend to a large degree on the split between training and test data. To get more reliable numbers, we may use n-fold cross-validation. We can use the whole `movie_dev` of 1800 items for this. To get round numbers, we decide to use 9-fold
cross-validation, which will put 200 items in each test set. 

We can use the best settings from exercise 1 to run a 9-fold cross-validation. and report the accuracy for
each run, together with the mean and standard deviation of the 9 runs.

In [119]:
best_params = gs_clf.best_params_
best_params

{'v__binary': True, 'v__ngram_range': (1, 3)}

In [213]:
def mnb_cv(data, params=best_params, n=9):
    '''
    Steps:
        - shuffle data
        - separate text from target
        - split into 9 folds
        - for each fold:
            - ConutVectorize() with optimal paramters
            - transform using v.trasnform or tfidf
            - fit MNB() model using 8 other folds
            - evaluate model using this fold
            - report store accuracy score
        - report average accuracy scores and std. dev. 
    '''
    random.shuffle(data)
    
    unzipped = list(zip(*data))
    X = list(unzipped[0])
    y = list(unzipped[1])
    
    
    binary      = params['v__binary']
    ngram_range = params['v__ngram_range']
    
    clf = Pipeline([
        ('v', CountVectorizer(binary=binary, ngram_range=ngram_range)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultinomialNB())
    ])
    
    scores = []
    
    fold_size = len(X)/n
    cur = 0
    
    if fold_size%2==0:
        for i in range(1, n+1):
            fold = int(fold_size*i)
            X_test, y_test = X[cur:fold], y[cur:fold]
            X_train = X[:cur] + X[fold:]
            y_train = y[:cur] + y[fold:]
            
            # fit model using pipleine
            clf.fit(X_train, y_train)
            
            # evaluate model
            predicted = clf.predict(X_test)
            score = np.mean(predicted == y_test)
            print(f'Fold {i} accuracy {score}')
            scores.append(score)
            
            # move cursor to start at end of last fold
            cur = fold
    else:
        pass # do nothing if splits not evenly sized
    
    avg_score = np.mean(scores)
    sd_scores = np.std(scores)
    print(f'Average score {avg_score}')
    print(f'Standard dev. {sd_scores}')
    return scores

In [143]:
mnb_cv(movie_dev)

Fold 1 accuracy 0.82
Fold 2 accuracy 0.85
Fold 3 accuracy 0.83
Fold 4 accuracy 0.85
Fold 5 accuracy 0.855
Fold 6 accuracy 0.85
Fold 7 accuracy 0.845
Fold 8 accuracy 0.885
Fold 9 accuracy 0.835
Average score 0.8466666666666667
Standard dev. 0.01732050807568879


[0.82, 0.85, 0.83, 0.85, 0.855, 0.85, 0.845, 0.885, 0.835]

# 2b
Let's now combine the 9-fold cross-validation with the various settings for CountVectorizer, as we did earlier in [1b](#1b). 
For each of the 6 settings, run 9-fold cross-validation and calculate the mean accuracy. Report the
results in a 2x3 table. Answer: Do you see the same as when you only used one test set?


In [214]:
def mnb_grid_search(params):
    cv_scores = []
    
    # recursive head to function for getting all parameters
    for k in params.keys():
        if isinstance(params[k], list):
            for v in params[k]:
                p = params.copy()
                p[k] = v
                gs = mnb_grid_search(p)
                if len(gs)>0:
                    cv_scores += gs
            return cv_scores
        else:
            pass
        
    print(params)    

    cv_scores+=[[params, mnb_cv(movie_dev, params)]] # should have made movie_dev a default here..
    return cv_scores

In [212]:
gs_scores = mnb_grid_search(parameters)
gs_scores

{'v__binary': True, 'v__ngram_range': (1, 1)}
Fold 1 accuracy 0.845
Fold 2 accuracy 0.865
Fold 3 accuracy 0.88
Fold 4 accuracy 0.865
Fold 5 accuracy 0.81
Fold 6 accuracy 0.81
Fold 7 accuracy 0.81
Fold 8 accuracy 0.875
Fold 9 accuracy 0.815
Average score 0.8416666666666666
Standard dev. 0.02867441755680874
{'v__binary': True, 'v__ngram_range': (1, 2)}
Fold 1 accuracy 0.825
Fold 2 accuracy 0.89
Fold 3 accuracy 0.905
Fold 4 accuracy 0.84
Fold 5 accuracy 0.865
Fold 6 accuracy 0.88
Fold 7 accuracy 0.85
Fold 8 accuracy 0.85
Fold 9 accuracy 0.815
Average score 0.8577777777777779
Standard dev. 0.02819683897877674
{'v__binary': True, 'v__ngram_range': (1, 3)}
Fold 1 accuracy 0.91
Fold 2 accuracy 0.845
Fold 3 accuracy 0.845
Fold 4 accuracy 0.88
Fold 5 accuracy 0.85
Fold 6 accuracy 0.82
Fold 7 accuracy 0.89
Fold 8 accuracy 0.83
Fold 9 accuracy 0.855
Average score 0.8583333333333333
Standard dev. 0.027588242262078105
{'v__binary': False, 'v__ngram_range': (1, 1)}
Fold 1 accuracy 0.76
Fold 2 accura

[[{'v__binary': True, 'v__ngram_range': (1, 1)},
  [0.845, 0.865, 0.88, 0.865, 0.81, 0.81, 0.81, 0.875, 0.815]],
 [{'v__binary': True, 'v__ngram_range': (1, 2)},
  [0.825, 0.89, 0.905, 0.84, 0.865, 0.88, 0.85, 0.85, 0.815]],
 [{'v__binary': True, 'v__ngram_range': (1, 3)},
  [0.91, 0.845, 0.845, 0.88, 0.85, 0.82, 0.89, 0.83, 0.855]],
 [{'v__binary': False, 'v__ngram_range': (1, 1)},
  [0.76, 0.74, 0.85, 0.83, 0.77, 0.81, 0.8, 0.79, 0.785]],
 [{'v__binary': False, 'v__ngram_range': (1, 2)},
  [0.795, 0.85, 0.795, 0.865, 0.85, 0.82, 0.835, 0.765, 0.795]],
 [{'v__binary': False, 'v__ngram_range': (1, 3)},
  [0.85, 0.825, 0.795, 0.85, 0.8, 0.84, 0.805, 0.805, 0.795]]]

In [218]:
gs_score_means = [np.mean(s[1]) for s in gs_scores]
gs_score_means

[0.8416666666666666,
 0.8577777777777779,
 0.8583333333333333,
 0.7927777777777778,
 0.8188888888888889,
 0.8183333333333334]

In [219]:
gs_binary = [s[0]['v__binary'] for s in gs_scores]
gs_binary

[True, True, True, False, False, False]

In [221]:
gs_ngram  = [s[0]['v__ngram_range'] for s in gs_scores]
gs_ngram

[(1, 1), (1, 2), (1, 3), (1, 1), (1, 2), (1, 3)]

In [245]:
mnb_score = pd.DataFrame(zip(gs_score_means, gs_binary, gs_ngram), columns=['score', 'binary', 'ngram_range'])
mnb_score

Unnamed: 0,score,binary,ngram_range
0,0.841667,True,"(1, 1)"
1,0.857778,True,"(1, 2)"
2,0.858333,True,"(1, 3)"
3,0.792778,False,"(1, 1)"
4,0.818889,False,"(1, 2)"
5,0.818333,False,"(1, 3)"


As we can see, these results looks pretty similar as the table from earlier, with the best parameters: `binary=True` and `ngram_range=(1,3)`.

# 3a
## Logistic Regression
We know that Logistic Regression may produce better results than Naive Bayes and want to see what happens if we use this regression method instead of MNB. We start with the same multinomial model for text classification as in exercises (1) and (2) above (i.e. we process the data the same way and use the same vectorizer), but exchange the learner with sciki-learn’s LogisticRegression in our pipeline. 

Since logistic regression is slow to train, we need to restrict ourselves somewhat with respect to which experiments to run. We consider two settings for the CountVectorizer, the default setting and the setting which gave the best result with naive Bayes (though, this does not have to be the best setting for the logistic regression). 

For each of the two settings,  we will run a 9-fold cross-validation and calculate the mean accuracy. The results will be compared in a 2x2 table where one axis is Naive Bayes vs. Logistic Regression and the other axis is default settings vs. earlier best settings for CountVectorizer. 

To do this, I will reuse the cv function I created earlier, but change only the last step in our pipeline. 

In [215]:
def log_cv(data, params=best_params, n=9):
    random.shuffle(data)
    
    unzipped = list(zip(*data))
    X = list(unzipped[0])
    y = list(unzipped[1])
    
    
    binary      = params['v__binary']
    ngram_range = params['v__ngram_range']
    
    clf = Pipeline([
        ('v', CountVectorizer(binary=binary, ngram_range=ngram_range)),
        ('tfidf', TfidfTransformer()),
        ('clf', LogisticRegression())
    ])
    
    scores = []
    
    fold_size = len(X)/n
    cur = 0
    
    if fold_size%2==0:
        for i in range(1, n+1):
            fold = int(fold_size*i)
            X_test, y_test = X[cur:fold], y[cur:fold]
            X_train = X[:cur] + X[fold:]
            y_train = y[:cur] + y[fold:]
            
            # fit model using pipleine
            clf.fit(X_train, y_train)
            
            # evaluate model
            predicted = clf.predict(X_test)
            score = np.mean(predicted == y_test)
            print(f'Fold {i} accuracy {score}')
            scores.append(score)
            
            # move cursor to start at end of last fold
            cur = fold
    else:
        pass # do nothing if splits not evenly sized
    
    avg_score = np.mean(scores)
    sd_scores = np.std(scores)
    print(f'Average score {avg_score}')
    print(f'Standard dev. {sd_scores}')
    return scores

I'll also reuse the grid search function, but change the call to the new cv funciton.

In [232]:
def log_grid_search(params):
    cv_scores = []
    
    # recursive head to function for getting all parameters
    for k in params.keys():
        if isinstance(params[k], list):
            for v in params[k]:
                p = params.copy()
                p[k] = v
                gs = log_grid_search(p)
                if len(gs)>0:
                    cv_scores += gs
            return cv_scores
        else:
            pass
        
    print(params)    

    cv_scores+=[[params, log_cv(movie_dev, params)]] # should have made movie_dev a default here..
    return cv_scores

I'll also have to create a new dictionary of parameters with the 4 alternatives mentioned above. 

In [216]:
best_params

{'v__binary': True, 'v__ngram_range': (1, 3)}

In [231]:
log_params = {
    'v__binary': [True, False],
    'v__ngram_range': [(1,1), (1, 3)]
}

We are now ready to run the grid search for our log. cv. function.

In [234]:
log_gs_scores = log_grid_search(log_params)

{'v__binary': True, 'v__ngram_range': (1, 1)}
Fold 1 accuracy 0.885
Fold 2 accuracy 0.885
Fold 3 accuracy 0.9
Fold 4 accuracy 0.885
Fold 5 accuracy 0.835
Fold 6 accuracy 0.875
Fold 7 accuracy 0.87
Fold 8 accuracy 0.885
Fold 9 accuracy 0.86
Average score 0.8755555555555555
Standard dev. 0.017864372434237268
{'v__binary': True, 'v__ngram_range': (1, 3)}
Fold 1 accuracy 0.86
Fold 2 accuracy 0.84
Fold 3 accuracy 0.875
Fold 4 accuracy 0.845
Fold 5 accuracy 0.865
Fold 6 accuracy 0.88
Fold 7 accuracy 0.825
Fold 8 accuracy 0.84
Fold 9 accuracy 0.885
Average score 0.8572222222222222
Standard dev. 0.019594657876164902
{'v__binary': False, 'v__ngram_range': (1, 1)}
Fold 1 accuracy 0.87
Fold 2 accuracy 0.835
Fold 3 accuracy 0.815
Fold 4 accuracy 0.815
Fold 5 accuracy 0.805
Fold 6 accuracy 0.815
Fold 7 accuracy 0.775
Fold 8 accuracy 0.78
Fold 9 accuracy 0.86
Average score 0.8188888888888889
Standard dev. 0.03025610845375577
{'v__binary': False, 'v__ngram_range': (1, 3)}
Fold 1 accuracy 0.77
Fold 2 

In [235]:
log_gs_scores

[[{'v__binary': True, 'v__ngram_range': (1, 1)},
  [0.885, 0.885, 0.9, 0.885, 0.835, 0.875, 0.87, 0.885, 0.86]],
 [{'v__binary': True, 'v__ngram_range': (1, 3)},
  [0.86, 0.84, 0.875, 0.845, 0.865, 0.88, 0.825, 0.84, 0.885]],
 [{'v__binary': False, 'v__ngram_range': (1, 1)},
  [0.87, 0.835, 0.815, 0.815, 0.805, 0.815, 0.775, 0.78, 0.86]],
 [{'v__binary': False, 'v__ngram_range': (1, 3)},
  [0.77, 0.755, 0.665, 0.695, 0.74, 0.76, 0.73, 0.79, 0.695]]]

In [242]:
log_gs_score_means = [np.mean(s[1]) for s in log_gs_scores]
log_binary = [s[0]['v__binary'] for s in log_gs_scores]
log_ngram  = [s[0]['v__ngram_range'] for s in log_gs_scores]

In [253]:
log_score = pd.DataFrame(zip(log_gs_score_means, log_binary, log_ngram), columns=['score', 'binary', 'ngram_range'])
log_score

Unnamed: 0,score,binary,ngram_range
0,0.875556,True,"(1, 1)"
1,0.857222,True,"(1, 3)"
2,0.818889,False,"(1, 1)"
3,0.733333,False,"(1, 3)"


In [247]:
mnb_score

Unnamed: 0,score,binary,ngram_range
0,0.841667,True,"(1, 1)"
1,0.857778,True,"(1, 2)"
2,0.858333,True,"(1, 3)"
3,0.792778,False,"(1, 1)"
4,0.818889,False,"(1, 2)"
5,0.818333,False,"(1, 3)"


In [250]:
pd.DataFrame([[mnb_score['score'][3], log_score['score'][2]],
              [max(mnb_score['score']), log_score['score'][1]]], 
             columns=['Naive Bayes', 'Log. Reg.'],
             index=['default', 'best'])

Unnamed: 0,Naive Bayes,Log. Reg.
default,0.792778,0.818889
best,0.858333,0.857222


For this particular execution, using the best parameters with Naive Bayes gave a slightly better result than Logistic Regression. However, this was not the best score for Logistic regression. This tells us that we could have found different best parameters had we executed step [1b](#1b) using a Logistic Regression in our pipeline instead of the multinomial naive Bayesian. 

Another thing we can read from the table is that tuning our parameters specifically for our data set can be quite beneficial. For both models, their respective best parameters gave around a 6 percent point increase in accuracy than the default parameters. 

