# Applying Multinomial Linear Regression
**Names:** Eva, Barbara & Joyce

Here we started implementing multinomial Linear Regression by creating matrices with vocabulary counts from scratch. However, in the end we did not use this code to compare our multinomial linear regression models, as it was easier to implement linear regression in a similar way by making use of Count Vectorizer. 

First we create a list "allwords" where we add all the words that occur in the statements of the train dataset. We make sure that the words are lemmatized and all set to lowercase, so that they are case insensitive. From this list we can make a vocabulary, a list containing all the distinct words that occur in the train dataset.

In [2]:
import nltk
from nltk.corpus import wordnet as wn
import csv
import string

countlines = 0 

allwords = []
with open("train.tsv", encoding="utf8") as tsvfile:    #open training set
    lines = csv.reader(tsvfile, delimiter="\t")        #convert file to lines
    for line in lines:
        statement = line[2]                     #get statement from each line 
        lostrings = statement.split(" ")        #convert string to list of strings
        new_lostrings = []
        countlines += 1                         #count number of lines so we know the number of statements
        for word in lostrings:
            word = nltk.WordNetLemmatizer().lemmatize(
                word.translate(str.maketrans('', '', string.punctuation)).lower()) # remove punctuation & lemmatize
            new_lostrings.append(word)
        allwords.extend(new_lostrings)

vocab = []                          #initialize a list for all the distint words in the trainset
for word in allwords:
    if word in vocab:               #do not add word if word is already in vocabulary 
        continue 
    else:
        vocab.append(word)

print("Number of words in trainset:", len(allwords))
print("Number of distinct words in trainset:", len(vocab))
print("Number of statements in train dataset",countlines)
print(allwords[:25])
print(vocab[:25])    #here we can see that the vocabulary only contains distinct words 


Number of words in trainset: 184014
Number of distinct words in trainset: 11916
Number of statements in train dataset 10240
['say', 'the', 'annies', 'list', 'political', 'group', 'support', 'thirdtrimester', 'abortion', 'on', 'demand', 'when', 'did', 'the', 'decline', 'of', 'coal', 'start', 'it', 'started', 'when', 'natural', 'gas', 'took', 'off']
['say', 'the', 'annies', 'list', 'political', 'group', 'support', 'thirdtrimester', 'abortion', 'on', 'demand', 'when', 'did', 'decline', 'of', 'coal', 'start', 'it', 'started', 'natural', 'gas', 'took', 'off', 'that', 'to']


Now we will make a list with all the statements of the train dataset. Here we again lemmatize all the words, remove the punctuation and set the words to lowercase. Furthermore we will make a list containing all the corresponding validity labels. 

In [21]:
statements = []
labels = []

with open("train.tsv", encoding="utf8") as tsvfile:
    tsvreader = csv.reader(tsvfile, delimiter="\t")
    for line in tsvreader:
        label = line[1]
        aline = line[2]
        bline = aline.split(" ")
        cline = []
        for word in bline:
            word = nltk.WordNetLemmatizer().lemmatize(
                word.translate(str.maketrans('', '', string.punctuation)).lower()) # remove punctuation & lemmatize
            cline.append(word)
        labels.append(label)
        statements.append(cline)

print("First 3 statements:", statements[:3])
print("First 3 labels:", labels[:3])

First 3 statements: [['say', 'the', 'annies', 'list', 'political', 'group', 'support', 'thirdtrimester', 'abortion', 'on', 'demand'], ['when', 'did', 'the', 'decline', 'of', 'coal', 'start', 'it', 'started', 'when', 'natural', 'gas', 'took', 'off', 'that', 'started', 'to', 'begin', 'in', 'president', 'george', 'w', 'bush', 'administration'], ['hillary', 'clinton', 'agrees', 'with', 'john', 'mccain', 'by', 'voting', 'to', 'give', 'george', 'bush', 'the', 'benefit', 'of', 'the', 'doubt', 'on', 'iran']]
First 3 labels: ['false', 'half-true', 'mostly-true']


Now we will create an empty matrix in which we will later on represent all the statements. Each row of the matrix corresponds to a statement and the columns to the words of the vocabulary. By creating such a matrix we can demonstrate for each statement if and how often words of the vocabulary occur in it.

In [22]:
import numpy as np 
rows = countlines          #corresponds to the number of statements of the train set
columns = len(vocab)       #corresponds to the number of distinct words occuring in the train set 
matrix = np.zeros((rows, columns))

In [23]:
#Create matrix representation of the train statements, X_matrix 

counts1 = 0 
for statement in statements: 
    counts2 = 0
    for word in vocab:
        if word in statement:
            count = statement.count(word)       #count how often the word occurs in the statement
            matrix[counts1, counts2] = count    #puts number of occurences in the entry corresponding to that word
        counts2 += 1 
    counts1 += 1 

print(matrix[0])

[1. 1. 1. ... 0. 0. 0.]


In [24]:
#Create y vector which contains the validity labels of the statements in the train dataset 
labelsdic ={"false":0, "barely-true":1,"half-true":2,"mostly-true":3,"true":4, "pants-fire":5}

size = len(statements)
y_vector = [None] * size

counter = 0
for label in labels:
    y_vector[counter] = labelsdic[label]  #convert the label to the corresponding integer
    counter += 1

Now we will move on to applying multinomial linear regression to the matrix representation of the statements and the vector representation of the validity labels. 

In [25]:
from sklearn.linear_model import LogisticRegression

In [26]:
#delete this cell? 

lr = LogisticRegression(solver='lbfgs',multi_class='multinomial').fit(matrix, y_vector)
yhat = lr.predict(matrix)
print(yhat[:7])

[0 2 3 0 3 4 1]


In [27]:
lr = LogisticRegression(solver='lbfgs',multi_class='multinomial')
lr.fit(matrix, y_vector) #apply linear regression to the statements matrix and labels vector of the train dataset

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=1, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In order to see how our Linear Regression model is doing, we will need to test the model with the test dataset. In order to use the test dataset, we will first have to represent the data of this dataset in the same way as we did for the train dataset. As our linear regression model is based on the vocabulary of the train dataset, we will only use the words of the test dataset that occur in the vocabulary of the train dataset. 

In [28]:
#make list of statements for test dataset 
import string 
import nltk
from nltk.corpus import wordnet as wn
import csv

teststatements = []
testlabels = []
test_countlines = 0

with open("test.tsv", encoding="utf8") as tsvfile:
    tsvreader = csv.reader(tsvfile, delimiter="\t")
    for line in tsvreader:
        label = line[1]
        aline = line[2]
        bline = aline.split(" ")
        cline = []
        test_countlines += 1 
        for word in bline:
            word = nltk.WordNetLemmatizer().lemmatize(
                word.translate(str.maketrans('', '', string.punctuation)).lower()) # remove punctuation & lemmatize
            cline.append(word)
        testlabels.append(label)
        teststatements.append(cline)

print("First 3 statements:", teststatements[:3])
print("First 3 labels:", testlabels[:3])
print(test_countlines)

First 3 statements: [['building', 'a', 'wall', 'on', 'the', 'usmexico', 'border', 'will', 'take', 'literally', 'year'], ['wisconsin', 'is', 'on', 'pace', 'to', 'double', 'the', 'number', 'of', 'layoff', 'this', 'year'], ['say', 'john', 'mccain', 'ha', 'done', 'nothing', 'to', 'help', 'the', 'vet']]
First 3 labels: ['true', 'false', 'false']
1267


In [29]:
import numpy as np 
rows = test_countlines                 #number of statements in the test dataset 
columns = len(vocab)                   #we will use the same vocabulary used in the train matrix
testmatrix = np.zeros((rows, columns))

In [30]:
#Create a matrix representation for test dataset, X_matrix_test 

counts1 = 0 
for statement in teststatements: 
    counts2 = 0
    for word in vocab:              #we represent the test statements with respect to the words occuring in the train dataset
        if word in statement:
            count = statement.count(word)
            testmatrix[counts1, counts2] = count
        counts2 += 1 
    counts1 += 1 

print(testmatrix[0])

[0. 1. 0. ... 0. 0. 0.]


In [31]:
#Create test y vector
labelsdic ={"false":0, "barely-true":1,"half-true":2,"mostly-true":3,"true":4, "pants-fire":5}

size = len(teststatements)
test_y_vector = [None] * size

counter = 0
for label in testlabels:
    test_y_vector[counter] = labelsdic[label]
    counter += 1

In [32]:
test_x = testmatrix 
test_y = test_y_vector

We will now test the accuracy of the model on the statements of the test dataset. 

In [33]:
lr.score(test_x, test_y)

0.23835832675611682

Unfortunately the accuracy is pretty low...
We will now have look at the coefficiants that are given to the words of the vocabulary, for the six different validity labels

In [38]:
#double cell
print(lr.coef_)
print(lr.coef_.shape) #is of size (n_classes, n_features)
print("#Columns is length of vocab: ", len(vocab))

[[ 0.07262584  0.0415358   0.21254774 ... -0.30697811 -0.00765843
  -0.05128795]
 [ 0.05065343 -0.00933911 -0.08832242 ... -0.05403901 -0.02912948
  -0.07423348]
 [ 0.043321   -0.00971194 -0.05323927 ...  0.52651426 -0.15128071
   0.25039853]
 [-0.09123454 -0.0497585  -0.01421984 ... -0.10734406 -0.03400038
  -0.02533595]
 [-0.23978211  0.05981195 -0.04045164 ... -0.04471728 -0.01407327
  -0.06120506]
 [ 0.16441637 -0.0325382  -0.01631457 ... -0.01343579  0.23614228
  -0.03833609]]
(6, 11916)
#Rows is length of vocab:  11916


In [50]:
coef_dict_false = dict()
line0 = 0
for word in vocab: 
    coef_dict_false[word] = lr.coef_[0][line0]
    line0 += 1 
    
coef_dict_barely = dict()
line1 = 0
for word in vocab: 
    coef_dict_barely[word] = lr.coef_[1][line1]
    line1 += 1 

coef_dict_half = dict()
line2 = 0
for word in vocab: 
    coef_dict_half[word] = lr.coef_[2][line2]
    line2 += 1 
    
coef_dict_mostly = dict()
line3 = 0
for word in vocab: 
    coef_dict_mostly[word] = lr.coef_[3][line3]
    line3 += 1 
    
coef_dict_true = dict()
line4 = 0
for word in vocab: 
    coef_dict_true[word] = lr.coef_[4][line4]
    line4 += 1 
    
coef_dict_pantsfire = dict()
line5 = 0
for word in vocab: 
    coef_dict_pantsfire[word] = lr.coef_[5][line5]
    line5 += 1 

In [51]:
# ordering
from collections import OrderedDict

ordered_false_coefs = [(k, coef_dict_false[k]) for k in sorted(coef_dict_false, key=coef_dict_false.get, reverse=True)]

ordered_false_coefs[0:10]

[('destroyed', 1.6392454228024622),
 ('taxed', 1.510072617819274),
 ('motor', 1.3773441327012856),
 ('responsibility', 1.348842040951605),
 ('scientist', 1.3059008288876481),
 ('125', 1.2903847356544436),
 ('supporter', 1.2579666985193203),
 ('cicilline', 1.2511152993137429),
 ('worked', 1.2383459057429123),
 ('scheme', 1.2347847219217412)]

In [52]:
ordered_barely_coefs = [(k, coef_dict_barely[k]) for k in sorted(coef_dict_barely, key=coef_dict_barely.get, reverse=True)]
ordered_barely_coefs[0:10]

[('deciding', 1.3721003524964706),
 ('clear', 1.2633377552530909),
 ('benghazi', 1.2295733902314818),
 ('along', 1.2219629665519056),
 ('patient', 1.1995008134592766),
 ('3000', 1.1670839678753282),
 ('list', 1.143761250567385),
 ('illegals', 1.1426038552894828),
 ('key', 1.1386040499673395),
 ('easier', 1.1140354494265294)]

In [53]:
ordered_half_coefs = [(k, coef_dict_half[k]) for k in sorted(coef_dict_half, key=coef_dict_half.get, reverse=True)]
ordered_half_coefs[0:10]

[('indiana', 1.7776271388910705),
 ('mammogram', 1.6280120819852053),
 ('deported', 1.4555018188393871),
 ('santorum', 1.429021410172006),
 ('ranking', 1.417563645580998),
 ('anytime', 1.3785708846276097),
 ('minnesota', 1.2654421919172874),
 ('confirmation', 1.250038968059138),
 ('direction', 1.2485952173938781),
 ('drink', 1.1789535137469767)]

In [54]:
ordered_mostly_coefs = [(k, coef_dict_mostly[k]) for k in sorted(coef_dict_mostly, key=coef_dict_mostly.get, reverse=True)]
ordered_mostly_coefs[0:10]

[('december', 1.5132457599814482),
 ('turn', 1.4807436902619637),
 ('earns', 1.3210329497753446),
 ('listed', 1.2870213918192295),
 ('mental', 1.274100476575609),
 ('doyles', 1.2578224068194608),
 ('79', 1.238493621558084),
 ('detail', 1.2064812845826935),
 ('graduating', 1.2061185647146462),
 ('fifty', 1.1966880256783268)]

In [55]:
ordered_true_coefs = [(k, coef_dict_true[k]) for k in sorted(coef_dict_true, key=coef_dict_true.get, reverse=True)]
ordered_true_coefs[0:10]

[('blow', 1.551719663305779),
 ('heavily', 1.4751874594109746),
 ('compensation', 1.451694790330877),
 ('youve', 1.4207869222145146),
 ('block', 1.3861172774080845),
 ('affect', 1.3655152560188122),
 ('restrictive', 1.290167424392575),
 ('understand', 1.278986194651979),
 ('expense', 1.247373106301906),
 ('previously', 1.23030905327161)]

In [56]:
ordered_pantsfire_coefs = [(k, coef_dict_pantsfire[k]) for k in sorted(coef_dict_pantsfire, key=coef_dict_pantsfire.get, reverse=True)]
ordered_pantsfire_coefs[0:10]

[('face', 1.7550063199553634),
 ('socialist', 1.4979286829632277),
 ('takeover', 1.4660871623263878),
 ('either', 1.3953236816987387),
 ('rep', 1.389536641392727),
 ('navy', 1.3887019350030623),
 ('sic', 1.3410005192707595),
 ('cabinet', 1.3299437775151992),
 ('250000', 1.304883240300638),
 ('2016', 1.2965506308996206)]