# Multinomial Logistic Regression

** Names: ** Barbara, Eva & Joyce
<br><br>
In this notebook we will implement multinomial logistic regression using countvectorizer to represent our data (rather than creating matrices from scratch). Then we will look at the feature importance of this model.


### Index
1. ** Logistic Regression using `Count Vectorizer` **
    
    
2. ** Feature importance **

### 1. Using CountVectorizer to represent data and apply LR

In [2]:
import pandas as pd
import numpy as np
import sklearn

from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

from collections import OrderedDict

In [3]:
# create dataframe for the train data
df_liar = pd.read_csv("train.tsv", encoding="utf8", sep="\t", names=["id", "truth-value", 
                                                                     "text", "topic", "name", "job", 
                                                                     "state", "politics", "count1", "count2", 
                                                                     "count3", "count4", "count5", "context"])

df_liar.head(3)

Unnamed: 0,id,truth-value,text,topic,name,job,state,politics,count1,count2,count3,count4,count5,context
0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer
1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.
2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver


In [4]:
# classification formula to take return the corresponding number to the validity label
validity_labels = {"false":0, "barely-true":1,"half-true":2,"mostly-true":3,"true":4, "pants-fire":5}

nolabel = 0 
def classify_validity(text):
    if text not in validity_labels.keys():          #in the case that the label is different
        return -1
        nolabel += 1
    else:
        return validity_labels[text]
    
df_liar["truth-score"] = df_liar["truth-value"].apply(classify_validity)  #add column with truth-score which contains a validity number
print("Number of statements without labels:", nolabel)                    #check how many statement do not one of the 6 validity labels

Number of statements without labels: 0


*Since all staments have a label, we do not have to filter out specific labels.* 

We now apply Count Vectorizer in order to represent the data of the train dataset which we can later use for multinomial logistic regression.

In [6]:
count_vect = CountVectorizer()
X_train = count_vect.fit(df_liar.text)          #our X matrix is the text from the statements 
X_train = count_vect.transform(df_liar.text)    

y_train = df_liar["truth-score"].values         #our y vector is the list of all the truth labels     

In [7]:
lr = LogisticRegression(solver='lbfgs', multi_class='multinomial')
lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=1, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

Now we will create a dataframe of the test dataset in order to test our logistic regression model. 


In [8]:
# create dataframe for the test data
df_liar_test = pd.read_csv("test.tsv", encoding="utf8", sep="\t", names=["id", "truth-value", 
                                                                     "text", "topic", "name", "job", 
                                                                     "state", "politics", "count1", "count2", 
                                                                     "count3", "count4", "count5", "context"])

df_liar_test.head(3)

Unnamed: 0,id,truth-value,text,topic,name,job,state,politics,count1,count2,count3,count4,count5,context
0,11972.json,True,Building a wall on the U.S.-Mexico border will...,immigration,rick-perry,Governor,Texas,republican,30,30,42,23,18,Radio interview
1,11685.json,False,Wisconsin is on pace to double the number of l...,jobs,katrina-shankland,State representative,Wisconsin,democrat,2,1,0,0,0,a news conference
2,11096.json,False,Says John McCain has done nothing to help the ...,"military,veterans,voting-record",donald-trump,President-Elect,New York,republican,63,114,51,37,61,comments on ABC's This Week.


In [9]:
df_liar_test["truth-score"] = df_liar_test["truth-value"].apply(classify_validity)

In [10]:
X_test = count_vect.transform(df_liar_test.text)        # X matrix is again text from statements
y_test = df_liar_test["truth-score"].values             # y vector is list of all the truth values

In [11]:
 y_hat_test = lr.predict(X_test)

# evaluate using accuracy: proportion of correctly predicted over total
print(accuracy_score(y_test, y_hat_test))
print(accuracy_score(y_test, y_hat_test, normalize=False))

0.244672454617206
310


#### Comments
> The accuracy of this multinomial model is unfortunately very low. However, we expected the accuracy to be low, as the accuracy of the binomial model was only around 60%. As there are 6 different labels for validity in this case with some labels between false and true the difference between the statements is smaller than in the case of only true and false and thus probably more difficult to distinguish. We saw from the binomial model that it was already difficult to predict the label of the false and true statements.  

### 2. Feature Importance 
Here we create dictionaries for all the features, which are all the different words of the train dataset, with their corresponding weight in this logistic regression model. All the words have a different weight for the six different labels. We therefor have six dictionaries with all the words and their importance weight.

In [15]:
print(lr.coef_)
print(lr.coef_.shape)
# This coefficient matrix has 6 rows as there are 6 different labels. 
#For every label each word has a different weight. 

[[-0.19671048  0.03022712 -0.04513979 ... -0.04800402 -0.02337375
  -0.07402812]
 [-0.13880998 -0.06395065  0.27090762 ... -0.01712352  0.30305471
  -0.21690808]
 [-0.18554702  0.2231832  -0.05050444 ... -0.13676568 -0.05250957
  -0.04384467]
 [ 0.52984746  0.2695472  -0.08310035 ...  0.33815024 -0.17182523
  -0.12282034]
 [-0.27025362 -0.09350222 -0.05699162 ... -0.11081243 -0.04979296
  -0.21307652]
 [ 0.26147363 -0.36550465 -0.03517143 ... -0.02544459 -0.0055532
   0.67067773]]
(6, 12196)


In [19]:
#Coefficient dictionaries for the different labels

#label false 
coef_dict_false = dict()                                    
for n, key in enumerate(count_vect.vocabulary_.keys()):
    coef_dict_false[key] = lr.coef_[0][n] 
    
#label barely-true
coef_dict_barelytrue = dict()
for n, key in enumerate(count_vect.vocabulary_.keys()):
    coef_dict_barelytrue[key] = lr.coef_[1][n] 

#label half-true
coef_dict_halftrue = dict()
for n, key in enumerate(count_vect.vocabulary_.keys()):
    coef_dict_halftrue[key] = lr.coef_[2][n] 
    
#label mostly-true
coef_dict_mostlytrue = dict()
for n, key in enumerate(count_vect.vocabulary_.keys()):
    coef_dict_mostlytrue[key] = lr.coef_[3][n] 
    
#label true
coef_dict_true = dict()
for n, key in enumerate(count_vect.vocabulary_.keys()):
    coef_dict_true[key] = lr.coef_[4][n] 
    
#label pants-fire 
coef_dict_pantsfire = dict()
for n, key in enumerate(count_vect.vocabulary_.keys()):
    coef_dict_pantsfire[key] = lr.coef_[5][n] 

In [25]:
#Ordering the different coefficient dictionaries

ordered_coefs_false = [(k, coef_dict_false[k]) for k in sorted(coef_dict_false, key=coef_dict_false.get, reverse=True)]

ordered_coefs_barelytrue = [(k, coef_dict_barelytrue[k]) for k in sorted(coef_dict_barelytrue, key=coef_dict_barelytrue.get, reverse=True)]

ordered_coefs_halftrue = [(k, coef_dict_halftrue[k]) for k in sorted(coef_dict_halftrue, key=coef_dict_halftrue.get, reverse=True)]

ordered_coefs_mostlytrue = [(k, coef_dict_mostlytrue[k]) for k in sorted(coef_dict_mostlytrue, key=coef_dict_mostlytrue.get, reverse=True)]

ordered_coefs_true = [(k, coef_dict_true[k]) for k in sorted(coef_dict_true, key=coef_dict_true.get, reverse=True)]

ordered_coefs_pantsfire = [(k, coef_dict_pantsfire[k]) for k in sorted(coef_dict_pantsfire, key=coef_dict_pantsfire.get, reverse=True)]

In [27]:
ordered_coefs_false[0:10]

[('colleges', 1.571115551590921),
 ('crystal', 1.5501536971495709),
 ('ranches', 1.3072655297420652),
 ('pray', 1.289434031495409),
 ('turkeys', 1.2784723202137627),
 ('er', 1.2682674094090063),
 ('psychiatric', 1.2576381806723511),
 ('directing', 1.2305376319688748),
 ('to1', 1.2288779585651446),
 ('sam', 1.221000814601286)]

In [28]:
ordered_coefs_barelytrue[0:10]

[('recount', 1.4426736703720562),
 ('dold', 1.2766246724560313),
 ('pastor', 1.2701726512090712),
 ('doc', 1.2375502252426396),
 ('initiated', 1.234166400553806),
 ('crucifying', 1.2327905840487676),
 ('declined', 1.2185644260883728),
 ('fivefold', 1.183293722693532),
 ('basra', 1.1813164923510588),
 ('violations', 1.1751700467412065)]

In [29]:
ordered_coefs_halftrue[0:10]

[('megachurch', 1.7285743052031848),
 ('push', 1.5753411395139432),
 ('begging', 1.531972208134943),
 ('experiment', 1.487670453158909),
 ('280', 1.34768206982658),
 ('carved', 1.2822690735233273),
 ('524', 1.2430842418501653),
 ('abu', 1.2391937734386518),
 ('palestine', 1.2224019220778954),
 ('morris', 1.1783782962220712)]

In [30]:
ordered_coefs_mostlytrue[0:10]

[('swastika', 1.8035766417854084),
 ('kittens', 1.7975403734980244),
 ('usual', 1.59153021155955),
 ('treasury', 1.4795967576923121),
 ('windfall', 1.2986497430970094),
 ('indianas', 1.268607903507159),
 ('assaulted', 1.2567707974363818),
 ('independent', 1.2565552123023043),
 ('wouldnt', 1.245281637956506),
 ('1928', 1.229870338883784)]

In [31]:
ordered_coefs_true[0:10]

[('science', 1.6067734871329367),
 ('611', 1.4853807170205966),
 ('1968', 1.4386097244832259),
 ('minutemen', 1.389352344114847),
 ('toddlers', 1.3195292753787937),
 ('istheone', 1.2913603536043332),
 ('eggs', 1.2667774710282504),
 ('ticketing', 1.221552578683125),
 ('internment', 1.1896678838816612),
 ('peaceful', 1.1562809128213787)]

In [32]:
ordered_coefs_pantsfire[0:10]

[('spoke', 1.9713198139783084),
 ('husted', 1.5713416375593625),
 ('handful', 1.5511399379652162),
 ('jurors', 1.4219543689550154),
 ('skips', 1.401469745657929),
 ('battery', 1.3530141100001047),
 ('mode', 1.351160825791204),
 ('frated', 1.3398404942254272),
 ('civil', 1.3290797741668097),
 ('children', 1.2737802434821182)]

### Comments
>Here we can see which words, for each label, were important in determining its label. Interestingly, the words with the highest weigt for "true", are different from the words with the highest weight in binomial logistic regression. Furthermore, the words with the lowest weight in the binomial logistic regression, which were the words contributed most to labeling the statement as false, did not correspond with the highest weights of the "false" label. As there are multiple labels in this case, the whole model is ofcourse different, however we would have expected some simmilarities. Furthermore some of the words with the highest weights seem very strange to actually be of importance in determining the label of the statement such as "611" and "to1". However, as our model has very low accuracy, we cannot really extract useful information from these coefficient matrices. 