<a href="https://colab.research.google.com/github/oferweintraub/finance_sent/blob/main/using_a_classifier_coef__method_to_select_most_influential_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## using a classifier coef_ method to select most influential features

In [68]:
# get the needed libraries

import pandas as pd
import numpy as np
import plotly.express as px

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression, PassiveAggressiveClassifier
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, classification_report, accuracy_score


In [69]:
# Create a sentences list and convert them to counters using CountVectorizer

sentences = ['Arabica coffee is the best cofee on earth', 'I do not like Robusta coffee', 'if you ask me for coffee I say Arabica Arabica Arabica', 'this coffee is so bad']
sentiment = [1, 0, 1 ,0] # 1- positive, 0 -negative

vect = CountVectorizer(
    stop_words='english',
    ngram_range = (1,1),
    lowercase = True,
    max_features=100
)

X = vect.fit_transform(sentences)

In [70]:
# now , let's look at both the sparse matrix and the full matrix

print(f' sparse matrix {vect.vocabulary_} \n')
print(f' full  matrix \n {X.toarray()} \n')
print(f' the size of our vocabulary is - {len(vect.vocabulary_)} \n')

 sparse matrix {'arabica': 0, 'coffee': 5, 'best': 3, 'cofee': 4, 'earth': 6, 'like': 7, 'robusta': 8, 'ask': 1, 'say': 9, 'bad': 2} 

 full  matrix 
 [[1 0 0 1 1 1 1 0 0 0]
 [0 0 0 0 0 1 0 1 1 0]
 [3 1 0 0 0 1 0 0 0 1]
 [0 0 1 0 0 1 0 0 0 0]] 

 the size of our vocabulary is - 10 



from the above obe can see that sentense 3 - 'if you ask me for coffe I say Arabica Arabica Arabica' is represented as line 3 in the full matrix and specifically the word 'Arabica' which has an index 0 in the sparse matrix is shown 3 times in column 0 in the full matrix. 

We now look at these unigrams as features for a classifier and we'll try to resolve wich feature has the most impact of the classifier decision - in this example let's choose the LogisticRegression classifier.

Finally, note that the size of our vocabulary is 10 in this case and that we will have exactly 10 coefficients in the set of parameters for the classifier.

In [71]:
# fit the data using Logistic Regression

clf = LogisticRegression (
    random_state=42,
    max_iter=500
)

clf.fit(X, sentiment)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [72]:
# let's look at the resulting coefficients
importance =clf.coef_.flatten()

# the size of the coefficients matrix and the matrix itself....
print(f' the number of coefficients we have is -  {importance.shape[0]} \n')
print(f' the full vector of coefficients and their importance is \n -  {importance} \n')

 the number of coefficients we have is -  10 

 the full vector of coefficients and their importance is 
 -  [ 7.63508924e-01  1.51957448e-01 -2.45864223e-01  3.07636579e-01
  3.07636579e-01  8.24045001e-07  3.07636579e-01 -2.13728981e-01
 -2.13728981e-01  1.51957448e-01] 



from looking at the table above we can see that the argument in position 0 has the largest positive value. We also remember that the position 0 is associated with 'Arabica" tat appear 4 times in the frequency matrix and it make senese that it has a large positive value.

Let's see which word has the largest negative value? --> it is definetly the word in index 2 with value of -0.24218427 ... and in index 2 we find the word 'bad' which is indeed quite negative...



How can we double check it? let's make index 2 i.e. the word 'bad' even more influencial and see is it value in the coef_ matrix increases. for ths will create another set of sentences which simply has few more 'bad' in the negative sentences

In [73]:
sentences_with_more_bad = ['Arabica coffee is the best cofee on earth', 'I do not like Robusta bad coffee ', 'if you ask me for coffee I say Arabica Arabica Arabica', 'this coffee is so bad, bad, bad']

# transfrom it to the same feature space
X_bad = vect.transform(sentences_with_more_bad)

# and look at X_bad , especially at index 2 where 'bad' is represented
print (f' words frequency matrix after adding few bads \n {X_bad.toarray()}')

 words frequency matrix after adding few bads 
 [[1 0 0 1 1 1 1 0 0 0]
 [0 0 1 0 0 1 0 1 1 0]
 [3 1 0 0 0 1 0 0 0 1]
 [0 0 3 0 0 1 0 0 0 0]]


In [74]:
# let's fit again the classifier with this data
clf.fit(X_bad, sentiment)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [75]:
# and let's look now at the coeffcients...we expect position 2 to be large and negative
importance =clf.coef_.flatten()

print(f' the number of coefficients we have is -  {importance.shape[0]} \n')
print(f' the full vector of coefficients and their importance is \n   {importance} \n')

 the number of coefficients we have is -  10 

 the full vector of coefficients and their importance is 
   [ 6.06899893e-01  1.24463714e-01 -6.08389854e-01  2.33508750e-01
  2.33508750e-01  2.04423891e-06  2.33508750e-01 -2.32760703e-01
 -2.32760703e-01  1.24463714e-01] 



indeed the 2nd term is now large and negative as expected. We only need now to correlate those features with words... which we do below

In [76]:
# find the most influencial terms for positive and negative sentiment

TOP_N = 10

# gey key (word) for a given index
def get_feature(features_dict, val):
    for key, value in features_dict.items():
         if val == value:
             return key
 
    return "key doesn't exist"

# reverse a list
def reverse(lst):
    new_lst = lst[::-1]
    return new_lst

counter = 0
features_pos = {}
features_neg = {}

for param in reverse(list(np.argsort(abs(importance)))):
  counter +=1
  if counter >= TOP_N:
    break
  else:
   if importance[param] >= 0: # print only the positives
    key = get_feature(vect.vocabulary_, param)
    value = importance[param]
    features_pos[key] = value
   else:
    key = get_feature(vect.vocabulary_, param)
    value = importance[param]
    features_neg[key] = value 

print(f' top terms contributing to positives:\n {features_pos} \n')
print(f' top terms contributing to negatives: \n {features_neg} \n')


 top terms contributing to positives:
 {'arabica': 0.6068998925439615, 'earth': 0.2335087499111397, 'cofee': 0.2335087499111397, 'best': 0.2335087499111397, 'say': 0.12446371421094059, 'ask': 0.12446371421094059} 

 top terms contributing to negatives: 
 {'bad': -0.6083898539890195, 'robusta': -0.23276070283024977, 'like': -0.23276070283024977} 

