*Copyright (c) Microsoft Corporation. All rights reserved.*

*Licensed under the MIT License.*

# Sentiment Analysis using Classical Explainer 


_**This notebook showcases how to use the interpret-text repo to implement an interpretable module using feature importances and bag of words representation.**_


## Contents
1. [Introduction](#Introduction)
2. [Setup](#Setup)
3. [Training](#Training)
4. [Results](#Results)

In [None]:
import sys
sys.path.append("../..")
import os

import pandas as pd
import nlp

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support

from interpret_text.experimental.classical import ClassicalTextExplainer

pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', None)
pd.set_option("max_rows", None)

## 1. Introduction
This notebook illustrates how to locally use interpret-text to help interpret text classification using a logisitic regression baseline and bag of words encoding. It demonstrates the API calls needed to obtain the feature importances along with a visualization dashbard.

###### Note:
* *Although we use logistic regression, any model that follows sklearn's classifier API should be supported natively or with minimal tweaking.*
* *The interpreter supports interpretations using either coefficients associated with linear models or feature importances associated with ensemble models.*
* *The classifier relies on scipy's sparse representations to keep the dataset in memory.*

## 2. Setup

The notebook is built on features made available by [scikit-learn](https://scikit-learn.org/stable/) and [spacy](https://spacy.io/) for easier compatibiltiy with popular tookits.

## Configuration parameters


In [None]:
EMOTION_COL = "emotion"
LABEL_COL = "label"
TEXT_COL = "text"

### Load data

In [None]:
train. test = nlp.load_dataset("emo", split = ["train", "test"])

In [None]:
id2label = {0: 'others', 1: 'happy', 2: 'sad', 3: 'angry'}
labels=list(id2label.values())
label2id = {}
for i,label in enumerate(labels):
    label2id[label]=i

In [None]:
train_data={TEXT_COL:[],
     EMOTION_COL:[]}
for val in train:
    if id2label[val[LABEL_COL]]!='others':
        train_data[TEXT_COL].append(val[TEXT_COL])
        train_data[EMOTION_COL].append(id2label[val[LABEL_COL]])
        
train_data = pd.DataFrame(train_data)

In [None]:
test_data={TEXT_COL:[],
     EMOTION_COL:[]}
for val in test:
    if id2label[val[LABEL_COL]]!='others':
        test_data[TEXT_COL].append(val[TEXT_COL])
        test_data[EMOTION_COL].append(id2label[val[LABEL_COL]])
        
test_data = pd.DataFrame(test_data)

In [None]:
X_str = train_data[TEXT_COL]
ylabels = train_data[EMOTION_COL]

X_str_test = test_data[TEXT_COL]
ylabels_test = test_data[EMOTION_COL]

## Create Explainer

In [None]:
# Create explainer object that contains default glassbox classifier and explanation methods
explainer = ClassicalTextExplainer()
label_encoder = LabelEncoder()

## Training

###### Note: Vocabulary

* *The vocabulary is compiled from the training set. Any word that does not appear in the training data split, will not appear in the vocabulary.*
* *The word must appear one or more times to be considered part of the vocabulary.*
* *However, the sklearn countvectorizer allows the addition of a custom vocabulary as an input parameter.*

### Configure training setup
This step will cast the training data and labels into the correct format

1. Data is alreadt split. Otherwise, split data into train and test using a random shuffle
2. Load desired classifier. In this case, Logistic Regression is set as default.
3. Setup grid search for hyperparameter optimization and train model. Edit the hyper parameter range to search over as per your model.
4. Fit models to train set

In [None]:
X_train, X_test, y_train, y_test = X_str, X_str_test, ylabels,  ylabels_test
y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.transform(y_test)

In [None]:
print("X_train shape =" + str(X_train.shape))
print("y_train shape =" + str(y_train.shape))
print("X_train data structure = " + str(type(X_train)))

#### Model Overview

The 1-gram [Bag of Words](https://en.wikipedia.org/wiki/Bag-of-words_model) allows a 1:1 mapping from individual words to their respective frequencies in the [document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix). 

In [None]:
classifier, best_params = explainer.fit(X_train, y_train)

## Results

###### Notes for default Logistic Regression classifier:
* *The parameters are set using cross-validation*
* *Below listed hyperparamters are selected by searching over a larger space.*
* *These apply specifically to this instance of the logistic regression model and mnli dataset.*
* *'Multinomial' setup was found to be better than 'one-vs-all' across the board*
* *Default 'liblinear' solver is not supported for 'multinomial' model setup*
* *For a different model or dataset, set the range as appropriate using the hyperparam_range argument in the train method* 

In [None]:
# obtain best classifier and hyper params
print("best classifier: " + str(best_params))

## Performance Metrics

In [None]:
mean_accuracy = classifier.score(X_test, y_test, sample_weight=None)
print("accuracy = " + str(mean_accuracy * 100) + "%")
y_pred = classifier.predict(X_test)
[precision, recall, fscore, support] = precision_recall_fscore_support(y_test, y_pred,average='macro')

In [None]:
# Enter any document or a document and label pair that needs to be interpreted
document = "There is no limit to what we, as women can acomplish"

In [None]:
# Obtain the top feature ids for the selected class label
explainer.preprocessor.labelEncoder = label_encoder

## Explain Model

In [None]:
local_explanation = explainer.explain_local(document)

In [None]:
y = classifier.predict(document)
predicted_label = label_encoder.inverse_transform(y)
local_explanation = explainer.explain_local(document, predicted_label)

## Visualize Explanations

In [None]:
from interpret_text.experimental.widget import ExplanationDashboard

In [None]:
ExplanationDashboard(local_explanation)