# Linear Optimization for [LIME]()

LIME means Local Interpretable Model-agnostic Explanations.

This study aims to formulate (and test) LIME with linear optimization.

TODO:
- Pre-process text
- Run SVM more

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1">Setup</a></span></li><li><span><a href="#Model" data-toc-modified-id="Model-2">Model</a></span><ul class="toc-item"><li><span><a href="#Data" data-toc-modified-id="Data-2.1">Data</a></span></li><li><span><a href="#Classifier" data-toc-modified-id="Classifier-2.2">Classifier</a></span></li></ul></li><li><span><a href="#Optimization" data-toc-modified-id="Optimization-3">Optimization</a></span><ul class="toc-item"><li><span><a href="#LIME" data-toc-modified-id="LIME-3.1">LIME</a></span></li><li><span><a href="#Calculation-of-parameters" data-toc-modified-id="Calculation-of-parameters-3.2">Calculation of parameters</a></span></li><li><span><a href="#Linear-optimization" data-toc-modified-id="Linear-optimization-3.3">Linear optimization</a></span></li></ul></li><li><span><a href="#References" data-toc-modified-id="References-4">References</a></span></li></ul></div>

## Setup

In [1]:
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
import numpy as np
import pandas as pd

In [2]:
%config Completer.use_jedi = False

## Model

### Data

In [3]:
df = pd.read_csv('../data/IMDB Dataset.csv')
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [4]:
X = df.review.to_list()
y = df.sentiment.to_list()

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

### Classifier

In [6]:
tf_idf = TfidfVectorizer(
    strip_accents=None,
    lowercase=True,
    smooth_idf=True,
)
X_train = tf_idf.fit_transform(X_train)
X_train.shape

(40000, 92647)

In [7]:
t_svd = TruncatedSVD(n_components=50, random_state=42)
X_train = t_svd.fit_transform(X_train)
X_train.shape

(40000, 50)

In [8]:
scaler = StandardScaler(with_mean=True)
X_train = scaler.fit_transform(X_train)

In [9]:
# lr = LogisticRegressionCV(
#     Cs=10,
#     fit_intercept=True,
#     cv=5,
#     penalty='l2',
#     solver='sag',
#     tol=0.0001,
#     max_iter=1000,
#     n_jobs=-1,
#     verbose=1,
#     random_state=42,
# )

In [10]:
# %%time
# lr.fit(X_train, y_train)

In [11]:
# model = Pipeline([
#     ('tf_idf', tf_idf),
#     ('t_svd', t_svd),
#     ('scaler', scaler),
#     ('lr', lr)
# ])

In [12]:
# model.score(X_test, y_test)

In [13]:
svc = SVC(
#     kernel='linear',
    shrinking=True,
    probability=True,
    tol=0.001,
    cache_size=200,
    verbose=True,
    max_iter=-1,
    decision_function_shape='ovr',
    random_state=42,
)

svm = GridSearchCV(
    svc,
    param_grid={'C': [1, 10], 'kernel': ['linear']},
    n_jobs=-1,
    cv=5,
    verbose=3
)

In [14]:
%%time
m = 600
svm.fit(X_train[:m], y_train[:m])

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[LibSVM]CPU times: user 4.1 s, sys: 184 ms, total: 4.28 s
Wall time: 11.4 s


GridSearchCV(cv=5,
             estimator=SVC(probability=True, random_state=42, verbose=True),
             n_jobs=-1, param_grid={'C': [1, 10], 'kernel': ['linear']},
             verbose=3)

In [15]:
model = Pipeline([
    ('tf_idf', tf_idf),
    ('t_svd', t_svd),
    ('scaler', scaler),
    ('svm', svm)
])

In [16]:
model.score(X_test, y_test)

0.8182

## Optimization

### LIME

### Calculation of parameters

In [44]:
example = X_test[5]
example

'To call this film a complete waste of celluloid would be an understatement.<br /><br />The acting was unconvincing to say the least, especially from actor Craig Fong, who couldn\'t have acted stiffer. As far as story goes...well...what story?! The "film" is nominally about Harry Lee, a Malaysian of Chinese descent who comes back to his home country after flunking out of every course he took and tries to start a band.<br /><br />The film has ever cliche you can think of -- sex, tension among band members and a little bit of racial tension thrown in.<br /><br />The problem is that even with a subject that\'s been covered adequately by even the most amateurish directors, this movie is all over the place and the whole thing just feels contrived with parts that would make even the most hardened reviewers\' hairs stand on end.<br /><br />'

In [27]:
y_test[5]

'negative'

In [28]:
model.predict([example])

array(['negative'], dtype='<U8')

In [29]:
model.predict_proba([example])

array([[0.90328654, 0.09671346]])

In [30]:
N = 100
K = 10

In [31]:
def f(x):
    return model.predict_proba([x])[0]

In [32]:
vector = Pipeline([
    ('tf_idf', tf_idf),
    ('t_svd', t_svd)
])

In [33]:
def pi(x, z, sigma=0.5):
    cos = np.dot(x, z)/(np.linalg.norm(x)*np.linalg.norm(z))
    if cos > 1:
        cos = 1
    D = np.arccos(cos)*2/np.pi
    return np.exp(-D**2/sigma**2)

In [34]:
split = example.split()
split

['To',
 'call',
 'this',
 'film',
 'a',
 'complete',
 'waste',
 'of',
 'celluloid',
 'would',
 'be',
 'an',
 'understatement.<br',
 '/><br',
 '/>The',
 'acting',
 'was',
 'unconvincing',
 'to',
 'say',
 'the',
 'least,',
 'especially',
 'from',
 'actor',
 'Craig',
 'Fong,',
 'who',
 "couldn't",
 'have',
 'acted',
 'stiffer.',
 'As',
 'far',
 'as',
 'story',
 'goes...well...what',
 'story?!',
 'The',
 '"film"',
 'is',
 'nominally',
 'about',
 'Harry',
 'Lee,',
 'a',
 'Malaysian',
 'of',
 'Chinese',
 'descent',
 'who',
 'comes',
 'back',
 'to',
 'his',
 'home',
 'country',
 'after',
 'flunking',
 'out',
 'of',
 'every',
 'course',
 'he',
 'took',
 'and',
 'tries',
 'to',
 'start',
 'a',
 'band.<br',
 '/><br',
 '/>The',
 'film',
 'has',
 'ever',
 'cliche',
 'you',
 'can',
 'think',
 'of',
 '--',
 'sex,',
 'tension',
 'among',
 'band',
 'members',
 'and',
 'a',
 'little',
 'bit',
 'of',
 'racial',
 'tension',
 'thrown',
 'in.<br',
 '/><br',
 '/>The',
 'problem',
 'is',
 'that',
 'even',
 '

In [60]:
z_line = []
f_z = []
pi_x = []
for i in range(N):
    n = np.random.choice(range(1, len(split)))
    indices = np.random.choice(range(len(split)), size=n, replace=False)
    perturbation = np.ones(len(split))
    for index in indices:
        perturbation[index] = 0
    z_line.append(perturbation)
    text = ' '.join([word for (j, word) in enumerate(split) if perturbation[j]])
    f_z.append(f(text))
    pi_x.append(pi(vector.transform([example])[0], vector.transform([text])[0]))

### Linear optimization

## References

- https://arxiv.org/pdf/1602.04938.pdf
- https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
- https://vanderbei.princeton.edu/tex/talks/MOPTA14/L1_reg.pdf