# Introduction
* This program that classifies legal issues into a binary value for each National Subject Matter Index (NSMI). (https://nsmi.lsntap.org/browse-v2) \\
"Category" means 20 indexes. \\
"Class" means sub categories under the category.

### Data
* The data contains 2777 labeled articles. Each article has a binary value(0 or 1) that indicates if this article is related to a specific legal class. We ignore unlabeled entries when constructing a model.

### Implementation
* The program converts an article into tf-idf and applies multinomial Naive-Bayes model provided by scikit-learn. 

* After preprocessing data, we predict the model with 10-fold cross-validation.

### Output
* We calculate accuracy with bot categories(20) and classes(100+). \\

See overall result is at the bottom of this notebook.

# Data Preparation (DONE)

In [1]:
!pip3 install PrettyTable
!pip3 install pandas
!pip3 install sklearn
!pip3 install matplotlib
!pip3 install seaborn
!pip3 install tqdm
!pip3 install nltk
!python3 -m nltk.downloader stopwords punkt

import os
import sys
import pandas as pd
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
pd.options.display.max_rows = 100
pd.set_option('display.max_columns', None) 
print("DONE")

[nltk_data] Downloading package stopwords to /Users/heeh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/heeh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!




DONE


In [2]:
url = 'https://raw.githubusercontent.com/heeh/legal_issue_classification/master/2019-12-06_95p-confidence_binary.csv'
df = pd.read_csv(url)
df.info()
df.iloc[:,:4]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2777 entries, 0 to 2776
Columns: 109 entries, _id to WO-09-00-00-00
dtypes: float64(107), object(2)
memory usage: 2.3+ MB


Unnamed: 0,_id,full_text,BE-00-00-00-00,BE-01-00-00-00
0,5b60e59cda52255c20cff794,Will he serve time?. Long story short my broth...,0.0,0.0
1,5b60e59cda52255c20cff79a,Groundwater leaking out of street 24/7. Ground...,0.0,0.0
2,5b60e59cda52255c20cff7a0,How do I get my mom's license taken away. My m...,0.0,0.0
3,5b60e59cda52255c20cff7bf,My boss hasn't paid me. What do i do?. I work ...,,
4,5b60e59cda52255c20cff7b8,"[Texas] I signed a non-compete contract, but t...",0.0,0.0
...,...,...,...,...
2772,5b60e66dda52255c20df433f,Do you and your parents get deported because o...,0.0,0.0
2773,5b60e66dda52255c20df43ae,Wondering the legality of a minor (me) being a...,0.0,0.0
2774,5b60e66dda52255c20df4462,Can I sue a billion dollar company in small cl...,0.0,0.0
2775,5b60e66dda52255c20df4448,Sued by creditor and currently in settlement n...,0.0,0.0


## Null and Rowsum Check

In [3]:

#Null Check
#df.isnull().sum()

# Class Check - Remove every column that has zero sum. 
df = df.loc[:, df.sum(axis=0, skipna=True) != 0]
temp = df.sum(axis = 0, skipna = True)


#df.info()


# Remove columns that have <10 positive classes

In [4]:
oldCols = list(df.columns)
print(len(oldCols))

newCols = []
for i,x in temp[2:].items():
    if x >= 10:
        newCols.append(i)
        
print(len(newCols))

cols = oldCols[:2] + newCols

print(cols)
print(len(cols))

df = df[cols]

df.sum(axis=0, skipna=True)

88
36
['_id', 'full_text', 'BE-00-00-00-00', 'BU-00-00-00-00', 'CO-00-00-00-00', 'CR-00-00-00-00', 'CR-01-00-00-00', 'CR-04-00-00-00', 'CR-06-00-00-00', 'CR-10-00-00-00', 'ED-00-00-00-00', 'ES-00-00-00-00', 'ES-01-00-00-00', 'ES-03-00-00-00', 'FA-00-00-00-00', 'FA-05-00-00-00', 'FA-06-00-00-00', 'FA-07-00-00-00', 'GO-00-00-00-00', 'HE-00-00-00-00', 'HO-00-00-00-00', 'HO-06-00-00-00', 'HO-09-00-00-00', 'IM-00-00-00-00', 'MO-00-00-00-00', 'MO-02-00-00-00', 'MO-07-00-00-00', 'MO-10-00-00-00', 'RI-00-00-00-00', 'TO-00-00-00-00', 'TR-00-00-00-00', 'TR-01-00-00-00', 'TR-02-00-00-00', 'TR-03-00-00-00', 'TR-04-00-00-00', 'TR-05-00-00-00', 'WO-00-00-00-00', 'WO-03-00-00-00']
38


_id               5b60e59cda52255c20cff7945b60e59cda52255c20cff7...
full_text         Will he serve time?. Long story short my broth...
BE-00-00-00-00                                                   27
BU-00-00-00-00                                                   93
CO-00-00-00-00                                                  106
CR-00-00-00-00                                                  302
CR-01-00-00-00                                                   12
CR-04-00-00-00                                                   13
CR-06-00-00-00                                                   11
CR-10-00-00-00                                                   11
ED-00-00-00-00                                                   24
ES-00-00-00-00                                                   78
ES-01-00-00-00                                                   10
ES-03-00-00-00                                                   13
FA-00-00-00-00                                  

# Tiny Example: Crime and Prison(CR-00-00-00-00)


## Preprocessing (DONE)

In [5]:
from collections import defaultdict
verbose = True
def preprocessing(dfset: defaultdict, cls: str):
    dfset[cls] = df.loc[:, ['_id', 'full_text', cls]]
    labels = dfset[cls].iloc[:,2]
    if verbose:
        print("------------Before dropping nan----------------------------------------")
        print(dfset[cls].iloc[:,1:])
        print(labels.value_counts(dropna=False))
    
    dfset[cls] = dfset[cls].dropna()
    labels = dfset[cls].iloc[:,2]
    if verbose:
        print("\n------------After dropping nan---------------------------------------")
        print(dfset[cls].iloc[:,1:])
        print(labels.value_counts(dropna=False))



## Data Preparation

In [6]:

cls = 'CR-00-00-00-00'
dfset = defaultdict() 
preprocessing(dfset, cls)    

#    model[cls] = make_pipeline(TfidfVectorizer(), MultinomialNB())
tinydf = dfset[cls]
X = tinydf['full_text'].values
Y = tinydf[cls].values


------------Before dropping nan----------------------------------------
                                              full_text  CR-00-00-00-00
0     Will he serve time?. Long story short my broth...             1.0
1     Groundwater leaking out of street 24/7. Ground...             0.0
2     How do I get my mom's license taken away. My m...             NaN
3     My boss hasn't paid me. What do i do?. I work ...             0.0
4     [Texas] I signed a non-compete contract, but t...             0.0
...                                                 ...             ...
2772  Do you and your parents get deported because o...             1.0
2773  Wondering the legality of a minor (me) being a...             0.0
2774  Can I sue a billion dollar company in small cl...             0.0
2775  Sued by creditor and currently in settlement n...             0.0
2776  (CA) Sales job. Income based on performance. A...             0.0

[2777 rows x 2 columns]
0.0    1377
NaN    1098
1.0     302
Nam

## TF-IDF using stopwords, ngram, and C value

In [7]:
import sys
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support as score
p = len(X) // 10 * 9
#tfidf_vect = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_df=300 )
tfidf_vect = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_df=0.3)
tfidf_vect.fit(X[0:p])
X_train_tfidf_vect = tfidf_vect.transform(X[0:p])
X_test_tfidf_vect = tfidf_vect.transform(X[p:])

model = LogisticRegression(penalty='l1', solver='liblinear', class_weight='balanced')
model.fit(X_train_tfidf_vect, Y[0:p])
preds = model.predict(X_test_tfidf_vect)
precision, recall, fscore, support = score(Y[p:], preds, average='binary')
print('accuracy : {0:.4f}'.format(accuracy_score(Y[p:], preds)))
print('precision: {0:.4f}'.format(precision))
print('recall   : {0:.4f}'.format(recall))
print('fscore   : {0:.4f}'.format(fscore))


from sklearn.model_selection import GridSearchCV

params = {'C':[0.01, 0.1, 1, 5, 10]}
grid_cv_lr = GridSearchCV(model,param_grid=params, cv=3, scoring='accuracy', verbose=1)
grid_cv_lr.fit(X_train_tfidf_vect, Y[:p])
print(grid_cv_lr.best_params_)

print("TF-IDF Dimension: ", len(tfidf_vect.vocabulary_))


accuracy : 0.9148
precision: 0.6333
recall   : 0.8261
fscore   : 0.7170
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


{'C': 10}
TF-IDF Dimension:  127258


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    1.0s finished


## TF-IDF Train & Predict

In [8]:
from sklearn.pipeline import Pipeline

model = defaultdict()
model = Pipeline([
    ('tfidf_vect', TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_df=0.3)),
    ('lr_clf', LogisticRegression(penalty='l1', solver='liblinear',class_weight='balanced'))
])
p = len(X) // 10 * 9

model.fit(X[0:p], Y[0:p])
preds = model.predict(X[p:])
print(preds)
accuracy = accuracy_score(Y[p:], preds)
precision, recall, fscore, support = score(Y[p:], preds)
print('accuracy: {}'.format(accuracy))
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))

[0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 1. 0. 1. 0. 0. 0. 0.]
accuracy: 0.9147727272727273
precision: [0.97260274 0.63333333]
recall: [0.92810458 0.82608696]
fscore: [0.94983278 0.71698113]
support: [153  23]


## TF-IDF + Logistic Regression on CR-00-00-00-00

In [9]:
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.linear_model import LogisticRegression

verbose = False
    
numdoc = defaultdict()

model = LogisticRegression(penalty='l1', solver='liblinear', class_weight='balanced')


def predict_by_class_tfidf(dfset: defaultdict,cls: str) -> float:
    preprocessing(dfset, cls)
    tinydf = dfset[cls]
    X = tinydf['full_text'].values
    Y = tinydf[cls].values
    

    print('------------------------------------\n')
    labels = dfset[cls].iloc[:,2]
    print(labels.value_counts(dropna=False))

    # 10-fold separation with train and test 
    #kfold = KFold(n_splits=10)
    kfold = KFold(n_splits=10)
    print('data set size', len(X))
    numdoc[cls] = len(X)
    n_iter = 0
    acc_list = []
    pre_list = []
    rec_list = []
    fsc_list = []
    sup_list = []

    
    preds = [0] * len(Y)

    for train_index, test_index in kfold.split(X, Y):
        X_train, X_test = X[train_index], X[test_index] 
        Y_train, Y_test = Y[train_index], Y[test_index] 
        
        tfidf_vect = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_df=0.3)
        tfidf_vect.fit(X_train)
        X_train_tfidf = tfidf_vect.transform(X_train)
        X_test_tfidf = tfidf_vect.transform(X_test)
        
        model.fit(X_train_tfidf, Y_train)
        out = model.predict(X_test_tfidf)

        #print(len(out))
        i = 0
        for x in test_index:
            preds[x] = out[i]
            i += 1
            
        #print(preds)

        n_iter += 1
    accuracy = accuracy_score(Y, preds)
    precision, recall, fscore, support = score(Y, preds)
    # accuracy: (tp + tn) / (p + n)
    # precision tp / (tp + fp)
    # recall: tp / (tp + fn)
    # f1: 2 tp / (2 tp + fp + fn)
    accuracy = np.round(accuracy, 4)
    precision[1] = np.round(precision[1], 4)
    recall[1] = np.round(recall[1], 4)
    fscore[1] = np.round(fscore[1], 4)
    support[1] = np.round(support[1], 4)
    
    
    return (accuracy, precision[1], recall[1], fscore[1], support[1])



cls = 'CR-00-00-00-00'


predict_by_class_tfidf(dfset, cls)

------------------------------------

0.0    1377
1.0     302
Name: CR-00-00-00-00, dtype: int64
data set size 1679


(0.8886, 0.6667, 0.7616, 0.711, 302)

## Download and Load GloVe

In [10]:
#!wget http://nlp.stanford.edu/data/glove.6B.zip
#!unzip glove.6B.zip
from tqdm import tqdm

embeddings_index = {}
f = open('glove.6B.300d.txt', encoding="utf8")
for line in tqdm(f):
    values = line.split()
    word = values[0]
    try:
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    except ValueError:
        pass
f.close()
print('Found %s word vectors.' % len(embeddings_index))

400000it [00:24, 16185.07it/s]

Found 400000 word vectors.





# GloVE

## GloVe Train & Predict

In [11]:
cls = 'CR-00-00-00-00'

import numpy as np
import pandas as pd


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack



from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk import punkt
stop_words = stopwords.words('english')

# Train and Test Split
p = len(X) // 10 * 9
train_text = X[:p]
test_text = X[p:]


print("Checkpoint1 - Data Read Complete")


hit = 0
all_words = 0
# this function creates a normalized vector for the whole sentence
def sent2vec(s):
    global hit, all_words
    words = str(s).lower()
    words = word_tokenize(words)
    words = [w for w in words if not w in stop_words]
    words = [w for w in words if w.isalpha()]
    M = []
    for w in words:
        try:
            M.append(embeddings_index[w])
        except:
            continue
    hit += len(M)
    all_words += len(words)
    
    M = np.array(M)
    v = M.sum(axis=0)
    if type(v) != np.ndarray:
        return np.zeros(300)
    return v / np.sqrt((v ** 2).sum())

# create sentence vectors using the above function for training and validation set
xtrain_glove = [sent2vec(x) for x in tqdm(train_text)]

print('Mean Train Word Hit Rate(\%)', hit / all_words * 100)
hit = 0
all_words = 0

xtest_glove = [sent2vec(x) for x in tqdm(test_text)]
print('Mean Test Word Hit Rate(\%)', hit / all_words * 100)

print('Checkpoint2 -Normalized Vector for Sentences are created')

xtrain_glove = np.array(xtrain_glove)
xtest_glove = np.array(xtest_glove)

model = LogisticRegression(penalty='l1', solver='liblinear', class_weight='balanced')

train_target = Y[:p]
test_target = Y[p:]

model.fit(xtrain_glove, train_target)
preds = model.predict(xtest_glove)

print(preds)
accuracy = accuracy_score(test_target, preds)
precision, recall, fscore, support = score(test_target, preds)
print('accuracy: {}'.format(accuracy))
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))



params = {'C': [1,10,20,30,40]}
grid_cv_lr = GridSearchCV(model,param_grid=params, cv=3, scoring='accuracy', verbose=1)
grid_cv_lr.fit(xtrain_glove, train_target)
print(grid_cv_lr.best_params_)

  0%|          | 0/1503 [00:00<?, ?it/s]

Checkpoint1 - Data Read Complete


100%|██████████| 1503/1503 [00:03<00:00, 488.96it/s]
 26%|██▌       | 46/176 [00:00<00:00, 459.44it/s]

Mean Train Word Hit Rate(\%) 99.58181372991224


100%|██████████| 176/176 [00:00<00:00, 529.58it/s]


Mean Test Word Hit Rate(\%) 99.61077662227088
Checkpoint2 -Normalized Vector for Sentences are created
[1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1.
 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 1. 0. 0. 0. 0.]
accuracy: 0.9034090909090909
precision: [0.99275362 0.57894737]
recall: [0.89542484 0.95652174]
fscore: [0.94158076 0.72131148]
support: [153  23]
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   42.4s finished


{'C': 20}


## GloVE + Custom Logistic Regression on CR-00-00-00-00

In [12]:
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer

verbose = False
    
numdoc = defaultdict()

classifier = defaultdict()

    
def predict_by_class_glove(dfset: defaultdict,cls: str) -> float:
    preprocessing(dfset, cls)
    tinydf = dfset[cls]
    X = tinydf['full_text'].values
    Y = tinydf[cls].values
    X_glove = [sent2vec(x) for x in tqdm(X)]
    X_glove = np.array(X_glove)

    print('------------------------------------\n')
    labels = dfset[cls].iloc[:,2]
    print(labels.value_counts(dropna=False))

    # 10-fold separation with train and test 
    #kfold = KFold(n_splits=10)
    kfold = KFold(n_splits=10)
    print('data set size', len(X))
    numdoc[cls] = len(X)
    n_iter = 0
    acc_list = []
    pre_list = []
    rec_list = []
    fsc_list = []
    sup_list = []

    
    preds = [0] * len(Y)

    print('Checkpoint2 -Normalized Vector for Sentences are created')
    


    for train_index, test_index in kfold.split(X_glove, Y):

        X_train, X_test = X_glove[train_index], X_glove[test_index] 
        Y_train, Y_test = Y[train_index], Y[test_index]

        # Scikit-Learn
        classifier[cls] = LogisticRegression(penalty='l1', solver='liblinear', class_weight='balanced')
        classifier[cls].fit(X_train, Y_train)
        out = classifier[cls].predict(X_test)

        #print(len(out))
        i = 0
        for x in test_index:
            preds[x] = out[i]
            i += 1
            
        #print(preds)

        n_iter += 1
    accuracy = accuracy_score(Y, preds)
    precision, recall, fscore, support = score(Y, preds)
    # accuracy: (tp + tn) / (p + n)
    # precision tp / (tp + fp)
    # recall:   tp / (tp + fn)
    # f1: 2 tp / (2 tp + fp + fn)
    accuracy = np.round(accuracy, 4)
    precision[1] = np.round(precision[1], 4)
    recall[1] = np.round(recall[1], 4)
    fscore[1] = np.round(fscore[1], 4)
    support[1] = np.round(support[1], 4)
    
    
    return (accuracy, precision[1], recall[1], fscore[1], support[1])



#cls = 'BE-00-00-00-00'
cls = 'CR-00-00-00-00'
predict_by_class_glove(dfset, cls)

100%|██████████| 1679/1679 [00:03<00:00, 525.00it/s]


------------------------------------

0.0    1377
1.0     302
Name: CR-00-00-00-00, dtype: int64
data set size 1679
Checkpoint2 -Normalized Vector for Sentences are created


(0.8785, 0.6219, 0.8278, 0.7102, 302)

# Entire Data 

## Build Models and Calculating Accuracies

In [None]:
import warnings
import sklearn.exceptions
warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)

class_list = df.columns[2:]
print(class_list)
verbose = False
stat_dict = defaultdict() 
for cls in class_list:
    preprocessing(dfset, cls)
    ret = predict_by_class_glove(dfset, cls)
    stat_dict[cls] = ret 
    print('statistics' , ret)

  3%|▎         | 56/1848 [00:00<00:03, 531.48it/s]

Index(['BE-00-00-00-00', 'BU-00-00-00-00', 'CO-00-00-00-00', 'CR-00-00-00-00',
       'CR-01-00-00-00', 'CR-04-00-00-00', 'CR-06-00-00-00', 'CR-10-00-00-00',
       'ED-00-00-00-00', 'ES-00-00-00-00', 'ES-01-00-00-00', 'ES-03-00-00-00',
       'FA-00-00-00-00', 'FA-05-00-00-00', 'FA-06-00-00-00', 'FA-07-00-00-00',
       'GO-00-00-00-00', 'HE-00-00-00-00', 'HO-00-00-00-00', 'HO-06-00-00-00',
       'HO-09-00-00-00', 'IM-00-00-00-00', 'MO-00-00-00-00', 'MO-02-00-00-00',
       'MO-07-00-00-00', 'MO-10-00-00-00', 'RI-00-00-00-00', 'TO-00-00-00-00',
       'TR-00-00-00-00', 'TR-01-00-00-00', 'TR-02-00-00-00', 'TR-03-00-00-00',
       'TR-04-00-00-00', 'TR-05-00-00-00', 'WO-00-00-00-00', 'WO-03-00-00-00'],
      dtype='object')


100%|██████████| 1848/1848 [00:03<00:00, 517.03it/s]


------------------------------------

0.0    1821
1.0      27
Name: BE-00-00-00-00, dtype: int64
data set size 1848
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 45/1590 [00:00<00:03, 447.04it/s]

statistics (0.9226, 0.1329, 0.7778, 0.227, 27)


100%|██████████| 1590/1590 [00:03<00:00, 515.95it/s]


------------------------------------

0.0    1497
1.0      93
Name: BU-00-00-00-00, dtype: int64
data set size 1590
Checkpoint2 -Normalized Vector for Sentences are created


  4%|▍         | 48/1164 [00:00<00:02, 478.29it/s]

statistics (0.9377, 0.483, 0.914, 0.632, 93)


100%|██████████| 1164/1164 [00:02<00:00, 509.88it/s]


------------------------------------

0.0    1058
1.0     106
Name: CO-00-00-00-00, dtype: int64
data set size 1164
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 49/1679 [00:00<00:03, 466.49it/s]

statistics (0.896, 0.4615, 0.8491, 0.598, 106)


100%|██████████| 1679/1679 [00:03<00:00, 498.13it/s]


------------------------------------

0.0    1377
1.0     302
Name: CR-00-00-00-00, dtype: int64
data set size 1679
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 44/1393 [00:00<00:03, 376.77it/s]

statistics (0.8785, 0.6219, 0.8278, 0.7102, 302)


100%|██████████| 1393/1393 [00:02<00:00, 518.08it/s]


------------------------------------

0.0    1381
1.0      12
Name: CR-01-00-00-00, dtype: int64
data set size 1393
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 44/1402 [00:00<00:03, 401.86it/s]

statistics (0.9088, 0.0248, 0.25, 0.0451, 12)


100%|██████████| 1402/1402 [00:02<00:00, 524.46it/s]


------------------------------------

0.0    1389
1.0      13
Name: CR-04-00-00-00, dtype: int64
data set size 1402
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 44/1404 [00:00<00:03, 385.41it/s]

statistics (0.9058, 0.0458, 0.4615, 0.0833, 13)


100%|██████████| 1404/1404 [00:02<00:00, 516.29it/s]


------------------------------------

0.0    1393
1.0      11
Name: CR-06-00-00-00, dtype: int64
data set size 1404
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 53/1969 [00:00<00:03, 516.24it/s]

statistics (0.9366, 0.0568, 0.4545, 0.101, 11)


100%|██████████| 1969/1969 [00:03<00:00, 503.69it/s]


------------------------------------

0.0    1958
1.0      11
Name: CR-10-00-00-00, dtype: int64
data set size 1969
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 49/1813 [00:00<00:03, 443.14it/s]

statistics (0.9335, 0.0082, 0.0909, 0.015, 11)


100%|██████████| 1813/1813 [00:03<00:00, 516.87it/s]


------------------------------------

0.0    1789
1.0      24
Name: ED-00-00-00-00, dtype: int64
data set size 1813
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 57/1944 [00:00<00:03, 565.71it/s]

statistics (0.968, 0.2571, 0.75, 0.383, 24)


100%|██████████| 1944/1944 [00:03<00:00, 529.52it/s]


------------------------------------

0.0    1866
1.0      78
Name: ES-00-00-00-00, dtype: int64
data set size 1944
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 56/1876 [00:00<00:03, 555.73it/s]

statistics (0.9208, 0.3155, 0.8333, 0.4577, 78)


100%|██████████| 1876/1876 [00:03<00:00, 526.83it/s]


------------------------------------

0.0    1866
1.0      10
Name: ES-01-00-00-00, dtype: int64
data set size 1876
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 54/1992 [00:00<00:03, 539.52it/s]

statistics (0.9403, 0.0278, 0.3, 0.0508, 10)


100%|██████████| 1992/1992 [00:03<00:00, 524.75it/s]


------------------------------------

0.0    1979
1.0      13
Name: ES-03-00-00-00, dtype: int64
data set size 1992
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 53/2042 [00:00<00:03, 524.44it/s]

statistics (0.9342, 0.053, 0.5385, 0.0966, 13)


100%|██████████| 2042/2042 [00:03<00:00, 520.50it/s]


------------------------------------

0.0    1685
1.0     357
Name: FA-00-00-00-00, dtype: int64
data set size 2042
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 56/2011 [00:00<00:03, 551.62it/s]

statistics (0.9119, 0.6928, 0.8908, 0.7794, 357)


100%|██████████| 2011/2011 [00:03<00:00, 529.17it/s]


------------------------------------

0.0    2001
1.0      10
Name: FA-05-00-00-00, dtype: int64
data set size 2011
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 56/1791 [00:00<00:03, 539.99it/s]

statistics (0.9423, 0.0508, 0.6, 0.0938, 10)


100%|██████████| 1791/1791 [00:03<00:00, 529.22it/s]


------------------------------------

0.0    1781
1.0      10
Name: FA-06-00-00-00, dtype: int64
data set size 1791
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 53/1968 [00:00<00:03, 514.27it/s]

statistics (0.9324, 0.0336, 0.4, 0.062, 10)


100%|██████████| 1968/1968 [00:03<00:00, 533.35it/s]


------------------------------------

0.0    1927
1.0      41
Name: FA-07-00-00-00, dtype: int64
data set size 1968
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 42/1517 [00:00<00:03, 417.28it/s]

statistics (0.8786, 0.1133, 0.7073, 0.1953, 41)


100%|██████████| 1517/1517 [00:03<00:00, 496.33it/s]


------------------------------------

0.0    1504
1.0      13
Name: GO-00-00-00-00, dtype: int64
data set size 1517
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 55/1900 [00:00<00:03, 546.34it/s]

statistics (0.9242, 0.0364, 0.3077, 0.065, 13)


100%|██████████| 1900/1900 [00:03<00:00, 521.18it/s]


------------------------------------

0.0    1778
1.0     122
Name: HE-00-00-00-00, dtype: int64
data set size 1900
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 57/2132 [00:00<00:03, 566.89it/s]

statistics (0.9605, 0.6343, 0.9098, 0.7475, 122)


100%|██████████| 2132/2132 [00:04<00:00, 498.94it/s]


------------------------------------

0.0    1582
1.0     550
Name: HO-00-00-00-00, dtype: int64
data set size 2132
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 57/1662 [00:00<00:02, 563.28it/s]

statistics (0.9536, 0.902, 0.92, 0.9109, 550)


100%|██████████| 1662/1662 [00:03<00:00, 546.87it/s]


------------------------------------

0.0    1628
1.0      34
Name: HO-06-00-00-00, dtype: int64
data set size 1662
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 56/1653 [00:00<00:02, 559.18it/s]

statistics (0.9296, 0.1732, 0.6471, 0.2733, 34)


100%|██████████| 1653/1653 [00:02<00:00, 554.09it/s]


------------------------------------

0.0    1626
1.0      27
Name: HO-09-00-00-00, dtype: int64
data set size 1653
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 56/1964 [00:00<00:03, 556.38it/s]

statistics (0.9546, 0.2333, 0.7778, 0.359, 27)


100%|██████████| 1964/1964 [00:03<00:00, 520.18it/s]


------------------------------------

0.0    1928
1.0      36
Name: IM-00-00-00-00, dtype: int64
data set size 1964
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 46/1429 [00:00<00:03, 448.22it/s]

statistics (0.9684, 0.3523, 0.8611, 0.5, 36)


100%|██████████| 1429/1429 [00:02<00:00, 539.74it/s]


------------------------------------

0.0    1063
1.0     366
Name: MO-00-00-00-00, dtype: int64
data set size 1429
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 56/1949 [00:00<00:03, 552.32it/s]

statistics (0.8509, 0.6727, 0.8142, 0.7367, 366)


100%|██████████| 1949/1949 [00:03<00:00, 536.37it/s]


------------------------------------

0.0    1937
1.0      12
Name: MO-02-00-00-00, dtype: int64
data set size 1949
Checkpoint2 -Normalized Vector for Sentences are created


  4%|▍         | 50/1129 [00:00<00:02, 495.16it/s]

statistics (0.9477, 0.05, 0.4167, 0.0893, 12)


100%|██████████| 1129/1129 [00:02<00:00, 548.26it/s]


------------------------------------

0.0    1116
1.0      13
Name: MO-07-00-00-00, dtype: int64
data set size 1129
Checkpoint2 -Normalized Vector for Sentences are created


  4%|▍         | 49/1106 [00:00<00:02, 488.83it/s]

statistics (0.9336, 0.0921, 0.5385, 0.1573, 13)


100%|██████████| 1106/1106 [00:01<00:00, 560.83it/s]


------------------------------------

0.0    1095
1.0      11
Name: MO-10-00-00-00, dtype: int64
data set size 1106
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 45/1396 [00:00<00:03, 447.66it/s]

statistics (0.9168, 0.0449, 0.3636, 0.08, 11)


100%|██████████| 1396/1396 [00:02<00:00, 537.36it/s]


------------------------------------

0.0    1374
1.0      22
Name: RI-00-00-00-00, dtype: int64
data set size 1396
Checkpoint2 -Normalized Vector for Sentences are created


  4%|▎         | 47/1257 [00:00<00:02, 468.04it/s]

statistics (0.861, 0.0275, 0.2273, 0.049, 22)


100%|██████████| 1257/1257 [00:02<00:00, 575.53it/s]


------------------------------------

0.0    1027
1.0     230
Name: TO-00-00-00-00, dtype: int64
data set size 1257
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 57/2006 [00:00<00:03, 552.25it/s]

statistics (0.8536, 0.565, 0.8696, 0.6849, 230)


100%|██████████| 2006/2006 [00:03<00:00, 523.53it/s]


------------------------------------

0.0    1746
1.0     260
Name: TR-00-00-00-00, dtype: int64
data set size 2006
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 54/1827 [00:00<00:03, 534.89it/s]

statistics (0.9482, 0.7407, 0.9231, 0.8219, 260)


100%|██████████| 1827/1827 [00:03<00:00, 514.36it/s]


------------------------------------

0.0    1805
1.0      22
Name: TR-01-00-00-00, dtype: int64
data set size 1827
Checkpoint2 -Normalized Vector for Sentences are created


  3%|▎         | 51/1819 [00:00<00:03, 495.29it/s]

statistics (0.9507, 0.1667, 0.7727, 0.2742, 22)


100%|██████████| 1819/1819 [00:03<00:00, 515.10it/s]


------------------------------------

0.0    1790
1.0      29
Name: TR-02-00-00-00, dtype: int64
data set size 1819
Checkpoint2 -Normalized Vector for Sentences are created


## Distribution

In [None]:
!python3 -m pip install prettytable
from prettytable import PrettyTable
t = PrettyTable(["class", "accuracy", "precision", "recall", "F1 score", "support", "|documents|"])
#t.align["class"] = "r"
t.align["accuracy"] = "r"
t.align["precision"] = "r"
t.align["recall"] = "r"
t.align["F1 score"] = "r"
t.align["support"] = "r"
for k,v in stat_dict.items():
    t.add_row([k, v[0], v[1], v[2], v[3], v[4], numdoc[k]])
    
print(t)

## Plotting for Top10 classes

In [None]:
#for k,v in accuracy_dict.items():
#    print(k,v)
#sys.exit()
import collections
import matplotlib.pyplot as plt
import numpy as np

# Fixing random state for reproducibility
np.random.seed(19680801)


plt.rcdefaults()
fig, ax = plt.subplots()

# Example data
y_pos = np.arange(10)
error = 0 

recall_dict = defaultdict()
for k,v in stat_dict.items():
    recall_dict[k] = v[2]

sorted_x = sorted(recall_dict.items(), key=lambda kv: kv[1], reverse=True)
topcat_dict = collections.OrderedDict(sorted_x)
#print(topcat_dict)

keyList = []
valList = []
for kv in topcat_dict.items():
    keyList.append(kv[0])
    valList.append(kv[1])

ax.barh(y_pos[:10], valList[:10], xerr=error, align='center')
ax.set_yticks(y_pos)
ax.set_yticklabels(keyList[:10])
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('Recall')
ax.set_title('Top10 Recall Scores')

plt.show()

## Bottom 10

In [None]:
#for k,v in accuracy_dict.items():
#    print(k,v)
#sys.exit()
import collections
import matplotlib.pyplot as plt
import numpy as np

# Fixing random state for reproducibility
np.random.seed(19680801)


plt.rcdefaults()
fig, ax = plt.subplots()

# Example data
y_pos = np.arange(10)
error = 0 

recall_dict = defaultdict()
for k,v in stat_dict.items():
    recall_dict[k] = v[2]
    
sorted_x = sorted(recall_dict.items(), key=lambda kv: kv[1])
topcat_dict = collections.OrderedDict(sorted_x)
#print(topcat_dict)

keyList = []
valList = []
for kv in topcat_dict.items():
    keyList.append(kv[0])
    valList.append(kv[1])




ax.barh(y_pos[:10], valList[:10], xerr=error, align='center')
ax.set_yticks(y_pos)
ax.set_yticklabels(keyList[:10])
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('Recall')
ax.set_title('Bottom 10 Recall Scores')

plt.show()

## Recall Distribution

In [None]:
import pylab as pl

recall_dict = defaultdict()
for k,v in stat_dict.items():
    recall_dict[k] = v[2]


recall_list = []
for k,v in recall_dict.items():
    recall_list.append(v*100)
   
d = {'Recall': recall_list}
tinydf = pd.DataFrame(data=d)



hist = tinydf.hist(edgecolor='black', bins = [0,10,20,30,40,50,60,70,80,90,100])
pl.title("Recall Distribution")
pl.xlabel("Recall Score(%)")
pl.ylabel("Number of Classes")
print(tinydf)

# Custom Input Prediction

In [None]:
text = "How do I get my mom's license taken away. My mom is 66, on disability for multiple sclerosis. She's been unable to work for about a decade. She has cataracts. She has neuropathy. She has 0 reaction time. She has had a fender bender on every single corner of her last car, which my brother then totaled. She also has no night vision. She also falls asleep all the time. ALMOST like like narcolepsy. It's mostly her overextending herself, but she will nod off driving or sleep in parking lots til she feels ok. She also has lymphedema in her legs which are swollen enough to impede driving. The last year she was driving she received 19 red light tickets. She agreed not to drive. And the insurance paid for her car. Now she's bought a new one, about 6 months later. Our relationship is terrible. I hate her. But I want her licence taken away before she kills or cripples someone(s). I'm no contact with her, but my brother still tries and he cares about this a lot. I've spoken with the DMV IN MY state, not very helpful. Can i contact her insurance? Do I contact the police? Has this happened to anyone"

print(text)
input = sent2vec(text)
input = np.array(input)

predictions = defaultdict() 
prob = defaultdict()
for cls in class_list:
    predictions[cls] = classifier[cls].predict([input])
    prob[cls] = classifier[cls].predict_proba([input])
for k,v in predictions.items():   
    if v > 0:
        print(k, v, end = ' ')
        print(np.round(prob[k][0][1], 4))
