### Introduction
* This program that classifies legal issues into a binary value for each National Subject Matter Index (NSMI). (https://nsmi.lsntap.org/browse-v2) \\
"Category" means 20 indexes. \\
"Class" means 

### Data
* The data contains 2777 labeled articles. Each article has a binary value(0 or 1) that indicates if this article is related to a specific legal class. We ignore unlabeled entries when constructing a model.

### Implementation
* The program converts an article into tf-idf and applies multinomial Naive-Bayes model provided by scikit-learn. 

* After preprocessing data, we predict the model with 10-fold cross-validation.

### Output
* We calculate accuracy with bot categories(20) and classes(100+). \\

See overall result is at the bottom of this notebook.

# Data Preparation

In [0]:
import os
import sys
import pandas as pd
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
pd.options.display.max_rows = 100
pd.set_option('display.max_columns', None) 

In [3]:
url = 'https://raw.githubusercontent.com/heeh/legal_issue_classification/master/2019-12-06_95p-confidence_binary.csv'
df = pd.read_csv(url)
df.info()
df.iloc[:,:4]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2777 entries, 0 to 2776
Columns: 109 entries, _id to WO-09-00-00-00
dtypes: float64(107), object(2)
memory usage: 2.3+ MB


Unnamed: 0,_id,full_text,BE-00-00-00-00,BE-01-00-00-00
0,5b60e59cda52255c20cff794,Will he serve time?. Long story short my broth...,0.0,0.0
1,5b60e59cda52255c20cff79a,Groundwater leaking out of street 24/7. Ground...,0.0,0.0
2,5b60e59cda52255c20cff7a0,How do I get my mom's license taken away. My m...,0.0,0.0
3,5b60e59cda52255c20cff7bf,My boss hasn't paid me. What do i do?. I work ...,,
4,5b60e59cda52255c20cff7b8,"[Texas] I signed a non-compete contract, but t...",0.0,0.0
...,...,...,...,...
2772,5b60e66dda52255c20df433f,Do you and your parents get deported because o...,0.0,0.0
2773,5b60e66dda52255c20df43ae,Wondering the legality of a minor (me) being a...,0.0,0.0
2774,5b60e66dda52255c20df4462,Can I sue a billion dollar company in small cl...,0.0,0.0
2775,5b60e66dda52255c20df4448,Sued by creditor and currently in settlement n...,0.0,0.0


In [4]:
#Null Check
#df.isnull().sum()

# Class Check
df = df.loc[:, df.sum(axis=0, skipna=True) != 0]
df.sum(axis = 0, skipna = True)
#df.info()

_id               5b60e59cda52255c20cff7945b60e59cda52255c20cff7...
full_text         Will he serve time?. Long story short my broth...
BE-00-00-00-00                                                   27
BE-04-00-00-00                                                    7
BU-00-00-00-00                                                   93
CO-00-00-00-00                                                  106
CR-00-00-00-00                                                  302
CR-01-00-00-00                                                   12
CR-04-00-00-00                                                   13
CR-06-00-00-00                                                   11
CR-07-00-00-00                                                    5
CR-10-00-00-00                                                   11
CR-14-00-00-00                                                    3
CR-15-00-00-00                                                    6
ED-00-00-00-00                                  

# Tiny Example: Crime and Prison(CR-00-00-00-00)


## Preprocessing

In [5]:
from collections import defaultdict

verbose = True

cls = 'CR-00-00-00-00'
def preprocessing(dfset: defaultdict, cls: str):
    dfset[cls] = df.loc[:, ['_id', 'full_text', cls]]
    labels = dfset[cls].iloc[:,2]
    if verbose:
        print("------------Before dropping nan----------------------------------------")
        print(dfset[cls].iloc[:,1:])
        print(labels.value_counts(dropna=False))
    
    dfset[cls] = dfset[cls].dropna()
    labels = dfset[cls].iloc[:,2]
    if verbose:
        print("\n------------After dropping nan---------------------------------------")
        print(dfset[cls].iloc[:,1:])
        print(labels.value_counts(dropna=False))
dfset = defaultdict() 
preprocessing(dfset, cls)    


------------Before dropping nan----------------------------------------
                                              full_text  CR-00-00-00-00
0     Will he serve time?. Long story short my broth...             1.0
1     Groundwater leaking out of street 24/7. Ground...             0.0
2     How do I get my mom's license taken away. My m...             NaN
3     My boss hasn't paid me. What do i do?. I work ...             0.0
4     [Texas] I signed a non-compete contract, but t...             0.0
...                                                 ...             ...
2772  Do you and your parents get deported because o...             1.0
2773  Wondering the legality of a minor (me) being a...             0.0
2774  Can I sue a billion dollar company in small cl...             0.0
2775  Sued by creditor and currently in settlement n...             0.0
2776  (CA) Sales job. Income based on performance. A...             0.0

[2777 rows x 2 columns]
0.0    1377
NaN    1098
1.0     302
Nam

## Calculating Accuracy with 10-fold validation

In [7]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_fscore_support as score

verbose = True 

model = defaultdict() 

def predict_by_class(dfset: defaultdict,cls: str) -> float:
    global model
    tiny_df = dfset[cls]
    print('------------------------------------\nclass name: ', cls)
    X = tiny_df['full_text'].values 
    Y = tiny_df[cls].values
    print(list(X))
    print(list(Y))
    model[cls] = make_pipeline(TfidfVectorizer(), MultinomialNB())



    model[cls].fit(X, Y)
    preds = model[cls].predict(X)
    precision, recall, fscore, support = score(Y, preds)
    print('precision: {}'.format(precision))
    print('recall: {}'.format(recall))
    print('fscore: {}'.format(fscore))
    print('support: {}'.format(support))

    sys.exit()

    # 10-fold separation with train and test 
    kfold = KFold(n_splits=10)
    cv_accuracy = []
    cv_recall = []
    cv_f1 = []
    cv_precision = []
    print('data set size', len(X))
    n_iter = 0
    
    for train_index, test_index in kfold.split(X):
        X_train, X_test = X[train_index], X[test_index] 
        Y_train, Y_test = Y[train_index], Y[test_index]
    
        model[cls] = make_pipeline(TfidfVectorizer(), MultinomialNB())
        model[cls].fit(X_train, Y_train)
        print('X_train')
        print(X_train)
        print('Y_train')
        print(Y_train)
        preds = model[cls].predict(X_test)
        n_iter += 1

        print(Y_test)    
        print(preds)
        precision, recall, fscore, support = score(Y_test, preds, average='binary')

        # accuracy: (tp + tn) / (p + n)
        # precision tp / (tp + fp)
        # recall: tp / (tp + fn)
        # f1: 2 tp / (2 tp + fp + fn)

        train_size = X_train.shape[0]
        test_size = X_test.shape[0]
        if verbose:
            print('precision: {}'.format(precision))
            print('recall: {}'.format(recall))
            print('fscore: {}'.format(fscore))
            print('support: {}'.format(support))
        cv_recall.append(recall)
    print(cv_recall)
    ret = np.round(np.mean(cv_recall), 4)
    if verbose:
        print('\n## Average valid recall:', ret)
    return ret

predict_by_class(dfset, cls)

Output hidden; open in https://colab.research.google.com to view.

# Entire Data 

##Build Models and Calculating Accuracies

In [0]:
sys.exit()
class_list = df.columns[2:]
print(class_list)
verbose = False
accuracy_dict = defaultdict() 
acc_list = []
for cls in class_list:
    preprocessing(dfset, cls)
    ret = predict_by_class(dfset, cls)
    acc_list.append(ret)
    accuracy_dict[cls] = ret 
    print('accuracy:' , ret)

##Plotting for 30 classes

In [0]:
#for k,v in accuracy_dict.items():
#    print(k,v)
#sys.exit()

import matplotlib.pyplot as plt
import numpy as np

# Fixing random state for reproducibility
np.random.seed(19680801)


plt.rcdefaults()
fig, ax = plt.subplots()

# Example data
y_pos = np.arange(30)
error = 0 

ax.barh(y_pos[:30], acc_list[:30], xerr=error, align='center')
ax.set_yticks(y_pos)
ax.set_yticklabels(class_list)
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('accuracy')
ax.set_title('Accuracy Scores for Each Class')

plt.show()

## Plotting by Top Categories

In [0]:
from statistics import mean, stdev
topcat_dict = defaultdict()

for k,v in accuracy_dict.items():
    topcat_dict[k[:2]] = [] 

for k,v in accuracy_dict.items():
    topcat_dict[k[:2]].append(v)

cat_list = []
avg_list = []
stdev_list = []
for k,v in topcat_dict.items():
    cat_list.append(k)
    avg_list.append(mean(v))
print(cat_list)
print(avg_list)


import matplotlib.pyplot as plt
import numpy as np

# Fixing random state for reproducibility
np.random.seed(19680801)


plt.rcdefaults()
fig, ax = plt.subplots()

# Example data
y_pos = np.arange(20)
error = 0 

ax.barh(y_pos[:20], acc_list[:20], xerr=error, align='center')
ax.set_yticks(y_pos)
ax.set_yticklabels(cat_list)
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('accuracy')
ax.set_title('Accuracy Scores for Each Category')

plt.show()

#Prediction

In [0]:
input = "How do I get my mom's license taken away. My mom is 66, on disability for multiple sclerosis. She's been unable to work for about a decade. She has cataracts. She has neuropathy. She has 0 reaction time. She has had a fender bender on every single corner of her last car, which my brother then totaled. She also has no night vision. She also falls asleep all the time. ALMOST like like narcolepsy. It's mostly her overextending herself, but she will nod off driving or sleep in parking lots til she feels ok. She also has lymphedema in her legs which are swollen enough to impede driving. The last year she was driving she received 19 red light tickets. She agreed not to drive. And the insurance paid for her car. Now she's bought a new one, about 6 months later. Our relationship is terrible. I hate her. But I want her licence taken away before she kills or cripples someone(s). I'm no contact with her, but my brother still tries and he cares about this a lot. I've spoken with the DMV IN MY state, not very helpful. Can i contact her insurance? Do I contact the police? Has this happened to anyone"

print(input)
predictions = defaultdict() 

for cls in class_list:
    predictions[cls] = model[cls].predict([input])
for k,v in predictions.items():
    print(k, v)
