# Classifying Text Values
Michaela Webster - mawebster9

Before we begin anything, we need to import all of the libraries/functions that we will be using throughout our notebook. Pandas is a crucial part of this process since it houses our data. Another important feature of this process is the scikit-learn library - this library is our one-stop-shop for our machine learning needs. 

In [129]:
#our data structure
import pandas as pd

#bag of words vectorizer - take inverse frequency of words to assign weights
from sklearn.feature_extraction.text import TfidfVectorizer

#split data into training/test data, validate our models, and specify number of folds for training/test data
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold 

# our 5 classification models
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

## 1. Import Data and Set X & y


For this example, we will focus solely on the judgment field. The judgment field acts like a brief story summary field that contains a lot of useful data surrounding a case. Our goal is to see if the judgment field can be used to determine whether a case should be denied or granted with high accuracy. For this, our X is the judgment field and its associated label, or y, is the denied field.

Also note, we are reducing our number of entries to 500 since processing text data is very bulky. If done with all data, you would get a memory error. 500 records has enough data to make a reasonably accurate model but not too much where there is not enough memory.

In [130]:
#connect to CSV file that contains our data
path_to_file = "https://raw.githubusercontent.com/mawebster9/ThesisCode/master/appeals_query.csv"

#open, read, and store our data into a pandas dataframe
appeals_data = pd.read_csv(path_to_file, encoding='latin-1')

In [131]:
#assign our attributes to X and y
X = appeals_data['Judgment'].head(500)
y = appeals_data['Denied']

In [132]:
#print the first record in X to verify the previous step
X.iloc[0]

"Applicant's drug abuse was not mitigated where marijuana use was recent, and had continued after Applicant stated an intent to refrain from drug use in the future. He falsified his drug abuse history on security questionnaires in March and October 1995 an"

In [133]:
#print the first record in y to verify the previous step
y.iloc[0]

True

## 2. Set-up X: Bag of Words

Now that we have the data assigned to our X and y variables, it is time to prime the data for the machine learning algorithms. For text data, we need to break up the words in a way that a machine can understand the characteristics of speech. One of the ways we can do this is by using a bag of words. A bag of words essentially takes a large amount of text data and separates the values into separate words and counts the number of occurrences of each word. 

For this example, we will be using a vectorizer to split the words and calculate the number of occurrences for each word. The vectorizer we will use in this example, TfidfVectorizer, works by counting the inverse frequency of the words found in the judgment field to assign a weight for each word. This ensures that common words, also known as "stop words", found in the english language, like "the", "a", "an", etc., are weighted less than words that are unique for this dataset, such as "foreign", "alcohol", "drugs", etc.


In [134]:
#set up bag of words for judgment field, use english stop words
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(X.tolist())

In [135]:
#print bag of words to ensure it is set up correctly
vectorizer.get_feature_names()

['00',
 '000',
 '10',
 '1001',
 '11',
 '12',
 '13',
 '14',
 '15',
 '154',
 '16',
 '17',
 '18',
 '19',
 '1959',
 '1965',
 '1966',
 '1967',
 '1969',
 '1970',
 '1970s',
 '1971',
 '1972',
 '1973',
 '1974',
 '1975',
 '1976',
 '1977',
 '1978',
 '1979',
 '1980',
 '1980s',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '199',
 '1990',
 '1990s',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '20',
 '2001',
 '2003',
 '2005',
 '2006',
 '2007',
 '203',
 '21',
 '22',
 '23',
 '24',
 '25',
 '26',
 '27',
 '28',
 '29',
 '30',
 '31',
 '33',
 '35',
 '36',
 '38',
 '40',
 '401k',
 '45',
 '50',
 '548',
 '59',
 '60',
 '700',
 '80',
 '81',
 '83',
 '86',
 '88',
 '91',
 '94',
 '96',
 'a10',
 'aa',
 'ab',
 'abandoning',
 'abandonment',
 'abilities',
 'ability',
 'able',
 'absence',
 'absent',
 'absolutely',
 'absolve',
 'absolving',
 'absorb',
 'abstain',
 'abstained',
 'abstention',
 'abstin',
 'abstinence',
 'abstinent',
 'abuse',
 'abused',

Now that we have our list created and number of occurrences counted and weighted appropriately, we need to ensure that the data is all accounted for.

In [136]:
#ensure that data is in the right format - TfidfVectorizer returns sparse matrix of type <class numpy.float64>
X[:1]

<1x2117 sparse matrix of type '<class 'numpy.float64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [137]:
#make sure we have all records that were passed into the vectorizer
X.shape

(500, 2117)

## 3. Set-up y: Fix Boolean Values

We have our X values set up in a way that a machine learning algorithm can understand it, but now we need to fix our y values. In order to do this we need to change our true/false values into a numeric format. To do this, we need to change all true values to 1.0 and all false values to 0.0.

Note how we are only taking the first 500 records for the y set because we only took the first 500 records for our X set.

In [138]:
#replace: True = 1.0, False = 0.0
y = y.replace(True,1).head(500)

In [139]:
#print out counts for all y records - ensure that our replace statement worked
y.value_counts()

0.0    260
1.0    240
Name: Denied, dtype: int64

### Final Data Check

Now that our X and y are in the right format, we need to ensure one last time that the dimensions of each dataframe are correct. For our X, we see that there are 500 rows and 2,117 columns (different words in BOW). For our y, we see that there are 500 rows and no columns.

Our data has passed the check and is ready to be used.

In [140]:
X.shape

(500, 2117)

In [141]:
y.shape

(500,)

## 4. Machine Learning Step-by-Step

#### A. Run train_test_split() on X & y

This step is not needed for this notebook but it shows you how the train_test_split function works. Our X and y are randomly split up into training and testing groups. In this case, our test group will be comprised of 33% of the X data(test_size), and will be the same each time we run this line (random_state).

In [142]:
#break 33% of X and y into X_test and y_test, break other remaining 67% into X_train and y_train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [143]:
#print size of training data (335/500 = .67)
X_train.shape

(335, 2117)

In [144]:
#print size of test data (165/500 = .33)
X_test.shape

(165, 2117)

#### B. Run fit() on X_train & y_train

The next step in our machine learning model is taking our training data and feeding it into an algorithm to build a model. This is essentially the step that teaches an algorithm that for each record X = y. To do this, there are a number of classification models. For this example we will focus on the LogisticRegression classifier.

In [145]:
#specify which classifier to use and set parameters
clf = LogisticRegression(random_state=0, solver='liblinear')

In [146]:
#send X and y into our classifier to build a model
logreg_model = clf.fit(X, y)

In [147]:
#print out all information about our model
print(logreg_model)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)


#### C. Run predict() on X_test

The next step in our model is to take the model we just made using the training data and feeding the test data into it. This will output an array of values that the algorithm has determined to be the denied status.

In [148]:
#send our test data into the model we just created
y_pred = logreg_model.predict(X_test)

In [149]:
#print our results for the predictions
print("Here is the model's predictions: ")
print(y_pred)

Here is the model's predictions: 
[0. 1. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0.
 1. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1.
 0. 1. 1. 1. 0. 0. 0. 1. 1. 1. 1. 1. 0. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1.
 1. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 1. 0. 0. 1. 0. 1. 0. 1.
 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0. 1. 1.
 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 1. 0. 1.
 1. 1. 1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0. 0.]


#### D. Verify Accuracy of Model

Now that we have split our data into training and testing groups, created a model using a machine learning algorithm, and used the model to predict outcomes for our test data, it is time to verify how well our model did compared to the actual outcomes. To do this, there are a number of accuracy metrics. For this example we will focus on the accuracy score.

In [150]:
#import our score functions
from sklearn.metrics import accuracy_score

In [154]:
#compare y_test values with the predicted y values
score = accuracy_score(y_test, y_pred).mean()
print("Accuracy score for LogisticRegression classifier:  ", score)

Accuracy score for LogisticRegression classifier:   0.9515151515151515


***Here we can see that our LogisticRegression model was correct 95.15% of the time when predicting the denied status.***

## 5. Test for Best Classifier to Use

Now that we understand how machine learning is done, we can determine which model is the best choice for our data. In this example we will use  different classifiers and evaluate each against 5 accuracy metrics.

Note: the cross_val_score() function handles test_train_split(X,y,test_size=33,random_state-42), fit(X_train,y_train), predict(X_test), and also any of the accuracy score metrics. This function is essentially and all-in-one function.

In [152]:
classifiers = [RandomForestClassifier(n_estimators=5), GaussianNB(), LogisticRegression(solver='liblinear'), DecisionTreeClassifier(criterion='gini'), KNeighborsClassifier(n_neighbors=6)]
clf_names = ['RandomForest','GausianNB','LogisticRegression','DecisionTreeClassRegressor', 'KNeighbors']
metric_names = ['roc_auc','f1','accuracy','precision','recall']

scv = StratifiedKFold(n_splits=3)

scores_df = pd.DataFrame(index=metric_names,columns=clf_names)
clf_scores = []
for clf, name in zip(classifiers, clf_names):
    print('-----------------------------------------------------------------------------------------------------------')
    print('Classifier: ',clf)
    print('')
    print("Scoring Metrics: ")
    for metric in metric_names:
        score = cross_val_score(clf,X.toarray(),y,scoring=metric, cv=scv).mean()
        clf_scores.append(score)
        print('\t*',metric,'score: ', score)
    scores_df[name] = clf_scores
    clf_scores = []


-----------------------------------------------------------------------------------------------------------
Classifier:  RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=5,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Scoring Metrics: 
	* roc_auc score:  0.867741189967032
	* f1 score:  0.7859146970783552
	* accuracy score:  0.768126397806796
	* precision score:  0.7757183327028354
	* recall score:  0.8208333333333333
-----------------------------------------------------------------------------------------------------------
Classifier:  GaussianNB(priors=None, var_smoothing=1e-09)

Scoring Metrics: 
	

## Final Results

For boolean classification, the top performing machine learning algorithm is LogisticRegression. This classifier works by analyzing a dataset with one or more independent variables that determine an outcome - the outcome is measured with a binary variable (only 2 outcomes possible).