# Peter Kim, OCPython Meetup - August 1, 2017
PeopleSpace, Irvine, CA

## Text Classification with Scikit-Learn
Lending Club Dataset
>Original Source: https://www.lendingclub.com/info/download-data.action
><br>Kaggle Discussion: https://www.kaggle.com/wendykan/lending-club-loan-data

## Inspiration Reference:
When Words Sweat: Identifying Signals for Loan Default in the Text of Loan Applications
>Netzer, Oded and Lemaire, Alain and Herzenstein, Michal, When Words Sweat: Identifying Signals for Loan Default in the Text of Loan Applications (November 6, 2016). Columbia Business School Research Paper No. 16-83. Available at SSRN: https://ssrn.com/abstract=2865327.  

## Background References on Text Classification and Python/Scikit-Learn.
>PyCon 2016 Tutorial from Data School, “Machine Learning with Text in scikit-learn (PyCon 2016),” by Kevin Markham on May 28, 2016. 
* YouTube Lecture Available at: https://www.youtube.com/watch?v=ZiKMIuYidY0  
* Github Available at: https://github.com/justmarkham/pycon-2016-tutorial
>
><br> scikit-learn, “Working with Text Data”.  Available at: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
<br>
><br> Kaggle tutorial, “Bag of Words Meets Bags of Popcorn.”  December 9 2014 – June 30, 2015.  Available at: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words
<br>
><br> scikit-learn, “Feature Extraction (Customizing the Vectorizer Class)”.  Available at: http://scikit-learn.org/stable/modules/feature_extraction.html
>
><br> Andreas Müller and Sarah Guido, "Introduction to Machine Learning with Python," O'Reilly Media, October 2016.  Available at: http://shop.oreilly.com/product/0636920030515.do
>
><br> Sebastian Raschka, "Python Machine Learning," Packt Publishing; 1 edition (September 23, 2015).  Available at: https://www.packtpub.com/big-data-and-business-intelligence/python-machine-learning


In [None]:
# Python libraries
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, HashingVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn import metrics
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer

In [None]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk import PorterStemmer
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

import matplotlib
matplotlib.style.use('ggplot')

In [None]:
# Create Pandas Dataframe with Lending Club data from 2007-2011.
# The "2007-2011.csv" file has been cleaned-up from original "LoanStats3a.csv".  
df_lendingclub = pd.read_csv("2007-2011.csv")

# How many rows and columns?
df_lendingclub.shape    

In [None]:
# Make sure it loaded correctly.
df_lendingclub.head()

In [None]:
# What are the columns?  
print(list(df_lendingclub.columns))

In [None]:
# Examine first record, first 20 columns.
df_lendingclub.iloc[0, 0:20]

In [None]:
# Examine first record, 'desc'.
df_lendingclub['desc'][0]

In [None]:
# Examine first record, 'loan_status'.
df_lendingclub['loan_status'][0]

## Column Descriptions From Lending Club Data Dictionary:
>'desc': Loan description provided by the borrower
<br>
<br>'loan_status': Current status of the loan

In [None]:
# Create new dataframe with 2 columns: 'desc' and 'loan_status'
df_text = df_lendingclub[['desc', 'loan_status']].copy()

# How many rows and columns?
print(df_text.shape)

# See if it loaded correctly.
df_text.head()

## Loan Status of Total Loans

In [None]:
# Total number of loans
print("Total number of loans: ", df_text.shape[0])

# How many are Fully Paid vs. Charged off?
loan_status = df_text['loan_status'].value_counts()
print("\nTotal loans by loan status: ")
print(loan_status)

# What % of loans are Fully Paid?
print("\n% of total loans by loan status: ")
print(loan_status/df_text.shape[0])

## Loans  with Blank Descriptions by Loan Status

In [None]:
# How many 'desc' fields are blank?
blank_desc = df_text['desc'].isnull().sum()
print("Number of loans with blank descriptions: ", blank_desc)
print("% of loans with blank descriptions: ", blank_desc/df_text.shape[0])

# Of the loans with blank descriptions, what is loan status?
blank_loan_status = df_text[df_text['desc'].isnull()]['loan_status'].value_counts()
print("\nLoans with blank descriptions by loan status: ")
print(blank_loan_status)

# What % of loans with blank descriptions are Fully Paid?
print("\n% of loans with blank descriptions by loan status: ")
print(blank_loan_status/blank_desc)

## Only Loans with Descriptions

In [None]:
# Only include loans with descriptions, use dropna() method.
df_text_desc = df_text.dropna()
df_text_desc.head()

In [None]:
# How many loans have a description?
print("Number of loans with a description: ", df_text_desc.shape[0])

# Loans with description by loan status?
print("\nLoans with descriptions by loan status: ")
loan_status_desc = df_text_desc['loan_status'].value_counts()
print(loan_status_desc)

# What % of loans with descriptions are Fully Paid vs. Charged Off?
print("\n% of loans with descriptions by loan status: ")
print(loan_status_desc/df_text_desc.shape[0])

## Very Important to Balance the Classes
Particularly for high-dimensional, highly-sparse datasets
> Blagus, Rok and Lusa, Lara, “SMOTE for high-dimensional class-imbalanced data,” BMC Bioinformatics, March 22, 2013.  Available at: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-106
>
><br>Blagus, Rok and Lusa, Lara, “Class prediction for high-dimensional class-imbalanced data,” BMC Bioinformatics, October 20, 2010.  Available at: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-523
>
><br>Alexander Yun-chung Liu, “The Effect of Oversampling and Undersampling on Classifying Imbalanced Text Datasets,” Thesis for M.S.E., The University of Texas at Austin, August 2004.  Available at: https://pdfs.semanticscholar.org/cade/435c88610820f073a0fb61b73dff8f006760.pdf
>
><br>Nick Becker Github page, “The Right Way to Oversample in Predictive Modeling,” December 23, 2016.  Available at: https://beckernick.github.io/oversampling-modeling/


In [None]:
# Bar chart showing different size of "Fully Paid" vs. "Charged Off"
df_text_desc['loan_status'].value_counts().plot(kind='bar')

In [None]:
'''
In order to balance the classes, under-sample the majority class.
Over-sampling the majority class can be done with SMOTE (Synthetic Minority Over-Sampling), 
but does not yield good results for high-dimensional, high-sparse datasets.  
'''

# Save Fully Paid (majority class) as separate dataframe.
fully_paid = df_text_desc['loan_status'] == 'Fully Paid'
df_fully_paid = df_text_desc[fully_paid]

# Run pandas.sample method to use same number of samples as minority class.  
num_samples = loan_status_desc[1]    # number of loans "Charged Off" (e.g. 3851)
df_fp_undersample = df_fully_paid.sample(num_samples, random_state=1)  
df_fp_undersample.shape

In [None]:
# Save minority class as separate dataframe.
charged_off = df_text_desc['loan_status'] == 'Charged Off'
df_charged_off = df_text_desc[charged_off]
df_charged_off.shape

In [None]:
# Concatenate the "Fully Paid" (under-sampled) and "Charged Off" into one dataframe
bal_frames = [df_fp_undersample, df_charged_off]
df_balanced = pd.concat(bal_frames)
df_balanced.shape

In [None]:
# Bar chart showing different size of "Fully Paid" vs. "Charged Off" (Balanced)
df_balanced['loan_status'].value_counts().plot(kind='bar')

## Train/Test Split

In [None]:
# Split the Content and Classes into train and test sets (20%).
# To ensure that it splits it according to same class ratio, use "stratify" parameter.  
X_train, X_test, y_train, y_test = train_test_split(df_balanced['desc'], df_balanced['loan_status'], 
                                                    random_state=1, test_size=0.2, stratify=df_balanced['loan_status'])

# Print the size of each train and test datasets
print('X training size: ', X_train.shape)
print('y training size: ', y_train.shape)
print('X test size: ', X_test.shape)
print('y test size: ', y_test.shape)

In [None]:
# X_train is the loan descriptions from 'desc' field (80% of total)
X_train[0:5]

In [None]:
# y_train is the class data, a.k.a. "label data" or "target classes" (80% of total)
y_train[0:5]

In [None]:
# The test_train_split output retains the data type as a panda series.  
# Later we will run a list comprehension so this data, changing data type.
# Either data type should work fine.
type(X_train)

In [None]:
# The test data represents 20% of the total.  X_test is loan descriptions, y_test is class data.

In [None]:
# Check the y_test value_counts to verify confusion matrix
y_test.value_counts()

## Pre-processing text  
Using techniques from Kaggle tutorial  and Scikit-Learn Tutorial.  (See references at the top).
1. Create process_chars function
1. List comprehension applying function to data.
1. Create a LemmaTokenizer class
1. Instantiate CountVectorizer object with LemmaTokenizer

In [None]:
# Step 1: Create process_chars function.

# Use stopwords from NLTK, but also add individual letters.
stop_nltk = stopwords.words("english")
stop_nltk_plus = stop_nltk + [u'a',u'b',u'c',u'd',u'e',u'f',u'g',u'h',u'i',u'j',
                         u'k',u'l',u'm',u'n',u'o',u'p',u'q',u'r',u's',u't',
                         u'u',u'v',u'w',u'x',u'y',u'z']
# In Python, searching a set is much faster than searching
# a list, so convert the stop words to a set.  
# Use this "steps" set in the below function.  
stops = set(stop_nltk_plus)
    
# function to process documents
def process_chars(input_text):
    # Remove non-letters, and make lowercase
    letters_only = re.sub("[^a-zA-Z]", " ", input_text)
        
    # Convert to lower case, split into individual words
    words = letters_only.lower().split()    
    
    # Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    
    # Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))   

In [None]:
# Step 2: List comprehension applying function to data.

# Process the text of X_train.  Keep same name for simplicity.
X_train = [process_chars(text_file) for text_file in X_train]

# Process the text of X_train.  Keep same name for simplicity.
X_test = [process_chars(text_file) for text_file in X_test]

In [None]:
# Step 3: Create a LemmaTokenizer class.  
# Based on Scikit-Learn "Feature Extraction" page.

class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

In [None]:
# Step 4: Instantiate CountVectorizer object with LemmaTokenizer.
# Based on Scikit-Learn "Feature Extraction" page.

# instantiate CountVectorizer object, with LemmaTokenizer()
count_vect_lemma = CountVectorizer(tokenizer=LemmaTokenizer(), ngram_range=(1, 2), max_features=400, 
                                   max_df=0.90, stop_words='english')   

# Option 1a: Vectorize Text into Document-Term-Matrix (DTM)

In [None]:
%%time

# Size of training dtm:  (6161, 1000)
# Wall time: 20.2 s


# Fit the vectorizer object to the X_train text data
X_train_vect = count_vect_lemma.fit(X_train)

# Transform the training text into a document-term-matrix
X_train_dtm = X_train_vect.transform(X_train)

print("Size of training dtm: ", X_train_dtm.shape)

In [None]:
%%time

# Size of test dtm:  (1541, 1000)
# Wall time: 2.12 s


# Transform the test text into a document-term-matrix (input fed into models)
X_test_dtm = X_train_vect.transform(X_test)

print("Size of test dtm: ", X_test_dtm.shape)

## Option 1b: In addition to CountVectorizer, use TfidfTransformer
Let's try 3 different options:
* TfidfTransformer(use_idf=True)
* TfidfTransformer(use_idf=False)
* Don't use TfidfTransformer at all

In [None]:
# Instantiate a TfidfTransformer object, fit the X_train_dtm data, save as object.
tf_transformer = TfidfTransformer(use_idf=True).fit(X_train_dtm)

# Transform the X_train_dtm data to the TfidfTransformer.  
# Could name is something else, but keeping same name for simplicity.
X_train_dtm = tf_transformer.transform(X_train_dtm)

# What is the shape of the document-term-matrix?  (Should be same.)
X_train_dtm.shape

## Option 1c: In addition to CountVectorizer, use TruncatedSVD
TruncatedSVD used for dimensionality reduction, particular with sparse matrices (e.g. text matrices).

When TruncatedSVD is used in conjunction with CountVectorizer and Tfidf, it is known as Latent Semantic Analysis (LSA).

Scikit-Learn Documentation on TruncatedSVD
> http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
<br>
><br>http://scikit-learn.org/stable/modules/decomposition.html
<br>
><br>http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

In [None]:
svd = TruncatedSVD()

normalizer = Normalizer(copy=True)

lsa = make_pipeline(svd, normalizer)

X_train_dtm = lsa.fit_transform(X_train_dtm)

X_test_dtm = lsa.transform(X_test_dtm)

## Option 2: Vectorize Text Using Hashing Vectorizer
Usually HashingVectorizer is used to vectorize text documents that do not fit in memory.  But maybe it can be used as a dimensionality reduction technique, and improve prediction accuracy.
<br>
<br>Scikit-Learn Documentation on HashingVectorizer:
>http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html#sklearn.feature_extraction.text.HashingVectorizer
<br>
><br>http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

In [None]:
# Perform an IDF normalization on the output of HashingVectorizer
hasher = HashingVectorizer(n_features=400,tokenizer=LemmaTokenizer(),
                           stop_words='english', non_negative=True,
                           norm=None, binary=False)

# Vectorizer uses pipeline to combine hasher and TfidfTransformer
vectorizer = make_pipeline(hasher, TfidfTransformer(use_idf=False))

# Create X_train_dtm
X_train_dtm = vectorizer.fit_transform(X_train)

# Create X_test_dtm
X_test_dtm = vectorizer.transform(X_test)

## Option 3: Text Classification with Word2Vec
http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/

In [1]:
# An idea to explore in the future.

## Text Classification Example Without GridSearchCV
* In practice, the GridSearchCV class accomplishes same result, while tuning combinations of model parameters.
* But to make it simpler to follow the workflow, here is an example of text classification step-by-step.

In [None]:
# Step 1: Instatiate a classifier object.
rf_clf = RandomForestClassifier()

# Step 2: "Fit" training data onto model, both data and labels.
# The machine is "learning" how the training words match the label data (or "classes").  
rf_clf.fit(X_train_dtm, y_train)

# Step 3: Predict test data using model, only data (not labels)
# Store results as predictions on test data ... next we will compare with real labels.  
rf_test_predictions = rf_clf.predict(X_test_dtm)

In [None]:
print("Random Forest Classifier: ")

# print accuracy of class predictions
print(metrics.accuracy_score(y_test, rf_test_predictions))

# print the confusion matrix
print("\nConfusion Matrix: ")
print("(rows are actual, columns are predictions)")
print(metrics.confusion_matrix(y_test, rf_test_predictions, labels=["Charged Off", "Fully Paid"]))

# print the Classification Report
print("\nClassification Report: ")
print(metrics.classification_report(y_test, rf_test_predictions,target_names=["Charged Off", "Fully Paid"]))

# Classification Technique #1: Random Forest Classifier

In [None]:
%%time

# 0.546664502516
# bootstrap: True
# class_weight: 'balanced'
# n_estimators: 50
# Wall time: 57.5 s

# Use GridSearchCV to tune model parameters

# parameters 
parameters_rf = {'n_estimators': (10, 50, 100),                 # default 10
                 'bootstrap': (True, False),                    # default true
                  'class_weight': ('balanced', None)}           # default None

# instantiate a classifier object
rf = RandomForestClassifier(random_state=42)

# instantiate a GridSearchCV object
gs_rf = GridSearchCV(rf, parameters_rf, n_jobs=-1)

# fit the GridSearchCV object to the training data
gs_rf = gs_rf.fit(X_train_dtm, y_train)

print(gs_rf.best_score_)

for param_name in sorted(parameters_rf.keys()):
    print("%s: %r" % (param_name, gs_rf.best_params_[param_name]))

In [None]:
print("Random Forest Classifier: ")

# predict classification
gs_rf_test_predictions = gs_rf.predict(X_test_dtm)

# print accuracy of class predictions
print(metrics.accuracy_score(y_test, gs_rf_test_predictions))

# print the confusion matrix
print(metrics.confusion_matrix(y_test, gs_rf_test_predictions))

print(metrics.classification_report(y_test, gs_rf_test_predictions,target_names=["Charged Off", "Fully Paid"]))