# Introduction
__Problem:__   Companies today are creating petabytes of new data, mostly electronic, each year. This has become a cost and logistics problem for many corporations because most do not have an enforceable document management policy for electronic data. This is not because of negligence, but rather ineffective tools on how to govern and regulate the knowledge that is shared electronically today. 

One idea to solve this problem is to create a document classification engine that will assign “categories” or “tags” to a document regardless of document location or filename. A tool like this would give flexibility to the end user, but also the governing power to the records manager so that they can apply best practices from physical records management

__Goal:__   In this notebook we will narrow our focus to building a model that will classify documents into one of three categories: Operations, Legal and Accounting. The notebook will test different combinations of text transfomers and classifiers and compare the results.

# Adding Required Resources

In [1]:
%matplotlib inline
import pandas as pd
import os
import numpy as np
import sklearn


In [2]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.lancaster import LancasterStemmer
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()


In [3]:
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# PreProcessing work
Loop through each text files acquired from OCR and 'preprocess' the data by removing stop words, extra spaces, and reducing the words to their root, or stem. We will then zip the contents into a Pandas DataFrame.



In [4]:
NEWLINE = '\n'
SKIP_FILES = {'cmds'}


def read_files(path):
    for root, dir_names, file_names in os.walk(path):
        for path in dir_names:
            read_files(os.path.join(root, path))
        for file_name in file_names:
            if file_name not in SKIP_FILES:
                file_path = os.path.join(root, file_name)
                if os.path.isfile(file_path):
                    past_header, lines = False, []
                    f = open(file_path, encoding="ANSI")
                    for line in f:
                        if past_header:
                            lines.append(line)
                        elif line == NEWLINE:
                            past_header = True
                    f.close()
                    content = NEWLINE.join(lines)
                    yield content
                    


In [5]:
#Training set is pre-classified according to subfolder
path =r'C:\Users\osutr_000\Documents\Data\Ops'
list_ = []

for text in read_files(path):
    # tokenize the text
    tokenizer = RegexpTokenizer(r'\w+')
    intermediate = tokenizer.tokenize(text)
    # Remove stop words known in nltk package
    stop = stopwords.words('english')
    intermediate = [i for i in intermediate if i not in stop]
    # Use word stemmeing to get root meaning
    lanste = LancasterStemmer()
    intermediate = [lanste.stem(i) for i in intermediate]
    # Concatenate words into one string and then add to list_
    final = " ".join(intermediate)
    list_.append(final)
ops_df = pd.DataFrame(data = list_)
ops_df['class']="ops"

In [6]:
path =r'C:\Users\osutr_000\Documents\Data\Legal'
list_ = []

for text in read_files(path):
    tokenizer = RegexpTokenizer(r'\w+')
    intermediate = tokenizer.tokenize(text)
    stop = stopwords.words('english')
    intermediate = [i for i in intermediate if i not in stop]
    lanste = LancasterStemmer()
    intermediate = [lanste.stem(i) for i in intermediate]
    final = " ".join(intermediate)
    list_.append(final)
legal_df = pd.DataFrame(data = list_)
legal_df['class']="legal"

In [7]:
path =r'C:\Users\osutr_000\Documents\Data\Accounting'
list_ = []

for text in read_files(path):
    tokenizer = RegexpTokenizer(r'\w+')
    intermediate = tokenizer.tokenize(text)
    stop = stopwords.words('english')
    intermediate = [i for i in intermediate if i not in stop]
    lanste = LancasterStemmer()
    intermediate = [lanste.stem(i) for i in intermediate]
    final = " ".join(intermediate)
    list_.append(final)
accounting_df = pd.DataFrame(data = list_)
accounting_df['class']="accounting"

## Assemble training and test data
We will use a 50/50 split

In [8]:
merged_df = ops_df.append(legal_df).append(accounting_df)
merged_df.columns = ['text', 'cat']
merged_df = merged_df.sort_index().reset_index()

There are 1113 total training examples:
* 580 Ops
* 209 Legal
* 324 Accounting

In [9]:
# Now we will flatten the data into (sample, feature) matrices
X = merged_df.text
y = merged_df.cat

# and then split the dataset into two equal parts
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=0)

# Model fitting
To streamline the code, I will be using both pipelines and grid_search in the model fitting steps. Pipelines allow for a linear sequence of data transforms to be chained together in a modeling process that can be evaluated. The purpose of the grid_search is to evaluate the model under different combinations of hyper-parameters. This will result in single 'best estimator'.

There will be 3 models evaluated:
* Model 1: CountVectorizer, TfidfTransformer, and SGDCLassifier
* Model 2: HashVectorizer and SGDClassifier
* Model 3: TfidfVectorizer and LinearSVC

## Model 1
Here are some of the model features seen in this section:
* CountVectorizer implements both tokenization and occurrence counting in a single class
* TfidfTransformer transforms a count matrix to a normalized tf or tf-idf representation. 
* SGDClassifier (Stochastic Gradient Decent)


In [10]:
# Define the pipeline steps:
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(max_iter=1000, tol=.0001)),
])

# hyper-parameters to be tuned:
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
}


In [11]:
print("# Tuning hyper-parameters...")
print()

clf = GridSearchCV(pipeline, parameters)
clf.fit(X = X_train, y = y_train)
print("Default hyper-parameters of classifier:")
print(pipeline.steps[-1][1])

# Tuning hyper-parameters...

Default hyper-parameters of classifier:
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=1000, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=0.0001, verbose=0, warm_start=False)


In [12]:
print("Best tuned-parameters set found on development set:")
print(clf.best_params_)
print()
print("Best score on development set:")
print(clf.best_score_)

Best tuned-parameters set found on development set:
{'clf__alpha': 1e-06, 'clf__penalty': 'elasticnet', 'tfidf__norm': 'l2', 'vect__max_df': 1.0}

Best score on development set:
1.0


In [13]:
print("Detailed classification report on test set:")
print()
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()

print("Detailed classification report on training set")
print()
y_true, y_pred = y_train, clf.predict(X_train)
print(classification_report(y_true, y_pred))
print()

Detailed classification report on test set:

             precision    recall  f1-score   support

 accounting       1.00      1.00      1.00       170
      legal       1.00      1.00      1.00       100
        ops       1.00      1.00      1.00       287

avg / total       1.00      1.00      1.00       557


Detailed classification report on training set

             precision    recall  f1-score   support

 accounting       1.00      1.00      1.00       154
      legal       1.00      1.00      1.00       109
        ops       1.00      1.00      1.00       293

avg / total       1.00      1.00      1.00       556




## Analysis of Model 1
As seen in the detailed classification report, both the precision and recall of the model are near 100%,
and therefore the harmonic mean (f-score) is also 100%.This indicates a near perfect model fit to the dataset. 

Precision is the number of true positives over the number of true positives plus false positives. 
Recall is the number of true positives over the number of true positives plus false negatives

## Model 2
In Model 2 a different vectorizer, HashingVectorizer, is used in the pipeline in order to evaluate the effect CountVectorizer had on Model 1. The HashingVectorizer will convert the text documents to a matrix of token occurrences by using a hashing trick to find the token string name to feature integer index mapping.

In [14]:
# Define the pipeline steps:

pipeline = Pipeline([
    ('vect', HashingVectorizer()),
    ('clf', SGDClassifier(max_iter=1000, tol=.0001)),
])

# hyper-parameters to be tuned:
parameters = {
    'vect__norm': ('l1','l2',None),
    'vect__alternate_sign': [0,1],
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
}


In [15]:
print("# Tuning hyper-parameters...")
print()

clf = GridSearchCV(pipeline, parameters)
clf.fit(X = X_train, y = y_train)
print("Default hyper-parameters of classifier:")
print(pipeline.steps[-1][1])

# Tuning hyper-parameters...

Default hyper-parameters of classifier:
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=1000, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=0.0001, verbose=0, warm_start=False)


In [16]:
print("Best tuned parameters set found on development set:")
print(clf.best_params_)
print()
print("Best score on development set:")
print(clf.best_score_)

Best tuned parameters set found on development set:
{'clf__alpha': 1e-05, 'clf__penalty': 'elasticnet', 'vect__alternate_sign': 0, 'vect__norm': 'l2'}

Best score on development set:
1.0


In [17]:
print("Detailed classification report on test set:")
print()
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()

print("Detailed classification report on training set")
print()
y_true, y_pred = y_train, clf.predict(X_train)
print(classification_report(y_true, y_pred))
print()

Detailed classification report on test set:

             precision    recall  f1-score   support

 accounting       1.00      0.99      1.00       170
      legal       0.99      1.00      1.00       100
        ops       1.00      1.00      1.00       287

avg / total       1.00      1.00      1.00       557


Detailed classification report on training set

             precision    recall  f1-score   support

 accounting       1.00      1.00      1.00       154
      legal       1.00      1.00      1.00       109
        ops       1.00      1.00      1.00       293

avg / total       1.00      1.00      1.00       556




## Results from Model 2
Model 2 score results are nearly identical to Model 1. There will need to be further analysis before we can confidently declare which vectorizer is a better fit to the model. Fortunately, both have shown good results. 

One benfit of using the HashingVectorizer is that it scales very well since it does not have memory. However, "there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model." (http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html)


## Model 3
In Model 3 I will apply a new classifier, Support Vector Machine (SVM) in order to evaluate the effectivness of the SGD classifier on Model 1. SVM is one of the oldest AI algorithms, and is the basis for neural networks. SVC does not care about the 'perfect' point, instead it wants the 'ugliest' point that still classifies. We will tune the slack variable (c) in our model. 

In addition, I will resume using the same transformers from Model 1, CountVectorizer and TfidfTransformer, which are combined into a TfidfVectorizer. 


In [20]:
# Define the pipeline steps
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LinearSVC()),
])

# hyper-parameters to be tuned:
parameters = dict(tfidf__sublinear_tf=[0,1],
                  tfidf__smooth_idf=[0,1],
                  tfidf__norm =['l1','l2',None],
                  clf__C=[1, 10, 100]
                 )

In [21]:
print("# Tuning hyper-parameters...")
print()

clf = GridSearchCV(pipeline, parameters)
clf.fit(X = X_train, y = y_train)
print("Default hyper-parameters of classifier:")
print(pipeline.steps[-1][1])

# Tuning hyper-parameters...

Default hyper-parameters of classifier:
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)


In [22]:
print("Best tuned parameters set found on development set:")
print(clf.best_params_)
print()
print("Best score on development set:")
print(clf.best_score_)

Best tuned parameters set found on development set:
{'clf__C': 10, 'tfidf__norm': 'l1', 'tfidf__smooth_idf': 0, 'tfidf__sublinear_tf': 0}

Best score on development set:
0.996402877698


In [23]:
print("Detailed classification report on test set:")
print()
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()

print("Detailed classification report on training set")
print()
y_true, y_pred = y_train, clf.predict(X_train)
print(classification_report(y_true, y_pred))
print()

Detailed classification report on test set:

             precision    recall  f1-score   support

 accounting       1.00      0.99      0.99       170
      legal       0.97      1.00      0.99       100
        ops       1.00      1.00      1.00       287

avg / total       0.99      0.99      0.99       557


Detailed classification report on training set

             precision    recall  f1-score   support

 accounting       1.00      0.99      1.00       154
      legal       0.99      1.00      1.00       109
        ops       1.00      1.00      1.00       293

avg / total       1.00      1.00      1.00       556




## Analysis of Model 3
The SVM classifier is performing well, but the best score is slightly less lower than the SGD classifier. Precision, Recall and the combined F1 score are all near 99%.

# Further Analysis of Results
It is very curious that we have seen such good statistical results on all three models. To analyze the results further I will refit the data to Model 1, and then look for insights into the words used .

In [24]:
# Define the pipeline steps:
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(max_iter=1000, tol=.0001)),
])

# hyper-parameters to be tuned:
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
}

gsearch = GridSearchCV(pipeline, parameters)
gsearch.fit(X = X_train, y = y_train)

print("Best tuned parameters set found on development set:")
print(gsearch.best_params_)
print()
print("Best score on development set:")
print(gsearch.best_score_)


Best tuned parameters set found on development set:
{'clf__alpha': 1e-05, 'clf__penalty': 'l2', 'tfidf__norm': 'l2', 'vect__max_df': 0.5}

Best score on development set:
1.0


### Number of features in training set

In [25]:
print("There are ,", len(gsearch.best_estimator_.steps[0][1].vocabulary_), " features in the dataset.")

There are ,  17498  features in the dataset.


### Feature weighting
Which features are being assigned higher a weight/significance, and which have the least effect on the classifier?

In [27]:
feature_names = gsearch.best_estimator_.steps[0][1].get_feature_names()
coefs_with_fns = sorted(zip(gsearch.best_estimator_.steps[-1][1].coef_[0], feature_names))
bestFeat = coefs_with_fns[:20]
worstFeat = coefs_with_fns[:-(20 + 1):-1]
print("20 features with highest significance: ")
for feat in bestFeat:
    print(feat)
print('\n','\n')
print("20 features with lowest significance: ")
for feat in worstFeat:
    print(feat)

20 features with highest significance: 
(-8.5020729274696833, 'iiii')
(-7.1207111824127214, 'pul')
(-6.6695182824853596, 'scan')
(-6.3876111606147257, '0620693')
(-6.1254828232685758, '45')
(-5.8944011924991182, '2407')
(-5.6945778517060219, '2017')
(-5.6623854031324008, '0621483')
(-5.6260594416621617, 'sit')
(-5.5672657359389479, 'boat')
(-5.5198551298360377, '0331534_20120430')
(-5.5198551298360377, '25am')
(-5.3769527315334908, 'flat')
(-5.3248471864722902, 'agr')
(-5.1428324665087182, '154')
(-5.1379049049677494, '0621441')
(-5.1169338892601059, '02rcvgtm0121')
(-5.0317786957614903, '02rcvgbx0189')
(-4.9277476596731677, 'vault')
(-4.6085086552407937, 'tap')

 

20 features with lowest significance: 
(7.4577434283106587, '2016')
(7.2613598223493812, '99')
(7.1799707325021975, '390')
(7.1436530909851719, 'serv')
(6.6706837680264739, '250')
(6.616028843130354, 'invo')
(6.4355836706132088, '000')
(6.3746651635444316, '16')
(6.2062799714373522, '36')
(6.0827398743368999, '37')
(6.00609

### Most repeated features in the training set

In [28]:
matrix = gsearch.best_estimator_.steps[0][1].fit_transform(X_train, y_train)
freqs = [(word, matrix.getcol(idx).sum()) for word, idx in gsearch.best_estimator_.steps[0][1].vocabulary_.items()]
#sort from largest to smallest
sort_freq = sorted (freqs, key = lambda x: -x[1])
#print top 20
print("The 20 most repeated word stems in decending frequency:")
print(sort_freq[0:20])

The 20 most repeated word stems in decending frequency:
[('stor', 2385), ('agr', 2308), ('shal', 1918), ('midcon', 1895), ('company', 1817), ('serv', 1766), ('mat', 1174), ('16', 1154), ('charg', 1120), ('provid', 997), ('11', 977), ('work', 959), ('2016', 947), ('fil', 896), ('pay', 888), ('us', 887), ('invo', 856), ('10', 847), ('delivery', 799), ('405', 792)]


### Test/Train split
As a final test, I will demonstrate that the test train split does not have an effect on the model score by varying the test/train split to:
* 60/40
* 40/60
* 30/70

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.6, random_state=0)

# Define the pipeline steps:
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(max_iter=1000, tol=.0001)),
])

# hyper-parameters to be tuned:
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
}

gsearch = GridSearchCV(pipeline, parameters)
gsearch.fit(X = X_train, y = y_train)

print("Best tuned parameters set found on development set:")
print(gsearch.best_params_)
print()
print("Best score on development set:")
print(gsearch.best_score_)
print()
print("Detailed classification report on test set:")
print()
y_true, y_pred = y_test, gsearch.predict(X_test)
print(classification_report(y_true, y_pred))


Best tuned parameters set found on development set:
{'clf__alpha': 1e-05, 'clf__penalty': 'l2', 'tfidf__norm': 'l2', 'vect__max_df': 0.5}

Best score on development set:
0.997752808989

Detailed classification report on test set:

             precision    recall  f1-score   support

 accounting       1.00      1.00      1.00       202
      legal       0.98      1.00      0.99       118
        ops       1.00      1.00      1.00       348

avg / total       1.00      1.00      1.00       668



In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

# Define the pipeline steps:
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(max_iter=1000, tol=.0001)),
])

# hyper-parameters to be tuned:
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
}

gsearch = GridSearchCV(pipeline, parameters)
gsearch.fit(X = X_train, y = y_train)

print("Best tuned parameters set found on development set:")
print(gsearch.best_params_)
print()
print("Best score on development set:")
print(gsearch.best_score_)
print()
print("Detailed classification report on test set:")
print()
y_true, y_pred = y_test, gsearch.predict(X_test)
print(classification_report(y_true, y_pred))


Best tuned parameters set found on development set:
{'clf__alpha': 1e-05, 'clf__penalty': 'l2', 'tfidf__norm': 'l2', 'vect__max_df': 0.75}

Best score on development set:
1.0

Detailed classification report on test set:

             precision    recall  f1-score   support

 accounting       1.00      1.00      1.00       145
      legal       1.00      1.00      1.00        70
        ops       1.00      1.00      1.00       231

avg / total       1.00      1.00      1.00       446



In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Define the pipeline steps:
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(max_iter=1000, tol=.0001)),
])

# hyper-parameters to be tuned:
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
}

gsearch = GridSearchCV(pipeline, parameters)
gsearch.fit(X = X_train, y = y_train)

print("Best tuned parameters set found on development set:")
print(gsearch.best_params_)
print()
print("Best score on development set:")
print(gsearch.best_score_)
print()
print("Detailed classification report on test set:")
print()
y_true, y_pred = y_test, gsearch.predict(X_test)
print(classification_report(y_true, y_pred))

Best tuned parameters set found on development set:
{'clf__alpha': 1e-05, 'clf__penalty': 'l2', 'tfidf__norm': 'l2', 'vect__max_df': 0.75}

Best score on development set:
1.0

Detailed classification report on test set:

             precision    recall  f1-score   support

 accounting       1.00      1.00      1.00       106
      legal       1.00      1.00      1.00        51
        ops       1.00      1.00      1.00       177

avg / total       1.00      1.00      1.00       334

