<div style="text-align: right"><strong>Capstone #3:</strong> <span style="color:darkred">Supervised Learning</span> </div>

<a id="top"></a>

#### <span style="color:darkred">__Part 1: Data Exploration__ https://github.com/kimrharper/thinkful/blob/master/unit3/unit3-capstone-exploration.ipynb </span><br><br><span style="color:darkred">__Part 2: Models__ https://github.com/kimrharper/thinkful/blob/master/unit3/unit3-capstone-models.ipynb </span>

----

# <span style="color:darkred">Part 2: </span><span style="color:darkblue">L1 Prediction from ELL Writing Samples</span>

__Author:__ Ryan Harper 

----

<a href='#ov'>Overview</a><br>
<a href='#exp'>Experiment</a><br>
<a href='#sec1'>1. Models:</a><br>
><a href='#seca'>A. LR - Ordinary Least Squares</a><br>
<a href='#secb'>B. LR - Logistic Regression</a> <a href='#secb1'> (Lasso)</a> <a href='#secb2'> (Ridge)</a><br>
<a href='#secc'>C. NN - K Nearest Neighbors</a><br>
<a href='#secd'>D. NN - Naive Bayes</a><br>
<a href='#sece'>E. NN - Decision Tree</a><br>
<a href='#secf'>F. Ensemble - Random Forest</a><br>

<a href='#sec2'>2. Model Comparison</a><br>

<a id="ov"></a>

# <span style="color:darkblue">Overview</span>  <a href='#top'>(top)</a>

__Data Source:__
> http://lang-8.com/ [scraped with Beautiful Soup]

![alt text](../data/language/lang8.png "Title")

__Summary:__
> In my previous profession, I have been teaching English to a diverse range of students of all ages, language background, and country origin. During my professional development, I started to observe that different students with different L1s (1st Language) tended to display different patterns of communication that appeared to have some connection to either education in their country of origin or a connection to the linguistic structure of their first language. Different ELL (English Language Learners) needed to focus on different aspects of the English language depending on their background. The purpose of this project is to use a large number of blog posts from a language practicing website and explore whether or not the L1 has any significant impact on the blog writing style of the English learner.<br><br>Part 1: Explore the data to find any noteworthy trends in linguistic structure: <ol><li> vocabulary (word freq, collocations, and cognates) <li>syntax (sentence structure)<li>grammar (i.e. grammar complexity of sentences) <li>errors (types of errors) <li> parts of speech (NLTK Abbreviations: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/)</ol><br>Part 2: Use linguistic trends to determine whether or not a learner's first language can be predicted.

__Variables:__
>__id:__ _User ID_<br>
__time:__ _Time the blog post was scraped (in order of user posted time)_ <br>
__title:__ _Title of the blog post_<br>
__content:__ _The blog post_<br>
__language:__ _User's self-reported first language_

<a id="exp"></a>

# <span style="color:darkblue">Experiment</span> <a href='#top'>(top)</a>

__Hypothesis:__ 
> L1 (first language) experience and academic environment influences ELLs' (English Language Learners') writing style. The L1 of ELLs can be predicted by looking at English blog posts and identifying patterns unique to their L1.

__Observations:__
><li> --<li>--<li>--

__Method:__
> Using multiple different models. The aim of this project is to explore how different models can handle the data (target and features) and to see what information can be gained from using multiple different models. Ultimately, the goal is to determine which models are appropriate for a binary (discrete) target with features that are both qualitative (discrete) and quantitative (ranked/continuous).

<a id="sec1"></a>

# <span style="color:darkblue">1. Models:</span>  <a href='#top'>(top)</a>

In [1]:
# iPython/Jupyter Notebook
import time
from pprint import pprint
import warnings
from IPython.display import Image

import time

# Data processing
import pandas as pd
import plotly as plo
import seaborn as sns
from scipy import stats
from collections import Counter
import numpy as np
import itertools

# NLP
from nltk.corpus import stopwords as sw
from nltk.util import ngrams
from nltk.corpus import brown
import nltk
import re
from nltk.tokenize import RegexpTokenizer
import difflib

# Stats
from sklearn.metrics import classification_report, roc_curve,roc_auc_score,accuracy_score
from sklearn import metrics

# Preparing Models
from sklearn.model_selection import train_test_split

# Models
from sklearn import linear_model
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.naive_bayes import BernoulliNB,MultinomialNB,GaussianNB

# Ensemble
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

#Visualization
from IPython.display import Image
import pydotplus
import graphviz

# import altair as alt

In [2]:
features = pd.read_csv('blogfeatures.csv').sample(frac=1.0)
del features['Unnamed: 0']
del features['id']
lang = list(features.language.unique())

In [3]:
y = features['language'].values.reshape(-1, 1).ravel()
X = features[features.columns[~features.columns.str.contains('language')]]
X.head()

print(np.shape(y))
print(np.shape(X))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=30)  

(14148,)
(14148, 481)


<a id="seca"></a>

__Create Function for Comparing Models__

In [4]:
cols = ['name','time','total','precision','recall','f1']

model_set = pd.DataFrame(columns=cols)
models_stored = []
pattern = "%.2f"

In [5]:
def run_model(model,name):
    global model_set
    m = model
    m.fit(X_train, y_train)
    start = time.time()

    total_score = m.score(X_test,y_test)
    pscore = [pattern % i for i in list(metrics.precision_score(y_test, m.predict(X_test),labels=lang,average=None))]
    rscore = [pattern % i for i in list(metrics.recall_score(y_test, m.predict(X_test),labels=lang,average=None))]
    fscore = [pattern % i for i in list(metrics.f1_score(y_test, m.predict(X_test),labels=lang,average=None))]
    end = time.time()
    t= pattern % (end - start)

    r = dict(zip(cols,[name,t,total_score,pscore,rscore,fscore]))
    print('Total Score is: {}\n'.format(total_score))
    print(classification_report(y_test, m.predict(X_test)))
    
    model_set = model_set.append(r,ignore_index=True)
    return r,m

### <span style="color:gray">A. LR - Ordinary Least Squares _(not used)_</span>  <a href='#top'>(top)</a>

> Target is discrete so this model may not be appropriate <br>Many features are binary so model may not be appropriate

<a id="secb"></a>

### <span style="color:darkred">B. LR - Logistic Regression</span>  <a href='#top'>(top)</a>

> Target is binary so logistic regression will operate on probabilities

In [6]:
%%time
lreg_data,lreg = run_model(linear_model.LogisticRegression(),'Logistic Regression')

Total Score is: 0.7569477154969383

                     precision    recall  f1-score   support

            English       0.50      0.12      0.20         8
           Japanese       0.78      0.92      0.84      1436
             Korean       0.24      0.04      0.06       163
Traditional Chinese       0.70      0.54      0.61       516

        avg / total       0.72      0.76      0.72      2123

CPU times: user 5 s, sys: 88.6 ms, total: 5.09 s
Wall time: 5.13 s


In [7]:
pd.crosstab(y_test,lreg.predict(X_test))

col_0,English,Japanese,Korean,Traditional Chinese
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
English,1,6,0,1
Japanese,1,1323,17,95
Korean,0,133,6,24
Traditional Chinese,0,237,2,277


<a id="secb1"></a>

### <span style="color:gray">C. Lasso _(not used)_</span>  <a href='#top'>(top)</a>

<a id="secb2"></a>

### <span style="color:gray">D. Ridge _(not used)_</span>  <a href='#top'>(top)</a>

_Lasso and Ridge are not good predictors so should I just be using them for parameter manipulation?_

<a id="sece"></a>

### <span style="color:darkred">E. K Nearest Neighbors</span>  <a href='#top'>(top)</a>

> Can handle discrete values for target <br>Quantitative values are limited (not continuous) and might be problematic for nearest neighbors

In [8]:
%%time
neighbors_data,neighbors = run_model(KNeighborsClassifier(n_neighbors=5),'K Nearest Neighbor')

Total Score is: 0.700423928403203

                     precision    recall  f1-score   support

            English       0.50      0.12      0.20         8
           Japanese       0.72      0.93      0.81      1436
             Korean       0.28      0.04      0.07       163
Traditional Chinese       0.58      0.29      0.39       516

        avg / total       0.65      0.70      0.65      2123

CPU times: user 14.1 s, sys: 46.3 ms, total: 14.1 s
Wall time: 14.1 s


<a id="secf"></a>

### <span style="color:darkred">F. Naive Bayes - Bernoulli</span>  <a href='#top'>(top)</a>

> Should be best for boolean classification but not for multiple discrete values (multinomial should be better)

In [9]:
%%time
bnb_data,bnb = run_model(BernoulliNB(),'Naive Bayes - Bernoulli')

Total Score is: 0.6203485633537447

                     precision    recall  f1-score   support

            English       0.03      0.25      0.06         8
           Japanese       0.74      0.75      0.75      1436
             Korean       0.17      0.17      0.17       163
Traditional Chinese       0.47      0.40      0.43       516

        avg / total       0.63      0.62      0.62      2123

CPU times: user 236 ms, sys: 80.8 ms, total: 316 ms
Wall time: 333 ms


<a id="secg"></a>

### <span style="color:darkred">G. Decision Tree</span>  <a href='#top'>(top)</a>

> Visualizes most important features by hierarchy <br>Longer processing time

In [10]:
%%time
dt_data,dt = run_model(tree.DecisionTreeClassifier(criterion='entropy',max_depth=7),'Decision Tree')

Total Score is: 0.6994818652849741

                     precision    recall  f1-score   support

            English       0.00      0.00      0.00         8
           Japanese       0.72      0.93      0.81      1436
             Korean       0.11      0.01      0.02       163
Traditional Chinese       0.58      0.28      0.38       516

        avg / total       0.64      0.70      0.65      2123

CPU times: user 384 ms, sys: 32.4 ms, total: 416 ms
Wall time: 432 ms


In [11]:
# Render tree.
dot_data = tree.export_graphviz(
    dt, out_file=None,
    feature_names=X.columns,
    class_names=lang,
    filled=True
)

graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())

graph.write_png('decision_tree.png')

True

In [12]:
dimportance = list(zip(X.columns,dt.feature_importances_))
dimportance = dict(dimportance)
a1_sorted_keys = sorted(dimportance, key=dimportance.get, reverse=True)
p = []
for r in a1_sorted_keys:
    if dimportance[r] != 0:
        p.append(r)
#         print(r, dimportance[r])
        
print(p)

['sc', 'pos2_NN-MD', 'wc', 'pos2_DT-JJ', 'freq_score', 'pos2_PRP-VBD', 'pos2_NN-PRP', 'sent_subj', 'sent_pol', 'pos2_VB-PRP', 'pos2_PRP-RB', 'pos2_MD-VB', 'pos2_NN-TO', 'pos2_VBP-DT', 'pos2_NN-RB', 'pos2_WRB-NN', 'pos2_RB-JJ', 'pos2_NN-NN', 'pos2_VB-VBN', 'pos2_RB-DT', 'pos2_PRP-DT', 'pos2_CC-PRP', 'pos2_JJ-VBP', 'pos2_VBZ-DT', 'pos2_VBD-VB', 'pos2_DT-RBS', 'pos2_VBP-RB', 'pos2_VBZ-VBN', 'pos2_VBP-NN', 'pos2_VBG-NNP', 'pos2_PDT-DT', 'pos2_PRP-VBP', 'pos2_NN-VBP', 'pos2_VB-NN', 'pos2_IN-WRB', 'pos2_IN-NN', 'pos2_TO-NNP', 'pos2_NNP-IN', 'pos2_RB-RB', 'pos2_VBP-JJ', 'pos2_TO-VB', 'pos2_EX-VBP', 'pos2_NN-VBZ', 'pos2_NNS-DT', 'pos2_DT-NN', 'pos2_VBD-DT', 'pos2_VBD-PRP$', 'pos2_VBD-TO', 'pos2_NNP-RB', 'pos2_NN-IN', 'pos2_VBG-IN', 'pos2_JJ-PRP', 'pos2_IN-PRP$', 'pos2_JJ-JJ', 'pos2_VBN-DT', 'pos2_CC-WRB', 'pos2_PRP$-NN', 'pos2_NN-VBD', 'pos2_VBP-NNS', 'pos2_WP-VBZ', 'pos2_DT-VBP', 'pos2_VBN-IN', 'pos2_TO-JJ']


_Good visualization of important features and presentation of entropy weighting_

<a id="sech"></a>

### <span style="color:darkred">H. Random Forest</span>  <a href='#top'>(top)</a>

> Runs decision tree multiple times for best output <br>Longest processing time

In [13]:
%%time
rf_data,rf = run_model(ensemble.RandomForestClassifier(n_estimators=20),'Random Forest')

Total Score is: 0.7183231276495525

                     precision    recall  f1-score   support

            English       0.00      0.00      0.00         8
           Japanese       0.72      0.96      0.83      1436
             Korean       1.00      0.01      0.01       163
Traditional Chinese       0.67      0.27      0.39       516

        avg / total       0.73      0.72      0.65      2123

CPU times: user 1.07 s, sys: 62.6 ms, total: 1.14 s
Wall time: 1.16 s



Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


F-score is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.



In [14]:
cvs = cross_val_score(rf, X_test, y_test, cv=5)
print(cvs)
scoreH = cvs.sum()/len(cvs)

[0.69086651 0.70117647 0.71058824 0.69503546 0.70449173]


In [15]:
rf.score(X_train,y_train)
print(scoreH)

0.7004316806364448


In [16]:
rf.feature_importances_
importance = list(zip(X.columns,rf.feature_importances_))

dimportance = dict(importance)

a1_sorted_keys = sorted(dimportance, key=dimportance.get, reverse=True)
for r in a1_sorted_keys:
    print(r, dimportance[r])

sc 0.0443354122937334
freq_score 0.02307865695073919
wc 0.019034921938334483
sent_subj 0.018102030396380822
sent_pol 0.01796207123057346
pos2_PRP-VBD 0.013221184764913935
pos2_DT-JJ 0.011932843017226049
pos2_PRP-VBP 0.011167899183883176
pos2_NN-MD 0.010261900978360102
pos2_DT-NN 0.009732288771361881
pos2_NN-IN 0.00939807323674675
pos2_MD-VB 0.009326911838550608
pos2_NN-PRP 0.009061431067542257
pos2_JJ-NN 0.008757156840537416
pos2_IN-NNP 0.008306137281656367
pos2_IN-DT 0.007891927436763228
pos2_NN-NN 0.007651707188445304
pos2_VB-DT 0.007579234392374168
pos2_JJ-VBP 0.007525118325629831
pos2_VBP-DT 0.007186537138337357
pos2_RB-JJ 0.007179126281652454
pos2_TO-VB 0.007061124770323159
pos2_PRP$-NN 0.006757835913391043
pos2_IN-NN 0.006626059953879236
pos2_IN-PRP 0.0063409263492180444
pos2_VBP-RB 0.006326589697681908
pos2_NN-VBZ 0.006264554771320196
pos2_NN-VBP 0.006099401857813937
pos2_PRP-MD 0.005761450109001553
pos2_NN-TO 0.005717913523297203
pos2_PRP$-JJ 0.005534133346326295
pos2_RB-VB 0.0

pos2_VBD-RBR 0.0003656192047793482
pos2_VBP-EX 0.0003626878708446051
pos2_CC-JJS 0.0003589781416905529
pos2_PRP$-DT 0.0003559668926072794
pos2_RBR-RB 0.00035577615427079176
pos2_DT-RBR 0.0003542367315026185
pos2_PRP$-CC 0.0003528569508872418
pos2_VBG-JJR 0.00034270524446897084
pos2_RP-JJ 0.0003411908791998226
pos2_RBR-NN 0.0003410633120163021
pos2_JJ-JJR 0.0003380032886705494
pos2_UH-PRP 0.0003239256819780248
pos2_VBP-RBR 0.00031960699057271056
pos2_VBG-WP 0.0003174888320902591
pos2_VBN-WRB 0.0003131458289408748
pos2_PRP-RBR 0.0003087395166026353
pos2_CD-TO 0.00030840614849352575
pos2_CD-POS 0.00030524888480638434
pos2_VBP-MD 0.0003034234159790867
pos2_VBD-JJR 0.0003032916744162812
pos2_NNP-NNPS 0.00030266567217129516
pos2_VBP-WRB 0.00029817312259938235
pos2_NNP-VBN 0.0002974348458778667
pos2_JJR-TO 0.00029727077069981276
pos2_POS-VB 0.0002899285829804962
pos2_VBZ-RP 0.00028500996436615276
pos2_TO-IN 0.00027498032754951396
pos2_VBD-WRB 0.0002728041415771333
pos2_NNS-RBR 0.0002672977339

<a id="sec2"></a>

# <span style="color:darkblue">2. Model Comparison</span>  <a href='#top'>(top)</a>

In [17]:
model_set.columns = ['name','time','total','prec: | JA | CH | KO | EN |','rec: | JA | CH | KO | EN |','f1: | JA | CH | KO | EN |']
model_set

Unnamed: 0,name,time,total,prec: | JA | CH | KO | EN |,rec: | JA | CH | KO | EN |,f1: | JA | CH | KO | EN |
0,Logistic Regression,0.04,0.756948,"[0.78, 0.70, 0.50, 0.24]","[0.92, 0.54, 0.12, 0.04]","[0.84, 0.61, 0.20, 0.06]"
1,K Nearest Neighbor,11.16,0.700424,"[0.72, 0.58, 0.50, 0.28]","[0.93, 0.29, 0.12, 0.04]","[0.81, 0.39, 0.20, 0.07]"
2,Naive Bayes - Bernoulli,0.09,0.620349,"[0.74, 0.47, 0.03, 0.17]","[0.75, 0.40, 0.25, 0.17]","[0.75, 0.43, 0.06, 0.17]"
3,Decision Tree,0.05,0.699482,"[0.72, 0.58, 0.00, 0.11]","[0.93, 0.28, 0.00, 0.01]","[0.81, 0.38, 0.00, 0.02]"
4,Random Forest,0.1,0.718323,"[0.72, 0.67, 0.00, 1.00]","[0.96, 0.27, 0.00, 0.01]","[0.83, 0.39, 0.00, 0.01]"


-----