<div style="text-align: right"><strong>Capstone #3:</strong> <span style="color:darkred">Supervised Learning</span> </div>

<a id="top"></a>

#### <span style="color:darkred">__Part 1: Data Exploration__ https://github.com/kimrharper/thinkful/blob/master/unit3/unit3-capstone-exploration.ipynb </span><br><br><span style="color:darkred">__Part 2: Models__ https://github.com/kimrharper/thinkful/blob/master/unit3/unit3-capstone-models.ipynb </span>

----

# <span style="color:darkred">Part 2: </span><span style="color:darkblue">L1 Prediction from ELL Writing Samples</span>

__Author:__ Ryan Harper 

----

<a href='#ov'>Overview</a><br>
<a href='#exp'>Experiment</a><br>
<a href='#sec1'>1. Models:</a><br>
><a href='#seca'>A. LR - Ordinary Least Squares</a><br>
<a href='#secb'>B. LR - Logistic Regression</a> <a href='#secb1'> (Lasso)</a> <a href='#secb2'> (Ridge)</a><br>
<a href='#secc'>C. NN - K Nearest Neighbors</a><br>
<a href='#secd'>D. NN - Naive Bayes</a><br>
<a href='#sece'>E. NN - Decision Tree</a><br>
<a href='#secf'>F. Ensemble - Random Forest</a><br>

<a href='#sec2'>2. Model Comparison</a><br>

<a id="ov"></a>

# <span style="color:darkblue">Overview</span>  <a href='#top'>(top)</a>

__Data Source:__
> http://lang-8.com/ [scraped with Beautiful Soup]

![alt text](../data/language/lang8.png "Title")

__Summary:__
> In my previous profession, I have been teaching English to a diverse range of students of all ages, language background, and country origin. During my professional development, I started to observe that different students with different L1s (1st Language) tended to display different patterns of communication that appeared to have some connection to either education in their country of origin or a connection to the linguistic structure of their first language. Different ELL (English Language Learners) needed to focus on different aspects of the English language depending on their background. The purpose of this project is to use a large number of blog posts from a language practicing website and explore whether or not the L1 has any significant impact on the blog writing style of the English learner.<br><br>Part 1: Explore the data to find any noteworthy trends in linguistic structure: <ol><li> vocabulary (word freq, collocations, and cognates) <li>syntax (sentence structure)<li>grammar (i.e. grammar complexity of sentences) <li>errors (types of errors) <li> parts of speech (NLTK Abbreviations: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/)</ol><br>Part 2: Use linguistic trends to determine whether or not a learner's first language can be predicted.

__Variables:__
>__id:__ _User ID_<br>
__time:__ _Time the blog post was scraped (in order of user posted time)_ <br>
__title:__ _Title of the blog post_<br>
__content:__ _The blog post_<br>
__language:__ _User's self-reported first language_

<a id="exp"></a>

# <span style="color:darkblue">Experiment</span> <a href='#top'>(top)</a>

__Hypothesis:__ 
> L1 (first language) experience and academic environment influences ELLs' (English Language Learners') writing style. The L1 of ELLs can be predicted by looking at English blog posts and identifying patterns unique to their L1.

__Observations:__
><li> --<li>--<li>--

__Method:__
> Using multiple different models. The aim of this project is to explore how different models can handle the data (target and features) and to see what information can be gained from using multiple different models. Ultimately, the goal is to determine which models are appropriate for a binary (discrete) target with features that are both qualitative (discrete) and quantitative (ranked/continuous).

<a id="sec1"></a>

# <span style="color:darkblue">1. Models:</span>  <a href='#top'>(top)</a>

In [1]:
# iPython/Jupyter Notebook
import time
from pprint import pprint
import warnings
from IPython.display import Image

import time

# Data processing
import pandas as pd
import plotly as plo
import seaborn as sns
from scipy import stats
from collections import Counter
import numpy as np
import itertools

# NLP
from nltk.corpus import stopwords as sw
from nltk.util import ngrams
from nltk.corpus import brown
import nltk
import re
from nltk.tokenize import RegexpTokenizer
import difflib

# Stats
from sklearn.metrics import classification_report, roc_curve,roc_auc_score,accuracy_score
from sklearn import metrics

# Preparing Models
from sklearn.model_selection import train_test_split

# Models
from sklearn import linear_model
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.naive_bayes import BernoulliNB,MultinomialNB,GaussianNB

# Ensemble
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

#Visualization
from IPython.display import Image
import pydotplus
import graphviz

# import altair as alt

In [2]:
features = pd.read_csv('blogfeatures.csv').sample(frac=1.0)
del features['Unnamed: 0']
del features['id']
lang = list(features.language.unique())

In [3]:
y = features['language'].values.reshape(-1, 1).ravel()
X = features[features.columns[~features.columns.str.contains('language')]]
X.head()

print(np.shape(y))
print(np.shape(X))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=30)  

(14148,)
(14148, 1056)


<a id="seca"></a>

__Create Function for Comparing Models__

In [4]:
cols = ['name','time','total','precision','recall','f1']

model_set = pd.DataFrame(columns=cols)
models_stored = []
pattern = "%.2f"

In [5]:
def run_model(model,name):
    global model_set
    m = model
    m.fit(X_train, y_train)
    start = time.time()

    total_score = m.score(X_test,y_test)
    pscore = [pattern % i for i in list(metrics.precision_score(y_test, m.predict(X_test),labels=lang,average=None))]
    rscore = [pattern % i for i in list(metrics.recall_score(y_test, m.predict(X_test),labels=lang,average=None))]
    fscore = [pattern % i for i in list(metrics.f1_score(y_test, m.predict(X_test),labels=lang,average=None))]
    end = time.time()
    t= pattern % (end - start)

    r = dict(zip(cols,[name,t,total_score,pscore,rscore,fscore]))
    print('Total Score is: {}\n'.format(total_score))
    print(classification_report(y_test, m.predict(X_test)))
    
    model_set = model_set.append(r,ignore_index=True)
    return r,m

### <span style="color:gray">A. LR - Ordinary Least Squares _(not used)_</span>  <a href='#top'>(top)</a>

> Target is discrete so this model may not be appropriate <br>Many features are binary so model may not be appropriate

<a id="secb"></a>

### <span style="color:darkred">B. LR - Logistic Regression</span>  <a href='#top'>(top)</a>

> Target is binary so logistic regression will operate on probabilities

In [6]:
%%time
lreg_data,lreg = run_model(linear_model.LogisticRegression(),'Logistic Regression')

Total Score is: 0.7597739048516251

                     precision    recall  f1-score   support

            English       0.00      0.00      0.00         6
           Japanese       0.79      0.93      0.85      1462
             Korean       0.59      0.06      0.10       178
Traditional Chinese       0.66      0.51      0.57       477

        avg / total       0.74      0.76      0.72      2123

CPU times: user 8.7 s, sys: 204 ms, total: 8.9 s
Wall time: 9.1 s


In [7]:
pd.crosstab(y_test,lreg.predict(X_test))

col_0,English,Japanese,Korean,Traditional Chinese
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
English,0,1,0,5
Japanese,2,1360,4,96
Korean,1,141,10,26
Traditional Chinese,1,230,3,243


<a id="secb1"></a>

### <span style="color:gray">C. Lasso _(not used)_</span>  <a href='#top'>(top)</a>

<a id="secb2"></a>

### <span style="color:gray">D. Ridge _(not used)_</span>  <a href='#top'>(top)</a>

_Lasso and Ridge are not good predictors so should I just be using them for parameter manipulation?_

<a id="sece"></a>

### <span style="color:darkred">E. K Nearest Neighbors</span>  <a href='#top'>(top)</a>

> Can handle discrete values for target <br>Quantitative values are limited (not continuous) and might be problematic for nearest neighbors

In [8]:
%%time
neighbors_data,neighbors = run_model(KNeighborsClassifier(n_neighbors=5),'K Nearest Neighbor')


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


F-score is ill-defined and being set to 0.0 in labels with no predicted samples.



Total Score is: 0.6928874234573716

                     precision    recall  f1-score   support

            English       0.00      0.00      0.00         6
           Japanese       0.72      0.93      0.81      1462
             Korean       0.20      0.02      0.04       178
Traditional Chinese       0.49      0.22      0.30       477

        avg / total       0.62      0.69      0.63      2123

CPU times: user 37.7 s, sys: 266 ms, total: 37.9 s
Wall time: 38.2 s



Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.



<a id="secf"></a>

### <span style="color:darkred">F. Naive Bayes - Bernoulli</span>  <a href='#top'>(top)</a>

> Should be best for boolean classification but not for multiple discrete values (multinomial should be better)

In [9]:
%%time
bnb_data,bnb = run_model(BernoulliNB(),'Naive Bayes - Bernoulli')

Total Score is: 0.6264719736222327

                     precision    recall  f1-score   support

            English       0.00      0.00      0.00         6
           Japanese       0.76      0.75      0.76      1462
             Korean       0.17      0.17      0.17       178
Traditional Chinese       0.42      0.43      0.43       477

        avg / total       0.63      0.63      0.63      2123

CPU times: user 552 ms, sys: 195 ms, total: 747 ms
Wall time: 865 ms


<a id="secg"></a>

### <span style="color:darkred">G. Decision Tree</span>  <a href='#top'>(top)</a>

> Visualizes most important features by hierarchy <br>Longer processing time

In [10]:
%%time
dt_data,dt = run_model(tree.DecisionTreeClassifier(criterion='entropy',max_depth=7),'Decision Tree')

Total Score is: 0.7145548751766369

                     precision    recall  f1-score   support

            English       0.00      0.00      0.00         6
           Japanese       0.73      0.94      0.82      1462
             Korean       0.00      0.00      0.00       178
Traditional Chinese       0.59      0.30      0.40       477

        avg / total       0.64      0.71      0.66      2123

CPU times: user 588 ms, sys: 67.5 ms, total: 656 ms
Wall time: 660 ms


In [11]:
# Render tree.
dot_data = tree.export_graphviz(
    dt, out_file=None,
    feature_names=X.columns,
    class_names=lang,
    filled=True
)

graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())

graph.write_png('decision_tree.png')

True

In [12]:
dimportance = list(zip(X.columns,dt.feature_importances_))
dimportance = dict(dimportance)
a1_sorted_keys = sorted(dimportance, key=dimportance.get, reverse=True)
p = []
for r in a1_sorted_keys:
    if dimportance[r] != 0:
        p.append(r)
#         print(r, dimportance[r])
        
print(p)

['sc', 'pos2_NN-MD', 'wc', 'pos2_DT-JJ', 'pos_RB', 'freq_score', 'pos_DT', 'pos2_PRP-DT', 'pos_VBP', 'pos2_IN-NNP', 'pos2_NN-NN', 'pos2_PRP-VBD', 'pos2_MD-VB', 'pos2_VB-PRP', 'pos2_PRP$-JJ', 'pos2_NN-PRP', 'pos2_NNP-IN', 'sent_subj', 'pos2_RB-JJ', 'pos2_TO-PRP', 'sent_pol', 'pos2_RB-MD', 'pos_PRP$', 'pos_VBZ', 'pos_NNP', 'pos2_VBG-NNP', 'pos2_NN-TO', 'pos2_JJ-NNS', 'pos2_VBZ-DT', 'pos2_VBP-PRP', 'pos2_NNP-NNP', 'pos2_NNS-WP', 'pos2_CC-NNP', 'pos2_NN-PRP$', 'pos2_IN-TO', 'pos2_NN-VBP', 'pos2_MD-NN', 'pos_VBD', 'pos_VB', 'pos_CD', 'pos_EX', 'pos2_VBD-PRP$', 'pos2_DT-NN', 'pos2_NNP-CD', 'pos2_NNS-NNP', 'pos2_VB-VBG', 'pos2_NNS-JJ', 'pos2_VBP-DT', 'pos_RP', 'pos_WP', 'pos2_NN-DT', 'pos2_IN-NN', 'pos2_RB-VBP', 'pos2_RP-DT', 'pos2_RB-PRP', 'pos2_VBN-RB', 'pos2_VBZ-VBN', 'pos2_CD-NN', 'pos2_NNS-VBZ', 'pos2_VBG-NNS', 'pos2_VB-TO', 'pos_CC', 'pos2_VB-WRB', 'pos2_JJ-NNP', 'pos2_CC-VBZ', 'pos2_NN-IN', 'pos2_WRB-VB', 'pos2_RP-RB', 'pos_PRP', 'pos_MD', 'pos2_JJ-TO', 'pos2_VBD-DT']


_Good visualization of important features and presentation of entropy weighting_

<a id="sech"></a>

### <span style="color:darkred">H. Random Forest</span>  <a href='#top'>(top)</a>

> Runs decision tree multiple times for best output <br>Longest processing time

In [13]:
%%time
rf_data,rf = run_model(ensemble.RandomForestClassifier(n_estimators=20),'Random Forest')

Total Score is: 0.7263306641544983

                     precision    recall  f1-score   support

            English       0.00      0.00      0.00         6
           Japanese       0.73      0.98      0.83      1462
             Korean       0.67      0.02      0.04       178
Traditional Chinese       0.72      0.22      0.34       477

        avg / total       0.72      0.73      0.65      2123

CPU times: user 1.24 s, sys: 122 ms, total: 1.37 s
Wall time: 1.38 s



Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


F-score is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.



In [14]:
cvs = cross_val_score(rf, X_test, y_test, cv=5)
print(cvs)
scoreH = cvs.sum()/len(cvs)

[0.70023419 0.6971831  0.70518868 0.71158392 0.70212766]


In [15]:
rf.score(X_train,y_train)
print(scoreH)

0.7032635107597306


In [16]:
rf.feature_importances_
importance = list(zip(X.columns,rf.feature_importances_))

dimportance = dict(importance)

a1_sorted_keys = sorted(dimportance, key=dimportance.get, reverse=True)
for r in a1_sorted_keys:
    print(r, dimportance[r])

sc 0.03606740214441025
freq_score 0.019202242897463924
wc 0.01379507750411924
sent_pol 0.013535244234113276
sent_subj 0.013481485223795017
pos_PRP 0.012637773024977348
pos_NN 0.011072688421472996
pos_JJ 0.00987103331982811
pos_IN 0.009823337519891173
pos_DT 0.009806532856006531
pos_RB 0.00950442311663418
pos2_PRP-VBD 0.009236300628400026
pos_VBD 0.009159356873091206
pos_VB 0.009117950000039909
pos_NNP 0.009029660287195696
pos2_DT-JJ 0.008806156846474938
pos_MD 0.008352720692458855
pos2_NN-PRP 0.00803933598364678
pos_VBP 0.008010158170020964
pos2_DT-NN 0.007915378550335373
pos2_PRP-VBP 0.007567219196645188
pos_VBZ 0.0073541384603109565
pos2_NN-IN 0.006903032490827127
pos2_IN-NNP 0.0068908444845073875
pos_CC 0.006687260820474163
pos2_NN-MD 0.006572396656395657
pos_TO 0.006495071589868869
pos_PRP$ 0.0064725370394995594
pos2_IN-DT 0.006440360133442823
pos_NNS 0.006437623615677732
pos2_MD-VB 0.0062775643080684495
pos_VBG 0.006178967679943645
pos2_TO-VB 0.00608459307189922
pos2_JJ-NN 0.00592

pos2_VBG-VB 0.0003676104220019796
pos2_IN-JJR 0.0003584336022289327
pos2_VBN-RP 0.0003557771032878522
pos2_NN-RP 0.00035500450350719196
pos2_VBG-VBZ 0.0003527933961009947
pos2_POS-NNS 0.0003518624294359209
pos2_DT-VBN 0.0003513565585184877
pos2_NNS-CD 0.00034971114168732116
pos2_JJ-WP 0.0003496555203526958
pos2_WDT-IN 0.00034784122815552593
pos2_WRB-NNP 0.0003460765901255126
pos2_VBZ-WP 0.00034606323462434764
pos2_RP-PRP 0.0003429435688507951
pos2_JJR-JJ 0.0003428092513262173
pos2_VB-POS 0.0003390924874132117
pos2_WRB-VBD 0.0003355460793418787
pos2_VBN-VBZ 0.00033458410446649694
pos2_MD-IN 0.00033408568840481376
pos2_VBG-VBN 0.00033043746757840286
pos2_RP-NNS 0.00032804724716862306
pos2_POS-VB 0.0003232328873121156
pos2_NNS-PRP$ 0.00032304568638388505
pos2_VBZ-VBD 0.00032227875346308105
pos_FW 0.00032148974222547823
pos2_EX-MD 0.0003196049228762456
pos2_VBG-CD 0.0003194350903202748
pos2_PRP-VBN 0.00031735676345946104
pos2_RP-TO 0.00031501811510597496
pos2_UH-DT 0.0003135685400150135
po

<a id="sec2"></a>

# <span style="color:darkblue">2. Model Comparison</span>  <a href='#top'>(top)</a>

In [17]:
model_set.columns = ['name','time','total','prec: | JA | CH | KO | EN |','rec: | JA | CH | KO | EN |','f1: | JA | CH | KO | EN |']
model_set

Unnamed: 0,name,time,total,prec: | JA | CH | KO | EN |,rec: | JA | CH | KO | EN |,f1: | JA | CH | KO | EN |
0,Logistic Regression,0.09,0.759774,"[0.79, 0.66, 0.59, 0.00]","[0.93, 0.51, 0.06, 0.00]","[0.85, 0.57, 0.10, 0.00]"
1,K Nearest Neighbor,30.0,0.692887,"[0.72, 0.49, 0.20, 0.00]","[0.93, 0.22, 0.02, 0.00]","[0.81, 0.30, 0.04, 0.00]"
2,Naive Bayes - Bernoulli,0.18,0.626472,"[0.76, 0.42, 0.17, 0.00]","[0.75, 0.43, 0.17, 0.00]","[0.76, 0.43, 0.17, 0.00]"
3,Decision Tree,0.06,0.714555,"[0.73, 0.59, 0.00, 0.00]","[0.94, 0.30, 0.00, 0.00]","[0.82, 0.40, 0.00, 0.00]"
4,Random Forest,0.13,0.726331,"[0.73, 0.72, 0.67, 0.00]","[0.98, 0.22, 0.02, 0.00]","[0.83, 0.34, 0.04, 0.00]"


-----