# Modeling
> - The first performance indicator we use is Accuracy because we want to make correct predictions most of the time, at least more than the baseline.  
> - The second performance indicator we want to use is Precision because we use the most utilized language as our positive case. So we want to make sur that our model predicts the other languages as precisely as possible.

In [1]:
# Ignoring warning messages from python
import warnings
warnings.filterwarnings('ignore')

# General use imports
import pandas as pd
import numpy as np

# Specific Modules
import prepare as prep
import json
import os
import unicodedata
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import export_graphviz



# # Visualization imports
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly
import plotly.express as px

## 1. Getting the data

In [2]:
df = prep.prep_data()

In [3]:
df.shape

(353, 3)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 353 entries, 1 to 475
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   repo      353 non-null    object
 1   language  353 non-null    object
 2   content   353 non-null    object
dtypes: object(3)
memory usage: 11.0+ KB


In [5]:
df.head()

Unnamed: 0,repo,language,content
1,CharlesPikachu/Games,Python,div aligncenter img srcdocslogopng width600 di...
2,channingbreeze/games,JavaScript,### phaser phaserphaserphaserhttpwwwphaserchin...
3,arcxingye/EatKano,JavaScript,p aligncenter hrefhttpsxingyemegameeatkanoimg ...
4,coding-horror/basic-computer-games,C#,### updating first million selling computer bo...
5,rwv/chinese-dos-games,Python,# do do 1898 python 3 python python downloadda...


## 2. Splitting and vectorizing

In [6]:
# Calling the split function and displaying the shape of the datasets

train, validate, test = prep.split_data(df)

In [7]:
train.shape

(197, 3)

In [8]:
validate.shape

(85, 3)

In [9]:
test.shape

(71, 3)

In [10]:
# Vectorizing and 'learning' on train dataset

tfidf = TfidfVectorizer()
X_train = tfidf.fit_transform(train.content)
y_train = train.language

In [11]:
X_train

<197x13439 sparse matrix of type '<class 'numpy.float64'>'
	with 33640 stored elements in Compressed Sparse Row format>

In [12]:
y_train

115       C++
395      HTML
166        C#
26        C++
162       C++
        ...  
217    Python
156       C++
116    Python
158      Java
153      Java
Name: language, Length: 197, dtype: object

In [13]:
# Applying the vectorization without giving out the content of validate
X_validate = tfidf.transform(validate.content)
y_validate = validate.language

In [14]:
# Applying the vectorization without giving out the content of validate
X_test = tfidf.transform(test.content)
y_test = test.language

In [15]:
X_validate.shape, X_train.shape

((85, 13439), (197, 13439))

## Logistic Regression Models

>### Building a baseline and fitting the train dataset and modeling

In [16]:
# Creating a dataframe of the target variable
y_train_df = pd.DataFrame(dict(actual=y_train))

In [17]:
y_train_df.head()

Unnamed: 0,actual
115,C++
395,HTML
166,C#
26,C++
162,C++


In [18]:
# Checking the unique content of the actuals
y_train_df.actual.nunique(), y_train_df.actual.unique()

(8,
 array(['C++', 'HTML', 'C#', 'C', 'JavaScript', 'Java', 'Python',
        'TypeScript'], dtype=object))

In [19]:
# checking the value count to decide the baseline target
y_train_df.actual.value_counts()

JavaScript    50
C++           36
Python        33
C#            24
C             18
Java          16
HTML          10
TypeScript    10
Name: actual, dtype: int64

In [20]:
# Since JavaScript has the highest occurence I'll use it 
y_train_df['baseline'] = y_train_df['actual'] == 'JavaScript'
y_train_df.head()

Unnamed: 0,actual,baseline
115,C++,False
395,HTML,False
166,C#,False
26,C++,False
162,C++,False


In [21]:
# Calculating the baseline model
print(f'The baseline model has an average capture rate for Javascript of: {y_train_df.baseline.mean():.2%}')

The baseline model has an average capture rate for Javascript of: 25.38%


In [22]:
# Applying the LogisticRegression (default settings) on the train dataset

lm = LogisticRegression().fit(X_train, y_train)

y_train_df['predicted'] = lm.predict(X_train)

In [46]:
33.8 - 25.38

8.419999999999998

In [23]:
# Displaying actual, baseline, and pedicted
y_train_df.head()

Unnamed: 0,actual,baseline,predicted
115,C++,False,C++
395,HTML,False,HTML
166,C#,False,C#
26,C++,False,C++
162,C++,False,C++


In [24]:
# Displaying the accuracy and classification report of the model

print('Accuracy: {:.2%}'.format(accuracy_score(y_train_df.actual, y_train_df.predicted)))
print('------------------')
print('Confusion Matrix')
print(pd.crosstab(y_train_df.predicted, y_train_df.actual))
print('------------------')
print('Classification Report')
print('\n')
print(classification_report(y_train_df.actual, y_train_df.predicted))

Accuracy: 85.79%
------------------
Confusion Matrix
actual       C  C#  C++  HTML  Java  JavaScript  Python  TypeScript
predicted                                                          
C           11   0    0     0     0           0       0           0
C#           0  24    0     0     0           0       0           0
C++          2   0   35     0     2           0       1           0
HTML         0   0    0     6     0           0       0           0
Java         0   0    0     0     9           0       0           0
JavaScript   5   0    1     4     5          50       0           8
Python       0   0    0     0     0           0      32           0
TypeScript   0   0    0     0     0           0       0           2
------------------
Classification Report


              precision    recall  f1-score   support

           C       1.00      0.61      0.76        18
          C#       1.00      1.00      1.00        24
         C++       0.88      0.97      0.92        36
       

>### Visualizing the model

In [None]:
y_pred_proba = lm.predict_proba(X_train)
y_pred_proba

In [None]:
# y_pred is the same as model_train.predicted
# I am just using this label for personal clarity

y_pred = lm.predict(X_train)

In [None]:
y_pred_proba.shape, y_pred.shape

In [None]:
type(y_pred_proba)

In [None]:
type(y_pred)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(y_pred_proba, y_pred)

>### No luck with plotting the Logistic Regression either. The size is definitely an issue but I'll finish up and come back to it

>### Running model on validate dadtaset

In [36]:
# Predicting on validate
lm = LogisticRegression().fit(X_train, y_train)
y_validate_pred = lm.predict(X_validate)

In [39]:
print('Accuracy: {:.2%}'.format(accuracy_score(y_validate, y_validate_pred)))
print('------------------')
print('Confusion Matrix')
print(pd.crosstab(y_validate_pred, y_validate))
print('------------------')
print('Classification Report')
print('\n')
print(classification_report(y_validate, y_validate_pred))

Accuracy: 45.88%
------------------
Confusion Matrix
language    C  C#  C++  HTML  Java  JavaScript  Python  TypeScript
row_0                                                             
C           1   0    0     0     0           0       0           0
C#          0   2    0     0     0           0       0           0
C++         1   0    6     0     3           0       1           0
HTML        0   0    0     2     0           0       0           0
JavaScript  5   9    9     2     4          22       7           4
Python      0   0    1     0     0           0       6           0
------------------
Classification Report


              precision    recall  f1-score   support

           C       1.00      0.14      0.25         7
          C#       1.00      0.18      0.31        11
         C++       0.55      0.38      0.44        16
        HTML       1.00      0.50      0.67         4
        Java       0.00      0.00      0.00         7
  JavaScript       0.35      1.00      0.52

>### Summary for the Logistic Regression model  
        - The train dataset had an accuracy at 85.79%
        - The validate dataset went down to 45.88%. This still beats the baseline which is 25% 
        - Recall is 100% meaning our model performs well on identifying a specific language (Javascrit) and Precision 35% meaning that it is hard for our model to predict other languages
        - I will run a Random Forest Model to have a different perspective

## Random Forest Model

In [40]:
# Writing a loop to capture a certain range of depth for the model to consider
# Random Forest Default values on validate dataset
for i in range(3, 13):
    multi_depth = RandomForestClassifier(max_depth=i, random_state=175)

    all_rf = multi_depth.fit(X_train, y_train)

    y_pred_rfc = all_rf.predict(X_train)

    report = classification_report(y_train, y_pred_rfc, output_dict=True)
    print(f'RandomForest depth {i}\n')
    print(pd.DataFrame(report))
    print('\n=======================\n')

RandomForest depth 3

              C     C#        C++   HTML       Java  JavaScript     Python  \
precision   0.0   1.00   1.000000   1.00   1.000000    0.316456   1.000000   
recall      0.0   0.25   0.361111   0.60   0.062500    1.000000   0.393939   
f1-score    0.0   0.40   0.530612   0.75   0.117647    0.480769   0.565217   
support    18.0  24.00  36.000000  10.00  16.000000   50.000000  33.000000   

           TypeScript  accuracy   macro avg  weighted avg  
precision         0.0  0.451777    0.664557      0.684380  
recall            0.0  0.451777    0.333444      0.451777  
f1-score          0.0  0.451777    0.355531      0.410026  
support          10.0  0.451777  197.000000    197.000000  


RandomForest depth 4

              C         C#        C++   HTML       Java  JavaScript  \
precision   0.0   1.000000   1.000000   1.00   1.000000    0.340136   
recall      0.0   0.375000   0.500000   0.60   0.062500    1.000000   
f1-score    0.0   0.545455   0.666667   0.75   0.1

>### The best depth is depth 12 with an accuracy at 82.74% and recall at 100% and a low precision at 59.52%. The precision of this model is higher than that of the Logistic Regression Model.  
We limited the depth to 12 so to not overfit our model.

>### On out-of-sample data, Validate

In [41]:
# Random Forest Default values on validate dataset
for i in range(3, 13):
    multi_depth = RandomForestClassifier(max_depth=i, random_state=175)

    all_rf = multi_depth.fit(X_train, y_train)

    y_pred_validate_rfc = all_rf.predict(X_validate)

    report = classification_report(y_validate, y_pred_validate_rfc, output_dict=True)
    print(f'RandomForest depth {i}\n')
    print(pd.DataFrame(report))
    print('\n=======================\n')

RandomForest depth 3

             C    C#        C++      HTML  Java  JavaScript     Python  \
precision  0.0   0.0   0.666667  1.000000   0.0    0.282051   1.000000   
recall     0.0   0.0   0.125000  0.500000   0.0    1.000000   0.142857   
f1-score   0.0   0.0   0.210526  0.666667   0.0    0.440000   0.250000   
support    7.0  11.0  16.000000  4.000000   7.0   22.000000  14.000000   

           TypeScript  accuracy  macro avg  weighted avg  
precision         0.0  0.329412   0.368590      0.410256  
recall            0.0  0.329412   0.220982      0.329412  
f1-score          0.0  0.329412   0.195899      0.226060  
support           4.0  0.329412  85.000000     85.000000  


RandomForest depth 4

             C         C#        C++      HTML  Java  JavaScript     Python  \
precision  0.0   1.000000   0.666667  1.000000   0.0    0.297297   1.000000   
recall     0.0   0.090909   0.250000  0.500000   0.0    1.000000   0.142857   
f1-score   0.0   0.166667   0.363636  0.666667   0.

>### Summary for Random Forest Classifier  
        - The best depth for the train dataset is 12 with an accuracy at 82.74%, recall at 100% and a low precision at 56.81%.
        - The best depth for the validate dataset is 11 and 12 with an accuracy at 44.7%.
        - Recall for both is at 95.45% but precision is different. Depth 11 has a higher 36.84% Precision while depth 12 has 35% Precision. We will pick depth 11.

>### Plotting the models

In [None]:
y_train.values

In [None]:
# I doubt the sparse matrix will work with the code below but will try
X_train

In [None]:
# Displaying model for best depth which is 12
model = RandomForestClassifier(max_depth=12, random_state=175)

# Train
model.fit(X_train, y_train)
# Extract single tree
estimator = model.estimators_[5]

# Export as dot file
export_graphviz(estimator, out_file='tree.dot', 
                feature_names = X_train.shape[0],
                class_names = y_train.values,
                rounded = True, proportion = False, 
                precision = 2, filled = True)

# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

>### Maybe resorting to the conversion performed in the curriculum would work but that is not an MVP. Pursue once all is done for an MVP.

## K Nearest Neighbor Model (KNN) 

In [42]:
# KNN Default values on train dataset
for i in range(3, 13):
    knn_depth = KNeighborsClassifier(n_neighbors=i)

    all_knn = knn_depth.fit(X_train, y_train)

    y_pred_knn = all_knn.predict(X_train)

    report = classification_report(y_train, y_pred_knn, output_dict=True)
    print(f'KNN depth {i}\n')
    print(pd.DataFrame(report))
    print('\n=======================\n')

KNN depth 3

                   C     C#        C++   HTML       Java  JavaScript  \
precision   0.173077   1.00   0.952381   1.00   1.000000    1.000000   
recall      1.000000   0.25   0.555556   0.60   0.437500    0.560000   
f1-score    0.295082   0.40   0.701754   0.75   0.608696    0.717949   
support    18.000000  24.00  36.000000  10.00  16.000000   50.000000   

              Python  TypeScript  accuracy   macro avg  weighted avg  
precision   1.000000    1.000000  0.558376    0.890682      0.915742  
recall      0.666667    0.300000  0.558376    0.546215      0.558376  
f1-score    0.800000    0.461538  0.558376    0.591877      0.631099  
support    33.000000   10.000000  0.558376  197.000000    197.000000  


KNN depth 4

               C     C#        C++   HTML       Java  JavaScript     Python  \
precision   0.25   0.75   0.913043   1.00   0.909091    0.820513   0.923077   
recall      1.00   0.50   0.583333   0.60   0.625000    0.640000   0.727273   
f1-score    0.40   

>### The best depth of the model on train dataset is 4 and it has a 63.95 % Accuracy, a 92.30% Precision, and a Recall at 72.73%

>### On out-of-sample data

In [43]:
# KNN Default values on validate dataset
for i in range(3, 13):
    knn_depth = KNeighborsClassifier(n_neighbors=i)

    all_knn = knn_depth.fit(X_train, y_train)

    y_val_pred_knn = all_knn.predict(X_validate)

    report = classification_report(y_validate, y_val_pred_knn, output_dict=True)
    print(f'KNN depth {i}\n')
    print(pd.DataFrame(report))
    print('\n=======================\n')

KNN depth 3

                  C         C#        C++      HTML  Java  JavaScript  \
precision  0.113208   0.400000   0.571429  1.000000   0.0    0.692308   
recall     0.857143   0.181818   0.250000  0.500000   0.0    0.409091   
f1-score   0.200000   0.250000   0.347826  0.666667   0.0    0.514286   
support    7.000000  11.000000  16.000000  4.000000   7.0   22.000000   

              Python  TypeScript  accuracy  macro avg  weighted avg  
precision   0.200000         0.0  0.282353   0.372118      0.427836  
recall      0.071429         0.0  0.282353   0.283685      0.282353  
f1-score    0.105263         0.0  0.282353   0.260505      0.296116  
support    14.000000         4.0  0.282353  85.000000     85.000000  


KNN depth 4

                  C         C#        C++      HTML  Java  JavaScript  \
precision  0.162162   0.272727   0.333333  1.000000   0.0    0.647059   
recall     0.857143   0.272727   0.125000  0.500000   0.0    0.500000   
f1-score   0.272727   0.272727   0.18

>### Summary for K Nearest Neighbor  
        - The best depth of the model on train dataset is 4 and it has a 63.95 % Accuracy and a 92.30% Precision. This is the highest Precision so far.
        - The best depth for the validate dataset is 6 with a 47.06% Accuracy and a 72.22% Precsion.

>### General Summary  
        - The model that performs best on the validate dataset is the Random Forest Classifier at depth 11 with 89.41% Accuracy, 100% Recall, and 70.96% Precision.
        - Now, I will run this model on the test dataset with the same default hyper-parameters.

>### K-Neighbors Classifier at depth 6 on test dataset

In [44]:
# Creating, Fitting, Making, and Predicting on the test dataset

knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)
y_pred_test_knn = knn.predict(X_test)
report = classification_report(y_test, y_pred_test_knn, output_dict=True)
print('Classification Report on test dataset')
print('\n=======================\n')
print(pd.DataFrame(report))

Classification Report on test dataset


                  C        C#        C++      HTML      Java  JavaScript  \
precision  0.200000  0.250000   0.363636  1.000000  0.500000    0.421053   
recall     0.333333  0.444444   0.307692  0.333333  0.166667    0.444444   
f1-score   0.250000  0.320000   0.333333  0.500000  0.250000    0.432432   
support    6.000000  9.000000  13.000000  3.000000  6.000000   18.000000   

              Python  TypeScript  accuracy  macro avg  weighted avg  
precision   0.700000         0.0  0.380282   0.429336      0.424735  
recall      0.583333         0.0  0.380282   0.326656      0.380282  
f1-score    0.636364         0.0  0.380282   0.340266      0.382162  
support    12.000000         4.0  0.380282  71.000000     71.000000  


>### The K-Nearest Neighbor at depth 6 has a 38.02% Accuracy and a 42.1% Precision.
        This model is the best at predicting the language used to build a game based on the README content.
        It beats the baseline which is 25.38% by 12.64 points of percent.