# Cell Classification - End Model (LSTM)

In this notebook we use the labeled data generated using snorkel in the previous notebook ([here](Classification.ipynb)) to train a supervised LSTM model that will classify a given cell source code into the relevant data-scientist workflow stage (multi-class text classification). 

In [1]:
# First let's import relevant libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, Dropout
from keras.models import Sequential
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras.metrics import mean_squared_error
from keras.models import load_model

# Input data files are available in the "../input/" directory.
import os
print(os.listdir("input/"))

# Any results you write to the current directory are saved as output.

ModuleNotFoundError: No module named 'keras'

In [2]:
# load our tagged Data
data = pd.read_csv('input/input.tsv', delimiter='\t', usecols=['Cell ID', 'Source', 'Label'])

## Pre-Processing

In [3]:
#first we'll remove cells that snorkel didn't tag
data.dropna(subset=['Label'], how='all', inplace = True)
data = data[data.Label != 'Unknown']

In [4]:
#now let's take a look at some random cells
data.sample(5)

Unnamed: 0,Cell ID,Source,Label
37965,kabure_#_predicting-house-prices-xgb-rf-baggin...,"df_usa = pd.read_csv(""../input/kc_house_data.c...",Load
69036,akashravichandran_#_pandas-tutorial-6_#_6,"gaming_products = pd.read_csv(""../input/things...",Load
74003,gpehls_#_indexing-selecting-assigning_#_11,check_q8(reviews.loc[reviews.country=='Italy']),Prep
49329,llabhishekll_#_fraud-transaction-detection_#_24,fig = plt.figure()\r\r\nax = fig.add_subplot(1...,Explore
68283,katerynad_#_data-exploring-part-1-indicators_#_14,['#indicators common for as many companies as ...,Prep


### Class Imbalance

let's take a look at the tagged data value counts

In [5]:
#now let's see
data.Label.value_counts()

Explore    31813
Prep       15344
Eval        7897
Load        7176
Import      4980
Train       3052
Name: Label, dtype: int64

We can see the classes are imbalanced. The data exploration class has much more cells than the others. we want to have balaced classes for the model to train, so we'll take a fixed size from each class (under sample the large classes).

In [6]:
# first we shuffle the data by randomly re-indexing
shuffled = data.reindex(np.random.permutation(data.index))
shuffled.head(5) #check data is indeed shuffeled


Unnamed: 0,Cell ID,Source,Label
886,shelars1985_#_bitcoin-vs-ethereum-candlestick-...,"['f,ax=plt.subplots(figsize=(15,11))\n', 'ax.x...",Explore
50910,asindico_#_porto-seguro-the-essential-kickstar...,"[""tmp = pd.concat([df['target'],df[ind_con]],a...",Explore
42821,alaeddineayadi_#_neural-net-solution-with-kera...,"['#KERAS MODEL DEFINITION\n', 'from keras.laye...",Train
26280,kerneler_#_starter-govdata360-998ff914-e_#_4,# Correlation matrix\r\r\r\ndef plotCorrelatio...,Explore
24555,xavierbourretsicotte_#_localizing-utc-time-eda...,from pandas.io.json import json_normalize\r\r\...,Prep


In [7]:
fixed_class_size = 5000 #the fixed size was selected by trial and error
l  = shuffled[shuffled['Label'] == 'Load'][:fixed_class_size]
p  = shuffled[shuffled['Label'] == 'Prep'][:fixed_class_size]
t  = shuffled[shuffled['Label'] == 'Train'][:fixed_class_size]
ev = shuffled[shuffled['Label'] == 'Eval'][:fixed_class_size]
ex = shuffled[shuffled['Label'] == 'Explore'][:fixed_class_size]
i  = shuffled[shuffled['Label'] == 'Import'][:fixed_class_size]

concated = pd.concat([l, p, t, ev, ex, i], ignore_index=True) #our new data with balanced classes
concated.head(5)

Unnamed: 0,Cell ID,Source,Label
0,barneythedinosaur_#_homework3_#_2,data=pd.read_csv('../input/kc_house_data.csv')...,Load
1,rahulin05_#_predicting-house-sales-by-linear-r...,"['import pandas as pd\n', 'housing_data = pd.r...",Load
2,shinto_#_notebook2_#_21,"['submission = pd.read_csv(""../input/sample_su...",Load
3,piscab_#_classifying-narratives-by-product-w-c...,"['# Read the input dataset \n', 'd = pd.read_c...",Load
4,shivammittal99_#_renaming-and-combining-workbo...,"powerlifting_meets = pd.read_csv(""../input/pow...",Load


In [8]:
#Shuffle the dataset again by re-indexing
concated = concated.reindex(np.random.permutation(concated.index))
concated.head(5)

Unnamed: 0,Cell ID,Source,Label
11660,shenba_#_home-credit-v6-12jun2018_#_80,"num_round = 1000\r\r\nbst = xgb.train(params, ...",Train
14052,finlay_#_quick-algos-start_#_5,['from sklearn.linear_model import LinearRegre...,Eval
7378,bill10_#_health-insurance-marketplace_#_5,"[""df.loc[df.IndividualRate==0, 'IndividualRate...",Prep
6335,optidatascience_#_data-preparation-for-sberban...,"[""dt = 'object'\n"", 'sel_col = train.columns[t...",Prep
9245,rveldur_#_mercari-price-suggestion-data-mining...,"full_df[""category_name""] = full_df[""category_n...",Prep


### Tokenization and Vector representation of label and code

We'll represent the label as a one-hot vector

In [9]:
#add int representation of the label
concated['INT'] = 0
concated.loc[concated['Label'] == 'Load', 'INT']  = 0
concated.loc[concated['Label'] == 'Prep', 'INT']  = 1
concated.loc[concated['Label'] == 'Train', 'INT']  = 2
concated.loc[concated['Label'] == 'Eval', 'INT'] = 3
concated.loc[concated['Label'] == 'Explore', 'INT'] = 4
concated.loc[concated['Label'] == 'Import', 'INT']  = 5

#one-hot encode the label
labels = to_categorical(concated['INT'], num_classes=6)
if 'Label' in concated.keys():
    concated.drop(['Label'], axis=1)
# '''
#  [1. 0. 0. 0. 0. 0.] load data
#  [0. 1. 0. 0. 0. 0.] data preparation and cleaning
#  [0. 0. 1. 0. 0. 0.] model training and parameter tuning
#  [0. 0. 0. 1. 0. 0.] model evaluation
#  [0. 0. 0. 0. 1. 0.] data exploration
#  [0. 0. 0. 0. 0. 1.] imports
# '''

#let's print some of the labels to see the encoding
labels.view()

NameError: name 'to_categorical' is not defined

We remove all comments, as comments may refer to actions that weren’t really done or to what was done previously to the current cell, so that it just interferes in our task to classify the current cell correctly.

In [11]:
from utils.utils import findAndRemoveComments
concated['Source'] = concated['Source'].apply(lambda x: findAndRemoveComments(x))
concated.head(5) # just to check comments were indeed removed

Unnamed: 0,Cell ID,Source,Label,INT
11660,shenba_#_home-credit-v6-12jun2018_#_80,"num_round = 1000\r\r\nbst = xgb.train(params, ...",Train,2
14052,finlay_#_quick-algos-start_#_5,['from sklearn.linear_model import LinearRegre...,Eval,3
7378,bill10_#_health-insurance-marketplace_#_5,"[""df.loc[df.IndividualRate==0, 'IndividualRate...",Prep,1
6335,optidatascience_#_data-preparation-for-sberban...,"[""dt = 'object'\n"", 'sel_col = train.columns[t...",Prep,1
9245,rveldur_#_mercari-price-suggestion-data-mining...,"full_df[""category_name""] = full_df[""category_n...",Prep,1


Now we turn the code to-lower, filter special chars and dots and split each cell's code into tokens.
Then we represent the most common words by ints and each cell is represented as a vector of ints according to the words that it contains. The vectors are then padded to a fixed max length of 100.

In [None]:
n_most_common_words = 8000
max_len = 120
tokenizer = Tokenizer(num_words=n_most_common_words, filters='!"#$%&()*+,.-/:;<=>?@[\]^`{|}~\n\r\t \'', lower=True)
tokenizer.fit_on_texts(concated['Source'].values)
sequences = tokenizer.texts_to_sequences(concated['Source'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
print("-------")
print(word_index) #our "words" dictionary

X = pad_sequences(sequences, maxlen=max_len)

Now we split the data, represented as vectors, into train and test sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X , labels, test_size=0.25, random_state=42)

## LSTM Model

Now we setup and train an LSTM model using the vector representation of the code and the labels.

#### Parameter Definitions

In [None]:
epochs = 15
# we set an EarlyStopping, so when the model stops improving val_loss'wise it will stop training
# but we also don't want to overfit
emb_dim = 512
batch_size = 256

#### Model Setup and Training
***Note: model training could take up to 100 minutes, you can skip and load the trained model in the next cell***

In [None]:
print("(X_train.shape, y_train.shape, X_test.shape, y_test.shape)")
print((X_train.shape, y_train.shape, X_test.shape, y_test.shape))
model = Sequential()
model.add(Embedding(n_most_common_words, emb_dim, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.8))
model.add(LSTM(64, dropout=0.8, recurrent_dropout=0.8))
model.add(Dense(6, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc', mean_squared_error])
print(model.summary())
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_acc',patience=7, min_delta=0.0001)])
# Loss: 0.614,  Accuracy: 0.815, bad conv. on 20 epochs
# Loss: 0.593,  Accuracy: 0.816, bad conv. on 15
# Loss: 0.577,  Accuracy: 0.814, bad conv. on 12
# Loss: 0.584,  Accuracy: 0.800, better conv. still bad on 8
# Loss: 0.571,  Accuracy: 0.802, better conv. still bad on 7
# Loss: 0.599,  Accuracy: 0.793, pretty good conv. on 5 (best on 4)
# Loss: 0.621,  Accuracy: 0.783,
# all above with 0.6 dropouts
# with 0.8 dropouts:
# Loss: 0.684, Accuracy: 0.765, pretty good conv. but not yet on 8
# Loss: 0.609, Accuracy: 0.793, good conv. on 15

In [None]:
model.save('LSTM.h5') #save the trained model

#### load the trained model (continue here if you don't train the model)

In [None]:
#load the trained model (not needed if you train again)
model = load_model('LSTM.h5')

## Model Evaluation

In [None]:
accr = model.evaluate(X_test,y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

We get a Categorical Accuracy of 79.3% - that means we get the class right for 79.3% of the cells, not bad.
We could see during training that the MSE of the training set is similar to that of the validation set, so we figure the model isn't too overfitted.
We optimised the model by Categorical Cross-Entropy Loss as we are classifying with softmax output node activation.
We Can see the model convergence:

In [None]:
import matplotlib.pyplot as plt
%matplotlib notebook

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

We can see that the loss and accuracy of the trainning set cenverges to that of the validation set - as it should.

Now we'll look at the model performance for each class (and also at the f:

In [None]:
from sklearn.metrics import classification_report

y_pred = []
y_true = []
test_pred = model.predict(X_test)

for pred_arr in test_pred:
    pred = np.argmax(pred_arr)
    y_pred.append(pred)
    
for true_arr in y_test:
    true = np.argmax(true_arr)
    y_true.append(true)
    
print(classification_report(y_true, y_pred))

We can see that the model performs pretty well for all labels 

## Examples

Here are some examples, we can see that the model classifies correctly

In [None]:
code = ["model = KNeighborsClassifier(n_neighbors=3)\n model.fit(x, y)"]
seq = tokenizer.texts_to_sequences(code)
padded = pad_sequences(seq, maxlen=max_len)
pred = model.predict(padded)
labels = ['Load', 'Prep', 'Train', 'Eval', 'Explore', 'Import']
print(pred, labels[np.argmax(pred)])

In [None]:
code = ["import library\nimport otherlibrary"]
seq = tokenizer.texts_to_sequences(code)
padded = pad_sequences(seq, maxlen=max_len)
pred = model.predict(padded)
print(pred, labels[np.argmax(pred)])

In [None]:
code = ["df = pd.read_csv('file')"]
seq = tokenizer.texts_to_sequences(code)
padded = pad_sequences(seq, maxlen=max_len)
pred = model.predict(padded)
print(pred, labels[np.argmax(pred)])

In [None]:
code = ["df.shape\ndf.head()"]
seq = tokenizer.texts_to_sequences(code)
padded = pad_sequences(seq, maxlen=max_len)
pred = model.predict(padded)
print(pred, labels[np.argmax(pred)])

In [None]:
code = ["accr = model.evaluate(X_test,y_test) print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))"]
seq = tokenizer.texts_to_sequences(code)
padded = pad_sequences(seq, maxlen=max_len)
pred = model.predict(padded)
print(pred, labels[np.argmax(pred)])

You can try yourself by enetering a cells code:

In [None]:
code = ["INSERT CODE HERE"]
seq = tokenizer.texts_to_sequences(code)
padded = pad_sequences(seq, maxlen=max_len)
pred = model.predict(padded)
print(pred, labels[np.argmax(pred)])