# Cell Classification - End Model (LSTM)

In this notebook we use the labeled data generated using snorkel in the previous notebook ([here](Exploration_and_WeakSupervision.ipynb)) to train a supervised LSTM model that will classify a given cell source code into the relevant data-scientist workflow stage (multi-class text classification). 

In [1]:
# install necessary packages
! pip install -U --user pip six numpy wheel mock pandas
! pip install -U --user keras_applications==1.0.6 --no-deps
! pip install -U --user keras_preprocessing==1.0.5 --no-deps
! pip install keras tensorflow sklearn

Requirement already up-to-date: pip in c:\users\tamirhuber\appdata\roaming\python\python37\site-packages (19.1.1)
Requirement already up-to-date: six in c:\users\tamirhuber\anaconda3\lib\site-packages (1.12.0)
Collecting numpy
  Downloading https://files.pythonhosted.org/packages/ce/61/be72eee50f042db3acf0b1fb86650ad36d6c0d9be9fc29f8505d3b9d6baa/numpy-1.16.4-cp37-cp37m-win_amd64.whl (11.9MB)
Requirement already up-to-date: wheel in c:\users\tamirhuber\appdata\roaming\python\python37\site-packages (0.33.4)
Requirement already up-to-date: mock in c:\users\tamirhuber\appdata\roaming\python\python37\site-packages (3.0.5)
Requirement already up-to-date: pandas in c:\users\tamirhuber\appdata\roaming\python\python37\site-packages (0.24.2)
Installing collected packages: numpy
  Found existing installation: numpy 1.16.3
    Uninstalling numpy-1.16.3:
      Successfully uninstalled numpy-1.16.3
Successfully installed numpy-1.16.4




Requirement already up-to-date: keras_applications==1.0.6 in c:\users\tamirhuber\appdata\roaming\python\python37\site-packages (1.0.6)
Requirement already up-to-date: keras_preprocessing==1.0.5 in c:\users\tamirhuber\appdata\roaming\python\python37\site-packages (1.0.5)


this should work, but if any problems occur see- [https://www.tensorflow.org/install](https://www.tensorflow.org/install), [https://keras.io/#installation](https://keras.io/#installation)

In [2]:
# First let's import relevant libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, Dropout
from keras.models import Sequential
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras.metrics import mean_squared_error
from keras.models import load_model
import keras

# Input data files are available in the "../input/" directory.
import os
print(os.listdir("input/"))

# Any results you write to the current directory are saved as output.

Using TensorFlow backend.


['cells.tsv', 'gold_labels.tsv', 'input.tsv', 'notebooks.csv', 'tagged_to_gold.tsv', '__pycache__']


*If keras and tensorflow installation was succesful and there is still a problem with the imports, try restarting the kernel and clearing outputs, and then run the imports cell again.

In [6]:
# load our tagged Data
data = pd.read_csv('input/input.tsv', delimiter='\t', usecols=['Cell ID', 'Source', 'Label'])

## Pre-Processing

In [7]:
#first we'll remove cells that snorkel didn't tag
data.dropna(subset=['Label'], how='all', inplace = True)
data = data[data.Label != 'Unknown']

In [8]:
#now let's take a look at some random cells
data.sample(5)

Unnamed: 0,Cell ID,Source,Label
47376,shilpa11_#_mercari-price-predictions-eda_#_28,"['plt.figure(figsize = (18,6))\n', ""sns.barplo...",Explore
79462,ostaski_#_grouping-and-sorting_#_1,import pandas as pd\r\r\nfrom learntools.advan...,Load
85848,vrush77_#_creating-reading-writing-data_#_10,# Your Code Here\r\r\nimport sqlite3\r\r\ncon ...,Import
51008,bertcarremans_#_data-preparation-exploration_#_33,"[""X_train = train.drop(['id', 'target'], axis=...",Train
75875,jokermario_#_indexing-selecting-assigning-4af0...,"# Your code here\r\r\na = reviews.loc[:,'point...",Prep


### Class Imbalance

let's take a look at the tagged data value counts

In [9]:
#now let's see
data.Label.value_counts()

Explore    31813
Prep       15344
Eval        7897
Load        7176
Import      4980
Train       3052
Name: Label, dtype: int64

We can see the classes are imbalanced. The data exploration class has much more cells than the others. we want to have balaced classes for the model to train, so we'll take a fixed size from each class (under sample the large classes).

In [10]:
# first we shuffle the data by randomly re-indexing
shuffled = data.reindex(np.random.permutation(data.index))
shuffled.head(5) #check data is indeed shuffeled


Unnamed: 0,Cell ID,Source,Label
1507,aschlapsi_#_the-first-kaggle-submission_#_5,"['print(""Number of observations: %i"" % len(tra...",Explore
57364,frankherfert_#_workflow-template-and-detailed-...,"[""sub_area_columns = ['sub_area', 'area_m', 'r...",Prep
37267,dkasprick_#_king-county-sales-regression_#_13,df['bathrooms'].value_counts(),Explore
86157,zbi441_#_winery-data-exploratory-analysis-phas...,"['import plotly.plotly as py\n', 'import plotl...",Explore
88517,kerneler_#_starter-world-bank-projects-c00904b...,"plotHistogram(df1, 10, 5)",Explore


In [11]:
fixed_class_size = 5000 #the fixed size was selected by trial and error
l  = shuffled[shuffled['Label'] == 'Load'][:fixed_class_size]
p  = shuffled[shuffled['Label'] == 'Prep'][:fixed_class_size]
t  = shuffled[shuffled['Label'] == 'Train'][:fixed_class_size]
ev = shuffled[shuffled['Label'] == 'Eval'][:fixed_class_size]
ex = shuffled[shuffled['Label'] == 'Explore'][:fixed_class_size]
i  = shuffled[shuffled['Label'] == 'Import'][:fixed_class_size]

concated = pd.concat([l, p, t, ev, ex, i], ignore_index=True) #our new data with balanced classes
concated.head(5)

Unnamed: 0,Cell ID,Source,Label
0,apapiu_#_notebook35c9e8bdca_#_2,"['train = pd.read_csv(""../input/train.csv"")']",Load
1,akashravichandran_#_pandas-tutorial-6_#_6,"gaming_products = pd.read_csv(""../input/things...",Load
2,nareshsrikakulapu_#_sub-mean_#_1,['# This Python 3 environment comes with many ...,Load
3,ivanbeg_#_08-lstm-stock-prediction-solution_#_3,"['df2 = pd.read_csv(""../input/fundamentals.csv...",Load
4,param87_#_house-price-prediction-based_#_1,# This Python 3 environment comes with many he...,Load


In [12]:
#Shuffle the dataset again by re-indexing
concated = concated.reindex(np.random.permutation(concated.index))
concated.head(5)

Unnamed: 0,Cell ID,Source,Label
21975,gauravgulati9c_#_lending-insights-for-beginner...,df.describe(),Explore
20139,sshreshth_#_manual-feature-engineering_#_32,# Dataframe grouped by the loan\r\r\nbureau_by...,Explore
6974,saroshfarhan_#_indexing-selecting-assigning_#_11,# Your code here\r\r\ncheck_q8(reviews.loc[rev...,Prep
25335,nagaraga_#_pandas-chap1-naga_#_1,import pandas as pd\r\r\npd.set_option('max_ro...,Import
2619,rajsrujan77_#_pytorch-credit-card-fraud-99-9-a...,"data = pd.read_csv(""../input/processed-creditd...",Load


### Tokenization and Vector representation of label and code

We'll represent the label as a one-hot vector

In [13]:
#add int representation of the label
concated['INT'] = 0
concated.loc[concated['Label'] == 'Load', 'INT']  = 0
concated.loc[concated['Label'] == 'Prep', 'INT']  = 1
concated.loc[concated['Label'] == 'Train', 'INT']  = 2
concated.loc[concated['Label'] == 'Eval', 'INT'] = 3
concated.loc[concated['Label'] == 'Explore', 'INT'] = 4
concated.loc[concated['Label'] == 'Import', 'INT']  = 5

#one-hot encode the label
labels = to_categorical(concated['INT'], num_classes=6)
if 'Label' in concated.keys():
    concated.drop(['Label'], axis=1)
# '''
#  [1. 0. 0. 0. 0. 0.] load data
#  [0. 1. 0. 0. 0. 0.] data preparation and cleaning
#  [0. 0. 1. 0. 0. 0.] model training and parameter tuning
#  [0. 0. 0. 1. 0. 0.] model evaluation
#  [0. 0. 0. 0. 1. 0.] data exploration
#  [0. 0. 0. 0. 0. 1.] imports
# '''

#let's print some of the labels to see the encoding
labels.view()

array([[0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 0.],
       ...,
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 0.]], dtype=float32)

We remove all comments, as comments may refer to actions that weren’t really done or to what was done previously to the current cell, so that it just interferes in our task to classify the current cell correctly.

In [14]:
from utils.utils import findAndRemoveComments
concated['Source'] = concated['Source'].apply(lambda x: findAndRemoveComments(x))

Now we turn the code to-lower, filter special chars and dots and split each cell's code into tokens.
Then we represent the most common words by ints and each cell is represented as a vector of ints according to the words that it contains. The vectors are then padded to a fixed max length of 100.

In [15]:
n_most_common_words = 8000
max_len = 120
tokenizer = Tokenizer(num_words=n_most_common_words, filters='!"#$%&()*+,.-/:;<=>?@[\]^`{|}~\n\r\t \'', lower=True)
tokenizer.fit_on_texts(concated['Source'].values)
sequences = tokenizer.texts_to_sequences(concated['Source'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
print("-------")
print(word_index) #our "words" dictionary

X = pad_sequences(sequences, maxlen=max_len)

Found 46827 unique tokens.
-------


Now we split the data, represented as vectors, into train and test sets.

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X , labels, test_size=0.25, random_state=42)

## LSTM Model

Now we setup and train an LSTM model using the vector representation of the code and the labels.

#### Parameter Definitions

In [21]:
epochs = 5
# we set an EarlyStopping, so when the model stops improving val_loss'wise it will stop training
# but we also don't want to overfit
emb_dim = 512
batch_size = 256

#### Model Setup and Training
<u>Note: model training could take up to 1 hour, you can skip and load the trained model in the next cell</u>

In [22]:
print("(X_train.shape, y_train.shape, X_test.shape, y_test.shape)")
print((X_train.shape, y_train.shape, X_test.shape, y_test.shape))
model = Sequential()
model.add(Embedding(n_most_common_words, emb_dim, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.8))
model.add(LSTM(64, dropout=0.8, recurrent_dropout=0.8))
model.add(Dense(6, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc', mean_squared_error])
print(model.summary())

checkpoint_path = "training_1/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create checkpoint callback
cp_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_path, save_weights_only=True, verbose=1)

history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_acc',patience=7, min_delta=0.0001), cp_callback])

(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
((21024, 120), (21024, 6), (7008, 120), (7008, 6))
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 120, 512)          4096000   
_________________________________________________________________
spatial_dropout1d_2 (Spatial (None, 120, 512)          0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                147712    
_________________________________________________________________
dense_2 (Dense)              (None, 6)                 390       
Total params: 4,244,102
Trainable params: 4,244,102
Non-trainable params: 0
_________________________________________________________________
None
Train on 16819 samples, validate on 4205 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [23]:
# model.model.save('model_model.h5')
model.save('model.h5')

model.save_weights('weights.h5')



In [None]:
model.save('LSTM.h5') #save the trained model

#### load the trained model (continue here if you don't train the model)

In [18]:
#load the trained model (not needed if you train again)
model = load_model('LSTM.h5')

In [25]:
import keras

In [26]:
model_weight = Sequential()
model_weight.add(Embedding(n_most_common_words, emb_dim, input_length=X.shape[1]))
model_weight.add(SpatialDropout1D(0.8))
model_weight.add(LSTM(64, dropout=0.8, recurrent_dropout=0.8))
model_weight.add(Dense(6, activation='softmax'))
model_weight.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc', mean_squared_error])

In [27]:
model_weight.load_weights('weights.h5')

## Model Evaluation

In [19]:
accr = model.evaluate(X_test,y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

Test set
  Loss: 3.216
  Accuracy: 0.208


We get a Categorical Accuracy of 86.7% - that means we get the class right for 86.7% of the cells, that's great!
We could see during training that the MSE of the training set is similar to that of the validation set, so we figure the model isn't too overfitted.
We optimised the model by Categorical Cross-Entropy Loss as we are classifying with softmax output node activation.
We Can see the model convergence:

In [None]:
import matplotlib.pyplot as plt
%matplotlib notebook

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

We can see that the loss and accuracy of the validation set cenverges to that of the trainning set - as it should.

Now we'll look at the model performance for each class (and also at the f1 scores):

In [None]:
from sklearn.metrics import classification_report

y_pred = []
y_true = []
test_pred = model.predict(X_test)

for pred_arr in test_pred:
    pred = np.argmax(pred_arr)
    y_pred.append(pred)
    
for true_arr in y_test:
    true = np.argmax(true_arr)
    y_true.append(true)
    
print(classification_report(y_true, y_pred))

We can see that the model performs well for all labels 

## Examples

Here are some examples, we can see that the model classifies correctly

In [28]:
code = ["model = KNeighborsClassifier(n_neighbors=3)\n model.fit(x, y)"]
seq = tokenizer.texts_to_sequences(code)
padded = pad_sequences(seq, maxlen=max_len)
pred = model.predict(padded)
labels = ['Load', 'Prep', 'Train', 'Eval', 'Explore', 'Import']
print(pred, labels[np.argmax(pred)])

[[0.00093987 0.00442266 0.9347053  0.05745431 0.00116859 0.00130913]] Train


In [29]:
code = ["model = KNeighborsClassifier(n_neighbors=3)\n model.fit(x, y)"]
seq = tokenizer.texts_to_sequences(code)
padded = pad_sequences(seq, maxlen=max_len)
pred = model_weight.predict(padded)
labels = ['Load', 'Prep', 'Train', 'Eval', 'Explore', 'Import']
print(pred, labels[np.argmax(pred)])

[[0.00093987 0.00442266 0.9347053  0.05745431 0.00116859 0.00130913]] Train


In [30]:
model_model = load_model('model_model.h5')
code = ["model = KNeighborsClassifier(n_neighbors=3)\n model.fit(x, y)"]
seq = tokenizer.texts_to_sequences(code)
padded = pad_sequences(seq, maxlen=max_len)
pred = model_model.predict(padded)
labels = ['Load', 'Prep', 'Train', 'Eval', 'Explore', 'Import']
print(pred, labels[np.argmax(pred)])

[[0.00093987 0.00442266 0.9347053  0.05745431 0.00116859 0.00130913]] Train


In [31]:
model_load = load_model('model.h5')
code = ["model = KNeighborsClassifier(n_neighbors=3)\n model.fit(x, y)"]
seq = tokenizer.texts_to_sequences(code)
padded = pad_sequences(seq, maxlen=max_len)
pred = model_load.predict(padded)
labels = ['Load', 'Prep', 'Train', 'Eval', 'Explore', 'Import']
print(pred, labels[np.argmax(pred)])

[[0.00093987 0.00442266 0.9347053  0.05745431 0.00116859 0.00130913]] Train


In [32]:
lstm = load_model('LSTM.h5')
code = ["model = KNeighborsClassifier(n_neighbors=3)\n model.fit(x, y)"]
seq = tokenizer.texts_to_sequences(code)
padded = pad_sequences(seq, maxlen=max_len)
pred = lstm.predict(padded)
labels = ['Load', 'Prep', 'Train', 'Eval', 'Explore', 'Import']
print(pred, labels[np.argmax(pred)])

[[5.7840010e-04 6.4588943e-03 1.4000367e-02 9.7178245e-01 1.1021891e-03
  6.0777483e-03]] Eval


In [None]:
code = ["import library\nimport otherlibrary"]
seq = tokenizer.texts_to_sequences(code)
padded = pad_sequences(seq, maxlen=max_len)
pred = model.predict(padded)
print(pred, labels[np.argmax(pred)])

In [None]:
code = ["df = pd.read_csv('file')"]
seq = tokenizer.texts_to_sequences(code)
padded = pad_sequences(seq, maxlen=max_len)
pred = model.predict(padded)
print(pred, labels[np.argmax(pred)])

In [None]:
code = ["df.shape\ndf.head()"]
seq = tokenizer.texts_to_sequences(code)
padded = pad_sequences(seq, maxlen=max_len)
pred = model.predict(padded)
print(pred, labels[np.argmax(pred)])

In [None]:
code = ["accr = model.evaluate(X_test,y_test) print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))"]
seq = tokenizer.texts_to_sequences(code)
padded = pad_sequences(seq, maxlen=max_len)
pred = model.predict(padded)
print(pred, labels[np.argmax(pred)])

You can try yourself by enetering a cells code:

In [None]:
code = ["INSERT CODE HERE"]
seq = tokenizer.texts_to_sequences(code)
padded = pad_sequences(seq, maxlen=max_len)
pred = model.predict(padded)
print(pred, labels[np.argmax(pred)])

In [33]:
import tensorflow.contrib.keras as keras
from tensorflow.contrib.keras import backend as K
# m = train_keras_cnn_model() # Fill in the gaps with your model
model_fn = "model-serialization.hdf5"
m_weights = model.get_weights()
keras.models.save_model(model, model_fn)
K.clear_session()

FailedPreconditionError: Attempting to use uninitialized value embedding_2_2/embeddings
	 [[{{node _retval_embedding_2_2/embeddings_0_0}}]]