# Classifier Example Notebook

After we collected datasets and notebooks and created ```cells.tsv``` using the Dataset Builder in [data_gathering](https://github.com/TAU-DB/guided-ds/tree/master/data_gathering), We'll now tag these cells to the relevant data science workflow stage using snorkel weak supervision and then train an LSTM classifier. <br>

This notebooks purpose is to explain the process and recreate it with different data.<br>
To check how we originally trained our classifier and our results see- [Exploration_and_WeakSupervision.ipynb](https://github.com/TAU-DB/guided-ds/blob/master/Classification/Exploration_and_WeakSupervision.ipynb), [Classification.ipynb](https://github.com/TAU-DB/guided-ds/blob/master/Classification/Classification.ipynb)

**prerequisite**: You must have snorkel installed.

In [None]:
#install snorkel for weak supervision
cd snorkel
! pip install .
cd ..
! pip install future

This should work, but if any problems occur, see- [https://github.com/HazyResearch/snorkel#quick-start](https://github.com/HazyResearch/snorkel#quick-start)

## Weak Supervision

We use snorkel new weak-supervison paradigm to tag our unlabeled data with a relatively small amount of noise and with no need to hand-tag a big amount of data.
see
https://github.com/HazyResearch/snorkel

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import numpy as np
import pandas as pd
%config IPCompleter.greedy=True

#start a snorkel session
from snorkel import SnorkelSession
session = SnorkelSession()

### Step 1: Loading & Preprocessing the data

We'll use the ```cells.tsv``` file we created as input

In [2]:
import sys
# insert at 1, 0 is the script path (or '' in REPL)
sys.path.insert(1, '../data_gathering')
import consts
input_path = consts.CELLS_TSV

print("Reading From: "+input_path)

df = pd.read_csv(input_path, delimiter='\t',encoding='utf-8')
# we get read of empty cells
clean_df = df[df["Source"].isnull() == False]
#take a first look
clean_df.head(5)

Reading From: C:\Workspace\guided-ds\Example_Data\cells.tsv


Unnamed: 0,Cell ID,User Name,Notebook name,Source,Output,Execution count,Masked,AST,Label
0,oriormeir_#_xgboost-2-market-news.ipynb_#_1,oriormeir,xgboost-2-market-news.ipynb,import numpy as np import pandas as pd from sk...,[],1,import_numpy import_pandas import_datetime imp...,"Module(body=[Import(names=[alias(name='numpy',...",
1,oriormeir_#_xgboost-2-market-news.ipynb_#_2,oriormeir,xgboost-2-market-news.ipynb,def prepare_market_data(market_df): market...,[],2,market_df.drop,Module(body=[FunctionDef(name='prepare_market_...,
2,oriormeir_#_xgboost-2-market-news.ipynb_#_3,oriormeir,xgboost-2-market-news.ipynb,def prepare_news_data(news_df): news_df['p...,[],3,news_df.drop var5=var4.merge var5.drop var2.ex...,Module(body=[FunctionDef(name='prepare_news_da...,
3,oriormeir_#_xgboost-2-market-news.ipynb_#_4,oriormeir,xgboost-2-market-news.ipynb,"def prepare_data(market_df, news_df, start=Non...",[],4,dt dt,"Module(body=[FunctionDef(name='prepare_data', ...",
4,oriormeir_#_xgboost-2-market-news.ipynb_#_5,oriormeir,xgboost-2-market-news.ipynb,"(market_df, news_df) = env.get_training_data()...",[],5,,Module(body=[Assign(targets=[Tuple(elts=[Name(...,


#### Making sure there are no empty cells-

In [3]:
empty_src_df = clean_df[clean_df["Source"] == ""]
print("Empty cells:", len(empty_src_df))
clean_df = clean_df[clean_df["Source"] != ""]
print("Before drop", len(clean_df))
clean_df.dropna()
clean_df = clean_df[clean_df["Source"].isnull() == False]
clean_df = clean_df[clean_df["Cell ID"].isnull() == False]
clean_df.drop_duplicates(inplace=True)
print("After drop", len(clean_df))
print("Unique ID's", len(clean_df["Cell ID"].unique()))

Empty cells: 0
Before drop 41
After drop 41
Unique ID's 41


### Step 2: Defining candidates

In [4]:
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import NgramMatcher, Matcher
from snorkel.models import Context, Document, Sentence, Span, Candidate, StableLabel,candidate_subclass
from snorkel.contrib.models.text import RawText

# our candidates are cells (what we classify), our values are the classes (Data-Science workflow stages)
cellCand  = candidate_subclass('Cell', ['cell'], values=['Load Data', 'Prep & Clean', 'Train & Param', 'Eval', 'Explore', 'Import', False])

# Our Catagorical classes: Load Data, Data Preparation & Cleaning, Model Train & paramater tunning, model Evaluation, data exploration, import
# Checking candidate cardinality to make sure object created succesfully
cellCand.cardinality

7

#### Now we extract candidates from dataset and add them into the current session as candidates:

<u>Note:This process takes a long time to execute, and could be skipped if the database already exists</u>

In [5]:
# Clearing any remaining data on the session
session.rollback()
session.query(Context).delete()
session.query(Candidate).delete()
session.query(Document).delete()
session.query(StableLabel).delete()

for i, row in clean_df.iterrows():
    c_stable_id = row["Cell ID"]
    c_name = 'cell_no_' + str(i)
    c_text = row["Source"]
    if c_stable_id is None or c_name is None or c_text is None:
        continue
    if c_stable_id == "" or c_name == "" or c_text == "":
        continue
    raw_text = RawText(stable_id=c_stable_id, name=c_name , text=c_text)
    if i % 3 != 1:
        # Split 0 is for training - 75%
        candidate = cellCand(cell=raw_text, split=0)
    else:
        # Split 1 is for evaluating - 25%
        candidate = cellCand(cell=raw_text, split=1)
    session.add(candidate)
session.commit()


#### Querying from the stored database:

**Continue here** - If you didn't run the previous cell it will query from the previously saved database, and if you did it will query from the most recent session.

In [6]:
train_cands = session.query(cellCand).filter(cellCand.split == 0).all()
dev_cands = session.query(cellCand).filter(cellCand.split == 1).all()

print("Number of Train candidates:", len(train_cands))
print("Number of Dev candidates:", len(dev_cands))

Number of Train candidates: 28
Number of Dev candidates: 13


### Step 3: Writing Labeling Functions

**The _categorical_ labeling functions (LFs) we now write can output the following values:**

* Abstain: `None` OR 0.
* Categorical values: One of the six categories we specify above (data science workflow stages classes).

In [7]:
# Getting an example candidate
c0 = train_cands[0]

print(c0.cell.text) # the code
print(c0.cell.stable_id) # unique id

import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from kaggle.competitions import twosigmanews import datetime import time  env = twosigmanews.make_env()
oriormeir_#_xgboost-2-market-news.ipynb_#_1


#### Checking LF arrays
##### Each class has its own array of labeling functions

In [8]:
from utils import LF_utils as lf
print(len(lf.LFs_Load), "LF for class 'Load Data'")
print(len(lf.LFs_Prep), "LF for class 'Prep & Clean'")
print(len(lf.LFs_Train), "LF for class 'Train & Param'")
print(len(lf.LFs_Eval), "LF for class 'Eval'")
print(len(lf.LFs_Explore), "LF for class 'Explore'")
print(len(lf.LFs_Import), "LF for class 'Import'")

1 LF for class 'Load Data'
15 LF for class 'Prep & Clean'
13 LF for class 'Train & Param'
6 LF for class 'Eval'
4 LF for class 'Explore'
1 LF for class 'Import'


You can review or change the labeling functions [here](https://github.com/tamirhuber/Jupyter-Notebook-Cells-Classification/blob/master/utils/LF_utils.py)

#### Merging all LF's into one array

In [9]:
LF_helpers = []
LF_helpers = np.concatenate((LF_helpers, lf.LFs_Load), axis=None)
LF_helpers = np.concatenate((LF_helpers, lf.LFs_Prep), axis=None)
LF_helpers = np.concatenate((LF_helpers, lf.LFs_Train), axis=None)
LF_helpers = np.concatenate((LF_helpers, lf.LFs_Eval), axis=None)
LF_helpers = np.concatenate((LF_helpers, lf.LFs_Explore), axis=None)
LF_helpers = np.concatenate((LF_helpers, lf.LFs_Import), axis=None)

print("Total of", len(LF_helpers), "LF are loaded")


Total of 40 LF are loaded


#### Adding LF that uses the info from the previous cell
Note: This function is not written in the utils file since it needs to use variables that are definfed in this scope

In [10]:
import random
def LF_BeforeCell(c):
    is_plotting = False
    my_labels = []
    for func in LF_helpers:
        res = func(c)
        if res is not None:
            if res not in my_labels:
                my_labels.append(res)
    if len(my_labels) > 0:
        if len(my_labels) == 1:
            if lf.LF_Plotting(c) == None:
                return None
            else:
                is_plotting = True
        else:
            return None
    pre = ""
    pre = lf.getPreCellSource(c, 1)
    raw_text = RawText(stable_id="temp", name='temp_cell', text=pre)
    candidate = cellCand(cell=raw_text, split=2)
    labels = []
    for func in LF_helpers:
        res = func(candidate)
        if res is not None:
            if res not in labels:
                labels.append(res)
    if len(labels) == 0:
        # if previous cell is also empty look for the prevoius of that
        pre = lf.getPreCellSource(c,2)
        raw_text = RawText(stable_id="temp", name='temp_cell', text=pre)
        candidate2 = cellCand(cell=raw_text, split=2)
        for func in LF_helpers:
            res = func(candidate2)
            if res is not None:
                if res not in labels:
                    labels.append(res)
        if len(labels) == 0:
            if is_plotting:
                return lf.LF_Plotting(c)
            return None
    if 'print' in pre or 'print' in c.cell.text or is_plotting:
        if 'Load Data' in labels or 'Prep & Clean' in labels:
            return 'Explore'
        if 'Eval' in labels or 'Train & Param' in labels:
            return 'Eval'
    return random.choice(labels)


In [11]:
# Adding LF that uses the info from the previous cell
LF_arr = LF_helpers.tolist()
LF_arr.append(LF_BeforeCell)
print("Total of", len(LF_arr), "LF are set for applying")

Total of 41 LF are set for applying


#### Now we load our gold label data (hand-labeled)
#### <u> If you skip candidate extraction you may also skip this cell </u>

**For a small example we probably didn't tag any of the cells manually...** <br>
You can tag cells for your dataset and change the path to ```gold_labels.tsv``` in ```consts.GOLD_LABELS```


In [12]:
from utils.Label_util import load_external_labels
from snorkel.models import StableLabel

# with session.no_autoflush:
session.rollback()
%time missed = load_external_labels(session, cellCand, annotator_name='gold')

consts imported
1024
AnnotatorLabels created: 0
AnnotatorLabels created: 0
Wall time: 3.32 s


In [13]:
from snorkel.annotations import load_gold_labels
L_gold_dev  = load_gold_labels(session, annotator_name='gold', split=1)
L_gold_dev

<13x1 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

#### Unit testing the LF functions of each class on the gold label set

Note: skip if there are no relevant gold labels found

In [None]:
from utils import Eval_utils as eu

print('LF_Import:')
tp, fp, tn, fn = eu.test_LF(session, lf.LFs_Import, 6, split=1, annotator_name='gold')
print('LF_LoadData:')
tp, fp, tn, fn = eu.test_LF(session, lf.LFs_Load, 1, split=1, annotator_name='gold')
print('LF_PrepAndClean:')
tp, fp, tn, fn = eu.test_LF(session, lf.LFs_Prep, 2, split=1, annotator_name='gold')
print('LF_TrainAndParam:')
tp, fp, tn, fn = eu.test_LF(session, lf.LFs_Train, 3, split=1, annotator_name='gold')
print('LF_Eval:')
tp, fp, tn, fn = eu.test_LF(session, lf.LFs_Eval, 4, split=1, annotator_name='gold')
print('LF_Explore:')
tp, fp, tn, fn = eu.test_LF(session, lf.LFs_Explore, 5, split=1, annotator_name='gold')

some functions are better than others, there are also a lot of collisions (as we are about to see) - noisy.

##### Creating a "Labeler"

In [16]:
from snorkel.annotations import LabelAnnotator
labeler = LabelAnnotator(lfs=LF_arr)

#### Now we apply the LFs to the candidates of the train set:


In [17]:
%time L_train = labeler.apply(split=0)

Clearing existing...
Running UDF...

Wall time: 623 ms


#### Loading Labeled Train Set

In [18]:
%time L_train = labeler.load_matrix(session, split=0)
L_train

Wall time: 4.01 ms


<28x41 sparse matrix of type '<class 'numpy.int32'>'
	with 59 stored elements in Compressed Sparse Row format>

#### Showing some of the labeling results (using the labeling functions with no weights)

In [19]:
c = L_train.get_candidate(session, 0)
print(c.labels) #labels
print(c.cell.text) #code
print("##########\n\n")

c = L_train.get_candidate(session, 1)
print(c.labels) #labels
print(c.cell.text) #code
print("##########\n\n")

c = L_train.get_candidate(session, 2)
print(c.labels) #labels
print(c.cell.text) #code
print("##########\n\n")


[Label (LF_Import = 6)]
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from kaggle.competitions import twosigmanews import datetime import time  env = twosigmanews.make_env()
##########


[Label (LF_Def = 2), Label (LF_Concat = 2), Label (LF_Drop = 2), Label (LF_sklearn_impute = 2)]
def prepare_news_data(news_df):     news_df['position'] = news_df['firstMentionSentence'] / news_df['sentenceCount']     news_df['coverage'] = news_df['sentimentWordCount'] / news_df['wordCount']      droplist = ['sourceTimestamp','firstCreated','sourceId','headline','takeSequence','provider','firstMentionSentence',                 'sentenceCount','bodySize','headlineTag','marketCommentary','subjects','audiences','sentimentClass',                 'assetName', 'urgency','wordCount','sentimentWordCount']     news_df.drop(droplist, axis=1, inplace=True)      # create a mapping between 'assetCode' to 'news_index'     assets = []     indices = []     for i, values in 

We can see, as we mentioned before. that there are overlaps. Different LFs don't agree.

#### LF functions Mutual Statistics

In [20]:
L_train.lf_stats(session)

Unnamed: 0,j,Coverage,Overlaps,Conflicts
LF_Read,0,0.0,0.0,0.0
LF_Def,1,0.142857,0.142857,0.035714
LF_Concat,2,0.178571,0.178571,0.107143
LF_Split,3,0.107143,0.107143,0.071429
LF_Drop,4,0.142857,0.142857,0.107143
LF_Fill,5,0.142857,0.142857,0.107143
LF_Nulls,6,0.142857,0.142857,0.107143
LF_Loc,7,0.142857,0.142857,0.107143
LF_Transformions,8,0.035714,0.035714,0.035714
LF_TransformOps,9,0.071429,0.071429,0.071429


labeling using only labeling functions is very noisy because of these overlaps and conflicts.
Snorkel uses a generative model to decide which function is "better" for each case.
So we can add "weights" to the labeling functions and get a more decisive and less noisy results.

### Step 4: Training the Generative Model

In [21]:
from snorkel.learning import GenerativeModel
gen_model = GenerativeModel()

#### <u>Note - the following cell may take long time to execute* You can skip straight to - Loading the Generative Model 

In [22]:
# Note: We pass cardinality explicitly here to be safe
gen_model.train(L_train, cardinality=6)
gen_model.save()

[GenerativeModel] Model saved as <GenerativeModel>.


#### Loading the trained Generative Model

In [23]:
gen_model.load()

[GenerativeModel] Model <GenerativeModel> loaded.


#### LFs accuracy with generative model weights

In [24]:
gen_model.weights.lf_accuracy

array([0.96982637, 1.00464003, 0.99640414, 0.97904441, 0.9891655 ,
       1.00640758, 0.99413879, 1.00700049, 0.97643607, 0.99029241,
       0.97501085, 0.97559757, 0.97518811, 0.97905146, 0.98652983,
       0.99025138, 0.9652113 , 0.98439672, 0.97529347, 0.973527  ,
       0.96696737, 0.98181254, 0.97485741, 0.970484  , 0.98434841,
       0.96763197, 0.97923347, 0.97223339, 0.9792072 , 0.96103026,
       0.97617537, 0.97680178, 0.97707546, 0.97980155, 0.97531532,
       0.96162941, 0.97682946, 0.97412664, 0.96668829, 0.97200587,
       0.98059102])

#### Generate Train Marginals
#### <u> Note - the following cell may take few minutes</u> 

In [25]:
train_marginals = gen_model.marginals(L_train)

### Now we apply the LFs to the candidates of the test set:
#### <u>Note - the following cell may take long time (~ 30 minutes)* . You can skip straight to - Loading Labeled Test Set</u>

In [26]:
L_test = labeler.apply_existing(split=1)

Clearing existing...
Running UDF...



#### Loading Labeled Test Set

In [27]:
%time L_test = labeler.load_matrix(session, split=1)
L_test

Wall time: 3.03 ms


<13x41 sparse matrix of type '<class 'numpy.int32'>'
	with 22 stored elements in Compressed Sparse Row format>

#### Generate Test Marginals
#### <u> Note - the following cell may take few minutes </u> 

In [28]:
from snorkel.annotations import save_marginals
test_marginals = gen_model.marginals(L_test)

#### Loading Gold Labels to test Snorkel results

In [29]:
gold_path = consts.GOLD_LABELS
labeled = pd.read_csv(gold_path,delimiter='\t',encoding='utf-8')
print("Total labeled data:", len(labeled))
labeled.head()

Total labeled data: 1025


Unnamed: 0,cell,label
0,kerneler_#_starter-advance-u-s-international-0...,6
1,kerneler_#_starter-advance-u-s-international-0...,1
2,kerneler_#_starter-advance-u-s-international-1...,1
3,kerneler_#_starter-advance-u-s-international-1...,1
4,kerneler_#_starter-advance-u-s-international-3...,1


In [30]:
# Clearing any session errors and checking the test set size
session.rollback()
L_test.shape[0]

13

#### Querying  the gold labeled test set candidates from the session
#### <u> Doing this once and saving only the important indexes of those candidates to improve time in future, this might take a few minutes</u>

In [31]:
tag_cand_index = []
for i in range(0, L_test.shape[0]):
    cand = L_test.get_candidate(session, i)
    cell_id = cand.cell.stable_id
    is_tagged = len(labeled[labeled["cell"] == cell_id])
    if is_tagged > 0:
        tag_cand_index.append(i)
print("Total labeled data in the test-set:", len(tag_cand_index))

Total labeled data in the test-set: 0


**IMPORTANT NOTE**: if there are no tagged cells in the test set skip the next evaluation cells (they will result in an error because there is nothing to compare against. (continue in step5)

#### Reload labeld data (if any updates were made -mosly used to check errors in gold lables)

In [None]:
# This was used in order to fix hand tagged errors and re-test the model result.
labeled = pd.read_csv(gold_path,delimiter='\t',encoding='utf-8')
print("Total labeled data:", len(labeled))

#### now let's check our categorical accuracy with the gold labels test set:

In [None]:
from utils import Eval_utils as eu
y_pred_arr, y_true_arr = eu.calc_lf_acc(session, tag_cand_index, L_test, test_marginals, labeled)

#### Classification Report

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_true_arr, y_pred_arr))

We can see that we label some of the data preparation cells as data exploration, as data preparation recall is lower and data exploration precision is lower, but that’s a hard task. Overall the results are pretty good.  

### Step 5: Labeling the data

In [32]:
Labels_str = ['Unknown', 'Load', 'Prep', 'Train', 'Eval', 'Explore', 'Import']
tagged_df = clean_df.copy()
#add a new column for the label
tagged_df["Label"] = ""
tagged_df.head(1) #just to check the column was added

Unnamed: 0,Cell ID,User Name,Notebook name,Source,Output,Execution count,Masked,AST,Label
0,oriormeir_#_xgboost-2-market-news.ipynb_#_1,oriormeir,xgboost-2-market-news.ipynb,import numpy as np import pandas as pd from sk...,[],1,import_numpy import_pandas import_datetime imp...,"Module(body=[Import(names=[alias(name='numpy',...",


We tag the data according to the highest probability in the generative model output-

 <u> Note - the following cell may take a long time to execute for a large number of cells.</u> 

In [33]:
#Set label according to the highest probabiltiy
for i,cand in enumerate(train_cands):
    marg = train_marginals[i]
    id = cand.cell.stable_id
    boolcol = (tagged_df["Cell ID"] == id)
    idx = tagged_df.index[boolcol][0]
    max =  marg.max()
    labels = np.where(marg == max)[0]
    if len(labels) > 1:
        label = 0
    else:
        label = labels[0] + 1
    
    tagged_df.at[idx, "Label"] = Labels_str[label]
        
for i in range(0, L_test.shape[0]):
    marg = test_marginals[i]
    cand = L_test.get_candidate(session, i)
    id = cand.cell.stable_id
    boolcol = (tagged_df["Cell ID"] == id)
    idx = tagged_df.index[boolcol][0]
    max =  marg.max()
    labels = np.where(marg == max)[0]
    if len(labels) > 1:
        label = 0
    else:
        label = labels[0] + 1
    
    tagged_df.at[idx, "Label"] = Labels_str[label]
tagged_df.head(5) #just to see some tags

Unnamed: 0,Cell ID,User Name,Notebook name,Source,Output,Execution count,Masked,AST,Label
0,oriormeir_#_xgboost-2-market-news.ipynb_#_1,oriormeir,xgboost-2-market-news.ipynb,import numpy as np import pandas as pd from sk...,[],1,import_numpy import_pandas import_datetime imp...,"Module(body=[Import(names=[alias(name='numpy',...",Import
1,oriormeir_#_xgboost-2-market-news.ipynb_#_2,oriormeir,xgboost-2-market-news.ipynb,def prepare_market_data(market_df): market...,[],2,market_df.drop,Module(body=[FunctionDef(name='prepare_market_...,Prep
2,oriormeir_#_xgboost-2-market-news.ipynb_#_3,oriormeir,xgboost-2-market-news.ipynb,def prepare_news_data(news_df): news_df['p...,[],3,news_df.drop var5=var4.merge var5.drop var2.ex...,Module(body=[FunctionDef(name='prepare_news_da...,Prep
3,oriormeir_#_xgboost-2-market-news.ipynb_#_4,oriormeir,xgboost-2-market-news.ipynb,"def prepare_data(market_df, news_df, start=Non...",[],4,dt dt,"Module(body=[FunctionDef(name='prepare_data', ...",Prep
4,oriormeir_#_xgboost-2-market-news.ipynb_#_5,oriormeir,xgboost-2-market-news.ipynb,"(market_df, news_df) = env.get_training_data()...",[],5,,Module(body=[Assign(targets=[Tuple(elts=[Name(...,Explore


#### Export tagged dataset

We export the snorkel tagged dataset to the Data Folder defined by ```consts.DATA_FOLDER``` as ```input.tsv```.<br>
This file will be used as input to train a supervised end-classification-model.

In [34]:
import csv
import os

save_path= os.path.join(consts.DATA_FOLDER,"input.tsv")

tagged_df.to_csv(save_path, sep='\t', quotechar='"', quoting=csv.QUOTE_MINIMAL)

The tagged dataset will be used to train a supervised end-classification-model.

## End Model (LSTM Classifier)

Now we have our snorkel tagged cells, we want to use them to train a supervised LSTM model that will classify a given cell source code into the relevant data-scientist workflow stage (multi-class text classification).

**Install prerequisites**:

In [None]:
# install necessary packages
! pip install -U --user pip six numpy wheel mock pandas
! pip install -U --user keras_applications==1.0.6 --no-deps
! pip install -U --user keras_preprocessing==1.0.5 --no-deps
! pip install keras tensorflow sklearn

this should work, but if any problems occur see- [https://www.tensorflow.org/install](https://www.tensorflow.org/install), [https://keras.io/#installation](https://keras.io/#installation)

In [35]:
# First let's import relevant libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, Dropout
from keras.models import Sequential
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras.metrics import mean_squared_error
from keras.models import load_model
from keras.models import model_from_json
import pickle

# Input data files are available in the "../input/" directory.
import os

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


*If keras and tensorflow installation was succesful and there is still a problem with the imports, try restarting the kernel and clearing outputs, and then run the imports cell again.

In [36]:
# load our tagged Data

import sys
# insert at 1, 0 is the script path (or '' in REPL)
sys.path.insert(1, '../data_gathering')
import consts

load_path = os.path.join(consts.DATA_FOLDER,"input.tsv")
data = pd.read_csv(load_path, delimiter='\t', usecols=['Cell ID', 'Source', 'Label'])

### Pre-Processing

In [37]:
#first we'll remove cells that snorkel didn't tag
data.dropna(subset=['Label'], how='all', inplace = True)
data = data[data.Label != 'Unknown']

In [38]:
#now let's take a look at some random cells
data.sample(5)

Unnamed: 0,Cell ID,Source,Label
3,oriormeir_#_xgboost-2-market-news.ipynb_#_4,"def prepare_data(market_df, news_df, start=Non...",Prep
7,oriormeir_#_xgboost-2-market-news.ipynb_#_8,"print(""generating predictions..."") days = env....",Train
16,alluxia_#_lb-0-6326-tuned-xgboost-baseline.ipy...,"X_train, X_test, up_train, up_test, r_train, r...",Train
5,oriormeir_#_xgboost-2-market-news.ipynb_#_6,train_columns = [x for x in merged_df.columns ...,Prep
17,alluxia_#_lb-0-6326-tuned-xgboost-baseline.ipy...,from xgboost import XGBClassifier import time,Import


Our tsv contains for each jupyter notebook cell of code- <br>
a unique cell id, the cell's source code and the label that was generated by snorkel

#### Class Imbalance

let's take a look at the tagged data value counts

In [39]:
#now let's see
data.Label.value_counts()

Prep       9
Import     6
Train      6
Explore    4
Eval       3
Name: Label, dtype: int64

We can see the classes are imbalanced. The data exploration class has much more cells than the others. we want to have balaced classes for the model to train, so we'll take a fixed size from each class (under sample the large classes).

In [40]:
# first we shuffle the data by randomly re-indexing
shuffled = data.reindex(np.random.permutation(data.index))
shuffled.head(5) #check data is indeed shuffeled


Unnamed: 0,Cell ID,Source,Label
3,oriormeir_#_xgboost-2-market-news.ipynb_#_4,"def prepare_data(market_df, news_df, start=Non...",Prep
35,charleslandau_#_iterative-approach.ipynb_#_9,from sklearn.metrics import accuracy_score imp...,Eval
9,alluxia_#_lb-0-6326-tuned-xgboost-baseline.ipy...,"def data_prep(market_train,news_train): ma...",Prep
2,oriormeir_#_xgboost-2-market-news.ipynb_#_3,def prepare_news_data(news_df): news_df['p...,Prep
6,oriormeir_#_xgboost-2-market-news.ipynb_#_7,from xgboost import XGBClassifier import time ...,Train


In [41]:
fixed_class_size = 2 #originally we took 5,000 cells from each class 
l  = shuffled[shuffled['Label'] == 'Load'][:fixed_class_size]
p  = shuffled[shuffled['Label'] == 'Prep'][:fixed_class_size]
t  = shuffled[shuffled['Label'] == 'Train'][:fixed_class_size]
ev = shuffled[shuffled['Label'] == 'Eval'][:fixed_class_size]
ex = shuffled[shuffled['Label'] == 'Explore'][:fixed_class_size]
i  = shuffled[shuffled['Label'] == 'Import'][:fixed_class_size]

concated = pd.concat([l, p, t, ev, ex, i], ignore_index=True) #our new data with balanced classes
concated.head(5)

Unnamed: 0,Cell ID,Source,Label
0,oriormeir_#_xgboost-2-market-news.ipynb_#_4,"def prepare_data(market_df, news_df, start=Non...",Prep
1,alluxia_#_lb-0-6326-tuned-xgboost-baseline.ipy...,"def data_prep(market_train,news_train): ma...",Prep
2,oriormeir_#_xgboost-2-market-news.ipynb_#_7,from xgboost import XGBClassifier import time ...,Train
3,alluxia_#_lb-0-6326-tuned-xgboost-baseline.ipy...,"X_train, X_test, up_train, up_test, r_train, r...",Train
4,charleslandau_#_iterative-approach.ipynb_#_9,from sklearn.metrics import accuracy_score imp...,Eval


In [42]:
#Shuffle the dataset again by re-indexing
concated = concated.reindex(np.random.permutation(concated.index))
concated.head(5)

Unnamed: 0,Cell ID,Source,Label
2,oriormeir_#_xgboost-2-market-news.ipynb_#_7,from xgboost import XGBClassifier import time ...,Train
9,charleslandau_#_iterative-approach.ipynb_#_1,from kaggle.competitions import twosigmanews i...,Import
5,charleslandau_#_iterative-approach.ipynb_#_12,#env.predict(predictions_template_df),Eval
4,charleslandau_#_iterative-approach.ipynb_#_9,from sklearn.metrics import accuracy_score imp...,Eval
3,alluxia_#_lb-0-6326-tuned-xgboost-baseline.ipy...,"X_train, X_test, up_train, up_test, r_train, r...",Train


#### Tokenization and Vector representation of label and code

We'll represent the label as a one-hot vector

In [43]:
#add int representation of the label
concated['INT'] = 0
concated.loc[concated['Label'] == 'Load', 'INT']  = 0
concated.loc[concated['Label'] == 'Prep', 'INT']  = 1
concated.loc[concated['Label'] == 'Train', 'INT']  = 2
concated.loc[concated['Label'] == 'Eval', 'INT'] = 3
concated.loc[concated['Label'] == 'Explore', 'INT'] = 4
concated.loc[concated['Label'] == 'Import', 'INT']  = 5

#one-hot encode the label
labels = to_categorical(concated['INT'], num_classes=6)
if 'Label' in concated.keys():
    concated.drop(['Label'], axis=1)
# '''
#  [1. 0. 0. 0. 0. 0.] load data
#  [0. 1. 0. 0. 0. 0.] data preparation and cleaning
#  [0. 0. 1. 0. 0. 0.] model training and parameter tuning
#  [0. 0. 0. 1. 0. 0.] model evaluation
#  [0. 0. 0. 0. 1. 0.] data exploration
#  [0. 0. 0. 0. 0. 1.] imports
# '''

#let's print some of the labels to see the encoding
labels.view()

array([[0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0.]], dtype=float32)

We remove all comments, as comments may refer to actions that weren’t really done or to what was done previously to the current cell, so that it just interferes in our task to classify the current cell correctly.

In [44]:
from utils.utils import findAndRemoveComments
concated['Source'] = concated['Source'].apply(lambda x: findAndRemoveComments(x))

consts imported


Now we turn the code to-lower, filter special chars and dots and split each cell's code into tokens.
Then we represent the most common words by ints and each cell is represented as a vector of ints according to the words that it contains. The vectors are then padded to a fixed max length of 100.

In [45]:
n_most_common_words = 8000
max_len = 120
tokenizer = Tokenizer(num_words=n_most_common_words, filters='!"#$%&()*+,.-/:;<=>?@[\]^`{|}~\n\r\t \'', lower=True)
tokenizer.fit_on_texts(concated['Source'].values)
sequences = tokenizer.texts_to_sequences(concated['Source'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
print("-------")

# to print our "words" dictionary uncomment the following line (long print) 
# print(word_index)

X = pad_sequences(sequences, maxlen=max_len)

Found 156 unique tokens.
-------


Now we split the data, represented as vectors, into train and test sets.

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X , labels, test_size=0.25, random_state=42)

## LSTM Model

Now we setup and train an LSTM model using the vector representation of the code and the labels (Supervised learning, where the labels where generated by snorkel).

#### Parameter Definitions

In [47]:
epochs = 3 # originally we trained with 15 epoches
# we set an EarlyStopping, so when the model stops improving val_loss'wise it will stop training
# but we also don't want to overfit
emb_dim = 512
batch_size = 256

#### Model Setup and Training
<u>Note: model training could take up to 1 hour, you can skip and load the trained model in the next cell</u>

In [48]:
print("(X_train.shape, y_train.shape, X_test.shape, y_test.shape)")
print((X_train.shape, y_train.shape, X_test.shape, y_test.shape))
model = Sequential()
model.add(Embedding(n_most_common_words, emb_dim, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.8))
model.add(LSTM(64, dropout=0.8, recurrent_dropout=0.8))
model.add(Dense(6, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc', mean_squared_error])
print(model.summary())
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_acc',patience=7, min_delta=0.0001)])

W0809 14:32:00.778898 10980 deprecation_wrapper.py:119] From C:\Users\gurya\Anaconda3\envs\snorkel\lib\site-packages\keras\backend\tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0809 14:32:00.803435 10980 deprecation_wrapper.py:119] From C:\Users\gurya\Anaconda3\envs\snorkel\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0809 14:32:00.807432 10980 deprecation_wrapper.py:119] From C:\Users\gurya\Anaconda3\envs\snorkel\lib\site-packages\keras\backend\tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0809 14:32:00.832439 10980 deprecation_wrapper.py:119] From C:\Users\gurya\Anaconda3\envs\snorkel\lib\site-packages\keras\backend\tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default inst

(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
((7, 120), (7, 6), (3, 120), (3, 6))


W0809 14:32:00.992433 10980 nn_ops.py:4224] Large dropout rate: 0.8 (>0.5). In TensorFlow 2.x, dropout() uses dropout rate instead of keep_prob. Please ensure that this is intended.
W0809 14:32:01.021437 10980 nn_ops.py:4224] Large dropout rate: 0.8 (>0.5). In TensorFlow 2.x, dropout() uses dropout rate instead of keep_prob. Please ensure that this is intended.
W0809 14:32:01.053432 10980 nn_ops.py:4224] Large dropout rate: 0.8 (>0.5). In TensorFlow 2.x, dropout() uses dropout rate instead of keep_prob. Please ensure that this is intended.
W0809 14:32:01.076431 10980 nn_ops.py:4224] Large dropout rate: 0.8 (>0.5). In TensorFlow 2.x, dropout() uses dropout rate instead of keep_prob. Please ensure that this is intended.
W0809 14:32:01.291661 10980 deprecation_wrapper.py:119] From C:\Users\gurya\Anaconda3\envs\snorkel\lib\site-packages\keras\optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0809 14:32:01.336149 10980 deprecati

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 120, 512)          4096000   
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 120, 512)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                147712    
_________________________________________________________________
dense_1 (Dense)              (None, 6)                 390       
Total params: 4,244,102
Trainable params: 4,244,102
Non-trainable params: 0
_________________________________________________________________
None
Train on 5 samples, validate on 2 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


#### Save trained classifier

The trained LSTM model will be saved (in 3 different files) to the folder defined by ```consts.CLASSIFIER```.

In [62]:
save_folder = consts.CLASSIFIER

if not os.path.isdir(save_folder):
    os.mkdir(save_folder)

picke_file_path = os.path.join(save_folder, "tokenizer.pickle")
json_file_path = os.path.join(save_folder, "model.json")
h5_file_path = os.path.join(save_folder, "model.h5")

In [63]:
# save the trained model (multiple relevant files)
with open(picke_file_path, 'wb') as handle:
        pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

model_json = model.to_json()
with open(json_file_path, "w") as json_file:
    json_file.write(model_json)

model.save_weights(h5_file_path)
print("Saved model to disk")


Saved model to disk


#### load the trained model (continue here if you don't train the model)

In [64]:
#load the trained model (not needed if you train again)
with open(picke_file_path, 'rb') as handle:
    load_tokenizer = pickle.load(handle)

json_file = open(json_file_path, 'r')
loaded_model_json = json_file.read()
json_file.close()
load_model = model_from_json(loaded_model_json)

load_model.load_weights(h5_file)



## Model Evaluation

In [65]:
accr = model.evaluate(X_test,y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

Test set
  Loss: 1.785
  Accuracy: 0.333


Using a small amount of cells as example will probably give bad results.
You can see our original results in [Classification.ipynb](https://github.com/TAU-DB/guided-ds/blob/master/Classification/Classification.ipynb).

### Tag Cells

Now if we want to tag all of our cells using the trained model:

In [68]:
def tag_existing_cells(cells_file_path):
    df = pd.read_csv(cells_file_path, delimiter='\t')
    row_count = df.shape[0]
    labels = ['Load', 'Prep', 'Train', 'Eval', 'Explore', 'Import']
    outputs = pd.Series([]) # to contain output labels
    
    print("Generating labels...")
    for index, row in enumerate(df.iterrows()):
#         if index % (row_count//100) == 0:
#             print(str(1+(100*index//row_count)) + "%")
        source = df.at[index, "Source"]
        code = [source]
        
        try:
            seq = tokenizer.texts_to_sequences(code)
            padded = pad_sequences(seq, maxlen=max_len)
            pred = model.predict(padded)
            lbl = labels[np.argmax(pred)]
        except:
            lbl = ""
        outputs[index] = lbl
        
    print("Adding labels to dataframe...")
    df['Label'] = outputs.values
    print("Exporting to file...")
    df.to_csv(cells_file_path, sep = '\t')
    print("DONE!")

**Notice**: we started with ```cells.tsv```, tagged it using snorkel (not inplace) to ```input.tsv```, now we tag the cells inside of the ```cells.tsv``` file (inplace) using the LSTM. 

In [69]:
tag_existing_cells(consts.CELLS_TSV)

Generating labels...
Adding labels to dataframe...
Exporting to file...
DONE!
