# Data Processing Challenge
In machine learning classification, it is important to consider the size of your dataset and how many labels you want to predict. In this example we demonstrate how to use snorkel to read through a series of papers and mark which rules match. So for example, the sentence "Birds fly with wings." We want to use rules that take the sentence and search for "fly", if the word exists then we apply the label "flight". 

# Downloading the Dataset






In [9]:
# run this to reset the folders 
!rm -r david_work/ interview_questions/ petal_snorkel/ sample_data/ bio* utils.py snorkel_challenge

rm: cannot remove 'sample_data/': No such file or directory
rm: cannot remove 'snorkel_challenge': No such file or directory


In [10]:
!git clone https://github.com/nasa-petal/interview_questions.git

Cloning into 'interview_questions'...
remote: Enumerating objects: 253, done.[K
remote: Counting objects: 100% (253/253), done.[K
remote: Compressing objects: 100% (237/237), done.[K
remote: Total 253 (delta 16), reused 241 (delta 11), pack-reused 0[K
Receiving objects: 100% (253/253), 1.55 MiB | 9.98 MiB/s, done.
Resolving deltas: 100% (16/16), done.


In [11]:
# re-arrange some files 
!mv interview_questions/snorkel_challenge/* .

In [12]:
!rm -r interview_questions

In [13]:
!mv snorkel/ paht
!mv paht/snorkel .
!rm -r paht  

## Data walk through

In the files section there are two csv files: *biomimicry_function_rules.csv* and *biomimicry_functions_enumerated.csv*. 

biomimicry_functions_enumerated.csv
- First column contains the labels
- Second column contains the label id

biomimicry_function_rules.csv 
- Header: these are the labels from biomimicry_functions_enumerated.csv column 1
- Each row below the header contains words. If any word matches the text then we mark it with a label.

There's another folder called `david_work` inside there's 2 csv files *formatted_enums.csv* and *labeled_data.csv*. labeled_data.csv is the important one. This contains the paper title and abstract which are combined as a text for prediction.


The code below loads the dataset and displays the columns. Not all the columns are used for prediction. Only the column called 'text' is used. 

In [4]:
from petal_utils import load_dataset
df_train, df_test = load_dataset()
print(df_train.columns) 

Index(['doi', 'url', 'full_doc_link', 'is_open_access', 'label_level_1',
       'label_level_2', 'label_level_3', 'journal', 'literature_site',
       'unnamed: 11', 'label', 'text'],
      dtype='object')


# Challenge setup
This part of the code illustrates how to setup the environment for this challenge. This example uses snorkel. Snorkel takes a list of texts and applies rules that predict labels. S

Some texts may match rules from many labels like "attach_permanently" and "send_sound_signals". This could be texts about bats. Snorkel uses supervised machine learning to take the set of matching rules and classify papers with a particular label.




## Installing prerequisites
Run the code below to install dependancies for snorkel. You may need to restart your runtime

In [5]:
!pip install -U networkx munkres numpy scipy pandas scikit-learn



In [14]:
import pandas as pd 
import itertools, os, pickle
from snorkel.labeling import LabelingFunction
from snorkel.labeling.model import LabelModel
from snorkel.labeling import PandasLFApplier
from snorkel.labeling import LFAnalysis
from snorkel.labeling.model import MajorityLabelVoter
from petal_utils import load_dataset

In [15]:
def keyword_lookup(x,phrase_to_match:str, label_id:int):
    """Snorkel Labeling function. Purpose is to return the label id based on the phrase in sentence

    Args:
        phrase_to_match (str): some phrase that we need to match
        label_id (int): id of label to use for this match
    Returns:
        (int): label id if match or -1 if no match 
    """
    if phrase_to_match.lower() in x.text.lower():     
        return label_id
    else:
        return -1

In [16]:
def create_labeling_functions(bio_file:pd.DataFrame, bio_rules:pd.DataFrame):
    """create a list of labeling functions
    
    Args:
        bio_file (pd.DataFrame): a list of all the biomimicry functions
        bio_rules (pd.DataFrame): a list of all the 'rules' for each biomimicry function
    Returns:
        labeling_function_list: a list of all the labeling function 'rules' corresponding to each biomimicry function
    """
    bio_file = pd.read_csv(bio_file)
    bio_rules = pd.read_csv(bio_rules)

    names_used = list()
    labeling_function_list = list()
    
    #get a list of all the rules
    for i in range(len(bio_file)):

        label_name = bio_file.iloc[i]['function']
        label_id = bio_file.iloc[i]['function_enumerated']
        label_rule_name = label_name + "_rules"

        if label_rule_name in list(bio_rules.columns):
            underscore_list = []
            phrases_lst = bio_rules[label_rule_name].to_list()
            
            #remove blank cells and keep unique values 
            rules_no_na = list(set([x for x in phrases_lst if not pd.isnull(x)]))
            
            #add underscore to rules
            for item in rules_no_na:
                item = item.replace(" ", "_")
                underscore_list.append(item)
            #create labeling function for each rule
            for phrase in underscore_list:
                function_name = f"keyword_{label_id}_{phrase}"
                if (function_name not in names_used):
                    labeling_function = LabelingFunction(name=function_name, f=keyword_lookup,
                                    resources={"phrase_to_match":phrase, "label_id":label_id})
                    labeling_function_list.append(labeling_function)
                    names_used.append(function_name)
    
    return labeling_function_list

There is a total of 674 rules. **Remember this for later on**

In [17]:
labeling_function_list = create_labeling_functions(r'./biomimicry_functions_enumerated.csv', r'./biomimicry_function_rules.csv')
len(labeling_function_list)

674

## Training Problem
The code below shows how to train using snorkel. Note the training probably won't work because it consumes an enormous about of memory.




In [18]:
df_train, df_test = load_dataset()

labeling_function_list = create_labeling_functions(r'./biomimicry_functions_enumerated.csv', r'./biomimicry_function_rules.csv')

len(labeling_function_list)

applier = PandasLFApplier(lfs=labeling_function_list)
# define train and test sets
L_train = applier.apply(df=df_train)
L_test = applier.apply(df=df_test)


100%|██████████| 107/107 [00:01<00:00, 81.09it/s]
100%|██████████| 14/14 [00:00<00:00, 85.26it/s]


## Crashing the runtime
Running the following block of code consumes way too much memory and causes the runtime to crash. You will have lots of fun if you run the code below. Skip this section and go to the one below.


In [None]:
majority_model = MajorityLabelVoter(cardinality=98)
preds_train = majority_model.predict(L=L_train)

label_model = LabelModel(cardinality=98, verbose=True, device = 'cpu')
label_model.fit(L_train=L_train, n_epochs=1000, log_freq=100, seed=123)

LFAnalysis(L=L_train, lfs=labeling_function_list).lf_summary()

df_train_filtered, preds_train_filtered = filter_unlabeled_dataframe(
    X=df_train, y=preds_train, L=L_train)

df_train["label"] = label_model.predict(L=L_train, tie_break_policy="abstain")

label_model.save("snorkel_model.pkl")

df_train.to_csv("results.csv")

### Why Training doesn't work
Looking at L_train variable you can see that the labels go up to 97 but there's a a lot skips. The skipped labels are not being used but snorkel thinks that there's a total of 97 labels so it allocates an enormous amount of memory. 

In [19]:
import numpy as np
np.unique(L_train)

array([-1,  0,  1,  6,  8, 19, 20, 21, 24, 26, 30, 35, 39, 40, 96, 97])

The other problem is the total number of rules. Because there's so many rules you a matrix of size #papers(trainset) x #rules where the total number of rules is 674. The other number 107 is the number of texts to train. 14 is number of texts for train. 


In [22]:
print(L_train.shape)
print(L_test.shape)

(107, 674)
(14, 674)


# The Challenge






## First Challenge - Reduction of the dataset
The number of labels as well as the number of rules influences the size of the neural network used by snorkel. You need to find a way to programmatically reduce number of rules and labels. Instead of 97 we should see 16 labels.

In [23]:
len(np.unique(L_train))

16

## Second Challenge - further reduction and overlap of labels in each model
Overlapping is important for verifying if the label is correctly matched.

For example:

*   Model 1 - Predicts: Labels 1-4 + (Not labels 1-4)
*   Model 2 - Predicts: labels 3-7 + (Not labels 3-7)

If both Model 1 and Model 2 predict that this text matches label 4 then most likely it is label 4.

Lets split the dataset into groups that can be used to train something like this.