# Data Processing Challenge
In machine learning classification, it is important to consider the size of your dataset and how many labels you want to predict. In this example we demonstrate how to use snorkel to read through a series of papers and mark which rules match. So for example, the sentence "Birds fly with wings." We want to use rules that take the sentence and search for "fly", if the word exists then we apply the label "flight". 

# Downloading the Dataset






In [1]:
!git clone https://github.com/nasa-petal/interview_questions.git
mv interview_questions/snorkel_challenge/* .

Cloning into 'interview_questions'...
remote: Enumerating objects: 245, done.[K
remote: Counting objects: 100% (245/245), done.[K
remote: Compressing objects: 100% (232/232), done.[K
remote: Total 245 (delta 10), reused 234 (delta 8), pack-reused 0[K
Receiving objects: 100% (245/245), 1.54 MiB | 10.53 MiB/s, done.
Resolving deltas: 100% (10/10), done.


In [None]:
# re-arrange some files 
mv snorkel/ paht
mv paht/snorkel .
rm -r paht  

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Data walk through

In the files section there are two csv files: *biomimicry_function_rules.csv* and *biomimicry_functions_enumerated.csv*. 

biomimicry_functions_enumerated.csv
- First column contains the labels
- Second column contains the label id

biomimicry_function_rules.csv 
- Header: these are the labels from biomimicry_functions_enumerated.csv column 1
- Each row below the header contains words. If any word matches the text then we mark it with a label.

There's another folder called `david_work` inside there's 2 csv files *formatted_enums.csv* and *labeled_data.csv*. labeled_data.csv is the important one. This contains the paper title and abstract which are combined as a text for prediction.


The code below loads the dataset and displays the columns. Not all the columns are used for prediction. Only the column called 'text' is used. 

In [4]:
from utils import load_dataset
df_train, df_test = load_dataset() 
print(df_train.columns) 

Index(['doi', 'url', 'full_doc_link', 'is_open_access', 'label_level_1',
       'label_level_2', 'label_level_3', 'journal', 'literature_site',
       'unnamed: 11', 'label', 'text'],
      dtype='object')


# Challenge setup
This part of the code illustrates how to setup the environment for this challenge. This example uses snorkel. Snorkel takes a list of texts and applies rules that predict labels. S

Some texts may match rules from many labels like "attach_permanently" and "send_sound_signals". This could be texts about bats. Snorkel uses supervised machine learning to take the set of matching rules and classify papers with a particular label.




## Installing prerequisites
Run the code below to install dependancies for snorkel

In [5]:
!pip install -U networkx munkres numpy scipy pandas scikit-learn

Collecting munkres
  Downloading munkres-1.1.4-py2.py3-none-any.whl (7.0 kB)
Collecting numpy
  Downloading numpy-1.21.4-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 59 kB/s 
Collecting scipy
  Downloading scipy-1.7.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (38.2 MB)
[K     |████████████████████████████████| 38.2 MB 25 kB/s 
Collecting pandas
  Downloading pandas-1.3.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
[K     |████████████████████████████████| 11.3 MB 17.8 MB/s 
Collecting scikit-learn
  Downloading scikit_learn-1.0.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (23.2 MB)
[K     |████████████████████████████████| 23.2 MB 94 kB/s 
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.0.0-py3-none-any.whl (14 kB)
Installing collected packages: numpy, threadpoolctl, scipy, scikit-learn, pandas, munkres
  Attempting uninstall: numpy
    Found

In [25]:
import pandas as pd 
from snorkel.labeling import LabelingFunction
import itertools, os
from snorkel.labeling import PandasLFApplier
from snorkel.labeling import LFAnalysis
from utils import load_dataset

ImportError: ignored

In [16]:
def keyword_lookup(x,bio_functions:pd.DataFrame,bio_function_rules:pd.DataFrame):
    """Snorkel Labeling function. Purpose is to return the label id based on the phrase in sentence

    Args:
        x (str): some phrase

    Returns:
        int: the id
    """
    for i in range(len(bio_functions)):
        label_name = bio_functions.iloc[i]['function'] 
        label_id = bio_functions.iloc[i]['function_enumerated']        
        
        label_rule_name = label_name + "_rules"
        if label_rule_name in list(bio_function_rules.columns):
            phrases_to_look_for = bio_function_rules[label_rule_name].to_list()
            phrases_to_look_for = [x for x in phrases_to_look_for if x == 'nan']
            for phrase in phrases_to_look_for:
                # now you could make a counter and see the percentage match so if 10/20 phrases are in the text/abstract then you return the
                if phrase in x.text.lower():     
                    return label_id 
    return -1

In [17]:
def create_labeling_functions(bio_file:pd.DataFrame, bio_rules:pd.DataFrame):
    """Takes the dataset and creates a list of labeling functions

    Args:
        bio_file (pd.DataFrame): a list of all the biomimicry functions
        bio_rules (pd.DataFrame): a list of all the 'rules' for each biomimicry function

    Returns:
        labeling_function_list: a list of all the labeling function 'rules' corresponding to each biomimicry function
    """
    bio_file = pd.read_csv(bio_file)
    bio_rules = pd.read_csv(bio_rules)

    lst = []
    underscore_list = []
    rules_no_na = []
    labeling_function_list = []
    
    #get a list of all the rules
    for i in range(len(bio_file)):
        label_name = bio_file.iloc[i]['function'] 
        label_rule_name = label_name + "_rules"
        if label_rule_name in list(bio_rules.columns):
            phrases_lst = bio_rules[label_rule_name].to_list()
            lst.append(phrases_lst)
    chained_lst = (list(itertools.chain.from_iterable(lst)))
    #remove blank cells
    remove_na = [x for x in chained_lst if pd.isnull(x) == False]
    #remove duplicates
    for rule in remove_na:
        if rule not in rules_no_na:
            rules_no_na.append(rule)
    #add underscore to rules
    for item in rules_no_na:
        item = item.replace(" ", "_")
        underscore_list.append(item)
    #create labeling function for each rule
    for phrase in underscore_list:
        labeling_function = LabelingFunction(name=f"keyword_{phrase}", f=keyword_lookup,
                        resources={"bio_functions":bio_file,"bio_function_rules":bio_rules})
        labeling_function_list.append(labeling_function)

    # print(len(labeling_function_list))
    return labeling_function_list

In [20]:
labeling_function_list = create_labeling_functions(r'./biomimicry_functions_enumerated.csv', r'./biomimicry_function_rules.csv')
len(labeling_function_list)

665

## Training Problem
The code below shows how to train using snorkel. Note the training probably won't work because it consumes an enormous about of memory.




In [24]:
from snorkel_paht import *

if __name__=="__main__":
    df_train, df_test = load_dataset() #import csv file and load train/test/split of dataset
if not os.path.exists('lf_analysis.pickle'):
    applier = PandasLFApplier(lfs=labeling_function_list)
    # define train and test sets
    L_train = applier.apply(df=df_train)
    L_test = applier.apply(df=df_test)

    df = LFAnalysis(L=L_train, lfs=labeling_function_list).lf_summary()
    with open('lf_analysis.pickle','wb') as f:
        pickle.dump({"lf_analysis":df, 'L_train':L_train,'L_test':L_test},f)

if os.path.exists('lf_analysis.pickle'):
    with open('lf_analysis.pickle','rb') as f:
        data = pickle.load(f)
        lf_analysis = data['lf_analysis']
        L_train = data['L_train']
        L_test = data['L_test']

majority_model = MajorityLabelVoter()
preds_train = majority_model.predict(L=L_train)

label_model = LabelModel(cardinality=19, verbose=True, device='cpu')
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=123)


NameError: ignored

### Why Training doesn't work
Looking at L_train variable you can see that the labels go up to 97 but there's a few skips. The skipped labels are not being used but snorkel thinks that there's a total of 97 labels so it allocates the memory.  The other problem is the total number of rules. Because there's so many rules you a matrix of size #papers(trainset) x #rules where the total number of rules is 660

In [None]:
# add code to show this

# The Challenge
The number of labels as well as the number of rules influences the size of the neural network used by snorkel. You need to find a way to programmatically reduce number of rules and labels. 