# Explanations of .py files for project

This notebook is part of my capstone project on hierarchichal classification:
https://github.com/luka5132/NLPToS
In this notebook you can find some additional information on what the classes in the respective .py files do.
The files discussed are:
1. pytorch_classification.py
2. data_processing.py
3. hierarchical_classification.py

In [1]:
import pandas as pd
import numpy as np

# import module we'll need to import our custom module (needed for kaggle)
from shutil import copyfile

# copy our file into the working directory (make sure it has .py suffix)
copyfile(src = "../input/privbert-data/data_processing.py", dst = "../working/data_processing.py")
copyfile(src = "../input/privbert-data/pytorch_classifier.py", dst = "../working/pytorch_classifier.py")
copyfile(src = "../input/privbert-data/hierarchical_data.py", dst = "../working/hierarchical_data.py")

'../working/hierarchical_data.py'

# Pytorch_classification

This file was initially planned to contain all code related to training a BERT model. However with memory constraints on kaggle I was forced to take out large bits, mainly everything that was used a GPU 'device', i.e. all of training and testing. Remains of the class are still used as *shortcuts* for loading data.

In [2]:
from pytorch_classifier import BertClassification

example_class = BertClassification()
attributes = vars(example_class)
attnames = [item[0] for item in attributes.items()]
print("All variables: \n")
print(''.join("%s \n" % attname for attname in attnames))
all_dir = dir(example_class)
all_functions = [method for method in all_dir if not method.startswith('__') and method not in attnames]
print("All functions: \n")
print(''.join("%s \n" % funcname for funcname in all_functions))

All variables: 

labels 
segments 
model_name 
max_length 
tokenizer 
encodings 
num_labels 
optimizer 

All functions: 

encode_texts 
init_data 
init_optimizer 
init_tokenizer 
input_labels 
input_texts 
load_test_data 
save_optimizer_state 
turn_to_tensor 



  '"sox" backend is being deprecated. '


Above one can see the functions and variables for this class
The *init_data, init_optimizer and init_tokenizer*  speak mostly for themselves as they initializes the respecitve data optimizer and tokenizer.
The only function really worth explaining here is *encode_texts*

In [3]:
encode_texts = example_class.encode_texts
# variables: max_length = 128, trunc = True, ptml = True,stratify = None, batchsize = 32, rs = 2021, valsize = 0.1, with_labels = True):
# max_length is the token legnth of the data
# trunc stands for trunctuation and is defaulted to true
# ptml = pad to max length and means that we fill the the token array to the max length
# batch size speaks for itself
# with_labels decides whether the dataloader contains the labels for the segments

# Data_processing

This class is used for creating the one hot vectors for each segment. Because we are working with a lot of classification models this requires some work.

In [4]:
# let's first have a look at the data:

all_data = pd.read_csv('../input/privbert-data/op115_processed.csv')
all_data.head()

Unnamed: 0,annotation_id,batch_id,annotator_id,policy_id,segment_id,category_name,attribute_value_pairs,date,policy_url,policy_uid,segment_text
0,20137,test_category_labeling_highlight_fordham_aaaaa,121,3905,0,Other,"{""Other Type"": {""selectedText"": ""Sci-News.com ...",Not specified,http://www.sci-news.com/privacy-policy.html,1017,Privacy Policy <br> <br> Sci-News.com is commi...
1,20324,test_category_labeling_highlight_fordham_aaaaa,121,3905,1,First Party Collection/Use,"{""Collection Mode"": {""selectedText"": ""nformati...",Not specified,http://www.sci-news.com/privacy-policy.html,1017,Information that Sci-News.com May Collect Onli...
2,20325,test_category_labeling_highlight_fordham_aaaaa,121,3905,1,First Party Collection/Use,"{""Collection Mode"": {""selectedText"": ""nformati...",Not specified,http://www.sci-news.com/privacy-policy.html,1017,Information that Sci-News.com May Collect Onli...
3,20326,test_category_labeling_highlight_fordham_aaaaa,121,3905,2,Data Retention,"{""Personal Information Type"": {""selectedText"":...",Not specified,http://www.sci-news.com/privacy-policy.html,1017,"- if you contact us, we may keep a record of t..."
4,20327,test_category_labeling_highlight_fordham_aaaaa,121,3905,3,First Party Collection/Use,"{""Collection Mode"": {""selectedText"": ""Not sele...",Not specified,http://www.sci-news.com/privacy-policy.html,1017,- details of your visits to our site including...


The information in this dataset that is most relevant to us is the *'category_name', 'attribute_value_pairs and segment_text*.
In the explanation of the class one will see how this data is used to create one hot vectors.
To see how the data was stratified please consult the following notebook: 'Stratify_data.ipynb'

In [5]:
from data_processing import Op115OneHots

example_class = Op115OneHots(all_data) # the class is initialized with a dataframe
attributes = vars(example_class)
attnames = [item[0] for item in attributes.items()]
print("All variables: \n")
print(''.join("%s \n" % attname for attname in attnames))
all_dir = dir(example_class)
all_functions = [method for method in all_dir if not method.startswith('__') and method not in attnames]
print("All functions: \n")
print(''.join("%s \n" % funcname for funcname in all_functions))

All variables: 

df 
categories 
dicts 
tuple_list 
unique_tups 
unique_atts 
unique_cats 
cat_oh 
subcat_oh 
subsubcat_oh 
dictvalues 
segments 
segments_vals 

All functions: 

classtree 
getcats 
getdicts 
getsub 
go2 
indexes 
len_onehots 
majority_vote 
new_onehots 
pol_seg 
processtuples 
return_oh_names 
return_ohs 
return_unique_texts 
returntexts 
set_oh_names 
set_onehots 
sort_df_polseg 
tuplist 
tuplist_per_segment 



The functions used for this classed are summed up mostly in *'go2'* This function calls the following functions in order:
- majority_vote() # if majority vote is true
- sort_df_polseg()  
- getcats()
- getdicts()
- processtuples()
- getsub()
- set_oh_names(class_tup) # if the function was loaded with a name of classes
- tuplist_per_segment()

In [6]:
#Majority vote filters annonations that were only annotated once (out of 3 times)
print('Number of annotations before majority vote :',len(example_class.df))
print()
example_class.majority_vote()
print('Number of annotations after majority vote :',len(example_class.df))

Number of annotations before majority vote : 23194

Number of annotations after majority vote : 20662


In [7]:
example_class.df[:5].columns

Index(['annotation_id', 'batch_id', 'annotator_id', 'policy_id', 'segment_id',
       'category_name', 'attribute_value_pairs', 'date', 'policy_url',
       'policy_uid', 'segment_text'],
      dtype='object')

In [8]:
# It then sorts the dataframe based on the policy-segment number
f5 = example_class.df[:5]
f5_id = f5[['policy_uid', 'segment_id']].values
print("First 5 id's before sorting: ")
print(''.join("%s \n" % f for f in f5_id))
example_class.sort_df_polseg()
f5 = example_class.df[:5]
f5_id = f5[['policy_uid', 'segment_id']].values
print("First 5 id's after sorting: ")
print(''.join("%s \n" % f for f in f5_id))

First 5 id's before sorting: 
[1017    0] 
[1017    1] 
[1017    1] 
[1017    2] 
[1017    3] 

First 5 id's after sorting: 
[20  0] 
[20  0] 
[20  0] 
[20  1] 
[20  1] 



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.df['str_polsegs'] = str_polseg
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.df.sort_values(by=['str_polsegs'], inplace=True)


In [9]:
# with get_cats() we see how many unique categories there are
print("self.categories before getcats() : ",example_class.categories)
example_class.getcats()
cats = example_class.categories[:5]
print("after the fucntion we now have a list with the corresponding category values")
print(''.join("%s \n" % cat for cat in cats))


self.categories before getcats() :  None
after the fucntion we now have a list with the corresponding category values
Other 
Other 
Other 
Other 
Other 



In [10]:
# get dicts turn the type = string 'dictionaires' into actualy dictionaries
print("attribute value pairs: \n ", example_class.df.attribute_value_pairs.values[0])
print("\n of type: ", type(example_class.df.attribute_value_pairs.values[0]))
example_class.getdicts()
print()
print("after calling getdicts(): \n", example_class.dicts[0])
print("\n of type: ", type(example_class.dicts[0]))

attribute value pairs: 
  {"Other Type": {"endIndexInSegment": 762, "startIndexInSegment": 100, "selectedText": "At the Atlantic Monthly Group, Inc. (\"The Atlantic\"), we want you to enjoy and benefit from our websites and online services secure in the knowledge that we have implemented fair information practices designed to protect your privacy. Our privacy policy is applicable to The Atlantic, and The Atlantics affiliates and subsidiaries whose websites, mobile applications and other online services are directly linked (the Sites). The privacy policy describes the kinds of information we may gather during your visit to these Sites, how we use your information, when we might disclose your personally identifiable information, and how you can manage your information.", "value": "Introductory/Generic"}}

 of type:  <class 'str'>

after calling getdicts(): 
 {'Other Type': {'endIndexInSegment': 762, 'startIndexInSegment': 100, 'selectedText': 'At the Atlantic Monthly Group, Inc. ("The At

In [11]:
# process tuples crate the tuple list per segment, these tuples are used to create the one hot vectors
example_class.processtuples()

In [12]:
print("after calling this function we now have a list of all the tuples that contain information on the respective segment")
print("one can see this as: Category, Subcategory, Value \n")
f5 = example_class.tuple_list[:5]
print(''.join("%s \n" % f for f in f5))

print("it also produces a list that sees if the values are 'enacted' or true")
f5 = example_class.dictvalues[:5]
print(''.join("%s \n" % f for f in f5))

print("and finally it produces a list with all uqniue tuples found, these are the values")
print("in total {} unqiue tuples / values are in the data".format(len(example_class.unique_tups)))

after calling this function we now have a list of all the tuples that contain information on the respective segment
one can see this as: Category, Subcategory, Value 

[('Other', 'Other Type', 'Introductory/Generic')] 
[('Other', 'Other Type', 'Introductory/Generic')] 
[('Other', 'Other Type', 'Introductory/Generic')] 
[('Other', 'Other Type', 'Introductory/Generic')] 
[('Other', 'Other Type', 'Practice not covered')] 

it also produces a list that sees if the values are 'enacted' or true
[True] 
[True] 
[True] 
[True] 
[True] 

and finally it produces a list with all uqniue tuples found, these are the values
in total 253 unqiue tuples / values are in the data


In [13]:
# we want to do something similar like unique tuples for the category and subcategory classes
print("To get the unique categories and subcategories we call 'getsub()' \n")
example_class.getsub()
print("we then see we have a total number of {} categories".format(len(example_class.unique_cats)))
print("and a total number of {} subcategories \n".format(len(example_class.unique_atts)))

print("an example of a category: \n {} \n".format(example_class.unique_cats[0]))
print("an example of a subcategory: \n {}".format(example_class.unique_atts[0]))

To get the unique categories and subcategories we call 'getsub()' 

we then see we have a total number of 10 categories
and a total number of 36 subcategories 

an example of a category: 
 Data Retention 

an example of a subcategory: 
 ('User Access, Edit and Deletion', 'User Type')


In [14]:
print("then if 'set_oh_name' has an input in the function we overwrite these unique tuples. This is done since some values are rare \nand thus might not apear in either the training or testing set")
print("one can create such a list by calling 'return_oh_names()' ")

then if 'set_oh_name' has an input in the function we overwrite these unique tuples. This is done since some values are rare 
and thus might not apear in either the training or testing set
one can create such a list by calling 'return_oh_names()' 


In [15]:
print("finally instead of having the data per annotation the data is grouped into segments")
print("one can see that we had {} rows, i.e. tuples in the tuplist created ealier".format(len(example_class.tuple_list)))
example_class.tuplist_per_segment()
print("\nafter calling 'tuplist_per_segment()' we now have a list of information per segment")
print("this list has length {}".format(len(example_class.segments_vals)))
print("an example of a segment is: \n{}".format(example_class.segments_vals[0]))
print("\none can see it contains the annotations per annotator")

finally instead of having the data per annotation the data is grouped into segments
one can see that we had 20662 rows, i.e. tuples in the tuplist created ealier

after calling 'tuplist_per_segment()' we now have a list of information per segment
this list has length 3729
an example of a segment is: 
[('Other', 'Other Type', 'Introductory/Generic'), ('Other', 'Other Type', 'Introductory/Generic'), ('Other', 'Other Type', 'Introductory/Generic')]

one can see it contains the annotations per annotator


Then afther all this preprocessing the *main* function is called, namely new_onehots(). This function return the text segments and one hot vectors for all classificaiton models

In [16]:
cat_to_sub, cat_to_val, sub_to_val, all_cats, all_subcats, all_vals, all_texts = example_class.new_onehots()

In [17]:
print("calling new_onehots returns a tuple with 7 elements. Namely: \n")
print("1) A dictionary that contains the texts and labels for a category to a subcategory ")
print("2) A dictionary that contains the texts and labels for a category to the values ")
print("3) A dictionary that contains the texts and labels for a subcategory to the values ")
print("4) A list of onehot vectors for the categories per segement ")
print("5) A list of onehot vectors for the subcategories per segement ")
print("6) A list of onehot vectors for the values per segement ")
print("7) A list texts/ segments ")

calling new_onehots returns a tuple with 7 elements. Namely: 

1) A dictionary that contains the texts and labels for a category to a subcategory 
2) A dictionary that contains the texts and labels for a category to the values 
3) A dictionary that contains the texts and labels for a subcategory to the values 
4) A list of onehot vectors for the categories per segement 
5) A list of onehot vectors for the subcategories per segement 
6) A list of onehot vectors for the values per segement 
7) A list texts/ segments 


In [18]:
type(cat_to_sub['Data Retention'])

list

# Hierarchical_data

This class is used to store the advices for the advice system. It also is used for running a gridsearch

In [19]:
from hierarchical_data import HierarchicalData

example_class = HierarchicalData() # the class is initialized with a dataframe
attributes = vars(example_class)
attnames = [item[0] for item in attributes.items()]
print("All variables: \n")
print(''.join("%s \n" % attname for attname in attnames))
all_dir = dir(example_class)
all_functions = [method for method in all_dir if not method.startswith('__') and method not in attnames]
print("All functions: \n")
print(''.join("%s \n" % funcname for funcname in all_functions))

All variables: 

cat_layer 
cat_candidates 
sub_all_layer 
sub_all_advice 
sub_layer 
sub_advice 
val_layer 
val_advice 
cat_names 
cat_dictionary 
sub_to_cat_advice 
val_to_cat_advice 
candidate_treshold 
parameter_dictionary 

All functions: 

create_gridsearch_advices 
define_candidates 
give_all_sub_advice 
give_sub_advice 
give_val_advice 
read_cat_predictions 
read_sub_predictions 
return_advice 
return_candidate 
return_gridsearch_advice 
return_predictions 
return_predictions_layers 
return_suball_advice 
save_advice 
save_gridsearch_advice 
set_parameters 
set_variables 



In [20]:
# TODO