<a href="https://colab.research.google.com/github/iVibudh/BAIT-509---Machine-Learning-/blob/master/notebooks/CER_01_labeling_VCs_with_zero_shot_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Labeling VCs with Zero-Shot Learning using Huggingface Transformers

This notebook has the goal of taking the text from ESA tables and trying to classify the VCs for a specific table. We use an approach called Zero-Shot Learning to do this. I won't go over the nitty-gritty details of Zero-Shot, but you can go to the following link if you are interested: https://towardsdatascience.com/zero-shot-text-classification-evaluation-c7ba0f56688e.

Since we have to classify *a lot* of tables, it will be better to use a GPU to speed up inference/prediction speed.

# Installations

In [None]:
!pip install transformers --quiet
!pip install tqdm --quiet

[K     |████████████████████████████████| 3.3 MB 5.0 MB/s 
[K     |████████████████████████████████| 61 kB 319 kB/s 
[K     |████████████████████████████████| 596 kB 41.7 MB/s 
[K     |████████████████████████████████| 3.3 MB 31.6 MB/s 
[K     |████████████████████████████████| 895 kB 42.3 MB/s 
[?25h

# Imports

In [None]:
import os
import requests
import pickle
import pandas as pd
from transformers import pipeline
from bs4 import BeautifulSoup
import nltk
from nltk.stem.porter import *
from tqdm import tqdm

nltk.download('stopwords', quiet=True)

True

In [None]:
pickled_dataset = 'zero_shot_vcs_train'

# Loading the Data

In [None]:
csv_url = 'https://raw.githubusercontent.com/JayThibs/huggingface-course-cer-workshop/main/data/zero_shot_esa_index_train_128_max_tokens.csv'
df_joined = pd.read_csv(csv_url, index_col=0)

In [None]:
len(df_joined)

28891

In [None]:
df_joined.head()

Unnamed: 0,Index,Content Type,Application Name,Application Short Name,Application Filing Date,Company Name,Commodity,File Name,ESA Folder URL,Document Number,Data ID,PDF Download URL,Application Type (NEB Act),Pipeline Location,Hearing order,Consultant Name,Pipeline Status,Regulatory Instrument(s),Application URL,Decision URL,ESA Section(s),ESA Section(s) Index,ESA Section(s) Topics,CSV Download URL,PDF Page Number,PDF Page Count,PDF Size,PDF Outline,Download folder name,Zipped Project Link,Missing CSV,CSV Filename,csvFileName,text,label
0,9134,Table,Application for North Montney Project,North Montney,11/8/2013,NOVA Gas Transmission Ltd.,Gas,B2-16 ESA_Appendix_G_Part1of4 (A3Q6H2),https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A3Q6H2,1059614,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"British Columbia, All",GH-001-2014,"Stantec Consulting Ltd., TERA Environmental Co...",Operating,GC-125,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/1...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/3...,Appendix G: TERA Aquatics Summary Report,15.0,"Water, All",http://www.cer-rec.gc.ca/esa-ees/nrthmntn/nrth...,14,48.0,5.87,No,nrthmntn,http://www.cer-rec.gc.ca/esa-ees/nrthmntn.zip,False,nrthmntn_table-3-summary-of-aquatics-field-wor...,1059614_14_lattice-v_1.csv,TABLE 3 SUMMARY OF AQUATICS FIELD WORK AND ABO...,-1
1,9135,Table,Application for North Montney Project,North Montney,11/8/2013,NOVA Gas Transmission Ltd.,Gas,B2-16 ESA_Appendix_G_Part1of4 (A3Q6H2),https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A3Q6H2,1059614,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"British Columbia, All",GH-001-2014,"Stantec Consulting Ltd., TERA Environmental Co...",Operating,GC-125,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/1...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/3...,Appendix G: TERA Aquatics Summary Report,15.0,"Water, All",http://www.cer-rec.gc.ca/esa-ees/nrthmntn/nrth...,17,48.0,5.87,No,nrthmntn,http://www.cer-rec.gc.ca/esa-ees/nrthmntn.zip,False,nrthmntn_table-4-summary-of-watercourse-crossi...,1059614_17_lattice-v_1.csv,TABLE 4 SUMMARY OF WATERCOURSE CROSSINGS ALONG...,-1
2,9136,Table,Application for North Montney Project,North Montney,11/8/2013,NOVA Gas Transmission Ltd.,Gas,B2-16 ESA_Appendix_G_Part1of4 (A3Q6H2),https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A3Q6H2,1059614,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"British Columbia, All",GH-001-2014,"Stantec Consulting Ltd., TERA Environmental Co...",Operating,GC-125,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/1...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/3...,Appendix G: TERA Aquatics Summary Report,15.0,"Water, All",http://www.cer-rec.gc.ca/esa-ees/nrthmntn/nrth...,18,48.0,5.87,No,nrthmntn,http://www.cer-rec.gc.ca/esa-ees/nrthmntn.zip,False,nrthmntn_table-4-summary-of-watercourse-crossi...,1059614_18_lattice-v_1.csv,TABLE 4 SUMMARY OF WATERCOURSE CROSSINGS ALONG...,-1
3,9137,Table,Application for North Montney Project,North Montney,11/8/2013,NOVA Gas Transmission Ltd.,Gas,B2-16 ESA_Appendix_G_Part1of4 (A3Q6H2),https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A3Q6H2,1059614,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"British Columbia, All",GH-001-2014,"Stantec Consulting Ltd., TERA Environmental Co...",Operating,GC-125,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/1...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/3...,Appendix G: TERA Aquatics Summary Report,15.0,"Water, All",http://www.cer-rec.gc.ca/esa-ees/nrthmntn/nrth...,19,48.0,5.87,No,nrthmntn,http://www.cer-rec.gc.ca/esa-ees/nrthmntn.zip,False,nrthmntn_table-4-summary-of-watercourse-crossi...,1059614_19_lattice-v_1.csv,TABLE 4 SUMMARY OF WATERCOURSE CROSSINGS ALONG...,-1
4,9138,Table,Application for North Montney Project,North Montney,11/8/2013,NOVA Gas Transmission Ltd.,Gas,B2-16 ESA_Appendix_G_Part1of4 (A3Q6H2),https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A3Q6H2,1059614,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"British Columbia, All",GH-001-2014,"Stantec Consulting Ltd., TERA Environmental Co...",Operating,GC-125,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/1...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/3...,Appendix G: TERA Aquatics Summary Report,15.0,"Water, All",http://www.cer-rec.gc.ca/esa-ees/nrthmntn/nrth...,20,48.0,5.87,No,nrthmntn,http://www.cer-rec.gc.ca/esa-ees/nrthmntn.zip,False,nrthmntn_table-4-summary-of-watercourse-crossi...,1059614_20_lattice-v_1.csv,TABLE 4 SUMMARY OF WATERCOURSE CROSSINGS ALONG...,-1


In [None]:
df_joined.iloc[100]['text']

'Table 425 Major Fire Departments in the Infrastructure and Services in the RAA FulltimeEmployees Equipment ServicesProvided AreasCommunitiesServed HudsonsHope FireDepartment 1 fire chief 25 firefighters Not specified fireprevention andinspectionservices City of Hudsons Hopeand surrounding area MoberlyLake FireDepartment 1 fire chief Severalfirefighters notspecified Not specified Not specified Moberly Lake areaWest Moberly FirstNation and SaulteauFirst Nation SOURCE Modified from PRRD 2013a '

In [None]:
sequences = []

for x in range(len(df_joined)):
  sequences.append(df_joined.iloc[x]['text'][0:250])

In [None]:
len(sequences)

28891

In [None]:
sequences[0]

'TABLE 3 SUMMARY OF AQUATICS FIELD WORK AND ABORIGINAL FIELD STUDY PARTICIPATION FOR THE PROJECT Survey Date Aboriginal Communities Detail July 9 to 12 2011 Blueberry River First Nation Halfway River First Nation McLeod Lake Indian Band North East Mti'

# Loading the Model with `pipeline`

In [None]:
# Explicitly ask for tensor allocation on CUDA device :0
classifier = pipeline("zero-shot-classification", device=0)

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [None]:
candidate_labels = """Physical and meteorological environment,Soil and soil productivity,Vegetation,Water quality and quantity,Fish and fish habitat,Wetlands,Wildlife and wildlife habitat,Species at Risk or Species of Special Status and related habitat,Greenhouse gas (GHG) emissions and climate change,GHG Emissions and Climate Change – Assessment of Upstream GHG Emissions,Air emissions,Acoustic environment,Electromagnetism and Corona Discharge,Human occupancy and resource use,Heritage resources,Navigation and navigation safety,Traditional land and resource use,Social and cultural well-being,Human health and aesthetics,Infrastructure and services,Employment and economy,Environmental Obligations,Rights of Indigenous Peoples""".split(',')
hypothesis_template = "The data from this table is about {}."
candidate_labels

['Physical and meteorological environment',
 'Soil and soil productivity',
 'Vegetation',
 'Water quality and quantity',
 'Fish and fish habitat',
 'Wetlands',
 'Wildlife and wildlife habitat',
 'Species at Risk or Species of Special Status and related habitat',
 'Greenhouse gas (GHG) emissions and climate change',
 'GHG Emissions and Climate Change – Assessment of Upstream GHG Emissions',
 'Air emissions',
 'Acoustic environment',
 'Electromagnetism and Corona Discharge',
 'Human occupancy and resource use',
 'Heritage resources',
 'Navigation and navigation safety',
 'Traditional land and resource use',
 'Social and cultural well-being',
 'Human health and aesthetics',
 'Infrastructure and services',
 'Employment and economy',
 'Environmental Obligations',
 'Rights of Indigenous Peoples']

The following code cell sometimes helps when Colab's GPU is full. That said, I don't find it helps very much even though people mention it quite often in StackOverflow threads. 

Generally, if you get a CUDA error during inference or training, it will likely be because you've loaded too much data onto the GPU and it is out of space. To resolve this, you will need to do a Factory Reset of the notebook instance, which you can find in the Runtime tab. After this, you will need to rerun all of your code.

In [None]:
import gc
import torch

gc.collect()

torch.cuda.empty_cache()

In [None]:
torch.cuda.memory_summary(device=None, abbreviated=False)



Below is where we do inference on all of the table texts. Since there can be a lot of text in tables, we needed to cut down the length of words when creating the dataset, otherwise you will use up all the memory and inference will crash.

However, this is not the only thing you need to worry about. When you are doing inference, you can actually to inference in "batches". This means that you are loading multiple examples of text at the same time onto the GPU. The more batches of data you load on the GPU at the same time, the faster your code will run. However, you also have the issue of running out of memory if you load too much data at once! So, if you get a "CUDA out of memory" type error, your first instinct should be to reduce the batch size. As you can see below in the following line:

```
for sequence_batch in tqdm(grouper(sequences, 2, "")):
```

I am only loading to examples at once! This is because more than two was too much for the Colab GPUs. Each GPU has different amounts of memory available, so they won't all have the same limit. In practice, this is just something you will need to play around with until you can find a "batch size" that doesn't overload the GPU, but still runs fast.

In [None]:
from itertools import zip_longest
 
def grouper(iterable_obj, count, fillvalue=None):
    args = [iter(iterable_obj)] * count
    return zip_longest(*args, fillvalue=fillvalue)

output = [] 

for sequence_batch in tqdm(grouper(sequences, 2, "")):
    # print(sequence_batch)
    output.append(classifier(sequence_batch, candidate_labels, hypothesis_template=hypothesis_template, multi_label=True))

14446it [3:01:43,  1.32it/s]


Phew! That took just a little over 3 hours.

Now, let's check the output:

In [None]:
output[0]

[{'labels': ['Water quality and quantity',
   'Traditional land and resource use',
   'Environmental Obligations',
   'Rights of Indigenous Peoples',
   'Heritage resources',
   'Social and cultural well-being',
   'Infrastructure and services',
   'Species at Risk or Species of Special Status and related habitat',
   'Human occupancy and resource use',
   'GHG Emissions and Climate Change – Assessment of Upstream GHG Emissions',
   'Soil and soil productivity',
   'Physical and meteorological environment',
   'Vegetation',
   'Fish and fish habitat',
   'Wetlands',
   'Employment and economy',
   'Human health and aesthetics',
   'Wildlife and wildlife habitat',
   'Greenhouse gas (GHG) emissions and climate change',
   'Acoustic environment',
   'Air emissions',
   'Navigation and navigation safety',
   'Electromagnetism and Corona Discharge'],
  'scores': [0.7399547696113586,
   0.05953020229935646,
   0.01749674417078495,
   0.01284861657768488,
   0.005223275627940893,
   0.003144

As you can see, we have a list of the VC labels along with a corresponding list of confidence scores. The closer the confidence score is to 1, the more likely the table should be labeled as that VC. Closer to 0 means it shouldn't be labeled that VC.

In [None]:
# Since we have a list of lists for the outputs (because inference was done in batches), we need to flatten the lists together into one.
flatOutput = [item for elem in output for item in elem]

# Saving the Predictions in a Pickle File

In [None]:
with open(f'{pickled_dataset}.pkl', 'wb') as f:
  pickle.dump(flatOutput, f)

We don't want to have to rerun inference every time if we don't have to, so let's make sure to save the predictions for future use and experimentation.

# Loading the Predictions from the Pickle File

In [None]:
infile = open(f'{pickled_dataset}.pkl', 'rb')
zs_vcs_dict_list = pickle.load(infile)
infile.close()

In [None]:
zs_vcs_dict_list[0]

{'labels': ['Water quality and quantity',
  'Traditional land and resource use',
  'Environmental Obligations',
  'Rights of Indigenous Peoples',
  'Heritage resources',
  'Social and cultural well-being',
  'Infrastructure and services',
  'Species at Risk or Species of Special Status and related habitat',
  'Human occupancy and resource use',
  'GHG Emissions and Climate Change – Assessment of Upstream GHG Emissions',
  'Soil and soil productivity',
  'Physical and meteorological environment',
  'Vegetation',
  'Fish and fish habitat',
  'Wetlands',
  'Employment and economy',
  'Human health and aesthetics',
  'Wildlife and wildlife habitat',
  'Greenhouse gas (GHG) emissions and climate change',
  'Acoustic environment',
  'Air emissions',
  'Navigation and navigation safety',
  'Electromagnetism and Corona Discharge'],
 'scores': [0.7399547696113586,
  0.05953020229935646,
  0.01749674417078495,
  0.01284861657768488,
  0.005223275627940893,
  0.0031445538625121117,
  0.0027075377

In [None]:
len(zs_vcs_dict_list)

28892

Now, let's prepare our data so that is can be placed in a Pandas DataFrame.

Note: I excluded the last element from the loop since it's an empty string.

In [None]:
vcs = []
scores = []
texts = []
for i in range(0, len(zs_vcs_dict_list) - 1):
  vcs.append(zs_vcs_dict_list[i]['labels'])
  scores.append(zs_vcs_dict_list[i]['scores'])
  texts.append(zs_vcs_dict_list[i]['sequence'])

In [None]:
d = {'vcs': vcs, 'scores': scores, 'texts': texts}
df_output = pd.DataFrame(data=d)
df_output

Unnamed: 0,vcs,scores,texts
0,"[Water quality and quantity, Traditional land ...","[0.7399547696113586, 0.05953020229935646, 0.01...",TABLE 3 SUMMARY OF AQUATICS FIELD WORK AND ABO...
1,"[Fish and fish habitat, Wildlife and wildlife ...","[0.8952215909957886, 0.7558441162109375, 0.348...",TABLE 4 SUMMARY OF WATERCOURSE CROSSINGS ALONG...
2,"[Fish and fish habitat, Wildlife and wildlife ...","[0.8952215909957886, 0.7558441162109375, 0.348...",TABLE 4 SUMMARY OF WATERCOURSE CROSSINGS ALONG...
3,"[Fish and fish habitat, Wildlife and wildlife ...","[0.8952215909957886, 0.7558441162109375, 0.348...",TABLE 4 SUMMARY OF WATERCOURSE CROSSINGS ALONG...
4,"[Fish and fish habitat, Wildlife and wildlife ...","[0.8952215909957886, 0.7558441162109375, 0.348...",TABLE 4 SUMMARY OF WATERCOURSE CROSSINGS ALONG...
...,...,...,...
28886,"[Traditional land and resource use, Environmen...","[0.2083447426557541, 0.17277342081069946, 0.10...",TABLE 1 GENERAL LAND USE AND ENVIRONMENTAL SET...
28887,[Species at Risk or Species of Special Status ...,"[0.9934717416763306, 0.5959182977676392, 0.550...",TABLE 2 OCCURRENCES OF SPECIES WITH SPECIAL CO...
28888,[Species at Risk or Species of Special Status ...,"[0.990873396396637, 0.6361707448959351, 0.4615...",TABLE 2 OCCURRENCES OF SPECIES WITH SPECIAL CO...
28889,[Species at Risk or Species of Special Status ...,"[0.9845494031906128, 0.3557242751121521, 0.196...",TABLE 2 OCCURRENCES OF SPECIES WITH SPECIAL CO...


Now, we could set a scoring threshold to decide whether we want to label the table with a VC or not. For example, we could say that all scores under 0.6 should not label the text.