# M1 Extracting Paragraphs from the EU Taxonomy Document


In [76]:
import re

import textract
import pandas as pd

## Objective

Process the EU sustainable finance taxonomy PDF file and extract and clean all the paragraphs in the document

## Download the EU sustainable finance taxonomy PDF from Taxonomy Report: Technical Annex.

## Load the EU sustainable finance taxonomy PDF file using the textract library and decode it. 

Look through the text to ensure that you have got all the text and that the decoding did not produce any bad characters.

In [77]:
text = textract.process('EUtaxonomy.pdf')

In [78]:
text = text.decode()

In [79]:
# text = textract.process('EUtaxonomy.pdf', method='pdfminer').decode()

## Use regular expressions to split the paragraphs and clean the text. 

The loaded text will be in raw format and will need to be segmented into paragraphs. These paragraphs will also need to be cleaned by removing newline characters and other characters that do not bring any semantic value to the paragraph (such as tabs or bullet points).

In [80]:
len(text)

1320996

In [81]:
text[0:1000]

'Updated methodology & Updated Technical Screening Criteria\n- 1-\n\nMarch 2020\n\n\x0cAbout this report\nThis document includes an updated Part B: Methodology from the June 2019 report and an updated Part\nF: Full list of technical screening criteria. The other original sections from the June 2019 report can be\nfound as labelled in the June 2019 report.\nPART A\n\nExplanation of the Taxonomy approach. This section sets out the role and importance of\nsustainable finance in Europe from a policy and investment perspective, the rationale for\nthe development of an EU Taxonomy, the daft regulation and the mandate of the TEG.\n\nPART B\n\nMethodology. This explains the methodologies for developing technical screening\ncriteria for climate change mitigation objectives, adaptation objectives and ‘do no\nsignificant harm’ to other environmental objectives in the legislative proposal.\nThis has been updated since 2019.\n\nPART C\n\nTaxonomy user and use case analysis. This section provides pr

In [82]:
paragraphs = re.split(r"\s*?\n\s*?\n\s*?", text)

In [83]:
min_length = 200
paragraphs = [para for para in paragraphs if len(para) > min_length]

In [84]:
len(paragraphs)

1627

In [85]:
def clean_paragraph(text):
    text = text.replace("\n", " ").replace("  ", " ").strip(" ")
    return re.sub(r'[^\w\s]', '', text).strip(" ")

## Store the paragraphs in a DataFrame with the column “paragraph” using the pandas library and save the DataFrame.

In [86]:
df = pd.DataFrame({'paragraph': paragraphs})

In [87]:
df.head()

Unnamed: 0,paragraph
0,About this report\nThis document includes an ...
1,Explanation of the Taxonomy approach. This sec...
2,Methodology. This explains the methodologies f...
3,Full list of technical screening criteria. Thi...
4,Disclaimer\nThis report represents the overall...


In [88]:
df['paragraph'] = df['paragraph'].apply(clean_paragraph)

In [89]:
df.head()

Unnamed: 0,paragraph
0,About this report This document includes an u...
1,Explanation of the Taxonomy approach This sect...
2,Methodology This explains the methodologies fo...
3,Full list of technical screening criteria This...
4,Disclaimer This report represents the overall ...


In [90]:
df.to_csv("paragraphs.csv")

# M2 Question Paragraph Matching

In [91]:
from sklearn.feature_extraction.text import TfidfVectorizer

## Objective

Build a text vectorizer that finds the best matching paragraph for the provided set of questions and qualitatively evaluates the results

In [92]:
df = pd.read_csv("paragraphs.csv")

In [93]:
df.head()

Unnamed: 0.1,Unnamed: 0,paragraph
0,0,About this report This document includes an u...
1,1,Explanation of the Taxonomy approach This sect...
2,2,Methodology This explains the methodologies fo...
3,3,Full list of technical screening criteria This...
4,4,Disclaimer This report represents the overall ...


## Initiate a TF-IDF model trained on the paragraphs from the previous milestone by using the TfidfVectorizer class from the scikit-learn library. 

This model will provide a representation for each paragraph or each question.

In [94]:
vectorizer = TfidfVectorizer()

In [95]:
vectorized_paragraphs = vectorizer.fit_transform(df['paragraph'])

In [96]:
vectorized_paragraphs.shape

(1627, 6496)

## Transform all the paragraphs into representations and calculate a distance in the representation space between each question and all the paragraphs. 

The distance can be calculated using the linear_kernel function from the scikit-learn library. Sort all the distances and match the paragraph that best corresponds to each question.

In [97]:
questions = [
    ["What fuel is used for manufacturing of chlorine?"],
    ["What metric is used for evaluating emission?"],
    ["How can carbon emission of the processes of cement clinker be reduced?"],
    ["How is the Weighted Cogeneration Threshold calculated?"],
    ["What is carbon capture and sequestration?"],
    ["What stages does CCS consist of?"],
    ["What should be the average energy consumption of a water supply system?"],
    ["What are examples of sludge treatments?"],
    ["How is the process of anaerobic digestion?"],
    ["How is reforestation defined?"],
    ["What is the threshold of emssion for inland passenger water transport?"], 
    ["What are the requirements of reporting for electricity generation from natural gas where there might be fugative emissions?"]
]

In [98]:
from sklearn.metrics.pairwise import linear_kernel

# Iterate through the questions and transform each of them to their vector representation. 
# Then use linear_kernel to get the distances and get the smallest one.
vector_representations = []

for question in questions:
    vec_rep = vectorizer.transform(question)
    lk_rank = linear_kernel(vec_rep, vectorized_paragraphs).flatten()
    vector_representations.append((question, df["paragraph"][lk_rank.argsort()[-1]]))    

## Bonus: Train a Doc2vec model with the paragraphs using the Doc2vec model provided by the gensim library. 

Similar to the TF-IDF model, Doc2vec provides a representation for the paragraphs.

In [99]:
import gensim

def read_corpus(text, tokens_only=False):
    for i, line in enumerate(text):
        tokens = gensim.utils.simple_preprocess(line)
        if tokens_only:
            yield tokens
        else:
            # For training data, add tags
            yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

corpus = list(read_corpus(df["paragraph"].values))
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
model.build_vocab(corpus)
model.train(corpus, total_examples=model.corpus_count, epochs=model.epochs)

## Bonus: Given the representation of the paragraphs, use the most_similar method in the gensim library, which uses cosine distance to get the paragraphs that best match the questions.

In [100]:
doc2vec_similarities = []
for question in questions:
    q1 = list(read_corpus(question, tokens_only=True))
    inferred_vector = model.infer_vector(q1[0])
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    doc2vec_similarities.append((question, df["paragraph"][sims[0][0]]))

## Bonus: Evaluate the two different methods for matching questions to paragraphs and pick the better performing one to use in the next milestone.

In [101]:
for ic,(question, context)  in enumerate(vector_representations):
    print(question[0])
    print(f"tfidf: {context}\n\ndoc2vec: {doc2vec_similarities[ic][1]}")
    print("-"*20)

What fuel is used for manufacturing of chlorine?
tfidf: Rationale The manufacturing process of carbon black accounts for approximately 34 of the GHG emissions from the chemical sector while the manufacturing of soda ash accounts for 15 of the emissions 212 The manufacturing process of chlorine is extremely energyintensive with chloralkali process accounting for 17 of total electrical consumption of the European chemical and petrochemical industry213 Reducing the manufacturing emissions for carbon black and soda ash and improving energy efficiency in the manufacturing of chlorine can positively contribute to the mitigation objective Moreover it is recognised that soda ash used in double glazing can enhance building efficiency gains The absolute performance approach has been proposed in order to identify the maximum acceptable carbon intensities of the manufacturing processes of carbon black and soda ash that the activities should comply with in order to be able to substantially contribu

# 3. Set-up Transformers for Question-Answering

## Objective

Get familiar with using the Hugging Face library for applied purposes
The main goal is to extract the answer given a question-paragr## aph tuple

## Notes

- Either of the cases will require pointing to an existing pretrained or, in our case, fine-tuned model. You can find a library of pretrained and fine-tuned models at Hugging Face Models. Notice that some of the models are quite large and perhaps will either not work or slow your computer down. A smaller model that could be a good starting point is distilbert-base-uncased-distilled-squad.

- There are other libraries you could use to develop a Question-Answering model. However, for this project we want to focus on the Hugging Face transformers since they are already pretrained and fine tuned. They also provide a very simple interface to set up and use the model.

- There are two different methods to use the transformers library. There are pros and cons for both, but for the purposes of this project it does not make any difference which one you choose.

- We will only use exact matches as an evaluation metric, as mentioned above. This is basically a count of the number of data points that the model predicts correctly over the total number of data points. A regular string match should be sufficient for the purpose of this project. Again, you do not need to use the entire dataset since it can take a lot of computational power, but rather sample some data from it (perhaps 1000 data points or so).m

## Resources
- Real-World Natural Language Processing by Masato Hagiwara Chapter 9, section 3, “Case study 1: Sentiment analysis with BERT,” provides an example of using the Hugging Face transformers library.


- Taming Text by Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris Chapter 8, “Building an example question answering system,” is helpful if you want to understand the theory behind building a question answering system.


- Natural Language Processing in Action by Hobson Lane, Cole Howard, and Hannes Hapke Chapter 10, “Sequence-to-sequence models and attention,” contains theoretical background on how transformers work.


## Additional Resources

- HuggingFace Transformers library documentation will be used for the question answering.

- Question answering tutorial https://huggingface.co/transformers/usage.html#extractive-question-answering – Note that we will use PyTorch for this project.

- Information about the Stanford question-answer dataset https://rajpurkar.github.io/SQuAD-explorer/, as well as a list of how different algorithms perform on the dataset.

## Help

- When setting up pipeline, you can point to a given model and tokenizer by including those as parameters, for example:

- qamodel = pipeline("question-answering", model=MODEL, tokenizer=MODEL, device=-1).

- The structure of SQuAD is a set of paragraphs—each paragraph has a set of questions and a set of answers. It might be good to extract tuples of these to make the evaluation easier. For the evaluation, you can just see how often there is an exact match.

In [102]:
MODEL = "distilbert-base-uncased-distilled-squad"
TEST_SAMPLE_SIZE = 1000

In [155]:
import random
import json

with open("data/dev-v2.0.json") as f:
    data = json.load(f)

def get_question_answers_context(data):
    # this function should provide tuples of question, answer and context from the data
    tuples = []
    for i in range(len(data['data'])):
        for j in range(len(data['data'][i]['paragraphs'])):
            tuples.extend(data['data'][i]['paragraphs'][j]['qas'])
    return tuples

qac = random.sample(get_question_answers_context(data), TEST_SAMPLE_SIZE)

## Import the pipeline class.

This will be more straightforward and you will get less exposure to the components of the transformer than in the previous milestone.


In [157]:
from transformers import pipeline

## Set up the model and point to an existing pretrained and fine-tuned model. 

(See the notes for more detail.)

## Use the SQuAD dev dataset to test how the different models are performing. 

The metric you will need to set up here is an exact match metric, which means you just need to see whether the predicted text is exactly the same as the answer provided in the dataset. You do not need to use the entire dataset, but make sure to evaluate a sample to ensure that the model performs well.

## Explore a few different models and evaluate which performs the best. 

For this project, we will only use exact matches as an evaluation metric.

## Bonus: Use the question and paragraph pairs from the previous milestone as input to allow the model to predict the location of the answer.

## Bonus: If you want to get a better understanding of setting up the model, you can import the AutoTokenizer and AutoModelForQuestionAnswering classes from the transformers library.

- Here you will get more insight into the structure of the pipeline—the two classes are the two main parts of the transformers architecture.


- One is the tokenization model which tokenizes both the question and the context (paragraph), and the second is the Question-Answering model which predicts where in the sequenced tokens of the context the answer starts and ends, given the question tokens as inputs.

# M4 Build an End-to-End Question-Answering Model

## Objective

Integrate the previous milestones into an end-to-end application for question answering on the EU taxonomy for sustainable finance corpus

## Notes

- The key here is to be able to clean up the code and test out the ensemble of the models to get the answers from the entire corpus.


- The goal of this milestone is to wrap up all your code and plug the functions in to a prebuilt template in order to visualize the application.

Here is the list of questions and potential answers for this project:

- What fuel is used for manufacturing of chlorine?
    A: Electricity


- What metric is used for evaluating emission?
    A: gCO2e


- How can carbon emission of the processes of cement clinker be reduced?
    A: The use of biomass and waste materials as fuels in cement kilns


- How is the Weighted Cogeneration Threshold calculated?
    A: The relative production of heat and power


- What is carbon capture and sequestration?
    A: A key technology for the decarbonisation of Europe


- What stages does CCS consist of?
    A: Capture, transport and storage


- What should be the average energy consumption of a water supply system?
    A: 05 kWh per cubic meter


- What are sludge treatments?
    A: Methane Anaerobic Digestion and in some cases aerobic digestion


- What is the process of anaerobic digestion?
    A: Microorganisms decompose the organic matter of the sludge in the absence of oxygen


- How is reforestation defined?
    A: Re-establishment of forests


- What is the threshold of emission for inland passenger water transport?
    A: 50gCO2e/pkm


- What are the requirements of reporting for electricity generation from natural gas where there might be fugitive emissions?
    A: full life cycle assessment of fugitive emissions



- Note that the questions might not have exact answers and these are just a few examples. Thus we cannot do a fair quantitative evaluation. However, you can get a feel for the model and do a somewhat qualitative evaluation. This project provides a good understanding of how you can implement the first version of a Question-Answering model without any domain-specific labeled data, which is a quite common problem faced in the industry.

## Additional resources

- In this template app  https://github.com/MatteusT/QAtemplate you can plug in functions and use the GUI for Question-Answering model testing. It contains instructions for use.


- General information about the Django platform https://docs.djangoproject.com/en/3.0/ used in the application

## Help

You will need to have just two main functions ready. You will integrate them into the code from QAtemplate. The two functions are:

- A function to find the paragraph most relevant to the question

    def get_context(question):
    ...
    return paragraph


- A function that returns the answer from input of a paragraph and the given question

    def get_answer_pipeline(question, context):
    ...
    return answer```


You will also need a function to process the document and return the paragraphs.

## Create a function from the findings in Milestone 1—a function that inputs a file path and returns a list of paragraphs.

In [159]:
## See above

## Create a function from the findings in Milestone 2—a function that inputs a list of paragraphs and a question and returns a ranked list with the top n paragraphs that are most similar to the question.


In [None]:
## See above

## Create a function of the findings in Milestone 3—a function that inputs a string consisting of a question and a list of paragraphs and returns a list of strings containing the extractive answers from those paragraphs.


In [None]:
## See above

## Integrate the functions into the repository and follow the instructions.