# M1 Extracting Paragraphs from the EU Taxonomy Document


In [76]:
import re

import textract
import pandas as pd

## Objective

Process the EU sustainable finance taxonomy PDF file and extract and clean all the paragraphs in the document

## Notes

- The most important part of this milestone is to be able to segment the document into many paragraphs. The data cleaning is secondary.


- There might be a need to change the cleaning method as we progress in the project.


- Note that there are methods you can use that will segment the document better and extract all of the parts in a clean manner. However, for the purposes of this project, we can just extract all the paragraphs in a somewhat rough matter. The segmentation and extraction have great importance in the overall result of any NLP processing solution you want to run on the corpus.

## Additional resources

- Chapter 2, Regular Expressions, Text Normalization, Edit Distance, in Speech and Language Processing by Daniel Jurafsky and James H. Martin covers all the basics of understanding regular expressions, up to an intermediate level.

- A few specific examples from the Python package can be used as a simple tutorial to extract text from PDF files. The example here is similar to what we will need for this project.

- “Regular Expressions — An excellent tool for text analysis or NLP” by Niwratti Kasture is a very good overview of regular expressions.

- [NLP] Basics: Understanding Regular Expressions by Céline Van den Rul is another good overview of regular expressions.
regular expressions 101 is a great resource to test out regular expressions.

- re — Regular expression operations is documentation on the package used in this milestone.

- Read the textract documentation to see how to extract the text from the PDF file.

## Help

- In this case, this is the function you want to use to process and extract the text from the PDF file:


text = textract.process("path/to/file.pdf")

- For regular expressions, you want to take advantage of the spacing and newline characters to split the paragraphs. To do this, you will need to set up a pattern that includes \s. The library re has the function split(), which you should use to split the text by a given specific pattern.

## Download the EU sustainable finance taxonomy PDF from Taxonomy Report: Technical Annex.

## Load the EU sustainable finance taxonomy PDF file using the textract library and decode it. 

Look through the text to ensure that you have got all the text and that the decoding did not produce any bad characters.

In [77]:
text = textract.process('EUtaxonomy.pdf')

In [78]:
text = text.decode()

In [79]:
# text = textract.process('EUtaxonomy.pdf', method='pdfminer').decode()

## Use regular expressions to split the paragraphs and clean the text. 

The loaded text will be in raw format and will need to be segmented into paragraphs. These paragraphs will also need to be cleaned by removing newline characters and other characters that do not bring any semantic value to the paragraph (such as tabs or bullet points).

In [80]:
len(text)

1320996

In [81]:
text[0:1000]

'Updated methodology & Updated Technical Screening Criteria\n- 1-\n\nMarch 2020\n\n\x0cAbout this report\nThis document includes an updated Part B: Methodology from the June 2019 report and an updated Part\nF: Full list of technical screening criteria. The other original sections from the June 2019 report can be\nfound as labelled in the June 2019 report.\nPART A\n\nExplanation of the Taxonomy approach. This section sets out the role and importance of\nsustainable finance in Europe from a policy and investment perspective, the rationale for\nthe development of an EU Taxonomy, the daft regulation and the mandate of the TEG.\n\nPART B\n\nMethodology. This explains the methodologies for developing technical screening\ncriteria for climate change mitigation objectives, adaptation objectives and ‘do no\nsignificant harm’ to other environmental objectives in the legislative proposal.\nThis has been updated since 2019.\n\nPART C\n\nTaxonomy user and use case analysis. This section provides pr

In [82]:
paragraphs = re.split(r"\s*?\n\s*?\n\s*?", text)

In [83]:
min_length = 200
paragraphs = [para for para in paragraphs if len(para) > min_length]

In [84]:
len(paragraphs)

1627

In [85]:
def clean_paragraph(text):
    text = text.replace("\n", " ").replace("  ", " ").strip(" ")
    return re.sub(r'[^\w\s]', '', text).strip(" ")

## Store the paragraphs in a DataFrame with the column “paragraph” using the pandas library and save the DataFrame.

In [86]:
df = pd.DataFrame({'paragraph': paragraphs})

In [87]:
df.head()

Unnamed: 0,paragraph
0,About this report\nThis document includes an ...
1,Explanation of the Taxonomy approach. This sec...
2,Methodology. This explains the methodologies f...
3,Full list of technical screening criteria. Thi...
4,Disclaimer\nThis report represents the overall...


In [88]:
df['paragraph'] = df['paragraph'].apply(clean_paragraph)

In [89]:
df.head()

Unnamed: 0,paragraph
0,About this report This document includes an u...
1,Explanation of the Taxonomy approach This sect...
2,Methodology This explains the methodologies fo...
3,Full list of technical screening criteria This...
4,Disclaimer This report represents the overall ...


In [90]:
df.to_csv("paragraphs.csv")

# M2 Question Paragraph Matching

In [91]:
from sklearn.feature_extraction.text import TfidfVectorizer

## Objective

Build a text vectorizer that finds the best matching paragraph for the provided set of questions and qualitatively evaluates the results

## Notes

- This is the first information retrieval step. For this step we will use a traditional method for searching documents.


- TF-IDF is, despite being an older method, usually preferred since it performs quite well and is usually faster than other representation methods.


- Doc2vec can provide better representations if trained on a larger corpus such as Wikipedia. For the purpose of the project, TF-IDF might perform better if Doc2vec is only trained on the EU taxonomy document.


The list of questions for this project are the following:


- What fuel is used for the manufacturing of chlorine?


- What metric is used for evaluating emission?


- How can carbon emission of the processes of cement clinker be reduced?


- How is the Weighted Cogeneration Threshold calculated?


- What are carbon capture and sequestration?


- What stages does CCS consist of?


- What should be the average energy consumption of a water supply system?


- What are sludge treatments? -What is the process of anaerobic digestion?


- How is reforestation defined?


- What is the threshold of emission for inland passenger water transport?


- What are the requirements of reporting for electricity generation from natural gas where there might be fugitive emissions?-

## Resources

- Natural Language Processing in Action by Hobson Lane, Cole Howard, and Hannes Hapke
  Chapter 3, “Math with words (TF-IDF vectors),” explains TF-IDF vectors. The first 3 sections are relevant. 
  If you want to understand TF-IDF, section 4, “Topic modeling,” and subsequent sections are more practical and relevant.


- Real-World Natural Language Processing by Masato Hagiwara
    Chapter 3, “Word and Document Embeddings,” discusses word and document embeddings, the theory behind Doc2Vec. 
    This chapter will allow you to understand the general theory behind text vectorization.


- Deep Learning for Natural Language Processing by Stephan Raaijmakers
    Chapter 3, “Text Embeddings,” includes more on document embeddings, for those who want to get a deeper insight in the theory.


## Additional resources

- Gensim library documentation, which should be used to get Doc2vec embeddings https://radimrehurek.com/gensim/


- An example of how to find the most similar paragraph and sentence   https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html#sphx-glr-auto-examples-tutorials-run-doc2vec-lee-py


- TF-IDF library documentation https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

- Documentation on the vector distance measure library 
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.linear_kernel.html

## Help

- To build a vectorizer, use the scikit-learn module TfidfVectorizer, which you will need to fit to the corpus.

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(corpus)

- The second step is to use linear_kernel() to get the distance matrix for the different vectors and then sort them to get the pairs that are closest.


In [161]:
df = pd.read_csv("paragraphs.csv")

In [162]:
df.head()

Unnamed: 0.1,Unnamed: 0,paragraph
0,0,About this report This document includes an u...
1,1,Explanation of the Taxonomy approach This sect...
2,2,Methodology This explains the methodologies fo...
3,3,Full list of technical screening criteria This...
4,4,Disclaimer This report represents the overall ...


## Initiate a TF-IDF model trained on the paragraphs from the previous milestone by using the TfidfVectorizer class from the scikit-learn library. 

This model will provide a representation for each paragraph or each question.

In [163]:
vectorizer = TfidfVectorizer()

In [164]:
vectorized_paragraphs = vectorizer.fit_transform(df['paragraph'])

In [165]:
vectorized_paragraphs.shape

(1627, 6496)

## Transform all the paragraphs into representations and calculate a distance in the representation space between each question and all the paragraphs. 

The distance can be calculated using the linear_kernel function from the scikit-learn library. Sort all the distances and match the paragraph that best corresponds to each question.

In [166]:
questions = [
    ["What fuel is used for manufacturing of chlorine?"],
    ["What metric is used for evaluating emission?"],
    ["How can carbon emission of the processes of cement clinker be reduced?"],
    ["How is the Weighted Cogeneration Threshold calculated?"],
    ["What is carbon capture and sequestration?"],
    ["What stages does CCS consist of?"],
    ["What should be the average energy consumption of a water supply system?"],
    ["What are examples of sludge treatments?"],
    ["How is the process of anaerobic digestion?"],
    ["How is reforestation defined?"],
    ["What is the threshold of emssion for inland passenger water transport?"], 
    ["What are the requirements of reporting for electricity generation from natural gas where there might be fugative emissions?"]
]

In [167]:
from sklearn.metrics.pairwise import linear_kernel

# Iterate through the questions and transform each of them to their vector representation. 
# Then use linear_kernel to get the distances and get the smallest one.
vector_representations = []

for question in questions:
    vec_rep = vectorizer.transform(question)
    lk_rank = linear_kernel(vec_rep, vectorized_paragraphs).flatten()
    vector_representations.append((question, df["paragraph"][lk_rank.argsort()[-1]]))    

## Bonus: Train a Doc2vec model with the paragraphs using the Doc2vec model provided by the gensim library. 

Similar to the TF-IDF model, Doc2vec provides a representation for the paragraphs.

In [168]:
import gensim

def read_corpus(text, tokens_only=False):
    for i, line in enumerate(text):
        tokens = gensim.utils.simple_preprocess(line)
        if tokens_only:
            yield tokens
        else:
            # For training data, add tags
            yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

corpus = list(read_corpus(df["paragraph"].values))
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
model.build_vocab(corpus)
model.train(corpus, total_examples=model.corpus_count, epochs=model.epochs)

## Bonus: Given the representation of the paragraphs, use the most_similar method in the gensim library, which uses cosine distance to get the paragraphs that best match the questions.

In [169]:
doc2vec_similarities = []
for question in questions:
    q1 = list(read_corpus(question, tokens_only=True))
    inferred_vector = model.infer_vector(q1[0])
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    doc2vec_similarities.append((question, df["paragraph"][sims[0][0]]))

## Bonus: Evaluate the two different methods for matching questions to paragraphs and pick the better performing one to use in the next milestone.

In [170]:
for ic,(question, context)  in enumerate(vector_representations):
    print(question[0])
    print(f"tfidf: {context}\n\ndoc2vec: {doc2vec_similarities[ic][1]}")
    print("-"*20)

What fuel is used for manufacturing of chlorine?
tfidf: Rationale The manufacturing process of carbon black accounts for approximately 34 of the GHG emissions from the chemical sector while the manufacturing of soda ash accounts for 15 of the emissions 212 The manufacturing process of chlorine is extremely energyintensive with chloralkali process accounting for 17 of total electrical consumption of the European chemical and petrochemical industry213 Reducing the manufacturing emissions for carbon black and soda ash and improving energy efficiency in the manufacturing of chlorine can positively contribute to the mitigation objective Moreover it is recognised that soda ash used in double glazing can enhance building efficiency gains The absolute performance approach has been proposed in order to identify the maximum acceptable carbon intensities of the manufacturing processes of carbon black and soda ash that the activities should comply with in order to be able to substantially contribu

# 3. Set-up Transformers for Question-Answering

## Objective

Get familiar with using the Hugging Face library for applied purposes
The main goal is to extract the answer given a question-paragr## aph tuple

## Notes

- Either of the cases will require pointing to an existing pretrained or, in our case, fine-tuned model. You can find a library of pretrained and fine-tuned models at Hugging Face Models. Notice that some of the models are quite large and perhaps will either not work or slow your computer down. A smaller model that could be a good starting point is distilbert-base-uncased-distilled-squad.

- There are other libraries you could use to develop a Question-Answering model. However, for this project we want to focus on the Hugging Face transformers since they are already pretrained and fine tuned. They also provide a very simple interface to set up and use the model.

- There are two different methods to use the transformers library. There are pros and cons for both, but for the purposes of this project it does not make any difference which one you choose.

- We will only use exact matches as an evaluation metric, as mentioned above. This is basically a count of the number of data points that the model predicts correctly over the total number of data points. A regular string match should be sufficient for the purpose of this project. Again, you do not need to use the entire dataset since it can take a lot of computational power, but rather sample some data from it (perhaps 1000 data points or so).m

## Resources
- Real-World Natural Language Processing by Masato Hagiwara Chapter 9, section 3, “Case study 1: Sentiment analysis with BERT,” provides an example of using the Hugging Face transformers library.


- Taming Text by Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris Chapter 8, “Building an example question answering system,” is helpful if you want to understand the theory behind building a question answering system.


- Natural Language Processing in Action by Hobson Lane, Cole Howard, and Hannes Hapke Chapter 10, “Sequence-to-sequence models and attention,” contains theoretical background on how transformers work.


## Additional Resources

- HuggingFace Transformers library documentation will be used for the question answering.

- Question answering tutorial https://huggingface.co/transformers/usage.html#extractive-question-answering – Note that we will use PyTorch for this project.

- Information about the Stanford question-answer dataset https://rajpurkar.github.io/SQuAD-explorer/, as well as a list of how different algorithms perform on the dataset.

## Help

- When setting up pipeline, you can point to a given model and tokenizer by including those as parameters, for example:

- qamodel = pipeline("question-answering", model=MODEL, tokenizer=MODEL, device=-1).

- The structure of SQuAD is a set of paragraphs—each paragraph has a set of questions and a set of answers. It might be good to extract tuples of these to make the evaluation easier. For the evaluation, you can just see how often there is an exact match.

In [172]:
import random
import json

with open("data/dev-v2.0.json") as f:
    data = json.load(f)

def get_question_answers_context(data):
    # this function should provide tuples of question, answer and context from the data
    tuples = []
    for i in range(len(data['data'])):
        for j in range(len(data['data'][i]['paragraphs'])):
            tuples.extend(data['data'][i]['paragraphs'][j]['qas'])
    return tuples

qac = random.sample(get_question_answers_context(data), TEST_SAMPLE_SIZE)

## Import the pipeline class.

This will be more straightforward and you will get less exposure to the components of the transformer than in the previous milestone.


In [173]:
from transformers import pipeline

## Set up the model and point to an existing pretrained and fine-tuned model. 

(See the notes for more detail.)

In [177]:
MODEL = "distilbert-base-uncased-distilled-squad"
TEST_SAMPLE_SIZE = 1000

In [180]:
import random
import json

with open("data/dev-v2.0.json") as f:
    data = json.load(f)

def get_question_answers_context(data):
    # this function should provide tuples of question, answer and context from the data
    tuples = []
    for data_index in data['data']:
        for para_index in data_index['paragraphs']:
            for q_index in para_index['qas']:
                answers = [answer["text"] for answer in q_index["answers"]]
                tuples.append((q_index['question'], answers, para_index['context']))
    return tuples

qac = random.sample(get_question_answers_context(data), TEST_SAMPLE_SIZE)

## Use the SQuAD dev dataset to test how the different models are performing. 

The metric you will need to set up here is an exact match metric, which means you just need to see whether the predicted text is exactly the same as the answer provided in the dataset. You do not need to use the entire dataset, but make sure to evaluate a sample to ensure that the model performs well.

In [181]:
def get_em_scores(qac, qa_model):
    score = []
    for question, answers, context in qac:
        answer = qa_model(question=question, context=context)
        if not answer and not answers:
            score.append(True)
        else:
            score.append(any([answer.lower()==ans.lower() for ans in answers]))
    return score

In [182]:
from transformers import pipeline

qamodel = pipeline("question-answering", model=MODEL, tokenizer=MODEL, device=-1)

def get_answer_pipeline(question, context):
    answer = qamodel(question=question, context=context)
    if answer["score"] < 0.6:
        return ""
    else:
        return answer["answer"].rstrip(".").rstrip(",").lstrip("(").rstrip(")").rstrip(".").strip("'").strip(":")

In [183]:
scores = get_em_scores(qac, get_answer_pipeline)
print(sum(scores)/len(scores))

0.534


## Explore a few different models and evaluate which performs the best. 

For this project, we will only use exact matches as an evaluation metric.

In [198]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
import numpy as np

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForQuestionAnswering.from_pretrained(MODEL)


def get_answer(question, context):
    inputs = tokenizer.encode_plus(question, 
                                   context, 
                                   add_special_tokens=True, 
                                   return_tensors="pt", 
                                   max_length=tokenizer.max_len_sentences_pair, truncation=True)
    input_ids = inputs["input_ids"].tolist()[0]

    with torch.no_grad():
        answer_start_scores, answer_end_scores = model.(**inputs)
        print(answer_start_scores, answer_end_scores)
        answer_start_scores, answer_end_scores = answer_start_scores.cpu().numpy(), answer_end_scores.cpu().numpy()
        
    answer_start = np.argmax(
        answer_start_scores
    )  # Get the most likely beginning of answer with the argmax of the score
    answer_end = np.argmax(
        answer_end_scores
    ) + 1  # Get the most likely end of answer with the argmax of the score
    
    # Normalize logits and spans to retrieve the answer
    start_ = np.exp(answer_start_scores - np.log(np.sum(np.exp(answer_start_scores), axis=-1, keepdims=True)))
    end_ = np.exp(answer_end_scores - np.log(np.sum(np.exp(answer_end_scores), axis=-1, keepdims=True)))
    score = np.mean([start_[0][answer_start], end_[0][answer_end-1]])
    
    if score > 0.9:
        answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
        return answer
    else:
        return ""

## Bonus: Use the question and paragraph pairs from the previous milestone as input to allow the model to predict the location of the answer.

In [200]:
# scores = get_em_scores(qac, get_answer)
# print(sum(scores)/len(scores))

## Bonus: If you want to get a better understanding of setting up the model, you can import the AutoTokenizer and AutoModelForQuestionAnswering classes from the transformers library.

- Here you will get more insight into the structure of the pipeline—the two classes are the two main parts of the transformers architecture.


- One is the tokenization model which tokenizes both the question and the context (paragraph), and the second is the Question-Answering model which predicts where in the sequenced tokens of the context the answer starts and ends, given the question tokens as inputs.

# M4 Build an End-to-End Question-Answering Model

## Objective

Integrate the previous milestones into an end-to-end application for question answering on the EU taxonomy for sustainable finance corpus

## Notes

- The key here is to be able to clean up the code and test out the ensemble of the models to get the answers from the entire corpus.


- The goal of this milestone is to wrap up all your code and plug the functions in to a prebuilt template in order to visualize the application.

Here is the list of questions and potential answers for this project:

- What fuel is used for manufacturing of chlorine?
    A: Electricity


- What metric is used for evaluating emission?
    A: gCO2e


- How can carbon emission of the processes of cement clinker be reduced?
    A: The use of biomass and waste materials as fuels in cement kilns


- How is the Weighted Cogeneration Threshold calculated?
    A: The relative production of heat and power


- What is carbon capture and sequestration?
    A: A key technology for the decarbonisation of Europe


- What stages does CCS consist of?
    A: Capture, transport and storage


- What should be the average energy consumption of a water supply system?
    A: 05 kWh per cubic meter


- What are sludge treatments?
    A: Methane Anaerobic Digestion and in some cases aerobic digestion


- What is the process of anaerobic digestion?
    A: Microorganisms decompose the organic matter of the sludge in the absence of oxygen


- How is reforestation defined?
    A: Re-establishment of forests


- What is the threshold of emission for inland passenger water transport?
    A: 50gCO2e/pkm


- What are the requirements of reporting for electricity generation from natural gas where there might be fugitive emissions?
    A: full life cycle assessment of fugitive emissions



- Note that the questions might not have exact answers and these are just a few examples. Thus we cannot do a fair quantitative evaluation. However, you can get a feel for the model and do a somewhat qualitative evaluation. This project provides a good understanding of how you can implement the first version of a Question-Answering model without any domain-specific labeled data, which is a quite common problem faced in the industry.

## Additional resources

- In this template app  https://github.com/MatteusT/QAtemplate you can plug in functions and use the GUI for Question-Answering model testing. It contains instructions for use.


- General information about the Django platform https://docs.djangoproject.com/en/3.0/ used in the application

## Help

You will need to have just two main functions ready. You will integrate them into the code from QAtemplate. The two functions are:

- A function to find the paragraph most relevant to the question

    def get_context(question):
    ...
    return paragraph


- A function that returns the answer from input of a paragraph and the given question

    def get_answer_pipeline(question, context):
    ...
    return answer```


You will also need a function to process the document and return the paragraphs.

## Create a function from the findings in Milestone 1—a function that inputs a file path and returns a list of paragraphs.

In [174]:
vectorizer = TfidfVectorizer()
vector_corpus = vectorizer.fit_transform(df["paragraph"])


def get_context(question):
    q_v = vectorizer.transform(question)
    lk_rank = linear_kernel(q_v, vector_corpus).flatten()
    return df["paragraph"][lk_rank.argsort()[-1]]

## Create a function from the findings in Milestone 2—a function that inputs a list of paragraphs and a question and returns a ranked list with the top n paragraphs that are most similar to the question.


In [175]:
from transformers import pipeline


MODEL = "distilbert-base-uncased-distilled-squad"
qamodel = pipeline("question-answering", model=MODEL, tokenizer=MODEL, device=-1)

def get_answer_pipeline(question, context):
    answer = qamodel(question=question, context=context)
    return answer["answer"].rstrip(".").rstrip(",").lstrip("(").rstrip(")").rstrip(".").strip("'").strip(":")

## Create a function of the findings in Milestone 3—a function that inputs a string consisting of a question and a list of paragraphs and returns a list of strings containing the extractive answers from those paragraphs.


In [176]:
for question in questions:
    context = get_context(question)
    answer = get_answer_pipeline(question, context)
    print(f"{question[0]}\n\n{answer}\n\n{context}")
    print("-"*100)

What fuel is used for manufacturing of chlorine?

soda ash214

Rationale The manufacturing process of carbon black accounts for approximately 34 of the GHG emissions from the chemical sector while the manufacturing of soda ash accounts for 15 of the emissions 212 The manufacturing process of chlorine is extremely energyintensive with chloralkali process accounting for 17 of total electrical consumption of the European chemical and petrochemical industry213 Reducing the manufacturing emissions for carbon black and soda ash and improving energy efficiency in the manufacturing of chlorine can positively contribute to the mitigation objective Moreover it is recognised that soda ash used in double glazing can enhance building efficiency gains The absolute performance approach has been proposed in order to identify the maximum acceptable carbon intensities of the manufacturing processes of carbon black and soda ash that the activities should comply with in order to be able to substantially c

How is reforestation defined?

the reestablishment of forest

Reforestation Reforestation is defined as the reestablishment of forest through planting andor deliberate seeding on land classified as forest It implies no change of land use includes plantingseeding of temporarily unstocked forest areas as well as plantingseeding of areas with forest cover It includes coppice from trees that were originally planted or seeded69 The FAO FRA definition of reforestation excludes natural regeneration However the Taxonomy recognises the importance of natural regeneration to the increased carbon sink and stock potential provided by forests in general It is therefore included explicitly within this context in line with the FAO FRA definition of naturally regenerating forest70 In the context of the Taxonomy the category reforestation applies in cases following extreme events wind throws fires etc and not as part of normal legally binding obligation to reforest after harvesting
---------------------

## Integrate the functions into the repository https://github.com/MatteusT/QAtemplate and follow the instructions.

## Bonus: Get the django-app working with your functions.