# M1 Extracting Paragraphs from the EU Taxonomy Document


In [169]:
import re

import textract
import pandas as pd

## Objective

Process the EU sustainable finance taxonomy PDF file and extract and clean all the paragraphs in the document

## Download the EU sustainable finance taxonomy PDF from Taxonomy Report: Technical Annex.

## Load the EU sustainable finance taxonomy PDF file using the textract library and decode it. 

Look through the text to ensure that you have got all the text and that the decoding did not produce any bad characters.

In [170]:
text = textract.process('EUtaxonomy.pdf')

In [171]:
text = text.decode()

In [138]:
# text = textract.process('EUtaxonomy.pdf', method='pdfminer').decode()

## Use regular expressions to split the paragraphs and clean the text. 

The loaded text will be in raw format and will need to be segmented into paragraphs. These paragraphs will also need to be cleaned by removing newline characters and other characters that do not bring any semantic value to the paragraph (such as tabs or bullet points).

In [172]:
len(text)

1320996

In [173]:
text[0:1000]

'Updated methodology & Updated Technical Screening Criteria\n- 1-\n\nMarch 2020\n\n\x0cAbout this report\nThis document includes an updated Part B: Methodology from the June 2019 report and an updated Part\nF: Full list of technical screening criteria. The other original sections from the June 2019 report can be\nfound as labelled in the June 2019 report.\nPART A\n\nExplanation of the Taxonomy approach. This section sets out the role and importance of\nsustainable finance in Europe from a policy and investment perspective, the rationale for\nthe development of an EU Taxonomy, the daft regulation and the mandate of the TEG.\n\nPART B\n\nMethodology. This explains the methodologies for developing technical screening\ncriteria for climate change mitigation objectives, adaptation objectives and ‘do no\nsignificant harm’ to other environmental objectives in the legislative proposal.\nThis has been updated since 2019.\n\nPART C\n\nTaxonomy user and use case analysis. This section provides pr

In [174]:
paragraphs = re.split(r"\s*?\n\s*?\n\s*?", text)

In [175]:
len(paragraphs)

8984

In [176]:
paragraphs[2]

'\x0cAbout this report\nThis document includes an updated Part B: Methodology from the June 2019 report and an updated Part\nF: Full list of technical screening criteria. The other original sections from the June 2019 report can be\nfound as labelled in the June 2019 report.\nPART A'

In [177]:
paragraphs[3]

'Explanation of the Taxonomy approach. This section sets out the role and importance of\nsustainable finance in Europe from a policy and investment perspective, the rationale for\nthe development of an EU Taxonomy, the daft regulation and the mandate of the TEG.'

In [178]:
def clean_paragraph(text):
    text = text.replace("\n", " ").replace("  ", " ").strip(" ")
    return re.sub(r'[^\w\s]', '', text).strip(" ")

## Store the paragraphs in a DataFrame with the column “paragraph” using the pandas library and save the DataFrame.

In [179]:
df = pd.DataFrame(data=paragraphs)
df.columns=['paragraph']

In [180]:
df.head()

Unnamed: 0,paragraph
0,Updated methodology & Updated Technical Screen...
1,March 2020
2,About this report\nThis document includes an ...
3,Explanation of the Taxonomy approach. This sec...
4,PART B


In [181]:
df['paragraph'] = df['paragraph'].apply(clean_paragraph)

In [182]:
df.head()

Unnamed: 0,paragraph
0,Updated methodology Updated Technical Screeni...
1,March 2020
2,About this report This document includes an u...
3,Explanation of the Taxonomy approach This sect...
4,PART B


In [150]:
df.to_csv("paragraphs.csv")

In [183]:
df.paragraph.values

array(['Updated methodology  Updated Technical Screening Criteria  1',
       'March 2020',
       '\x0cAbout this report This document includes an updated Part B Methodology from the June 2019 report and an updated Part F Full list of technical screening criteria The other original sections from the June 2019 report can be found as labelled in the June 2019 report PART A',
       ..., 'Wildfire', '5', '\x0c'], dtype=object)

# M2 Question Paragraph Matching

In [152]:
from sklearn.feature_extraction.text import TfidfVectorizer

## Objective

Build a text vectorizer that finds the best matching paragraph for the provided set of questions and qualitatively evaluates the results

In [151]:
df = pd.read_csv("paragraphs.csv")

## Initiate a TF-IDF model trained on the paragraphs from the previous milestone by using the TfidfVectorizer class from the scikit-learn library. 

This model will provide a representation for each paragraph or each question.

In [154]:
vectorizer = TfidfVectorizer()

In [184]:
 X = vectorizer.fit_transform(df['paragraph'].values)

In [185]:
vectorizer.get_feature_names_out()

array(['00', '00295', '0045', ..., 'zurich', 'zwickel', 'μgnm3'],
      dtype=object)

In [188]:
X.shape

(8984, 7424)

## Transform all the paragraphs into representations and calculate a distance in the representation space between each question and all the paragraphs. 

The distance can be calculated using the linear_kernel function from the scikit-learn library. Sort all the distances and match the paragraph that best corresponds to each question.

In [190]:
questions = [
    ["What fuel is used for manufacturing of chlorine?"],
    ["What metric is used for evaluating emission?"],
    ["How can carbon emission of the processes of cement clinker be reduced?"],
    ["How is the Weighted Cogeneration Threshold calculated?"],
    ["What is carbon capture and sequestration?"],
    ["What stages does CCS consist of?"],
    ["What should be the average energy consumption of a water supply system?"],
    ["What are examples of sludge treatments?"],
    ["How is the process of anaerobic digestion?"],
    ["How is reforestation defined?"],
    ["What is the threshold of emssion for inland passenger water transport?"], 
    ["What are the requirements of reporting for electricity generation from natural gas where there might be fugative emissions?"]
]

In [207]:
from sklearn.metrics.pairwise import linear_kernel

kernel = linear_kernel(X)
# Iterate through the questions and transform each of them to their vector representation. 
# Then use linear_kernel to get the distances and get the smallest one.
question_vectorizer = TfidfVectorizer()
vector_representations = []

for question in questions:
    vec_rep = question_vectorizer.fit_transform(question)
    vector_representations.append(vec_rep)
    

In [208]:
linear_kernel(X)
# linear_kernel(vector_representations)

array([[1.        , 0.        , 0.32988537, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.32988537, 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

## Bonus: Train a Doc2vec model with the paragraphs using the Doc2vec model provided by the gensim library. 

Similar to the TF-IDF model, Doc2vec provides a representation for the paragraphs.

## Bonus: Given the representation of the paragraphs, use the most_similar method in the gensim library, which uses cosine distance to get the paragraphs that best match the questions.

## Bonus: Evaluate the two different methods for matching questions to paragraphs and pick the better performing one to use in the next milestone.