# Extracting Paragraphs from the EU Taxonomy Document


In [1]:
import re

import textract
import pandas as pd

## Objective

Process the EU sustainable finance taxonomy PDF file and extract and clean all the paragraphs in the document

## Download the EU sustainable finance taxonomy PDF from Taxonomy Report: Technical Annex.

## Load the EU sustainable finance taxonomy PDF file using the textract library and decode it. 

Look through the text to ensure that you have got all the text and that the decoding did not produce any bad characters.

In [2]:
text = textract.process('EUtaxonomy.pdf')

In [3]:
text = text.decode()

In [3]:
# text = textract.process('EUtaxonomy.pdf', method='pdfminer').decode()

## Use regular expressions to split the paragraphs and clean the text. 

The loaded text will be in raw format and will need to be segmented into paragraphs. These paragraphs will also need to be cleaned by removing newline characters and other characters that do not bring any semantic value to the paragraph (such as tabs or bullet points).

In [4]:
len(text)

1320996

In [108]:
text[0:10000]

'Updated methodology & Updated Technical Screening Criteria\n- 1-\n\nMarch 2020\n\n\x0cAbout this report\nThis document includes an updated Part B: Methodology from the June 2019 report and an updated Part\nF: Full list of technical screening criteria. The other original sections from the June 2019 report can be\nfound as labelled in the June 2019 report.\nPART A\n\nExplanation of the Taxonomy approach. This section sets out the role and importance of\nsustainable finance in Europe from a policy and investment perspective, the rationale for\nthe development of an EU Taxonomy, the daft regulation and the mandate of the TEG.\n\nPART B\n\nMethodology. This explains the methodologies for developing technical screening\ncriteria for climate change mitigation objectives, adaptation objectives and ‘do no\nsignificant harm’ to other environmental objectives in the legislative proposal.\nThis has been updated since 2019.\n\nPART C\n\nTaxonomy user and use case analysis. This section provides pr

In [116]:
paragraphs = re.split('\D\n+\D+\.\n+', text)

In [117]:
len(paragraphs)

1564

In [103]:
def clean_paragraph(strg, ch='\n'):
    newStr = ''
    previous = 0
    for i in strg:
        if i != ch:
            newStr += i
            previous = 0
        else:
            if not previous:
                newStr += ' '
            previous = 1
    return newStr.strip()

In [104]:
paragraphs[2]

'PART F\n\nFull list of technical screening criteria. This annex sets out the sector- and\neconomic activity-specific technical screening criteria and rationale for the TEG’s\nanalysis. These have been updated since 2019.'

In [118]:
paragraphs[3]

'- 2-\n\n\x0cContents\nMethodology statements ....................................................................10\n1.\n\nSubstantial contribution to Climate change mitigation .......................................................... 10\n1.1\n\nWork process – conceptual approach ............................................................................ 10\n\n1.2\n\nDefining substantial contribution to climate change mitigation ....................................... 14\n\n1.3\n\nEligibility of finance for activities contributing substantially to mitigation ........................ 16\n\n1.4\n\nFurther development ....................................................................................................... 16\n\n2.\n\nSubstantial contribution to Climate change adaptation ......................................................... 18\n2.1\n\nWork process – conceptual approach ............................................................................ 18\n\n2.2\n\nDefining s

In [106]:
clean_paragraph(paragraphs[3])

'- 2- \x0cContents Methodology statements ....................................................................10 1. Substantial contribution to Climate change mitigation .......................................................... 10 1.1 Work process – conceptual approach ............................................................................ 10 1.2 Defining substantial contribution to climate change mitigation ....................................... 14 1.3 Eligibility of finance for activities contributing substantially to mitigation ........................ 16 1.4 Further development ....................................................................................................... 16 2. Substantial contribution to Climate change adaptation ......................................................... 18 2.1 Work process – conceptual approach ............................................................................ 18 2.2 Defining substantial contribution to climate change adap

## Store the paragraphs in a DataFrame with the column “paragraph” using the pandas library and save the DataFrame.

In [86]:
df = pd.DataFrame(data=paragraphs)
df.columns=['paragraph']

In [87]:
df.head()

Unnamed: 0,paragraph
0,Updated methodology & Updated Technical Screen...
1,This has been updated since 2019.
2,\nPART F\n\nFull list of technical screening c...
3,\n- 2-\n\nContents\nMethodology statements .....
4,\nFigure 1 Work process for technical screenin...


In [119]:
df['paragraph']=df['paragraph'].apply(clean_paragraph)