Turns out the 2021 speeches, at least the China one, have such bad pdf quality that normal pdf to text converters can't deal with it.

So we're lovingly transforming everything to images, and then extracting the text from these images.

## Transforming the bad quality pdf speech into an image

Installation steps for `pdf2image`:
- install `pdf2image` via anaconda
- install the underlying `poppler` package via `anaconda` as well --> that wey you don't need to specify a path to `poppler` in your code

In [14]:
from pdf2image import convert_from_path

pages = convert_from_path('Statement_China_2021.pdf', 500)

img_name_prefix = 'Statement_China_2021'
counter = 1

paths_to_images = []

# saving each page of the faux-pdf speech as a separate image
for page in pages:
    current_path = f'{img_name_prefix}_{counter}.png'
    # save image
    page.save(current_path, 'PNG')
    # store path reference in list for later
    paths_to_images.append(current_path)
    counter += 1

In [15]:
paths_to_images

['Statement_China_2021_1.png',
 'Statement_China_2021_2.png',
 'Statement_China_2021_3.png',
 'Statement_China_2021_4.png',
 'Statement_China_2021_5.png',
 'Statement_China_2021_6.png']

## Extracting the text from the images

Next, using `tesseract` to extract the text.

Installation steps for this code to work (I have a Mac):
- install `tesseract` via hombrew
- set the environment path to your local `tesseract` installation via `pytesseract.pytesseract.tesseract_cmd = r'/usr/local/bin/tesseract'`
- set a `path_to_tesseract` var in your code, which will be used as the `pytesseract.tesseract_cmd` command
- i hope i didn't miss anything

In [16]:
from PIL import Image
import pytesseract

#---------------- TESSERACT SETUP ----------------
# setting the tesseract path in the environment
pytesseract.pytesseract.tesseract_cmd = r'/usr/local/bin/tesseract'

# storing the path in a variable   
local_tesseract_path = r'/usr/local/bin/tesseract'

# letting the python tesseract wrapper library know where your tesseract installation is locally,
# so that it knows which command to execute
pytesseract.tesseract_cmd = local_tesseract_path

In [20]:
#-------------- EXTRACTING IMAGE TEXT -------------
full_speech_text = ''

for path in paths_to_images:
#     current_path = r"Statement_China_2021-1.png"
    current_image = Image.open(r''+ path)
    current_text = pytesseract.image_to_string(current_image)
    full_speech_text = '\n'.join([full_speech_text, current_text])


In [21]:
full_speech_text

'\n(Translation)\n\nBolstering Confidence and Jointly Overcoming Difficulties\nTo Build a Better World\n\nStatement by H.E. Xi Jinping\nPresident of the People’s Republic of China\nAt the General Debate of the 76th Session of\nThe United Nations General Assembly\n\n21 September 2021\n\nMr. President,\n\nThe year 2021 is a truly remarkable one for the Chinese people. This\nyear marks the centenary of the Communist Party of China. It is also the\n50th anniversary of the restoration of the lawful seat of the. People’s\nRepublic-of China in the United Nations, a historic event which will be\nsolemnly commemorated by China. We will continue our active efforts to\ntake China’s cooperation with the United Nations to a new level and\nmake new and greater contributions to advancing-the noble cause of the\nUN.\n\nMr. President,\n\nA year ago, global leaders attended the high-level meetings marking\nthe 75th anniversary of the UN and issued a declaration pledging to fight\nCOVID-19 in solidarity,

## Cleaning extracted speech text

In [35]:
import re

def clean_text(speech, header_end_string):
    full_speech_text_no_header = full_speech_text.split(header_end_string)[1]
    
    #replace newlines that point to a new line on the next page, sth like \n\n1 for starting on the second page
    speech_no_page_break = re.sub(r'\n\n[0-9]', '\n', full_speech_text_no_header)
    
    #matching multiple newline occurences and only replacing single newline occurences that break words across rows
    speech_no_line_break = re.sub(r'([^\S\n]*\n(?:[^\S\n]*\n)+[^\S\n]*)|[^\S\n]*\n[^\S\n]*', 
                                  lambda x: x.group(1) or ' ', speech_no_page_break) 
    
    return speech_no_line_break
    

In [36]:
cleaned_speech = clean_text(full_speech_text, '21 September 2021')
cleaned_speech

'\n\nMr. President,\n\nThe year 2021 is a truly remarkable one for the Chinese people. This year marks the centenary of the Communist Party of China. It is also the 50th anniversary of the restoration of the lawful seat of the. People’s Republic-of China in the United Nations, a historic event which will be solemnly commemorated by China. We will continue our active efforts to take China’s cooperation with the United Nations to a new level and make new and greater contributions to advancing-the noble cause of the UN.\n\nMr. President,\n\nA year ago, global leaders attended the high-level meetings marking the 75th anniversary of the UN and issued a declaration pledging to fight COVID-19 in solidarity, tackle challenges together, uphold multilateralism, strengthen the role of the UN, and work for the common future of present and coming generations.\n\n\nOne year on, our world is facing the combined impacts of changes unseen in a century and the COVID-19 pandemic. In all countries, people

## Writing the cleaned speech to .txt

In [37]:
with open("CHN_75_2021.txt", "w") as text_file:
    text_file.write(cleaned_speech)