# Environment Setup

If you do not have these modules installed, you will need to install them at the command line using a BASH shell, Terminal, or the Anaconda Command Prompt.

- `tesseract`
- `pytesseract`
- `opencv`
- `pillow`
- `pdf2image`

## tesseract

"Tesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages."

You will need to install Tesseract through the command line or by downloading the executable file.
- INSTALL: ["Introduction," Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/Installation.html)

NOTE: Windows users may need to go through some additional steps with security permissions and environment variables to be able to use `tesseract`.
- Bharath Sivakumar, "[Installing and Using Tesseract 4 on Windows 10](https://medium.com/quantrium-tech/installing-and-using-tesseract-4-on-windows-10-4f7930313f82)" *Quantrium* (8 July 2020)

## pytesseract

"Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images.

`python-tesseract` is a wrapper for [Google’s Tesseract-OCR Engine](https://github.com/tesseract-ocr/tesseract). It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file."

You can install `pytesseract` using `pip` or `conda` install methods.
- INSTALL: [`pytesseract` documentation, PyPi](https://pypi.org/project/pytesseract/)

## opencv

"OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library. OpenCV was built to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in the commercial products."

You can install the Python wraper for OpenCV (`opencv-python`) using `pip` or `conda` install methods.
- INSTALL: [`opencv-python` documentation, PyPi](https://pypi.org/project/opencv-python/)

## Pillow

"Pillow is the friendly PIL fork by Alex Clark and Contributors. PIL is the Python Imaging Library by Fredrik Lundh and Contributors.

The Python Imaging Library adds image processing capabilities to your Python interpreter.

This library provides extensive file format support, an efficient internal representation, and fairly powerful image processing capabilities."

You can install the Pillow PIL fork using the `pip` install method.
- INSTALL: ["Installation," Pillow documentation](https://pillow.readthedocs.io/en/latest/installation.html)

## pdf2image

"A python (3.6+) module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object"

NOTE: Windows users will have to download and install `poppler` separately from `pdf2image`.

You can install the `pdf2image` using the `pip` install method.

Installation documentation:
- [`pdf2image`, PyPi](https://pypi.org/project/pdf2image/)
- [`pdf2image`, GitHub](https://github.com/Belval/pdf2image)

### poppler troubleshooting for Windows users

[`python-poppler`, PyPi](https://pypi.org/project/python-poppler/)

Matthew Earl Miller, "[Poppler On Windows](https://towardsdatascience.com/poppler-on-windows-179af0e50150)" *Towards Data Science* (9 January 2020)

## Putting It All Together

To recap:
1. Install tesseract
2. Install Python packages
3. IF NEEDED: Survive Windows troubleshooting *adventures*...

# Load Modules

In [None]:
# import requests
import requests

# import csv
import csv

# import beautifulsoup
from bs4 import BeautifulSoup

# import pandas
import pandas as pd

# import pdf2image
import pdf2image

In [None]:
# import modules
import pytesseract
from PIL import Image
import sys
from pdf2image import convert_from_path
import cv2 as opencv
import os
import io

In [None]:
from pdf2image import convert_from_path, convert_from_bytes

# Things Windows Users Might Have to Do...

In [None]:
# assign tesseract to path
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

In [None]:
# assign poppler path
poppler_test = r'C:\Users\katie\Downloads\poppler-0.68.0_x86\poppler-0.68.0\bin'

# Load Single PDF and Convert to Images

The following code shows a workflow for converting a single PDF into a series of PNG images to prepare for OCR.

In [None]:
images = convert_from_path('DIR_1922_1923.pdf', poppler_path = poppler_test)

images

# Run OCR on Single Image

The following code shows a workflow for running OCR on a single image using a combination of Tesseract and OpenCV.

In [None]:
test = images[2]

text = pytesseract.image_to_string(test)

print(text)

# Run OCR on List of Images

In [None]:
for i in images:
    text = pytesseract.image_to_string(i)
    
    print(text)

# Looping Through Multiple Images

The following code shows a workflow for appending the output from multiple images to a single `txt` file. This workflow can be used for multi-page PDFs that have multiple single images but for the purposes of future text analysis need to be a single `txt` file.

In [None]:
output_file = "sample_directory_output.txt"

f = open(output_file, 'a')

for i in images:
    text = pytesseract.image_to_string(i)
    f.write(text)

f.close()

# Looping Through Multiple PDFs

The following code shows a workflow for converting multiple PDFs to image files, using a for loop and subdirectories, running OCR on the images, appending the OCR output from multiple images to a single `txt` file for each original PDF.

This workflow can be used for multi-page PDFs that have multiple single images but for the purposes of future text analysis need to be a single `txt` file.

## For loops that create subdirectories and convert PDF to images in each folder.

In [None]:
# get list of PDF files for creating subdirectories

# import modules
import os
from os import walk

# set path
path = r"C:\Users\katie\jupyter-notebooks\archives\directories"

# create empty list
file_names = []

# for loop that gets file name and appends to list
for x in os.listdir(path):
    if x.endswith(".pdf"):
        file_names.append(os.path.splitext(x)[0])

# show list of file names
file_names

In [None]:
# create empty list for subdirectory paths
subdirectory_list = []

# create subdirectories
for file in file_names:
    directory = file
    parent_dir = r"C:\Users\katie\jupyter-notebooks\archives\directories"
    mode = 0o666
    path = os.path.join(parent_dir, directory)
    subdirectory_list.append(str(path))
    try:
        os.makedirs(path, mode)
    except:
        continue

# show list of subdirectories
print(subdirectory_list)

In [None]:
# for loop that converts PDF to image files and saves images to respective subdirectory
for file in subdirectory_list:
    pdf = file + ".pdf"
    pages = convert_from_path(pdf, 500, poppler_path = poppler_test)
    for i, image in enumerate(pages):
        fname = "\image" + str(i) + '.png'
        fpath = file + fname
        image.save(fpath, "PNG")

## Iteration: Run OCR on contents of each subdirectory and append to txt file in parent directory

We can start this workflow by creating a dictionary from the `file_names` and `subdirectory_list` lists.

The key-value pairs in this dictionary connect the file name (minus extension, i.e. `Football-1901s`) for each PDF with the subdirectory path associated with that file name (i.e. `C:\Users\katie\jupyter-notebooks\archives\scholastic_football_review\Football-1901s`).

This lets us create `.txt` files with the same file name as the PDF and also access the images in each subdirectory.

In [None]:
# creat dictionary from file names and directories

directory_dict = {file_names[i]: subdirectory_list[i] for i in range(len(file_names))}

directory_dict

In [None]:
path = r"C:\Users\katie\jupyter-notebooks\archives\directories\\"

for name in file_names:
    output_file = name + ".txt"
    f = open(output_file, 'a')
    for file in os.listdir(path):
        if file.endswith(".pdf"):
            images = convert_from_path(file, poppler_path = poppler_test)
            for i in images:
                text = pytesseract.image_to_string(i)
                f.write(text)
    f.close()

In [None]:
# using pytesseract image method
path = r"C:\Users\katie\jupyter-notebooks\archives\directories\\"

for key, value in directory_dict.items():
    if "1922-1923" in key:
        football = key + ".txt"
        f = open(football, 'a')
        test_list = []
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1923-1924" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1924_1925" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1925_1926" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1926_1927" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1927_1928" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1928_1929" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1929_1930" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1930_1931" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1931_1932" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1932_1933" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1933_1934" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1934_1935" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1935_1936" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1937_1938" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()   
        
    elif "1938_1939" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1939_1940" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1940_1941" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1941_1942" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1942_1943" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1943_1944" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1943_SPR" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1943_SUM" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1944_1945" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1944_SPR" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1944_SUM" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1945_1946"" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1945_SPR" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1945_SUM" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1946_1947" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1946_SPR" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1946_SUM" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1947_1948" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1947_SPR" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1947_SUM" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1948_1949" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()  
        
    elif "1948_SUM" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1949_1950" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1950_1951" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1951_1952" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1952_1953" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1953_1954" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1954_1955" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1955_1956" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1956_1957" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1957_1958" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1958_1959" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1959_1960" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1960_1961" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1961_1962" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close() 
        
    elif "1962_1963" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1963_1964" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1964_1965" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1965_1966" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1966_1967" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1967_1968" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1968_1969" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1969_1970" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1970_1971" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
    elif "1971_1972" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1972_1973" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()
        
    elif "1973_1974" in key:
        test_list = []
        football = key + ".txt"
        f = open(football, 'a')
        for file in os.listdir(value):
            test_list.append(os.path.normpath(os.path.join(value, file)))
        for file in test_list:
            text = str(((pytesseract.image_to_string(Image.open(file)))))
            f.write(text)
        f.close()

# Test Cleaning TXT File in Open Refine 

Separate documentation here