## Exploring Kleister-charity dataset

### Imports

In [1]:
import numpy as np                                                        # linear algebra lib
import pandas as pd                                                       # dataframes lib
import re                                                                 # regex lib
import matplotlib.pyplot as plt                                           # plotting lib 
import nltk                                                               # Nat Lang Processing package
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer   # NLTK tokenization modules
from nltk.corpus import stopwords                                         # NLTK stop words  
from nltk.stem.wordnet import WordNetLemmatizer                           # NLTK lemmitization lib
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))                              # NLTK stopwords set

[nltk_data] Downloading package stopwords to /home/becode/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/becode/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/becode/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Read in test-A TSV and explore

I will work on a subset of the full Kleister-charity dataset which contains +1700 documents.

More specifically I will look at the 609 test-A documents. The test-A tsv file contains the filename of
each document and its respective OCR'ed text. There are 4 different ways that the thext has been OCR'ed, 
each in a seperate column.

In [2]:
# read Kleister-charity test-A tsv file into dataframe
kl = pd.read_csv('/home/becode/AI/Data/Faktion/kleister-charity/test-A/in.tsv', sep='\t',
            names=['filename', 'keys', 'text_djvu', 'text_tesseract', 'text_textract', 'text_best'])
print(kl.head(10),"\n")
print(kl.shape)
print(kl.info())
print(kl.isna().sum())

                               filename  \
0  abbf98ed31e28068150dce58296302ee.pdf   
1  f3e363848aea2fa645814f2de0221a5a.pdf   
2  62acdd1bbd0dfeea27da2720eb795449.pdf   
3  e734bc7dfc9b37c5dd2c3a37693062e8.pdf   
4  cb6b0949a2f9294750e436f7ea2f10ce.pdf   
5  87c977ccb9bdf111b1397e9c4ada2470.pdf   
6  39df988309a04c631445b04ebd6a4a53.pdf   
7  e38bd1524e145b49edf991ab8f3e153d.pdf   
8  bb80583ada5875ccb4690ffa22f97bab.pdf   
9  7ae3665305caf119acabb0863ea1e46d.pdf   

                                                keys  \
0  address__post_town address__postcode address__...   
1  address__post_town address__postcode address__...   
2  address__post_town address__postcode address__...   
3  address__post_town address__postcode address__...   
4  address__post_town address__postcode address__...   
5  address__post_town address__postcode address__...   
6  address__post_town address__postcode address__...   
7  address__post_town address__postcode address__...   
8  address__post_town 

#### In total we have 609 filenames with OCR'ed text in 4 different OCR text columns

OCR 'text_djvu' has 17 missing document texts so we'll probably not use that one. We can already drop the 'keys' column as this is not needed for our purpose.

In [3]:
# drop not needed 'keys' column
kl = kl.drop(columns='keys')

### Compare different OCR columns; 'text_djvu','text_textract','text_best', 'text_tesseract'

In [None]:
print(f"{kl.loc[0,'text_djvu']}\n")
print(f"{kl.loc[0,'text_tesseract']}\n")
print(f"{kl.loc[0,'text_textract']}\n")
print(kl.loc[0,'text_best'])

In [None]:
print(f"{kl.loc[100,'text_djvu']}\n")
print(f"{kl.loc[100,'text_tesseract']}\n")
print(f"{kl.loc[100,'text_textract']}\n")
print(kl.loc[100,'text_best'])

In [None]:
print(f"{kl.loc[300,'text_djvu']}\n")
print(f"{kl.loc[300,'text_tesseract']}\n")
print(f"{kl.loc[300,'text_textract']}\n")
print(kl.loc[300,'text_best'])

### Select 'text_tesseract' OCR column and preprocess text

All 4 OCR columns appear to have their pros and cons. 
I will make the arbitrary choice to keep the 'text_tesseract' to work with and drop the other OCR columns.
From looking at the OCR content of the documents we can already clean it up a bit by removing line breaks

In [4]:
# keep only 'text_tesseract' OCR text column; the OCR columns all have pros and cons
kl = kl.drop(columns=['text_djvu','text_textract','text_best'])

# some easy preprocessing after looking at text content; remove '\n' and '\\n'
kl['text_tesseract'] = kl['text_tesseract'].astype(str)
kl['text_tesseract'] = kl['text_tesseract'].apply(lambda x: x.replace("\n"," "))
kl['text_tesseract'] = kl['text_tesseract'].apply(lambda x: x.replace("\\n"," "))

In [None]:
# check result of first preprocessing
print(f"{kl.loc[0,'text_tesseract']}\n")
print(f"{kl.loc[100,'text_tesseract']}\n")
print(kl.loc[300,'text_tesseract'])

### Tokenization with NLTK

To cluster the documents we will need to tokenize and process the text content of the documents and vectorise the tokens using TF-IDF

In [5]:
# Tokenize or more specifally lemmitize sentence and word
def tokenize_lem(text):
    lem = WordNetLemmatizer()
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.match('[a-zA-Z]', token) and not token in stop_words and len(token)>3: # changed from if re.search('[a-zA-Z]', token)
            lemmatized_word = lem.lemmatize(token)
            filtered_tokens.append(lemmatized_word)
    return filtered_tokens

In [6]:
# apply function to 'text_tesseract' column
kl['text_tesseract'] = kl['text_tesseract'].apply(tokenize_lem)

In [11]:
# check first 1OO tokenized words in 'text_tesseract' column
print(kl['text_tesseract'][0][0:100])

['praestat', 'opes', 'sapientia', 'hampton', 'school', 'charitable', 'company', 'limited', 'guarantee', 'report', 'financial', 'statement', 'year', 'ended', 'august', 'registered', 'company', 'registered', 'charity', 'hampton', 'school', 'content', 'page', 'chairman', 'report', 'legal', 'administrative', 'information', 'governor', 'report', 'independent', 'auditor', 'report', 'statement', 'financial', 'activity', 'year', 'ended', 'august', 'statement', 'financial', 'activity', 'year', 'ended', 'august', 'balance', 'sheet', 'cashflow', 'statement', 'note', 'financial', 'statement', 'hampton', 'school', 'chairman', 'report', 'year', 'ended', 'august', 'delighted', 'another', 'successful', 'year', 'trust', 'school', 'success', 'publicly', 'recognised', 'team', 'independent', 'school', 'inspectorate', 'inspected', 'school', 'march', 'report', 'highlighted', 'many', 'excellent', 'area', 'school', 'activity', 'particularly', 'concluded', 'hampton', 'school', 'quality', 'pupil', 'achievement'

In [None]:
# check result of tokenization for different documents
print(kl['text_tesseract'][0])
print(kl['text_tesseract'][0])
print(kl['text_tesseract'][100])
print(kl['text_tesseract'][300])

In [7]:
# check length of tokens lists per document
print(kl['text_tesseract'].apply(lambda x : len(x)).sort_values(ascending=False))

47     63100
429    42752
375    40201
56     22874
21     14683
       ...  
576      322
235      322
317      291
260      150
390      119
Name: text_tesseract, Length: 609, dtype: int64


In [8]:
# check unique tokenized words per document
print(kl['text_tesseract'].apply(lambda x : len(set(x))).sort_values(ascending=False))

429    7225
375    6872
47     6720
21     3305
528    3298
       ... 
576     167
235     166
317     151
260     118
390      95
Name: text_tesseract, Length: 609, dtype: int64


### Next we will vectorise the tokenized text with Sci-kit Learn's TF-IDF. See next notebook