# Creating a Clean Corpus for German Political Speeches

## Overview

__Goal:__ Cleaning the messy input data and creating a corpus of tokenized documents.

__Input:__ CSV files with political speeches. One speech per input line.

__Output:__ A _dictionary_ and a _corpus_ of documents with indexed tokens.


## Data Preparation Steps

### Cleaning

First, we need to clean up the messy input data, because the raw text contains problematic punctuations and other issues which can hamper further tokenization and the quality of the resulting document models. The cleaning includes:

  * removal of quotes
  * handling of abbreviations
  * handling of messy punctuation

### Tokenization

A common representation of documents is a vector space model based on the words contained in the documents. Tokenization is the process for extracting 'useful' words from the documents and comprises:

  * lower case conversion
  * (sentence splitting)
  * word splitting
  * (building n-grams)
  * stop word removal
  * (lemmatization)

## References

* [Tutorial on Corpora and Vector Spaces](https://radimrehurek.com/gensim/tut1.html) from [Gensim](https://radimrehurek.com/gensim/index.html).
* [Tutorial on Topic Modeling with Gensim](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/) from [Machine Learning Plus](https://www.machinelearningplus.com/).

## Prerequisites

### Libraries and Constants

In [None]:
import pandas as pd
import string
import os
import re
import time

# input files
data_dir        = '../data/'
filename        = data_dir + 'Bundesregierung.csv'

# output files
corpus_dir      = '../corpus/'
dict_filename   = corpus_dir + 'gps.dict'
corpus_filename = corpus_dir + 'gps_bow.mm'

# ensure output directory exists
if not os.path.exists(corpus_dir):
    os.makedirs(corpus_dir)

### Reading the input files

In [None]:
start_time = time.time()
df = pd.read_csv(filename)
print("--- took %d:%.2d minutes ---" % divmod(time.time() - start_time, 60))

print(len(df), 'speeches imported')
df.head()

## Cleaning Messy Input Text Data

### Wrong Punctuation

Examle: `betreffen,Herkunftsland`

### Define RegEx Patterns

In [None]:
# split consecutive wrong punctiations with greedy and look-ahead matching: '?(?=\W)'

# TODO: remove URLs

# remove abbreviations and ellipses
regex_ellipsis = re.compile(r'\.\.\.')
regex_abbrev   = re.compile(r'\s[a-z]\.[a-zA-Z]\.(?=\s)')

# insert missing spaces following punctuations (needed for splitting words)
regex_comma    = re.compile(r',([^\s\d]{2,}?)(?=\W)')
regex_period   = re.compile(r'\.([A-Z][a-z])')
regex_sentence = re.compile(r'([\?!):;])([^\s])')

In [None]:
# define punctuation to be removed
punct_trans = str.maketrans({key:None for key in string.punctuation})
print('removing punctuation:', string.punctuation)

def clean(text):
    text = re.sub(r'["–]', ' ', text)
    text = re.sub(regex_ellipsis, ' ', text)
    text = re.sub(regex_abbrev, '', text)
    text = re.sub(regex_comma, r', \1', text)
    text = re.sub(regex_period, r'. \1', text)
    text = re.sub(regex_sentence, r'\1 \2', text)
    text = text.translate(punct_trans)
    return text

In [None]:
test = 'aus meiner Sicht, d.h. aus Sicht der operativen:Politik.Heißt,So dass Sie,zumindest hier 4,5%'
print(test)
print(re.findall(regex_comma, test))
clean(test)

In [None]:
print(df['text'].iloc[0][0:500])

In [None]:
def check_pattern(pattern, texts):
    plist = []
    for i, doc in enumerate(texts):
        matches = re.findall(pattern, doc)
        if len(matches) > 0:
            plist.append(matches)
    items = [itm for lst in plist for itm in lst]
    print(len(items), 'matches found, e.g.', items[0:5])
    #print('\n'.join(wrong))

check_pattern(regex_comma, df['text'])

## Perform Cleaning

In [None]:
start_time = time.time()
df['text'] = df['text'].apply(clean)
print("--- took %d:%.2d minutes ---" % divmod(time.time() - start_time, 60))

In [None]:
print(df['text'].iloc[0][0:500], ' [...]')

## Load Stopwords

In [None]:
stopwords_filename = '../data/stopwords-de.txt'

with open(stopwords_filename) as f:
    stopwords = f.read().splitlines()

In [None]:
def tokenize(text):
    """Tokenize a text and return a list of cleaned tokens."""
    return [word for word in text.lower().split() if word not in stopwords]

In [None]:
# show intermediate result
print(tokenize(df['text'].iloc[0])[0:50], '...')

## Calculate Word Frequency

Remove Infrequent Tokens (Single Occurrence)

In [None]:
from collections import defaultdict

# store token frequency counts in dictionary
frequency = defaultdict(int)

start_time = time.time()
for doc in df['text']:
    for token in tokenize(doc):
        frequency[token] += 1
print("--- took %d:%.2d minutes ---" % divmod(time.time() - start_time, 60))

once = len([v for v in frequency.values() if v == 1])

print(len(frequency), "words in dictionary")
print(once, "words with one occurrence")
print(len(frequency)-once, "words with multiple occurrence")

#### Most frequent and least frequent tokens

[Sort dictionary](https://docs.python.org/3/library/collections.html#ordereddict-examples-and-recipes) by token frequency in descending order.

In [None]:
freq_desc = sorted(frequency.items(), key=lambda t: t[1], reverse=True)

print('--- Most Frequent Tokens ---')
for (k,v) in freq_desc[0:10]: print('{freq}: {token}'.format(token=k, freq=v))
print('--- Least Frequent Tokens ---')
for (k,v) in freq_desc[-10:]: print('{freq}: {token}'.format(token=k, freq=v))

## Tokenize Text

In [None]:
start_time = time.time()
texts = [[token for token in tokenize(doc) if token != '' and frequency[token] > 1 ] for doc in df['text']]
print("--- took %d:%.2d minutes ---" % divmod(time.time() - start_time, 60))

In [None]:
# show final results
print(texts[0][0:50])

# Create Dictionary

In [None]:
from gensim import corpora

print('Creating Dictionary')
start_time = time.time()
dictionary = corpora.Dictionary(texts)
print("--- took %d:%.2d minutes ---" % divmod(time.time() - start_time, 60))

In [None]:
print('Saving Dictionary')
start_time = time.time()
dictionary.save(dict_filename)
print("--- took %d:%.2d minutes ---" % divmod(time.time() - start_time, 60))

print(dictionary)

In [None]:
print(list(dictionary.iteritems())[0:20])

In [None]:
dfs_desc = sorted(dictionary.dfs.items(), key=lambda t: t[1], reverse=True)

print('--- Most Frequent Token Occurrences in', dictionary.num_docs, 'Documents ---')
for (k,v) in dfs_desc[0:25]: print('{freq}: {token}'.format(token=dictionary.id2token[k], freq=v))

print('--- Least Frequent Token Occurrences in', dictionary.num_docs, 'Documents ---')
for (k,v) in dfs_desc[-10:]: print('{freq}: {token}'.format(token=dictionary.id2token[k], freq=v))

# Create Corpus

In [None]:
print('Creating Corpus')
start_time = time.time()
corpus_bow = [dictionary.doc2bow(doc) for doc in texts]
print("--- took %d:%.2d minutes ---" % divmod(time.time() - start_time, 60))

In [None]:
print('Saving Corpus')
start_time = time.time()
corpora.MmCorpus.serialize(corpus_filename, corpus_bow)
print("--- took %d:%.2d minutes ---" % divmod(time.time() - start_time, 60))

print(corpus_bow[0:5])