## File Preprocessing

**80:20 rule of data science** (Quote from "Machine Learning for Text Analysis" slide 48,  provided by Prof. Christian Bauckhage)

- 80% of the work are spent on pre-processing, data cleansing
- 20% of the work are spent on data analysis and visualization

**VERY IMPORTANT ON PRE-PROCESSING**

In the following, we try preprocessing using NLTK and Spacy and further discuss the benefits and drawbacks of both libraries.

In [1]:
import numpy as np
import re
import string
import warnings
from collections import Counter
import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer

import spacy

### Preprocessing using nltk packages 
[NLTK](https://www.nltk.org) is a leading platform for building Python programs to work with human language data [1].

First we try to use nltk packages to pre-process the data based on:
- strong community support
- more efficient than spacy (less run time after comparison)

However, there are some drawbacks, especially for German processing:
- lemmatization not supported
- bad stemming --> casuing too many redundant tokens
- part-of-speech tagging not well integrated

Our NLTK preprocessing consists of the following steps:
1. remove digits (e.g. '0123456789')
2. remove punctuations (e.g. ',.„“|')
3. change all text case to lower case
4. tokenize sentences (breaking sentences into words or phrases)
5. remove stop words (e.g. 'einer', 'eine', 'eines', 'einen', 'oder', 'aber'...)

**Reference:**

[1] Bird, Steven, Edward Loper and Ewan Klein (2009). Natural Language Processing with Python.  O'Reilly Media Inc.

In [2]:
def nltkPreprocessing(text):

    # remove digits and some special symbols
    dig_translator = str.maketrans('', '', '0123456789-/€®–„“|')
    text = text.translate(dig_translator)

    # remove punctuation
    str_translator = str.maketrans('', '', string.punctuation)
    text = text.translate(str_translator).lower()
    text = text.strip()

    # tokenize sentences
    word_tokens = word_tokenize(text)
    stop_words = stopwords.words('german')

    # remove stop words
    filtered_tokens = [w.lower() for w in word_tokens if not w.lower() in stop_words]

    return filtered_tokens

    # using PorterStemmer to stem the tokens (effect not good)
#     ps = SnowballStemmer('german')
#     stem_tokens = [ps.stem(w) for w in filtered_tokens]
#     return stem_tokens

### Preprocessing using Spacy packages
[Spacy](https://spacy.io/usage/) is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more. 

Based on the features, Spacy gives a better performance in:
- tokenization quality
- lemmatization

However, there are some drawbacks, especially for German processing:
- super slow tokenization
- incomplete german stop words 

Our Spacy preprocessing consists of the following steps:
1. change all text case to lower case
2. tokenize sentences (breaking sentences into words or phrases)
3. only keep words with alphabets 
4. remove words with less than 3 letters 
5. remove stop words (e.g. 'einer', 'eine', 'eines', 'einen', 'oder', 'aber'...)
6. remove punctuations (e.g. ',.„“|')
7. remove currency signs
8. remove number-like words (e.g. 'one', 'two'..)
9. remove spaces
10. lemmatize words

After comparison with nltk, we indeed find out that spacy provides more accurate text preprocessing on German text.

**Reference:**

[1] Honnibal, Matthew and Montani, Ines (2017). "Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing"

In [3]:
# load German language package in spacy
warnings.filterwarnings("ignore")
nlp = spacy.load('de', disable=['parser', 'ner'])
nlp.max_length = 2000000


def spacyPreprocessing(text):
    # define stop words
    my_stop_words = ['einer', 'eine', 'eines', 'einen', 'oder', 'aber', 'dass',
                     'teur', 'euro', 'eur', 'jahr', 'million', 'tausend', 'mio', 'mrd']
    stop_words = stopwords.words('german')
    stop_words.extend(my_stop_words)
    for w in stop_words:
        nlp.vocab[w].is_stop = True

    # tokenize texts
    word_tokens = nlp(text.lower())

    # remove words containing special letters, short words, stop words, punctuations,
    # currency, numbers and spaces, then lemmatize words
    final_word_tokens = [w.lemma_ for w in word_tokens if w.text.isalpha() and len(w)>2 
                         and not (w.is_stop or w.is_punct or w.is_currency or w.like_num or w.is_space)]

    return final_word_tokens

#### Comparison between NLTK Preprocessing and Spacy Preprocessing

In [4]:
%%time
# randomly choose one file for comparison
test_file = '../txt_data/Adidas-QuarterlyReport-2016-Q3'
with open(test_file) as f:
    text = f.read()
    nltk_tokens = nltkPreprocessing(text)
nltk_count = Counter(nltk_tokens)

CPU times: user 77.4 ms, sys: 4.75 ms, total: 82.1 ms
Wall time: 82.6 ms


In [5]:
%%time
with open(test_file) as f:
    text = f.read()
    spacy_tokens = spacyPreprocessing(text)
spacy_count = Counter(spacy_tokens)

CPU times: user 1.2 s, sys: 88.7 ms, total: 1.29 s
Wall time: 600 ms


We try to compare these two preprocessing methods based on the following criteria:
1. Run Time
2. Number of total tokens after preprocessing
3. Number of unique tokens after preprocessing
4. occurrence of some words (e.g. geben), indicating the effectiveness of lemmatization (gab|gibt -> geben)
5. **Most common tokens** comparison

<table style="width:50%">
  <tr>
    <th></th>
    <th>NLTK</th>
    <th>Spacy</th> 
  </tr>
  <tr>
    <td>Run Time (ms)</td>
    <td>~82</td>
    <td>~600</td> 
  </tr>
  <tr>
    <td># of total tokens</td>
    <td>4619</td>
    <td>3804</td> 
  </tr>
  <tr>
    <td># of unique tokens</td>
    <td>1636</td>
    <td>1289</td> 
  </tr>
  <tr>
    <td># of "geben"</td>
    <td>1</td>
    <td>3</td> 
  </tr>
  <tr>
    <td># of "gab"</td>
    <td>2</td>
    <td>0</td> 
  </tr>
</table>

In [6]:
# dataframe consisting of nltk word tokens and word frequency
nltk_df = pd.DataFrame(nltk_count, index=['NLTK_Frequency'])
nltk_df = nltk_df.transpose().sort_values('NLTK_Frequency', ascending=False)[:10]
nltk_df['NLTK_Tokens'] = nltk_df.index
nltk_df = nltk_df[['NLTK_Tokens', 'NLTK_Frequency']]
nltk_df.index = range(len(nltk_df))

# dataframe consisting of spacy word tokens and word frequency
spacy_df = pd.DataFrame(spacy_count, index=['Spacy_Frequency'])
spacy_df = spacy_df.transpose().sort_values('Spacy_Frequency', ascending=False)[:10]
spacy_df['Spacy_Tokens'] = spacy_df.index
spacy_df = spacy_df[['Spacy_Tokens', 'Spacy_Frequency']]
spacy_df.index = range(len(spacy_df))

pd.concat([nltk_df, spacy_df], axis=1)

Unnamed: 0,NLTK_Tokens,NLTK_Frequency,Spacy_Tokens,Spacy_Frequency
0,adidas,82,adidas,82
1,umsatz,76,umsatz,74
2,sowie,64,aufwendungen,41
3,mio,55,konzerns,37
4,aufwendungen,41,betrieblich,37
5,konzerns,37,hoch,34
6,aufgrund,35,steigen,33
7,mrd,35,entwicklung,33
8,betrieblichen,34,prozentpunkte,33
9,prozentpunkte,33,erhöhen,30


Just take one file as an example, regarding the running time, nltk preprocessing is more efficient. However, spacy preprocessing generate more reliable results, especially on token lemmatization. 

After file preprocessing, we would like to generate corpus [Corpus Generation](03-Corpus_Generation.ipynb)