## File Preprocessing

**80:20 rule of data science**

- 80% of the work are spent on pre-processing, data cleansing
- 20% of the work are spent on data analysis and visualization

**VERY IMPORTANT ON PRE-PROCESSING**

In the following, we try preprocessing using NLTK and Spacy and further discuss the good and bad of both libraries.


In [12]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
import spacy
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import re
import warnings
import string
warnings.filterwarnings("ignore") 

### Preprocessing using nltk packages 
[NLTK](https://www.nltk.org) is a leading platform for building Python programs to work with human language data.

First we try to use nltk packages to pre-process the data based on:
- strong community support
- more efficient than spacy (less run time after comparison)

However, there are some drawbacks, especially for German processing:
- lemmatization not supported
- bad stemming --> casuing too many redundant tokens
- part-of-speech tagging not well integrated

Our NLTK preprocessing consists of the following steps:
1. remove digits (e.g. '0123456789')
2. remove punctuations (e.g. ',.„“|')
3. change all text case to lower case
4. tokenize sentences (breaking sentences into words or phrases)
5. remove stop words (e.g. 'einer', 'eine', 'eines', 'einen', 'oder', 'aber'...)
6. (stemming words (e.g. 'sucht | suchst' -> 'suchen'))

In [9]:
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer

def nltkPreprocessing(text):
    
    #remove digits and some special symbols
    dig_translator = str.maketrans('', '', '0123456789-/€®–„“|')
    text = text.translate(dig_translator)
    
    #remove punctuation
    str_translator = str.maketrans('', '', string.punctuation)
    text = text.translate(str_translator).lower()
    text = text.strip()
    
    #tokenize sentences
    word_tokens = word_tokenize(text)
    stop_words = stopwords.words('german')
    
    #remove stop words
    filtered_tokens = [w.lower() for w in word_tokens if not w.lower() in stop_words]
    
    return filtered_tokens

    #using PorterStemmer to stem the tokens (effect not good)
#     ps = SnowballStemmer('german')
#     stem_tokens = [ps.stem(w) for w in filtered_tokens]   
#     return stem_tokens



### Preprocessing using Spacy packages
[Spacy](https://spacy.io/usage/) is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more. 

Based on the features, Spacy gives a better performance in:
- tokenization quality
- lemmatization

However, there are some drawbacks, especially for German processing:
- super slow tokenization
- incomplete german stop words 

Our Spacy preprocessing consists of the following steps:
1. change all text case to lower case
2. tokenize sentences (breaking sentences into words or phrases)
3. only keep words with alphabets 
4. remove words with less than 3 letters 
5. remove stop words (e.g. 'einer', 'eine', 'eines', 'einen', 'oder', 'aber'...)
6. remove punctuations (e.g. ',.„“|')
7. remove currency signs
8. remove number-like words (e.g. 'one', 'two'..)
9. remove spaces
10. lemmatize words

After comparison with nltk, we indeed find out that spacy provides more accurate text preprocessing on German text.

In [13]:
#load German language package in spacy
warnings.filterwarnings("ignore") 
nlp = spacy.load('de', disable=['parser', 'ner'])
nlp.max_length = 2000000

def spacyPreprocessing(text): 
    # define stop words
    my_stop_words  = ['einer', 'eine', 'eines', 'einen', 'oder', 'aber', 'dass',  'teur', 'euro', 'eur', 'jahr', 'million', 'tausend', 'mio', 'mrd']
    stop_words = stopwords.words('german')
    stop_words.extend(my_stop_words)
    for w in stop_words:
        nlp.vocab[w].is_stop = True
    
    #tokenize texts
    word_tokens = nlp(text.lower())
    
    #remove words containing special letters, short words, stop words, punctuations, currency, numbers and spaces, then lemmatize words
    final_word_tokens = [w.lemma_ for w in word_tokens if w.text.isalpha() and len(w)>2 and not w.is_stop and not w.is_punct and not w.is_currency and not w.like_num and not w.is_space]
    
    return final_word_tokens