## File Preprocessing

**80:20 rule of data science**

- 80% of the work are spent on pre-processing, data cleansing
- 20% of the work are spent on data analysis and visualization

**VERY IMPORTANT ON PRE-PROCESSING**

In the following, we try preprocessing using NLTK and Spacy and further discuss the good and bad of both libraries.

<font color="blue"/>

### dsp:
  * Is this rule true? Is this a quote?

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
import spacy
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import re
import warnings
import string
warnings.filterwarnings("ignore") 

<font color="blue"/>

### dsp:
  * I'd prefer to list the general packages first and add some structure by inserting empty lines.
  * Are you sure that all warning that you silence will be unimportant? You might want to add a comment describing, which comment you want to silence and for what reason.

### Preprocessing using nltk packages 
[NLTK](https://www.nltk.org) is a leading platform for building Python programs to work with human language data.

First we try to use nltk packages to pre-process the data based on:
- strong community support
- more efficient than spacy (less run time after comparison)

However, there are some drawbacks, especially for German processing:
- lemmatization not supported
- bad stemming --> casuing too many redundant tokens
- part-of-speech tagging not well integrated

Our NLTK preprocessing consists of the following steps:
1. remove digits (e.g. '0123456789')
2. remove punctuations (e.g. ',.„“|')
3. change all text case to lower case
4. tokenize sentences (breaking sentences into words or phrases)
5. remove stop words (e.g. 'einer', 'eine', 'eines', 'einen', 'oder', 'aber'...)
6. (stemming words (e.g. 'sucht | suchst' -> 'suchen'))

<font color="blue"/>

### dsp:
  * &#x1f642; It is great that you make an attempt to compare the two toolkits and give your criteria.
  * You could support your claim with a few short experiments. Take one or two sentences and process them with the two toolkits and compare the results. Although these special examples do not "prove" anything about how the toolkits compare in general, it helps to understand the differences. (If you wanted to provide a strong comparison, you would need to find or create a comprehensive benchmark. A bit too much for a student project.)
  * How do you know that "NLTK is a leading platform for building Python programs to work with human language data."? &#x1f609; Is it true?
  * Does the stemmer map  'sucht ' and 'suchst' to 'suchen'? (I did not check.)
  

In [9]:
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer

def nltkPreprocessing(text):
    
    #remove digits and some special symbols
    dig_translator = str.maketrans('', '', '0123456789-/€®–„“|')
    text = text.translate(dig_translator)
    
    #remove punctuation
    str_translator = str.maketrans('', '', string.punctuation)
    text = text.translate(str_translator).lower()
    text = text.strip()
    
    #tokenize sentences
    word_tokens = word_tokenize(text)
    stop_words = stopwords.words('german')
    
    #remove stop words
    filtered_tokens = [w.lower() for w in word_tokens if not w.lower() in stop_words]
    
    return filtered_tokens

    #using PorterStemmer to stem the tokens (effect not good)
#     ps = SnowballStemmer('german')
#     stem_tokens = [ps.stem(w) for w in filtered_tokens]   
#     return stem_tokens



### Preprocessing using Spacy packages
[Spacy](https://spacy.io/usage/) is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more. 

Based on the features, Spacy gives a better performance in:
- tokenization quality
- lemmatization

However, there are some drawbacks, especially for German processing:
- super slow tokenization
- incomplete german stop words 

Our Spacy preprocessing consists of the following steps:
1. change all text case to lower case
2. tokenize sentences (breaking sentences into words or phrases)
3. only keep words with alphabets 
4. remove words with less than 3 letters 
5. remove stop words (e.g. 'einer', 'eine', 'eines', 'einen', 'oder', 'aber'...)
6. remove punctuations (e.g. ',.„“|')
7. remove currency signs
8. remove number-like words (e.g. 'one', 'two'..)
9. remove spaces
10. lemmatize words

After comparison with nltk, we indeed find out that spacy provides more accurate text preprocessing on German text.

In [None]:
#load German language package in spacy
warnings.filterwarnings("ignore") 
nlp = spacy.load('de', disable=['parser', 'ner'])
nlp.max_length = 2000000

def spacyPreprocessing(text): 
    # define stop words
    my_stop_words  = ['einer', 'eine', 'eines', 'einen', 'oder', 'aber', 'dass',  'teur', 'euro', 'eur', 'jahr', 'million', 'tausend', 'mio', 'mrd']
    stop_words = stopwords.words('german')
    stop_words.extend(my_stop_words)
    for w in stop_words:
        nlp.vocab[w].is_stop = True
    
    #tokenize texts
    word_tokens = nlp(text.lower())
    
    #remove words containing special letters, short words, stop words, punctuations, currency, numbers and spaces, then lemmatize words
    final_word_tokens = [w.lemma_ for w in word_tokens if w.text.isalpha() and len(w)>2 and not w.is_stop and not w.is_punct and not w.is_currency and not w.like_num and not w.is_space]
    
    return final_word_tokens

<font color="blue"/>

### dsp:
  * I didn't know that you can convince spacy to process even larger texts. Nice! &#x1f642;
  * I neither did know that you can tell spacy to consider certain words to be stop words. &#x1f642;
  * I would not be suprised if the results for `nlp(text)` would be better than for `nlp(text.lower())`, but I have not tried it yet.
  * Please distribute the list comprehension over more than one line so that we can read it without scrolling. Linebreaks are fine:
```Python
    final_word_tokens = [w.lemma_ for w in word_tokens if w.text.isalpha() and len(w)>2 
                             and not w.is_stop and not w.is_punct and not w.is_currency 
                             and not w.like_num and not w.is_space]
```
or
```Python
    final_word_tokens = [w.lemma_ for w in word_tokens if w.text.isalpha() and len(w)>2 
                             and not (w.is_stop or w.is_punct or w.is_currency or w.like_num or w.is_space)]
```
  * Did you check whether `my_stop_words` is not subsumed by `stopwords.words('german')`? The list looks similar to my early list into which I had not invest much thought.
  * It would be nice to demonstrate these two functions in this notebook with small examples. Maybe even to the degree to convince the reader of your comparison results.