# Tutorial 1: Basic NLP Pipeline
Ziel des ersten Tutorial ist es, eine simple NLP Pipeline zu erstellen und sich dabei mit den Modulen nltk, numpy und scikit-learn vertraut zu machen.

## Aufgabe 1: Importieren der Module 
### 1.1: Import der wichtigsten NLP Module
Stellen Sie sicher, dass die benötigten Module "pandas", "numpy", "nltk", "sklearn" und "re" importiert sind und lassen Sie sich die Versionsnummern der Module ausgeben.

In [1]:
import pandas as pd
print ("pandas", pd.__version__)

import numpy as np
print ("numpy", np.__version__)

import nltk
print ("nltk", nltk.__version__)

import re
print ("re", re.__version__)

import sklearn
print ("sklearn", sklearn.__version__)


pandas 1.0.5
numpy 1.18.5
nltk 3.5
re 2.2.1
sklearn 0.21.3


### 1.2 Import von "Quality-of-Life Modulen"
Installieren sie "pandarrallel". Importieren sie die Methode "pandarrallen" von pandarrallel (für parallelization in pandas) und initialisieren sie es mit pandarallel.initialize(). 

In [2]:
# quality of life improvements
from pandarallel import pandarallel  # parallelization
pandarallel.initialize()

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


## 2: Daten importieren
In diesem Tutorial werden Sie mit einen Dataset aus Witzen arbeiten. Diese wurden mithilfe von Crawlern von den Plattformen "stupidstuff.org", "wocka.com" sowie "reddit.com" gesammelt.   

Lesen Sie die beigefügten .json-Dateien ein und führen Sie diese in einem Pandas-Dataframe zusammen. Stellen Sie dabei sicher, dass die Quelle der Daten als Key erhalten bleibt.

In [3]:
dfs = [pd.read_json("stupidstuff.json"), pd.read_json("reddit.json"), pd.read_json("wocka.json")]
data = pd.concat(dfs, keys=["stupidstuff", "reddit", "wocka"])
data.loc["reddit"]

Unnamed: 0,body,category,id,rating,score,title
0,"Now I have to say ""Leroy can you please paint ...",,5tz52q,,1.0,I hate how you cant even say black paint anymore
1,Pizza doesn't scream when you put it in the ov...,,5tz4dd,,0.0,What's the difference between a Jew in Nazi Ge...
2,...and being there really helped me learn abou...,,5tz319,,0.0,I recently went to America....
3,A Sunday school teacher is concerned that his ...,,5tz2wj,,1.0,"Brian raises his hand and says, “He’s in Heaven.”"
4,He got caught trying to sell the two books to ...,,5tz1pc,,0.0,You hear about the University book store worke...
...,...,...,...,...,...,...
40804,Mediocre meaty ochre.,,4n0hy6,,1.0,What do you call bad black paint made out of m...
40805,IHOP,,4n0f4q,,2.0,Where does a person with one leg work?
40806,Droid Sans,,4n0ci8,,0.0,What is Obi Wan's favorite font?
40807,It's the only place I can buy 400 cantaloupes ...,,4n09ud,,1.0,"Even though I hate it, math is special."


## 3: Data Preprocessing
Wie es sehr oft bei gecrawlten Daten der Fall ist, sind die Daten aus den unterschiedlichen Quellen sehr unterschiedlich beschaffen und getaggt. 
### 3.1: Text bereinigen
Bereiten sie die Daten so auf, dass in der Spalte "body" jeweils der gesamte Text des Witzes enthalten ist und dass keine Format-Tokens (z.B. "\n") mehr enthalten sind.

In [4]:
data.loc["reddit"]["body"] = data.loc["reddit"].parallel_apply(lambda row: row["title"]+" "+row["body"], axis=1)
data.loc["reddit"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc["reddit"]["body"] = data.loc["reddit"].parallel_apply(lambda row: row["title"]+" "+row["body"], axis=1)


Unnamed: 0,body,category,id,rating,score,title
0,I hate how you cant even say black paint anymo...,,5tz52q,,1.0,I hate how you cant even say black paint anymore
1,What's the difference between a Jew in Nazi Ge...,,5tz4dd,,0.0,What's the difference between a Jew in Nazi Ge...
2,I recently went to America.... ...and being th...,,5tz319,,0.0,I recently went to America....
3,"Brian raises his hand and says, “He’s in Heave...",,5tz2wj,,1.0,"Brian raises his hand and says, “He’s in Heaven.”"
4,You hear about the University book store worke...,,5tz1pc,,0.0,You hear about the University book store worke...
...,...,...,...,...,...,...
40804,What do you call bad black paint made out of m...,,4n0hy6,,1.0,What do you call bad black paint made out of m...
40805,Where does a person with one leg work? IHOP,,4n0f4q,,2.0,Where does a person with one leg work?
40806,What is Obi Wan's favorite font? Droid Sans,,4n0ci8,,0.0,What is Obi Wan's favorite font?
40807,"Even though I hate it, math is special. It's t...",,4n09ud,,1.0,"Even though I hate it, math is special."


In [5]:
data.loc["wocka"]["body"] = data.loc["wocka"].parallel_apply(lambda row: re.sub("(\n)|(\r)", " ", row["body"]), axis=1)
data.loc["wocka"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc["wocka"]["body"] = data.loc["wocka"].parallel_apply(lambda row: re.sub("(\n)|(\r)", " ", row["body"]), axis=1)


Unnamed: 0,body,category,id,rating,score,title
0,What do you call a cow with no legs? Ground...,Animal,1,,,Cow With No Legs
1,What do you call a cow jumping over a barbed w...,Animal,2,,,Jumping Cow
2,What's black and white and red all over? A ...,Other / Misc,4,,,"Black, White and Red"
3,"So, this guy walks into a bar. And says, ""o...",Bar,5,,,Guy in a Bar
4,"If the opposite of pro is con, isn't the oppos...",One Liners,6,,,Progress
...,...,...,...,...,...,...
10014,(A man comes to my register with a mint chocol...,Men / Women,18196,,,Hell Hath No Fury Like A Pregnant Woman Scorned
10015,(I am shelving DVDs in a library when a man co...,Children,18197,,,"No Pranks, Just Thanks"
10016,"Me: ""That will be 17.50, please."" Customer:...",Religious,18198,,,Hell In A Handbag
10017,"Me: ""Sir, would you like to use any coupons to...",At Work,18199,,,A Good Ol' Fashioned A** Whoopin'


### 3.2: Tokenization
Nutzen sie das nltk Modul um die Witze zu Tokenisieren. Nutzen sie dazu den RegexpTokenizer des nltk Moduls, um mit einem passenden regulären Ausdruck nur Tokens aus Wörtern und Zahlen zu übernehmen (also keine Satzzeichen). Die Tokens sollen nur kleine Buchstaben enthalten. Speichern Sie die Ergebnisse in einer Spalte "tokens".

In [6]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

data["tokens"] = data.apply(lambda row: tokenizer.tokenize(str(row["body"].lower())), axis=1)

In [7]:
data.loc["stupidstuff"]

Unnamed: 0,body,category,id,rating,score,title,tokens
0,A blackjack dealer and a player with a thirtee...,Children,1,2.63,,,"[a, blackjack, dealer, and, a, player, with, a..."
1,"At a dinner party, several of the guests were ...",Blonde Jokes,2,2.57,,,"[at, a, dinner, party, several, of, the, guest..."
2,One day this cop pulls over a blonde for speed...,Blonde Jokes,3,3.09,,,"[one, day, this, cop, pulls, over, a, blonde, ..."
3,Three women are about to be executed for crime...,Blonde Jokes,4,4.10,,,"[three, women, are, about, to, be, executed, f..."
4,A girl came skipping home FROM school one day....,Blonde Jokes,5,4.30,,,"[a, girl, came, skipping, home, from, school, ..."
...,...,...,...,...,...,...,...
3768,,Miscellaneous,3769,5.00,,,[]
3769,The Pope and the Queen of England are on the s...,Miscellaneous,3770,4.00,,,"[the, pope, and, the, queen, of, england, are,..."
3770,,Miscellaneous,3771,1.00,,,[]
3771,Letter to Xerox and the Reply\n\nDear Kings of...,Miscellaneous,3772,4.00,,,"[letter, to, xerox, and, the, reply, dear, kin..."


### 3.3 Stopword removal
Entfernen sie alle englischen Stopwords aus den erzeugten Tokens.

In [8]:
def remove_stopwords_from_list(list_in, language):
    return [a for a in list_in if a not in nltk.corpus.stopwords.words(language)]

In [9]:
data["tokens"] = data.parallel_apply(lambda row: remove_stopwords_from_list(row["tokens"], "english"), axis=1)

### 3.4 POS Tagging
Bestimmen Sie POS-Tags für die tokenisierten Texte und speichern Sie diese in einer Spalte "pos".  


In [10]:
data["pos"] = data.parallel_apply(lambda row: nltk.pos_tag(row["tokens"]), axis=1)

In [11]:
data.loc["reddit"]["pos"][1][1][1]

'NN'

### 3.5 Lemmatisierung
Lemmatisieren sie die Tokens der Texte und speichern sie die bestimmten Lemmata in einer Spalte "Lemmata".

BONUS: Berücksichtigen sie bei der Lemmatisierung die Wortformen der Tokens.

In [12]:
# Simple solution:
from nltk.stem import WordNetLemmatizer  
lemmatizer = WordNetLemmatizer()
data["lemmata"] = data.parallel_apply(lambda row: [lemmatizer.lemmatize(word) for word in row["tokens"]], axis=1)

In [13]:
data

Unnamed: 0,Unnamed: 1,body,category,id,rating,score,title,tokens,pos,lemmata
stupidstuff,0,A blackjack dealer and a player with a thirtee...,Children,1,2.63,,,"[blackjack, dealer, player, thirteen, count, h...","[(blackjack, NN), (dealer, NN), (player, NN), ...","[blackjack, dealer, player, thirteen, count, h..."
stupidstuff,1,"At a dinner party, several of the guests were ...",Blonde Jokes,2,2.57,,,"[dinner, party, several, guests, arguing, whet...","[(dinner, NN), (party, NN), (several, JJ), (gu...","[dinner, party, several, guest, arguing, wheth..."
stupidstuff,2,One day this cop pulls over a blonde for speed...,Blonde Jokes,3,3.09,,,"[one, day, cop, pulls, blonde, speeding, cop, ...","[(one, CD), (day, NN), (cop, VB), (pulls, NNS)...","[one, day, cop, pull, blonde, speeding, cop, g..."
stupidstuff,3,Three women are about to be executed for crime...,Blonde Jokes,4,4.10,,,"[three, women, executed, crimes, one, brunette...","[(three, CD), (women, NNS), (executed, VBD), (...","[three, woman, executed, crime, one, brunette,..."
stupidstuff,4,A girl came skipping home FROM school one day....,Blonde Jokes,5,4.30,,,"[girl, came, skipping, home, school, one, day,...","[(girl, NN), (came, VBD), (skipping, VBG), (ho...","[girl, came, skipping, home, school, one, day,..."
...,...,...,...,...,...,...,...,...,...,...
wocka,10014,(A man comes to my register with a mint chocol...,Men / Women,18196,,,Hell Hath No Fury Like A Pregnant Woman Scorned,"[man, comes, register, mint, chocolate, candy,...","[(man, NN), (comes, VBZ), (register, JJ), (min...","[man, come, register, mint, chocolate, candy, ..."
wocka,10015,(I am shelving DVDs in a library when a man co...,Children,18197,,,"No Pranks, Just Thanks","[shelving, dvds, library, man, comes, boy, app...","[(shelving, VBG), (dvds, JJ), (library, JJ), (...","[shelving, dvd, library, man, come, boy, appea..."
wocka,10016,"Me: ""That will be 17.50, please."" Customer:...",Religious,18198,,,Hell In A Handbag,"[17, 50, please, customer, christian, dear, as...","[(17, CD), (50, CD), (please, NN), (customer, ...","[17, 50, please, customer, christian, dear, as..."
wocka,10017,"Me: ""Sir, would you like to use any coupons to...",At Work,18199,,,A Good Ol' Fashioned A** Whoopin',"[sir, would, like, use, coupons, today, custom...","[(sir, NN), (would, MD), (like, VB), (use, NN)...","[sir, would, like, use, coupon, today, custome..."


In [None]:
# Lemmatization with word type from https://www.machinelearningplus.com/nlp/lemmatization-examples-python/
# Lemmatize with POS Tag
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)


data["lemmata_word_type"] = data.parallel_apply(lambda row: [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in row["tokens"]], axis=1)

### 3.6 Frequencies
Fügen sie in einer neuen Spalte "frequencies" die Häufigkeitsverteilungen der lemmatisierten Tokens für jeden Text hinzu.

In [None]:
from nltk.probability import FreqDist
data["frequencies"] = data.parallel_apply(lambda row: FreqDist(row["lemmata_word_type"]), axis=1)

## 4 Data Analysis
### 4.1 Überblick über Themen
Lassen Sie sich die für die von Stupidstuff und Wocker gecrawlten Witze die Kategorien als Liste ausgeben

In [None]:
data.loc["stupidstuff"].category.unique().tolist()

In [None]:
data.loc["wocka"].category.unique().tolist()

### 4.2 Überblick über numerische Werte 

Lassen sie sich die durchschnittlichen Bewertungen für die Kategorien der Witze von Stupidstuff aufsteigendsortiert ausgeben. Lassen Sie sich anschließend mithilfe van Pandas einen überblick über deskriptive Statistiken der Stupidstuff Witze ausgeben, erneut nach Kategorien gruppiert. 

In [None]:
data.loc["stupidstuff"].groupby(["category"]).mean().sort_values(by=['category'])

In [None]:
data.loc["stupidstuff"].groupby(["category"]).describe()

## 5 Bonus

Analysieren oder bearbeiten Sie einen Aspekt ihrer Wahl des Datensatzes. Beispielsweise können Sie für jede der Plattformen den Anteil der Witze berechnen, die ein Bestimmtes Wort oder Wörter einer bestimmten Wortgruppe enthalten, Gesamtworthäufigkeiten der verschiedenen Quellen oder Genres berechnen oder eine andere beliebige Fragestellung bearbeiten, die ohne weitere Module zu importieren umsetzbar ist.