# Auswertung Koalitionsvertrag
## 1. Wort Frequenzen
In diesem Notebook nutzen wir `NLTK` um die einzelnen Dokumente der Wahlprogramme und den Koalitionsvertrag einzulesen. Anschließend werden die Satzzeichen entfernt und die Dokumente tokenisiert. Die Tokens werden mit einer Stoppwort-Liste abgeglichen, um häufig vorkommende Wörter wie *ich, und, wir, ...* zu entfernen. Anschließend nutzen wir die `nltk.FreqDist()`-Funktion, um die Wortfrequenzlisten zu erstellen, die wir anschließend in einem `pandas.DataFrame()` speichern und zur weiteren Auswertung nutzen möchten.

In [88]:
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import requests
import pandas as pd

In [89]:
# Retrieve Stopwords from Github
r = requests.get('https://github.com/stopwords-iso/stopwords-de/raw/master/stopwords-de.json')
stop_words = r.json()

In [90]:
def word_freq(filename, outputname):
    # Opening File
    with open(filename, 'r') as f:
        corpus = f.read()
        
    # Tokenizing, removing punctuation
    tokens = RegexpTokenizer(r'\w+').tokenize(corpus)
    
    # Removing Stopwords
    filtered_sentence = [w for w in tokens if not w.lower() in stop_words]
    
    # Creating Data Frame
    df = pd.DataFrame.from_dict(nltk.FreqDist(filtered_sentence),orient='index')
    df.columns = ['Frequency']
    df.index.name = 'Term'
    
    # Exporting Data Frame
    df.to_csv(outputname)

In [91]:
# First we define all documentnames, then we loop through them, creating a csv file for each document.
files = [("data/fdp_b.txt", "fdp_freq.csv"), ("data/gruen_b.txt", "gruen_freq.csv"), ("data/spd_b.txt", "spd_freq.csv"), ("data/koav_b.txt", "koav_freq.csv")]
for file in files:
    word_freq(file[0], file[1])

In [92]:
# Combining all Files to one large df (long vs wide)
# This is an important step for using tidyr-packages later on
def word_freq_long(filename, name):
    # Opening File
    with open(filename, 'r') as f:
        corpus = f.read()
        
    # Tokenizing, removing punctuation
    tokens = RegexpTokenizer(r'\w+').tokenize(corpus)
    
    # Removing Stopwords
    filtered_sentence = [w for w in tokens if not w.lower() in stop_words]
    filtered_sentence = [w.lower() for w in filtered_sentence]
    
    # Creating Data Frame
    df = pd.DataFrame.from_dict(nltk.FreqDist(filtered_sentence),orient='index')
    df.columns = ['Frequency']
    df.index.name = 'Term'
    df['Source'] = name
    
    return df

In [93]:
# Reading all textfiles, creating one df for all frequency lists
files = [("data/fdp_b.txt", "FDP"), ("data/gruen_b.txt", "GRUENE"), ("data/spd_b.txt", "SPD"), ("data/koav_b.txt", "KOALITION")]

df = pd.DataFrame()
for file in files:
    _df = word_freq_long(file[0], file[1])
    df = df.append(_df)
    
# Export to CSV
df.to_csv("all_freq.csv")

In [94]:
# Printing the top 10 terms as a LaTeX Table for use in the presentation
print(df.sort_values("Frequency", ascending=False)[0:10].to_latex())

\begin{tabular}{lrl}
\toprule
{} &  Frequency &     Source \\
Term         &            &            \\
\midrule
innen        &        451 &     GRUENE \\
freie        &        386 &        FDP \\
demokraten   &        375 &        FDP \\
stärken      &        238 &  KOALITION \\
eu           &        181 &     GRUENE \\
stärken      &        181 &     GRUENE \\
unterstützen &        166 &  KOALITION \\
setzen       &        156 &  KOALITION \\
eu           &        148 &  KOALITION \\
innen        &        136 &        SPD \\
\bottomrule
\end{tabular}



### 1.2 Relative Wortfrequenzen
Obige Ergebnisse sind noch nicht sehr aussagekräftig, da sie nur absolute Zahlen wiedergeben. Nun sind die Wahlprogramme allerdings unterschiedlich lang, deshalb wäre es spannend die Wortfrequenz in Bezug zur Gesamtsumme zu sehen, wir wollen also berechnen: $\frac{Frequency}{Total Words}=Relative Frequency \%$

In [104]:
def word_relative_freq_long(filename, name):
    # Opening File
    with open(filename, 'r') as f:
        corpus = f.read()
        
    # Tokenizing, removing punctuation
    tokens = RegexpTokenizer(r'\w+').tokenize(corpus)
    
    # Removing Stopwords
    filtered_sentence = [w for w in tokens if not w.lower() in stop_words]
    filtered_sentence = [w.lower() for w in filtered_sentence]
    
    # Creating Data Frame
    freq = nltk.FreqDist(filtered_sentence)
    total_tokens = sum(freq.values()) # Here we calculate the total number of tokens in our Frequency List
    df = pd.DataFrame.from_dict(freq,orient='index')
    df.columns = ['Frequency']
    df.index.name = 'Term'
    df['Source'] = name
    df['Relative'] = (df['Frequency'] / total_tokens) * 100 # Here we add a new column `relative` (*100 for percentage)
    
    return df

In [105]:
# Reading all textfiles, creating one df for all frequency lists
files = [("data/fdp_b.txt", "FDP"), ("data/gruen_b.txt", "GRUENE"), ("data/spd_b.txt", "SPD"), ("data/koav_b.txt", "KOALITION")]

df = pd.DataFrame()
for file in files:
    _df = word_relative_freq_long(file[0], file[1])
    df = df.append(_df)
    
# Export to CSV
df.to_csv("all_freq.csv")

In [106]:
# Printing the top 10 terms as a LaTeX Table for use in the presentation
print(df.sort_values("Relative", ascending=False)[0:10].to_latex())

\begin{tabular}{lrlr}
\toprule
{} &  Frequency &     Source &  Relative \\
Term         &            &            &           \\
\midrule
freie        &        386 &        FDP &  2.070593 \\
demokraten   &        375 &        FDP &  2.011587 \\
innen        &        451 &     GRUENE &  1.315981 \\
innen        &        136 &        SPD &  1.169088 \\
stärken      &        238 &  KOALITION &  0.879657 \\
eu           &        128 &        FDP &  0.686622 \\
unterstützen &        166 &  KOALITION &  0.613542 \\
fordern      &        110 &        FDP &  0.590065 \\
setzen       &        156 &  KOALITION &  0.576582 \\
deutschland  &        106 &        FDP &  0.568609 \\
\bottomrule
\end{tabular}

