# Assignment 6 - Text Classification Pipeline

In this task you will apply the concepts learned in the last sessions:

- Feature Extraction for Text data
- scikit-learn pipelines
- ML Model Evaluation
- Cross-Validation

As all these things are too complex to implement in an exercise, we will be using existing libraries for this, in particular ``scikit-learn``. 

We also will make use of a library for storing tabular data, ``pandas``, our data will be stored as pandas dataframe. You don't need to understand much about this format, except for how to retrieve a column from it. Just like for dictionaries you can type ``df[colname]`` to get the column values. 

The first cell downloads some text data, it's a number of speeches in the German Parliament, the *Bundestag*. 

Your task will be to predict party affiliation from the speech text. 

You will do that by building a sklearn pipeline and use grid search to optimize the n-gram range of thee text featurizer. 

You can check whether your code worked if it produces the same classification reports as the solution.

In [1]:
# !pip install pandas

import os, gzip
import pandas as pd
import numpy as np
import urllib.request

np.random.seed(0)

import warnings
warnings.filterwarnings('ignore')

DATADIR = "data"

if not os.path.exists(DATADIR): 
    os.mkdir(DATADIR)

file_name = os.path.join(DATADIR, 'bundestags_parlamentsprotokolle.csv.gzip')
if not os.path.exists(file_name):
    url_data = 'https://www.dropbox.com/s/1nlbfehnrwwa2zj/bundestags_parlamentsprotokolle.csv.gzip?dl=1'
    urllib.request.urlretrieve(url_data, file_name)

df = pd.read_csv(gzip.open(file_name), index_col=0).sample(n=10000)

We can inspect the first 4 rows of the pandas dataframe like this

In [2]:
df[:4]

Unnamed: 0,sitzung,wahlperiode,sprecher,text,partei
17908,201,17,Ulrich Lange,Es ist vielmehr – das erlaube ich mir hier sch...,cducsu
42100,226,18,Doris Wagner,"Ein weiteres Finanzierungsinstrument, um diese...",gruene
18777,209,17,Heidrun Bluhm,In den letzten zwei Jahrzehnten sind immer meh...,linke
14589,168,17,Dr. Egon Jüttner,– Ja. – Von rund 80 000 bis 100 000 Herero im ...,cducsu


Your solution will have to implement the following parts:

- Split the data into train (80%) and test (20%) set. You can use the sklearn function ``train_test_split`` for this.

- Build a pipeline with a [``TfidfVectorizer``](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and a [``NearestCentroid``](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestCentroid.html) classifier

- Train the pipeline inside a [``GridSearchCV``](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) object on the training data. Try to find the best ``n_gram_range`` for the Vectorizer.

- Evaluate F1, precision, recall on the test data. You can use the sklearn function ``classification_report`` for that.

In [3]:
# uncomment these lines to install the required dependencies.
# !pip install numpy
# !pip install matplotlib
# !pip install scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import NearestCentroid
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix, classification_report



In [4]:
parties = df["partei"].values
statements = df["text"].values


train_data, test_data, train_labels , test_lables = train_test_split(statements, parties, test_size=0.2, shuffle=True)

ngram_ranges = {"tfidfvectorizer__ngram_range": [(1,1),(1,2),(1,3),(2,2),(2,3),(3,3)]}

clf = GridSearchCV(make_pipeline(TfidfVectorizer(), NearestCentroid()), ngram_ranges, n_jobs=-1)
clf.fit(train_data, train_labels)

clf = clf.best_estimator_

### Comparision with Training dataset

In [5]:
ncc_predictions = clf.predict(train_data)
print(classification_report(ncc_predictions, train_labels))

              precision    recall  f1-score   support

      cducsu       0.97      0.97      0.97      2911
         fdp       1.00      1.00      1.00       610
      gruene       0.98      0.96      0.97      1256
       linke       0.99      0.94      0.96      1119
         spd       0.96      0.99      0.97      2104

    accuracy                           0.97      8000
   macro avg       0.98      0.97      0.97      8000
weighted avg       0.97      0.97      0.97      8000



### Comparision with Testing dataset

In [6]:
ncc_predictions_test = clf.predict(test_data)
print(classification_report(ncc_predictions_test, test_lables))

              precision    recall  f1-score   support

      cducsu       0.74      0.55      0.63       971
         fdp       0.08      0.42      0.13        31
      gruene       0.28      0.32      0.30       240
       linke       0.42      0.57      0.48       202
         spd       0.42      0.42      0.42       556

    accuracy                           0.48      2000
   macro avg       0.38      0.46      0.39      2000
weighted avg       0.55      0.48      0.51      2000



In [7]:
print(clf)

Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer(ngram_range=(2, 2))),
                ('nearestcentroid', NearestCentroid())])
