## Textklassifikation für Deutsche Texte 1
Unser Newsdatensatz aus den Aufgaben zur Textähnlichkeit beinhaltet leider keine Kategorien, so dass eine überwachte Klassifikation nicht möglich ist.

Daher habe ich einen weiteren Datensatz verfügbar gemacht: 10kGNDA (Zehntausend deutsche Zeitungsartikel).

Anhand dieses Datensatzes soll die Textklassifikation auf deutsche Sprache übertragen werden. Passen sie dazu den Code von Sarkar geeignet an. Folgende Einschränkungen vereinfachen die Aufgabe etwas:

- Starten sie zunächst mit dem (kleineren) Testdatensatz. Das sollte die Berechnungen etwas schneller ablaufen lassen.
- Falls alles funktioniert, können sie zusätzlich noch den Trainingsdatensatz verwenden. Entweder fertig aufgeteilt (training, test) oder zuerst zusammenfassen und dann im Programm aufteilen lassen.
- Sie müssen nicht alle Methoden umsetzen, wählen sie zwei aus.
- Starten sie zunächst ohne Hyperparameter-Tuning

In [None]:
import pandas as pd
import csv
from Normalisieren_DE import normalize_corpus
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [13]:
with open('test.csv', 'r', encoding='utf-8') as f:
    data = f.readlines()

split_data = []
for l in data:
    l_split = l.split(';', 1)
    split_data.append(l_split)
df = pd.DataFrame(split_data, columns=["Target Name", "Article"])
df.head()

Unnamed: 0,Target Name,Article
0,Wirtschaft,"'Die Gewerkschaft GPA-djp lanciert den ""All-in..."
1,Sport,Franzosen verteidigen 2:1-Führung – Kritische ...
2,Web,'Neues Video von Designern macht im Netz die R...
3,Sport,23-jähriger Brasilianer muss vier Spiele pausi...
4,International,Aufständische verwendeten Chemikalie bei Gefec...
5,Web,Bewährungs- und Geldstrafe für 26-Jährigen weg...
6,Sport,ÖFB-Teamspieler nur sechs Minuten nach seinem ...
7,Panorama,Ein 31-jähriger Polizist soll einer 42-Jährige...
8,International,18 Menschen verschleppt. Kabul – Nach einem Hu...
9,Web,Deutschland und Frankreich am stärksten von Lo...


In [70]:
data_labels_map = dict(enumerate(set(df['Target Name'].values)))
data_labels_map

{0: 'Wirtschaft',
 1: 'Wissenschaft',
 2: 'Etat',
 3: 'Web',
 4: 'Sport',
 5: 'International',
 6: 'Kultur',
 7: 'Inland',
 8: 'Panorama'}

In [14]:
norm_corpus = normalize_corpus(df['Article'], contraction_expansion=False)
df['Clean Article'] = norm_corpus
df.head()

Unnamed: 0,Target Name,Article,Clean Article
0,Wirtschaft,"'Die Gewerkschaft GPA-djp lanciert den ""All-in...",Gewerkschaft gpadjp lancieren allinrechner fin...
1,Sport,Franzosen verteidigen 2:1-Führung – Kritische ...,Franzosen verteidigen fuehrung kritisch Stimme...
2,Web,'Neues Video von Designern macht im Netz die R...,neu Video Designer Netz Runde schlagen etwa bu...
3,Sport,23-jähriger Brasilianer muss vier Spiele pausi...,jaehriger Brasilianer vier Spiele pausieren En...
4,International,Aufständische verwendeten Chemikalie bei Gefec...,Aufstaendisch verwenden Chemikalie Gefecht Aug...


In [15]:
#durch das cleaning leere Dokumente? (Leerzeichen ersetzen)
df = df.replace(r'^(\s?)+$', np.nan, regex=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1028 entries, 0 to 1027
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Target Name    1028 non-null   object
 1   Article        1028 non-null   object
 2   Clean Article  1028 non-null   object
dtypes: object(3)
memory usage: 24.2+ KB


In [16]:
# Bereinigtes Dokument abspeichern
df.to_csv('clean_articles.csv', index=False)


In [20]:
#Dokumente aufteilen 33% testdaten
from sklearn.model_selection import train_test_split

train_corpus, test_corpus,  train_label_names, test_label_names =\
                                 train_test_split(np.array(df['Clean Article']),
                                                       np.array(df['Target Name']), test_size=0.33, random_state=42)

train_corpus.shape, test_corpus.shape

((688,), (340,))

In [21]:
#wie verteilen sich die Dokumente nach Newsgruppen?
#Durchzählen. Sind die Labels gut verteilt in den Train und Testdaten?
from collections import Counter

trd = dict(Counter(train_label_names))
tsd = dict(Counter(test_label_names))

(pd.DataFrame([[key, trd[key], tsd[key]] for key in trd],
             columns=['Target Label', 'Train Count', 'Test Count'])
.sort_values(by=['Train Count', 'Test Count'],
             ascending=False))

Unnamed: 0,Target Label,Train Count,Test Count
3,Panorama,122,46
7,Web,104,64
1,International,101,50
6,Wirtschaft,97,44
2,Sport,75,45
0,Inland,71,31
8,Wissenschaft,45,12
5,Etat,43,24
4,Kultur,30,24


### CountVectorizer

In [22]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score

# build BOW features on train articles
cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0)
cv_train_features = cv.fit_transform(train_corpus)

In [23]:
# transform test articles into features
# HR: der "fit" wird übernommen
cv_test_features = cv.transform(test_corpus)

In [24]:
print('BOW model:> Train features shape:', cv_train_features.shape, ' Test features shape:', cv_test_features.shape)

BOW model:> Train features shape: (688, 32075)  Test features shape: (340, 32075)


# Bag of Words
ohne Gewichtung der einzelnen Ausdrücke

### Naive Bayes

In [25]:
#HR erste Runde Algorithmen: basierend auf BoW (hier: cv)

from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB(alpha=1) #Hyperparametr: Alpha versucht zu glätten, indem Wörter nicht so hart bestraft werden, wenn sie nicht oft vorkommen
mnb.fit(cv_train_features, train_label_names)
mnb_bow_cv_scores = cross_val_score(mnb, cv_train_features, train_label_names, cv=5) # fünf verschiedene Cross Validations
mnb_bow_cv_mean_score = np.mean(mnb_bow_cv_scores)
print('CV Accuracy (5-fold):', mnb_bow_cv_scores)
print('Mean CV Accuracy:', mnb_bow_cv_mean_score)
mnb_bow_test_score = mnb.score(cv_test_features, test_label_names)
print('Test Accuracy:', mnb_bow_test_score) #auf den Testwerten

# ungefähr 2/3 Genauigkeit mit einem sehr schnellen Algorithmus

CV Accuracy (5-fold): [0.74637681 0.63768116 0.65217391 0.72992701 0.68613139]
Mean CV Accuracy: 0.6904580556437109
Test Accuracy: 0.6882352941176471


### Logistic Regression

In [26]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2', max_iter=100, C=1, random_state=42)
lr.fit(cv_train_features, train_label_names)
lr_bow_cv_scores = cross_val_score(lr, cv_train_features, train_label_names, cv=5)
lr_bow_cv_mean_score = np.mean(lr_bow_cv_scores)
print('CV Accuracy (5-fold):', lr_bow_cv_scores)
print('Mean CV Accuracy:', lr_bow_cv_mean_score)
lr_bow_test_score = lr.score(cv_test_features, test_label_names)
print('Test Accuracy:', lr_bow_test_score)

# vergleichbares Ergebnis aber viel längere Rechenzeit

CV Accuracy (5-fold): [0.65217391 0.68115942 0.64492754 0.74452555 0.67883212]
Mean CV Accuracy: 0.6803237067597588
Test Accuracy: 0.7176470588235294


### Support Vector

In [27]:
from sklearn.svm import LinearSVC

svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(cv_train_features, train_label_names)
svm_bow_cv_scores = cross_val_score(svm, cv_train_features, train_label_names, cv=5)
svm_bow_cv_mean_score = np.mean(svm_bow_cv_scores)
print('CV Accuracy (5-fold):', svm_bow_cv_scores)
print('Mean CV Accuracy:', svm_bow_cv_mean_score)
svm_bow_test_score = svm.score(cv_test_features, test_label_names)
print('Test Accuracy:', svm_bow_test_score)

#wieder ähnliche Genauigkeit und schneller als LR

CV Accuracy (5-fold): [0.68115942 0.66666667 0.63043478 0.66423358 0.67883212]
Mean CV Accuracy: 0.6642653125991748
Test Accuracy: 0.7058823529411765


In [28]:
# SVM with Stochastic Gradient Descent
from sklearn.linear_model import SGDClassifier

svm_sgd = SGDClassifier(loss='hinge', penalty='l2', max_iter=5, random_state=42)
svm_sgd.fit(cv_train_features, train_label_names)
svmsgd_bow_cv_scores = cross_val_score(svm_sgd, cv_train_features, train_label_names, cv=5)
svmsgd_bow_cv_mean_score = np.mean(svmsgd_bow_cv_scores)
print('CV Accuracy (5-fold):', svmsgd_bow_cv_scores)
print('Mean CV Accuracy:', svmsgd_bow_cv_mean_score)
svmsgd_bow_test_score = svm_sgd.score(cv_test_features, test_label_names)
print('Test Accuracy:', svmsgd_bow_test_score)

CV Accuracy (5-fold): [0.70289855 0.60144928 0.65217391 0.65693431 0.64963504]
Mean CV Accuracy: 0.6526182164392257
Test Accuracy: 0.711764705882353


### Random Forest

In [29]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=10, random_state=42) #wie viele parallele Bäume wollen wir erzielen
rfc.fit(cv_train_features, train_label_names)
rfc_bow_cv_scores = cross_val_score(rfc, cv_train_features, train_label_names, cv=5)
rfc_bow_cv_mean_score = np.mean(rfc_bow_cv_scores)
print('CV Accuracy (5-fold):', rfc_bow_cv_scores)
print('Mean CV Accuracy:', rfc_bow_cv_mean_score)
rfc_bow_test_score = rfc.score(cv_test_features, test_label_names)
print('Test Accuracy:', rfc_bow_test_score)

# haben schlechtere Ergebnisse

CV Accuracy (5-fold): [0.54347826 0.5        0.52173913 0.45255474 0.52554745]
Mean CV Accuracy: 0.5086639162170739
Test Accuracy: 0.4970588235294118


In [30]:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(n_estimators=10, random_state=42)
gbc.fit(cv_train_features, train_label_names)
gbc_bow_cv_scores = cross_val_score(gbc, cv_train_features, train_label_names, cv=5)
gbc_bow_cv_mean_score = np.mean(gbc_bow_cv_scores)
print('CV Accuracy (5-fold):', gbc_bow_cv_scores)
print('Mean CV Accuracy:', gbc_bow_cv_mean_score)
gbc_bow_test_score = gbc.score(cv_test_features, test_label_names)
print('Test Accuracy:', gbc_bow_test_score)

CV Accuracy (5-fold): [0.65217391 0.54347826 0.55072464 0.57664234 0.58394161]
Mean CV Accuracy: 0.5813921506400085
Test Accuracy: 0.5970588235294118


# Jetzt mit TF IDF

tf-idf liefert bessere Resultate, da Wörter anders gewichtet (nicht nur Anzahl)

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer


tv = TfidfVectorizer(min_df=0., max_df=1.,use_idf=True)
tv_matrix = tv.fit_transform(test_corpus) #TfidfVectoriuzer kombiniert beide Formeln, weshalb man hier die Grundsätze importieren kann
tv_matrix = tv_matrix.toarray()

vocab = tv.get_feature_names_out()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

Unnamed: 0,ab,abaaoud,abbau,abbauen,abbilden,abbildung,abblitzen,abbrechen,abc,abd,...,zwischenstation,zwischenstopp,zwischenzeit,zwischenzeitlich,zwoelf,zwoelft,zwoelfte,zyklus,zynisch,zynismus
0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0
1,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0
2,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0
3,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0
4,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
335,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0
336,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0
337,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.00,0.0,0.06,0.0,0.0,0.0,0.0,0.0
338,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.03,0.0,0.00,0.0,0.0,0.0,0.0,0.0


In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer

# build BOW features on train articles
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0)
tv_train_features = tv.fit_transform(train_corpus)

In [37]:
# transform test articles into features
tv_test_features = tv.transform(test_corpus)

In [44]:
print('TFIDF model:> Train features shape:', tv_train_features.shape, ' Test features shape:', tv_test_features.shape)

TFIDF model:> Train features shape: (688, 32075)  Test features shape: (340, 32075)


### Naive Bias

In [39]:
mnb = MultinomialNB(alpha=1)
mnb.fit(tv_train_features, train_label_names)
mnb_tfidf_cv_scores = cross_val_score(mnb, tv_train_features, train_label_names, cv=5)
mnb_tfidf_cv_mean_score = np.mean(mnb_tfidf_cv_scores)
print('CV Accuracy (5-fold):', mnb_tfidf_cv_scores)
print('Mean CV Accuracy:', mnb_tfidf_cv_mean_score)
mnb_tfidf_test_score = mnb.score(tv_test_features, test_label_names)
print('Test Accuracy:', mnb_tfidf_test_score)

CV Accuracy (5-fold): [0.50724638 0.44927536 0.47101449 0.46715328 0.51824818]
Mean CV Accuracy: 0.4825875383476145
Test Accuracy: 0.5088235294117647


### Logistic Regression

In [40]:
lr = LogisticRegression(penalty='l2', max_iter=100, C=1, random_state=42)
lr.fit(tv_train_features, train_label_names)
lr_tfidf_cv_scores = cross_val_score(lr, tv_train_features, train_label_names, cv=5)
lr_tfidf_cv_mean_score = np.mean(lr_tfidf_cv_scores)
print('CV Accuracy (5-fold):', lr_tfidf_cv_scores)
print('Mean CV Accuracy:', lr_tfidf_cv_mean_score)
lr_tfidf_test_score = lr.score(tv_test_features, test_label_names)
print('Test Accuracy:', lr_tfidf_test_score)

CV Accuracy (5-fold): [0.64492754 0.5942029  0.5942029  0.6350365  0.58394161]
Mean CV Accuracy: 0.6104622871046228
Test Accuracy: 0.6558823529411765


### Support Vector

In [42]:
svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(tv_train_features, train_label_names)
svm_tfidf_cv_scores = cross_val_score(svm, tv_train_features, train_label_names, cv=5)
svm_tfidf_cv_mean_score = np.mean(svm_tfidf_cv_scores)
print('CV Accuracy (5-fold):', svm_tfidf_cv_scores)
print('Mean CV Accuracy:', svm_tfidf_cv_mean_score)
svm_tfidf_test_score = svm.score(tv_test_features, test_label_names)
print('Test Accuracy:', svm_tfidf_test_score)

CV Accuracy (5-fold): [0.77536232 0.73913043 0.73913043 0.75182482 0.70072993]
Mean CV Accuracy: 0.7412355865862689
Test Accuracy: 0.8147058823529412


In [43]:
svm_sgd = SGDClassifier(loss='hinge', penalty='l2', max_iter=5, random_state=42)
svm_sgd.fit(tv_train_features, train_label_names)
svmsgd_tfidf_cv_scores = cross_val_score(svm_sgd, tv_train_features, train_label_names, cv=5)
svmsgd_tfidf_cv_mean_score = np.mean(svmsgd_tfidf_cv_scores)
print('CV Accuracy (5-fold):', svmsgd_tfidf_cv_scores)
print('Mean CV Accuracy:', svmsgd_tfidf_cv_mean_score)
svmsgd_tfidf_test_score = svm_sgd.score(tv_test_features, test_label_names)
print('Test Accuracy:', svmsgd_tfidf_test_score)

CV Accuracy (5-fold): [0.75362319 0.69565217 0.69565217 0.7080292  0.72262774]
Mean CV Accuracy: 0.7151168941076906
Test Accuracy: 0.788235294117647


### Random Forest

In [46]:
rfc = RandomForestClassifier(n_estimators=10, random_state=42)
rfc.fit(tv_train_features, train_label_names)
rfc_tfidf_cv_scores = cross_val_score(rfc, tv_train_features, train_label_names, cv=5)
rfc_tfidf_cv_mean_score = np.mean(rfc_tfidf_cv_scores)
print('CV Accuracy (5-fold):', rfc_tfidf_cv_scores)
print('Mean CV Accuracy:', rfc_tfidf_cv_mean_score)
rfc_tfidf_test_score = rfc.score(tv_test_features, test_label_names)
print('Test Accuracy:', rfc_tfidf_test_score)

CV Accuracy (5-fold): [0.49275362 0.47101449 0.51449275 0.50364964 0.48175182]
Mean CV Accuracy: 0.49273246588384645
Test Accuracy: 0.4823529411764706


In [47]:
gbc = GradientBoostingClassifier(n_estimators=10, random_state=42)
gbc.fit(tv_train_features, train_label_names)
gbc_tfidf_cv_scores = cross_val_score(gbc, tv_train_features, train_label_names, cv=5)
gbc_tfidf_cv_mean_score = np.mean(gbc_tfidf_cv_scores)
print('CV Accuracy (5-fold):', gbc_tfidf_cv_scores)
print('Mean CV Accuracy:', gbc_tfidf_cv_mean_score)
gbc_tfidf_test_score = gbc.score(tv_test_features, test_label_names)
print('Test Accuracy:', gbc_tfidf_test_score)

CV Accuracy (5-fold): [0.5942029  0.55072464 0.56521739 0.59124088 0.54744526]
Mean CV Accuracy: 0.5697662117846186
Test Accuracy: 0.5441176470588235


## Ergebnisübersicht

In [48]:
pd.DataFrame([['Naive Bayes', mnb_bow_cv_mean_score, mnb_bow_test_score,
               mnb_tfidf_cv_mean_score, mnb_tfidf_test_score],
              ['Logistic Regression', lr_bow_cv_mean_score, lr_bow_test_score,
               lr_tfidf_cv_mean_score, lr_tfidf_test_score],
              ['Linear SVM', svm_bow_cv_mean_score, svm_bow_test_score,
               svm_tfidf_cv_mean_score, svm_tfidf_test_score],
              ['Linear SVM (SGD)', svmsgd_bow_cv_mean_score, svmsgd_bow_test_score,
               svmsgd_tfidf_cv_mean_score, svmsgd_tfidf_test_score],
              ['Random Forest', rfc_bow_cv_mean_score, rfc_bow_test_score,
               rfc_tfidf_cv_mean_score, rfc_tfidf_test_score],
              ['Gradient Boosted Machines', gbc_bow_cv_mean_score, gbc_bow_test_score,
               gbc_tfidf_cv_mean_score, gbc_tfidf_test_score]],
             columns=['Model', 'CV Score (TF)', 'Test Score (TF)', 'CV Score (TF-IDF)', 'Test Score (TF-IDF)'],
             ).T

Unnamed: 0,0,1,2,3,4,5
Model,Naive Bayes,Logistic Regression,Linear SVM,Linear SVM (SGD),Random Forest,Gradient Boosted Machines
CV Score (TF),0.690458,0.680324,0.664265,0.652618,0.508664,0.581392
Test Score (TF),0.688235,0.717647,0.705882,0.711765,0.497059,0.597059
CV Score (TF-IDF),0.482588,0.610462,0.741236,0.715117,0.492732,0.569766
Test Score (TF-IDF),0.508824,0.655882,0.814706,0.788235,0.482353,0.544118




# Hyperparameter-Tuning
### Naive Bayes

In [49]:
#hier startet das Hyperparameter-Tuning

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV #alle Kombinationen der Parameter werden versucht
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer #nehmen nur noch das TFIDF, da dies bessere Resultate geliefert hat

mnb_pipeline = Pipeline([('tfidf', TfidfVectorizer()),
                        ('mnb', MultinomialNB())
                       ])

#die Parameter, die durchprobiert werden sollen: 1/2-gramme und alpha-smoothing
param_grid = {'tfidf__ngram_range': [(1, 1), (1, 2)],       #2 Möglichkeiten
              'mnb__alpha': [1e-5, 1e-4, 1e-2, 1e-1, 1]     #5 Möglichkeiten von 1 Zehntausenstel
}

# 2*5 Möglichkeiten, welche 5 Mal (cv) ausprobiert werden = 50 Durchläufe
gs_mnb = GridSearchCV(mnb_pipeline, param_grid, cv=5, verbose=2) # jedes fünf Mal ausprobieren, Verbose -> Auskunft geben, was passiert
gs_mnb = gs_mnb.fit(train_corpus, train_label_names)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 1); total time=   0.0s
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 1); total time=   0.0s
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 1); total time=   0.0s
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 1); total time=   0.0s
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 1); total time=   0.0s
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 2); total time=   0.2s
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 2); total time=   0.4s
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 2); total time=   0.5s
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 2); total time=   0.4s
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 2); total time=   0.2s
[CV] END .......mnb__alpha=0.0001, tfidf__ngram_range=(1, 1); total time=   0.0s
[CV] END .......mnb__alpha=0.0001, tfidf__ngram_

In [50]:
#beste Parameter?
gs_mnb.best_estimator_.get_params()

{'memory': None,
 'steps': [('tfidf', TfidfVectorizer()), ('mnb', MultinomialNB(alpha=0.01))],
 'verbose': False,
 'tfidf': TfidfVectorizer(),
 'mnb': MultinomialNB(alpha=0.01),
 'tfidf__analyzer': 'word',
 'tfidf__binary': False,
 'tfidf__decode_error': 'strict',
 'tfidf__dtype': numpy.float64,
 'tfidf__encoding': 'utf-8',
 'tfidf__input': 'content',
 'tfidf__lowercase': True,
 'tfidf__max_df': 1.0,
 'tfidf__max_features': None,
 'tfidf__min_df': 1,
 'tfidf__ngram_range': (1, 1),
 'tfidf__norm': 'l2',
 'tfidf__preprocessor': None,
 'tfidf__smooth_idf': True,
 'tfidf__stop_words': None,
 'tfidf__strip_accents': None,
 'tfidf__sublinear_tf': False,
 'tfidf__token_pattern': '(?u)\\b\\w\\w+\\b',
 'tfidf__tokenizer': None,
 'tfidf__use_idf': True,
 'tfidf__vocabulary': None,
 'mnb__alpha': 0.01,
 'mnb__class_prior': None,
 'mnb__fit_prior': True,
 'mnb__force_alpha': 'warn'}

In [51]:
gs_mnb.best_estimator_.get_params()
#Rangfolge

cv_results = gs_mnb.cv_results_
results_df = pd.DataFrame({'rank': cv_results['rank_test_score'],
                           'params': cv_results['params'],
                           'cv score (mean)': cv_results['mean_test_score'],
                           'cv score (std)': cv_results['std_test_score']}
                          )
results_df = results_df.sort_values(by=['rank'], ascending=True)
pd.set_option('display.max_colwidth', 100)
results_df


Unnamed: 0,rank,params,cv score (mean),cv score (std)
4,1,"{'mnb__alpha': 0.01, 'tfidf__ngram_range': (1, 1)}",0.752914,0.022834
3,2,"{'mnb__alpha': 0.0001, 'tfidf__ngram_range': (1, 2)}",0.731133,0.021149
2,3,"{'mnb__alpha': 0.0001, 'tfidf__ngram_range': (1, 1)}",0.725346,0.02826
1,4,"{'mnb__alpha': 1e-05, 'tfidf__ngram_range': (1, 2)}",0.719496,0.025944
5,5,"{'mnb__alpha': 0.01, 'tfidf__ngram_range': (1, 2)}",0.709341,0.037791
0,6,"{'mnb__alpha': 1e-05, 'tfidf__ngram_range': (1, 1)}",0.706464,0.025444
6,7,"{'mnb__alpha': 0.1, 'tfidf__ngram_range': (1, 1)}",0.662827,0.030921
7,8,"{'mnb__alpha': 0.1, 'tfidf__ngram_range': (1, 2)}",0.6265,0.035527
8,9,"{'mnb__alpha': 1, 'tfidf__ngram_range': (1, 1)}",0.507278,0.028619
9,10,"{'mnb__alpha': 1, 'tfidf__ngram_range': (1, 2)}",0.46947,0.030585


In [52]:
#und wie gut geht es für die Testdaten?
best_mnb_test_score = gs_mnb.score(test_corpus, test_label_names)
print('Test Accuracy :', best_mnb_test_score)

Test Accuracy : 0.8


### Logistic Regression

In [53]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC

In [54]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

lr_pipeline = Pipeline([('tfidf', TfidfVectorizer()),
                        ('lr', LogisticRegression(penalty='l2', max_iter=100, random_state=42))
                       ])

#C-Wert steuert Regularisierung = Bestrafung von over-fitting
# komplexe Funktion wird weniger gut bewertet als einfache
param_grid = {'tfidf__ngram_range': [(1, 1), (1, 2)],
              'lr__C': [1, 5, 10]                   #könnte sich lohnen, über 10 hinausgehen
}

gs_lr = GridSearchCV(lr_pipeline, param_grid, cv=5, verbose=2)
gs_lr = gs_lr.fit(train_corpus, train_label_names)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 1); total time=   2.0s
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 1); total time=   1.1s
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 1); total time=   1.8s
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 1); total time=   1.5s
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 1); total time=   1.4s
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 2); total time=   6.7s
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 2); total time=   5.7s
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 2); total time=   5.9s
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 2); total time=   5.8s
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 2); total time=   5.8s
[CV] END .................lr__C=5, tfidf__ngram_range=(1, 1); total time=   1.7s
[CV] END .................lr__C=5, tfidf__ngram_r

In [55]:
gs_lr.best_estimator_

In [56]:
best_lr_test_score = gs_lr.score(test_corpus, test_label_names)
print('Test Accuracy :', best_lr_test_score)

Test Accuracy : 0.7676470588235295


### Support Vector

In [57]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

svm_pipeline = Pipeline([('tfidf', TfidfVectorizer()),
                        ('svm', LinearSVC(random_state=42))
                       ])

#C-Wert - siehe oben
param_grid = {'tfidf__ngram_range': [(1, 1), (1, 2)],
              'svm__C': [0.01, 0.1, 1, 5]
}

gs_svm = GridSearchCV(svm_pipeline, param_grid, cv=5, verbose=2)
gs_svm = gs_svm.fit(train_corpus, train_label_names)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 1); total time=   0.0s
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 1); total time=   0.0s
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 1); total time=   0.0s
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 1); total time=   0.0s
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 1); total time=   0.0s
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 2); total time=   0.3s
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 2); total time=   0.2s
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 2); total time=   0.3s
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 2); total time=   0.3s
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 2); total time=   0.3s
[CV] END ..............svm__C=0.1, tfidf__ngram_range=(1, 1); total time=   0.0s
[CV] END ..............svm__C=0.1, tfidf__ngram_r

In [58]:
gs_svm.best_estimator_.get_params()


{'memory': None,
 'steps': [('tfidf', TfidfVectorizer()),
  ('svm', LinearSVC(C=1, random_state=42))],
 'verbose': False,
 'tfidf': TfidfVectorizer(),
 'svm': LinearSVC(C=1, random_state=42),
 'tfidf__analyzer': 'word',
 'tfidf__binary': False,
 'tfidf__decode_error': 'strict',
 'tfidf__dtype': numpy.float64,
 'tfidf__encoding': 'utf-8',
 'tfidf__input': 'content',
 'tfidf__lowercase': True,
 'tfidf__max_df': 1.0,
 'tfidf__max_features': None,
 'tfidf__min_df': 1,
 'tfidf__ngram_range': (1, 1),
 'tfidf__norm': 'l2',
 'tfidf__preprocessor': None,
 'tfidf__smooth_idf': True,
 'tfidf__stop_words': None,
 'tfidf__strip_accents': None,
 'tfidf__sublinear_tf': False,
 'tfidf__token_pattern': '(?u)\\b\\w\\w+\\b',
 'tfidf__tokenizer': None,
 'tfidf__use_idf': True,
 'tfidf__vocabulary': None,
 'svm__C': 1,
 'svm__class_weight': None,
 'svm__dual': 'warn',
 'svm__fit_intercept': True,
 'svm__intercept_scaling': 1,
 'svm__loss': 'squared_hinge',
 'svm__max_iter': 1000,
 'svm__multi_class': 'ovr'

In [59]:
best_svm_test_score = gs_svm.score(test_corpus, test_label_names)
print('Test Accuracy :', best_svm_test_score)

Test Accuracy : 0.8147058823529412


In [60]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

sgd_pipeline = Pipeline([('tfidf', TfidfVectorizer()),
                        ('sgd', SGDClassifier(random_state=42))
                       ])

#alpha-Wert: Regularisierungsfaktor
param_grid = {'tfidf__ngram_range': [(1, 1), (1, 2)],
              'sgd__alpha': [1e-7, 1e-6, 1e-5, 1e-4]
}

gs_sgd = GridSearchCV(sgd_pipeline, param_grid, cv=5, verbose=2)
gs_sgd = gs_sgd.fit(train_corpus, train_label_names)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 1); total time=   0.0s
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 1); total time=   0.0s
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 1); total time=   0.0s
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 1); total time=   0.0s
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 1); total time=   0.0s
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 2); total time=   0.3s
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 2); total time=   0.4s
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 2); total time=   0.4s
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 2); total time=   0.5s
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 2); total time=   0.3s
[CV] END ........sgd__alpha=1e-06, tfidf__ngram_range=(1, 1); total time=   0.0s
[CV] END ........sgd__alpha=1e-06, tfidf__ngram_r

In [61]:
gs_sgd.best_estimator_.get_params()

{'memory': None,
 'steps': [('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
  ('sgd', SGDClassifier(alpha=1e-05, random_state=42))],
 'verbose': False,
 'tfidf': TfidfVectorizer(ngram_range=(1, 2)),
 'sgd': SGDClassifier(alpha=1e-05, random_state=42),
 'tfidf__analyzer': 'word',
 'tfidf__binary': False,
 'tfidf__decode_error': 'strict',
 'tfidf__dtype': numpy.float64,
 'tfidf__encoding': 'utf-8',
 'tfidf__input': 'content',
 'tfidf__lowercase': True,
 'tfidf__max_df': 1.0,
 'tfidf__max_features': None,
 'tfidf__min_df': 1,
 'tfidf__ngram_range': (1, 2),
 'tfidf__norm': 'l2',
 'tfidf__preprocessor': None,
 'tfidf__smooth_idf': True,
 'tfidf__stop_words': None,
 'tfidf__strip_accents': None,
 'tfidf__sublinear_tf': False,
 'tfidf__token_pattern': '(?u)\\b\\w\\w+\\b',
 'tfidf__tokenizer': None,
 'tfidf__use_idf': True,
 'tfidf__vocabulary': None,
 'sgd__alpha': 1e-05,
 'sgd__average': False,
 'sgd__class_weight': None,
 'sgd__early_stopping': False,
 'sgd__epsilon': 0.1,
 'sgd__eta0': 0.0

In [62]:
best_sgd_test_score = gs_sgd.score(test_corpus, test_label_names)
print('Test Accuracy :', best_sgd_test_score)

Test Accuracy : 0.7794117647058824


In [63]:
import model_evaluation_utils_hr as meu
import importlib
importlib.reload(meu)

<module 'model_evaluation_utils_hr' from 'C:\\Users\\sarak\\Documents\\FHGR\\NLP\\notebooks\\model_evaluation_utils_hr.py'>

# Analyse der Ergebnisse für MNB

In [64]:
# Auswahl aufgrund guter Ergebnisse und einfaches Handling

mnb_predictions = gs_mnb.predict(test_corpus)
unique_classes = list(set(test_label_names))
meu.get_metrics(true_labels=test_label_names, predicted_labels=mnb_predictions)

Accuracy: 0.8
Precision: 0.8201
Recall: 0.8
F1 Score: 0.8009


In [65]:
#Ergebnisse nach Newsgruppe

meu.display_classification_report(true_labels=test_label_names,
                                  predicted_labels=mnb_predictions, classes=unique_classes)

               precision    recall  f1-score   support

   Wirtschaft       0.76      0.80      0.78        44
 Wissenschaft       0.83      0.83      0.83        12
         Etat       1.00      0.50      0.67        24
          Web       0.93      0.88      0.90        64
        Sport       0.98      0.89      0.93        45
International       0.80      0.82      0.81        50
       Kultur       0.75      0.62      0.68        24
       Inland       0.69      0.87      0.77        31
     Panorama       0.61      0.78      0.69        46

     accuracy                           0.80       340
    macro avg       0.82      0.78      0.78       340
 weighted avg       0.82      0.80      0.80       340



In [71]:
#Übersicht Label

label_data_map = {v:k for k, v in data_labels_map.items()}
label_map_df = pd.DataFrame(list(label_data_map.items()), columns=['Label Name', 'Label Number'])
label_map_df

Unnamed: 0,Label Name,Label Number
0,Wirtschaft,0
1,Wissenschaft,1
2,Etat,2
3,Web,3
4,Sport,4
5,International,5
6,Kultur,6
7,Inland,7
8,Panorama,8


In [73]:
#Konfusionsmatrix mit Labeln
unique_classes = label_map_df['Label Name'].values
meu.display_confusion_matrix_pretty(true_labels=test_label_names,
                                    predicted_labels=mnb_predictions, classes=unique_classes)

# vieles, die ETAT sind, sind auch in Sport eingeordnet

Unnamed: 0_level_0,Unnamed: 1_level_0,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:
Unnamed: 0_level_1,Unnamed: 1_level_1,Wirtschaft,Wissenschaft,Etat,Web,Sport,International,Kultur,Inland,Panorama
Actual:,Wirtschaft,12,4,1,4,1,0,1,0,1
Actual:,Wissenschaft,0,27,0,0,1,0,0,3,0
Actual:,Etat,0,0,41,0,8,0,0,1,0
Actual:,Web,0,2,1,15,4,0,1,0,1
Actual:,Sport,0,3,3,1,36,0,0,3,0
Actual:,International,0,0,2,0,3,40,0,0,0
Actual:,Kultur,0,1,1,0,2,0,56,4,0
Actual:,Inland,0,1,1,0,4,1,2,35,0
Actual:,Panorama,0,1,1,0,0,0,0,0,10


In [75]:
train_idx, test_idx = train_test_split(np.array(range(len(df['Article']))), test_size=0.33, random_state=42)
test_idx

array([ 428,  533,  388,  107,  423,  420,  158,  237,  662,  543,  923,
         31,  394,  582,  914, 1025,  309,  688,  588,  244,  801,  750,
        755,  506,  350,  872,  745,   96,  286,  382,  449,  993,  531,
        696,  101,  261,  832,  109,   70,   59,  657,  902,  136,  910,
        580,  168,  652,   86,  903,  758,  974,  359,  706,  851,  538,
        405,  296,  323,  633,   10,  549,  718,  790,  139,  842,  208,
        846,  835,  290,   76,  595,  700,   30,  615,  760,   54,   23,
        526,  468,  518,    3,  798, 1022,  616,  409,  451,  894,  649,
        411,  535,  996,  596,  199,  527,  998,  318,  430,  825,   39,
        764,  770,  677,  673,  898,   66,  717,   67,  260,  209,  613,
        361,  665,  806,  493,  830,  362,  210,  675,  723,  956,  762,
        100,  298,   88,   63,  906,  678,  924,  256,  626,  332,  314,
          2,  803,  275,  398,  982,  432,  861,  499,  536,  823,  901,
        572,  184,  731,  966,  479,  841,  847,  6

In [90]:
#einige Beispiele

predict_probas = gs_mnb.predict_proba(test_corpus).max(axis=1)
test_df = df.iloc[test_idx]
test_df['Predicted Name'] = mnb_predictions
test_df['Predicted Confidence'] = predict_probas
test_df.head()

Unnamed: 0,Target Name,Article,Clean Article,Predicted Name,Predicted Confidence
428,Wirtschaft,"'Finanzminister Osborne sieht ""Jahre der Rezession"" – Österreichs Wirtschaft geht nicht von Brexit aus. Wien/London – Knapp fünf Wochen vor der Volksabstimmung in Großbritannien über Verbleib in o...",Finanzminister Osborne sehen Jahr Rezession oesterreichs Wirtschaft gehen Brexit wien London knapp Fuenf Wochen Volksabstimmung Grobritannien Ueber verbleib ausscheiden EU nehmen warnend Stimme so...,Wirtschaft,0.563171
533,Wirtschaft,Österreichs Automobilindustrie sieht sich gut aufgestellt und macht gegen die Verteufelung des Autos mobil. Wien/Graz/Aurora – Die heimische Automobilindustrie zeigt sich selbstbewusst: Während d...,oesterreichs Automobilindustrie sehen gut aufstellen Verteufelung Auto mobil Wien Graz Aurora heimisch Automobilindustrie zeigen selbstbewusst waehrend Wirtschaftsstandort Oesterreich Ranking perm...,Wirtschaft,0.779896
388,Etat,"Ein halbes Jahr nach dem DVD-Start zeigt der ORF ab Montag Schalkos bildgewaltige Miniserie. Wien – Diese Frage musste ja kommen, sagt Produzent John Lueftner. Mit den Vorstadtweibern, nein, mit d...",halb Jahr dvdstart zeigen Orf ab Montag Schalkos bildgewaltig Miniserie wien Frage ja kommen sagen produzent John Lueftner Vorstadtweiber nein Moecht messen Traumquote geben kantig Orf zeigen ab M...,Etat,0.7321
107,Wissenschaft,Techniker ringen mit einem Problem diesseits von Higgs-Bosonen und Dunkler Materie. Genf – Mit dem Nachweis des Higgs-Bosons schrieb das Genfer Teilchenforschungszentrum Cern 2012 Wissenschaftsges...,Techniker ringen Problem diesseits higgsbosonen dunkel Materie Genf nachweis higgsbosons schreiben Genfer Teilchenforschungszentrum Cern Wissenschaftsgeschichte Plaene Fuer Zukunft ehrgeizig hoffe...,Wissenschaft,0.665455
423,Sport,"5:2 erster Erfolg bei einer A-WM seit 1939 – Russland, Kanada, Finnland im Viertelfinale – Kasachstan steht als Absteiger fest. St. Petersburg/Moskau – Gastgeber Russland, Titelverteidiger Kanada ...",erster Erfolg awm seit russland kanada Finnland Viertelfinale Kasachstan stehen Absteiger fest st Petersburg Moskau Gastgeber russland Titelverteidiger Kanada Mitfavorit Finnland eishockeywm tsche...,Sport,0.999968


In [88]:
#falsch einsortierte Dokumente

pd.set_option('display.max_colwidth', 200)
res_df = (test_df[(test_df['Target Name'] == 'Sport') & (test_df['Predicted Name'] == 'International')]
       .sort_values(by=['Predicted Confidence'], ascending=False).head(5))
res_df

Unnamed: 0,Target Name,Article,Clean Article,Predicted Name,Predicted Confidence
923,Sport,Treffen in Zürich soll am 16. Dezember stattfinden. Wien – Der Weltfußballverband Fifa dürfte am 16. Dezember über die Nachfolge seines zurückgetretenen Präsidenten Sepp Blatter entscheiden. Das b...,Treffen Zuerich sollen Dezember stattfinden wien Weltfuballverband Fifa Duerfte Dezember Ueber Nachfolge zurueckgetretenen Praesident Sepp Blatter entscheiden berichten bbc Mittwochfrueh Mitglieds...,International,0.643908
760,Sport,"Am 28. November verteidigt der Ukrainer in der Düsseldorf-Arena alle vier WM-Titel. Die Vorbereitung findet traditionell beim Stanglwirt in Going statt. Going – Es hat schon Tradition, dass sich W...",November verteidigen Ukrainer duesseldorfarena vier wmtitel Vorbereitung finden traditionell Stanglwirt Going statt Going schon Tradition Wladimir Klitschko Tiroler bergen Stanglwirt Going wmkaemp...,International,0.429052


In [89]:
#weitere Falschsortierungen (andere Gruppe)

pd.set_option('display.max_colwidth', 200)
res_df = (test_df[(test_df['Target Name'] == 'Sport') & (test_df['Predicted Name'] == 'Panorama')]
       .sort_values(by=['Predicted Confidence'], ascending=False).head(5))
res_df

Unnamed: 0,Target Name,Article,Clean Article,Predicted Name,Predicted Confidence
705,Sport,Michael Diamond betrunken und bewaffnet am Steuer gestoppt. Sydney/Rio de Janeiro – Gut zehn Wochen vor den Olympischen Spielen hat ein australischer Goldmedaillengewinner im Schießen nach einem F...,Michael Diamond betrunken bewaffnet Steuer stoppen Sydney Rio de Janeiro gut zehn Woche olympisch Spiel australisch Goldmedaillengewinner Schieen Familienstreit Waffenlizenz verlieren Polizei stop...,Panorama,0.839159
798,Sport,"Neuschnee erzwang Absage, Kombination ersatzlos gestrichen. Crans Montana – Wegen zu viel Neuschnee ist die für den (heutigen) Samstag geplante Weltcup-Abfahrt der Damen in Crans Montana auf Sonnt...",Neuschnee Erzwang absagen Kombination ersatzlos streichen Crans Montana wegen Neuschnee Fuer heutig Samstag geplant weltcupabfahrt Dame Crans Montana Sonntag Uhr verschieben eigentlich Fuer Sonnta...,Panorama,0.500947
801,Sport,"Neuer Generalsekretär rechnet nicht mit Freispruch für Russin. Rom – Der designierte Generaldirektor der Welt-Antidoping-Agentur (WADA), Olivier Niggli, rechnet nicht mit einen Freispruch für die ...",neu Generalsekretaer rechnen Freispruch Fuer russin rom designiert Generaldirektor weltantidopingagentur Wada Olivier Niggli rechnen Freispruch Fuer derzeit wegen Doping suspendiert Tennisspieleri...,Panorama,0.352073
