# GermEval -- TF.IDF


The original dataset is an XML format and it has been parsed to CSV, `parsed.csv` (through beautifulsoup).

This notebook contains the code for extracting TF.IDF features for the GermEval dataset. 

- Title and Description fields are combined before applying TF.IDF
- TF.IDF is also applied to the Authors field separately. 

Different classifiers are tested in their vanilla form on the TF.IDF alone

TODO:
- Stopwords removal (German)
- Root words extraction (maybe)

In [243]:
import pandas as pd
import pickle
from ast import literal_eval
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, cohen_kappa_score, f1_score
from nltk.corpus import stopwords
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MultiLabelBinarizer , LabelEncoder , LabelBinarizer
from sklearn.svm import LinearSVC , NuSVC
from utils import subtask_A_evaluation

In [51]:
book_df = pd.read_csv("/home/evenuma/germeval/data/parsed_train_plus_validation.csv")

The categories can be represented as lists wherein each element is a tag. Each tag is separated in to levels split by '>' token. 

In [52]:
book_df["categories"] = book_df["categories"].apply(lambda categories: literal_eval(categories))

For the time being, we are focusing on a single tag classification and only the top level is considered. If a sample has more than one label assigned to it, we take only the first.

In [53]:
book_df["top_level"] = book_df["categories"].apply(lambda  categories: np.unique([i.split(">")[0].strip() for i in categories]))

In [54]:
book_df["count_of_categories"] = book_df["top_level"].apply(lambda top_levels: 3 if len(top_levels) > 3 else len(top_levels))

In [55]:
book_df.count_of_categories.value_counts()

1    15549
2     1004
3       74
Name: count_of_categories, dtype: int64

In [56]:
book_df.head(3)

Unnamed: 0,title,description,categories,author,published_date,isbn,top_level,count_of_categories
0,Die Klinik,Ein Blick hinter die Kulissen eines Krankenhau...,[Literatur & Unterhaltung > Romane & Erzählungen],Noah Gordon,2013-12-02,9783641136291,[Literatur & Unterhaltung],1
1,Die Erben von Midkemia 4,Die Bedrohungen für Midkemia und Kelewan wolle...,[Literatur & Unterhaltung > Fantasy > Heroisch...,Raymond Feist,2016-06-20,9783641185787,[Literatur & Unterhaltung],1
2,Völlig losgelöst,In der Dreizimmerwohnung stapeln sich Flohmark...,[Ratgeber > Lebenshilfe & Psychologie > Besser...,Susanne Weingarten,2019-01-14,9783328103646,[Ratgeber],1


In [57]:
def repeat_categories(df):
    lens = [len(item) for item in df['top_level']]
    return pd.DataFrame({"category" : np.concatenate(df['top_level'].values), 
                         "categories" : np.repeat(df['top_level'].values,lens), 
                          "title" : np.repeat(df['title'].values,lens),
                          "description" : np.repeat(df['description'].values,lens),
                          "author" : np.repeat(df['author'].values,lens),
                          "published_date" : np.repeat(df['published_date'].values,lens),
                          "isbn":np.repeat(df['isbn'].values,lens),
                          "count_of_categories":np.repeat(df['count_of_categories'].values,lens)
                        })

In [58]:
# len(np.repeat(book_df['categories'].values,lens))

In [59]:
flat_book_df = repeat_categories(book_df)

In [60]:
flat_book_df.head(3)

Unnamed: 0,category,categories,title,description,author,published_date,isbn,count_of_categories
0,Literatur & Unterhaltung,[Literatur & Unterhaltung],Die Klinik,Ein Blick hinter die Kulissen eines Krankenhau...,Noah Gordon,2013-12-02,9783641136291,1
1,Literatur & Unterhaltung,[Literatur & Unterhaltung],Die Erben von Midkemia 4,Die Bedrohungen für Midkemia und Kelewan wolle...,Raymond Feist,2016-06-20,9783641185787,1
2,Ratgeber,[Ratgeber],Völlig losgelöst,In der Dreizimmerwohnung stapeln sich Flohmark...,Susanne Weingarten,2019-01-14,9783328103646,1


In [61]:
flat_book_df.category.value_counts()

Literatur & Unterhaltung      8929
Sachbuch                      2540
Kinderbuch & Jugendbuch       2275
Ratgeber                      2124
Ganzheitliches Bewusstsein     916
Glaube & Ethik                 689
Künste                         165
Architektur & Garten           145
Name: category, dtype: int64

In [62]:
print(len(flat_book_df))
print(sum(book_df["count_of_categories"].values))

17783
17779


## ^ The above diff is due to 4 records having 4 toplevel categories and they have been limited to 3

In [63]:
final_df = flat_book_df

In [64]:
final_df["authors"] = final_df["author"].apply(lambda x:[ i.strip() for i in str(x).split(",")])

In [65]:
final_df["published_date_parsed"] = pd.to_datetime(final_df["published_date"],infer_datetime_format=True)

In [66]:
final_df["year"] = final_df["published_date_parsed"].apply(lambda x: x.year)

In [67]:
final_df.head(3)

Unnamed: 0,category,categories,title,description,author,published_date,isbn,count_of_categories,authors,published_date_parsed,year
0,Literatur & Unterhaltung,[Literatur & Unterhaltung],Die Klinik,Ein Blick hinter die Kulissen eines Krankenhau...,Noah Gordon,2013-12-02,9783641136291,1,[Noah Gordon],2013-12-02,2013
1,Literatur & Unterhaltung,[Literatur & Unterhaltung],Die Erben von Midkemia 4,Die Bedrohungen für Midkemia und Kelewan wolle...,Raymond Feist,2016-06-20,9783641185787,1,[Raymond Feist],2016-06-20,2016
2,Ratgeber,[Ratgeber],Völlig losgelöst,In der Dreizimmerwohnung stapeln sich Flohmark...,Susanne Weingarten,2019-01-14,9783328103646,1,[Susanne Weingarten],2019-01-14,2019


In [68]:
validation_authors =  pickle.load(open("validation_authors",'rb'))
test_authors =  pickle.load(open("test_authors",'rb'))

In [69]:
all_authors = np.concatenate((validation_authors.values,final_df["authors"].values))

In [70]:
mlb = MultiLabelBinarizer()
mlb.fit(all_authors)

year_lb = LabelBinarizer()
year_lb.fit(final_df["year"])

LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)

In [71]:
final_df["authors"].values[:2]

array([list(['Noah Gordon']), list(['Raymond Feist'])], dtype=object)

In [72]:
def combine_title_desc(row):
    """Combines the Title and Description fields in the given row
    and returns the combined result"""
    return str(row["title"]) + " " + str(row["description"])
    #return str(row["title"]) + " " + str(row["description"]) + " " + str(row["author"])

In [73]:
final_df.head(3)

Unnamed: 0,category,categories,title,description,author,published_date,isbn,count_of_categories,authors,published_date_parsed,year
0,Literatur & Unterhaltung,[Literatur & Unterhaltung],Die Klinik,Ein Blick hinter die Kulissen eines Krankenhau...,Noah Gordon,2013-12-02,9783641136291,1,[Noah Gordon],2013-12-02,2013
1,Literatur & Unterhaltung,[Literatur & Unterhaltung],Die Erben von Midkemia 4,Die Bedrohungen für Midkemia und Kelewan wolle...,Raymond Feist,2016-06-20,9783641185787,1,[Raymond Feist],2016-06-20,2016
2,Ratgeber,[Ratgeber],Völlig losgelöst,In der Dreizimmerwohnung stapeln sich Flohmark...,Susanne Weingarten,2019-01-14,9783328103646,1,[Susanne Weingarten],2019-01-14,2019


### Prepare the data

In [75]:
del book_df
del flat_book_df

import gc
gc.collect()

NameError: name 'book_df' is not defined

In [196]:
# X_train_categories = ["_".join([str(le.transform([a])[0]) for a in i]) for i in  X_train_df["categories"]]



In [104]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
final_df["target"]  = le.fit_transform(final_df["category"].values)

data = train_test_split(final_df[["title","description","authors","category","year","count_of_categories","isbn","categories"]],
                final_df["target"], test_size=0.20, random_state=42, stratify=final_df["target"], shuffle=True)

X_train_df, X_test_df, y_train, y_test = data
vectorizer = TfidfVectorizer(ngram_range=(1,2),stop_words=stopwords.words('german'),token_pattern=r'\b[^\d\W]+\b')

X_train = vectorizer.fit_transform(X_train_df.apply(lambda row: combine_title_desc(row), axis=1))
X_test = vectorizer.transform(X_test_df.apply(lambda row: combine_title_desc(row),axis=1))

X_train_author = mlb.transform(X_train_df["authors"].values)
X_test_author = mlb.transform(X_test_df["authors"].values)

X_train_year = year_lb.transform(X_train_df["year"].values)
X_test_year = year_lb.transform(X_test_df["year"].values)

# X_train_categories = ["_".join([str(le.transform([a])[0]) for a in i]) for i in  X_train_df["categories"]]
# X_test_categories = ["_".join([str(le.transform([a])[0]) for a in i]) for i in  X_test_df["categories"]]

from scipy import sparse
X_train_all = sparse.hstack((
     X_train,sparse.csr_matrix(X_train_author),
     sparse.csr_matrix(X_train_year),
     sparse.csr_matrix(X_train_df["isbn"].values.reshape(len(X_train_df),1)),
     sparse.csr_matrix(X_train_df["count_of_categories"].values.reshape(len(X_train_df),1))    
    ))

X_test_all = sparse.hstack((
    X_test,sparse.csr_matrix(X_test_author),
    sparse.csr_matrix(X_test_year),
    sparse.csr_matrix(X_test_df["isbn"].values.reshape(len(X_test_df),1)),
    sparse.csr_matrix(X_test_df["count_of_categories"].values.reshape(len(X_test_df),1))
    ))



In [105]:
pickle.dump(mlb, open("mlb_ete_top_level.pickle", "wb"))
pickle.dump(year_lb, open("lb_ete_top_level.pickle", "wb"))
pickle.dump(vectorizer, open("vectorizer_ete_top_level.pickle", "wb"))
pickle.dump(X_train_all, open("X_train_ete_top_level.pickle", "wb"))
pickle.dump(X_test_all, open("X_test_ete_top_level.pickle", "wb"))
pickle.dump(y_train, open("y_train_ete_top_level.pickle", "wb"))
pickle.dump(y_test, open("y_test_ete_top_level.pickle", "wb"))
pickle.dump(dict(zip(le.classes_,le.transform(le.classes_))), open("le_author_year_mapping.pickle", "wb"))

In [106]:
%%time
from scipy import sparse

X_full = sparse.vstack((X_train_all,X_test_all))
Y_full = np.concatenate((y_train,y_test))

CPU times: user 16 ms, sys: 2 ms, total: 18 ms
Wall time: 16.3 ms


In [107]:
Y_full

array([5, 5, 5, ..., 1, 5, 5])

In [108]:
X_full_array = X_full.tocsr()


In [118]:
class_weights = {'Kinderbuch & Jugendbuch': 1.8, 'Ratgeber': 3, 'Sachbuch': 2,'Glaube & Ethik' : 2 ,'Künste' : 6,'Architektur & Garten' : 6 }

In [125]:
class_weights_encoded = []
for value,key in enumerate(class_weights):
    print(class_weights[key],key)
    class_weights_encoded.append((le.transform([key])[0],class_weights[key]))

1.8 Kinderbuch & Jugendbuch
3 Ratgeber
2 Sachbuch
2 Glaube & Ethik
6 Künste
6 Architektur & Garten


In [225]:
dict(class_weights_encoded)

{0: 6, 2: 2, 3: 1.8, 4: 6, 6: 3, 7: 2}

In [230]:
pd.DataFrame(global_isbn_category_dict,columns=["isbn","actual"]).dtypes

isbn       int64
actual    object
dtype: object

In [245]:
global_isbn_category_dict = [(i[0],[le.transform([k])[0] for k in i[1].tolist()]) for i in final_df[["isbn","categories"]].head(100).values]

global_isbn_category_df = pd.DataFrame(global_isbn_category_dict,columns=["isbn","actual"])

global_isbn_category_df["isbn"] = global_isbn_category_df["isbn"].apply(lambda x:str(x))

def get_one_instance(index,csr_matrix):
    unique_isbns = set([ str(int(i[0])) for i in csr_matrix[index,-2].todense()])
    isbns_dict = dict(list(zip([ str(int(i[0])) for i in csr_matrix[index,-2].todense()],index)))
    return (list(isbns_dict.keys()),list(isbns_dict.values()))

def getClasses(class_probs,class_count):
    arr = np.array(class_probs)
    return arr.argsort()[-1 * class_count:][::-1]

def get_multi_prediction(row):
    """Combines the Title and Description fields in the given row
    and returns the combined result"""
    return getClasses(row["predict_proba"],row["class_count"])

def compute_f1_score(prediction,class_count_prediction):
    prediction_df = pd.DataFrame(prediction,columns=["isbn","predict_proba"])
    prediction_df["isbn"] = prediction_df["isbn"].apply(lambda x:str(int(x[0])))
#     print(prediction_df)
    class_count_df = pd.DataFrame(class_count_prediction,columns=["isbn","class_count"])
    
    joined_ = prediction_df.merge(class_count_df,on="isbn",how="inner")
    
    joined_["predictions"] = joined_.apply(get_multi_prediction,axis=1)
    joined_1 = joined_.merge(global_isbn_category_df,on="isbn")
    predicted_ = joined_1["predictions"]
    actuals_ = joined_1["actual"]
    
    predicted_labels = [ [le.inverse_transform([k])[0] for k in i] for i in predicted_.values]
    actual_labels = [ [le.inverse_transform([k])[0] for k in i] for i in actuals_.values]
    
    return subtask_A_evaluation(actual_labels,predicted_labels)
    
# compute_f1_score(a,b)

# Build End to end classifer

In [248]:
a = None
b = None
from sklearn.model_selection import StratifiedKFold
fold_classes = []
probas = []
cv = StratifiedKFold(n_splits=7, random_state=42)
for train_index, test_index in cv.split(X_full_array,Y_full):      
#     print("Train Index: ", train_index, "\n")
#     print("Test Index: ", len(test_index))
    
    lsvcclf = LinearSVC(random_state=42,max_iter=3000,verbose=1,class_weight= dict(class_weights_encoded))
    X_train, X_test, y_train, y_test = X_full_array[train_index,:-2], X_full_array[test_index,:-2], Y_full[train_index], Y_full[test_index]

    lsvcclf.fit(X_train,y_train)
    
    print("printing prediction")
    prediction = list(zip(X_full_array[test_index,-2].todense().tolist(),lsvcclf._predict_proba_lr(X_test)))
    a = prediction
#     print(prediction)

    train_isbn , train_index_for_cc = get_one_instance(train_index,X_full_array)
    test_isbn , test_index_for_cc = get_one_instance(test_index,X_full_array)
    
    X_train_cc, X_test_cc = X_full_array[train_index_for_cc,:-2], X_full_array[test_index_for_cc,:-2]
    
    y_train_cc, y_test_cc = [int(i[0]) for i in X_full_array[train_index_for_cc,-1].todense().tolist()], [int(i[0]) for i in  X_full_array[test_index_for_cc,-1].todense().tolist()]

    lsvcclf_cc = LinearSVC(random_state=42,max_iter=3000,verbose=1)
    
    lsvcclf_cc.fit(X_train_cc,y_train_cc)
    
    prediction_cc = list(zip(test_isbn,lsvcclf_cc.predict(X_test_cc)))
    b = prediction_cc    
    
    print(compute_f1_score(prediction,prediction_cc))

    

[LibLinear]printing prediction
[LibLinear][0.8275862068965517, 0.96, 0.888888888888889, 0.7368421052631579]
[LibLinear]printing prediction
[LibLinear][0.4444444444444444, 0.5714285714285714, 0.5, 0.5714285714285714]
[LibLinear]printing prediction
[LibLinear][0.9523809523809523, 0.9523809523809523, 0.9523809523809523, 0.9411764705882353]
[LibLinear]printing prediction
[LibLinear][1.0, 1.0, 1.0, 1.0]
[LibLinear]printing prediction
[LibLinear][0.8235294117647058, 0.8235294117647058, 0.8235294117647058, 0.8235294117647058]
[LibLinear]printing prediction
[LibLinear][0.7027027027027027, 0.8387096774193549, 0.7647058823529411, 0.42857142857142855]
[LibLinear]printing prediction
[LibLinear][0.875, 0.875, 0.875, 0.8]


In [204]:
b

[('9783641052850', 1),
 ('9783453422506', 1),
 ('9783442479887', 1),
 ('9783809433545', 1),
 ('9783579086491', 3),
 ('9783570403211', 2),
 ('9783442488568', 1),
 ('9783641114503', 1),
 ('9783424201956', 1),
 ('9783641172022', 1),
 ('9783641029036', 1),
 ('9783641120856', 1),
 ('9783453580732', 1),
 ('9783837140132', 1),
 ('9783328103172', 1),
 ('9783641065867', 2),
 ('9783453603349', 1),
 ('9783453702950', 1),
 ('9783791355795', 1),
 ('9783570552582', 1),
 ('9783442745609', 1),
 ('9783837174687', 1),
 ('9783844522679', 1),
 ('9783442715091', 1),
 ('9783453318410', 1),
 ('9783453201446', 1),
 ('9783641055806', 1),
 ('9783809026907', 1),
 ('9783466371686', 2),
 ('9783867172905', 2),
 ('9783641074524', 1),
 ('9783641146054', 2),
 ('9783717520023', 1),
 ('9783453436497', 1),
 ('9783442486120', 1),
 ('9783641107758', 1),
 ('9783453401372', 1),
 ('9783641075200', 1),
 ('9783424201222', 2),
 ('9783570158746', 1),
 ('9783328103646', 1),
 ('9783442391240', 1),
 ('9783570226087', 1),
 ('97834427

In [47]:
X_full_array.shape

(17783, 696974)

In [234]:
le.transform(le.classes_)

array([0, 1, 2, 3, 4, 5, 6, 7])

In [241]:
le.inverse_transform(1)

ValueError: bad input shape ()