# GermEval -- TF.IDF


The original dataset is an XML format and it has been parsed to CSV, `parsed.csv` (through beautifulsoup).

This notebook contains the code for extracting TF.IDF features for the GermEval dataset. 

- Title and Description fields are combined before applying TF.IDF
- TF.IDF is also applied to the Authors field separately. 

Different classifiers are tested in their vanilla form on the TF.IDF alone

TODO:
- Stopwords removal (German)
- Root words extraction (maybe)

In [1]:
import pandas as pd
import pickle
from ast import literal_eval
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, cohen_kappa_score, f1_score
from nltk.corpus import stopwords
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MultiLabelBinarizer , LabelEncoder , LabelBinarizer
from sklearn.svm import LinearSVC , NuSVC
from utils import subtask_A_evaluation , subtask_A_confusion_matrix , subtask_A_classification_report
from bpemb import BPEmb


In [2]:
bpemb_de = BPEmb(lang="de", vs=25000, dim=300)

In [3]:
bpemb_de.encode("Ein Blick hinter die ")

['▁ein', '▁blick', '▁hinter', '▁die']

In [4]:
book_df = pd.read_csv("/home/evenuma/germeval/data/parsed_train_plus_validation.csv")

The categories can be represented as lists wherein each element is a tag. Each tag is separated in to levels split by '>' token. 

In [5]:
book_df["categories"] = book_df["categories"].apply(lambda categories: literal_eval(categories))

For the time being, we are focusing on a single tag classification and only the top level is considered. If a sample has more than one label assigned to it, we take only the first.

In [6]:
book_df["top_level"] = book_df["categories"].apply(lambda  categories: np.unique([i.split(">")[0].strip() for i in categories]))

In [7]:
book_df["count_of_categories"] = book_df["top_level"].apply(lambda top_levels: 3 if len(top_levels) > 3 else len(top_levels))

In [8]:
book_df.count_of_categories.value_counts()

1    15549
2     1004
3       74
Name: count_of_categories, dtype: int64

In [9]:
book_df.head(3)

Unnamed: 0,title,description,categories,author,published_date,isbn,top_level,count_of_categories
0,Die Klinik,Ein Blick hinter die Kulissen eines Krankenhau...,[Literatur & Unterhaltung > Romane & Erzählungen],Noah Gordon,2013-12-02,9783641136291,[Literatur & Unterhaltung],1
1,Die Erben von Midkemia 4,Die Bedrohungen für Midkemia und Kelewan wolle...,[Literatur & Unterhaltung > Fantasy > Heroisch...,Raymond Feist,2016-06-20,9783641185787,[Literatur & Unterhaltung],1
2,Völlig losgelöst,In der Dreizimmerwohnung stapeln sich Flohmark...,[Ratgeber > Lebenshilfe & Psychologie > Besser...,Susanne Weingarten,2019-01-14,9783328103646,[Ratgeber],1


In [10]:
# def repeat_categories(df):
#     lens = [len(item) for item in df['top_level']]
#     return pd.DataFrame({"category" : np.concatenate(df['top_level'].values), 
#                          "categories" : np.repeat(df['top_level'].values,lens), 
#                           "title" : np.repeat(df['title'].values,lens),
#                           "description" : np.repeat(df['description'].values,lens),
#                           "author" : np.repeat(df['author'].values,lens),
#                           "published_date" : np.repeat(df['published_date'].values,lens),
#                           "isbn":np.repeat(df['isbn'].values,lens),
#                           "count_of_categories":np.repeat(df['count_of_categories'].values,lens)
#                         })

In [11]:
# len(np.repeat(book_df['categories'].values,lens))

In [12]:
# flat_book_df = repeat_categories(book_df)

In [13]:
flat_book_df = book_df

In [14]:
flat_book_df.head(3)

Unnamed: 0,title,description,categories,author,published_date,isbn,top_level,count_of_categories
0,Die Klinik,Ein Blick hinter die Kulissen eines Krankenhau...,[Literatur & Unterhaltung > Romane & Erzählungen],Noah Gordon,2013-12-02,9783641136291,[Literatur & Unterhaltung],1
1,Die Erben von Midkemia 4,Die Bedrohungen für Midkemia und Kelewan wolle...,[Literatur & Unterhaltung > Fantasy > Heroisch...,Raymond Feist,2016-06-20,9783641185787,[Literatur & Unterhaltung],1
2,Völlig losgelöst,In der Dreizimmerwohnung stapeln sich Flohmark...,[Ratgeber > Lebenshilfe & Psychologie > Besser...,Susanne Weingarten,2019-01-14,9783328103646,[Ratgeber],1


In [16]:
# flat_book_df.category.value_counts()

In [17]:
print(len(flat_book_df))
print(sum(book_df["count_of_categories"].values))

16627
17779


## ^ The above diff is due to 4 records having 4 toplevel categories and they have been limited to 3

In [18]:
final_df = flat_book_df

In [19]:
final_df["authors"] = final_df["author"].apply(lambda x:[ i.strip() for i in str(x).split(",")])

In [20]:
final_df["published_date_parsed"] = pd.to_datetime(final_df["published_date"],infer_datetime_format=True)

In [21]:
final_df["year"] = final_df["published_date_parsed"].apply(lambda x: x.year)

In [24]:
final_df['ISBN_GRP'] = final_df["isbn"].apply(lambda x:str(x)[4:6])
final_df['ISBN_PUBLISHER'] = final_df["isbn"].apply(lambda x:str(x)[6:10]) 

In [25]:
final_df.head(3)

Unnamed: 0,title,description,categories,author,published_date,isbn,top_level,count_of_categories,authors,published_date_parsed,year,ISBN_GRP,ISBN_PUBLISHER
0,Die Klinik,Ein Blick hinter die Kulissen eines Krankenhau...,[Literatur & Unterhaltung > Romane & Erzählungen],Noah Gordon,2013-12-02,9783641136291,[Literatur & Unterhaltung],1,[Noah Gordon],2013-12-02,2013,64,1136
1,Die Erben von Midkemia 4,Die Bedrohungen für Midkemia und Kelewan wolle...,[Literatur & Unterhaltung > Fantasy > Heroisch...,Raymond Feist,2016-06-20,9783641185787,[Literatur & Unterhaltung],1,[Raymond Feist],2016-06-20,2016,64,1185
2,Völlig losgelöst,In der Dreizimmerwohnung stapeln sich Flohmark...,[Ratgeber > Lebenshilfe & Psychologie > Besser...,Susanne Weingarten,2019-01-14,9783328103646,[Ratgeber],1,[Susanne Weingarten],2019-01-14,2019,32,8103


In [26]:
validation_authors =  pickle.load(open("validation_authors",'rb'))
test_authors =  pickle.load(open("test_authors",'rb'))

In [27]:
all_authors = np.concatenate((validation_authors.values,final_df["authors"].values))

In [28]:
mlb = MultiLabelBinarizer()
mlb.fit(all_authors)

year_lb = LabelBinarizer()
year_lb.fit(final_df["year"])

isbn_group_lb = LabelBinarizer()
isbn_group_lb.fit(final_df["ISBN_GRP"])

isbn_publisher_lb = LabelBinarizer()
isbn_publisher_lb.fit(final_df["ISBN_PUBLISHER"])

LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)

In [29]:
final_df["authors"].values[:2]

array([list(['Noah Gordon']), list(['Raymond Feist'])], dtype=object)

In [30]:
def combine_title_desc(row):
    """Combines the Title and Description fields in the given row
    and returns the combined result"""
    return str(row["title"]) + " " + str(row["description"])
    #return str(row["title"]) + " " + str(row["description"]) + " " + str(row["author"])

In [31]:
final_df.head(3)

Unnamed: 0,title,description,categories,author,published_date,isbn,top_level,count_of_categories,authors,published_date_parsed,year,ISBN_GRP,ISBN_PUBLISHER
0,Die Klinik,Ein Blick hinter die Kulissen eines Krankenhau...,[Literatur & Unterhaltung > Romane & Erzählungen],Noah Gordon,2013-12-02,9783641136291,[Literatur & Unterhaltung],1,[Noah Gordon],2013-12-02,2013,64,1136
1,Die Erben von Midkemia 4,Die Bedrohungen für Midkemia und Kelewan wolle...,[Literatur & Unterhaltung > Fantasy > Heroisch...,Raymond Feist,2016-06-20,9783641185787,[Literatur & Unterhaltung],1,[Raymond Feist],2016-06-20,2016,64,1185
2,Völlig losgelöst,In der Dreizimmerwohnung stapeln sich Flohmark...,[Ratgeber > Lebenshilfe & Psychologie > Besser...,Susanne Weingarten,2019-01-14,9783328103646,[Ratgeber],1,[Susanne Weingarten],2019-01-14,2019,32,8103


### Prepare the data

In [32]:
del book_df
del flat_book_df

import gc
gc.collect()

1546

In [37]:
# X_train_categories = ["_".join([str(le.transform([a])[0]) for a in i]) for i in  X_train_df["categories"]]
def bpemb_tokenize(sentence):
    return [i.replace("▁","") for i in bpemb_de.encode(sentence)]


In [41]:
[a.tolist() for a in final_df["top_level"].values]

[['Literatur & Unterhaltung'],
 ['Literatur & Unterhaltung'],
 ['Ratgeber'],
 ['Literatur & Unterhaltung'],
 ['Literatur & Unterhaltung'],
 ['Künste'],
 ['Sachbuch'],
 ['Ratgeber'],
 ['Literatur & Unterhaltung'],
 ['Literatur & Unterhaltung'],
 ['Literatur & Unterhaltung'],
 ['Literatur & Unterhaltung'],
 ['Künste'],
 ['Literatur & Unterhaltung'],
 ['Literatur & Unterhaltung'],
 ['Literatur & Unterhaltung'],
 ['Sachbuch'],
 ['Sachbuch'],
 ['Literatur & Unterhaltung'],
 ['Ratgeber'],
 ['Literatur & Unterhaltung'],
 ['Literatur & Unterhaltung'],
 ['Literatur & Unterhaltung'],
 ['Glaube & Ethik'],
 ['Literatur & Unterhaltung'],
 ['Literatur & Unterhaltung'],
 ['Literatur & Unterhaltung'],
 ['Literatur & Unterhaltung'],
 ['Literatur & Unterhaltung'],
 ['Kinderbuch & Jugendbuch'],
 ['Literatur & Unterhaltung'],
 ['Ganzheitliches Bewusstsein'],
 ['Sachbuch'],
 ['Literatur & Unterhaltung'],
 ['Glaube & Ethik'],
 ['Literatur & Unterhaltung'],
 ['Sachbuch'],
 ['Literatur & Unterhaltung'],
 ['Gl

In [78]:
from sklearn import preprocessing
le = preprocessing.MultiLabelBinarizer()
final_df["target"]  =  le.fit_transform([a.tolist() for a in final_df["top_level"].values]).tolist()

data = train_test_split(final_df[["title","description","authors","year","count_of_categories","isbn","categories","ISBN_GRP","ISBN_PUBLISHER"]],
                final_df["target"], test_size=0.20, random_state=42, shuffle=True)

X_train_df, X_test_df, y_train, y_test = data
vectorizer = TfidfVectorizer(ngram_range=(1,2),stop_words=stopwords.words('german'),tokenizer=bpemb_tokenize)

X_train = vectorizer.fit_transform(
    X_train_df.apply(lambda row: combine_title_desc(row), axis=1))
X_test = vectorizer.transform(X_test_df.apply(lambda row: combine_title_desc(row),axis=1))

X_train_author = mlb.transform(X_train_df["authors"].values)
X_test_author = mlb.transform(X_test_df["authors"].values)

X_train_year = year_lb.transform(X_train_df["year"].values)
X_test_year = year_lb.transform(X_test_df["year"].values)

X_train_isbn_grp = isbn_group_lb.transform(X_train_df["ISBN_GRP"].values)
X_test_isbn_grp = isbn_group_lb.transform(X_test_df["ISBN_GRP"].values)

X_train_isbn_publisher = isbn_publisher_lb.transform(X_train_df["ISBN_PUBLISHER"].values)
X_test_isbn_publisher = isbn_publisher_lb.transform(X_test_df["ISBN_PUBLISHER"].values)


# X_train_categories = ["_".join([str(le.transform([a])[0]) for a in i]) for i in  X_train_df["categories"]]
# X_test_categories = ["_".join([str(le.transform([a])[0]) for a in i]) for i in  X_test_df["categories"]]

from scipy import sparse
X_train_all = sparse.hstack((
     X_train,sparse.csr_matrix(X_train_author),
     sparse.csr_matrix(X_train_year),
     sparse.csr_matrix(X_train_isbn_grp),
     sparse.csr_matrix(X_train_isbn_publisher),    
     sparse.csr_matrix(X_train_df["isbn"].values.reshape(len(X_train_df),1)),
     sparse.csr_matrix(X_train_df["count_of_categories"].values.reshape(len(X_train_df),1))    
    ))

X_test_all = sparse.hstack((
    X_test,sparse.csr_matrix(X_test_author),
    sparse.csr_matrix(X_test_year),
    sparse.csr_matrix(X_test_isbn_grp),
    sparse.csr_matrix(X_test_isbn_publisher),    
    sparse.csr_matrix(X_test_df["isbn"].values.reshape(len(X_test_df),1)),
    sparse.csr_matrix(X_test_df["count_of_categories"].values.reshape(len(X_test_df),1))
    ))



  'stop_words.' % sorted(inconsistent))


In [79]:
%%time
from scipy import sparse

X_full = sparse.vstack((X_train_all,X_test_all))
Y_full = np.concatenate((y_train,y_test))

CPU times: user 27.3 ms, sys: 9.03 ms, total: 36.3 ms
Wall time: 34.8 ms


In [80]:
# pickle.dump(mlb, open("mlb_ete_top_level.pickle", "wb"))
# pickle.dump(year_lb, open("lb_ete_top_level.pickle", "wb"))
# pickle.dump(vectorizer, open("vectorizer_ete_top_level.pickle", "wb"))
# pickle.dump(X_train_all, open("X_train_ete_top_level.pickle", "wb"))
# pickle.dump(X_test_all, open("X_test_ete_top_level.pickle", "wb"))
# pickle.dump(y_train, open("y_train_ete_top_level.pickle", "wb"))
# pickle.dump(y_test, open("y_test_ete_top_level.pickle", "wb"))
# pickle.dump(dict(zip(le.classes_,le.transform(le.classes_))), open("le_author_year_mapping.pickle", "wb"))

In [81]:
len(y_train)

13301

In [82]:
X_test_all.tocsr().shape

(3326, 588937)

In [83]:
len(y_test)

3326

In [84]:
X_train_author.shape

(13301, 8635)

In [85]:
X_train_year.shape

(13301, 55)

In [86]:
X_full_array = X_full.tocsr()


In [61]:
class_weights = {'Kinderbuch & Jugendbuch': 1.8, 'Ratgeber': 3, 'Sachbuch': 2,'Glaube & Ethik' : 2 ,'Künste' : 6,'Architektur & Garten' : 6 }

In [62]:
class_weights_encoded = []
for value,key in enumerate(class_weights):
    print(class_weights[key],key)
    class_weights_encoded.append((le.transform([[key]])[0],class_weights[key]))

1.8 Kinderbuch & Jugendbuch
3 Ratgeber
2 Sachbuch
2 Glaube & Ethik
6 Künste
6 Architektur & Garten


In [87]:
# dict(class_weights_encoded)
class_weights_encoded = {}

In [88]:
# y_train

In [89]:
# global_isbn_category_dict = [(i[0],[le.transform([k])[0] for k in i[1].tolist()]) for i in final_df[["isbn","categories"]].head(100).values]

# global_isbn_category_df = pd.DataFrame(global_isbn_category_dict,columns=["isbn","actual"])

# global_isbn_category_df["isbn"] = global_isbn_category_df["isbn"].apply(lambda x:str(x))

def get_one_instance(index,csr_matrix):
    unique_isbns = set([ str(int(i[0])) for i in csr_matrix[index,-2].todense()])
    isbns_dict = dict(list(zip([ str(int(i[0])) for i in csr_matrix[index,-2].todense()],index)))
    return (list(isbns_dict.keys()),list(isbns_dict.values()))

def getClasses(class_probs,class_count):
    arr = np.array(class_probs)
    return arr.argsort()[-1 * class_count:][::-1]

def get_multi_prediction(row):
    """Combines the Title and Description fields in the given row
    and returns the combined result"""
    return getClasses(row["predict_proba"],row["class_count"])

def compute_f1_score(prediction,class_count_prediction):
    prediction_df = pd.DataFrame(prediction,columns=["isbn","predict_proba"])
    prediction_df["isbn"] = prediction_df["isbn"].apply(lambda x:str(int(x[0])))
#     print(prediction_df)
    class_count_df = pd.DataFrame(class_count_prediction,columns=["isbn","class_count"])
    
    joined_ = prediction_df.merge(class_count_df,on="isbn",how="inner")
    
    joined_["predictions"] = joined_.apply(get_multi_prediction,axis=1)
    joined_1 = joined_.merge(global_isbn_category_df,on="isbn")
    predicted_ = joined_1["predictions"]
    actuals_ = joined_1["actual"]
    
    predicted_labels = [ [le.inverse_transform([k])[0] for k in i] for i in predicted_.values]
    actual_labels = [ [le.inverse_transform([k])[0] for k in i] for i in actuals_.values]
    
#     print(subtask_A_confusion_matrix(actual_labels,predicted_labels))
#     print(subtask_A_classification_report(actual_labels,predicted_labels))
    
    import numpy as np
    
    x = np.concatenate(actual_labels)
    
    unique, counts = np.unique(x, return_counts=True)

    print(np.asarray((unique, counts)).T)
    
    return subtask_A_evaluation(actual_labels,predicted_labels)
    
# compute_f1_score(a,b)

In [90]:
# X_full_array.shape

In [91]:
le.classes_

array(['Architektur & Garten', 'Ganzheitliches Bewusstsein',
       'Glaube & Ethik', 'Kinderbuch & Jugendbuch', 'Künste',
       'Literatur & Unterhaltung', 'Ratgeber', 'Sachbuch'], dtype=object)

In [93]:
# Literatur & Unterhaltung      8929
# Sachbuch                      2540
# Kinderbuch & Jugendbuch       2275
# Ratgeber                      2124
# Ganzheitliches Bewusstsein     916
# Glaube & Ethik                 689
# Künste                         165
# Architektur & Garten           145

In [105]:
[0,4,2,1,6,3,7,5]

[0, 4, 2, 1, 6, 3, 7, 5]

In [106]:
# sparse.csr_matrix(Y_full)

# Build End to end classifer

In [114]:
# a = None
# b = None
# from sklearn.model_selection import StratifiedKFold

# fold_classes = []
# probas = []
# cv = StratifiedKFold(n_splits=4, random_state=42)
# for train_index, test_index in cv.split(X_full_array,Y_full):      
# #     print("Train Index: ", train_index, "\n")
# #     print("Test Index: ", len(test_index))
    
#     lsvcclf = LinearSVC(random_state=42,max_iter=3000,verbose=1)
#     classifier = sklearn.multioutput.ClassifierChain(lsvcclf,order=[0,4,2,1,6,3,7,5])
#     X_train, X_test, y_train, y_test = X_full_array[train_index,:-2], X_full_array[test_index,:-2], Y_full[train_index], Y_full[test_index]
    
#     classifier.fit(X_train,y_train)
#     predictions(classifier.predict(X_test))



In [116]:
X_full_array[:,:-2]

<16627x588935 sparse matrix of type '<class 'numpy.float64'>'
	with 2764569 stored elements in Compressed Sparse Row format>

In [117]:
Y_full.shape

(16627,)

In [113]:
from sklearn.multioutput import ClassifierChain

lsvcclf = LinearSVC(random_state=42,max_iter=3000,verbose=1)
classifier = ClassifierChain(lsvcclf,order=[0,4,2,1,6,3,7,5])
# X_train, X_test, y_train, y_test = X_full_array[train_index,:-2], X_full_array[test_index,:-2], Y_full[train_index], Y_full[test_index]
    
classifier.fit(X_full_array[:,:-2],Y_full)

IndexError: tuple index out of range

In [45]:
lsvcclf = LinearSVC(random_state=42,max_iter=3000,verbose=1,class_weight= dict(class_weights_encoded))

lsvcclf.fit(X_full_array[:,:-2],Y_full)

[LibLinear]

LinearSVC(C=1.0, class_weight={0: 6, 2: 2, 3: 1.8, 4: 6, 6: 3, 7: 2}, dual=True,
          fit_intercept=True, intercept_scaling=1, loss='squared_hinge',
          max_iter=3000, multi_class='ovr', penalty='l2', random_state=42,
          tol=0.0001, verbose=1)

In [46]:
import eli5
eli5.explain_weights(lsvcclf,target_names=le.inverse_transform([0, 1, 2, 3, 4, 5, 6, 7]).tolist())

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6
Weight?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7
+1.193,x587881,,,,,,
+0.998,x201499,,,,,,
+0.910,x587104,,,,,,
+0.901,x593923,,,,,,
+0.862,x593922,,,,,,
+0.851,x590171,,,,,,
+0.836,x585876,,,,,,
+0.818,x593924,,,,,,
+0.807,x588541,,,,,,
+0.798,x589924,,,,,,

Weight?,Feature
+1.193,x587881
+0.998,x201499
+0.910,x587104
+0.901,x593923
+0.862,x593922
+0.851,x590171
+0.836,x585876
+0.818,x593924
+0.807,x588541
+0.798,x589924

Weight?,Feature
+1.224,x591156
+1.223,x588420
+1.195,x594589
+1.150,x238339
+1.065,x594591
+1.040,x594222
+1.025,x585852
+1.018,x238233
+1.015,x594590
+0.983,x448894

Weight?,Feature
+1.460,x225096
+1.392,x307260
+1.029,x594702
+1.026,x97199
+0.997,x594860
+0.966,x589507
+0.950,x594884
+0.943,x590059
+0.924,x586954
+0.899,x588506

Weight?,Feature
+1.554,x37611
+1.383,x64
+1.250,x585331
+1.238,x593431
+1.222,x306042
+1.197,x444780
+1.179,x589086
+1.176,x591775
+1.163,x586299
+1.156,x587496

Weight?,Feature
+1.086,x593779
+0.879,x586764
+0.867,x589405
+0.780,x593606
+0.748,x586345
+0.744,x592609
+0.662,x594140
+0.588,x594141
+0.581,x594911
… 32770 more positive …,… 32770 more positive …

Weight?,Feature
+2.643,x24397
+2.468,x426619
+1.507,x190508
+1.317,x593884
+1.188,x340645
+1.149,x593195
+1.057,x297157
+1.045,x171301
… 227932 more positive …,… 227932 more positive …
… 226195 more negative …,… 226195 more negative …

Weight?,Feature
+1.696,x106527
+1.342,x150998
+1.252,x555018
+1.246,x557673
+1.220,x278979
+1.216,x416640
+1.181,x310466
+1.173,x594207
+1.089,x594588
+1.084,x594576

Weight?,Feature
+1.834,x130737
+1.740,x406562
+1.428,x130306
+1.325,x33288
+1.268,x593880
+1.259,x545009
+1.215,x593025
+1.181,x217779
+1.145,x593881
+1.132,x593879


In [47]:
dir(eli5)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_decision_path',
 '_feature_importances',
 '_feature_names',
 '_feature_weights',
 '_graphviz',
 'absolute_import',
 'base',
 'base_utils',
 'explain',
 'explain_prediction',
 'explain_prediction_df',
 'explain_prediction_dfs',
 'explain_prediction_lightgbm',
 'explain_prediction_sklearn',
 'explain_weights',
 'explain_weights_df',
 'explain_weights_dfs',
 'explain_weights_lightgbm',
 'explain_weights_sklearn',
 'format_as_dataframe',
 'format_as_dataframes',
 'format_as_dict',
 'format_as_html',
 'format_as_text',
 'format_html_styles',
 'formatters',
 'ipython',
 'lightgbm',
 'permutation_importance',
 'show_prediction',
 'show_weights',
 'sklearn',
 'transform',
 'transform_feature_names',
 'utils']

In [48]:
feature_names = vectorizer.get_feature_names()

In [49]:
len(feature_names)

585071

In [50]:
# for i in feature_names:
#     if("bergen" in i):
#         print(i)

In [51]:
# feature_names

In [52]:
le.inverse_transform([0, 1, 2, 3, 4, 5, 6, 7]).tolist()

['Architektur & Garten',
 'Ganzheitliches Bewusstsein',
 'Glaube & Ethik',
 'Kinderbuch & Jugendbuch',
 'Künste',
 'Literatur & Unterhaltung',
 'Ratgeber',
 'Sachbuch']

## Test Data Creation

In [53]:
test_book_df = pd.read_csv("/home/evenuma/germeval/data/test_set.csv")

In [54]:
test_book_df.head(3)

Unnamed: 0,title,description,author,published_date,isbn
0,Malbuch für 365 Tage,Ausmalen bringt Freude und entspannt. Dieses m...,,2016-10-03,9783809436690
1,Ansteckende Gesundheit,Die Beliebtheit der Geistheilung als Alternati...,Horst Krohne,2014-11-10,9783453702653
2,Karibu heißt willkommen,Die englische Farmerstochter Stella und das Ki...,Stefanie Zweig,2010-04-06,9783453407343


In [55]:
test_book_df["authors"] = test_book_df["author"].apply(lambda x:[ i.strip() for i in str(x).split(",")])

test_book_df["published_date_parsed"] = pd.to_datetime(final_df["published_date"],infer_datetime_format=True)

test_book_df["year"] = final_df["published_date_parsed"].apply(lambda x: x.year)

In [56]:
test_book_df['ISBN_GRP'] = test_book_df["isbn"].apply(lambda x:str(x)[4:6])
test_book_df['ISBN_PUBLISHER'] = test_book_df["isbn"].apply(lambda x:str(x)[6:10]) 

In [57]:
test_book_df.head(3)

Unnamed: 0,title,description,author,published_date,isbn,authors,published_date_parsed,year,ISBN_GRP,ISBN_PUBLISHER
0,Malbuch für 365 Tage,Ausmalen bringt Freude und entspannt. Dieses m...,,2016-10-03,9783809436690,[nan],2013-12-02,2013,80,9436
1,Ansteckende Gesundheit,Die Beliebtheit der Geistheilung als Alternati...,Horst Krohne,2014-11-10,9783453702653,[Horst Krohne],2016-06-20,2016,45,3702
2,Karibu heißt willkommen,Die englische Farmerstochter Stella und das Ki...,Stefanie Zweig,2010-04-06,9783453407343,[Stefanie Zweig],2019-01-14,2019,45,3407


In [58]:
from scipy import sparse

from sklearn import preprocessing
# le = preprocessing.LabelEncoder()
# final_df["target"]  = le.fit_transform(final_df["category"].values)

# data = train_test_split(final_df[["title","description","authors","category","year","count_of_categories","isbn","categories"]],
#                 final_df["target"], test_size=0.20, random_state=42, stratify=final_df["target"], shuffle=True)

# X_train_df, X_test_df, y_train, y_test = data
# vectorizer = TfidfVectorizer(ngram_range=(1,2),stop_words=stopwords.words('german'),tokenizer=bpemb_tokenize)

X_test_set = vectorizer.transform(test_book_df.apply(lambda row: combine_title_desc(row), axis=1))

X_test_set_author = mlb.transform(test_book_df["authors"].values)

X_test_set_year = year_lb.transform(test_book_df["year"].values)

X_test_ISBN_GRP = isbn_group_lb.transform(test_book_df["ISBN_GRP"].values)

X_test_ISBN_PUBLISHER = isbn_publisher_lb.transform(test_book_df["ISBN_PUBLISHER"].values)

# X_train_categories = ["_".join([str(le.transform([a])[0]) for a in i]) for i in  X_train_df["categories"]]
# X_test_categories = ["_".join([str(le.transform([a])[0]) for a in i]) for i in  X_test_df["categories"]]

from scipy import sparse
X_test_set_all = sparse.hstack((
     X_test_set,sparse.csr_matrix(X_test_set_author),
     sparse.csr_matrix(X_test_set_year),
     sparse.csr_matrix(X_test_ISBN_GRP),
     sparse.csr_matrix(X_test_ISBN_PUBLISHER)    
    ))

# X_test_all = sparse.hstack((
#     X_test,sparse.csr_matrix(X_test_author),
#     sparse.csr_matrix(X_test_year),
#     sparse.csr_matrix(X_test_df["isbn"].values.reshape(len(X_test_df),1)),
#     sparse.csr_matrix(X_test_df["count_of_categories"].values.reshape(len(X_test_df),1))
#     ))



  .format(sorted(unknown, key=str)))


In [59]:
X_full_array_arr = X_full_array[:,:-2].toarray()

## Create Final Model 

In [60]:
%%time
lsvcclf = LinearSVC(random_state=42,max_iter=3000,verbose=1,class_weight= dict(class_weights_encoded))
lsvcclf.fit(X_full_array_arr,Y_full)

[LibLinear]CPU times: user 35.6 s, sys: 83 ms, total: 35.6 s
Wall time: 35.7 s


In [61]:
%%time
lsvcclf_cc = LinearSVC(random_state=42,max_iter=3000,verbose=1)
lsvcclf_cc.fit(X_full_array_arr,[int(i[0]) for i in X_full_array[:,-1].todense().tolist()])

[LibLinear]CPU times: user 23.1 s, sys: 216 ms, total: 23.3 s
Wall time: 23.3 s


In [62]:
def getClasses(class_probs,class_count):
    arr = np.array(class_probs)
    return arr.argsort()[-1 * class_count:][::-1]

## Predict for the test set

In [63]:
test_book_df["class_count"] = lsvcclf_cc.predict(X_test_set_all)

In [64]:
y_prob_list = [i.tolist() for i in lsvcclf._predict_proba_lr(X_test_set_all)]

In [65]:
test_book_df["class_probs"] = y_prob_list

In [66]:
test_book_df.head(2)

Unnamed: 0,title,description,author,published_date,isbn,authors,published_date_parsed,year,ISBN_GRP,ISBN_PUBLISHER,class_count,class_probs
0,Malbuch für 365 Tage,Ausmalen bringt Freude und entspannt. Dieses m...,,2016-10-03,9783809436690,[nan],2013-12-02,2013,80,9436,1,"[0.04047110957544817, 0.17070709288981567, 0.1..."
1,Ansteckende Gesundheit,Die Beliebtheit der Geistheilung als Alternati...,Horst Krohne,2014-11-10,9783453702653,[Horst Krohne],2016-06-20,2016,45,3702,1,"[0.1011158337779388, 0.29829498522037784, 0.09..."


In [67]:
le_mapping= dict(zip(le.classes_,le.transform(le.classes_)))
le_mapping_inv = {v: k for k, v in le_mapping.items()}

def reverseMap(number):
    return le_mapping_inv[number]


In [68]:
test_book_df["class_predictions_e"] = test_book_df.apply(lambda x:getClasses(x["class_probs"],x["class_count"]),axis=1)


In [69]:
test_book_df["class_predictions"] = test_book_df["class_predictions_e"].apply(lambda x: [reverseMap(i) for i in x])

In [70]:
test_book_df.head(3)

Unnamed: 0,title,description,author,published_date,isbn,authors,published_date_parsed,year,ISBN_GRP,ISBN_PUBLISHER,class_count,class_probs,class_predictions_e,class_predictions
0,Malbuch für 365 Tage,Ausmalen bringt Freude und entspannt. Dieses m...,,2016-10-03,9783809436690,[nan],2013-12-02,2013,80,9436,1,"[0.04047110957544817, 0.17070709288981567, 0.1...",[6],[Ratgeber]
1,Ansteckende Gesundheit,Die Beliebtheit der Geistheilung als Alternati...,Horst Krohne,2014-11-10,9783453702653,[Horst Krohne],2016-06-20,2016,45,3702,1,"[0.1011158337779388, 0.29829498522037784, 0.09...",[1],[Ganzheitliches Bewusstsein]
2,Karibu heißt willkommen,Die englische Farmerstochter Stella und das Ki...,Stefanie Zweig,2010-04-06,9783453407343,[Stefanie Zweig],2019-01-14,2019,45,3407,1,"[0.09833484234270963, 0.10037577551466435, 0.0...",[5],[Literatur & Unterhaltung]


In [71]:
def formatOutput(isbn,class_predictions):
    classes = ["\t"+ i for i in class_predictions]
    return str(isbn) + "".join(classes) + "\n"

In [72]:
test_book_df[["isbn","class_predictions"]].to_csv("../../data/svm_test_set_submission_sub_task_a.csv",encoding="utf-8",index=False)

In [125]:
test_book_df = pd.read_csv("../../data/svm_test_set_submission_sub_task_a.csv",encoding="utf-8")

In [74]:
test_book_df.head(3)

Unnamed: 0,title,description,author,published_date,isbn,authors,published_date_parsed,year,ISBN_GRP,ISBN_PUBLISHER,class_count,class_probs,class_predictions_e,class_predictions
0,Malbuch für 365 Tage,Ausmalen bringt Freude und entspannt. Dieses m...,,2016-10-03,9783809436690,[nan],2013-12-02,2013,80,9436,1,"[0.04047110957544817, 0.17070709288981567, 0.1...",[6],[Ratgeber]
1,Ansteckende Gesundheit,Die Beliebtheit der Geistheilung als Alternati...,Horst Krohne,2014-11-10,9783453702653,[Horst Krohne],2016-06-20,2016,45,3702,1,"[0.1011158337779388, 0.29829498522037784, 0.09...",[1],[Ganzheitliches Bewusstsein]
2,Karibu heißt willkommen,Die englische Farmerstochter Stella und das Ki...,Stefanie Zweig,2010-04-06,9783453407343,[Stefanie Zweig],2019-01-14,2019,45,3407,1,"[0.09833484234270963, 0.10037577551466435, 0.0...",[5],[Literatur & Unterhaltung]


In [75]:
test_book_df["class_predictions"] = test_book_df["class_predictions"].apply(lambda x: literal_eval(x))

ValueError: malformed node or string: ['Ratgeber']

In [76]:
test_book_df.head(3)

Unnamed: 0,title,description,author,published_date,isbn,authors,published_date_parsed,year,ISBN_GRP,ISBN_PUBLISHER,class_count,class_probs,class_predictions_e,class_predictions
0,Malbuch für 365 Tage,Ausmalen bringt Freude und entspannt. Dieses m...,,2016-10-03,9783809436690,[nan],2013-12-02,2013,80,9436,1,"[0.04047110957544817, 0.17070709288981567, 0.1...",[6],[Ratgeber]
1,Ansteckende Gesundheit,Die Beliebtheit der Geistheilung als Alternati...,Horst Krohne,2014-11-10,9783453702653,[Horst Krohne],2016-06-20,2016,45,3702,1,"[0.1011158337779388, 0.29829498522037784, 0.09...",[1],[Ganzheitliches Bewusstsein]
2,Karibu heißt willkommen,Die englische Farmerstochter Stella und das Ki...,Stefanie Zweig,2010-04-06,9783453407343,[Stefanie Zweig],2019-01-14,2019,45,3407,1,"[0.09833484234270963, 0.10037577551466435, 0.0...",[5],[Literatur & Unterhaltung]


In [77]:
submissions = test_book_df.apply(lambda x:formatOutput(x["isbn"],x["class_predictions"]),axis=1).values

In [78]:
submissions

array(['9783809436690\tRatgeber\n',
       '9783453702653\tGanzheitliches Bewusstsein\n',
       '9783453407343\tLiteratur & Unterhaltung\n', ...,
       '9783442715206\tLiteratur & Unterhaltung\n',
       '9783809027003\tLiteratur & Unterhaltung\n',
       '9783641074029\tSachbuch\n'], dtype=object)

In [79]:
submissions = np.insert(submissions,0,"subtask_a\n")

In [80]:
submissions

array(['subtask_a\n', '9783809436690\tRatgeber\n',
       '9783453702653\tGanzheitliches Bewusstsein\n', ...,
       '9783442715206\tLiteratur & Unterhaltung\n',
       '9783809027003\tLiteratur & Unterhaltung\n',
       '9783641074029\tSachbuch\n'], dtype=object)

In [81]:
len(submissions)

4158

In [82]:
submission_format = "".join(submissions)

In [83]:
# submission_format

In [84]:
submission_file = open("../../data/svm_test_set_submission_sub_task_a.txt",'w',encoding="utf-8")

In [85]:
submission_file.writelines(submission_format)

In [86]:
submission_file.close()

In [94]:
len(np.unique(test_book_df.isbn))

4157

In [95]:
submissions

array(['subtask_a\n', '9783809436690\tRatgeber\n',
       '9783453702653\tGanzheitliches Bewusstsein\n', ...,
       '9783442715206\tLiteratur & Unterhaltung\n',
       '9783809027003\tLiteratur & Unterhaltung\n',
       '9783641074029\tLiteratur & Unterhaltung\n'], dtype=object)

In [133]:
bpemb_tokenize("Ein Blick hinter die Kulissen eines Krankenhauses vom Autor der Bestseller Der Medicus und Der Medicus von Saragossa. Der Wissenschaftler Adam Silverstone, der kubanische Aristokrat Rafael Meomartino und der Farbige Spurgeon Robinson - sie sind drei grundverschiedene Klinik-Ärzte, die unter der unerbittlichen Aufsicht von Dr. Longwood praktizieren. Eines Tages stirbt eine Patientin, und Dr. Longwood wittert einen Behandlungsfehler. Sofort macht er sich auf die Suche nach einem Schuldigen, dem er die Verantwortung in die Schuhe schieben könnte")

['ein',
 'blick',
 'hinter',
 'die',
 'kul',
 'issen',
 'eines',
 'kranken',
 'hauses',
 'vom',
 'autor',
 'der',
 'best',
 'seller',
 'der',
 'med',
 'icus',
 'und',
 'der',
 'med',
 'icus',
 'von',
 'sar',
 'ag',
 'ossa',
 '.',
 'der',
 'wissenschaftler',
 'adam',
 'silver',
 'stone',
 ',',
 'der',
 'kub',
 'anische',
 'arist',
 'okrat',
 'rafael',
 'me',
 'om',
 'art',
 'ino',
 'und',
 'der',
 'farb',
 'ige',
 'spur',
 'ge',
 'on',
 'robinson',
 '-',
 'sie',
 'sind',
 'drei',
 'grund',
 'ver',
 'schieden',
 'e',
 'klinik',
 '-',
 'ärzte',
 ',',
 'die',
 'unter',
 'der',
 'uner',
 'b',
 'itt',
 'lichen',
 'aufsicht',
 'von',
 'dr',
 '.',
 'long',
 'wood',
 'prakt',
 'izieren',
 '.',
 'eines',
 'tages',
 'stirbt',
 'eine',
 'patient',
 'in',
 ',',
 'und',
 'dr',
 '.',
 'long',
 'wood',
 'w',
 'itter',
 't',
 'einen',
 'behandlungs',
 'fehler',
 '.',
 'sofort',
 'macht',
 'er',
 'sich',
 'auf',
 'die',
 'suche',
 'nach',
 'einem',
 'schuld',
 'igen',
 ',',
 'dem',
 'er',
 'die',
 'vera

In [130]:
book_df["description"][0]

'Ein Blick hinter die Kulissen eines Krankenhauses vom Autor der Bestseller "Der Medicus" und "Der Medicus von Saragossa". Der Wissenschaftler Adam Silverstone, der kubanische Aristokrat Rafael Meomartino und der Farbige Spurgeon Robinson - sie sind drei grundverschiedene Klinik-Ärzte, die unter der unerbittlichen Aufsicht von Dr. Longwood praktizieren. Eines Tages stirbt eine Patientin, und Dr. Longwood wittert einen Behandlungsfehler. Sofort macht er sich auf die Suche nach einem Schuldigen, dem er die Verantwortung in die Schuhe schieben könnte ...'