In this notebook, we are going to train a text classification model by tf-idf and svm. 

Firstly, load the preprocessed data.

In [49]:
import pickle

In [50]:
def load_file(file_dir):
    with open(file_dir, 'rb') as file:
        return pickle.load(file)

title_train = load_file('data/title_train.pickle')
content_train = load_file('data/content_train.pickle')
author_train = load_file('data/author_train.pickle')

title_val = load_file('data/title_val.pickle')
content_val = load_file('data/content_val.pickle')
author_val = load_file('data/author_val.pickle')

title_test = load_file('data/title_test.pickle')
content_test = load_file('data/content_test.pickle')
author_test = load_file('data/author_test.pickle')

To use tf-idf, we will first concatenate all the book contents writen by the same author into their respective string. This is for learning the tf-idf vectorizer.

In [51]:
author_list_train = list(set(author_train))
concatenated_string_train = []
for author_of_author_list in author_list_train:
    concatenated_string = ""
    for idx, author_of_author_train in enumerate(author_train):
        if author_of_author_list == author_of_author_train:
            concatenated_string += title_train[idx]
            concatenated_string += content_train[idx]
    concatenated_string_train.append(concatenated_string)

# just a simple check that we have same number of author and his / her corresponding all-in-one string
print(len(author_list_train))
print(len(concatenated_string_train))

75
75


For validation and test set, no need to concatenate all books because in fact we are going to make prediction on each of the books. So we will just concatenate the title and contents.

In [52]:
author_list_val = author_val
concatenated_string_val = []
for title, content in zip(title_val, content_val):
    concatenated_string_val.append(title + content)

author_list_test = author_test
concatenated_string_test = []
for title, content in zip(title_test, content_test):
    concatenated_string_test.append(title + content)

Now we can start to convert the corpus into tf-idf vectors, sklearn provides a function to achieve so easily.

But before that, it would be good to first know more about what tf-idf actually is. The term tf-idf is actually a combination of two components namely term frequency (tf) and inverse document frequency (idf). And before we talk about what these two components mean, we need to know a few teminologies, which are:

(word) token: usually refers to a unique word. e.g. the sentence *"How do you do"*, although having four words, has three word tokens. But it really depends on the tokenizer used and parameters settings such as range of n-gram.<br>
document: all the text belonging to the same class. In this project, it refers to all the book titles and contents written by the same author.<br>

The general idea is:<br>
**Term frequency**: how many times a word token appears in a document.<br>
**Inverse document frequency**: the inverse of in how many documents a word token appears.<br>
To be precise, their formula looks like:<br>
<span style="font-size:16px;text-align:center;">$tf = \frac{number\, of\, occurence\, of\, a\, term\, in\, the\, document}{number\, of\, words\, in\, the\, document}$</span> and, <span style="font-size:16px">$idf = log_e({\frac{total\, number\, of\, documents}{number\, of\, documents\, in\, which\, the\, term\, appears}})$</span><br>
These two values are computed for each token of each document and are multiplied to become tf-idf, then the tf-idf values of each token in the same document are concatenated to become a tf-idf vector represnting the document.

Term frequency is very straight-forward that we think of a word token as more important if it appears more often. However, it actually has a drawback that tokens with the highest term frequencies are very likely to be stopwords and words that are common thoughout every class. Giving these words high value does not help in classification a lot.<br>Therefore inverse document frequency is introduced. If a word appears very frequently in a document but it also appears in most documents, because of a smaller idf value, the tf-idf will become smaller as well. On the other hand, if a word appears frequently in a document and it only appears in this one document, then very likely this word is very important and icoinc for this class, the tf-idf can reflect this well too. Therefore, tf-idf is a very elegant and common algorithm to vectorize text in NLP tasks.

In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [54]:
# note of parameters
# analyzer: char because we want each word character to be a token. Since Japanese, unlike English, does not seperate words by space naturally
# ngram_range: we recognize tokens that appear together in same other as an unique token, the range (1,3) means we accept unigram (1), bigram (2) and trigram (3).
# max_df: all tokens with document frequency higher than this value is given up in the vector, they usually are stopwords
# smooth_idf: adding 1 to the document frequency when computing idf to prevent the denominator being 0. 
vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(1,3), max_df=0.8, )
vectorizer.fit(concatenated_string_train)

TfidfVectorizer(analyzer='char', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=0.8, max_features=None,
                min_df=1, ngram_range=(1, 3), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [55]:
X_train = vectorizer.transform(concatenated_string_train)

In [56]:
print(X_train.shape)

(75, 2313293)


And let's do the same for validation and test sets.

In [57]:
X_val = vectorizer.transform(concatenated_string_val)
X_test = vectorizer.transform(concatenated_string_test)

Now we have finished taking care about the X, let's do the y part as well. The author list are list of string for the moement, but we will need a list of class number. We have actually done something similar in the preprocessing notebook.

In [58]:
from sklearn.preprocessing import LabelEncoder

In [59]:
le = LabelEncoder()
le = le.fit(author_list_train)

In [60]:
y_train = le.transform(author_list_train)
y_val = le.transform(author_list_val)
y_test = le.transform(author_list_test)

Then we can feed the data to our machine learning model named Support Vector Machine (SVM).

In [62]:
from sklearn import svm

In [63]:
clf = svm.SVC(decision_function_shape='ovo')
clf.fit(X_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [64]:
y_pred = clf.predict(X_val)

In [65]:
from sklearn.metrics import classification_report

In [66]:
print(classification_report(y_val, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         8
           1       0.00      0.00      0.00         2
           2       0.94      0.89      0.91        54
           3       1.00      1.00      1.00         1
           4       1.00      1.00      1.00         2
           5       1.00      1.00      1.00         2
           6       1.00      0.06      0.12        16
           7       0.91      1.00      0.95        41
           8       1.00      0.50      0.67         2
           9       0.62      1.00      0.76         8
          10       0.00      0.00      0.00         1
          11       0.78      0.50      0.61        14
          12       0.00      0.00      0.00        10
          13       0.00      0.00      0.00         1
          14       0.87      0.97      0.92        34
          15       0.75      0.60      0.67         5
          16       0.67      1.00      0.80        14
          17       0.75    

  _warn_prf(average, modifier, msg_start, len(result))
