# Introduction

The goal of text classification is to automatically classify the text documents into one or more defined categories. Some examples of text classification are:
- Understanding audience sentiment from social media,
- Detection of spam and non-spam emails,
- Auto tagging of customer queries, and
- Categorization of news articles into defined topics. <br> <br>

Text Classification is an example of supervised machine learning task since a labelled dataset containing text documents and their labels is used for train a classifier. There are 4 steps that we need to do as follows:
- Dataset Preparation (Preprocessing Data)
- Feature Engineering (Preprocessing Data)
- Model Training
- Improve Performance 


In this tutorial, we will implement Text Classifier Model for newpapers in Vietnamese. <br>
There are totally 10 classes in data set.

# Preprocessing Data

Dataset was downloaded from https://github.com/duyvuleo/VNTC

In [None]:
!wget -c https://github.com/duyvuleo/VNTC/raw/master/Data/10Topics/Ver1.1/Train_Full.rar
!unrar x -r Train_Full.rar

In [2]:
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble

import pandas, xgboost, numpy, textblob, string
from keras.preprocessing import text, sequence
from keras import layers, models, optimizers
from keras.layers import *

## Dataset preparation

In [5]:
!pip install pyvi
from pyvi import ViTokenizer, ViPosTagger
from tqdm import tqdm
import numpy as np
import gensim
import numpy as np

Collecting pyvi
  Downloading pyvi-0.1.1-py2.py3-none-any.whl (8.5 MB)
[K     |████████████████████████████████| 8.5 MB 4.7 MB/s 
[?25hCollecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.3.6-py2.py3-none-any.whl (12 kB)
Collecting python-crfsuite>=0.8.3
  Downloading python_crfsuite-0.9.7-cp37-cp37m-manylinux1_x86_64.whl (743 kB)
[K     |████████████████████████████████| 743 kB 51.9 MB/s 
Installing collected packages: python-crfsuite, sklearn-crfsuite, pyvi
Successfully installed python-crfsuite-0.9.7 pyvi-0.1.1 sklearn-crfsuite-0.3.6


In [9]:
import os 

def get_data(folder_path):
    X = []
    y = []
    dirs = os.listdir(folder_path)
    for path in dirs:
        file_paths = os.listdir(os.path.join(folder_path, path))
        for file_path in tqdm(file_paths):
            with open(os.path.join(folder_path, path, file_path), 'r', encoding="utf-16") as f:
                lines = f.readlines()
                lines = ' '.join(lines)
                lines = gensim.utils.simple_preprocess(lines)
                lines = ' '.join(lines)
                lines = ViTokenizer.tokenize(lines)
#                 sentence = ' '.join(words)
#                 print(lines)
                X.append(lines)
                y.append(path)
#             break
#         break
    return X, y

#train_path = os.path.join(dir_path, 'Train_Full')
X_data, y_data = get_data('Train_Full')


/


100%|██████████| 1820/1820 [00:18<00:00, 98.36it/s]
100%|██████████| 3159/3159 [00:43<00:00, 72.65it/s]
100%|██████████| 5219/5219 [00:56<00:00, 91.96it/s] 
100%|██████████| 3080/3080 [00:36<00:00, 84.31it/s]
100%|██████████| 2552/2552 [00:26<00:00, 97.75it/s]
100%|██████████| 2481/2481 [00:21<00:00, 118.11it/s]
100%|██████████| 2898/2898 [00:28<00:00, 102.62it/s]
100%|██████████| 5298/5298 [01:07<00:00, 79.04it/s]
100%|██████████| 3868/3868 [00:34<00:00, 112.76it/s]
100%|██████████| 3384/3384 [00:33<00:00, 99.85it/s]


In [10]:
import pickle

pickle.dump(X_data, open('X_data.pkl', 'wb'))
pickle.dump(y_data, open('y_data.pkl', 'wb'))

In [None]:
!wget -c https://github.com/ltdaovn/VNTC/raw/master/Data/10Topics/Ver1.1/Test_Full.rar
!unrar x -r Test_Full.rar

In [12]:
#test_path = os.path.join(dir_path, 'VNTC-master/Data/10Topics/Ver1.1/Test_Full')
X_test, y_test = get_data('Test_Full')

100%|██████████| 2096/2096 [00:25<00:00, 83.72it/s]
100%|██████████| 2036/2036 [00:34<00:00, 58.19it/s]
100%|██████████| 7567/7567 [01:20<00:00, 94.13it/s]
100%|██████████| 6250/6250 [01:19<00:00, 78.52it/s]
100%|██████████| 5276/5276 [01:00<00:00, 87.54it/s]
100%|██████████| 4560/4560 [00:44<00:00, 101.35it/s]
100%|██████████| 6716/6716 [01:01<00:00, 109.01it/s]
100%|██████████| 6667/6667 [01:32<00:00, 72.25it/s]
100%|██████████| 3788/3788 [00:36<00:00, 103.18it/s]
100%|██████████| 5417/5417 [01:01<00:00, 88.09it/s]


In [13]:
pickle.dump(X_test, open('X_test.pkl', 'wb'))
pickle.dump(y_test, open('y_test.pkl', 'wb'))

## Feature Engineering

In this step, raw text data will be transformed into eature vectors and new features will be created using the existing dataset. We will implement some idea as follows:
1. Count Vectors as features
2. TF-IDF Vectors as features<br>
    2.1. Word level<br>
    2.2. N-Gram level<br>
    2.3. Character level
3. Word Embeddings as features
4. Text / NLP based features
5. Topic Models as features

In [14]:
import pickle

X_data = pickle.load(open('X_data.pkl', 'rb'))
y_data = pickle.load(open('y_data.pkl', 'rb'))

X_test = pickle.load(open('X_test.pkl', 'rb'))
y_test = pickle.load(open('y_test.pkl', 'rb'))

### Count Vectors as features
Count Vector is a matrix notation of the dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document.

In [15]:
# create a count vectorizer object 
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(X_data)

# transform the training and validation data using count vectorizer object
X_data_count = count_vect.transform(X_data)
X_test_count = count_vect.transform(X_test)

### TF-IDF Vectors

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)<br>
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)<br>
TF-IDF Vectors can be generated at different levels of input tokens (words, characters, n-grams)

a. Word Level TF-IDF : Matrix representing tf-idf scores of every term in different documents

b. N-gram Level TF-IDF : N-grams are the combination of N terms together. This Matrix representing tf-idf scores of N-grams

c. Character Level TF-IDF : Matrix representing tf-idf scores of character level n-grams in the corpus



In [16]:
# word level - we choose max number of words equal to 30000 except all words (100k+ words)
tfidf_vect = TfidfVectorizer(analyzer='word', max_features=30000)
tfidf_vect.fit(X_data) # learn vocabulary and idf from training set
X_data_tfidf =  tfidf_vect.transform(X_data)
# assume that we don't have test set before
X_test_tfidf =  tfidf_vect.transform(X_test)

In [17]:
tfidf_vect.get_feature_names()

['aa',
 'aaa',
 'aac',
 'aachen',
 'aaron',
 'aas',
 'ab',
 'aba',
 'abashidze',
 'abba',
 'abbas',
 'abbey',
 'abbiati',
 'abbondanzieri',
 'abbott',
 'abc',
 'abd',
 'abdel',
 'abdelrahim',
 'abdoulaye',
 'abdul',
 'abdulaziz',
 'abdullah',
 'abe',
 'abel',
 'aberdeen',
 'abeyie',
 'abf',
 'abidjan',
 'abkhazia',
 'able',
 'abn',
 'about',
 'abqaiq',
 'abraham',
 'abramoff',
 'abramovich',
 'abs',
 'abtc',
 'abu',
 'ac',
 'academy',
 'acasiete',
 'acb',
 'acbs',
 'accc',
 'accept',
 'access',
 'account',
 'accumbens',
 'ace',
 'aceh',
 'acer',
 'acetaminophen',
 'achilefu',
 'achilles',
 'acid',
 'acid_amin',
 'acid_béo',
 'acl',
 'acm',
 'acoo',
 'acpe',
 'acrobat',
 'acronis',
 'acropolis',
 'acrylic',
 'act',
 'action',
 'active',
 'activex',
 'acuff',
 'acyclovir',
 'ad',
 'adam',
 'adams',
 'adan',
 'adani',
 'adapter',
 'adb',
 'add',
 'address',
 'addvote',
 'adebayor',
 'adelaide',
 'adelman',
 'aden',
 'adeno',
 'adeportivo',
 'adidas',
 'adler',
 'admin',
 'adn',
 'adnan',


In [18]:
# ngram level - we choose max number of words equal to 30000 except all words (100k+ words)
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', max_features=30000, ngram_range=(2, 3))
tfidf_vect_ngram.fit(X_data)
X_data_tfidf_ngram =  tfidf_vect_ngram.transform(X_data)
# assume that we don't have test set before
X_test_tfidf_ngram =  tfidf_vect_ngram.transform(X_test)

In [19]:
tfidf_vect_ngram.get_feature_names()

['abu ghraib',
 'ac milan',
 'ac milan và',
 'agribank cup',
 'ai biết',
 'ai có',
 'ai có_thể',
 'ai cũng',
 'ai cũng biết',
 'ai cũng có',
 'ai cũng có_thể',
 'ai cả',
 'ai cập',
 'ai dám',
 'ai hết',
 'ai khác',
 'ai không',
 'ai là',
 'ai là người',
 'ai làm',
 'ai muốn',
 'ai mà',
 'ai nghĩ',
 'ai nói',
 'ai sẽ',
 'ai trong',
 'ai và',
 'ai đã',
 'ai đó',
 'ai được',
 'ajax amsterdam',
 'al jazeera',
 'al qaeda',
 'al zarqawi',
 'album của',
 'album mới',
 'album này',
 'alex ferguson',
 'alfred riedl',
 'am subject',
 'am subject gui',
 'an bình',
 'an cho',
 'an giang',
 'an phú',
 'an và',
 'an đã',
 'an_ninh cho',
 'an_ninh của',
 'an_ninh mạng',
 'an_ninh quốc_gia',
 'an_ninh trật_tự',
 'an_ninh và',
 'an_ninh điều_tra',
 'an_ninh điều_tra bộ',
 'an_toàn cho',
 'an_toàn của',
 'an_toàn giao_thông',
 'an_toàn hơn',
 'an_toàn thực_phẩm',
 'an_toàn trong',
 'an_toàn và',
 'andre agassi',
 'andrew carnegie',
 'andy roddick',
 'anh anh',
 'anh biết',
 'anh bình',
 'anh bạn',
 'anh

In [20]:
# ngram-char level - we choose max number of words equal to 30000 except all words (100k+ words)
tfidf_vect_ngram_char = TfidfVectorizer(analyzer='char', max_features=30000, ngram_range=(2, 3))
tfidf_vect_ngram_char.fit(X_data)
X_data_tfidf_ngram_char =  tfidf_vect_ngram_char.transform(X_data)
# assume that we don't have test set before
X_test_tfidf_ngram_char =  tfidf_vect_ngram_char.transform(X_test)

#### Transform by SVD to decrease number of dimensions

##### Word Level

In [21]:
from sklearn.decomposition import TruncatedSVD

In [22]:
svd = TruncatedSVD(n_components=300, random_state=42)
svd.fit(X_data_tfidf)

TruncatedSVD(algorithm='randomized', n_components=300, n_iter=5,
             random_state=42, tol=0.0)

In [23]:
X_data_tfidf_svd = svd.transform(X_data_tfidf)
X_test_tfidf_svd = svd.transform(X_test_tfidf)

##### ngram Level

In [24]:
svd_ngram = TruncatedSVD(n_components=300, random_state=42)
svd_ngram.fit(X_data_tfidf_ngram)

TruncatedSVD(algorithm='randomized', n_components=300, n_iter=5,
             random_state=42, tol=0.0)

In [25]:
X_data_tfidf_ngram_svd = svd_ngram.transform(X_data_tfidf_ngram)
X_test_tfidf_ngram_svd = svd_ngram.transform(X_test_tfidf_ngram)

##### ngram Char Level

In [26]:
svd_ngram_char = TruncatedSVD(n_components=300, random_state=42)
svd_ngram_char.fit(X_data_tfidf_ngram_char)

TruncatedSVD(algorithm='randomized', n_components=300, n_iter=5,
             random_state=42, tol=0.0)

In [27]:
X_data_tfidf_ngram_char_svd = svd_ngram_char.transform(X_data_tfidf_ngram_char)
X_test_tfidf_ngram_char_svd = svd_ngram_char.transform(X_test_tfidf_ngram_char)

### Word Embeddings

We will convert each word in document to a embedding vector. We will use pretrained model for Vietnamese. The model can be downloaded from https://github.com/Kyubyong/wordvectors

Assume that, one document have $n$ word, each word is represented by 300 dimensional vector, then the document vector be 2-dimensional matrix with size $ n \times 300 $. From that, we can use DNN, RNN, CNN model for this type of data.

In [None]:
from gensim.models import KeyedVectors 
dir_path = os.path.dirname(os.path.realpath(os.getcwd()))

!wget -c https://github.com/ltdaovn/Natual-Language-Processing/raw/master/vi.vec

#word2vec_model_path = os.path.join(dir_path, "Data/vi/vi.vec")
word2vec_model_path = "vi.vec"

w2v = KeyedVectors.load_word2vec_format(word2vec_model_path)
vocab = w2v.wv.vocab
wv = w2v.wv

In [30]:
def get_word2vec_data(X):
    word2vec_data = []
    for x in X:
        sentence = []
        for word in x.split(" "):
            if word in vocab:
#                 print(word)
                sentence.append(wv[word])

        word2vec_data.append(sentence)
#         break
    return word2vec_data

X_data_w2v = get_word2vec_data(X_data)
X_test_w2v = get_word2vec_data(X_test)



### Text / NLP based features
Idea from https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/

A number of extra text based features can also be created which sometimes are helpful for improving text classification models. Some examples are:

1. Word Count of the documents – total number of words in the documents
2. Character Count of the documents – total number of characters in the documents
3. Average Word Density of the documents – average length of the words used in the documents
4. Puncutation Count in the Complete Essay – total number of punctuation marks in the documents
5. Upper Case Count in the Complete Essay – total number of upper count words in the documents
6. Title Word Count in the Complete Essay – total number of proper case (title) words in the documents
7. Frequency distribution of Part of Speech Tags:
    - Noun Count
    - Verb Count
    - Adjective Count
    - Adverb Count
    - Pronoun Count
    
These features are highly experimental ones and should be used according to the problem statement only.

### Topic Models as features

Topic Modelling is a technique to identify the groups of words (called a topic) from a collection of documents that contains best information in the collection. I have used Latent Dirichlet Allocation for generating Topic Modelling Features. LDA is an iterative model which starts from a fixed number of topics. Each topic is represented as a distribution over words, and each document is then represented as a distribution over topics. Although the tokens themselves are meaningless, the probability distributions over words provided by the topics provide a sense of the different ideas contained in the documents

### Convert y to categorical

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

encoder = preprocessing.LabelEncoder()
y_data_n = encoder.fit_transform(y_data)
y_test_n = encoder.fit_transform(y_test)

In [None]:
encoder.classes_

# Model

In this tutorial, we will implement some models and compare them to find the most effective model for text classification problem. We will implement these models:
1. Naive Bayes Classifier
2. Linear Classifier
3. Support Vector Machine
4. Bagging Models
5. Boosting Models
6. Shallow Neural Networks
7. Deep Neural Networks
    - Convolutional Neural Network (CNN)
    - Long Short Term Modelr (LSTM)
    - Gated Recurrent Unit (GRU)
    - Bidirectional RNN
    - Recurrent Convolutional Neural Network (RCNN)
    - Other Variants of Deep Neural Networks
8. Doc2Vec model

We use the prototype function to do some classifiers as follows: <br>
(Because of my machine memory, I test only on WORD LEVEL TF-IDF (with SVD or not))

In [32]:
from sklearn.model_selection import train_test_split

In [33]:
def train_model(classifier, X_data, y_data, X_test, y_test, is_neuralnet=False, n_epochs=3):       
    X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.1, random_state=42)
    
    if is_neuralnet:
        classifier.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=n_epochs, batch_size=512)
        
        val_predictions = classifier.predict(X_val)
        test_predictions = classifier.predict(X_test)
        val_predictions = val_predictions.argmax(axis=-1)
        test_predictions = test_predictions.argmax(axis=-1)
    else:
        classifier.fit(X_train, y_train)
    
        train_predictions = classifier.predict(X_train)
        val_predictions = classifier.predict(X_val)
        test_predictions = classifier.predict(X_test)
        
    print("Validation accuracy: ", metrics.accuracy_score(val_predictions, y_val))
    print("Test accuracy: ", metrics.accuracy_score(test_predictions, y_test))

## Naive Bayes

In [34]:
train_model(naive_bayes.MultinomialNB(), X_data_tfidf, y_data, X_test_tfidf, y_test, is_neuralnet=False)

Validation accuracy:  0.8640402843601895
Test accuracy:  0.862942449328013


In [None]:
train_model(naive_bayes.MultinomialNB(), X_data_tfidf_ngram_svd, y_data, X_test_tfidf_ngram_svd, y_test, is_neuralnet=False)

In [None]:
train_model(naive_bayes.MultinomialNB(), X_data_tfidf_ngram_char_svd, y_data, X_test_tfidf_ngram_char_svd, y_test, is_neuralnet=False)

### Other type Naive Bayes

In [None]:
# use too much memory
# train_model(naive_bayes.GaussianNB(), X_data_tfidf.todense(), y_data, X_test_tfidf.todense(), y_test, is_neuralnet=False)

In [None]:
train_model(naive_bayes.BernoulliNB(), X_data_tfidf, y_data, X_test_tfidf, y_test, is_neuralnet=False)

In [None]:
train_model(naive_bayes.BernoulliNB(), X_data_tfidf_svd, y_data, X_test_tfidf_svd, y_test, is_neuralnet=False)

## Linear Classifier

In [None]:
train_model(linear_model.LogisticRegression(), X_data_tfidf, y_data, X_test_tfidf, y_test, is_neuralnet=False)

In [None]:
train_model(linear_model.LogisticRegression(), X_data_tfidf_svd, y_data, X_test_tfidf_svd, y_test, is_neuralnet=False)

## SVM Model

In [None]:
train_model(svm.SVC(), X_data_tfidf_svd, y_data, X_test_tfidf_svd, y_test, is_neuralnet=False)

## Bagging Model

In [None]:
train_model(ensemble.RandomForestClassifier(), X_data_tfidf_svd, y_data, X_test_tfidf_svd, y_test, is_neuralnet=False)

## Boosting Model

In [None]:
train_model(xgboost.XGBClassifier(), X_data_tfidf_svd, y_data, X_test_tfidf_svd, y_test, is_neuralnet=False)

## Deep Neural Network

In [None]:
from keras.layers import *

In [None]:
def create_dnn_model():
    input_layer = Input(shape=(300,))
    layer = Dense(1024, activation='relu')(input_layer)
    layer = Dense(1024, activation='relu')(layer)
    layer = Dense(512, activation='relu')(layer)
    output_layer = Dense(10, activation='softmax')(layer)
    
    classifier = models.Model(input_layer, output_layer)
    classifier.compile(optimizer=optimizers.Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return classifier

In [None]:
classifier = create_dnn_model()
train_model(classifier=classifier, X_data=X_data_tfidf_svd, y_data=y_data_n, X_test=X_test_tfidf_svd, y_test=y_test_n, is_neuralnet=True)

## Convolutional Neural Network 

In [None]:
def create_cnn_model():
    pass

## Recurrent Neural Network  

### LSTM 

In [None]:
def create_lstm_model():
    input_layer = Input(shape=(300,))
    
    layer = Reshape((10, 30))(input_layer)
    layer = LSTM(128, activation='relu')(layer)
    layer = Dense(512, activation='relu')(layer)
    layer = Dense(512, activation='relu')(layer)
    layer = Dense(128, activation='relu')(layer)
    
    output_layer = Dense(10, activation='softmax')(layer)
    
    classifier = models.Model(input_layer, output_layer)
    
    classifier.compile(optimizer=optimizers.Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return classifier

In [None]:
classifier = create_lstm_model()
train_model(classifier=classifier, X_data=X_data_tfidf_svd, y_data=y_data_n, X_test=X_test_tfidf_svd, y_test=y_test_n, is_neuralnet=True)

### GRU 

In [None]:
def create_gru_model():
    input_layer = Input(shape=(300,))
    
    layer = Reshape((10, 30))(input_layer)
    layer = GRU(128, activation='relu')(layer)
    layer = Dense(512, activation='relu')(layer)
    layer = Dense(512, activation='relu')(layer)
    layer = Dense(128, activation='relu')(layer)
    
    output_layer = Dense(10, activation='softmax')(layer)
    
    classifier = models.Model(input_layer, output_layer)
    
    classifier.compile(optimizer=optimizers.Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return classifier

In [None]:
classifier = create_gru_model()
train_model(classifier=classifier, X_data=X_data_tfidf_svd, y_data=y_data_n, X_test=X_test_tfidf_svd, y_test=y_test_n, is_neuralnet=True, n_epochs=10)

### Bidirectional RNN 

In [None]:
def create_brnn_model():
    input_layer = Input(shape=(300,))
    
    layer = Reshape((10, 30))(input_layer)
    layer = Bidirectional(GRU(128, activation='relu'))(layer)
    layer = Dense(512, activation='relu')(layer)
    layer = Dense(512, activation='relu')(layer)
    layer = Dense(128, activation='relu')(layer)
    
    output_layer = Dense(10, activation='softmax')(layer)
    
    classifier = models.Model(input_layer, output_layer)
    
    classifier.compile(optimizer=optimizers.Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return classifier

In [None]:
classifier = create_brnn_model()
train_model(classifier=classifier, X_data=X_data_tfidf_svd, y_data=y_data_n, X_test=X_test_tfidf_svd, y_test=y_test_n, is_neuralnet=True, n_epochs=20)

## Recurrent Convolutional Neural Network 

In [None]:
# def create_rcnn_model():
#     input_layer = Input(shape=(300,))
    
#     layer = Reshape((10, 30))(input_layer)
#     layer = Bidirectional(GRU(128, activation='relu', return_sequences=True))(layer)
# #     layer = Reshape((16, 16))(layer)
# #     layer = Convolution1D(100, 3, activation="relu")(layer)
#     layer = Dense(512, activation='relu')(layer)
#     layer = Dense(512, activation='relu')(layer)
#     layer = Dense(128, activation='relu')(layer)
    
#     output_layer = Dense(10, activation='softmax')(layer)
    
#     classifier = models.Model(input_layer, output_layer)
    
#     classifier.compile(optimizer=optimizers.Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
#     return classifier
def create_rcnn_model():
    input_layer = Input(shape=(300,))
    
    layer = Reshape((10, 30))(input_layer)
    layer = Bidirectional(GRU(128, activation='relu', return_sequences=True))(layer)    
    layer = Convolution1D(100, 3, activation="relu")(layer)
    layer = Flatten()(layer)
    layer = Dense(512, activation='relu')(layer)
    layer = Dense(512, activation='relu')(layer)
    layer = Dense(128, activation='relu')(layer)
    
    output_layer = Dense(10, activation='softmax')(layer)
    
    classifier = models.Model(input_layer, output_layer)
    classifier.summary()
    classifier.compile(optimizer=optimizers.Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return classifier

In [None]:
classifier = create_rcnn_model()
train_model(classifier=classifier, X_data=X_data_tfidf_svd, y_data=y_data_n, X_test=X_test_tfidf_svd, y_test=y_test_n, is_neuralnet=True, n_epochs=20)

## Doc2Vec Model 

In [None]:
def get_corpus(documents):
    corpus = []
    
    for i in tqdm(range(len(documents))):
        doc = documents[i]
        
        words = doc.split(' ')
        tagged_document = gensim.models.doc2vec.TaggedDocument(words, [i])
        
        corpus.append(tagged_document)
        
    return corpus

In [None]:
train_corpus = get_corpus(X_data)


In [None]:
test_corpus = get_corpus(X_test)

#### Build Doc2Vec model 

In [None]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=300, min_count=2, epochs=40)
model.build_vocab(train_corpus)

In [None]:
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

#### Get vector 

In [None]:
X_data_vectors = []
for x in train_corpus:
    vector = model.infer_vector(x.words)
    X_data_vectors.append(vector)

In [None]:
X_test_vectors = []
for x in test_corpus:
    vector = model.infer_vector(x.words)
    X_test_vectors.append(vector)

In [None]:
classifier = create_dnn_model()
train_model(classifier=classifier, X_data=np.array(X_data_vectors), y_data=y_data_n, X_test=(X_test_vectors), y_test=y_test_n, is_neuralnet=True, n_epochs=5)