# Introduction
In this notebook, we will using doc2vec for vectoring the text to see if there is improvement

# Train Doc2Vec

## Prepare data

In [1]:
import pandas as pd 

#Load data
train_set = pd.read_csv('corpus/clean_training_set.csv')
test_set = pd.read_csv('corpus/clean_test_set.csv')

#Concat all news in both data set into a big list
full_news = list(train_set.Text) + list(test_set.Text)

print("full corpus size: ", len(full_news))

print(full_news[-1])



full corpus size:  2225
citizenship event for 18s touted citizenship ceremonies could be introduced for people celebrating their 18th birthday  charles clarke has said.  the idea will be tried as part of an overhaul of the way government approaches  inclusive citizenship  particularly for ethnic minorities. a pilot scheme based on ceremonies in australia will start in october. mr clarke said it would be a way of recognising young people reaching their voting age when they also gain greater independence from parents. britain s young black and asian people are to be encouraged to learn about the nation s heritage as part of the government s new race strategy which will also target specific issues within different ethnic minority groups. officials say the home secretary wants young people to feel they belong and to understand their  other cultural identities  alongside being british. the launch follows a row about the role of faith schools in britain. on monday school inspection chief dav

### Preprocessing Data

Different form TF-IDF, training Word2Vec is needed sentences in sequence form. So we do not remove stop words from text data. But make some transform like:
- Remove url
- Remove punct
- Ner tag words number to (time, date, quantity, ordinal)

In [2]:
import spacy
import numpy as np
print(f"Spacy version: {spacy.__version__}")
nlp_spacy = spacy.load('en_core_web_sm')

def text_preprocessing(text):
    #text = correct_spelling(text)
    
    doc = nlp_spacy(text)
    
    clean_bag_words = []
    for token in doc:
        if token.is_currency:
            clean_bag_words.append('_currency_')
        elif token.ent_type_ == 'ORDINAL':
            clean_bag_words.append('_ordinal_')
        elif token.ent_type_ == 'TIME':
            clean_bag_words.append('_time_')
        elif token.ent_type_ == 'QUANTITY':
            clean_bag_words.append('_quantity_')
        elif token.ent_type_ == 'DATE':
            clean_bag_words.append('_date_')
        elif token.is_alpha:
            clean_bag_words.append(token.lemma_)
    
    return clean_bag_words

Spacy version: 2.1.4


In [3]:
from tqdm.contrib.concurrent import thread_map
full_news_clean = thread_map(text_preprocessing, full_news)

100%|██████████| 2225/2225 [01:03<00:00, 34.93it/s]


## Training phrase

In [4]:
import gensim
import multiprocessing
MAX_WORKERS = multiprocessing.cpu_count() - 2

tag_documents = []
#Tagged document for doc2vec_model
for i, doc in enumerate(full_news_clean):
    tag_documents.append(gensim.models.doc2vec.TaggedDocument(doc, [i]))


In [5]:
tag_documents[0]



In [6]:

#Init model
doc2vec_model = gensim.models.Doc2Vec(window=3, vector_size=200, negative=10, min_count=1, 
                                        sample=1e-4, workers = MAX_WORKERS)

In [7]:
%%time
import logging
logging.basicConfig(level=logging.INFO)

#Build vocab
doc2vec_model.build_vocab(tag_documents)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 21963 word types and 2225 unique tags from a corpus of 2225 examples and 865336 words
INFO:gensim.models.word2vec:Creating a fresh vocabulary
INFO:gensim.utils:Doc2Vec lifecycle event {'msg': 'effective_min_count=1 retains 21963 unique words (100.0%% of original 21963, drops 0)', 'datetime': '2022-04-29T10:57:49.526160', 'gensim': '4.1.2', 'python': '3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:35:01) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22598-SP0', 'event': 'prepare_vocab'}
INFO:gensim.utils:Doc2Vec lifecycle event {'msg': 'effective_min_count=1 leaves 865336 word corpus (100.0%% of original 865336, drops 0)', 'datetime': '2022-04-29T10:57:49.528125', 'gensim': '4.1.2', 'python': '3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:35:01) [MS

Wall time: 641 ms


In [8]:
%%time
doc2vec_model.train(tag_documents, total_examples=len(tag_documents), epochs=30, report_delay=10)

INFO:gensim.utils:Doc2Vec lifecycle event {'msg': 'training model with 6 workers on 21963 vocabulary and 200 features, using sg=0 hs=0 sample=0.0001 negative=10 window=3 shrink_windows=True', 'datetime': '2022-04-29T10:59:03.053339', 'gensim': '4.1.2', 'python': '3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:35:01) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22598-SP0', 'event': 'train'}
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 5 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 4 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 3 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:EPOCH - 1 : training on 865336 raw 

Wall time: 17.1 s


### Testing Doc2Vec

In [10]:
ranks = []
for doc_id in range(len(tag_documents)):
    inferred_vector = doc2vec_model.infer_vector(tag_documents[doc_id].words)
    
    #Get top 5 most similar documents
    sims = doc2vec_model.dv.most_similar([inferred_vector], topn=5)
    
    #Get rank of this documents by itself
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)

    
import collections
counter = collections.Counter(ranks)
print(counter)

Counter({0: 2076, 1: 149})


Basically, greater than 90% of the inferred documents are found to be most similar to itself and about 10% of the time it is mistakenly most similar to another document. 

In [14]:
#Save doc2vec model
doc2vec_model.save('model/doc2vec_bbc_200')

INFO:gensim.utils:Doc2Vec lifecycle event {'fname_or_handle': 'model/doc2vec_bbc_200', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2022-04-29T11:11:15.571144', 'gensim': '4.1.2', 'python': '3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:35:01) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22598-SP0', 'event': 'saving'}
INFO:gensim.utils:not storing attribute cum_table
INFO:gensim.utils:saved model/doc2vec_bbc_200


# Modeling

## Feature engineering

In [15]:
train_set.head()

Unnamed: 0,ArticleId,Text,Category,CleanText
0,1833,worldcom ex-boss launches defence lawyers defe...,business,"['worldcom', 'boss', 'launch', 'defence', 'law..."
1,154,german business confidence slides german busin...,business,"['german', 'business', 'confidence', 'slide', ..."
2,1101,bbc poll indicates economic gloom citizens in ...,business,"['bbc', 'poll', 'indicate', 'economic', 'gloom..."
3,1976,lifestyle governs mobile choice faster bett...,tech,"['lifestyle', 'govern', 'mobile', 'choice', 'f..."
4,917,enron bosses in $168m payout eighteen former e...,business,"['enron', 'boss', '_currency_', 'payout', 'eig..."


In [16]:
%%time 
train_set['BagWords'] = thread_map(text_preprocessing, train_set['Text']) 
train_set['BagWords'].head()

100%|██████████| 1490/1490 [00:51<00:00, 29.07it/s]

Wall time: 51.4 s





0    [worldcom, ex, boss, launch, defence, lawyer, ...
1    [german, business, confidence, slide, german, ...
2    [bbc, poll, indicate, economic, gloom, citizen...
3    [lifestyle, govern, mobile, choice, faster, be...
4    [enron, boss, in, _currency_, m, payout, eight...
Name: BagWords, dtype: object

### Vectoring sentence with Doc2Vec


In [18]:
def doc2vec_sentence(bag_words):
    return doc2vec_model.infer_vector(bag_words)

print(doc2vec_sentence(train_set['BagWords'][0]).shape)

(200,)


In [19]:
#Vectoring traning set
train_vectors = np.array(thread_map(doc2vec_sentence, train_set['BagWords'].values))
train_vectors.shape

100%|██████████| 1490/1490 [00:10<00:00, 138.63it/s]


(1490, 200)

In [20]:
#Init label encoder for news category feature
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

#Encode target, save to varible y
y = encoder.fit_transform(train_set['Category'].values)
print(y.shape)

(1490,)


## Train model

In [21]:
#Model
from sklearn.svm import SVC
from sklearn.naive_bayes import  GaussianNB
from sklearn.ensemble import RandomForestClassifier

#Evalution Model
from sklearn.model_selection import cross_validate, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, ShuffleSplit

#CV splitrun model 10x with 70/20 split intentionally leaving out 10%
cv_split = ShuffleSplit(n_splits = 5, test_size = .2,
                        train_size = .7, random_state = 0)

param_grids = [
    #Random Forest
    {'n_estimators': [100, 150, 250],
     'criterion': ['gini', 'entropy']},

    #GausianNB
    {},

    #SVC
    {'C': [1, 10, 100],
     'gamma': [0.01, 0.1, 0.001]},

]

MLA = [
    RandomForestClassifier(),
    GaussianNB(),
    SVC(),
]

In [25]:
%%time
import multiprocessing
MAX_WORKER = multiprocessing.cpu_count() - 2

report = pd.read_csv('report_training.csv')
scoring = {'f1': 'f1_macro', 'precision': 'precision_macro', 'recall':'recall_macro'}

row = len(report)
for mla, param in zip(MLA, param_grids):
    gscv = GridSearchCV(mla, param, cv = cv_split, return_train_score=True, n_jobs=MAX_WORKER, 
                        scoring=scoring, refit='f1', error_score='raise')
    gscv.fit(train_vectors,y)
    
    best_index = gscv.best_index_
    
    report.loc[row, 'algorithm'] = gscv.best_estimator_.__class__.__name__ + "_Doc2Vec"
    report.loc[row, 'best_params'] = str(gscv.best_params_)
    report.loc[row, 'f1_train'] = gscv.cv_results_['mean_train_f1'][best_index]
    report.loc[row, 'f1_test'] = gscv.cv_results_['mean_test_f1'][best_index]
    report.loc[row, 'recall_train'] = gscv.cv_results_['mean_train_recall'][best_index]
    report.loc[row, 'recall_test'] = gscv.cv_results_['mean_test_recall'][best_index]
    report.loc[row, 'precision_train'] = gscv.cv_results_['mean_train_precision'][best_index]
    report.loc[row, 'precision_test'] = gscv.cv_results_['mean_test_precision'][best_index]
    report.loc[row, 'fit_time'] = gscv.cv_results_['mean_fit_time'][best_index]
    
    
    
    row+=1

Wall time: 28.7 s


## Evaluation

In [26]:
report

Unnamed: 0,algorithm,best_params,f1_train,f1_test,recall_train,recall_test,precision_train,precision_test,fit_time
0,RandomForestClassifier_TF_IDF_ONE_GRAM,"{'criterion': 'gini', 'n_estimators': 250}",1.0,0.911843,1.0,0.910633,1.0,0.914503,1.66159
1,GaussianNB_TF_IDF_ONE_GRAM,{},0.931458,0.858537,0.931203,0.859601,0.933053,0.861169,0.013115
2,SVC_TF_IDF_ONE_GRAM,"{'C': 10, 'gamma': 0.1}",0.986628,0.900158,0.986545,0.900327,0.98674,0.902152,0.262039
3,RandomForestClassifier_TF_IDF_ONE_TWO_GRAM,"{'criterion': 'entropy', 'n_estimators': 150}",1.0,0.947797,1.0,0.946281,1.0,0.950623,2.689887
4,GaussianNB_TF_IDF_ONE_TWO_GRAM,{},1.0,0.901938,1.0,0.905434,1.0,0.903421,0.193932
5,SVC_TF_IDF_ONE_TWO_GRAM,"{'C': 10, 'gamma': 0.1}",1.0,0.970375,1.0,0.970559,1.0,0.9706,6.497143
6,RandomForestClassifier_Word2Vec,"{'criterion': 'entropy', 'n_estimators': 150}",1.0,0.969028,1.0,0.967868,1.0,0.970771,3.264943
7,GaussianNB_Word2Vec,{},0.958047,0.947563,0.958102,0.94749,0.958212,0.948213,0.012621
8,SVC_Word2Vec,"{'C': 10, 'gamma': 0.1}",0.985137,0.97132,0.985105,0.971288,0.985193,0.971927,0.098401
9,RandomForestClassifier_Word2Vec,"{'criterion': 'gini', 'n_estimators': 250}",1.0,0.963234,1.0,0.962467,1.0,0.965187,2.682367


- Score and time on overall same as AvgWord2Vec 
- Memory cost go up because Doc2Vec model consume more memory
- Doc2Vec is scalable than Word2Vec

In [27]:
report.to_csv('report_training.csv', index=False)

# Conclusion

Doc2Vec is same as Word2Vec, but on a large scale of data, there will be sure different performance between there two models. 
But to keep simple as possible and scale ability, I prefer Doc2Vec.