# Introduction
In this notebook, we will using word2vec for vectoring the text instead of TF-IDF, to see if there is improvement

# Train Word2Vec

## Prepare data

In [2]:
import pandas as pd 

#Load data
train_set = pd.read_csv('corpus/clean_training_set.csv')
test_set = pd.read_csv('corpus/clean_test_set.csv')

#Concat all news in both data set into a big list
full_news = list(train_set.Text) + list(test_set.Text)

print("full corpus size: ", len(full_news))

print(full_news[-1])



full corpus size:  2225
citizenship event for 18s touted citizenship ceremonies could be introduced for people celebrating their 18th birthday  charles clarke has said.  the idea will be tried as part of an overhaul of the way government approaches  inclusive citizenship  particularly for ethnic minorities. a pilot scheme based on ceremonies in australia will start in october. mr clarke said it would be a way of recognising young people reaching their voting age when they also gain greater independence from parents. britain s young black and asian people are to be encouraged to learn about the nation s heritage as part of the government s new race strategy which will also target specific issues within different ethnic minority groups. officials say the home secretary wants young people to feel they belong and to understand their  other cultural identities  alongside being british. the launch follows a row about the role of faith schools in britain. on monday school inspection chief dav

### Preprocessing Data

Different form TF-IDF, training Word2Vec is needed sentences in sequence form. So we do not remove stop words from text data. But make some transform like:
- Remove url
- Remove punct
- Ner tag words number to (time, date, quantity, ordinal)

In [48]:
import spacy
import numpy as np
print(f"Spacy version: {spacy.__version__}")
nlp_spacy = spacy.load('en_core_web_sm')

def text_preprocessing(text):
    #text = correct_spelling(text)
    
    doc = nlp_spacy(text)
    
    clean_bag_words = []
    for token in doc:
        if token.is_currency:
            clean_bag_words.append('_currency_')
        elif token.ent_type_ == 'ORDINAL':
            clean_bag_words.append('_ordinal_')
        elif token.ent_type_ == 'TIME':
            clean_bag_words.append('_time_')
        elif token.ent_type_ == 'QUANTITY':
            clean_bag_words.append('_quantity_')
        elif token.ent_type_ == 'DATE':
            clean_bag_words.append('_date_')
        elif token.is_alpha:
            clean_bag_words.append(token.lemma_)
    
    return clean_bag_words

Spacy version: 2.1.4


In [49]:
print(text_preprocessing(full_news[-1]))

['citizenship', 'event', 'for', '_date_', 'tout', 'citizenship', 'ceremony', 'could', 'be', 'introduce', 'for', 'people', 'celebrate', '-PRON-', '_ordinal_', 'birthday', 'charles', 'clarke', 'have', 'say', 'the', 'idea', 'will', 'be', 'try', 'as', 'part', 'of', 'an', 'overhaul', 'of', 'the', 'way', 'government', 'approach', 'inclusive', 'citizenship', 'particularly', 'for', 'ethnic', 'minority', 'a', 'pilot', 'scheme', 'base', 'on', 'ceremony', 'in', 'australia', 'will', 'start', 'in', 'october', 'mr', 'clarke', 'say', '-PRON-', 'would', 'be', 'a', 'way', 'of', 'recognise', 'young', 'people', 'reach', '-PRON-', 'voting', 'age', 'when', '-PRON-', 'also', 'gain', 'great', 'independence', 'from', 'parent', 'britain', 's', 'young', 'black', 'and', 'asian', 'people', 'be', 'to', 'be', 'encourage', 'to', 'learn', 'about', 'the', 'nation', 's', 'heritage', 'as', 'part', 'of', 'the', 'government', 's', 'new', 'race', 'strategy', 'which', 'will', 'also', 'target', 'specific', 'issue', 'within',

In [50]:
from tqdm.contrib.concurrent import thread_map
full_news_clean = thread_map(text_preprocessing, full_news)

100%|██████████| 2225/2225 [01:18<00:00, 28.34it/s]


## Training phrase

In [51]:
import gensim
import multiprocessing
MAX_WORKERS = multiprocessing.cpu_count() - 2

#Init model
word2vec_model = gensim.models.Word2Vec(window=3, vector_size=200, negative=10, min_count=1, 
                                        sample=1e-4, workers = MAX_WORKERS, sg=1)


INFO:gensim.utils:Word2Vec lifecycle event {'params': 'Word2Vec(vocab=0, vector_size=200, alpha=0.025)', 'datetime': '2022-04-28T21:12:13.045406', 'gensim': '4.1.2', 'python': '3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:35:01) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22598-SP0', 'event': 'created'}


In [52]:
%%time
import logging
logging.basicConfig(level=logging.INFO)

#Build vocab
word2vec_model.build_vocab(full_news_clean)

INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:collected 21963 word types from a corpus of 865336 raw words and 2225 sentences
INFO:gensim.models.word2vec:Creating a fresh vocabulary
INFO:gensim.utils:Word2Vec lifecycle event {'msg': 'effective_min_count=1 retains 21963 unique words (100.0%% of original 21963, drops 0)', 'datetime': '2022-04-28T21:12:13.352754', 'gensim': '4.1.2', 'python': '3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:35:01) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22598-SP0', 'event': 'prepare_vocab'}
INFO:gensim.utils:Word2Vec lifecycle event {'msg': 'effective_min_count=1 leaves 865336 word corpus (100.0%% of original 865336, drops 0)', 'datetime': '2022-04-28T21:12:13.353721', 'gensim': '4.1.2', 'python': '3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:35:01) [MSC v.1916 64 bit 

Wall time: 615 ms


In [53]:
%%time
word2vec_model.train(full_news_clean, total_examples=len(full_news_clean), epochs=25, report_delay=10)

INFO:gensim.utils:Word2Vec lifecycle event {'msg': 'training model with 6 workers on 21963 vocabulary and 200 features, using sg=1 hs=0 sample=0.0001 negative=10 window=3 shrink_windows=True', 'datetime': '2022-04-28T21:12:13.841222', 'gensim': '4.1.2', 'python': '3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:35:01) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22598-SP0', 'event': 'train'}
INFO:gensim.models.word2vec:EPOCH 1 - PROGRESS: at 92.22% examples, 388568 words/s, in_qsize 8, out_qsize 0
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 5 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 4 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 3 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker threa

Wall time: 30 s


(10575247, 21633400)

In [54]:
print(f"Total vocab, and dimension features: {word2vec_model.wv.vectors.shape}")
word2vec_model.wv.save('model/word2vec_bbc_200')

INFO:gensim.utils:KeyedVectors lifecycle event {'fname_or_handle': 'model/word2vec_bbc_200', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2022-04-28T21:12:43.918592', 'gensim': '4.1.2', 'python': '3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:35:01) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22598-SP0', 'event': 'saving'}
INFO:gensim.utils:saved model/word2vec_bbc_200


Total vocab, and dimension features: (21963, 200)


In [55]:
word2vec_model.wv.most_similar('human')

[('strasbourg', 0.5911527276039124),
 ('contravention', 0.567435622215271),
 ('genome', 0.5610306859016418),
 ('shami', 0.5599691271781921),
 ('connectedness', 0.5560138821601868),
 ('companion', 0.553825855255127),
 ('damming', 0.5493731498718262),
 ('frailty', 0.5487865805625916),
 ('convention', 0.5443421602249146),
 ('chakrabarti', 0.5389552712440491)]

In [56]:
word2vec_model.wv.most_similar('morning')

[('treadmill', 0.7206295728683472),
 ('timeslot', 0.7108034491539001),
 ('napalm', 0.6681059002876282),
 ('definately', 0.6680393218994141),
 ('truncated', 0.657975971698761),
 ('edgy', 0.6570670008659363),
 ('footfall', 0.6546801924705505),
 ('glide', 0.6490354537963867),
 ('afternoon', 0.6483133435249329),
 ('onimusha', 0.6481848359107971)]

In [57]:
word2vec_model.wv.most_similar('love')

[('fleshiness', 0.5509140491485596),
 ('luminosity', 0.5503056049346924),
 ('speakerboxxx', 0.5440655946731567),
 ('ample', 0.5337980389595032),
 ('stabber', 0.5236081480979919),
 ('tug', 0.5185402631759644),
 ('courtney', 0.5174943208694458),
 ('forbidden', 0.5116926431655884),
 ('intuition', 0.511223554611206),
 ('downsize', 0.5042084455490112)]

In [58]:
word2vec_model.wv.most_similar('business')

[('namely', 0.5362796187400818),
 ('irancell', 0.5216964483261108),
 ('company', 0.5190095901489258),
 ('tankan', 0.5056881904602051),
 ('tetsuro', 0.5030200481414795),
 ('nanjing', 0.5013078451156616),
 ('broking', 0.49983251094818115),
 ('radar', 0.4983746111392975),
 ('outsourcing', 0.49788033962249756),
 ('pessimism', 0.4975112974643707)]

In [59]:
del word2vec_model

# Modeling

In [60]:
import gensim
word2vec_model = gensim.models.KeyedVectors.load('model/word2vec_bbc_200')
word2vec_model.most_similar('home')

INFO:gensim.utils:loading KeyedVectors object from model/word2vec_bbc_200
INFO:gensim.utils:KeyedVectors lifecycle event {'fname': 'model/word2vec_bbc_200', 'datetime': '2022-04-28T21:12:44.491596', 'gensim': '4.1.2', 'python': '3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:35:01) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22598-SP0', 'event': 'loaded'}


[('broom', 0.4992559552192688),
 ('arbitrarily', 0.4974686801433563),
 ('hyped', 0.47862720489501953),
 ('prosperous', 0.4762115180492401),
 ('agnostic', 0.47368767857551575),
 ('gainful', 0.4725506007671356),
 ('mactaggart', 0.47231656312942505),
 ('speeding', 0.469606876373291),
 ('moldova', 0.46904489398002625),
 ('certificate', 0.46093687415122986)]

## Feature engineering

In [61]:
train_set.head()

Unnamed: 0,ArticleId,Text,Category,CleanText,BagWords
0,1833,worldcom ex-boss launches defence lawyers defe...,business,"['worldcom', 'boss', 'launch', 'defence', 'law...","['worldcom', 'boss', 'launch', 'defence', 'law..."
1,154,german business confidence slides german busin...,business,"['german', 'business', 'confidence', 'slide', ...","['german', 'business', 'confidence', 'slide', ..."
2,1101,bbc poll indicates economic gloom citizens in ...,business,"['bbc', 'poll', 'indicate', 'economic', 'gloom...","['bbc', 'poll', 'indicate', 'economic', 'gloom..."
3,1976,lifestyle governs mobile choice faster bett...,tech,"['lifestyle', 'govern', 'mobile', 'choice', 'f...","['lifestyle', 'govern', 'mobile', 'choice', 'f..."
4,917,enron bosses in $168m payout eighteen former e...,business,"['enron', 'boss', '_currency_', 'payout', 'eig...","['enron', 'boss', '_currency_', 'payout', 'eig..."


In [62]:
%%time 
train_set['BagWords'] = thread_map(text_preprocessing, train_set['Text']) 
train_set['BagWords'].head()

100%|██████████| 1490/1490 [00:49<00:00, 30.05it/s]

Wall time: 49.8 s





0    [worldcom, ex, boss, launch, defence, lawyer, ...
1    [german, business, confidence, slide, german, ...
2    [bbc, poll, indicate, economic, gloom, citizen...
3    [lifestyle, govern, mobile, choice, faster, be...
4    [enron, boss, in, _currency_, m, payout, eight...
Name: BagWords, dtype: object

### Vectoring sentence with Word2Vec

There are an algorithm known as AvgWord2Vec that vectorizes sentences by taking average all vector word in sentences.

In [63]:
def word2vec_sentence(bag_words):
    vector = np.zeros(word2vec_model.vectors.shape[1])
    
    for token in bag_words:
        try:
            vector = np.add(vector,word2vec_model[token])
        except:
            pass
    
    return np.divide(vector, len(bag_words))

print(word2vec_sentence(train_set['BagWords'][0]).shape)

(200,)


In [64]:
from tqdm.contrib.concurrent import thread_map

In [65]:
#Vectoring traning set
train_vectors = np.array(thread_map(word2vec_sentence, train_set['BagWords'].values))
train_vectors.shape

100%|██████████| 1490/1490 [00:05<00:00, 289.25it/s]


(1490, 200)

In [66]:
#Init label encoder for news category feature
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

#Encode target, save to varible y
y = encoder.fit_transform(train_set['Category'].values)
print(y.shape)

(1490,)


## Train model

In [67]:
#Model
from sklearn.svm import SVC
from sklearn.naive_bayes import  GaussianNB
from sklearn.ensemble import RandomForestClassifier

#Evalution Model
from sklearn.model_selection import cross_validate, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, ShuffleSplit

#CV splitrun model 10x with 70/20 split intentionally leaving out 10%
cv_split = ShuffleSplit(n_splits = 5, test_size = .2,
                        train_size = .7, random_state = 0)

param_grids = [
    #Random Forest
    {'n_estimators': [100, 150, 250],
     'criterion': ['gini', 'entropy']},

    #GausianNB
    {},

    #SVC
    {'C': [1, 10, 100],
     'gamma': [0.01, 0.1, 0.001]},

]

MLA = [
    RandomForestClassifier(),
    GaussianNB(),
    SVC(),
]

In [68]:
%%time
import multiprocessing
MAX_WORKER = multiprocessing.cpu_count() - 2

report = pd.read_csv('report_training.csv')
scoring = {'f1': 'f1_macro', 'precision': 'precision_macro', 'recall':'recall_macro'}

row = len(report)
for mla, param in zip(MLA, param_grids):
    gscv = GridSearchCV(mla, param, cv = cv_split, return_train_score=True, n_jobs=MAX_WORKER, 
                        scoring=scoring, refit='f1', error_score='raise')
    gscv.fit(train_vectors,y)
    
    best_index = gscv.best_index_
    
    report.loc[row, 'algorithm'] = gscv.best_estimator_.__class__.__name__ + "_Word2Vec"
    report.loc[row, 'best_params'] = str(gscv.best_params_)
    report.loc[row, 'f1_train'] = gscv.cv_results_['mean_train_f1'][best_index]
    report.loc[row, 'f1_test'] = gscv.cv_results_['mean_test_f1'][best_index]
    report.loc[row, 'recall_train'] = gscv.cv_results_['mean_train_recall'][best_index]
    report.loc[row, 'recall_test'] = gscv.cv_results_['mean_test_recall'][best_index]
    report.loc[row, 'precision_train'] = gscv.cv_results_['mean_train_precision'][best_index]
    report.loc[row, 'precision_test'] = gscv.cv_results_['mean_test_precision'][best_index]
    report.loc[row, 'fit_time'] = gscv.cv_results_['mean_fit_time'][best_index]
    
    
    
    row+=1

Wall time: 37.8 s


## Evaluation

In [69]:
report

Unnamed: 0,algorithm,best_params,f1_train,f1_test,recall_train,recall_test,precision_train,precision_test,fit_time
0,RandomForestClassifier_TF_IDF_ONE_GRAM,"{'criterion': 'gini', 'n_estimators': 250}",1.0,0.911843,1.0,0.910633,1.0,0.914503,1.66159
1,GaussianNB_TF_IDF_ONE_GRAM,{},0.931458,0.858537,0.931203,0.859601,0.933053,0.861169,0.013115
2,SVC_TF_IDF_ONE_GRAM,"{'C': 10, 'gamma': 0.1}",0.986628,0.900158,0.986545,0.900327,0.98674,0.902152,0.262039
3,RandomForestClassifier_TF_IDF_ONE_TWO_GRAM,"{'criterion': 'entropy', 'n_estimators': 150}",1.0,0.947797,1.0,0.946281,1.0,0.950623,2.689887
4,GaussianNB_TF_IDF_ONE_TWO_GRAM,{},1.0,0.901938,1.0,0.905434,1.0,0.903421,0.193932
5,SVC_TF_IDF_ONE_TWO_GRAM,"{'C': 10, 'gamma': 0.1}",1.0,0.970375,1.0,0.970559,1.0,0.9706,6.497143
6,RandomForestClassifier_Word2Vec,"{'criterion': 'entropy', 'n_estimators': 150}",1.0,0.969028,1.0,0.967868,1.0,0.970771,3.264943
7,GaussianNB_Word2Vec,{},0.958047,0.947563,0.958102,0.94749,0.958212,0.948213,0.012621
8,SVC_Word2Vec,"{'C': 10, 'gamma': 0.1}",0.985137,0.97132,0.985105,0.971288,0.985193,0.971927,0.098401


- Increassing score overall, on average 1%
- Get rid of overfitting
- Time handle reduce
- Memory cost reduce

In [70]:
report.to_csv('report_training.csv', index=False)

# Conclusion

Word2Vec did make a better performance model.