## CBOW modelling

This notebook contains CBOW modelling script on the collected mainstream and dezinformational.

Table of Content:
* **1. Import required packages**
* **2. Import files**
* 2.1 Separate mainstream and dezinformational articles

### 1. Import required packages

In [1]:
import pickle
import pandas as pd
import numpy as np

from gensim.models import Word2Vec

### 2. Import files

In [2]:
abt_covid = pd.read_pickle('C:/Users/molna/Desktop/Szakdolgozat/adatok/abt_covid_featured.pkl')

print("ABT table has {} rows and {} columns".format(len(abt_covid), len(abt_covid.columns)))

ABT table has 63633 rows and 50 columns


In [3]:
abt_covid.columns

Index(['title', 'date', 'text', 'source', 'dezinf', 'title_word_cnt',
       'title_avg_word', 'title_exclam_num', 'title_ques_num',
       'title_stop_cnt', 'title_cnt_upper', 'text_word_cnt', 'text_avg_word',
       'text_ques_num', 'text_exclam_num', 'text_stop_cnt', 'text_cnt_upper',
       'text_cnt_num', 'title_cnt_num', 'title_cleaned', 'text_cleaned',
       'title_tokens', 'text_tokens', 'title_lemmas', 'text_lemmas',
       'title_pos', 'text_pos', 'text_unique_lemma_ratio',
       'title_cnt_unique_lemmas', 'title_stop_word_ratio',
       'text_cnt_unique_lemmas', 'text_stop_word_ratio', 'title_noun_ratio',
       'title_verb_ratio', 'title_propn_ratio', 'title_adj_ratio',
       'text_noun_ratio', 'text_verb_ratio', 'text_propn_ratio',
       'text_adj_ratio', 'title_ner_pers', 'title_ner_orgs', 'title_ner_locs',
       'text_ner_pers', 'text_ner_orgs', 'text_ner_locs', 'title_senti_list',
       'title_polarity', 'text_senti_list', 'text_polarity'],
      dtype='object')

### 2.1 Separate mainstream and dezinformational articles

In [4]:
abt_covid_mainstream = abt_covid[abt_covid["dezinf"] == 0]

In [5]:
abt_covid_dezinf = abt_covid[abt_covid["dezinf"] == 1]

### 3. CBOW modelling on dezinformational articles

In [7]:
CBOW_model_dezinf_title = Word2Vec(abt_covid_dezinf["title_lemmas"], min_count=5, workers=3, window=9, sg=0)

In [10]:
pickle.dump(CBOW_model_dezinf_title, open('C:/Users/molna/Desktop/Szakdolgozat/CBOW_model_dezinf_title.pk', 'wb'))

In [8]:
CBOW_model_dezinf_article = Word2Vec(abt_covid_dezinf["text_lemmas"], min_count=5, workers=3, window=9, sg=0)

In [11]:
pickle.dump(CBOW_model_dezinf_article, open('C:/Users/molna/Desktop/Szakdolgozat/CBOW_model_dezinf_article.pk', 'wb'))

#### 3.1 CBOW on articles title

In [20]:
CBOW_model_dezinf_title.most_similar(positive=["koronavírus"], negative=[], topn=20, restrict_vocab=None, indexer=None)

  """Entry point for launching an IPython kernel.


[('koronavírusos', 0.998615026473999),
 ('elhunyt', 0.996547520160675),
 ('hal', 0.9959557056427002),
 ('áldozat', 0.9955768585205078),
 ('érkezettsajnos', 0.9942648410797119),
 ('borzasztóan', 0.9934672117233276),
 ('dolog', 0.9933350086212158),
 ('szőrnyű', 0.9930605888366699),
 ('azonosít', 0.9927099347114563),
 ('rekord', 0.9924144744873047),
 ('gyógyult', 0.9921922087669373),
 ('krónikus', 0.9921479225158691),
 ('fő', 0.991748571395874),
 ('terhes', 0.9915972948074341),
 ('éjjel', 0.9912574291229248),
 ('itthon', 0.9907810688018799),
 ('ébred', 0.9907292127609253),
 ('meghal', 0.9904597401618958),
 ('nagyot', 0.9904000163078308),
 ('igazolt', 0.9903033971786499)]

#### 3.2 CBOW on articles text

In [18]:
CBOW_model_dezinf_article.most_similar(positive=["koronavírus"], negative=[], topn=20, restrict_vocab=None, indexer=None)

  """Entry point for launching an IPython kernel.


[('vírus', 0.5537782311439514),
 ('kezdet', 0.5316112041473389),
 ('covi', 0.5142301321029663),
 ('megugrana', 0.4509262442588806),
 ('újrafertőződés', 0.44093799591064453),
 ('keletkezhet', 0.44069868326187134),
 ('szerotípus', 0.42996180057525635),
 ('stockholm', 0.4154028594493866),
 ('kanyaró', 0.41127607226371765),
 ('kitörés', 0.40510863065719604),
 ('megelőzhető', 0.402570903301239),
 ('tünetmentes', 0.39998534321784973),
 ('megjelenés', 0.3993462920188904),
 ('átesett', 0.3908366560935974),
 ('koronaívrus', 0.3870522379875183),
 ('brit', 0.38629311323165894),
 ('gyors', 0.3857772946357727),
 ('sars', 0.3842359185218811),
 ('potenciális', 0.3836808204650879),
 ('azonosít', 0.3825083374977112)]

### 4. CBOW modelling on mainstream articles


In [21]:
CBOW_model_main_title = Word2Vec(abt_covid_mainstream["title_lemmas"], min_count=5, workers=3, window=9, sg=0)

In [22]:
pickle.dump(CBOW_model_main_title, open('C:/Users/molna/Desktop/Szakdolgozat/CBOW_model_main_title.pk', 'wb'))

In [23]:
CBOW_model_main_article = Word2Vec(abt_covid_mainstream["text_lemmas"], min_count=5, workers=3, window=9, sg=0)

In [24]:
pickle.dump(CBOW_model_main_article, open('C:/Users/molna/Desktop/Szakdolgozat/CBOW_model_main_article.pk', 'wb'))

#### 3.1 CBOW on articles title

In [25]:
CBOW_model_main_title.most_similar(positive=["koronavírus"], negative=[], topn=20, restrict_vocab=None, indexer=None)

  """Entry point for launching an IPython kernel.


[('vírus', 0.9704287052154541),
 ('szegénység', 0.9609699249267578),
 ('terjed', 0.9534137845039368),
 ('ütem', 0.9528130888938904),
 ('továbbra', 0.9489189386367798),
 ('gyors', 0.9471156001091003),
 ('mutáns', 0.9439229965209961),
 ('adat', 0.9431066513061523),
 ('lassul', 0.9425371289253235),
 ('szennyvíz', 0.9409483075141907),
 ('járványba', 0.9409302473068237),
 ('gyorsuló', 0.9408190846443176),
 ('elmúlt', 0.940660297870636),
 ('gyorsul', 0.940253496170044),
 ('áramfogyasztás', 0.9385579824447632),
 ('ukrajna', 0.9363289475440979),
 ('lelassul', 0.9362465739250183),
 ('azonosít', 0.9354507327079773),
 ('kitörés', 0.934977650642395),
 ('ő', 0.934356689453125)]

#### 3.2 CBOW on articles text

In [26]:
CBOW_model_main_article.most_similar(positive=["koronavírus"], negative=[], topn=20, restrict_vocab=None, indexer=None)

  """Entry point for launching an IPython kernel.


[('covi', 0.6085655689239502),
 ('vírus', 0.5492939352989197),
 ('koronavírusos', 0.4433193504810333),
 ('kór', 0.42949536442756653),
 ('kezdet', 0.42572569847106934),
 ('tetőzésé', 0.41760849952697754),
 ('vírusvariáns', 0.38163506984710693),
 ('fellángolás', 0.37640005350112915),
 ('vírusváltozattal', 0.3734596371650696),
 ('kitörés', 0.3684489130973816),
 ('vírusfertőzés', 0.3671513795852661),
 ('regisztráltaka', 0.3649141788482666),
 ('fékezhető', 0.35908639430999756),
 ('vírustörzz', 0.3536154627799988),
 ('vírusváltozat', 0.35089975595474243),
 ('felkészülhess', 0.3466934561729431),
 ('elkerülte', 0.3461502194404602),
 ('esetleges', 0.3447743058204651),
 ('csúcspontján', 0.3434714078903198),
 ('tárki', 0.3424675166606903)]