### In this notebook, we'll try to do some topic modeling on the data set. 
We'll be working with the abstracts data frame from the TFIDF notebook. What we're going to do:
1. Load the data from the previous notebook
2. Create Doc2Vec vectors.
3. Make an abstract similarity network
4. Find documents that are closest to each other. 
5. Profit???

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# pd.options.display.max_colwidth = 500
from tqdm import tqdm

from multiprocessing import cpu_count, Pool #for multiprocessing data
cores = cpu_count()
import re

#imports for network analysis
import networkx as nx #for creating networl
import community #pip install python-louvain; for determining communities

from gensim.models.doc2vec import Doc2Vec, TaggedDocument #for doc2vec modelling
from IPython.display import clear_output

In [2]:
abstracts_df = pd.read_pickle('abstracts_dataframe') #load data

In [3]:
abstracts_df.head() #check out if everything looks ok 

Unnamed: 0,indice,content,title,content_clean,content_clean_nsw,title_clean,title_clean_nsw,vector
0,2,The geographic spread of 2019 novel coronaviru...,Incubation Period and Other Epidemiological Ch...,"[the, geographic, spread, of, novel, coronavir...","[geographic, spread, novel, coronavirus, covid...","[incubation, period, and, other, epidemiologic...","[incubation, period, epidemiological, characte...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,3,"In December 2019, cases of unidentified pneumo...",Characteristics of and Public Health Responses...,"[in, december, cases, of, unidentified, pneumo...","[december, cases, unidentified, pneumonia, his...","[characteristics, of, and, public, health, res...","[characteristics, public, health, responses, c...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,5,The basic reproduction number of an infectious...,An updated estimation of the risk of transmiss...,"[the, basic, reproduction, number, of, an, inf...","[basic, reproduction, number, infectious, agen...","[an, updated, estimation, of, the, risk, of, t...","[updated, estimation, risk, transmission, nove...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,6,The initial cluster of severe pneumonia cases ...,Real-time forecasts of the 2019-nCoV epidemic ...,"[the, initial, cluster, of, severe, pneumonia,...","[initial, cluster, severe, pneumonia, cases, t...","[real, time, forecasts, of, the, ncov, epidemi...","[real, time, forecasts, ncov, epidemic, china,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,8,Cruise ships carry a large number of people in...,COVID-19 outbreak on the Diamond Princess crui...,"[cruise, ships, carry, a, large, number, of, p...","[cruise, ships, carry, large, number, people, ...","[covid, outbreak, on, the, diamond, princess, ...","[covid, outbreak, diamond, princess, cruise, s...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


### Doc2Vec Modeling

In [4]:
content = abstracts_df['content_clean'].to_list()
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(content)]

In [73]:
tqdm.pandas(desc="progress-bar")

epochs = 40
vec_size = 50
alpha = 0.025

model = Doc2Vec(vector_size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=2,
                workers = cores,
                dm = 1, epochs = epochs)

model.build_vocab(documents = documents, progress_per=1)

  from pandas import Panel


### Now we can compute similarity values between documents

In [74]:
doc_i = 1
# print('')
print(abstracts_df.iloc[doc_i]['title'])
print('-'*10)
for doc in model.docvecs.most_similar(1,topn=10):
    print(doc[1],abstracts_df.iloc[doc[0]]['title'])
    print('-'*10)

Characteristics of and Public Health Responses to the Coronavirus Disease 2019 Outbreak in China
----------
0.5229278802871704 Architects of Assembly: roles of Flaviviridae nonstructural proteins in virion morphogenesis
----------
0.5095396041870117 The NF-κB-dependent and -independent transcriptome and chromatin landscapes of human coronavirus 229E-infected cells
----------
0.4931204617023468 Evaluation of the cytotoxic potential of extracts from the genus Passiflora cultived in Brazil against cancer cells
----------
0.48933345079421997 Influenza A Virus Assembly Intermediates Fuse in the Cytoplasm
----------
0.4824468195438385 Molecular detection of enteric viruses and the genetic characterization of porcine astroviruses and sapoviruses in domestic pigs from Slovakian farms
----------
0.48157939314842224 The National Ebola Training and Education Center: Preparing the United States for Ebola and Other Special Pathogens
----------
0.4812838137149811 Human Betacoronavirus 2c EMC/2012–re

In [13]:
high_threshold = 0.4
# low_threshold = 0.35
maxedges = 50
abstract_network = nx.Graph()
size = len(documents)

In [14]:
for i in tqdm(range(size)):
    sims = model.docvecs.most_similar(i,topn=maxedges)
    for sim in sims:
        if sim[1] >= high_threshold:
            abstract_network.add_edge(i,sim[0],weight=sim[1])

100%|██████████| 26552/26552 [00:32<00:00, 823.50it/s]


In [15]:
#community detection
partition = community.best_partition(abstract_network) #default louvain community detection
v = {} #create dict to group nodes by community
for key, value in tqdm(sorted(partition.items())):
    v.setdefault(value, []).append(key)
print('{} communities found'.format(len(v)))

100%|██████████| 26552/26552 [00:00<00:00, 1341837.68it/s]

7 communities found





In [18]:
df_samples = {}
for c, n in v.items():
#     print(c,len(n))
    s = np.random.choice(n, size = 10)
    print('samples from community{}'.format(c),s)
    s_df = abstracts_df[abstracts_df.index.isin(s)]
    df_samples[c] = s_df

samples from community0 [ 4545  6619  2349 21587  5926  8731 17291  9016 11451  3083]
samples from community1 [18901   160 20286 12087 15339  6666 25686  6451 20024 20683]
samples from community3 [24529 23830   627 25966 25437  1763 17466  9265  1341 10194]
samples from community4 [19401  7704 26445 16445  5401 10277 12338 25605 13222 23755]
samples from community5 [ 3115 17279  4470 11234   328  2840  2505   660 19641 18442]
samples from community6 [ 3629 26098  7916  9650 12774 22581 16752 15978  3497 16431]
samples from community2 [12677  6351 22005 14883 25170  3909 21433 21899 13097  4347]


In [28]:
pd.options.display.max_colwidth = 500
df_samples[1][['title']]

Unnamed: 0,title
160,Understanding of COVID-19 based on current evidence
6451,SARS Coronavirus Papain-Like Protease Inhibits the TLR7 Signaling Pathway through Removing Lys63-Linked Polyubiquitination of TRAF3 and TRAF6
6666,Germinal Center B Cell and T Follicular Helper Cell Responses to Viral Vector and Protein-in-Adjuvant Vaccines
12087,Are we prepared? The development of performance indicators for public health emergency preparedness using a modified Delphi approach
15339,Mucosal Immunization Induces a Higher Level of Lasting Neutralizing Antibody Response in Mice by a Replication-Competent Smallpox Vaccine: Vaccinia Tiantan Strain
18901,"Expression, crystallization and preliminary crystallographic study of the functional mutant (N60K) of nonstructural protein 9 from Human coronavirus HKU1"
20024,SOCS proteins in development and disease
20286,Hepatitis C Virus-Induced Autophagy Is Independent of the Unfolded Protein Response
20683,Characterization of the stop codon readthrough signal of Colorado tick fever virus segment 9 RNA
25686,Three intergenic regions of coronavirus mouse hepatitis virus strain A59 genome RNA contain a common nucleotide sequence that is homologous to the 3' end of the viral mRNA leader sequence.


I'm not too sure what this tells me. Maybe doing similar analysis using the tfidf vectors will give me better results?

### TFIDF Similarity
The previous cells give us a network based on the similariy scores of the abstracts, according to the Doc2Vec Model. We'll do the same now, but with the tfidf vectors. 

In [29]:
#returns cosine similarity between two vectors, i.e. their dot product over product of their norms
def cosim(vector1,vector2):
    sim = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
    return sim

In [84]:
class TFIDFM():
    def __init__(self,data):
        self.data = data
    
    def most_similar(self,doc,topn=10):
        v1 = self.data.iloc[doc]['vector']
        sims = []
        for i in range(len(self.data)):
            if i != doc:
                v2 = self.data.iloc[i]['vector']
                sims.append((i,cosim(v1,v2)))
        return sorted(sims, key=lambda x: x[1], reverse= True)[:topn]
    
    def most_similar_text(self,doc,topn=10):
        sim = [i[0] for i in self.most_similar(doc,topn=topn)]
        df = self.data[self.data.index.isin(sim)]
        return df[['title']]
    
    def print_most_sim(self, doc, topn=10):
        sim = self.most_similar(doc,topn=topn)
        for i in sim:
            print('Paper number: {},\nTitle: {}'.format(abstracts_df.iloc[i[0]]['indice'], abstracts_df.iloc[i[0]]['title']))

In [85]:
tfidfmodel = TFIDFM(abstracts_df)

In [87]:
tfidfmodel.most_similar_text(122, 10)

Unnamed: 0,title
41,Clinical characteristics of novel coronavirus cases in tertiary hospitals in Hubei Province
486,"Clinical analysis of 23 cases of 2019 novel coronavirus infection in Xinyang City, Henan Province"
714,Health management of breast cancer patients outside the hospital during the outbreak of 2019 novel coronavirus disease
716,[Surgical treatment for esophageal cancer during the outbreak of COVID-19]
717,[Discussion on diagnosis and treatment of hepatobiliary malignancies during the outbreak of novel coronavirus pneumonia]
755,Pulmonary rehabilitation guidelines in the principle of 4S for patients infected with 2019 novel coronavirus (2019-nCoV)
760,Pharmacotherapeutics for the New Coronavirus Pneumonia
845,Experts proposal and frequently asked questions of rapid screening and prevention of novel coronavirus pneumonia in children
13167,Pathogenic Influenza Viruses and Coronaviruses Utilize Similar and Contrasting Approaches To Control Interferon-Stimulated Gene Responses
13972,The Relationship between Airway Inflammation and Exacerbation in Chronic Obstructive Pulmonary Disease
