<a href="https://colab.research.google.com/github/jsprecher/COVID19/blob/master/Covid_19_Research_by_Method.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Setting up Kaggle API in Google Colaboratory

In [0]:
!pip install -q kaggle

In [0]:
f = open("kaggle.json", "w")
f.write('{"username":"jakesprecher","key":"525fd5c1ad51bf478a6bc6f02e623451"}')
f.close()

In [0]:
!mkdir ~/.kaggle

In [0]:
!cp kaggle.json ~/.kaggle/

In [0]:
!chmod 600 ~/.kaggle/kaggle.json

In [0]:
!kaggle datasets download -d skylord/coronawhy

Downloading coronawhy.zip to /content
100% 12.8G/12.8G [04:24<00:00, 31.0MB/s]
100% 12.8G/12.8G [04:24<00:00, 51.7MB/s]


## Grouping Papers by Methodology

**Task at hand** :  https://trello.com/c/JQtlAk06/65-split-studies-by-type-different-types-of-studies-have-different-credibility

In [0]:
!unzip coronawhy.zip v6_text/*
!unzip coronawhy.zip clean_metadata.csv

Archive:  coronawhy.zip
  inflating: v6_text/v6_text/v6_text_0.pkl  
  inflating: v6_text/v6_text/v6_text_1.pkl  
  inflating: v6_text/v6_text/v6_text_10.pkl  
  inflating: v6_text/v6_text/v6_text_11.pkl  
  inflating: v6_text/v6_text/v6_text_12.pkl  
  inflating: v6_text/v6_text/v6_text_13.pkl  
  inflating: v6_text/v6_text/v6_text_14.pkl  
  inflating: v6_text/v6_text/v6_text_15.pkl  
  inflating: v6_text/v6_text/v6_text_16.pkl  
  inflating: v6_text/v6_text/v6_text_17.pkl  
  inflating: v6_text/v6_text/v6_text_18.pkl  
  inflating: v6_text/v6_text/v6_text_19.pkl  
  inflating: v6_text/v6_text/v6_text_2.pkl  
  inflating: v6_text/v6_text/v6_text_3.pkl  
  inflating: v6_text/v6_text/v6_text_4.pkl  
  inflating: v6_text/v6_text/v6_text_5.pkl  
  inflating: v6_text/v6_text/v6_text_6.pkl  
  inflating: v6_text/v6_text/v6_text_7.pkl  
  inflating: v6_text/v6_text/v6_text_8.pkl  
  inflating: v6_text/v6_text/v6_text_9.pkl  
Archive:  coronawhy.zip
  inflating: clean_metadata.csv      


In [0]:
# Importing required libraries

import os
import pandas as pd
import numpy as np
import nltk 
from nltk.corpus import wordnet 


PATH = "v6_text/v6_text/"

Here are the keywords (currently, we have trigrams) to search for in the papers.

The output is supposed to be in this format.

**Searching for the keywords only in method and result section of the papers, as the most relevant papers can be found there only.**

(This cell may take some time)

In [0]:
synonyms = [] 
antonyms = [] 
nltk.download('wordnet')
  
for syn in wordnet.synsets("design"): 
    for l in syn.lemmas(): 
        synonyms.append(l.name()) 
        if l.antonyms(): 
            antonyms.append(l.antonyms()[0].name()) 
  
print(set(synonyms)) 

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
{'figure', 'pattern', 'excogitation', 'conception', 'designing', 'invention', 'aim', 'intention', 'innovation', 'project', 'blueprint', 'purpose', 'contrive', 'plan', 'design', 'intent'}


In [0]:
method_keywords = ['method', 'experiment', 'design', 'process', 'study', 'specimen','setup','task', 'intro']

In [0]:
firstpass = True
sections = []

for pkl in os.listdir(PATH):
  df = pd.read_pickle(PATH+pkl, compression='gzip')
  df_sub = df.loc[:,['paper_id', 'language','section', 'sentence', 'lemma', 'UMLS']]
  
  if(firstpass):
    df_method = df_sub[df_sub['section'].astype(str).str.contains('|'.join(method_keywords))]
    firstpass = False
  else:
    df_method = pd.concat([df_method, df_sub[df_sub['section'].astype(str).str.contains('|'.join(method_keywords))]])

In [0]:
unique_papers = df_method['paper_id'].nunique()

print('There are {} papers accounted for.'.format(unique_papers))

There are 26908 papers accounted for.


In [0]:
df_method['lemma'] = df_method['lemma'].apply(lambda x: ' '.join(x))
df_method.head()

Unnamed: 0,paper_id,language,section,sentence,lemma,UMLS
158,53cee6ab40cabc83c6a1d71faf668e95d823c33f,en,introduction,"In the last few decades, several methods to cl...","in the last few decade , several method to cla...","[Decade, Methods, Genes, Proteins]"
159,53cee6ab40cabc83c6a1d71faf668e95d823c33f,en,introduction,Most of these methods are alignmentbased in wh...,Most of these method be alignmentbase in which...,"[Methods, Alignment, Monitoring Systems]"
160,53cee6ab40cabc83c6a1d71faf668e95d823c33f,en,introduction,These methods provide accurate classification ...,these method provide accurate classification o...,"[Methods, Accurate (qualifier), Classification..."
161,53cee6ab40cabc83c6a1d71faf668e95d823c33f,en,introduction,"Nevertheless, their major drawback is due to s...","nevertheless , -PRON- major drawback be due to...","[Consumption of goods, statistical cluster]"
162,53cee6ab40cabc83c6a1d71faf668e95d823c33f,en,introduction,"Henceforth, an alignment-free technique is a t...","henceforth , an alignment-free technique be a ...","[Test Method, Classification, Data Set]"


In [0]:
df_method['full_lemma'] = df_method.groupby(['paper_id','section'])['lemma'].transform(lambda x: ','.join(x.astype(str)))
df_method['full_section'] = df_method.groupby(['paper_id', 'section'])['sentence'].transform(lambda x: ','.join(x.astype(str)))

# Identify different experiment types

In [0]:
silico = ['computation', 'simulation']
vitro = ['cell culture', 'cell line', 'immortalized', 'media', 'well plate', 'FBS',
         'fetal bovine serum', 'incubator', 'CO2', 'carbon dioxide', 'air-liquid interface']
vivo = ['mouse', 'ferret', 'dog', 'rat', 'bat', 'cat', 'mice', 'fish', 'rabbit',
        'guinea pig', 'hamster', 'pig','hog','cow','animal','dog','monkey',
        'wildtype', 'intraperitoneal', 'oral gavage', 'tail vein', 'subcutaneous'] # add other drug administation methods

In [0]:
articles = df_method.loc[:,['paper_id', 'section','full_section','full_section', 'full_lemma']].drop_duplicates()

In [0]:
articles['silico'] = articles['full_lemma'].str.contains('|'.join(silico), case=False)
articles['vitro'] = articles['full_lemma'].str.contains('|'.join(vitro), case=False)
articles['vivo'] = articles['full_lemma'].str.contains('|'.join(vivo), case=False)

In [0]:
print("There are {}% articles containing silico terms.".format(round(articles['silico'].sum()/len(articles),4)*100))
print("There are {}% articles containing vitro terms.".format(round(articles['vitro'].sum()/len(articles),4)*100))
print("There are {}% articles containing vivo terms.".format(round(articles['vivo'].sum()/len(articles),4)*100))
print("There are {} articles total".format(articles['paper_id'].nunique()))

There are 4.26% articles containing silico terms.
There are 24.81% articles containing vitro terms.
There are 91.42% articles containing vivo terms.
There are 26908 articles total


In [0]:
def keywordcounter(sentence, keywords):
  '''
  Input : List of sentences
  Returns : Keywords present in sentences, Total count of all keywords present in Input
  '''
  total = 0
  for word in keywords:
    counter = sentence.lower().count(word)
    total = total + counter
  return total

In [0]:
articles['silico'] = articles['full_lemma'].apply(lambda x: keywordcounter(x, silico))
articles['vitro'] = articles['full_lemma'].apply(lambda x: keywordcounter(x, vitro))
articles['vivo'] = articles['full_lemma'].apply(lambda x: keywordcounter(x, vivo))

In [0]:
articles.loc[(articles['silico'] > articles['vitro'])&(articles['silico'] > articles['vivo']),'methodology'] = 'silico'
articles.loc[(articles['vitro'] > articles['silico'])&(articles['vitro'] > articles['vivo']),'methodology'] = 'vitro'
articles.loc[(articles['vivo'] > articles['silico'])&(articles['vivo'] > articles['vitro']),'methodology'] = 'vivo'

In [0]:
metadata = pd.read_csv('clean_metadata.csv')
metadata.rename(columns={'sha':'paper_id'}, inplace = True)
metadata['paper_id'] = metadata['paper_id'].astype("string")
metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31745 entries, 0 to 31744
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Unnamed: 0                   31745 non-null  int64  
 1   cord_uid                     31745 non-null  object 
 2   paper_id                     31744 non-null  string 
 3   source_x                     31745 non-null  object 
 4   title                        31710 non-null  object 
 5   doi                          31745 non-null  object 
 6   pmcid                        16636 non-null  object 
 7   pubmed_id                    24739 non-null  float64
 8   license                      31745 non-null  object 
 9   abstract                     27917 non-null  object 
 10  publish_time                 31745 non-null  object 
 11  authors                      31191 non-null  object 
 12  journal                      30623 non-null  object 
 13  Microsoft Academ

Merging the given papers with their metadata, which contains relevant data.

In [0]:
df_real = articles.merge(metadata[['paper_id', 'title', 'abstract', 'publish_time', 'authors', 'url']], on='paper_id', how='left')

Keeping only the fields which are relevant to us.

In [0]:
df_real = df_real[['paper_id', 'section', 'full_section', 'title', 'abstract', 'publish_time', 'authors', 'url', 'methodology']]
df_real.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41502 entries, 0 to 41501
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   paper_id      41502 non-null  string
 1   section       41502 non-null  object
 2   full_section  41502 non-null  object
 3   full_section  41502 non-null  object
 4   title         35205 non-null  object
 5   abstract      33674 non-null  object
 6   publish_time  35205 non-null  object
 7   authors       34821 non-null  object
 8   url           34976 non-null  object
 9   methodology   37733 non-null  object
dtypes: object(9), string(1)
memory usage: 3.5+ MB


***Grouping the sentences by their Paper IDs.***

In [0]:
grouped = df_real.groupby('paper_id')

Function to format all the code in specified output format.

Work left : 

-> No of citations for each paper

-> Correlation (Unclear)

-> Design Methodology (Unclear)

In [0]:
def aggregation(item):
  '''
  Input : Dataframe of sentences of a paper
  Return : Datframe in Standard Output format
  '''
  dfo = {}

  dfo['Risk Factor'] = 'Pollution'
  dfo['Keyword/Ngram'], dfo['No of keyword occurence in Paper'] = keywordcounter(item['sentence'].tolist())
  dfo['Paper ID'] = item['paper_id'].iloc[0]
  dfo['URL'] = item['url'].iloc[0]
  dfo['Sentences from Method'] = item[item['section']=='methods']['sentence'].tolist()
  dfo['Sentences from Result'] = item[item['section']=='results']['sentence'].tolist()
  dfo['Authors'] = item['authors'].iloc[0]
  dfo['No of Citations'] = 0
  dfo['Correlation'] = 'Not yet'
  dfo['Design Methodology'] = 'None'

  return dfo

Applying function to get the final output dataframe.

In [0]:
for key, item in grouped:
  df_output = pd.concat([df_output, pd.DataFrame([aggregation(item)])])

In [0]:
df_output = df_output.reset_index()
del df_output['index']
df_output

Unnamed: 0,Risk Factor,Keyword/Ngram,Paper ID,URL,Sentences from Result,Sentences from Method,Design Methodology,Correlation,Authors,No of Citations,No of keyword occurence in Paper
0,Pollution,[indoor air pollution],045b111f0f2584890e9271399aa93c917a496662,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,[],[Some patients participated in linked case-con...,,Not yet,"Aston, Stephen J.; Ho, Antonia; Jary, Hannah; ...",0,1
1,Pollution,[air pollution and],0ce2f11a7c4da992b9a8686bd3b2ed53064a2ac8,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,[],[Hourly data for air pollution and meteorologi...,,Not yet,"Chen, Pei-Shih; Tsai, Feng Ta; Lin, Chien Kun;...",0,1
2,Pollution,"[indoor air pollution, of air pollution]",13a1c758877b216e51c51e9ee532ab5c3e0c12fb,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,[],[Although ecologic fallacy and uncontrolled co...,,Not yet,"Xia, Tian; Zhu, Yifang; Mu, Lina; Zhang, Zuo-F...",0,2
3,Pollution,[indoor air pollution],142a615ffb970d12beaa9597bff2b9c49da4bb96,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,"[Of these, 7 risk factors were significantly a...",[We searched a variety of databases-Medline (O...,,Not yet,"Jackson, Stewart; Mathews, Kyle H.; Pulanić, D...",0,3
4,Pollution,[indoor air pollutants],142ecef30a98e08395505ea44e0a7a6f8239ed6b,https://doi.org/10.1016/j.cej.2010.11.061,[Indoor air pollutants consist of particulates...,[],,Not yet,"Yao, Nan; Lun Yeung, King",0,1
5,Pollution,[of air pollution],21d98e3c9427e96a21a4dc7f37d420c42cd6e864,https://doi.org/10.1016/j.envres.2010.05.005,[],[The analytic method used in this study follow...,,Not yet,"Thach, Thuan-Quoc; Wong, Chit-Ming; Chan, King...",0,1
6,Pollution,"[indoor air pollutants, indoor air pollution]",283e32b6190b8c6183e0f5cbfae20c01d7df035c,https://doi.org/10.1016/j.rppneu.2014.02.006,[],[Since the first studies on indoor air quality...,,Not yet,"Araújo-Martins, J.; Carreiro Martins, P.; Vieg...",0,4
7,Pollution,"[air pollution and, between air pollution]",32412ffe52eb41da8ecfed314b703c49fd041f5c,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,[To study the possible relationship between ai...,[],,Not yet,"Almagro, Pere; Hernandez, Carme; Martinez-Camb...",0,2
8,Pollution,[household air pollution],5ec597fac768d34f7c66553b0d568a5c6f3e2aa0,https://doi.org/10.1016/s1473-3099(19)30410-4,[],[Although decreased exposure to household air ...,,Not yet,"Troeger, Christopher E; Khalil, Ibrahim A; Bla...",0,3
9,Pollution,"[air pollution and, between air pollution]",75ebad1e8641e5fe2a51214f4d97742ab59a8dae,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,[Ecologic analysis was conducted among 5 regio...,[Ecologic analysis was conducted to explore th...,,Not yet,"Cui, Yan; Zhang, Zuo-Feng; Froines, John; Zhao...",0,3


Saving the extracted papers.

In [0]:
df_output.to_csv('pollution_papers.csv')