In [3]:
import numpy as np 
import pandas as pd
import os
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import glob
import json
import re
import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

**Goal**
- With a large amount of literature and fast spreading of COVID-19. It's difficult for health care professionals figure out relevant research. 
- In this post, we will try to identify which topic is discussed in research. It also reduce number of articles which scientist has go through. 
- Research paper topic modelling is an unsupervised machine learning method which allow us to learn topic of articles in corpus

*ok Lets go*
- Because kaggle provided us lot of json file so we will load all json data to dataframe and drop abstract duplicate to make sure unique articles

In [6]:
#path = '/kaggle/input/CORD-19-research-challenge/comm_use_subset/'
path = '/kaggle/input/'
all_json = glob.glob(f'{path}/**/*.json', recursive=True)
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id']
            self.abstract = []
            self.body_text = []
            for entry in content['abstract']:
                self.abstract.append(entry['text'])
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            self.abstract = '\n'.join(self.abstract)
            self.body_text = '\n'.join(self.body_text)
dict_ = {'paper_id': [], 'abstract': [], 'body_text': []}
for idx, entry in enumerate(all_json):
    if idx % (len(all_json) // 10) == 0:
        print(f'Processing index: {idx} of {len(all_json)}')
    content = FileReader(entry)
    dict_['paper_id'].append(content.paper_id)
    dict_['abstract'].append(content.abstract)
    dict_['body_text'].append(content.body_text)
covid_df = pd.DataFrame(dict_, columns=['paper_id', 'abstract', 'body_text'])
covid_df.drop_duplicates(['abstract'], inplace=True)
covid_df.head()

Processing index: 0 of 29315
Processing index: 2931 of 29315
Processing index: 5862 of 29315
Processing index: 8793 of 29315
Processing index: 11724 of 29315
Processing index: 14655 of 29315
Processing index: 17586 of 29315
Processing index: 20517 of 29315
Processing index: 23448 of 29315
Processing index: 26379 of 29315
Processing index: 29310 of 29315


Unnamed: 0,paper_id,abstract,body_text
0,25621281691205eb015383cbac839182b838514f,The human interferon (IFN)-induced MxA protein...,Influenza A viruses (IAV) are severe human pat...
1,7db22f7f81977109d493a0edf8ed75562648e839,"Scorpine, a small cationic peptide from the ve...",The oldest known scorpions lived around 430 mi...
2,a137eb51461b4a4ed3980aa5b9cb2f2c1cf0292a,Background: The complex interplay between vira...,The emergence of Severe Acute Respiratory Synd...
3,6c3e1a43f0e199876d4bd9ff787e1911fd5cfaa6,,Sjögren's syndrome (SS) is a connective tissue...
4,2ce201c2ba233a562ee605a9aa12d2719cfa2beb,Background: Human adenovirus type 55 is a re-e...,Human adenovirus (HAdV) is a common pathogen a...


We have to clean-up the text by 
- Remove punctuation
- Convert each text to lower case

In [7]:
covid_df['body_text'] = covid_df['body_text'].apply(lambda x: re.sub('[^a-zA-z0-9\s]','',x))
covid_df['abstract'] = covid_df['abstract'].apply(lambda x: re.sub('[^a-zA-z0-9\s]','',x))

def lower_case(input_str):
    input_str = input_str.lower()
    return input_str

covid_df['body_text'] = covid_df['body_text'].apply(lambda x: lower_case(x))
covid_df['abstract'] = covid_df['abstract'].apply(lambda x: lower_case(x))
covid_df.head()

Unnamed: 0,paper_id,abstract,body_text
0,25621281691205eb015383cbac839182b838514f,the human interferon ifninduced mxa protein is...,influenza a viruses iav are severe human patho...
1,7db22f7f81977109d493a0edf8ed75562648e839,scorpine a small cationic peptide from the ven...,the oldest known scorpions lived around 430 mi...
2,a137eb51461b4a4ed3980aa5b9cb2f2c1cf0292a,background the complex interplay between viral...,the emergence of severe acute respiratory synd...
3,6c3e1a43f0e199876d4bd9ff787e1911fd5cfaa6,,sjgrens syndrome ss is a connective tissue dis...
4,2ce201c2ba233a562ee605a9aa12d2719cfa2beb,background human adenovirus type 55 is a reeme...,human adenovirus hadv is a common pathogen amo...


- Because we only need body_text of the article so we will drop paper_id and abstract then save clean file, we will use it later

In [8]:
text = covid_df.drop(["paper_id", "abstract"], axis=1)
text.head()
text.to_csv('./clean_text.csv')

- Next we will import spacy. If you never installed spacy before then you have to install before import
- If you are using anaconda then implement
    - *conda install -c conda-forge spacy*
- If you are not using anaconda and you want to install via pip then implement:
    - *pip install -U spacy*
- If you want to install from source then implement:
    - *git clone https://github.com/explosion/spaCy
    - *cd spaCy*
    - *pip install -r requirements.txt*
    - *python setup.py build_ext - inplace*
- You can refer to this page for more option: https://spacy.io/usage
- **Then what is spaCy ?**
    - spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.
    - If you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other?
    - spaCy is designed specifically for production use and helps you build applications that process and "understand" large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning ([source](https://spacy.io/usage/spacy-101))
- ok let's import spacy

In [9]:
import spacy
spacy.load('en')
from spacy.lang.en import English
parser = English()

- We will use following function to clean our text and return list of tokens:

In [10]:
def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

We use NLTK’s Wordnet to find the meanings of words, synonyms, antonyms, and more. In addition, we use WordNetLemmatizer to get the root word.

- We use NLTK Wordnet and WordNetLemmatizer to find the meaning of words such as synonyms, antonyms, etc. and also get the root word
- Before that feel free to install nltk and download wordnet together with stopword
    - *pip install - user -U nltk*
    - *nltk.download('wordnet')*
    - *nltk.download('stopwords')*

In [11]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer

def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

- Filter out stop words:

In [12]:
en_stop = set(nltk.corpus.stopwords.words('english'))

- We can define a function to prepare the text for topic modelling

In [None]:
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

- Open up our data, read line by line, for each line, prepare text for LDA, then add to a list.


In [None]:
text_data = []
with open('./clean_text.csv') as f:
    for line in f:
        tokens = prepare_text_for_lda(line)
        text_data.append(tokens)

**Latent Dirichlet Allocation (LDA) with Gensim**
- What is Gensim ?
    - Gensim = "Generate Similar". 
    - Gensim started off as a collection of various Python scripts for the Czech Digital Mathematics Library dml.cz in 2008, where it served to generate a short list of the most similar articles to a given article (source)
- Install Gensim via anaconda
    - conda install -c anaconda gensim
- Install Gensim via pip
    - pip install - upgrade gensim
    
**Then what is LDA**
- In natural language processing, the latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics. LDA is an example of a topic model and belongs to the machine learning toolbox and in wider sense to the artificial intelligence toolbox (source)
- Ok, we will create a dictionary from the data, then convert to bag-of-words corpus and save the dictionary and corpus for future use

In [None]:
from gensim import corpora
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

- So we are trying to ask LDA to find 20 topics in the data

In [None]:
import gensim
NUM_TOPICS = 20
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save('model20.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

All topic related to virus mechanism but research on difference way

# pyLDAvis
- pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.
- Visualizing 20 topics:

In [None]:
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model20.gensim')
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

- Saliency: a measure of how much the term tells you about the topic.
- Relevance: a weighted average of the probability of the word given the topic and the word given the topic normalized by the probability of the topic.
- The size of the bubble measures the importance of the topics, relative to the data.
- First, we got the most salient terms, means terms mostly tell us about what’s going on relative to the topics. We can also look at individual topic.

When we have 20 or more topics, we can see certain topics are clustered together, this indicates the similarity between topics