# Summary

The preprocessing pipeline for the abstracts and body texts involved several steps to prepare them for topic modeling. Empty documents were discarded, eliminating those lacking valuable content. Next, infrequently appearing words were removed to focus on the most prominent vocabulary. The abstracts and body text then underwent thorough cleaning, which included removing special characters, numbers, punctuation, and converting text to lowercase for consistency. Each abstract was then broken down into individual words and filtered further by removing common stop words and domain-specific ones related to COVID-19 research. Finally, stemming reduced words to their base forms, ensuring better matching and capturing semantic similarities. The cleaned and processed abstracts and body texts were transformed into a numerical representation called a Document-Term Matrix (DTM) using CountVectorizer. This matrix captures the frequency of each word in each document, providing a foundation for quantitative analysis and topic modeling of the research content. The final files were exported to be used for modeling.

In [25]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import gensim
from gensim.matutils import Sparse2Corpus
from collections import Counter

In [26]:
# Read clean data
pandemic_publications_nomissing = pd.read_csv('pandemic_publications_nomissing.csv')

In [27]:
pandemic_publications_nomissing.head()

Unnamed: 0,paper_id,doi,abstract,body_text,authors,title,journal,abstract_summary,publish_time,language,publish_year,cleaned_abstract,cleaned_body_text
0,4ac2fe75e88e16b9d97c10f3a89e38f0b2c658ae,10.1530/ec-20-0567,Objective: COVID-19 in people with diabetes is...,People with diabetes are disproportionately af...,". Kempegowda, Punith. Melson, Eka<br> Johnson...",E f f e c t o f C O V I D - 1 9 o n t...,Endocr Connect,Objective: COVID-19 in people with diabetes i...,2021-03-05,en,2021,object covid peopl diabet associ disproportion...,peopl diabet disproportion affect covid like a...
1,226e654c8633175ebbec1e348db4b155169ce111,10.1016/j.jns.2020.117084,"Please cite this article as: I.L. Calandri, M....",The COVID-19 outbreak has rapidly spread aroun...,". Calandri, I.L.. Hawkes, M.A.. Marrodan, M....",T h e i m p a c t o f a n e a r l y ...,J Neurol Sci,"Please cite this article as: I.L. Calandri,<b...",2020-08-12,en,2020,pleas cite articl calandri hawk marrodan impac...,covid outbreak rapidli spread around world cau...
2,c5c5cf2dd676811521dee00439629fdd46255743,10.1186/s12966-021-01186-9,Background: Media use may influence metabolic ...,Background Non-communicable diseases have reac...,". Sina, Elida. Buck, Christoph<br> Veidebaum,...",M e d i a u s e t r a j e c t o r i e s ...,Int J Behav Nutr Phys Act,Background: Media use may influence metabolic...,2021-10-18,en,2021,background media use may influenc metabol synd...,background non communic diseas reach alarm pro...
3,0abdbf563b563b5c706ecab2f01c00a68501ec33,10.1111/irv.12847,Background: Vaccine hesitancy is a global thre...,"2020. 1 Toward the end of November 2020, the p...",". Alabdulla, Majid. Reagu, Shuja Mohd<br> Al‐...",C O V I D ‐ 1 9 v a c c i n e h e s i t a...,Influenza Other Respir Viruses,Background: Vaccine hesitancy is a global<br>...,2021-02-19,en,2021,background vaccin hesit global threat undermin...,2020 toward end novemb 2020 pandem spread 215 ...
4,1f72a188bf206edc2483fe70271e802634193e1c,10.3390/ph15020165,"Citation: Kolarič, A.; Jukič, M.; Bren, U. Nov...",Severe acute respiratory syndrome coronavirus ...,". Kolarič, Anja. Jukič, Marko. Bren, Urban...",N o v e l S m a l l - M o l e c u l e I n...,Pharmaceuticals (Basel),"C i t a t i o n : K o l a r i č , A . ; ...",2022-01-28,en,2022,citat kolarič jukič bren novel small molecul i...,sever acut respiratori syndrom coronaviru sar ...


In [30]:
# Remove rare words in abstract
word_counts = Counter()
for abstract in pandemic_publications_nomissing['cleaned_abstract']:
    if isinstance(abstract, str):  # Check for string type
        word_counts.update(abstract.split())
    else:
        print(f"Skipping non-string value: {abstract}")  # Log non-string values

min_frequency = 10  # Adjust this threshold as needed
frequent_words = [word for word, count in word_counts.items() if count >= min_frequency]

def filter_rare_words(text):
    if isinstance(text, str):  # Check again before filtering
        filtered_words = [word for word in text.split() if word in frequent_words]
        return ' '.join(filtered_words)
    else:
        return text  # Return unchanged if not a string

pandemic_publications_nomissing['filtered_abstract'] = pandemic_publications_nomissing['cleaned_abstract'].apply(lambda x: filter_rare_words(x))

# Remove rare words in body
word_counts = Counter()
for body in pandemic_publications_nomissing['cleaned_body_text']:
    if isinstance(body, str):  # Check for string type
        word_counts.update(body.split())
    else:
        print(f"Skipping non-string value: {body}")  # Log non-string values

min_frequency = 10  # Adjust this threshold as needed
frequent_words = [word for word, count in word_counts.items() if count >= min_frequency]

def filter_rare_words(text):
    if isinstance(text, str):  # Check again before filtering
        filtered_words = [word for word in text.split() if word in frequent_words]
        return ' '.join(filtered_words)
    else:
        return text  # Return unchanged if not a string

pandemic_publications_nomissing['filtered_body_text'] = pandemic_publications_nomissing['cleaned_body_text'].apply(lambda x: filter_rare_words(x))


Skipping non-string value: nan
Skipping non-string value: nan


In [31]:
# Preprocess and save files for clean_abstract
non_empty_abstracts = pandemic_publications_nomissing['filtered_abstract'].dropna().tolist()
vectorizer = CountVectorizer(max_features=10000)
dtm_abstract = vectorizer.fit_transform(non_empty_abstracts)
gensim_dict_abstract = gensim.corpora.Dictionary([vectorizer.get_feature_names_out()])
gensim_corpus_abstract = [Sparse2Corpus(doc, documents_columns=False)[0] for doc in dtm_abstract]
gensim_dict_abstract.save("gensim_dictionary_abstract.dict")
gensim.corpora.MmCorpus.serialize("gensim_corpus_abstract.mm", gensim_corpus_abstract)

# Preprocess and save files for clean_body_text
non_empty_body_texts = pandemic_publications_nomissing['filtered_body_text'].dropna().tolist()
vectorizer = CountVectorizer(max_features=10000)  # Re-initialize for body text
dtm_body = vectorizer.fit_transform(non_empty_body_texts)
gensim_dict_body = gensim.corpora.Dictionary([vectorizer.get_feature_names_out()])
gensim_corpus_body = [Sparse2Corpus(doc, documents_columns=False)[0] for doc in dtm_body]
gensim_dict_body.save("gensim_dictionary_body.dict")
gensim.corpora.MmCorpus.serialize("gensim_corpus_body.mm", gensim_corpus_body)
