Skip to content

This repo for basic analysis of the CORD-19 research dataset and creating a simple Streamlit application to display the findings

Notifications You must be signed in to change notification settings

olanak/PLP-python-frameworks-Assignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📊 CORD-19 Metadata Analysis & Interactive Explorer

Streamlit Python Pandas Matplotlib Seaborn

This project explores the CORD-19 metadata dataset (COVID-19 research papers) and demonstrates a complete data science workflow:

  1. Data loading & exploration
  2. Data cleaning & preparation
  3. Descriptive analysis & visualization
  4. Interactive Streamlit web app for exploration
  5. Documentation & reflection

📂 Project Structure

Frameworks_Assignment/
│
├── notebooks/                  # Jupyter notebooks for step-by-step analysis
│   ├── Data Cleaning & Preparation.ipynb
│   ├── exploration.ipynb
│
├── app.py                      # Streamlit application
|── data
|    ├── cord19_cleaned.csv          # Cleaned dataset (sample from Kaggle CORD-19)
├── requirements.txt            # Dependencies
└── README.md                   # Project documentation

🚀 Getting Started

1️⃣ Clone Repository

git clone https://github.com/olanak/PLP-python-frameworks-Assignment.git
cd PLP-python-frameworks-Assignment

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Run Streamlit App

streamlit run app.py

📊 Key Findings

  • Dataset Overview

    • Final cleaned dataset: ~29,491 records × 10 columns
    • Major missing values in publish_time (~96%)
    • All records retain titles, abstracts, authors, and journals after cleaning
  • Temporal Trends

    • Publications span 2006–2020
    • Massive spike in 2020 with 1,100+ papers
  • Top Journals

    • PLoS One, Emerg Infect Dis, Sci Rep, PLoS Pathog lead publication counts
  • Sources

    • Data aggregated from multiple repositories (PubMed, PMC, WHO, etc.)
  • Text Insights

    • Titles average 8–15 words
    • Abstracts vary from ~50–300 words
    • Frequent terms: COVID-19, coronavirus, infection, respiratory

🌐 Streamlit App Features

✅ Interactive year range sliderDropdown filter by source ✅ 📊 Charts: publications trend, top journals, top sources ✅ ☁ Word cloud of research titles ✅ 🔎 Raw data preview


🔍 Reflection

Challenges

  • High missingness in publish_time limited full temporal analysis
  • Authors field contained unnormalized strings, complicating per-author stats
  • Dataset size required sampling for efficient analysis

Learnings

  • Practical experience with data cleaning strategies (dropping vs imputing)
  • Text analysis basics with word frequency & word clouds
  • Building interactive dashboards using Streamlit
  • End-to-end workflow: load → clean → analyze → visualize → deploy

📜 License

This project is for educational purposes. Data source: CORD-19 Dataset (Allen Institute for AI).


🙌 Acknowledgements


✨ Developed as part of Frameworks Assignment — demonstrating data science fundamentals with a real-world dataset.


About

This repo for basic analysis of the CORD-19 research dataset and creating a simple Streamlit application to display the findings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published