📊 CORD-19 Metadata Analysis & Interactive Explorer

This project explores the CORD-19 metadata dataset (COVID-19 research papers) and demonstrates a complete data science workflow:

Data loading & exploration
Data cleaning & preparation
Descriptive analysis & visualization
Interactive Streamlit web app for exploration
Documentation & reflection

📂 Project Structure

Frameworks_Assignment/
│
├── notebooks/                  # Jupyter notebooks for step-by-step analysis
│   ├── Data Cleaning & Preparation.ipynb
│   ├── exploration.ipynb
│
├── app.py                      # Streamlit application
|── data
|    ├── cord19_cleaned.csv          # Cleaned dataset (sample from Kaggle CORD-19)
├── requirements.txt            # Dependencies
└── README.md                   # Project documentation

🚀 Getting Started

1️⃣ Clone Repository

git clone https://github.com/olanak/PLP-python-frameworks-Assignment.git
cd PLP-python-frameworks-Assignment

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Run Streamlit App

streamlit run app.py

📊 Key Findings

Dataset Overview
- Final cleaned dataset: ~29,491 records × 10 columns
- Major missing values in publish_time (~96%)
- All records retain titles, abstracts, authors, and journals after cleaning
Temporal Trends
- Publications span 2006–2020
- Massive spike in 2020 with 1,100+ papers
Top Journals
- PLoS One, Emerg Infect Dis, Sci Rep, PLoS Pathog lead publication counts
Sources
- Data aggregated from multiple repositories (PubMed, PMC, WHO, etc.)
Text Insights
- Titles average 8–15 words
- Abstracts vary from ~50–300 words
- Frequent terms: COVID-19, coronavirus, infection, respiratory

🌐 Streamlit App Features

✅ Interactive year range slider ✅ Dropdown filter by source ✅ 📊 Charts: publications trend, top journals, top sources ✅ ☁ Word cloud of research titles ✅ 🔎 Raw data preview

🔍 Reflection

Challenges

High missingness in publish_time limited full temporal analysis
Authors field contained unnormalized strings, complicating per-author stats
Dataset size required sampling for efficient analysis

Learnings

Practical experience with data cleaning strategies (dropping vs imputing)
Text analysis basics with word frequency & word clouds
Building interactive dashboards using Streamlit
End-to-end workflow: load → clean → analyze → visualize → deploy

📜 License

This project is for educational purposes. Data source: CORD-19 Dataset (Allen Institute for AI).

🙌 Acknowledgements

Allen Institute for AI for dataset
Streamlit for rapid app development
Kaggle community for open datasets

✨ Developed as part of Frameworks Assignment — demonstrating data science fundamentals with a real-world dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📊 CORD-19 Metadata Analysis & Interactive Explorer

📂 Project Structure

🚀 Getting Started

1️⃣ Clone Repository

2️⃣ Install Dependencies

3️⃣ Run Streamlit App

📊 Key Findings

🌐 Streamlit App Features

🔍 Reflection

Challenges

Learnings

📜 License

🙌 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

olanak/PLP-python-frameworks-Assignment

Folders and files

Latest commit

History

Repository files navigation

📊 CORD-19 Metadata Analysis & Interactive Explorer

📂 Project Structure

🚀 Getting Started

1️⃣ Clone Repository

2️⃣ Install Dependencies

3️⃣ Run Streamlit App

📊 Key Findings

🌐 Streamlit App Features

🔍 Reflection

Challenges

Learnings

📜 License

🙌 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages