This project explores the CORD-19 metadata dataset (COVID-19 research papers) and demonstrates a complete data science workflow:
- Data loading & exploration
- Data cleaning & preparation
- Descriptive analysis & visualization
- Interactive Streamlit web app for exploration
- Documentation & reflection
Frameworks_Assignment/
│
├── notebooks/ # Jupyter notebooks for step-by-step analysis
│ ├── Data Cleaning & Preparation.ipynb
│ ├── exploration.ipynb
│
├── app.py # Streamlit application
|── data
| ├── cord19_cleaned.csv # Cleaned dataset (sample from Kaggle CORD-19)
├── requirements.txt # Dependencies
└── README.md # Project documentation
git clone https://github.com/olanak/PLP-python-frameworks-Assignment.git
cd PLP-python-frameworks-Assignment
pip install -r requirements.txt
streamlit run app.py
-
Dataset Overview
- Final cleaned dataset: ~29,491 records × 10 columns
- Major missing values in
publish_time
(~96%) - All records retain titles, abstracts, authors, and journals after cleaning
-
Temporal Trends
- Publications span 2006–2020
- Massive spike in 2020 with 1,100+ papers
-
Top Journals
- PLoS One, Emerg Infect Dis, Sci Rep, PLoS Pathog lead publication counts
-
Sources
- Data aggregated from multiple repositories (PubMed, PMC, WHO, etc.)
-
Text Insights
- Titles average 8–15 words
- Abstracts vary from ~50–300 words
- Frequent terms: COVID-19, coronavirus, infection, respiratory
✅ Interactive year range slider ✅ Dropdown filter by source ✅ 📊 Charts: publications trend, top journals, top sources ✅ ☁ Word cloud of research titles ✅ 🔎 Raw data preview
- High missingness in
publish_time
limited full temporal analysis - Authors field contained unnormalized strings, complicating per-author stats
- Dataset size required sampling for efficient analysis
- Practical experience with data cleaning strategies (dropping vs imputing)
- Text analysis basics with word frequency & word clouds
- Building interactive dashboards using Streamlit
- End-to-end workflow: load → clean → analyze → visualize → deploy
This project is for educational purposes. Data source: CORD-19 Dataset (Allen Institute for AI).
- Allen Institute for AI for dataset
- Streamlit for rapid app development
- Kaggle community for open datasets
✨ Developed as part of Frameworks Assignment — demonstrating data science fundamentals with a real-world dataset.