This project is part of the Frameworks Assignment. It uses the CORD-19 dataset to explore COVID-19 research papers through data analysis and a simple Streamlit web app.
-
Load and explore the CORD-19 metadata.csv file
-
Data cleaning and sampling (avoid memory errors)
-
Interactive filters:
- Select year range
- Filter by journal
-
Visualizations:
- Publications per year (bar chart)
- Heatmap of publications per journal per year
- Word cloud of paper titles
-
Download filtered data as CSV
- Python 3.7+
- pandas (data manipulation)
- matplotlib & seaborn (visualizations)
- wordcloud (word cloud generation)
- streamlit (web application)
Clone the repository:
git clone https://github.com/iampunit123/week-8-python-assignment-frameworks-.git
cd Frameworks_Assignment
Install dependencies:
pip install -r requirements.txt
Run the Streamlit app:
streamlit run app.py
or (if streamlit
is not in PATH):
python -m streamlit run app.py
The app will open in your browser at:
http://localhost:8501
app.py
→ Streamlit appmetadata.csv
→ dataset file (ormetadata_sample.csv
if dataset is too big)requirements.txt
→ dependencies listREADME.md
→ this file
- Publications by Year (bar chart)
- Heatmap of publications per journal vs year
- Word Cloud of paper titles
- Download Button to export filtered results as CSV
During this project, I learned how to:
- Load and clean real-world datasets (handling missing data, sampling large files)
- Perform basic exploratory data analysis with pandas
- Create visualizations with matplotlib, seaborn, and wordcloud
- Build an interactive dashboard with Streamlit
- Document and share my work using GitHub
Challenges included dealing with the very large dataset (20+ GB). To solve this, I used only the metadata.csv file and sampled rows (nrows=5000
) to make the app lightweight and fast.