# CORD-19 Metadata Analysis & Streamlit App

This notebook guides you through analyzing the CORD-19 metadata and building a simple Streamlit application to present your findings. Follow each section to complete the assignment step-by-step.

## 1. Install and Import Required Libraries

Install the necessary Python packages and import all libraries needed for data analysis and visualization.

In [None]:
# If running in a notebook, uncomment and run the following lines to install packages:
# !pip install pandas matplotlib seaborn streamlit wordcloud

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
import streamlit as st
import numpy as np
``</VSCode.Cell>

<VSCode.Cell language="markdown">
## 2. Download and Load the Metadata Dataset

Download the `metadata.csv` file from the CORD-19 dataset (Kaggle) and load it into a pandas DataFrame.

In [None]:
# Load the metadata.csv file (ensure it's in your working directory)
df = pd.read_csv('metadata.csv')

# Display the shape of the DataFrame
print(f"DataFrame shape: {df.shape}")

## 3. Explore Data Structure and Basic Statistics

Display the first few rows, check DataFrame shape, column data types, and generate summary statistics for numerical columns.

In [None]:
# Show the first 5 rows
df.head()

In [None]:
# Check column data types and info
df.info()

In [None]:
# Generate summary statistics for numerical columns
df.describe()

## 4. Check and Handle Missing Data

Identify columns with missing values, decide on removal or filling strategies, and create a cleaned version of the dataset.

In [None]:
# Check for missing values in each column
missing_values = df.isnull().sum()
print("Missing values per column:\n", missing_values)

# Example: Drop rows where 'title' or 'publish_time' is missing
df_clean = df.dropna(subset=['title', 'publish_time'])

# Optionally, fill missing abstracts with empty string
df_clean['abstract'] = df_clean['abstract'].fillna('')

## 5. Data Cleaning and Feature Engineering

Convert `publish_time` to datetime, extract publication year, and create new columns such as abstract word count.

In [None]:
# Convert publish_time to datetime
df_clean['publish_time'] = pd.to_datetime(df_clean['publish_time'], errors='coerce')

# Extract year from publish_time
df_clean['year'] = df_clean['publish_time'].dt.year

# Create abstract word count column
df_clean['abstract_word_count'] = df_clean['abstract'].apply(lambda x: len(str(x).split()))

## 6. Analysis: Publications by Year

Count the number of papers published each year and prepare data for visualization.

In [None]:
# Count papers by publication year
year_counts = df_clean['year'].value_counts().sort_index()
print("Publications by year:\n", year_counts)

## 7. Analysis: Top Journals

Identify and list the top journals publishing COVID-19 research papers.

In [None]:
# Top 10 journals by publication count
top_journals = df_clean['journal'].value_counts().head(10)
print("Top journals:\n", top_journals)

## 8. Analysis: Frequent Words in Titles

Compute word frequency in paper titles using basic text processing.

In [None]:
from collections import Counter
import re

# Combine all titles into one string
titles = ' '.join(df_clean['title'].dropna().astype(str))

# Remove punctuation and split into words
words = re.findall(r'\b\w+\b', titles.lower())

# Remove common stopwords
stopwords = set(STOPWORDS)
filtered_words = [word for word in words if word not in stopwords and len(word) > 2]

# Count word frequencies
word_freq = Counter(filtered_words)
print("Most common words in titles:", word_freq.most_common(20))

## 9. Visualizations: Publications Over Time

Plot the number of publications per year using matplotlib or seaborn.

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(x=year_counts.index, y=year_counts.values, palette="Blues_d")
plt.title('Number of Publications by Year')
plt.xlabel('Year')
plt.ylabel('Number of Publications')
plt.tight_layout()
plt.show()

## 10. Visualizations: Top Journals Bar Chart

Create a bar chart showing the top publishing journals.

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(x=top_journals.values, y=top_journals.index, palette="Greens_d")
plt.title('Top Journals Publishing COVID-19 Research')
plt.xlabel('Number of Publications')
plt.ylabel('Journal')
plt.tight_layout()
plt.show()

## 11. Visualizations: Word Cloud of Titles

Generate and display a word cloud of the most frequent words in paper titles.

In [None]:
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=STOPWORDS).generate(' '.join(filtered_words))

plt.figure(figsize=(12,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Paper Titles')
plt.show()

## 12. Visualizations: Paper Counts by Source

Plot the distribution of paper counts by source using a suitable chart.

In [None]:
source_counts = df_clean['source_x'].value_counts().head(10)

plt.figure(figsize=(10,6))
sns.barplot(x=source_counts.values, y=source_counts.index, palette="Purples_d")
plt.title('Paper Counts by Source')
plt.xlabel('Number of Papers')
plt.ylabel('Source')
plt.tight_layout()
plt.show()

## 13. Build a Simple Streamlit Application

Create a Streamlit app with title, description, interactive widgets, and display the visualizations and sample data.

In [None]:
# Save this code in a separate file (e.g., streamlit_app.py) to run with Streamlit

import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS

st.title("CORD-19 Data Explorer")
st.write("Simple exploration of COVID-19 research papers")

# Load data
df = pd.read_csv('metadata.csv')
df['publish_time'] = pd.to_datetime(df['publish_time'], errors='coerce')
df['year'] = df['publish_time'].dt.year
df['abstract'] = df['abstract'].fillna('')
df['abstract_word_count'] = df['abstract'].apply(lambda x: len(str(x).split()))

# Interactive year range slider
min_year = int(df['year'].min())
max_year = int(df['year'].max())
year_range = st.slider("Select year range", min_year, max_year, (2020, 2021))

filtered_df = df[(df['year'] >= year_range[0]) & (df['year'] <= year_range[1])]

st.write(f"Number of papers from {year_range[0]} to {year_range[1]}: {filtered_df.shape[0]}")

# Publications by year
year_counts = filtered_df['year'].value_counts().sort_index()
fig, ax = plt.subplots()
sns.barplot(x=year_counts.index, y=year_counts.values, ax=ax, palette="Blues_d")
ax.set_title('Number of Publications by Year')
ax.set_xlabel('Year')
ax.set_ylabel('Number of Publications')
st.pyplot(fig)

# Top journals
top_journals = filtered_df['journal'].value_counts().head(10)
fig2, ax2 = plt.subplots()
sns.barplot(x=top_journals.values, y=top_journals.index, ax=ax2, palette="Greens_d")
ax2.set_title('Top Journals Publishing COVID-19 Research')
ax2.set_xlabel('Number of Publications')
ax2.set_ylabel('Journal')
st.pyplot(fig2)

# Word cloud of titles
titles = ' '.join(filtered_df['title'].dropna().astype(str))
words = [word for word in titles.lower().split() if word not in STOPWORDS and len(word) > 2]
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=STOPWORDS).generate(' '.join(words))
fig3, ax3 = plt.subplots(figsize=(12,6))
ax3.imshow(wordcloud, interpolation='bilinear')
ax3.axis('off')
st.pyplot(fig3)

# Paper counts by source
source_counts = filtered_df['source_x'].value_counts().head(10)
fig4, ax4 = plt.subplots()
sns.barplot(x=source_counts.values, y=source_counts.index, ax=ax4, palette="Purples_d")
ax4.set_title('Paper Counts by Source')
ax4.set_xlabel('Number of Papers')
ax4.set_ylabel('Source')
st.pyplot(fig4)

# Show sample data
st.write("Sample of filtered data:")
st.dataframe(filtered_df[['title', 'journal', 'year', 'source_x', 'abstract_word_count']].head(10))