# CORD-19 Dataset Analysis and Visualization

This notebook implements a comprehensive analysis of the CORD-19 (COVID-19 Open Research Dataset) with focus on the metadata of research papers. The analysis includes data exploration, cleaning, visualization, and creation of an interactive Streamlit application.

## Table of Contents
1. Setup Development Environment
2. Download and Load Dataset
3. Basic Data Exploration
4. Data Cleaning and Preparation
5. Exploratory Data Analysis
6. Create Visualizations
7. Build Streamlit Application
8. Package and Deploy

# 1. Setup Development Environment

First, we'll set up our Colab environment and install required packages. We'll need:
- pandas: for data manipulation
- matplotlib and seaborn: for visualization
- nltk: for text analysis
- wordcloud: for generating word clouds
- streamlit: for the interactive web application
- kaggle: for downloading the dataset

Note: In Colab, some of these packages may already be installed.

In [None]:
# Install required packages
!pip install pandas matplotlib seaborn nltk wordcloud streamlit kaggle --quiet

# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from wordcloud import WordCloud
import streamlit as st
from datetime import datetime
import numpy as np
import warnings
import os

# Configure display settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
plt.style.use('seaborn')

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

print("Setup completed successfully!")

# 2. Download and Load Dataset

We'll download the CORD-19 dataset metadata file from Kaggle. To use the Kaggle API, make sure you have:
1. A Kaggle account
2. API credentials (kaggle.json) file

The dataset we're using is the CORD-19 research challenge dataset from Kaggle.

In [None]:
# Download the dataset
# Note: For quick testing, we'll use a small sample. For the full dataset, uncomment the kaggle command
import os

# Try to use existing sample first
if os.path.exists('metadata_sample.csv'):
    print("Using existing metadata_sample.csv")
else:
    print("Downloading sample metadata...")
    !wget -q -O metadata_sample.csv https://raw.githubusercontent.com/luanwachira-bit/week8-py-codes/main/metadata_sample.csv

# To download full dataset from Kaggle (uncomment these lines):
# !kaggle datasets download -d allen-institute-for-ai/CORD-19-research-challenge --file metadata.csv
# !unzip -o metadata.csv.zip metadata.csv

# Function to load and validate the dataset
def load_dataset(file_path: str) -> pd.DataFrame:
    """
    Load the CORD-19 metadata CSV file into a pandas DataFrame
    
    Args:
        file_path (str): Path to the metadata CSV file
        
    Returns:
        pd.DataFrame: Loaded dataset
    """
    try:
        df = pd.read_csv(file_path)
        print(f"Dataset loaded successfully with {df.shape[0]} rows and {df.shape[1]} columns")
        return df
    except Exception as e:
        print(f"Error loading dataset: {e}")
        return None

# Load the dataset (try full dataset first, fall back to sample)
if os.path.exists('metadata.csv'):
    df = load_dataset('metadata.csv')
else:
    print("Full metadata.csv not found, using sample...")
    df = load_dataset('metadata_sample.csv')

# Display the first few rows
print("\nFirst few rows of the dataset:")
display(df.head())

In [None]:
# Set up Kaggle credentials
# This cell shows how to do it safely - you'll need to provide your own credentials
import os
import json

def setup_kaggle():
    """Configure Kaggle credentials safely."""
    kaggle_dir = os.path.expanduser('~/.kaggle')
    if not os.path.exists(kaggle_dir):
        os.makedirs(kaggle_dir)
    
    kaggle_path = os.path.join(kaggle_dir, 'kaggle.json')
    if not os.path.exists(kaggle_path):
        print("Please enter your Kaggle credentials:")
        username = input("Username: ")
        key = input("API Key: ")
        
        credentials = {
            "username": username,
            "key": key
        }
        
        with open(kaggle_path, 'w') as f:
            json.dump(credentials, f)
        os.chmod(kaggle_path, 0o600)
        print("Credentials saved successfully!")
    else:
        print("Kaggle credentials file already exists.")

# Run setup
setup_kaggle()

# 2. Configure Kaggle and Download Dataset

To download the CORD-19 dataset, we need to:
1. Set up Kaggle API credentials (safely)
2. Download the metadata.csv file

**Important: Never share your Kaggle API key in notebook cells or public repositories!**

Follow these steps to set up Kaggle:
1. Go to kaggle.com → Account → 'Create New API Token'
2. Download kaggle.json
3. Run the cell below to configure the credentials

# 3. Basic Data Exploration

Let's explore the basic characteristics of our dataset:
1. Check data types of each column
2. Identify missing values
3. Generate basic statistics
4. Understand the structure of our data

In [None]:
# Function to analyze dataset structure
def analyze_dataset_structure(df: pd.DataFrame) -> None:
    """
    Analyze and display the basic structure of the dataset
    
    Args:
        df (pd.DataFrame): Input dataset
    """
    print("Data Types of Columns:")
    print(df.dtypes)
    
    print("\nMissing Values:")
    missing_values = df.isnull().sum()
    missing_percentages = (missing_values / len(df)) * 100
    missing_info = pd.DataFrame({
        'Missing Values': missing_values,
        'Percentage': missing_percentages
    })
    print(missing_info[missing_info['Missing Values'] > 0])
    
    print("\nBasic Statistics:")
    print(df.describe())

# Analyze dataset structure
analyze_dataset_structure(df)

# Display unique values in categorical columns
print("\nUnique values in selected columns:")
categorical_columns = ['source_x', 'journal']
for col in categorical_columns:
    if col in df.columns:
        print(f"\n{col}:")
        print(df[col].value_counts().head())

In [None]:
# Download the metadata.csv file using wget
!wget -O metadata.csv https://github.com/allenai/cord19/raw/master/sample-metadata.csv

# Function to load and validate the dataset
def load_dataset(file_path: str) -> pd.DataFrame:
    """
    Load the CORD-19 metadata CSV file into a pandas DataFrame
    
    Args:
        file_path (str): Path to the metadata CSV file
        
    Returns:
        pd.DataFrame: Loaded dataset
    """
    try:
        df = pd.read_csv(file_path)
        print(f"Dataset loaded successfully with {df.shape[0]} rows and {df.shape[1]} columns")
        return df
    except Exception as e:
        print(f"Error loading dataset: {e}")
        return None

# Load the dataset
df = load_dataset('metadata.csv')

# Display the first few rows
print("\nFirst few rows of the dataset:")
display(df.head())

# 4. Data Cleaning and Preparation

Now we'll clean the dataset and prepare it for analysis:
1. Handle missing values
2. Convert date columns to proper datetime format
3. Create derived features
4. Clean text data in titles and abstracts

In [None]:
# Basic cleaning and feature engineering
from tqdm.notebook import tqdm

def clean_and_prepare(df: pd.DataFrame) -> pd.DataFrame:
    """
    Perform basic cleaning and add useful derived columns.
    
    Args:
        df: Input DataFrame with CORD-19 metadata
        
    Returns:
        Cleaned DataFrame with additional features
    """
    print("Starting data cleaning...")
    df = df.copy()
    
    # Standardize column names
    df.columns = [c.strip() for c in df.columns]
    print("Standardized column names")
    
    # Convert publish_time to datetime and extract year
    if 'publish_time' in df.columns:
        df['publish_time'] = pd.to_datetime(df['publish_time'], errors='coerce')
        df['year'] = df['publish_time'].dt.year
        print(f"Converted dates, found years from {df['year'].min()} to {df['year'].max()}")
    else:
        df['year'] = pd.NA
        print("No publish_time column found")
    
    # Process text columns with progress bar
    text_columns = ['abstract', 'title']
    for col in text_columns:
        if col in df.columns:
            print(f"\nProcessing {col}...")
            df[col] = df[col].fillna('').astype(str)
            df[f'{col}_word_count'] = df[col].str.split().apply(len)
            mean_words = df[f'{col}_word_count'].mean()
            print(f"Average {col} length: {mean_words:.1f} words")
    
    # Handle source information
    if 'source_x' in df.columns:
        df['source_x'] = df['source_x'].fillna('unknown')
        sources = df['source_x'].value_counts()
        print(f"\nFound {len(sources)} different sources")
        print("Top 3 sources:", sources.head(3).to_dict())
    
    # Drop completely empty columns
    empty_cols = [c for c in df.columns if df[c].isna().all()]
    if empty_cols:
        df = df.drop(columns=empty_cols)
        print(f"\nDropped {len(empty_cols)} empty columns: {empty_cols}")
    
    # Summary statistics
    print("\nFinal dataset shape:", df.shape)
    print("Columns:", sorted(df.columns))
    
    return df

# Apply cleaning with timing
import time
start = time.time()
df = clean_and_prepare(df)
print(f"\nCleaning completed in {time.time() - start:.1f} seconds")

# Display sample of cleaned data
print("\nSample of cleaned data:")
df.head()

# 5. Exploratory Data Analysis

We'll perform basic analyses: publications by year, top journals, title word frequencies, and source distribution.

In [None]:
# Publications by year
year_counts = df['year'].value_counts(dropna=True).sort_index()
print('Publications by year:')
display(year_counts)

# Top journals
if 'journal' in df.columns:
    top_journals = df['journal'].fillna('unknown').value_counts().head(15)
    print('
Top journals:')
    display(top_journals)
else:
    top_journals = pd.Series(dtype=int)

# Source distribution
if 'source_x' in df.columns:
    source_counts = df['source_x'].value_counts()
    print('
Source distribution:')
    display(source_counts)
else:
    source_counts = pd.Series(dtype=int)

# Simple title word frequency
from collections import Counter
import re

def tokenize_text(text: str) -> list:
    # Lowercase, remove non-alpha chars, and split
    text = re.sub(r'[^a-zA-Z\s]', '', text.lower())
    tokens = [t for t in text.split() if len(t) > 2]
    return tokens

title_tokens = Counter()
for t in df['title'].fillna(''):
    title_tokens.update(tokenize_text(t))

top_title_words = pd.Series(dict(title_tokens.most_common(50)))
print('
Top title words:')
display(top_title_words.head(20))

# 6. Visualizations

Create plots for publications over time, top journals, title word cloud, and source distribution.

In [None]:
# Publications over time plot
plt.figure(figsize=(10, 4))
year_counts.plot(kind='bar')
plt.title('Publications by Year')
plt.xlabel('Year')
plt.ylabel('Number of publications')
plt.tight_layout()
plt.show()

# Top journals bar chart
if not top_journals.empty:
    plt.figure(figsize=(10, 6))
    top_journals.sort_values().plot(kind='barh', color='C1')
    plt.title('Top Journals (by number of papers)')
    plt.xlabel('Number of papers')
    plt.tight_layout()
    plt.show()

# Word cloud for titles
if not top_title_words.empty:
    wc = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(title_tokens)
    plt.figure(figsize=(12, 6))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')
    plt.title('Title Word Cloud')
    plt.show()

# Source distribution pie chart
if not source_counts.empty:
    plt.figure(figsize=(6, 6))
    source_counts.head(10).plot(kind='pie', autopct='%1.1f%%')
    plt.ylabel('')
    plt.title('Top Sources')
    plt.tight_layout()
    plt.show()

# 7. Streamlit Application

We'll provide a lightweight Streamlit app (`streamlit_app.py`) that loads the cleaned metadata and shows interactive controls for year range, top journals and a data table sample. In Colab, Streamlit can be run using `ngrok` or `localtunnel` (not included here).

In [None]:
# Install and configure ngrok
!pip install pyngrok --quiet
from pyngrok import ngrok

# Create streamlit_app.py if it doesn't exist
%%writefile streamlit_app.py
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from collections import Counter
import re

st.set_page_config(layout="wide")
st.title('CORD-19 Metadata Explorer')

@st.cache_data
def load_data():
    try:
        df = pd.read_csv('metadata.csv')
    except:
        df = pd.read_csv('metadata_sample.csv')
        st.warning('Using sample dataset. Full metadata.csv not found.')
    return df

df = load_data()

# Show basic stats
st.write(f"Dataset contains {len(df):,} papers")

# Year range filter
years = pd.to_datetime(df['publish_time']).dt.year
year_range = st.slider(
    'Select year range',
    int(years.min()),
    int(years.max()),
    (int(years.min()), int(years.max()))
)

# Filter data
mask = (years >= year_range[0]) & (years <= year_range[1])
filtered = df[mask]

# Show visualizations
col1, col2 = st.columns(2)

with col1:
    st.write('Papers by Year')
    year_counts = years[mask].value_counts().sort_index()
    st.bar_chart(year_counts)

with col2:
    st.write('Top Sources')
    if 'source_x' in df.columns:
        source_counts = filtered['source_x'].value_counts().head(10)
        st.bar_chart(source_counts)

# Show sample of papers
st.write('Sample of Papers')
st.dataframe(filtered[['title', 'publish_time', 'journal']].head())

# Running Streamlit in Colab

To run the Streamlit app in Colab, we'll:
1. Install pyngrok for tunneling
2. Start Streamlit with a public URL
3. Access the app through the provided link

Note: The ngrok tunnel closes when the notebook disconnects.

In [None]:
# Run Streamlit with ngrok tunnel
# Note: This cell will display the public URL when run
def run_streamlit():
    from pyngrok import ngrok
    
    # Start ngrok tunnel
    ngrok.kill()  # Kill any existing tunnels
    streamlit_port = 8501
    public_url = ngrok.connect(streamlit_port).public_url
    print(f' * ngrok tunnel "{public_url}" -> "http://127.0.0.1:{streamlit_port}"')
    
    # Run Streamlit
    !streamlit run streamlit_app.py --server.port {streamlit_port} &>/dev/null &
    print('\nStreamlit app is ready! Access it at:', public_url)

# Run the app
run_streamlit()

# 8. Notes, Reflection and How to Run

## How to run locally

1. Install dependencies: `pip install -r requirements.txt`
2. Ensure `metadata.csv` is in the repo root or use the bundled `metadata_sample.csv`
3. Run Streamlit: `streamlit run streamlit_app.py`

## Running in Colab

Open this notebook in Colab, install packages, and download `metadata.csv`. For Streamlit in Colab use a tunnelling service (ngrok) — many guides exist online.

## Reflection

- We cleaned the dataset by converting dates, extracting years, and computing word counts for abstracts and titles.
- Visualizations include publications by year, top journals, a title word cloud, and source distribution.
- Challenges: the full CORD-19 dataset is large; use sampling or cloud resources for heavy analysis.

If you'd like, I can now:

- Add more in-depth text analysis (TF-IDF, topic modeling),
- Improve Streamlit UI with more filters and caching, or
- Create unit tests for the data cleaning functions.
