# Exploratory Data Analysis (EDA) on CORD-19 Dataset

This notebook contains the exploratory data analysis steps for the CORD-19 research dataset. We will load the dataset, perform basic exploration, clean the data, and create initial visualizations to document our findings and insights.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set(style='whitegrid')

# Load the dataset
data_path = '../data/cord19.csv'  # Update with the correct path to your dataset
df = pd.read_csv(data_path)

# Display the first few rows of the dataset
df.head()

In [None]:
# Basic exploration of the dataset

# Check the shape of the dataset
df.shape

# Check for missing values
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

In [None]:
# Data cleaning steps

# Example: Dropping rows with missing titles
df_cleaned = df.dropna(subset=['title'])

# Reset index after cleaning
df_cleaned.reset_index(drop=True, inplace=True)

# Display the shape of the cleaned dataset
df_cleaned.shape

In [None]:
# Initial visualizations

# Example: Count of publications by year
df_cleaned['publish_time'] = pd.to_datetime(df_cleaned['publish_time'])
df_cleaned['year'] = df_cleaned['publish_time'].dt.year

publications_by_year = df_cleaned['year'].value_counts().sort_index()

# Plotting
plt.figure(figsize=(12, 6))
sns.barplot(x=publications_by_year.index, y=publications_by_year.values)
plt.title('Number of Publications by Year')
plt.xlabel('Year')
plt.ylabel('Number of Publications')
plt.xticks(rotation=45)
plt.show()

## Conclusion

In this notebook, we performed exploratory data analysis on the CORD-19 dataset. We loaded the data, explored its structure, cleaned it, and created initial visualizations. Further analysis and visualizations can be added in subsequent sections.