# Step 1: Data Collection

In this step, we download the COVID-19 dataset from [Our World in Data](https://ourworldindata.org/covid-deaths).  
We’ll use the cleaned CSV version for easier loading and analysis.

Make sure the file is saved in the `data/` subfolder of this project.


In [1]:
# Let's import pandas and confirm that the data file exists
import pandas as pd
import os

# Check if the data file is in the expected location
file_path = "data/owid-covid-data.csv"
if os.path.exists(file_path):
    print("✅ Data file found! Ready to load.")
else:
    print("❌ Data file not found. Please download it and place it in the 'data/' folder.")


❌ Data file not found. Please download it and place it in the 'data/' folder.


## Step 2: Data Loading & Exploration

In this step, we load the COVID-19 dataset and explore its structure.
We'll check for:
- Available columns
- Preview some rows
- Missing values


In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv("data/owid-covid-data.csv")

# View first few rows
df.head()


FileNotFoundError: [Errno 2] No such file or directory: 'data/owid-covid-data.csv'

In [2]:
# Check all column names
df.columns


NameError: name 'df' is not defined

In [3]:
# Check how many missing values are in each column
df.isnull().sum()


NameError: name 'df' is not defined

## Step 3: Data Cleaning

In this step, we will:
- Focus on selected countries (Kenya, USA, India)
- Convert the `date` column to datetime
- Drop rows with missing critical values
- Handle missing numeric data


In [4]:
# Keep only rows for selected countries
selected_countries = ['Kenya', 'United States', 'India']
df = df[df['location'].isin(selected_countries)]


NameError: name 'df' is not defined

In [5]:
# Convert date column to datetime format
df['date'] = pd.to_datetime(df['date'])


NameError: name 'df' is not defined

In [6]:
# Drop rows with missing dates or total_cases
df = df.dropna(subset=['date', 'total_cases'])


NameError: name 'df' is not defined

In [7]:
# Fill missing numeric values with forward fill
df.fillna(method='ffill', inplace=True)


NameError: name 'df' is not defined

## Step 4: Exploratory Data Analysis (EDA)

We will:
- Plot total COVID-19 cases over time
- Plot total deaths over time
- Compare daily new cases between countries
- Calculate and visualize death rate (total_deaths / total_cases)


In [8]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
plt.style.use('seaborn-darkgrid')

# Plot total cases over time
plt.figure(figsize=(10,6))
for country in selected_countries:
    country_data = df[df['location'] == country]
    plt.plot(country_data['date'], country_data['total_cases'], label=country)

plt.title('Total COVID-19 Cases Over Time')
plt.xlabel('Date')
plt.ylabel('Total Cases')
plt.legend()
plt.tight_layout()
plt.show()


OSError: 'seaborn-darkgrid' is not a valid package style, path of style file, URL of style file, or library style name (library styles are listed in `style.available`)

In [None]:
# Plot total deaths over time
plt.figure(figsize=(10,6))
for country in selected_countries:
    country_data = df[df['location'] == country]
    plt.plot(country_data['date'], country_data['total_deaths'], label=country)

plt.title('Total COVID-19 Deaths Over Time')
plt.xlabel('Date')
plt.ylabel('Total Deaths')
plt.legend()
plt.tight_layout()
plt.show()


In [9]:
# Plot daily new cases
plt.figure(figsize=(10,6))
for country in selected_countries:
    country_data = df[df['location'] == country]
    plt.plot(country_data['date'], country_data['new_cases'], label=country)

plt.title('Daily New COVID-19 Cases')
plt.xlabel('Date')
plt.ylabel('New Cases')
plt.legend()
plt.tight_layout()
plt.show()


NameError: name 'df' is not defined

<Figure size 1000x600 with 0 Axes>

In [10]:
# Add a new column for death rate
df['death_rate'] = df['total_deaths'] / df['total_cases']

# Plot death rate for each country
plt.figure(figsize=(10,6))
for country in selected_countries:
    country_data = df[df['location'] == country]
    plt.plot(country_data['date'], country_data['death_rate'], label=country)

plt.title('COVID-19 Death Rate Over Time')
plt.xlabel('Date')
plt.ylabel('Death Rate')
plt.legend()
plt.tight_layout()
plt.show()


NameError: name 'df' is not defined

## Step 5: Visualizing Vaccination Progress

In this step, we will:
- Plot cumulative total vaccinations over time
- Compare the percentage of the population vaccinated (if available)


In [11]:
# Plot total vaccinations over time
plt.figure(figsize=(10,6))
for country in selected_countries:
    country_data = df[df['location'] == country]
    plt.plot(country_data['date'], country_data['total_vaccinations'], label=country)

plt.title('Total COVID-19 Vaccinations Over Time')
plt.xlabel('Date')
plt.ylabel('Total Vaccinations')
plt.legend()
plt.tight_layout()
plt.show()


NameError: name 'df' is not defined

<Figure size 1000x600 with 0 Axes>

In [12]:
# Check if 'people_vaccinated_per_hundred' exists
if 'people_vaccinated_per_hundred' in df.columns:
    plt.figure(figsize=(10,6))
    for country in selected_countries:
        country_data = df[df['location'] == country]
        plt.plot(country_data['date'], country_data['people_vaccinated_per_hundred'], label=country)

    plt.title('People Vaccinated per Hundred')
    plt.xlabel('Date')
    plt.ylabel('Percent Vaccinated (%)')
    plt.legend()
    plt.tight_layout()
    plt.show()
else:
    print("Column 'people_vaccinated_per_hundred' not available in this dataset.")


NameError: name 'df' is not defined

## Step 6: Choropleth Map (Optional)

This world map shows the total COVID-19 cases per country as of the most recent date in the dataset.
We'll use Plotly Express for visualization.


In [13]:
import plotly.express as px

# Get the most recent date in the dataset
latest_date = df['date'].max()

# Filter data for the latest date and drop missing values
latest_data = df[df['date'] == latest_date]
latest_data = latest_data[['iso_code', 'location', 'total_cases']].dropna()

# Plot the choropleth
fig = px.choropleth(
    latest_data,
    locations="iso_code",
    color="total_cases",
    hover_name="location",
    color_continuous_scale="Reds",
    title=f'Total COVID-19 Cases by Country as of {latest_date}'
)

fig.show()


NameError: name 'df' is not defined

In [14]:
!pip install plotly


Defaulting to user installation because normal site-packages is not writeable


## Step 7: Insights & Reporting

### 🔍 Key Insights:

1. **[Insert Insight]**: For example, *India had a rapid increase in cases between March and May 2021.*
2. **[Insert Insight]**: *The USA recorded the highest number of cumulative cases.*
3. **[Insert Insight]**: *Vaccination rates in the UK grew faster than in Kenya or India.*
4. **[Insert Insight]**: *Global death rate has decreased slightly as vaccination increased.*
5. **[Insert Insight]**: *Some countries show data inconsistencies or missing values.*

### 📌 Observations:
- Countries with higher vaccination rates tend to have slower case growth later in the timeline.
- Daily new cases vary greatly by region and period.
- Some developing countries have incomplete data reporting.

### 💡 Conclusion:
This analysis provides an overview of how different countries handled the pandemic in terms of infections, deaths, and vaccinations. With clear trends and visualizations, decision-makers can evaluate pandemic response effectiveness.

