In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# COVID-19 Global Data Tracker\n",
    "\n",
    "This notebook covers steps 1 to 7 of the COVID-19 data analysis project."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 1: Import necessary libraries\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import plotly.express as px\n",
    "\n",
    "print(\"📊 COVID-19 Global Data Tracker Started\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Load the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    df = pd.read_csv(\"owid-covid-data.csv\")\n",
    "    print(\"✅ Dataset loaded successfully.\")\n",
    "except FileNotFoundError:\n",
    "    print(\"❌ ERROR: 'owid-covid-data.csv' not found. Make sure the file is in the same folder as this notebook.\")\n",
    "    exit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3: Explore the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 3.1 View column names\n",
    "print(\"\\n🔎 Columns in the dataset:\")\n",
    "print(df.columns.tolist())\n",
    "\n",
    "# 3.2 Preview the first 5 rows\n",
    "print(\"\\n📌 First 5 rows of the dataset:\")\n",
    "print(df.head())\n",
    "\n",
    "# 3.3 Check for missing values\n",
    "print(\"\\n🚨 Missing values per column:\")\n",
    "print(df.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Data Cleaning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 4.1 Convert the 'date' column to datetime format\n",
    "df['date'] = pd.to_datetime(df['date'])\n",
    "\n",
    "# 4.2 Filter countries of interest (you can change these)\n",
    "countries = ['South Africa', 'United States', 'India', 'Kenya']\n",
    "df = df[df['location'].isin(countries)]\n",
    "\n",
    "# 4.3 Drop rows where critical values are missing\n",
    "df = df.dropna(subset=['date', 'total_cases', 'total_deaths'])\n",
    "\n",
    "# 4.4 Fill or interpolate remaining missing values\n",
    "df[['new_cases', 'new_deaths', 'total_vaccinations']] = df[['new_cases', 'new_deaths', 'total_vaccinations']].fillna(0)\n",
    "\n",
    "print(\"\\n🧹 Data cleaning completed.\")\n",
    "print(f\"📊 Remaining rows: {len(df)}\")\n",
    "print(f\"📍 Countries in dataset: {df['location'].unique()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Exploratory Data Analysis & Visualizations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.style.use('seaborn-darkgrid')\n",
    "plt.rcParams['figure.figsize'] = (12, 6)\n",
    "\n",
    "# Total cases over time\n",
    "for country in countries:\n",
    "    country_data = df[df['location'] == country]\n",
    "    plt.plot(country_data['date'], country_data['total_cases'], label=country)\n",
    "\n",
    "plt.title(\"Total COVID-19 Cases Over Time\")\n",
    "plt.xlabel(\"Date\")\n",
    "plt.ylabel(\"Total Cases\")\n",
    "plt.legend()\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# Total deaths over time\n",
    "for country in countries:\n",
    "    country_data = df[df['location'] == country]\n",
    "    plt.plot(country_data['date'], country_data['total_deaths'], label=country)\n",
    "\n",
    "plt.title(\"Total COVID-19 Deaths Over Time\")\n",
    "plt.xlabel(\"Date\")\n",
    "plt.ylabel(\"Total Deaths\")\n",
    "plt.legend()\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# Daily new cases comparison\n",
    "for country in countries:\n",
    "    country_data = df[df['location'] == country]\n",
    "    plt.plot(country_data['date'], country_data['new_cases'], label=country)\n",
    "\n",
    "plt.title(\"Daily New COVID-19 Cases\")\n",
    "plt.xlabel(\"Date\")\n",
    "plt.ylabel(\"New Cases\")\n",
    "plt.legend()\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# Death rate = total_deaths / total_cases\n",
    "df['death_rate'] = df['total_deaths'] / df['total_cases']\n",
    "for country in countries:\n",
    "    country_data = df[df['location'] == country]\n",
    "    plt.plot(country_data['date'], country_data['death_rate'], label=country)\n",
    "\n",
    "plt.title(\"COVID-19 Death Rate Over Time\")\n",
    "plt.xlabel(\"Date\")\n",
    "plt.ylabel(\"Death Rate\")\n",
    "plt.legend()\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# Optional: Bar chart - total cases per country (latest date)\n",
    "latest_df = df[df['date'] == df['date'].max()]\n",
    "latest_df = latest_df.groupby('location')['total_cases'].max().sort_values(ascending=False).reset_index()\n",
    "\n",
    "sns.barplot(x='total_cases', y='location', data=latest_df, palette=\"Reds_r\")\n",
    "plt.title(\"Total COVID-19 Cases by Country (Latest Date)\")\n",
    "plt.xlabel(\"Total Cases\")\n",
    "plt.ylabel(\"Country\")\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6: Visualizing Vaccination Progress"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Vaccination plots\n",
    "selected_countries = countries  # Using same countries list\n",
    "\n",
    "# 1. Plot cumulative vaccinations over time for selected countries\n",
    "plt.figure(figsize=(12, 6))\n",
    "for country in selected_countries:\n",
    "    country_data = df[df['location'] == country]\n",
    "    plt.plot(country_data['date'], country_data['total_vaccinations'], label=country)\n",
    "\n",
    "plt.title('Cumulative Vaccinations Over Time')\n",
    "plt.xlabel('Date')\n",
    "plt.ylabel('Total Vaccinations')\n",
    "plt.legend()\n",
    "plt.xticks(rotation=45)\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# 2. Compare vaccination percentages (latest available for each country)\n",
    "latest_data = df[df['date'] == df['date'].max()]\n",
    "vacc_percent = latest_data[latest_data['location'].isin(selected_countries)][\n",
    "    ['location', 'people_vaccinated_per_hundred']\n",
    "]\n",
    "\n",
    "# Bar chart\n",
    "plt.figure(figsize=(8, 5))\n",
    "sns.barplot(data=vacc_percent, x='location', y='people_vaccinated_per_hundred')\n",
    "plt.title('% of Population Vaccinated (Latest)')\n",
    "plt.ylabel('% Vaccinated')\n",
    "plt.xlabel('Country')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 7: Optional Choropleth Map"
   ]
  },
  {
   "cell_type": "code


Insights & Reporting

## Key Insights

1. **Vaccine Rollout Speed:**  
   South Africa showed a slower vaccine rollout compared to the United States and India, which had steeper increases in total vaccinations early on.

2. **Case Trends:**  
   The United States consistently reported the highest total COVID-19 cases and deaths over the timeline, while Kenya maintained relatively lower case counts.

3. **Death Rate Patterns:**  
   Although India had a large number of cases, its death rate remained lower than South Africa and the United States, possibly reflecting differences in healthcare responses or demographics.

4. **Anomalies:**  
   There were sudden spikes in reported new cases in some countries that may correspond with changes in testing policies or data reporting delays.

5. **Vaccination vs Cases:**  
   Countries with faster vaccination rates generally saw a slowing increase in new cases over time, highlighting the vaccine’s role in controlling the pandemic.

## Additional Observations

- The death rate fluctuated over time, sometimes increasing during waves of infections and then dropping as vaccination coverage improved.
- Kenya's data had more missing values, which might affect the accuracy of the trends observed.
- Visualizations of daily new cases showed clear waves corresponding to global pandemic waves.
