# COVID-19 Data Analysis for Kenya, India, and USA

This notebook analyzes COVID-19 data for Kenya, India, and the United States, focusing on total cases, daily new cases, vaccination progress, and a geographical representation of total cases.

**Data Source:** OWID (Our World in Data) COVID-19 dataset.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns # Seaborn is imported but not explicitly used in the provided script, can be removed if not needed later.
import plotly.express as px

## 1. Load and Inspect Data

Load the dataset and perform an initial inspection to understand its structure and contents.

In [None]:
# Load the data
df = pd.read_csv("owid-covid-data.csv")

# Preview the data columns
print("Columns in dataset:", df.columns.tolist()) # .tolist() for better readability

# Display the first few rows
print("\nFirst few rows of the dataset:")
df.head() # In Jupyter, df.head() will display a nice HTML table

## 2. Data Preprocessing

Filter the data for the countries of interest, convert data types, and handle missing values.

In [None]:
# Filter for specific countries
countries_of_interest = ['Kenya', 'India', 'United States']
df_filtered = df[df['location'].isin(countries_of_interest)].copy() # Use .copy() to avoid SettingWithCopyWarning

# Convert date column to datetime
df_filtered['date'] = pd.to_datetime(df_filtered['date'])

# Display data types after conversion
print("Data types after date conversion:")
df_filtered.info()

# Drop rows with missing critical values for time series plotting
# For plots like total cases, deaths, vaccinations, we need these to be present.
critical_columns = ['total_cases', 'total_deaths', 'total_vaccinations']
df_processed = df_filtered.dropna(subset=critical_columns).copy()

# Fill remaining missing values (e.g., new_cases might still have NaNs if not dropped above)
# A common strategy is to fill with 0, but this depends on the specific column's meaning.
df_processed.fillna(0, inplace=True)

print(f"\nShape of processed data: {df_processed.shape}")
print("\nMissing values after processing:")
print(df_processed.isnull().sum())

print("\nFirst few rows of processed data:")
df_processed.head()

## 3. Visualizations

Create visualizations to explore trends in COVID-19 cases and vaccinations.

### 3.1. Total Cases Over Time

In [None]:
plt.figure(figsize=(12, 6))
for country in countries_of_interest:
    country_data = df_processed[df_processed['location'] == country]
    plt.plot(country_data['date'], country_data['total_cases'], label=country)
plt.title('Total COVID-19 Cases Over Time')
plt.xlabel('Date')
plt.ylabel('Total Cases (Log Scale)') # Consider if log scale is appropriate or specify if not
plt.yscale('log') # Using log scale as cases grow exponentially, makes comparison easier
plt.legend()
plt.grid(True, which="both", ls="-") # Added 'which' and 'ls' for better grid
plt.tight_layout()
# plt.savefig('total_cases_over_time.png') # Keep if you also want to save the file
plt.show() # Ensures the plot is displayed in the notebook

### 3.2. Daily New Cases

In [None]:
plt.figure(figsize=(12, 6))
for country in countries_of_interest:
    country_data = df_processed[df_processed['location'] == country]
    # Calculate rolling average for smoother trend if data is noisy
    country_data['new_cases_smoothed'] = country_data['new_cases'].rolling(window=7).mean()
    plt.plot(country_data['date'], country_data['new_cases_smoothed'], label=f'{country} (7-day avg)')
    # plt.plot(country_data['date'], country_data['new_cases'], label=country, alpha=0.5) # Optional: plot raw data too
plt.title('Daily New COVID-19 Cases (7-Day Rolling Average)')
plt.xlabel('Date')
plt.ylabel('New Cases')
plt.legend()
plt.grid(True)
plt.tight_layout()
# plt.savefig('daily_new_cases.png')
plt.show()

### 3.3. Vaccination Progress

In [None]:
plt.figure(figsize=(12, 6))
for country in countries_of_interest:
    country_data = df_processed[df_processed['location'] == country]
    plt.plot(country_data['date'], country_data['total_vaccinations'], label=country)
plt.title('Vaccination Progress Over Time')
plt.xlabel('Date')
plt.ylabel('Total Vaccinations (Log Scale)') # Consider log scale
plt.yscale('log') # Using log scale
plt.legend()
plt.grid(True, which="both", ls="-")
plt.tight_layout()
# plt.savefig('vaccination_progress.png')
plt.show()

## 4. Further Analysis: Death Rate and Choropleth Map

### 4.1. Calculate Death Rate

Calculate the death rate (total deaths / total cases). Note: This is a crude Case Fatality Rate and has many epidemiological caveats.

In [None]:
# Ensure total_cases is not zero to avoid division by zero
df_processed['death_rate'] = 0.0 # Initialize
df_processed.loc[df_processed['total_cases'] > 0, 'death_rate'] = df_processed['total_deaths'] / df_processed['total_cases']

# Display death rate for the latest date for each country
latest_death_rates = df_processed.sort_values('date').groupby('location').last()[['date', 'death_rate']]
print("Latest calculated death rates:")
latest_death_rates

### 4.2. Choropleth Map of Total Cases

Visualize the latest total cases by country using a choropleth map.
This map will use the full dataset to show global context before filtering.
We will use the original `df` for this, as `df_processed` is filtered to specific countries and has NaN handling that might not be ideal for a global map.

In [None]:
# Use the original dataframe 'df' for a global map perspective
# Convert date to datetime if not already done on original df for sorting
df_map_data = df.copy() # Work on a copy
df_map_data['date'] = pd.to_datetime(df_map_data['date'])

# Get the latest data for each country from the original dataframe
latest_df_global = df_map_data.sort_values("date").groupby("location").last().reset_index()

# Filter out entries that don't have an iso_code (e.g., continents, income groups)
latest_df_global = latest_df_global[latest_df_global['iso_code'].notna()]
# Ensure total_cases is numeric and fill NaNs if any for plotting
latest_df_global['total_cases'] = pd.to_numeric(latest_df_global['total_cases'], errors='coerce').fillna(0)


fig = px.choropleth(
    latest_df_global,
    locations="iso_code",
    color="total_cases",
    hover_name="location",
    hover_data={'iso_code': False, 'total_cases': ':,', 'total_deaths': ':,'}, # Custom hover data
    color_continuous_scale=px.colors.sequential.Reds, # Or "Reds", "YlOrRd"
    title="Latest Total COVID-19 Cases by Country (Log Scale Color)"
    # Using log scale for color can make variations more visible if numbers vary wildly
    # color_continuous_scale can also accept custom scales or log color: color='log(total_cases)'
    # However, Plotly handles log color better if data has zeros using a 'log' type color axis
)
fig.update_layout(coloraxis_colorbar=dict(title="Total Cases"))
# fig.write_html("choropleth_map_global.html") # Keep if you want to save the file
fig.show()

## 5. Conclusion

In [None]:
print("Analysis completed. Visuals are displayed inline above.")
print("If `savefig` or `write_html` lines were uncommented, files are also saved.")