<a href="https://colab.research.google.com/github/olumideadekunle/Data-Sharing-among-Business/blob/main/Copy_of_Question_Mini_Project_Module_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COVID-19 Data Analysis: Exploring Trends, Vaccination Impact, and Insights Through Visualizations

The COVID-19 pandemic, caused by the SARS-CoV-2 virus, has significantly impacted global health, economies, and daily life. Since its emergence in late 2019, vast amounts of data have been collected on infection rates, mortality, vaccination efforts, and testing strategies. Analyzing this data is crucial for understanding the spread of the virus, identifying trends, and making informed policy decisions. This project involves basic data cleaning, exploratory data analysis (EDA), and visualization to uncover insights into COVID-19 cases, deaths, and vaccinations over time.

## About the Dataset

The **[dataset](https://drive.google.com/file/d/1syeD6Ni_ZlfXHHH0Jp6ALpvg9iCAEk_-/view?usp=sharing)** used for this project contains essential COVID-19 metrics, tracking the progression of the pandemic across different countries. The dataset consists of the following columns:

- DATE: The recorded date of COVID-19 data entry.
- country: The country or region where the cases, deaths, and vaccinations were reported.
- NEW Cases: The number of newly confirmed COVID-19 cases reported on a given date.
- NEW_DEATHS: The number of new deaths attributed to COVID-19 on that specific date.
- vaccinated: The number of people who have received at least one dose of the COVID-19 vaccine.

This dataset may require cleaning and preprocessing to handle missing values, incorrect formats, and inconsistencies before conducting meaningful analysis.

## Objective

The primary goal of this mini-project is to clean, analyze, and visualize COVID-19 data to identify trends, patterns, and key insights. The specific objectives include:

- Perform Data Cleaning: Handle missing values, standardize date formats, and filter out inconsistencies.
- Explore Trends in COVID-19 Cases and Deaths:
  - Analyze daily and cumulative trends in infections and fatalities.
  - Compare case and death rates across different countries and regions.
- Create Data Visualizations:
  - Line Plots: Show the trend of cases, deaths, and vaccinations over time.
  - Bar Charts: Compare cases, deaths, and vaccinations by country.
  - Scatter Plots: Explore relationships between infection rates and testing or vaccination rates.

In [25]:

!pip install pandas matplotlib seaborn numpy


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns




In [26]:

from google.colab import files
uploaded = files.upload()


filename = list(uploaded.keys())[0]


df = pd.read_csv(filename)


df.head()

Saving Dataset module 4 covid.csv to Dataset module 4 covid (4).csv


Unnamed: 0,DATE,country,NEW Cases,NEW_DEATHS,vaccinated
0,11/1/2024,Argentina,,,unknown
1,11/1/2024,Australia,0.0,0.0,327
2,11/1/2024,Australia,0.0,0.0,327
3,11/1/2024,Brazil,971.0,48.0,430
4,11/1/2024,Canada,176.0,8.0,unknown


In [27]:

print("Missing Values:\n", df.isnull().sum())

df.fillna(0, inplace=True)

df["2020"] = pd.to_datetime(df["2020"], errors="coerce")


df.drop_duplicates(inplace=True)
n
df["vaccinated"] = pd.to_numeric(df["vaccinated"], errors="coerce")
df["vaccinated"].fillna(0, inplace=True)


df.info()


Missing Values:
 DATE            0
country         0
 NEW Cases     11
NEW_DEATHS     26
vaccinated      0
dtype: int64


KeyError: '2020'

In [29]:
# Assuming 'DATE' is the column with date information
df["####"] = pd.to_datetime(df["####"], errors="coerce")

KeyError: '####'

In [21]:

print("Missing Values:\n", df.isnull().sum())


df.fillna(0, inplace=True)



df["Date"] = pd.to_datetime(df["Date"], errors="coerce")


df.drop_duplicates(inplace=True)


df["vaccinated"] = pd.to_numeric(df["vaccinated"], errors="coerce")
df["vaccinated"].fillna(0, inplace=True)


df.info()

Missing Values:
 DATE            0
country         0
 NEW Cases     11
NEW_DEATHS     26
vaccinated      0
dtype: int64


KeyError: 'Date'

In [11]:

df = pd.read_csv("C:/Users/SAIL/OneDrive/Desktop/3MTT/Dataset module 4 covid.csv")


print(df.columns)


df['DATE'] = pd.to_datetime(df['DATE'], errors="coerce")


FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/SAIL/OneDrive/Desktop/3MTT/Dataset module 4 covid.csv'

In [6]:
import pandas as pd
from google.colab import files


uploaded = files.upload()


filename = list(uploaded.keys())[0]


df = pd.read_csv(filename)


Saving Dataset module 4 covid.csv to Dataset module 4 covid (1).csv


In [22]:

print("Missing Values:\n", df.isnull().sum())


df.fillna(0, inplace=True)


df["DATE"] = pd.to_datetime(df["DATE"], errors="coerce")


df.drop_duplicates(inplace=True)


df["vaccinated"] = pd.to_numeric(df["vaccinated"], errors="coerce")
df["vaccinated"].fillna(0, inplace=True)

df.info()


Missing Values:
 DATE           0
country        0
 NEW Cases     0
NEW_DEATHS     0
vaccinated     0
dtype: int64


KeyError: 'DATE'

In [31]:

print("Missing Values:\n", df.isnull().sum())


df.fillna(0, inplace=True)
df["2025-04-16"] = pd.to_datetime(df["2025-04-16"], errors="coerce")


df.drop_duplicates(inplace=True)


df["vaccinated"] = pd.to_numeric(df["vaccinated"], errors="coerce")
df["vaccinated"].fillna(0, inplace=True)


df.info()

Missing Values:
 DATE           0
country        0
 NEW Cases     0
NEW_DEATHS     0
vaccinated     0
dtype: int64


KeyError: '2025-04-16'

In [16]:

print("\nSummary Statistics:\n", df.describe())


top_countries = df.groupby("country")["NEW Cases"].sum().sort_values(ascending=False).head(5)
print("\nTop 5 Countries by Cases:\n", top_countries)



Summary Statistics:
          NEW Cases  NEW_DEATHS
count   165.000000  165.000000
mean    540.430303   23.987879
std     418.124708   21.605100
min       0.000000    0.000000
25%     152.000000    1.000000
50%     560.000000   21.000000
75%     793.000000   37.000000
max    1730.000000   86.000000


KeyError: 'country'

In [18]:
print("\nKey Insights:")
print("- Countries with high cases should prioritize vaccination and testing.")
print("- Analyzing spikes in deaths can help improve healthcare strategies.")
print("- Visualization trends help policymakers make informed decisions.")



Key Insights:
- Countries with high cases should prioritize vaccination and testing.
- Analyzing spikes in deaths can help improve healthcare strategies.
- Visualization trends help policymakers make informed decisions.


In [33]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load Dataset
df = pd.read_csv("Dataset module 4 covid.csv")

# Standardize Date Format
df["DATE"] = pd.to_datetime(df["DATE"], errors="coerce")

# Handle missing values
df.replace("unknown", np.nan, inplace=True)  # Convert "unknown" to NaN
df.fillna(0, inplace=True)  # Fill missing values with zero (adjust as needed)

# Remove duplicates
df.drop_duplicates(inplace=True)

# Convert columns to proper numeric types
df["NEW Cases"] = pd.to_numeric(df["NEW Cases"], errors="coerce")
df["NEW_DEATHS"] = pd.to_numeric(df["NEW_DEATHS"], errors="coerce")
df["vaccinated"] = pd.to_numeric(df["vaccinated"], errors="coerce")

# Summary Statistics
print("Data Overview:\n", df.describe())

# Visualizations
plt.figure(figsize=(12,6))
sns.lineplot(data=df, x="DATE", y="NEW Cases", hue="country", marker="o")
plt.title("COVID-19 Cases Trend Over Time")
plt.xlabel("Date")
plt.ylabel("New Cases")
plt.legend(title="Country")
plt.show()


KeyError: 'DATE'

In [36]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load Dataset
df = pd.read_csv("Dataset module 4 covid.csv")

# Standardize Date Format
df["DATE"] = pd.to_datetime(df["DATE"], errors="coerce")

# Handle missing values
df.replace("unknown", np.nan, inplace=True)  # Convert "unknown" to NaN
df.fillna(0, inplace=True)  # Fill missing values with zero (adjust as needed)

# Remove duplicates
df.drop_duplicates(inplace=True)

# Convert columns to proper numeric types
df["NEW Cases"] = pd.to_numeric(df["NEW Cases"], errors="coerce")
df["NEW_DEATHS"] = pd.to_numeric(df["NEW_DEATHS"], errors="coerce")
df["vaccinated"] = pd.to_numeric(df["vaccinated"], errors="coerce")

# Summary Statistics
print("Data Overview:\n", df.describe())

# Visualizations
plt.figure(figsize=(12,6))
sns.lineplot(data=df, x="DATE", y="NEW Cases", hue="country", marker="o")
plt.title("COVID-19 Cases Trend Over Time")
plt.xlabel("Date")
plt.ylabel("New Cases")
plt.legend(title="Country")
plt.show()


KeyError: 'DATE'