# 01. Data Cleaning and Inspection

This notebook loads all available datasets, inspects them for quality issues (missing values, duplicates), and prepares them for analysis.

In [1]:
import pandas as pd
import os

data_dir = "../archive"

## 1. Load Datasets

In [5]:
files = {
    "country_wise": "country_wise_latest.csv",
    "covid_clean": "covid_19_clean_complete.csv",
    "day_wise": "day_wise.csv",
    "full_grouped": "full_grouped.csv",
    "usa_county": "usa_county_wise.csv",
    "worldometer": "worldometer_data.csv"
}

dfs = {}
for name, filename in files.items():
    print(f"Loading {filename}...")
    dfs[name] = pd.read_csv(os.path.join(data_dir, filename))
    print(f"  Shape: {dfs[name].shape}")

Loading country_wise_latest.csv...
  Shape: (187, 15)
Loading covid_19_clean_complete.csv...
  Shape: (49068, 10)
Loading day_wise.csv...
  Shape: (188, 12)
Loading full_grouped.csv...
  Shape: (35156, 10)
Loading usa_county_wise.csv...
  Shape: (627920, 14)
Loading worldometer_data.csv...
  Shape: (209, 16)


## 2. Inspect for Missing Values

In [3]:
for name, df in dfs.items():
    print(f"\n--- {name} Missing Values ---")
    missing = df.isnull().sum()
    print(missing[missing > 0])


--- country_wise Missing Values ---
Series([], dtype: int64)

--- covid_clean Missing Values ---
Province/State    34404
dtype: int64

--- day_wise Missing Values ---
Series([], dtype: int64)

--- full_grouped Missing Values ---
Series([], dtype: int64)

--- usa_county Missing Values ---
FIPS      1880
Admin2    1128
dtype: int64

--- worldometer Missing Values ---
Continent             1
Population            1
NewCases            205
TotalDeaths          21
NewDeaths           206
TotalRecovered        4
NewRecovered        206
ActiveCases           4
Serious,Critical     87
Tot Cases/1M pop      1
Deaths/1M pop        22
TotalTests           18
Tests/1M pop         18
WHO Region           25
dtype: int64


## 3. Date Standardization
Convert 'Date' columns to datetime objects where applicable.

In [4]:
for name, df in dfs.items():
    if 'Date' in df.columns:
        print(f"Converting Date column in {name}...")
        df['Date'] = pd.to_datetime(df['Date'])
        print(f"  Date range: {df['Date'].min()} to {df['Date'].max()}")

Converting Date column in covid_clean...
  Date range: 2020-01-22 00:00:00 to 2020-07-27 00:00:00
Converting Date column in day_wise...
  Date range: 2020-01-22 00:00:00 to 2020-07-27 00:00:00
Converting Date column in full_grouped...
  Date range: 2020-01-22 00:00:00 to 2020-07-27 00:00:00
Converting Date column in usa_county...
  Date range: 2020-01-22 00:00:00 to 2020-07-27 00:00:00


  df['Date'] = pd.to_datetime(df['Date'])
