**Data Cleaning for a Real-world Dataset**

In [24]:
import pandas as pd
df = pd.read_csv('cleaned_worldometer_data.csv', quotechar='"', engine='python')

# Preview data
print(df.head())
print(df.columns)


  Country/Region      Continent   Population  TotalCases  NewCases  \
0         Mexico  North America  129066160.0      462690    6590.0   
1        Bolivia  South America   11688459.0       86423    1282.0   
2       S. Korea           Asia   51273732.0       14519      20.0   

   TotalDeaths  NewDeaths  TotalRecovered  NewRecovered  ActiveCases  \
0      50517.0      819.0        308848.0        4140.0     103325.0   
1       3465.0       80.0         27373.0         936.0      55585.0   
2        303.0        1.0         13543.0          42.0        673.0   

   Serious,Critical  Tot Cases/1M pop  Deaths/1M pop  TotalTests  \
0            3987.0            3585.0          391.0   1056915.0   
1              71.0            7394.0          296.0    183583.0   
2              18.0             283.0            6.0   1613652.0   

   Tests/1M pop      WHO Region  
0        8189.0        Americas  
1       15706.0        Americas  
2       31471.0  WesternPacific  
Index(['Country/Regio

**Explore and Understand the data**

In [14]:
print(df.info())         
print(df.describe())    
print(df.columns)        

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209 entries, 0 to 208
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Country/Region    209 non-null    object 
 1   Continent         208 non-null    object 
 2   Population        208 non-null    float64
 3   TotalCases        209 non-null    int64  
 4   NewCases          4 non-null      float64
 5   TotalDeaths       188 non-null    float64
 6   NewDeaths         3 non-null      float64
 7   TotalRecovered    205 non-null    float64
 8   NewRecovered      3 non-null      float64
 9   ActiveCases       205 non-null    float64
 10  Serious,Critical  122 non-null    float64
 11  Tot Cases/1M pop  208 non-null    float64
 12  Deaths/1M pop     187 non-null    float64
 13  TotalTests        191 non-null    float64
 14  Tests/1M pop      191 non-null    float64
 15  WHO Region        184 non-null    object 
dtypes: float64(12), int64(1), object(3)
memory u

**Handling the Missing Values**

In [15]:
print(df.isnull().sum())

Country/Region        0
Continent             1
Population            1
TotalCases            0
NewCases            205
TotalDeaths          21
NewDeaths           206
TotalRecovered        4
NewRecovered        206
ActiveCases           4
Serious,Critical     87
Tot Cases/1M pop      1
Deaths/1M pop        22
TotalTests           18
Tests/1M pop         18
WHO Region           25
dtype: int64


**Imputing Missing Values**

In [18]:
df = df.dropna()

**Removing the Duplicates**

In [19]:
df = df.drop_duplicates()


**Fix Inconsistencies in Categorical Data**

In [21]:
df['Country/Region'] = df['Country/Region'].str.strip().str.title()
print(df['Country/Region'].unique())


['Mexico' 'Bolivia' 'S. Korea']


**Convert Data Types**

In [22]:
cols_to_clean = ['TotalCases', 'TotalDeaths', 'TotalRecovered', 'ActiveCases', 'Population']
for col in cols_to_clean:
    df[col] = df[col].astype(str).str.replace(',', '').replace('nan', '0')
    df[col] = pd.to_numeric(df[col], errors='coerce')



**Cleaned Data**

In [23]:
df.to_csv('cleaned_worldometer_data.csv', index=False)


**Rename Problematic Column**

In [26]:
df.rename(columns={'"Serious,Critical"': 'Serious_Critical'}, inplace=True)


**Continue Cleaning**

In [28]:
print(df.isnull().sum())


Country/Region      0
Continent           0
Population          0
TotalCases          0
NewCases            0
TotalDeaths         0
NewDeaths           0
TotalRecovered      0
NewRecovered        0
ActiveCases         0
Serious,Critical    0
Tot Cases/1M pop    0
Deaths/1M pop       0
TotalTests          0
Tests/1M pop        0
WHO Region          0
dtype: int64


In [29]:
df.fillna(0, inplace=True)  


In [30]:
df['TotalCases'] = pd.to_numeric(df['TotalCases'], errors='coerce')


In [32]:
df.to_csv('final_cleaned_worldometer_data.csv', index=False)
