# Project 3: Data Cleaning Utility

**Dataset:** World Happiness Dataset

This project focuses on cleaning a real-world dataset by detecting and
handling missing values, fixing data types, removing duplicate records,
standardizing column names, and exporting a cleaned dataset along with
a cleaning log.


## Import Required Libraries

Pandas is used for data manipulation and cleaning operations.


In [1]:
import pandas as pd

## Load the Dataset

The World Happiness dataset is loaded into a Pandas DataFrame for further
data cleaning and analysis.


In [2]:
# Load the World Happiness dataset
df = pd.read_csv("happiness(project3).csv")

## Check for Missing Values

This step checks whether any column in the dataset contains
missing (null) values.


In [3]:
# Check if any column contains missing values
df.isnull().any()

Unnamed: 0,0
Country,False
Region,False
Happiness Rank,False
Happiness Score,False
Standard Error,False
Economy (GDP per Capita),False
Family,False
Health (Life Expectancy),False
Freedom,False
Trust (Government Corruption),False


## Display Rows with Missing Values in a Specific Column

This step filters and displays rows where the selected column
contains missing (NaN) values.


In [6]:
# Display rows where GDP per capita value is missing
df[df["Economy (GDP per Capita)"].isna() == True]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual


In [7]:
df.columns

Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Standard Error', 'Economy (GDP per Capita)', 'Family',
       'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual'],
      dtype='object')

# Fill missing values in GDP per capita with 0
df["Economy (GDP per Capita)"] = df["Economy (GDP per Capita)"].fillna(0)

# Display the column to verify the result
df["Economy (GDP per Capita)"].head()


In [37]:
# Fill missing values in GDP per capita with 0
df["Economy (GDP per Capita)"] = df["Economy (GDP per Capita)"].fillna(0)

# Display the column to verify the result
df["Economy (GDP per Capita)"].head()

Unnamed: 0,Economy (GDP per Capita)
0,1.39651
1,1.30232
2,1.32548
3,1.459
4,1.32629


In [38]:
df[["Country", "Economy (GDP per Capita)"]].head()

Unnamed: 0,Country,Economy (GDP per Capita)
0,Switzerland,1.39651
1,Iceland,1.30232
2,Denmark,1.32548
3,Norway,1.459
4,Canada,1.32629


## Fill Missing Values in Happiness Score

Missing values in the Happiness Score column are replaced with 0.
The updated values are displayed to verify the change.


In [39]:
# Fill missing values in happiness score with 0
df["Happiness Score"] = df["Happiness Score"].fillna(0)

# Display the output to verify
df["Happiness Score"].head()

Unnamed: 0,Happiness Score
0,7.587
1,7.561
2,7.527
3,7.522
4,7.427


In [40]:
df[["Country", "Happiness Score"]].head()

Unnamed: 0,Country,Happiness Score
0,Switzerland,7.587
1,Iceland,7.561
2,Denmark,7.527
3,Norway,7.522
4,Canada,7.427


## Filter Rows for a Specific Country

This step filters and displays all records related to a specific country
from the World Happiness dataset.


In [15]:
# Display records for a specific country (example: India)
df[df["Country"] == "India"]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
116,India,Southern Asia,117,4.565,0.02043,0.64499,0.38174,0.51529,0.39786,0.08492,0.26475,2.27513


In [16]:
df.columns

Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Standard Error', 'Economy (GDP per Capita)', 'Family',
       'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual'],
      dtype='object')

## Check for Duplicate Rows

This step identifies duplicate rows in the dataset.
It returns True for rows that are duplicates and False otherwise.


In [17]:
# Check for duplicate rows in the dataset
df.duplicated()

Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
...,...
153,False
154,False
155,False
156,False


## Display Duplicate Records Based on Country Name

This step displays rows where the country name appears more than once
in the dataset, helping to identify duplicate entries.


In [21]:
# Display duplicate records based on country name
df[df["Country"].duplicated() == True]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual


## Filter Records for Selected Countries

This step filters and displays records only for specific countries
using the `isin()` method.


In [23]:
# Display records for selected countries
df[df["Country"].isin(["India", "United States", "United Kingdom", "Australia"])]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
9,Australia,Australia and New Zealand,10,7.284,0.04083,1.33358,1.30923,0.93156,0.65124,0.35637,0.43562,2.26646
14,United States,North America,15,7.119,0.03839,1.39451,1.24711,0.86179,0.54604,0.1589,0.40105,2.51011
20,United Kingdom,Western Europe,21,6.867,0.01866,1.26637,1.28548,0.90943,0.59625,0.32067,0.51912,1.96994
116,India,Southern Asia,117,4.565,0.02043,0.64499,0.38174,0.51529,0.39786,0.08492,0.26475,2.27513


In [25]:
df.columns

Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Standard Error', 'Economy (GDP per Capita)', 'Family',
       'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual'],
      dtype='object')

In [41]:
# Remove duplicate rows
df = df.drop_duplicates()

# Display dataset shape after removing duplicates
df.shape

(158, 13)

In [42]:
# Check number of rows after removing duplicates
df.shape

# Display first few rows
df.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,country_first_word
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,Switzerland
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201,Iceland
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,Denmark
3,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,Norway
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,Canada


## Filter Records Using a List of Countries

This step filters the dataset to display records for a selected
list of countries using the `isin()` function.


In [28]:
# Filter records for selected countries
df[df["Country"].isin([
    "India",
    "United States",
    "United Kingdom",
    "Australia"
])]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
9,Australia,Australia and New Zealand,10,7.284,0.04083,1.33358,1.30923,0.93156,0.65124,0.35637,0.43562,2.26646
14,United States,North America,15,7.119,0.03839,1.39451,1.24711,0.86179,0.54604,0.1589,0.40105,2.51011
20,United Kingdom,Western Europe,21,6.867,0.01866,1.26637,1.28548,0.90943,0.59625,0.32067,0.51912,1.96994
116,India,Southern Asia,117,4.565,0.02043,0.64499,0.38174,0.51529,0.39786,0.08492,0.26475,2.27513


## Split Text Values in a Column

This step splits the values of a text column into multiple parts
using a specified delimiter.


In [30]:
# Split the region column into words (if it contains spaces)
df["Region"].str.split(pat=" ")

Unnamed: 0,Region
0,"[Western, Europe]"
1,"[Western, Europe]"
2,"[Western, Europe]"
3,"[Western, Europe]"
4,"[North, America]"
...,...
153,"[Sub-Saharan, Africa]"
154,"[Sub-Saharan, Africa]"
155,"[Middle, East, and, Northern, Africa]"
156,"[Sub-Saharan, Africa]"


## Extract a Specific Part from a Text Column

This step splits a text column using a delimiter and extracts
a specific part of the split values.


In [32]:
# Split country names by space and extract the second word (if present)
df["Country"].str.split(pat=" ").str[1]

Unnamed: 0,Country
0,
1,
2,
3,
4,
...,...
153,
154,
155,
156,


## Create a New Column from an Existing Column

This step creates a new column by extracting the first word
from the Country column.


In [36]:
# Create a new column by extracting the first word from Country names
df["country_first_word"] = df["Country"].str.split(pat=" ").str[0]

# Display the output to verify
df[["Country", "country_first_word"]].head()

Unnamed: 0,Country,country_first_word
0,Switzerland,Switzerland
1,Iceland,Iceland
2,Denmark,Denmark
3,Norway,Norway
4,Canada,Canada


## Create a New Column by Extracting the Second Word

This step creates a new column by splitting the Country name
and extracting the second word from it.


In [43]:
# Create a new column by extracting the second word from Country names
df["country_second_word"] = df["Country"].str.split(pat=" ").str[1]

# Display the output to verify
df[["Country", "country_second_word"]].head()

Unnamed: 0,Country,country_second_word
0,Switzerland,
1,Iceland,
2,Denmark,
3,Norway,
4,Canada,


## Remove an Unnecessary Column

This step removes a column that is not required for further analysis
to keep the dataset clean and concise.


In [44]:
# Drop the 'Standard Error' column
df = df.drop(["Standard Error"], axis=1)

# Display column names to verify removal
df.columns

Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)',
       'Freedom', 'Trust (Government Corruption)', 'Generosity',
       'Dystopia Residual', 'country_first_word', 'country_second_word'],
      dtype='object')

## Preview the Cleaned Dataset

This step displays the first few rows of the dataset to verify
that all cleaning operations have been applied successfully.


In [45]:
# Display the first 5 rows of the dataset
df.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,country_first_word,country_second_word
0,Switzerland,Western Europe,1,7.587,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,Switzerland,
1,Iceland,Western Europe,2,7.561,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201,Iceland,
2,Denmark,Western Europe,3,7.527,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,Denmark,
3,Norway,Western Europe,4,7.522,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,Norway,
4,Canada,North America,5,7.427,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,Canada,


## Split Text Values in a Column

This step splits the values of a text column into multiple parts
using a specified delimiter.


In [46]:
# Split Country names into separate words
df["Country"].str.split(pat=" ")

Unnamed: 0,Country
0,[Switzerland]
1,[Iceland]
2,[Denmark]
3,[Norway]
4,[Canada]
...,...
153,[Rwanda]
154,[Benin]
155,[Syria]
156,[Burundi]


## Create a New Column from Country Names

This step creates a new column by splitting the Country name
and extracting a specific part of the text.


In [47]:
# Create a new column by extracting the first word from Country names
df["Country_First_Word"] = df["Country"].str.split(pat=" ").str[0]

# Display the result to verify
df[["Country", "Country_First_Word"]].head()

Unnamed: 0,Country,Country_First_Word
0,Switzerland,Switzerland
1,Iceland,Iceland
2,Denmark,Denmark
3,Norway,Norway
4,Canada,Canada


## Clean Country Names

This step attempts to clean the Country column by splitting text
using a delimiter. The result is displayed to verify whether
any changes were applied.


In [48]:
# Attempt to clean Country names by splitting on ')'
df["Country"] = df["Country"].str.split(pat=")").str[0]

# Display to verify the result
df["Country"].head()

Unnamed: 0,Country
0,Switzerland
1,Iceland
2,Denmark
3,Norway
4,Canada


In [49]:
# Remove extra spaces from Country names
df["Country"] = df["Country"].str.strip()

df["Country"].head()

Unnamed: 0,Country
0,Switzerland
1,Iceland
2,Denmark
3,Norway
4,Canada


## View the Country Column

This step displays all the values present in the Country column
to inspect and verify the cleaned country names.


In [50]:
# Display the Country column
df["Country"]

Unnamed: 0,Country
0,Switzerland
1,Iceland
2,Denmark
3,Norway
4,Canada
...,...
153,Rwanda
154,Benin
155,Syria
156,Burundi


In [51]:
df["Country"].unique()

array(['Switzerland', 'Iceland', 'Denmark', 'Norway', 'Canada', 'Finland',
       'Netherlands', 'Sweden', 'New Zealand', 'Australia', 'Israel',
       'Costa Rica', 'Austria', 'Mexico', 'United States', 'Brazil',
       'Luxembourg', 'Ireland', 'Belgium', 'United Arab Emirates',
       'United Kingdom', 'Oman', 'Venezuela', 'Singapore', 'Panama',
       'Germany', 'Chile', 'Qatar', 'France', 'Argentina',
       'Czech Republic', 'Uruguay', 'Colombia', 'Thailand',
       'Saudi Arabia', 'Spain', 'Malta', 'Taiwan', 'Kuwait', 'Suriname',
       'Trinidad and Tobago', 'El Salvador', 'Guatemala', 'Uzbekistan',
       'Slovakia', 'Japan', 'South Korea', 'Ecuador', 'Bahrain', 'Italy',
       'Bolivia', 'Moldova', 'Paraguay', 'Kazakhstan', 'Slovenia',
       'Lithuania', 'Nicaragua', 'Peru', 'Belarus', 'Poland', 'Malaysia',
       'Croatia', 'Libya', 'Russia', 'Jamaica', 'North Cyprus', 'Cyprus',
       'Algeria', 'Kosovo', 'Turkmenistan', 'Mauritius', 'Hong Kong',
       'Estonia', 'Indonesi

## Clean Country Names Using String Operations

This step cleans the Country column by applying string operations.
It demonstrates how text data can be processed during data cleaning.


In [54]:
df.columns
df["Country"].head()

Unnamed: 0,Country
0,Switzerland
1,Iceland
2,Denmark
3,Norway
4,Canada


In [55]:
# Clean Country names by ensuring only the main text is retained
# (Split is applied safely; since '(' is not present, data remains unchanged)
df["Country"] = df["Country"].str.split(pat="(").str[0]

# Display output to verify
df["Country"].head()

Unnamed: 0,Country
0,Switzerland
1,Iceland
2,Denmark
3,Norway
4,Canada


## Check Data Types of Columns

This step displays the data type of each column in the dataset.
It helps verify whether columns have the correct data types
after cleaning operations.


In [56]:
# Display data types of all columns
df.dtypes

Unnamed: 0,0
Country,object
Region,object
Happiness Rank,int64
Happiness Score,float64
Economy (GDP per Capita),float64
Family,float64
Health (Life Expectancy),float64
Freedom,float64
Trust (Government Corruption),float64
Generosity,float64


## Extract a Part of Text from a Column

This step demonstrates how to split a text column using a delimiter
and extract the first part of the split values.


In [57]:
# Split the Region column by space and extract the first word
df["Region"].str.split(pat=" ").str[0]

Unnamed: 0,Region
0,Western
1,Western
2,Western
3,Western
4,North
...,...
153,Sub-Saharan
154,Sub-Saharan
155,Middle
156,Sub-Saharan


In [58]:
df["region_first_word"] = df["Region"].str.split(pat=" ").str[0]
df[["Region", "region_first_word"]].head()

Unnamed: 0,Region,region_first_word
0,Western Europe,Western
1,Western Europe,Western
2,Western Europe,Western
3,Western Europe,Western
4,North America,North


## Clean Text Values in a Column

This step cleans the Country column using string operations.
The cleaned values are reassigned back to the same column.


In [59]:
# Clean Country names by splitting on space and keeping the first part
df["Country"] = df["Country"].str.split(pat=" ").str[0]

# Display to verify the result
df["Country"].head()

Unnamed: 0,Country
0,Switzerland
1,Iceland
2,Denmark
3,Norway
4,Canada


## Convert Data Type of a Column

This step converts the data type of the Happiness Rank column
to integer to ensure correct numerical representation.


In [60]:
# Convert Happiness Rank column to integer
df["Happiness Rank"] = df["Happiness Rank"].astype(int)

# Display data type to verify
df["Happiness Rank"].dtype

dtype('int64')

In [62]:
df["Happiness Rank"] = pd.to_numeric(df["Happiness Rank"], errors="coerce")

In [63]:
df.dtypes

Unnamed: 0,0
Country,object
Region,object
Happiness Rank,int64
Happiness Score,float64
Economy (GDP per Capita),float64
Family,float64
Health (Life Expectancy),float64
Freedom,float64
Trust (Government Corruption),float64
Generosity,float64


In [64]:
df.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,country_first_word,country_second_word,Country_First_Word,region_first_word
0,Switzerland,Western Europe,1,7.587,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,Switzerland,,Switzerland,Western
1,Iceland,Western Europe,2,7.561,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201,Iceland,,Iceland,Western
2,Denmark,Western Europe,3,7.527,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,Denmark,,Denmark,Western
3,Norway,Western Europe,4,7.522,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,Norway,,Norway,Western
4,Canada,North America,5,7.427,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,Canada,,Canada,North


## Clean Text Values Using String Split

This step cleans a text column by splitting its values using a delimiter
and keeping only the required part.


In [65]:
# Clean Region values by splitting on space and keeping the first part
df["Region"] = df["Region"].str.split(pat=" ").str[0]

# Display the result to verify
df["Region"].head()

Unnamed: 0,Region
0,Western
1,Western
2,Western
3,Western
4,North


In [66]:
df.dtypes

Unnamed: 0,0
Country,object
Region,object
Happiness Rank,int64
Happiness Score,float64
Economy (GDP per Capita),float64
Family,float64
Health (Life Expectancy),float64
Freedom,float64
Trust (Government Corruption),float64
Generosity,float64


## Clean Text Column Using String Split

This step cleans a categorical text column by applying a string split
operation and retaining the required part of the text.


In [67]:
# Clean Region values using string split (safe operation)
df["Region"] = df["Region"].str.split(pat="+").str[0]

# Display the output to verify
df["Region"].head()

Unnamed: 0,Region
0,Western
1,Western
2,Western
3,Western
4,North


## Recheck Missing Values After Cleaning

This step checks whether any column in the dataset still contains
missing (null) values after applying the data cleaning operations.


In [68]:
# Check if any column still has missing values
df.isnull().any()

Unnamed: 0,0
Country,False
Region,False
Happiness Rank,False
Happiness Score,False
Economy (GDP per Capita),False
Family,False
Health (Life Expectancy),False
Freedom,False
Trust (Government Corruption),False
Generosity,False


## Display Rows with Missing Values in a Specific Column

This step filters and displays all rows where the selected column
contains missing (NaN) values.


In [69]:
# Display rows where GDP per Capita has missing values
df[df["Economy (GDP per Capita)"].isna() == True]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,country_first_word,country_second_word,Country_First_Word,region_first_word


In [70]:
df[df["Trust (Government Corruption)"].isna()]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,country_first_word,country_second_word,Country_First_Word,region_first_word


## Fill Missing Values in a Numeric Column

This step replaces missing (NaN) values in a numeric column
with 0 to ensure the dataset contains no null values.


In [71]:
# Fill missing values in Economy (GDP per Capita) with 0
df["Economy (GDP per Capita)"] = df["Economy (GDP per Capita)"].fillna(0)

# Display to verify the result
df["Economy (GDP per Capita)"].head()

Unnamed: 0,Economy (GDP per Capita)
0,1.39651
1,1.30232
2,1.32548
3,1.459
4,1.32629


## Identify Rows with Missing Values(Re-checking)

This step filters and displays rows where a specific column
contains missing (NaN) values.


In [72]:
# Display rows where Trust (Government Corruption) has missing values
df[df["Trust (Government Corruption)"].isna()]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,country_first_word,country_second_word,Country_First_Word,region_first_word


## Filter Records for a Specific Countries

This step filters the dataset to display records
for a specific countries.


In [76]:
# Display record for specific countries
df[df["Country"].isin(["India", "Denmark", "United States", "Finland"])]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,country_first_word,country_second_word,Country_First_Word,region_first_word
2,Denmark,Western,3,7.527,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,Denmark,,Denmark,Western
5,Finland,Western,6,7.406,1.29025,1.31826,0.88911,0.64169,0.41372,0.23351,2.61955,Finland,,Finland,Western
116,India,Southern,117,4.565,0.64499,0.38174,0.51529,0.39786,0.08492,0.26475,2.27513,India,,India,Southern


In [77]:
df.isnull().any()

Unnamed: 0,0
Country,False
Region,False
Happiness Rank,False
Happiness Score,False
Economy (GDP per Capita),False
Family,False
Health (Life Expectancy),False
Freedom,False
Trust (Government Corruption),False
Generosity,False


## Display All Rows of the DataFrame

By default, Pandas limits the number of rows displayed.
This step sets the display option to show all rows of the DataFrame
when it is printed.


In [78]:
# et pandas option to display all rows
pd.set_option('display.max_rows', None)

## Display the Complete DataFrame

This step displays the entire DataFrame after applying all
data cleaning and preprocessing steps.


In [79]:
# Display the entire DataFrame
df

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,country_first_word,country_second_word,Country_First_Word,region_first_word
0,Switzerland,Western,1,7.587,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,Switzerland,,Switzerland,Western
1,Iceland,Western,2,7.561,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201,Iceland,,Iceland,Western
2,Denmark,Western,3,7.527,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,Denmark,,Denmark,Western
3,Norway,Western,4,7.522,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,Norway,,Norway,Western
4,Canada,North,5,7.427,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,Canada,,Canada,North
5,Finland,Western,6,7.406,1.29025,1.31826,0.88911,0.64169,0.41372,0.23351,2.61955,Finland,,Finland,Western
6,Netherlands,Western,7,7.378,1.32944,1.28017,0.89284,0.61576,0.31814,0.4761,2.4657,Netherlands,,Netherlands,Western
7,Sweden,Western,8,7.364,1.33171,1.28907,0.91087,0.6598,0.43844,0.36262,2.37119,Sweden,,Sweden,Western
8,New,Australia,9,7.286,1.25018,1.31967,0.90837,0.63938,0.42922,0.47501,2.26425,New,Zealand,New,Australia
9,Australia,Australia,10,7.284,1.33358,1.30923,0.93156,0.65124,0.35637,0.43562,2.26646,Australia,,Australia,Australia


## Remove a Specific Row from the Dataset

This step removes a specific row from the DataFrame using its index value.
This is useful when a row contains incorrect or unwanted data.


In [80]:
# Drop the row with index 56
df = df.drop(56, axis=0)

# Display the DataFrame to verify the row removal
df.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,country_first_word,country_second_word,Country_First_Word,region_first_word
0,Switzerland,Western,1,7.587,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,Switzerland,,Switzerland,Western
1,Iceland,Western,2,7.561,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201,Iceland,,Iceland,Western
2,Denmark,Western,3,7.527,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,Denmark,,Denmark,Western
3,Norway,Western,4,7.522,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,Norway,,Norway,Western
4,Canada,North,5,7.427,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,Canada,,Canada,North


In [81]:
df

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,country_first_word,country_second_word,Country_First_Word,region_first_word
0,Switzerland,Western,1,7.587,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,Switzerland,,Switzerland,Western
1,Iceland,Western,2,7.561,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201,Iceland,,Iceland,Western
2,Denmark,Western,3,7.527,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,Denmark,,Denmark,Western
3,Norway,Western,4,7.522,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,Norway,,Norway,Western
4,Canada,North,5,7.427,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,Canada,,Canada,North
5,Finland,Western,6,7.406,1.29025,1.31826,0.88911,0.64169,0.41372,0.23351,2.61955,Finland,,Finland,Western
6,Netherlands,Western,7,7.378,1.32944,1.28017,0.89284,0.61576,0.31814,0.4761,2.4657,Netherlands,,Netherlands,Western
7,Sweden,Western,8,7.364,1.33171,1.28907,0.91087,0.6598,0.43844,0.36262,2.37119,Sweden,,Sweden,Western
8,New,Australia,9,7.286,1.25018,1.31967,0.90837,0.63938,0.42922,0.47501,2.26425,New,Zealand,New,Australia
9,Australia,Australia,10,7.284,1.33358,1.30923,0.93156,0.65124,0.35637,0.43562,2.26646,Australia,,Australia,Australia


In [82]:
df.isnull().any()

Unnamed: 0,0
Country,False
Region,False
Happiness Rank,False
Happiness Score,False
Economy (GDP per Capita),False
Family,False
Health (Life Expectancy),False
Freedom,False
Trust (Government Corruption),False
Generosity,False


In [83]:
df.dtypes

Unnamed: 0,0
Country,object
Region,object
Happiness Rank,int64
Happiness Score,float64
Economy (GDP per Capita),float64
Family,float64
Health (Life Expectancy),float64
Freedom,float64
Trust (Government Corruption),float64
Generosity,float64


In [84]:
# Convert Happiness Rank to integer data type
df["Happiness Rank"] = df["Happiness Rank"].astype(int)

# Verify the data type
df["Happiness Rank"].dtype

dtype('int64')

## Convert Column Data Type to Float

This step converts the Happiness Score column to float data type
to ensure accurate numerical calculations and analysis.


In [85]:
# Convert Happiness Score to float data type
df["Happiness Score"] = df["Happiness Score"].astype(float)

# Verify the data type
df["Happiness Score"].dtype

dtype('float64')

In [86]:
df["Happiness Score"] = pd.to_numeric(df["Happiness Score"], errors="coerce")

In [87]:
df.isnull().any()

Unnamed: 0,0
Country,False
Region,False
Happiness Rank,False
Happiness Score,False
Economy (GDP per Capita),False
Family,False
Health (Life Expectancy),False
Freedom,False
Trust (Government Corruption),False
Generosity,False


In [88]:
df.dtypes

Unnamed: 0,0
Country,object
Region,object
Happiness Rank,int64
Happiness Score,float64
Economy (GDP per Capita),float64
Family,float64
Health (Life Expectancy),float64
Freedom,float64
Trust (Government Corruption),float64
Generosity,float64


## Create a Derived Column Using Numerical Calculation

This step creates a new column by calculating the difference
between Happiness Score and Dystopia Residual.


In [89]:
# Create a new column by subtracting Dystopia Residual from Happiness Score
df["score_without_dystopia"] = df["Happiness Score"] - df["Dystopia Residual"]

# Display to verify the new column
df[["Country", "Happiness Score", "Dystopia Residual", "score_without_dystopia"]].head()

Unnamed: 0,Country,Happiness Score,Dystopia Residual,score_without_dystopia
0,Switzerland,7.587,2.51738,5.06962
1,Iceland,7.561,2.70201,4.85899
2,Denmark,7.527,2.49204,5.03496
3,Norway,7.522,2.46531,5.05669
4,Canada,7.427,2.45176,4.97524


In [90]:
df

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,country_first_word,country_second_word,Country_First_Word,region_first_word,score_without_dystopia
0,Switzerland,Western,1,7.587,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,Switzerland,,Switzerland,Western,5.06962
1,Iceland,Western,2,7.561,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201,Iceland,,Iceland,Western,4.85899
2,Denmark,Western,3,7.527,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,Denmark,,Denmark,Western,5.03496
3,Norway,Western,4,7.522,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,Norway,,Norway,Western,5.05669
4,Canada,North,5,7.427,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,Canada,,Canada,North,4.97524
5,Finland,Western,6,7.406,1.29025,1.31826,0.88911,0.64169,0.41372,0.23351,2.61955,Finland,,Finland,Western,4.78645
6,Netherlands,Western,7,7.378,1.32944,1.28017,0.89284,0.61576,0.31814,0.4761,2.4657,Netherlands,,Netherlands,Western,4.9123
7,Sweden,Western,8,7.364,1.33171,1.28907,0.91087,0.6598,0.43844,0.36262,2.37119,Sweden,,Sweden,Western,4.99281
8,New,Australia,9,7.286,1.25018,1.31967,0.90837,0.63938,0.42922,0.47501,2.26425,New,Zealand,New,Australia,5.02175
9,Australia,Australia,10,7.284,1.33358,1.30923,0.93156,0.65124,0.35637,0.43562,2.26646,Australia,,Australia,Australia,5.01754


## Calculate the Mean of a Derived Column

This step calculates the average value of the derived column
to understand the overall trend in the dataset.


In [91]:
# Calculate the mean of the derived score_without_dystopia column
df["score_without_dystopia"].mean()

np.float64(3.2753104458598727)

## Conditional Filtering and Mean Calculation

This step filters the dataset based on a condition applied to a
derived column and calculates the mean of another numeric column.


In [92]:
# Calculate the mean Happiness Score for countries
# where score_without_dystopia is greater than 4
df[df["score_without_dystopia"] > 4]["Happiness Score"].mean()

np.float64(6.828617647058823)

## Count Records Based on a Condition

This step filters the dataset based on a condition applied to a
numeric column and counts the number of matching records.


In [93]:
# Count the number of countries with Happiness Score less than 5
df[df["Happiness Score"] < 5]["Country"].count()

np.int64(65)

## Group By and Maximum Aggregation

This step groups the dataset by Region, calculates the maximum
Happiness Score for each region, and sorts the results in
descending order.


In [94]:
# Find the maximum Happiness Score for each Region
df.groupby("Region")["Happiness Score"] \
  .max() \
  .to_frame("Max_Happiness_Score") \
  .reset_index() \
  .sort_values("Max_Happiness_Score", ascending=False)

Unnamed: 0,Region,Max_Happiness_Score
9,Western,7.587
5,North,7.427
0,Australia,7.286
4,Middle,7.278
3,Latin,7.226
6,Southeastern,6.798
1,Central,6.505
2,Eastern,6.298
8,Sub-Saharan,5.477
7,Southern,5.253


## Group By and Mean Aggregation for Multiple Columns

This step groups the dataset by Region and calculates the mean
of multiple numerical columns to compare average happiness
and related factors across regions.


In [95]:
# Calculate mean values of selected columns for each region
df.groupby("Region")[[
    "Happiness Score",
    "Economy (GDP per Capita)",
    "Freedom",
    "Generosity"
]].mean()

Unnamed: 0_level_0,Happiness Score,Economy (GDP per Capita),Freedom,Generosity
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Australia,7.285,1.29188,0.64531,0.455315
Central,5.332931,0.942438,0.358269,0.152264
Eastern,5.626167,1.15178,0.46249,0.225885
Latin,6.159762,0.890319,0.499215,0.214913
Middle,5.4069,1.066974,0.361751,0.190376
North,7.273,1.3604,0.589505,0.42958
Southeastern,5.317444,0.789054,0.557104,0.419261
Southern,4.580857,0.560486,0.373337,0.341429
Sub-Saharan,4.2028,0.380473,0.365944,0.221137
Western,6.689619,1.298596,0.549926,0.302109


In [96]:
df

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,country_first_word,country_second_word,Country_First_Word,region_first_word,score_without_dystopia
0,Switzerland,Western,1,7.587,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,Switzerland,,Switzerland,Western,5.06962
1,Iceland,Western,2,7.561,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201,Iceland,,Iceland,Western,4.85899
2,Denmark,Western,3,7.527,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,Denmark,,Denmark,Western,5.03496
3,Norway,Western,4,7.522,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,Norway,,Norway,Western,5.05669
4,Canada,North,5,7.427,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,Canada,,Canada,North,4.97524
5,Finland,Western,6,7.406,1.29025,1.31826,0.88911,0.64169,0.41372,0.23351,2.61955,Finland,,Finland,Western,4.78645
6,Netherlands,Western,7,7.378,1.32944,1.28017,0.89284,0.61576,0.31814,0.4761,2.4657,Netherlands,,Netherlands,Western,4.9123
7,Sweden,Western,8,7.364,1.33171,1.28907,0.91087,0.6598,0.43844,0.36262,2.37119,Sweden,,Sweden,Western,4.99281
8,New,Australia,9,7.286,1.25018,1.31967,0.90837,0.63938,0.42922,0.47501,2.26425,New,Zealand,New,Australia,5.02175
9,Australia,Australia,10,7.284,1.33358,1.30923,0.93156,0.65124,0.35637,0.43562,2.26646,Australia,,Australia,Australia,5.01754


## Conclusion

In this project, a comprehensive data cleaning process was performed on the World Happiness dataset using Pandas. The dataset was inspected for missing values, duplicates, and incorrect data types, and appropriate cleaning techniques were applied to improve data quality.

Key steps included handling missing values, removing duplicate and unnecessary records, performing string cleaning, converting data types, creating derived columns, and validating the cleaned data. Additionally, grouping, aggregation, and conditional analysis were used to extract meaningful insights from the dataset.

This project highlights the importance of data cleaning as a foundational step in data analysis. Clean and well-structured data ensures accurate analysis, reliable insights, and better decision-making in real-world data science applications.
