## 1,Loading and Inspecting the Data

In [67]:
#type:ignore
import pandas as pd  # Importing pandas for data manipulation

# Load the dataset into a DataFrame
df_happiness = pd.read_csv("World-happiness-report-2024.csv")

# displaying the first 5 observations
print(df_happiness.head())
# Get information about the DataFrame


  Country name            Regional indicator  Ladder score  upperwhisker  \
0      Finland                Western Europe         7.741         7.815   
1      Denmark                Western Europe         7.583         7.665   
2      Iceland                Western Europe         7.525         7.618   
3       Sweden                Western Europe         7.344         7.422   
4       Israel  Middle East and North Africa         7.341         7.405   

   lowerwhisker  Log GDP per capita  Social support  Healthy life expectancy  \
0         7.667               1.844           1.572                    0.695   
1         7.500               1.908           1.520                    0.699   
2         7.433               1.881           1.617                    0.718   
3         7.267               1.878           1.501                    0.724   
4         7.277               1.803           1.513                    0.740   

   Freedom to make life choices  Generosity  Perceptions of co

In [68]:
print(df_happiness.info())  #printing concise summary of the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143 entries, 0 to 142
Data columns (total 12 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Country name                  143 non-null    object 
 1   Regional indicator            143 non-null    object 
 2   Ladder score                  143 non-null    float64
 3   upperwhisker                  143 non-null    float64
 4   lowerwhisker                  143 non-null    float64
 5   Log GDP per capita            140 non-null    float64
 6   Social support                140 non-null    float64
 7   Healthy life expectancy       140 non-null    float64
 8   Freedom to make life choices  140 non-null    float64
 9   Generosity                    140 non-null    float64
 10  Perceptions of corruption     140 non-null    float64
 11  Dystopia + residual           140 non-null    float64
dtypes: float64(10), object(2)
memory usage: 13.5+ KB
None


## 2 Cleaning the data

### 2.1 Identifying and Handling Missing Values

In [69]:
# Count of missing values in each column
df_happiness.isnull().sum()

Country name                    0
Regional indicator              0
Ladder score                    0
upperwhisker                    0
lowerwhisker                    0
Log GDP per capita              3
Social support                  3
Healthy life expectancy         3
Freedom to make life choices    3
Generosity                      3
Perceptions of corruption       3
Dystopia + residual             3
dtype: int64

In [70]:
# Display rows that contain any null values
rows_with_null = df_happiness[df_happiness.isnull().any(axis=1)]
print(rows_with_null)   

           Country name                  Regional indicator  Ladder score  \
61              Bahrain        Middle East and North Africa         5.959   
87           Tajikistan  Commonwealth of Independent States         5.281   
102  State of Palestine        Middle East and North Africa         4.879   

     upperwhisker  lowerwhisker  Log GDP per capita  Social support  \
61          6.153         5.766                 NaN             NaN   
87          5.361         5.201                 NaN             NaN   
102         5.006         4.753                 NaN             NaN   

     Healthy life expectancy  Freedom to make life choices  Generosity  \
61                       NaN                           NaN         NaN   
87                       NaN                           NaN         NaN   
102                      NaN                           NaN         NaN   

     Perceptions of corruption  Dystopia + residual  
61                         NaN                  NaN  
8

### Dropping Certain Rows

In the World Happiness Report 2024 dataset, three rows (indices 61, 87, and 102) were identified as having a high percentage of missing values,
with each row missing data in 7 out of 12 columns. After careful consideration, these rows were dropped for the following reasons:

   - High Percentage of Missing Values in the Rows

   - Large Enough Dataset for Analysis Without These Rows

   - Consistency in Data Analysis

In [71]:
# Drop rows by index
df_happiness_cleaned = df_happiness.drop([61, 87, 102])

### 2.2 Identifying and Handling Duplicates

- no duplicate rows in the df_happiness dataframe.

In [72]:
# Check for duplicate rows
df_happiness.duplicated().sum()

0

### 2.3 Checking and Handling Data Type Inconsistencies

In [73]:
# Check data types of each column
data_types = df_happiness_cleaned.dtypes
print(data_types)

Country name                     object
Regional indicator               object
Ladder score                    float64
upperwhisker                    float64
lowerwhisker                    float64
Log GDP per capita              float64
Social support                  float64
Healthy life expectancy         float64
Freedom to make life choices    float64
Generosity                      float64
Perceptions of corruption       float64
Dystopia + residual             float64
dtype: object


In [74]:
# Check for non-numeric entries in numeric columns
numeric_cols = df_happiness_cleaned.select_dtypes(include=['int64', 'float64']).columns
non_numeric_entries = df_happiness_cleaned[numeric_cols].apply(pd.to_numeric, errors='coerce').isnull().sum()

print("Non-numeric entries in numeric columns:")
print(non_numeric_entries[non_numeric_entries > 0])

Non-numeric entries in numeric columns:
Series([], dtype: int64)


In [75]:
# Select non-numeric columns
non_numeric_cols = df_happiness_cleaned.select_dtypes(include=['object']).columns

# Check for numeric entries in non-numeric columns
numeric_in_non_numeric = {}
for col in non_numeric_cols:
    # Try to convert entries to numeric, capturing any that convert successfully
    numeric_values = pd.to_numeric(df_happiness_cleaned[col], errors='coerce')
    # Count how many entries were converted to numeric (not NaN)
    count_numeric = numeric_values.notna().sum()
    if count_numeric > 0:
        numeric_in_non_numeric[col] = count_numeric

# Output any non-numeric columns that contain numeric values
if numeric_in_non_numeric:
    print("Non-numeric columns containing numeric values:")
    for col, count in numeric_in_non_numeric.items():
        print(f"Column '{col}' contains {count} numeric values.")
else:
    print("No numeric values found in non-numeric columns.")

No numeric values found in non-numeric columns.


### 2.4 Final Check

The final DataFrame, df_happiness_cleaned, has been meticulously cleaned and now features consistent data types across all columns, along with no null values and no duplicate rows. This ensures a solid foundation for further analysis.

In [76]:
# Concise information of the data after dropping rows
print(df_happiness_cleaned.info())


<class 'pandas.core.frame.DataFrame'>
Index: 140 entries, 0 to 142
Data columns (total 12 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Country name                  140 non-null    object 
 1   Regional indicator            140 non-null    object 
 2   Ladder score                  140 non-null    float64
 3   upperwhisker                  140 non-null    float64
 4   lowerwhisker                  140 non-null    float64
 5   Log GDP per capita            140 non-null    float64
 6   Social support                140 non-null    float64
 7   Healthy life expectancy       140 non-null    float64
 8   Freedom to make life choices  140 non-null    float64
 9   Generosity                    140 non-null    float64
 10  Perceptions of corruption     140 non-null    float64
 11  Dystopia + residual           140 non-null    float64
dtypes: float64(10), object(2)
memory usage: 14.2+ KB
None


## 3. Exploratory Data Analysis