3.1 Problem - 1A - Single Year HDI Exploration (Latest Year: 2022)
Objective:
Explore the HDI dataset for the latest available year (2022) to practice basic EDA techniques.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


df = pd.read_csv('Human_Development_Index_Dataset.csv', encoding='latin1')


1. Extract Latest Year:
• Identify unique years in the dataset.


In [None]:
unique_years = df['year'].unique()
print(unique_years)


• Filter the dataset to include only observations from the year 2022.
• Save the filtered dataframe as hdi 2022 df (used for all subsequent tasks in Problem 1A).

In [None]:
hdi_2022_df = df[df['year'] == 2022]


2. Data Exploration:
• Display the first 10 rows of the 2022 dataset.

In [None]:
hdi_2022_df.head(10)


• Count the number of rows and columns.

In [None]:
rows, columns = hdi_2022_df.shape
print("Rows:", rows)
print("Columns:", columns)


• List all column names and their data types.

In [None]:
hdi_2022_df.dtypes

3. Missing Values & Data Cleaning:
• Check for missing values in each column and report total counts.

In [None]:
null_values_df = hdi_2022_df.isna().sum()

null_values_df
     

• Inspect dataset for:
 – numeric columns stored as text,
 – inconsistent or misspelled country names,
 – duplicate rows,
 – special characters (e.g., “–”) representing missing data.

 – numeric columns stored as text,

In [None]:
text_columns = hdi_2022_df.select_dtypes(include='object').columns
text_columns


 – inconsistent or misspelled country names,

In [None]:
unique_countries = hdi_2022_df['country'].unique()
print(unique_countries)
print(f"Total unique countries: {len(unique_countries)}")



In [None]:
hdi_2022_df['country'] = hdi_2022_df['country'].str.strip().str.title()
hdi_2022_df['country'].str.strip().value_counts()
non_countries = [
    'Very High Human Development',
    'High Human Development',
    'Medium Human Development',
    'Low Human Development',
    'Arab States',
    'East Asia And The Pacific',
    'Europe And Central Asia',
    'Latin America And The Caribbean',
    'South Asia',
    'Sub-Saharan Africa',
    'World'
]
hdi_2022_df = hdi_2022_df[~hdi_2022_df['country'].isin(non_countries)]
hdi_2022_df['country'].unique()

The dataset now only contains sovereign countries as development organizations and aggregate categories were removed.

In [None]:
hdi_2022_df.duplicated().sum()
hdi_2022_df[hdi_2022_df.duplicated()]

There are no duplicate rows in the dataset 

In [None]:
hdi_2022_df.isin(['-']).sum()


In [None]:
hdi_2022_df.isnull().sum()

In [None]:
numeric_columns = hdi_2022_df.select_dtypes(include='number').columns

for column in numeric_columns:
    median_value = hdi_2022_df[column].median()
    hdi_2022_df[column] = hdi_2022_df[column].fillna(median_value)
hdi_2022_df.isnull().sum()

Missing entries in numerical features were replaced using the median of each respective column. This approach retains all records while limiting the influence of outliers, making it well-suited for socio-economic data.

4. Basic Statistics:
• Compute the mean, median, and standard deviation of HDI for the year 2022.
• Identify the country with the highest HDI in 2022.
• Identify the country with the lowest HDI in 2022.

In [None]:
hdi_2022_df[numeric_columns]['hdi'].agg(['mean', 'median', 'std'])

In [None]:
hdi_2022_df.sort_values(by='hdi', ascending=False).head(1)[['country', 'hdi']]


In [None]:
hdi_2022_df.sort_values(by='hdi', ascending=True).head(1)[['country', 'hdi']]

5. Filtering and Sorting:
• Filter countries with HDI {"hdi"} greater than 0.800.
• Sort this filtered dataset by Gross National Income (GNI) per Capita {"gross inc percap"} in
descending order.
• Display the top 10 countries.

In [None]:
high_hdi_df = hdi_2022_df[hdi_2022_df['hdi'] > 0.800]
high_hdi_sorted = high_hdi_df.sort_values(by='gross_inc_percap', ascending=False)

high_hdi_sorted[['country', 'hdi', 'gross_inc_percap']].head(10)

6. Adding HDI Category Column:
• Create a new column HDI Category that classifies each country into one of the four official
Human Development Index groups. The classification should be based on the HDI value for the
year 2022. Use the following categories and thresholds defined by the United Nations Development
Programme (UNDP):

HDI Category HDI Range (hdi)
Low < 0.550
Medium 0.550 – 0.699
High 0.700 – 0.799
Very ≥ 0.800

After creating this new column:
• verify that all countries are classified correctly,
• ensure the updated dataframe includes the new category column.
• Save the final dataframe as HDI category added.csv and include this file in your final
submission.

In [None]:

def classify_hdi(hdi):
    if hdi < 0.550:
        return 'Low'
    elif hdi < 0.700:
        return 'Medium'
    elif hdi < 0.800:
        return 'High'
    else:
        return 'Very High'

hdi_2022_df['HDI Category'] = hdi_2022_df['hdi'].apply(classify_hdi)
hdi_2022_df['HDI Category'].value_counts()

hdi_2022_df.to_csv('HDI_category_added.csv', index=False)
hdi_2022_df[['country', 'hdi', 'HDI Category']].head(15)

3.2 Problem - 1B - HDI Visualization and Trend Analysis (2020 – 2022)
Objective:
Analyze multi-year HDI patterns (2020, 2021, and 2022) to explore temporal changes, regional differences,
and trends across countries.
Tasks:
Complete all the Following Tasks:

1. Data Extraction and Saving:
• Filter the dataset to include only the years 2020, 2021, and 2022.
• Save the filtered dataset as HDI problem1B.csv.
• Use this cleaned dataset for all subsequent tasks in Problem 1B.

In [None]:
years = [2020, 2021, 2022]
hdi_1B_df = df[df['year'].isin(years)]
hdi_1B_df.to_csv("HDI_problem1B.csv", index=False)
hdi_1B_df.shape

2. Data Cleaning:
• Check for missing values in the following essential columns:
– hdi
– country
– year

In [None]:
hdi_1B_df[['hdi', 'country', 'year']].isnull().sum()

In [None]:
hdi_1B_df.dtypes

• Apply and justify cleaning steps, including:
– handling missing values (dropping or imputing),
– converting data types appropriately,
– removing duplicate entries,
– ensuring consistent naming conventions for countries and years.

In [None]:
hdi_1B_df = hdi_1B_df.dropna(subset=['hdi'])

Missing country and year entries make rows unusable and would typically be removed. However, since HDI is a composite index rather than a raw variable, imputing its missing values is not appropriate, so HDI was dropped instead.

In [None]:
non_countries = [
    'World',
    'Arab States',
    'South Asia',
    'Sub-Saharan Africa',
    'Europe and Central Asia',
    'East Asia and the Pacific',
    'Latin America and the Caribbean',
    'Very high human development',
    'High human development',
    'Medium human development',
    'Low human development'
]

hdi_1B_df = hdi_1B_df[~hdi_1B_df['country'].isin(non_countries)]

hdi_1B_df['country'].unique()

The dataset now only contains sovereign countries as development organizations and aggregate categories were removed.

In [None]:
hdi_1B_df.duplicated().sum()


There are no duplicate rows in the dataset based on columns

In [None]:

hdi_1B_df['hdi'] = hdi_1B_df['hdi'].replace(['-', '_', ''], pd.NA)
hdi_1B_df['hdi'] = pd.to_numeric(hdi_1B_df['hdi'], errors='coerce')
hdi_1B_df['hdi'].isnull().sum()

In [None]:
hdi_1B_df[['hdi']].dtypes


This ensures HDI is numeric and suitable for analysis.

3. Visualization Tasks:
• A. Line Chart — HDI Trend (Country-Level):
– Select any five countries (or five countries from a region of your choice).
– Plot HDI values for each country across the years 2020, 2021, and 2022.
– Ensure the chart includes appropriate axis labels, a legend, and an informative caption.

In [None]:
countries = ['Nepal', 'Ireland', 'Brazil', 'United States', 'Thailand']
plt.figure(figsize=(8,5))
plot_df = hdi_1B_df[hdi_1B_df['country'].isin(countries)]
for country in countries:
    data = plot_df[plot_df['country'] == country]
    plt.plot(data['year'], data['hdi'], marker='o', label=country)

plt.xlabel("Year")
plt.ylabel("HDI Value")
plt.title("HDI Trend of Selected Countries (2020–2022)")
plt.xticks([2020, 2021, 2022])
plt.legend()
plt.grid(linestyle ='--')
plt.show()

• B. Generate Visualizations:
– Bar Chart: Average HDI by Region (2020–2022)
∗ Group the dataset by Region and Year.
∗ Compute the mean HDI for each region-year pair.
∗ Plot a bar chart comparing average HDI across regions for each year.
∗ Label axes clearly and include a descriptive title.

In [None]:
region_map = {
    #South-Asia
    'Afghanistan': 'South Asia',
    'Bangladesh': 'South Asia',
    'Bhutan': 'South Asia',
    'India': 'South Asia',
    'Maldives': 'South Asia',
    'Nepal': 'South Asia',
    'Pakistan': 'South Asia',
    'Sri Lanka': 'South Asia',

    #East-Asia and Pacific
    'China': 'East Asia & Pacific',
    'Japan': 'East Asia & Pacific',
    'Australia': 'East Asia & Pacific',
    'New Zealand': 'East Asia & Pacific',
    'Indonesia': 'East Asia & Pacific',
    'Philippines': 'East Asia & Pacific',
    'South Korea': 'East Asia & Pacific',
    'Thailand': 'East Asia & Pacific',

    #Europe and Central Asia
    'Germany': 'Europe & Central Asia',
    'France': 'Europe & Central Asia',
    'United Kingdom': 'Europe & Central Asia',
    'Russia': 'Europe & Central Asia',
    'Italy': 'Europe & Central Asia',
    'Spain': 'Europe & Central Asia',

    #Latin America and Caribbean
    'Brazil': 'Latin America & Caribbean',
    'Mexico': 'Latin America & Caribbean',
    'Argentina': 'Latin America & Caribbean',
    'Chile': 'Latin America & Caribbean',
    'Colombia': 'Latin America & Caribbean',

    #Sub-Saharan Africa
    'Nigeria': 'Sub-Saharan Africa',
    'South Africa': 'Sub-Saharan Africa',
    'Kenya': 'Sub-Saharan Africa',
    'Ghana': 'Sub-Saharan Africa',
    'Ethiopia': 'Sub-Saharan Africa',

    #North America
    'United States': 'North America',
    'Canada': 'North America',

    #Arab States
    'Egypt': 'Arab States',
    'Saudi Arabia': 'Arab States',
    'United Arab Emirates': 'Arab States',
    'Jordan': 'Arab States',
}
hdi_1B_df['region'] = hdi_1B_df['country'].map(region_map)
hdi_region = hdi_1B_df.dropna(subset=['region'])
hdi_region[['country', 'region']].drop_duplicates().head(20)
     

Since regions were not provided we need to create regions to complete the task.

In [None]:
region_hdi = hdi_region.groupby(['region', 'year'])['hdi'].mean().unstack()
plt.figure(figsize=(12,6))

region_hdi.plot(kind='bar')

plt.xlabel('Region')
plt.ylabel('Average HDI')
plt.title('Average HDI by Region (2020-2022)')
plt.xticks(rotation=45, ha='right')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', title='Year')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()

plt.show()

– Box Plot: HDI Distribution for 2020, 2021, and 2022
∗ Filter the dataset for the years 2020, 2021, and 2022.
∗ Create a box plot showing HDI spread for each of the three years.
∗ Include titles and axis labels.
∗ Comment briefly on distribution differences.

In [None]:
box_df = hdi_1B_df[hdi_1B_df['year'].isin([2020, 2021, 2022])]

plt.figure()
box_df.boxplot(column='hdi', by='year')
plt.xlabel("Year")
plt.ylabel("HDI Value")
plt.title("HDI Distribution (2020–2022)")
plt.suptitle("")
plt.show()



The box plot shows that median HDI values increases slightly over time.
The spread of HDI values remains similar across the three years which indicates consistent inequality levels between countries.

In [None]:
scatter_df = hdi_1B_df[hdi_1B_df['year'].isin([2020, 2021, 2022])]

plt.figure(figsize=(10, 6))
sns.scatterplot(data=scatter_df, x='gross_inc_percap', y='hdi', hue='year', palette='Set2')
plt.xlabel("GNI per Capita")
plt.ylabel("HDI")
plt.title("Scatter Plot: HDI vs GNI per Capita (2020–2022)")
plt.grid(linestyle = '--')
plt.legend(title='Year')
plt.show()

The scatter plot demonstrates a positive correlation between income and human development by showing that nations with higher GNI per capita typically have higher HDI values. While some lower-income nations have lower HDIs, the majority of nations cluster in the mid-to-high HDI range. The HDI values from 2020 to 2022 were comparatively steady, with only little gains for some nations, according to the year-wise colours.

4. Short Analysis Questions:
• Which countries show the greatest improvement in HDI from 2020 to 2022?
• Did any countries experience a decline in HDI? Provide possible reasons.
• Which region has the highest and lowest average HDI across these three years?


In [None]:
df_hdi_change = hdi_1B_df[hdi_1B_df['year'].isin([2020, 2022])]
hdi_pivot = df_hdi_change.pivot(index='country', columns='year', values='hdi')
hdi_pivot.columns = ['HDI_2020', 'HDI_2022']
hdi_pivot = hdi_pivot.dropna()
hdi_pivot['HDI_change'] = hdi_pivot['HDI_2022'] - hdi_pivot['HDI_2020']
top_improved = hdi_pivot.sort_values(by='HDI_change', ascending=False).head(10)
top_improved
     

In [None]:
decline_hdi = hdi_pivot[hdi_pivot['HDI_change'] < 0]
decline_hdi_sorted = decline_hdi.sort_values(by='HDI_change')
decline_hdi_sorted

A small number of countries saw a decline in HDI between 2020 and 2022, likely reflecting the effects of COVID-19, such as reduced life expectancy, economic contraction, and disruptions to education that adversely affected key HDI components.

In [None]:
avg_hdi_region = hdi_region.groupby('region')['hdi'].mean().sort_values(ascending=False)
avg_hdi_region

North America has the highest average HDI score at 0.928000, whereas Sub-Saharan Africa records the lowest average HDI at 0.590533.

• Discuss how global events (e.g., the COVID-19 pandemic) may have affected HDI trends during
this period.

The COVID-19 pandemic disrupted global health, education, and economic conditions, leading to a stagnation in HDI from 2020 to 2022, with effects differing markedly across regions.

4 Problem 2
Advanced HDI Exploration
Objective:

Perform advanced analysis of HDI data, focusing on South Asian countries, composite metrics, outlier detec-
tion, metric relationships, and gap analysis.

Tasks:
Complete all the following tasks:
1. Create South Asia Subset:
• Define the list of South Asian countries: ["Afghanistan", "Bangladesh", "Bhutan", "India",
"Maldives", "Nepal", "Pakistan", "Sri Lanka"].
• Filter the HDI dataset to include only these countries.
• Save the filtered dataset as HDI SouthAsia.csv and include this file in the final submission.

In [None]:
south_asia = ["Afghanistan", "Bangladesh", "Bhutan", "India", "Maldives", "Nepal", "Pakistan", "Sri Lanka"]

south_asia_df = hdi_1B_df[hdi_1B_df['country'].isin(south_asia)].copy()
south_asia_df.shape
south_asia_df.to_csv("HDI_SouthAsia.csv", index=False)

2. Composite Development Score:
• Create a new metric called Composite Score using the formula:

Composite Score = 0.30 × Life Expectancy Index + 0.30 × GNI per Capita Index}
Here: Life Expectancy Index → "life expectancy" and GNI per Capita Index → "gross inc percap"
• Rank South Asian countries based on Composite Score.
• Plot the top 5 countries in a horizontal bar chart.
• Compare the ranking of countries by Composite Score with their HDI ranking and discuss any
differences.

In [None]:
south_asia_df = south_asia_df.copy()
le_min = south_asia_df['life_expectancy'].min()
le_max = south_asia_df['life_expectancy'].max()
south_asia_df.loc[:, 'life_exp_index'] = (south_asia_df['life_expectancy'] - le_min)/ (le_max - le_min)

gni_min = south_asia_df['gross_inc_percap'].min()
gni_max = south_asia_df['gross_inc_percap'].max()
south_asia_df.loc[:, 'gni_index'] = (south_asia_df['gross_inc_percap'] - gni_min)/ (gni_max - gni_min)     

In [None]:
south_asia_df.loc[:, 'composite_score'] = (0.30 * south_asia_df['life_exp_index'] + 0.30 * south_asia_df['gni_index'])
south_asia_2022 = south_asia_df[south_asia_df['year'] == 2022]
composite_ranking = south_asia_2022.sort_values(by='composite_score', ascending=False)
composite_ranking[['country', 'composite_score']]

In [None]:
top5 = composite_ranking.head(5)

plt.figure(figsize=(8, 5))
plt.barh(top5['country'], top5['composite_score'])
plt.xlabel("Composite Development Score")
plt.title("Top 5 South Asian Countries by Composite Score (2022)")
plt.gca().invert_yaxis()
plt.show()

In [None]:
comparison = south_asia_2022.sort_values(by='hdi', ascending=False)[['country', 'hdi', 'composite_score']]
print(comparison)

When comparing HDI and Composite Score rankings, we can see that most countries are ranked in a similar way. However, some differences appear because the Composite Score does not include education as a factor. For example, Maldives ranks highest in Composite Score mainly due to its high income per capita, while Sri Lanka ranks highest in HDI because it performs better in education. This shows that country rankings can change depending on which development indicators are considered.

3. Outlier Detection:
• Detect outliers in HDI and GNI per Capita using the 1.5 × IQR rule.
• Create a scatter plot of GNI per Capita vs HDI, highlighting the outliers in a different color.
• Discuss why the identified countries stand out as outliers.

In [None]:
def detect_outliers(series):
  Q1 = series.quantile(0.25)
  Q3 = series.quantile(0.75)
  IQR = Q3 - Q1
  lower_bound = Q1 - 1.5 * IQR
  upper_bound = Q3 + 1.5 * IQR
    
  return(series < lower_bound) | (series > upper_bound)

In [None]:
south_asia_2022 = south_asia_df[south_asia_df['year'] == 2022].copy()
south_asia_2022['hdi_outlier'] = detect_outliers(south_asia_2022['hdi'])
south_asia_2022['gni_outlier'] = detect_outliers(south_asia_2022['gross_inc_percap'])

south_asia_2022['outlier'] = south_asia_2022['hdi_outlier'] | south_asia_2022['gni_outlier']

south_asia_2022[['country', 'hdi', 'gross_inc_percap', 'outlier']]

In [None]:
plt.figure(figsize=(8, 6))

plt.scatter(
    south_asia_2022['gross_inc_percap'],
    south_asia_2022['hdi'],
    c=south_asia_2022['outlier'],
    cmap='coolwarm'
)

plt.xlabel("GNI per Capita")
plt.ylabel("HDI")
plt.title("HDI vs GNI per Capita (Outliers Highlighted)")
plt.grid(linestyle='--')
plt.show()


Applying the 1.5 × IQR rule to South Asian countries in 2022 did not reveal any outliers. This is expected given the small sample size and the relatively homogeneous HDI and GNI per capita values within the region.

4. Exploring Metric Relationships:
• Select two HDI components (e.g., Gender Development Index {"gender development"} and Life
Expectancy Index {"life expectancy"}).
• Compute Pearson correlation of each metric with HDI.
• Create scatter plots with trendlines to visualize the relationships.
• Discuss:
– Which metric is most strongly related to HDI and shows the weakest relationship with HDI.

In [None]:
south_asia_rel = south_asia_df.copy()

In [None]:
corr_life = south_asia_rel['life_expectancy'].corr(south_asia_rel['hdi'])
corr_gender = south_asia_rel['gender_development'].corr(south_asia_rel['hdi'])

print(f"Life Expectancy vs HDI: {corr_life:.3f}")
print(f"Gender Development vs HDI: {corr_gender:.3f}")


In [None]:
data = south_asia_rel[['life_expectancy', 'hdi']].dropna()

plt.figure(figsize=(9,6))
plt.scatter(data['life_expectancy'], data['hdi'], c='green', s=60, edgecolor='black', alpha=0.6)

plt.xlabel('Life Expectancy')
plt.ylabel('HDI')
plt.title('Life expectency vs HDI')

coeffs = np.polyfit(data['life_expectancy'], data['hdi'], 1)
trend = np.poly1d(coeffs)
plt.plot(data['life_expectancy'], trend(data['life_expectancy']), color='blue', linewidth=2)

plt.grid(alpha=0.4)
plt.show()

Life expectancy exhibits the strongest association with HDI.

In [None]:
subset = south_asia_rel[['gender_development', 'hdi']].dropna()
plt.figure(figsize=(8,6))
plt.scatter(subset['gender_development'], subset['hdi'], alpha=0.7)
plt.xlabel('Gender Development')
plt.ylabel('Human Development Index (HDI)')
plt.title('Gender Development vs HDI')

z = np.polyfit(subset['gender_development'], subset['hdi'], 1)
p = np.poly1d(z)
plt.plot(subset['gender_development'], p(subset['gender_development']), "r")

plt.grid(True, linestyle='--', alpha=0.5)
plt.show()


Gender Development Index shows a comparatively weaker relationship with HDI.

5. Gap Analysis:
• Create a new metric:

GNI HDI Gap = "gross inc percap" − "hdi"
• Rank South Asian countries by GNI HDI Gap in descending and ascending order.
• Plot the top 3 positive gaps and top 3 negative gaps.
• Discuss the implications of the gap, e.g., cases where GNI is high but HDI is lower than expected.

In [None]:
south_asia_2022 = south_asia_df[south_asia_df['year'] == 2022].copy()
south_asia_2022['gni_hdi_gap'] = south_asia_2022['gross_inc_percap'] - south_asia_2022['hdi']
south_asia_2022[['country', 'gross_inc_percap', 'hdi', 'gni_hdi_gap']]

gap_asc = south_asia_2022.sort_values(by='gni_hdi_gap', ascending=True)
gap_asc[['country', 'gross_inc_percap', 'hdi', 'gni_hdi_gap']]

In [None]:
gap_desc = south_asia_2022.sort_values(by='gni_hdi_gap', ascending=False)
gap_desc[['country', 'gross_inc_percap', 'hdi', 'gni_hdi_gap']]

In [None]:
top_positive = gap_desc.head(3)
top_negative = gap_asc.head(3)

plot_data = pd.concat([top_positive, top_negative])

colors = ['green']*len(top_positive) + ['red']*len(top_negative)
plt.figure(figsize=(8,5))
plt.barh(plot_data['country'], plot_data['gni_hdi_gap'], color=colors)
plt.axvline(0, color='black', linewidth=0.8)

plt.xlabel('Gni-Hdi Gap')
plt.ylabel('country')
plt.title('Top 3 PositiveNegative GNI-HDI Gaps of South Asia')
plt.grid(axis='x', linestyle='--', alpha=0.6)
plt.show()


The GNI–HDI gap underscores the disparity between economic conditions and human development outcomes, illustrating that economic prosperity does not always translate directly into overall human well-being.

5 Problem 3
Comparative Regional Analysis: South Asia vs Middle East
Objective:
Perform a comparative analysis of HDI and related metrics between South Asia and the Middle East using
the 2020–2022 dataset from Problem 1B.
Tasks:
Complete all the following tasks:

In [None]:
hdi = pd.read_csv("HDI_problem1B.csv")

1. Create Middle East Subset:
• Define the list of Middle East countries: ["Bahrain", "Iran", "Iraq", "Israel", "Jordan",
"Kuwait", "Lebanon", "Oman", "Palestine", "Qatar", "Saudi Arabia", "Syria",
"United Arab Emirates", "Yemen"].
• Filter the dataset from Problem 1B (HDI problem1B.csv) to create subsets for South Asia and
Middle East.
• Save these subsets as HDI SouthAsia 2020 2022.csv and HDI MiddleEast 2020 2022.csv for
use in subsequent tasks.

In [None]:
south_asia_countries = ["Afghanistan", "Bangladesh", "Bhutan", "India", "Maldives", "Nepal", "Pakistan", "Sri Lanka"]
middle_east_countries = ["Bahrain", "Iran", "Iraq", "Israel", "Jordan",
               "Kuwait", "Lebanon", "Oman", "Palestine", "Qatar",
               "Saudi Arabia", "Syria", "United Arab Emirates", "Yemen"]

hdi_2020_2022 = hdi[hdi['year'].isin([2020, 2021, 2022])]

hdi_south_asia = hdi_2020_2022[hdi_2020_2022['country'].isin(south_asia_countries)]
hdi_middle_east = hdi_2020_2022[hdi_2020_2022['country'].isin(middle_east_countries)]


hdi_south_asia.to_csv("HDI_SouthAsia_2020_2022.csv", index=False)
hdi_middle_east.to_csv("HDI_MiddleEast_2020_2022.csv", index=False)

2. Descriptive Statistics:
• Compute the mean and standard deviation of HDI for each region (South Asia vs Middle East)
across 2020–2022.
• Identify which region performs better on average.

In [None]:
south_asia = pd.read_csv("HDI_SouthAsia_2020_2022.csv")
middle_east = pd.read_csv("HDI_MiddleEast_2020_2022.csv")


mean_sa = hdi_south_asia['hdi'].mean()
std_sa = hdi_south_asia['hdi'].std()

mean_me = hdi_middle_east['hdi'].mean()
std_me = hdi_middle_east['hdi'].std()

stats = pd.DataFrame({
    'Region': ['South Asia', 'Middle East'],
    'Mean HDI (2020-2022)': [mean_sa, mean_me],
    'HDI Standard Deviation (2020-2022)': [std_sa, std_me]
})
stats

As the data shows, the Middle East has a higher average HDI than South Asia, indicating better overall human development. Also, the Middle East has greater variation between countries.

3. Top and Bottom Performers:
• Identify the top 3 and bottom 3 countries in each region based on HDI.
• Create a bar chart comparing these top and bottom performers across the two regions.

In [None]:
sa_avg_hdi = south_asia.groupby('country')['hdi'].mean()
me_avg_hdi = middle_east.groupby('country')['hdi'].mean()

# Identify top 3 and bottom 3 countries in each region
sa_top = sa_avg_hdi.sort_values(ascending=False).head(3)
sa_bottom = sa_avg_hdi.sort_values().head(3)

me_top = me_avg_hdi.sort_values(ascending=False).head(3)
me_bottom = me_avg_hdi.sort_values().head(3)

# Combine into a single DataFrame for easier plotting
plot_df = pd.concat([
    sa_top.rename("HDI").to_frame().assign(Region="South Asia", Group="Top 3"),
    sa_bottom.rename("HDI").to_frame().assign(Region="South Asia", Group="Bottom 3"),
    me_top.rename("HDI").to_frame().assign(Region="Middle East", Group="Top 3"),
    me_bottom.rename("HDI").to_frame().assign(Region="Middle East", Group="Bottom 3")
]).reset_index().rename(columns={'index':'Country'})

In [None]:
plot_df['Label'] = plot_df['country'] + " (" + plot_df['Region'] + ")"

plt.figure(figsize=(8,6))
plt.barh(plot_df['Label'], plot_df['HDI'], color='skyblue')
plt.xlabel("Average HDI")
plt.title("Top and Bottom HDI Performers (2020–2022)")
plt.grid(axis='x', linestyle='--', alpha=0.5)
plt.gca().invert_yaxis() 
plt.show()

4. Metric Comparisons:
• Compare the following metrics across regions using grouped bar charts:
– Gender Development Index {"gender development"}
– Life Expectancy Index {"life expectancy"}
– GNI per Capita Index {"gross inc percap"}
• Identify which metric shows the greatest disparity between regions.

In [None]:
metrics = ['gender_development', 'life_expectancy', 'gross_inc_percap']

sa_means = south_asia[metrics].mean()
me_means = middle_east[metrics].mean()


metric_df = pd.DataFrame({
    'Metric': metrics,
    'South Asia': sa_means.values,
    'Middle East': me_means.values
})

metric_df


In [None]:
x = np.arange(len(metrics))
width = 0.35

plt.figure(figsize=(9,7))
plt.bar(x - width/2, metric_df['South Asia'], width=width, label='South Asia', color='navy')
plt.bar(x + width/2, metric_df['Middle East'], width=width, label='Middle East', color='crimson')

plt.xticks(x, metric_df['Metric'], rotation=15)
plt.ylabel("Average Value")
plt.title("Comparison of Metrics Between South Asia and Middle East (2020–2022)")
plt.legend()
plt.show()


The metric with the greatest disparity between South Asia and the Middle East is Gross Inc Per Cap, showing that income per capita differs the most between the two regions.

5. HDI Disparity:
• Compute the range (max – min) of HDI for each region.
• Compute the coefficient of variation (CV = std/mean) for HDI.
• Identify which region exhibits more variation in HDI.

In [None]:
range_sa = south_asia['hdi'].max() - south_asia['hdi'].min()
range_me = middle_east['hdi'].max() - middle_east['hdi'].min()


cv_sa = south_asia['hdi'].std() / south_asia['hdi'].mean()
cv_me = middle_east['hdi'].std() / middle_east['hdi'].mean()

hdi_disparity = pd.DataFrame({
    'Region': ['South Asia', 'Middle East'],
    'HDI Range': [range_sa, range_me],
    'HDI Coefficient of Variation (CV)': [cv_sa, cv_me]
})

hdi_disparity


The Middle East exhibits more variation in HDI than South Asia, as indicated by its higher coefficient of variation.

6. Correlation Analysis:
• For each region, compute correlations of HDI with:
– Gender Development Index
– Life Expectancy Index
• Create scatter plots with trendlines for each correlation.
• Interpret the strength and direction of these relationships.

In [None]:
sa_corr_gender = south_asia['hdi'].corr(south_asia['gender_development'])
sa_corr_life = south_asia['hdi'].corr(south_asia['life_expectancy'])

me_corr_gender = middle_east['hdi'].corr(middle_east['gender_development'])
me_corr_life = middle_east['hdi'].corr(middle_east['life_expectancy'])

corr_df = pd.DataFrame({
    'Region': ['South Asia', 'South Asia', 'Middle East', 'Middle East'],
    'Metric': ['Gender Development', 'Life Expectancy', 'Gender Development', 'Life Expectancy'],
    'Correlation with HDI': [sa_corr_gender, sa_corr_life, me_corr_gender, me_corr_life]
})

corr_df

In [None]:
x = south_asia['gender_development']
y = south_asia['hdi']
plt.scatter(x, y)
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x + b, color='red', linewidth=2)
plt.xlabel("Gender Development")
plt.ylabel("HDI")
plt.title("South Asia: HDI vs Gender Development")
plt.grid(alpha=0.5)
plt.show()

In [None]:
x = south_asia['life_expectancy']
y = south_asia['hdi']
plt.scatter(x, y)
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x + b, color='red', linewidth=2)
plt.xlabel("Life Expectancy")
plt.ylabel("HDI")
plt.title("South Asia: HDI vs Life Expectancy")
plt.grid(alpha=0.5)
plt.show()


In [None]:
x = middle_east['gender_development']
y = middle_east['hdi']
plt.scatter(x, y)
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x + b, color='red', linewidth=2)
plt.xlabel("Gender Development")
plt.ylabel("HDI")
plt.title("Middle East: HDI vs Gender Development")
plt.grid(alpha=0.5)
plt.show()

In [None]:
x = middle_east['life_expectancy']
y = middle_east['hdi']
plt.scatter(x, y)
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x + b, color='red', linewidth=2)
plt.xlabel("Life Expectancy")
plt.ylabel("HDI")
plt.title("Middle East: HDI vs Life Expectancy")
plt.grid(alpha=0.5)
plt.show()


In both South Asia and the Middle East, HDI tends to be higher in countries with higher Gender Development and longer Life Expectancy. Among the two, Life Expectancy shows a stronger relationship with HDI, suggesting it has a greater influence on overall human development.

In [None]:
south_asia['hdi_outlier'] = detect_outliers(south_asia['hdi'])
south_asia['gni_outlier'] = detect_outliers(south_asia['gross_inc_percap'])
south_asia['outlier'] = (south_asia['hdi_outlier'] | south_asia['gni_outlier'])

middle_east['hdi_outlier'] = detect_outliers(middle_east['hdi'])
middle_east['gni_outlier'] = detect_outliers(middle_east['gross_inc_percap'])
middle_east['outlier'] = (middle_east['hdi_outlier'] | middle_east['gni_outlier'])

In [None]:
plt.figure(figsize=(8,5))

plt.scatter(
    south_asia.loc[~south_asia['outlier'], 'gross_inc_percap'],
    south_asia.loc[~south_asia['outlier'], 'hdi'],
    label='Normal',
    alpha=0.7
)

plt.scatter(
    south_asia.loc[south_asia['outlier'], 'gross_inc_percap'],
    south_asia.loc[south_asia['outlier'], 'hdi'],
    label='Outlier',
    marker='x'
)

plt.xlabel('GNI per Capita')
plt.ylabel('HDI')
plt.title('South Asia: HDI vs GNI per Capita (Outliers Highlighted)')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()

In [None]:
plt.figure(figsize=(7,5))

plt.scatter(
    middle_east.loc[~middle_east['outlier'], 'gross_inc_percap'],
    middle_east.loc[~middle_east['outlier'], 'hdi'],
    label='Normal',
    alpha=0.7
)

plt.scatter(
    middle_east.loc[middle_east['outlier'], 'gross_inc_percap'],
    middle_east.loc[middle_east['outlier'], 'hdi'],
    label='Outlier',
    marker='x'
)

plt.xlabel('GNI per Capita')
plt.ylabel('HDI')
plt.title('Middle East: HDI vs GNI per Capita (Outliers Highlighted)')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()


Outliers are countries with unusually high or low HDI or GNI. They show exceptional cases and can affect regional averages