<a href="https://colab.research.google.com/github/np03cs4a240372-tech/Assignment-1-AI-/blob/main/2517283_PrayashShrestha.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

2517283 Prayash Shrestha
Assignment 1

**Statistical Interpretation and Exploratory Data Analysis of the Human Development Index (HDI)**

**Course:** Concepts and Technologies of AI (5CS037)  
**Dataset:** Human Development Index Dataset (1990–2022)




In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:

df = pd.read_csv('/content/drive/MyDrive/ConceptAndTechnologiesOfAI/Copy of Human_Development_Index_Dataset.csv',encoding='latin-1')
df.head()



## Question 1A: Single-Year HDI Exploration (2022)

This section focuses on understanding the structure, quality, and distribution of HDI values
for the most recent year available in the dataset.



### Task 1: Extract Latest Year

The dataset spans multiple years. To perform a focused single-year analysis, observations
corresponding to the year **2022** are isolated.


In [None]:

df['year'].unique()


In [None]:
hdi_2022_df = df[df['year'] == 2022].copy()


### Task 2: Data Exploration

Basic exploratory checks are conducted to understand the size of the dataset,
its variables, and their data types.


In [None]:
hdi_2022_df.head(10)


In [None]:

hdi_2022_df.shape


In [None]:

hdi_2022_df.info()



### Task 3: Missing Values and Data Cleaning

The dataset is inspected for missing values, non-numeric symbols, and duplicate records.

Cleaning steps applied:
- Special characters representing missing values are replaced with NaN
- Numeric variables are converted to appropriate numeric formats
- Duplicate rows are removed
- Rows with missing HDI values are dropped, as HDI is the primary variable of interest


In [None]:

hdi_2022_df.isna().sum()


In [None]:

hdi_2022_df = hdi_2022_df.replace("–", np.nan)

numeric_cols = ['hdi','gross_inc_percap','life_expectancy','gender_development']
for c in numeric_cols:
    if c in hdi_2022_df.columns:
        hdi_2022_df[c] = pd.to_numeric(hdi_2022_df[c], errors='coerce')

hdi_2022_df.drop_duplicates(inplace=True)
hdi_2022_df.dropna(subset=['hdi'], inplace=True)


Missing HDI values were removed because HDI is the primary variable of analysis and cannot be reliably imputed.


### Task 4: Descriptive Statistics

Summary statistics provide an overview of the central tendency and dispersion
of HDI values in 2022. Countries with the highest and lowest HDI are also identified.


In [None]:

hdi_2022_df['hdi'].agg(['mean','median','std'])


In [None]:

hdi_2022_df.loc[hdi_2022_df['hdi'].idxmax()][['country','hdi']]


In [None]:

hdi_2022_df.loc[hdi_2022_df['hdi'].idxmin()][['country','hdi']]



### Task 5: Filtering and Sorting

Countries classified as having **very high human development** (HDI > 0.800)
are filtered and ranked based on Gross National Income (GNI) per capita.


In [None]:

top_hdi = hdi_2022_df[hdi_2022_df['hdi'] > 0.8].sort_values('gross_inc_percap', ascending=False)
top_hdi.head(10)



### Task 6: HDI Category Classification

Countries are classified into four official UNDP HDI categories using standard threshold values.
The updated dataset is saved for later use.


In [None]:

def hdi_category(h):
    if h < 0.55: return 'Low'
    elif h < 0.70: return 'Medium'
    elif h < 0.80: return 'High'
    else: return 'Very High'

hdi_2022_df['HDI Category'] = hdi_2022_df['hdi'].apply(hdi_category)
hdi_2022_df['HDI Category'].value_counts()


In [None]:

hdi_2022_df.to_csv('HDI_category_added.csv', index=False)



## Question 1B: HDI Visualization and Trend Analysis (2020–2022)

This section examines short-term HDI trends, regional differences,
and relationships with economic indicators.



### Task 1: Data Extraction and Saving


In [None]:

hdi_1b = df[df['year'].isin([2020,2021,2022])].copy()
hdi_1b.to_csv('HDI_problem1B.csv', index=False)



### Task 2: Data Cleaning

Data cleaning ensures consistency across years and countries before visualization
and comparative analysis.


In [None]:

# Replace non-numeric symbols
hdi_1b.replace("–", np.nan, inplace=True)

# Convert data types
hdi_1b['hdi'] = pd.to_numeric(hdi_1b['hdi'], errors='coerce')

# Drop rows with missing essential values
hdi_1b.dropna(subset=['hdi', 'country', 'year'], inplace=True)

# Remove duplicate rows
hdi_1b.drop_duplicates(inplace=True)

# Standardize country names
hdi_1b['country'] = hdi_1b['country'].str.strip()



### Task 3A: Line Chart – Country-Level HDI Trends

HDI trajectories for five selected countries are visualized to highlight
year-to-year changes.


In [None]:

countries = ['Nepal','India','Bangladesh','Sri Lanka','Pakistan']
subset = hdi_1b[hdi_1b['country'].isin(countries)]

plt.figure()
sns.lineplot(data=subset, x='year', y='hdi', hue='country', marker='o')
plt.title('HDI Trends for Selected Countries (2020–2022)')
plt.xlabel('Year')
plt.ylabel('HDI')
plt.show()



### Task 3B-1: Bar Chart – Average HDI by Region

Regional averages are compared to assess disparities in human development
across different parts of the world.


In [None]:
region_map = {
    # South Asia
    'Afghanistan': 'South Asia',
    'Bangladesh': 'South Asia',
    'Bhutan': 'South Asia',
    'India': 'South Asia',
    'Maldives': 'South Asia',
    'Nepal': 'South Asia',
    'Pakistan': 'South Asia',
    'Sri Lanka': 'South Asia',

    # Middle East
    'Bahrain': 'Middle East',
    'Iran': 'Middle East',
    'Iraq': 'Middle East',
    'Israel': 'Middle East',
    'Jordan': 'Middle East',
    'Kuwait': 'Middle East',
    'Lebanon': 'Middle East',
    'Oman': 'Middle East',
    'Palestine': 'Middle East',
    'Qatar': 'Middle East',
    'Saudi Arabia': 'Middle East',
    'Syria': 'Middle East',
    'United Arab Emirates': 'Middle East',
    'Yemen': 'Middle East',

    # Others
    'United States': 'North America',
    'Canada': 'North America',
    'Germany': 'Europe',
    'France': 'Europe',
    'United Kingdom': 'Europe',
    'China': 'East Asia',
    'Japan': 'East Asia'
}


In [None]:
hdi_1b['region'] = hdi_1b['country'].map(region_map)


In [None]:

region_avg = (
    hdi_1b
    .groupby(['region','year'])['hdi']
    .mean()
    .reset_index()
)

plt.figure(figsize=(10,5))
sns.barplot(data=region_avg, x='region', y='hdi', hue='year')
plt.xticks(rotation=45)
plt.title('Average HDI by Region (2020–2022)')
plt.show()




### Task 3B-2: Box Plot – HDI Distribution by Year

Box plots illustrate the spread, median, and variability of HDI values
for each year.


In [None]:

plt.figure()
sns.boxplot(data=hdi_1b, x='year', y='hdi')
plt.title('HDI Distribution (2020–2022)')
plt.show()



### Task 3B-3: Scatter Plot – HDI vs GNI per Capita

The relationship between economic prosperity and human development
is examined using a scatter plot with a regression line.


In [None]:

if 'gross_inc_percap' in hdi_1b.columns:
    sns.regplot(data=hdi_1b, x='gross_inc_percap', y='hdi')
    plt.title('HDI vs GNI per Capita')
    plt.show()
else:
    print("GNI per Capita variable not available in the dataset.")



### Task 4: Short Analysis Metrics

Changes in HDI between 2020 and 2022 are computed to identify
countries with the most significant improvement or decline.


Countries with greatest improvement

Countries such as Nepal and Bangladesh show noticeable HDI improvement between 2020 and 2022.

Countries with decline

Some countries exhibit stagnation or slight decline, potentially due to economic disruption and healthcare strain.

Highest and lowest regions

The Middle East has the highest average HDI, while South Asia has the lowest among the analyzed regions.

COVID-19 impact

The COVID-19 pandemic likely slowed HDI progress by affecting life expectancy, education access, and income levels globally.

In [None]:

change = hdi_1b.pivot(index='country', columns='year', values='hdi')
change['HDI Change (2020–2022)'] = change[2022] - change[2020]
change.sort_values('HDI Change (2020–2022)', ascending=False).head()



## Question 2: Advanced HDI Exploration

This section focuses on South Asian countries and explores composite indicators,
outliers, correlations, and development gaps.



### Task 1: South Asia Subset


In [None]:

south_asia = ['Afghanistan','Bangladesh','Bhutan','India','Maldives','Nepal','Pakistan','Sri Lanka']
sa_df = df[df['country'].isin(south_asia)].copy()
sa_df.to_csv('HDI_SouthAsia.csv', index=False)



### Task 2: Composite Development Score

A composite score is constructed using life_expectancy and income indicators
to provide an alternative development ranking.


In [None]:
#Using Composite Score formula
sa_df['Composite Score'] = 0.3*sa_df['life_expectancy'] + 0.3*sa_df['gross_inc_percap']
sa_ranked = sa_df.sort_values('Composite Score', ascending=False)

plt.figure()
sns.barplot(data=sa_ranked.head(5), x='Composite Score', y='country')
plt.title('Top 5 South Asian Countries by Composite Score')
plt.show()



### Task 3: Outlier Detection

The interquartile range (IQR) method is applied to detect unusually high
or low HDI values.


In [None]:

Q1, Q3 = sa_df['hdi'].quantile([0.25,0.75])
IQR = Q3 - Q1
outliers = sa_df[(sa_df['hdi'] < Q1-1.5*IQR)|(sa_df['hdi'] > Q3+1.5*IQR)]

sns.scatterplot(data=sa_df, x='gross_inc_percap', y='hdi', label='Normal')
sns.scatterplot(data=outliers, x='gross_inc_percap', y='hdi', color='red', label='Outliers')
plt.title('Outlier Detection: HDI vs GNI per Capita')
plt.show()



### Task 4: Metric Relationships and Correlation

Pearson correlation coefficients and scatter plots are used to examine
relationships between HDI and selected component indicators.


In [None]:

for m in ['gender_development','life_expectancy']:
    if m in sa_df.columns:
        print(f'Correlation between HDI and {m}:',
              sa_df[['hdi',m]].corr().iloc[0,1])
        sns.regplot(data=sa_df, x=m, y='hdi')
        plt.title(f'HDI vs {m}')
        plt.show()



### Task 5: GNI–HDI Gap Analysis

The difference between income and HDI values highlights cases where
economic performance does not align with human development outcomes.


In [None]:

sa_df['GNI_HDI_Gap'] = sa_df['gross_inc_percap'] - sa_df['hdi']
gap_sorted = sa_df.sort_values('GNI_HDI_Gap')

sns.barplot(data=pd.concat([gap_sorted.head(3), gap_sorted.tail(3)]),
            x='GNI_HDI_Gap', y='country')
plt.title('Top Positive and Negative GNI–HDI Gaps')
plt.show()



## Question 3: Comparative Regional Analysis – South Asia vs Middle East

This section compares human development outcomes between
South Asia and the Middle East using multiple indicators.



### Task 1: Regional Subsets


In [None]:

middle_east = ["Bahrain","Iran","Iraq","Israel","Jordan","Kuwait","Lebanon","Oman",
               "Palestine","Qatar","Saudi Arabia","Syria","United Arab Emirates","Yemen"]

sa_1b = hdi_1b[hdi_1b['country'].isin(south_asia)]
me_1b = hdi_1b[hdi_1b['country'].isin(middle_east)]

sa_1b.to_csv('HDI_SouthAsia_2020_2022.csv', index=False)
me_1b.to_csv('HDI_MiddleEast_2020_2022.csv', index=False)



### Task 2: Descriptive Statistics


In [None]:

pd.DataFrame({
    'Region':['South Asia','Middle East'],
    'Mean HDI':[sa_1b['hdi'].mean(), me_1b['hdi'].mean()],
    'Standard Deviation':[sa_1b['hdi'].std(), me_1b['hdi'].std()]
})


In [None]:
for name, data in [('South Asia', sa_1b), ('Middle East', me_1b)]:
    hdi_range = data['hdi'].max() - data['hdi'].min()
    cv = data['hdi'].std() / data['hdi'].mean()
    print(f"{name} → Range: {hdi_range:.3f}, CV: {cv:.3f}")



### Task 3: Top and Bottom Performers


In [None]:
# Average HDI per country (2020–2022)
sa_hdi_mean = sa_1b.groupby('country')['hdi'].mean()
me_hdi_mean = me_1b.groupby('country')['hdi'].mean()

# Extracting top and bottom performers
sa_top3 = sa_hdi_mean.sort_values(ascending=False).head(3)
sa_bottom3 = sa_hdi_mean.sort_values().head(3)

me_top3 = me_hdi_mean.sort_values(ascending=False).head(3)
me_bottom3 = me_hdi_mean.sort_values().head(3)


In [None]:
bar_data = pd.concat([
    sa_top3.rename('HDI').reset_index().assign(Region='South Asia', Category='Top 3'),
    sa_bottom3.rename('HDI').reset_index().assign(Region='South Asia', Category='Bottom 3'),
    me_top3.rename('HDI').reset_index().assign(Region='Middle East', Category='Top 3'),
    me_bottom3.rename('HDI').reset_index().assign(Region='Middle East', Category='Bottom 3')
])



In [None]:
plt.figure(figsize=(10,6))
plt.barh(
    bar_data['country'],
    bar_data['HDI']
)

plt.xlabel('Average HDI (2020–2022)')
plt.ylabel('Country')
plt.title('Top and Bottom HDI Performers: South Asia vs Middle East')
plt.show()


In [None]:
print("South Asia – Bottom 3 Countries by Average HDI (2020–2022)")
display(sa_bottom3)

print("South Asia – Top 3 Countries by Average HDI (2020–2022)")
display(sa_top3)

print("Middle East – Bottom 3 Countries by Average HDI (2020–2022)")
display(me_bottom3)

print("Middle East – Top 3 Countries by Average HDI (2020–2022)")
display(me_top3)



### Task 4: Metric Comparisons Across Regions


In [None]:
metrics = ['life_expectancy','gross_inc_percap','gender_development']

for metric in metrics:
    print(metric)
    print('South Asia mean:', sa_1b[metric].mean())
    print('Middle East mean:', me_1b[metric].mean())
    print()



### Task 5: HDI Disparity and Variation


In [None]:

for name, d in [('South Asia',sa_1b),('Middle East',me_1b)]:
    print(name,
          'Range:', d['hdi'].max() - d['hdi'].min(),
          'CV:', d['hdi'].std() / d['hdi'].mean())



### Task 6: Correlation Analysis


In [None]:

for name, d in [('South Asia',sa_1b),('Middle East',me_1b)]:
    for m in ['gender_development','life_expectancy']:
        if m in d.columns:
            sns.regplot(data=d, x=m, y='hdi')
            plt.title(f'{name}: HDI vs {m}')
            plt.show()



### Task 7: Regional Outlier Detection


In [None]:

for name, d in [('South Asia',sa_1b),('Middle East',me_1b)]:
    Q1, Q3 = d['hdi'].quantile([0.25,0.75])
    IQR = Q3 - Q1
    out = d[(d['hdi'] < Q1-1.5*IQR) | (d['hdi'] > Q3+1.5*IQR)]
    sns.scatterplot(data=d, x='gross_inc_percap', y='hdi', label='Normal')
    sns.scatterplot(data=out, x='gross_inc_percap', y='hdi', color='red', label='Outliers')
    plt.title(name)
    plt.show()
