<a href="https://colab.research.google.com/github/kumarrajesh1992-arch/kumarrajesh1992-arch.github.io/blob/main/Chart1_HDI_Dropdown.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 I imported the libraries needed for data wrangling (pandas) and text cleaning (regex).

In [None]:
import pandas as pd
import re

 I then read the raw CSV which has no header and two columns: a combined indicator-year string and a value.

In [None]:
input_path = "CC9_India_HDI_Data.csv"
df = pd.read_csv(input_path, header=None)
df.columns = ["Indicator_Year_String", "Value"]

df.head(10)

Unnamed: 0,Indicator_Year_String,Value
0,Human Development Index (value) (1990),0.446
1,Human Development Index (value) (1991),0.448
2,Human Development Index (value) (1992),0.453
3,Human Development Index (value) (1993),0.458
4,Human Development Index (value) (1994),0.463
5,Human Development Index (value) (1995),0.469
6,Human Development Index (value) (1996),0.476
7,Human Development Index (value) (1997),0.482
8,Human Development Index (value) (1998),0.488
9,Human Development Index (value) (1999),0.496


I extracted the year (YYYY) from the combined string, then clean the remaining text to get a consistent indicator name.

In [None]:
df["year"] = df["Indicator_Year_String"].str.extract(r"\((\d{4})\)").astype("Int64")

df["indicator_raw"] = df["Indicator_Year_String"].str.replace(r"\s\(value\)\s\(\d{4}\)", "", regex=True)
df["indicator_raw"] = df["indicator_raw"].str.replace(r"\s\(\d{4}\)", "", regex=True)

df["indicator_raw"] = df["indicator_raw"].str.replace(r"\s\(.*?\)", "", regex=True)


df["indicator_raw"] = df["indicator_raw"].str.strip()

df[["Indicator_Year_String", "indicator_raw", "year", "Value"]].head(10)

Unnamed: 0,Indicator_Year_String,indicator_raw,year,Value
0,Human Development Index (value) (1990),Human Development Index,1990,0.446
1,Human Development Index (value) (1991),Human Development Index,1991,0.448
2,Human Development Index (value) (1992),Human Development Index,1992,0.453
3,Human Development Index (value) (1993),Human Development Index,1993,0.458
4,Human Development Index (value) (1994),Human Development Index,1994,0.463
5,Human Development Index (value) (1995),Human Development Index,1995,0.469
6,Human Development Index (value) (1996),Human Development Index,1996,0.476
7,Human Development Index (value) (1997),Human Development Index,1997,0.482
8,Human Development Index (value) (1998),Human Development Index,1998,0.488
9,Human Development Index (value) (1999),Human Development Index,1999,0.496


I created a mapping so raw labels become standardised column names (this prevents messy column headers later).

In [None]:
indicator_mapping = {
    "Human Development Index": "HDI",
    "Life Expectancy at Birth": "Life_expectancy",
    "Mean Years of Schooling": "Mean_years_schooling",
    "Expected Years of Schooling": "Expected_years_schooling",
    "Gross National Income Per Capita": "GNI_per_capita"
}

df["indicator"] = df["indicator_raw"].map(indicator_mapping)


unmapped = df.loc[df["indicator"].isna(), "indicator_raw"].unique()
if len(unmapped) > 0:
    print("⚠️ Unmapped indicators found (won’t appear in final dataset):")
    for u in unmapped:
        print(" -", u)
else:
    print("All indicators mapped successfully.")

All indicators mapped successfully.


I reshaped the dataset so each year is one row and each indicator is a separate column.

In [None]:
df_wide = (
    df.pivot(index="year", columns="indicator", values="Value")
      .reset_index()
      .rename(columns={"year": "Year"})
)

df_wide.columns.name = None

df_wide.head()

Unnamed: 0,Year,Expected_years_schooling,GNI_per_capita,HDI,Life_expectancy,Mean_years_schooling
0,1990,8.204444,2167.222109,0.446,58.618,2.780574
1,1991,8.216524,2143.711946,0.448,59.032,2.883907
2,1992,8.228621,2215.913758,0.453,59.445,2.987239
3,1993,8.240736,2277.498432,0.458,59.823,3.090571
4,1994,8.252869,2381.704988,0.463,60.237,3.193903


I then ensured the indicator columns are numeric and sort chronologically.

In [None]:
df = df_wide.copy()

num_cols = [
    "HDI",
    "Life_expectancy",
    "Mean_years_schooling",
    "Expected_years_schooling",
    "GNI_per_capita"
]

for col in num_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")

df = df.sort_values("Year").reset_index(drop=True)

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Year                      34 non-null     Int64  
 1   Expected_years_schooling  34 non-null     float64
 2   GNI_per_capita            34 non-null     float64
 3   HDI                       34 non-null     float64
 4   Life_expectancy           34 non-null     float64
 5   Mean_years_schooling      34 non-null     float64
dtypes: Int64(1), float64(5)
memory usage: 1.8 KB


Unnamed: 0,Year,Expected_years_schooling,GNI_per_capita,HDI,Life_expectancy,Mean_years_schooling
0,1990,8.204444,2167.222109,0.446,58.618,2.780574
1,1991,8.216524,2143.711946,0.448,59.032,2.883907
2,1992,8.228621,2215.913758,0.453,59.445,2.987239
3,1993,8.240736,2277.498432,0.458,59.823,3.090571
4,1994,8.252869,2381.704988,0.463,60.237,3.193903


I ran quick checks to confirm years and missing values look reasonable before saving.

In [None]:
print("Year range:", df["Year"].min(), "to", df["Year"].max())
print("\nMissing values by column:")
print(df[num_cols].isna().sum())

print("\nFirst 5 years:")
display(df.head())

print("\nLast 5 years:")
display(df.tail())

Year range: 1990 to 2023

Missing values by column:
HDI                         0
Life_expectancy             0
Mean_years_schooling        0
Expected_years_schooling    0
GNI_per_capita              0
dtype: int64

First 5 years:


Unnamed: 0,Year,Expected_years_schooling,GNI_per_capita,HDI,Life_expectancy,Mean_years_schooling
0,1990,8.204444,2167.222109,0.446,58.618,2.780574
1,1991,8.216524,2143.711946,0.448,59.032,2.883907
2,1992,8.228621,2215.913758,0.453,59.445,2.987239
3,1993,8.240736,2277.498432,0.458,59.823,3.090571
4,1994,8.252869,2381.704988,0.463,60.237,3.193903



Last 5 years:


Unnamed: 0,Year,Expected_years_schooling,GNI_per_capita,HDI,Life_expectancy,Mean_years_schooling
29,2019,11.75398,7895.441397,0.651,70.746,6.28138
30,2020,12.12914,7331.951385,0.652,70.156,6.49
31,2021,12.40237,7992.775136,0.647,67.282,6.53
32,2022,12.95646,8475.67988,0.676,71.698,6.57
33,2023,12.95454,9046.756336,0.685,72.003,6.88


I saved the cleaned wide-format dataset to a CSV for use in Vega-Lite / further analysis.

In [None]:
output_path = "CC9_India_HDI_tidy.csv"
df.to_csv(output_path, index=False)

output_path

'CC9_India_HDI_tidy.csv'