I begin by loading the extended state-level dataset that includes social composition variables alongside residual HDI.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm

df = pd.read_csv("/content/CC10_state_residuals_2023.csv")
df.head()

Unnamed: 0,State,HDI,log_GNI_pc,HDI_pred,residual_HDI,WPR,Urban_MPCE,Rural_MPCE,SC_share,ST_share
0,Andaman and Nicobar Islands,0.706,9.334,0.710049,-0.004049,63.9,10268,7332,0.0,0.0
1,Andhra Pradesh,0.642,9.027,0.674105,-0.032105,64.3,6877,4996,16.41,5.3
2,Arunachal Pradesh,0.683,9.064,0.678437,0.004563,66.4,8649,5300,0.0,68.8
3,Assam,0.615,8.387,0.599173,0.015827,55.8,6210,3546,7.15,12.4
4,Bihar,0.577,8.253,0.583484,-0.006484,48.7,4819,3454,15.91,1.3


I examine summary statistics to understand the variation in Scheduled Caste and Scheduled Tribe population shares across states.

In [None]:
df[['SC_share','ST_share']].describe()

Unnamed: 0,SC_share,ST_share
count,36.0,36.0
mean,11.9325,22.466667
std,8.607115,29.104236
min,0.0,0.0
25%,3.465,1.45
50%,14.22,9.8
75%,17.83,30.9
max,31.94,94.8


Because SC and ST shares are measured on different scales than HDI residuals, I standardise them to facilitate coefficient comparison.

In [None]:
scaler = StandardScaler()

X_std = scaler.fit_transform(
    df[['SC_share','ST_share']]
)

X_std = pd.DataFrame(
    X_std,
    columns=['SC_share_z','ST_share_z']
)

df_std = pd.concat([df, X_std], axis=1)
df_std.head()

Unnamed: 0,State,HDI,log_GNI_pc,HDI_pred,residual_HDI,WPR,Urban_MPCE,Rural_MPCE,SC_share,ST_share,SC_share_z,ST_share_z
0,Andaman and Nicobar Islands,0.706,9.334,0.710049,-0.004049,63.9,10268,7332,0.0,0.0,-1.406019,-0.782888
1,Andhra Pradesh,0.642,9.027,0.674105,-0.032105,64.3,6877,4996,16.41,5.3,0.527588,-0.598201
2,Arunachal Pradesh,0.683,9.064,0.678437,0.004563,66.4,8649,5300,0.0,68.8,-1.406019,1.614561
3,Assam,0.615,8.387,0.599173,0.015827,55.8,6210,3546,7.15,12.4,-0.563527,-0.35079
4,Bihar,0.577,8.253,0.583484,-0.006484,48.7,4819,3454,15.91,1.3,0.468673,-0.737587


I first estimate bivariate regressions to examine whether states with higher SC or ST population shares systematically underperform or outperform income-adjusted HDI benchmarks.

In [None]:
X = sm.add_constant(df_std['SC_share_z'])
model_sc = sm.OLS(df_std['residual_HDI'], X).fit()
model_sc.summary()

0,1,2,3
Dep. Variable:,residual_HDI,R-squared:,0.131
Model:,OLS,Adj. R-squared:,0.106
Method:,Least Squares,F-statistic:,5.134
Date:,"Sun, 28 Dec 2025",Prob (F-statistic):,0.0299
Time:,20:51:58,Log-Likelihood:,95.391
No. Observations:,36,AIC:,-186.8
Df Residuals:,34,BIC:,-183.6
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-7.467e-12,0.003,-2.55e-09,1.000,-0.006,0.006
SC_share_z,-0.0066,0.003,-2.266,0.030,-0.013,-0.001

0,1,2,3
Omnibus:,1.362,Durbin-Watson:,1.91
Prob(Omnibus):,0.506,Jarque-Bera (JB):,0.576
Skew:,0.263,Prob(JB):,0.75
Kurtosis:,3.327,Cond. No.,1.0


In [None]:
X = sm.add_constant(df_std['ST_share_z'])
model_st = sm.OLS(df_std['residual_HDI'], X).fit()
model_st.summary()

0,1,2,3
Dep. Variable:,residual_HDI,R-squared:,0.043
Model:,OLS,Adj. R-squared:,0.014
Method:,Least Squares,F-statistic:,1.509
Date:,"Sun, 28 Dec 2025",Prob (F-statistic):,0.228
Time:,20:52:26,Log-Likelihood:,93.641
No. Observations:,36,AIC:,-183.3
Df Residuals:,34,BIC:,-180.1
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-7.467e-12,0.003,-2.43e-09,1.000,-0.006,0.006
ST_share_z,0.0038,0.003,1.229,0.228,-0.002,0.010

0,1,2,3
Omnibus:,2.852,Durbin-Watson:,2.136
Prob(Omnibus):,0.24,Jarque-Bera (JB):,1.739
Skew:,0.495,Prob(JB):,0.419
Kurtosis:,3.422,Cond. No.,1.0


I then estimate a combined specification capturing the joint association between social composition and income-adjusted human development outcomes.

In [None]:
X = sm.add_constant(
    df_std[['SC_share_z','ST_share_z']]
)

model_social = sm.OLS(df_std['residual_HDI'], X).fit()
model_social.summary()

0,1,2,3
Dep. Variable:,residual_HDI,R-squared:,0.133
Model:,OLS,Adj. R-squared:,0.08
Method:,Least Squares,F-statistic:,2.53
Date:,"Sun, 28 Dec 2025",Prob (F-statistic):,0.095
Time:,20:52:37,Log-Likelihood:,95.427
No. Observations:,36,AIC:,-184.9
Df Residuals:,33,BIC:,-180.1
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-7.467e-12,0.003,-2.51e-09,1.000,-0.006,0.006
SC_share_z,-0.0073,0.004,-1.855,0.073,-0.015,0.001
ST_share_z,-0.0010,0.004,-0.257,0.798,-0.009,0.007

0,1,2,3
Omnibus:,1.227,Durbin-Watson:,1.862
Prob(Omnibus):,0.541,Jarque-Bera (JB):,0.457
Skew:,0.215,Prob(JB):,0.796
Kurtosis:,3.347,Cond. No.,2.19


In [None]:
df_std.to_csv(
    "PC8_SC_ST_ResidualHDI_StateLevel.csv",
    index=False
)