
# SDG 4 – Literacy Rate Prediction

## I. Project Objective

The objective of this project is to predict the **Literacy Rate** of Indian States and Union Territories using:

* Government **Expenditure on Primary Education**
* **Schools per 1000 children**

The dataset was constructed by integrating multiple official government data sources and synthetically completing missing values using structured anchor-based reasoning.

---

## II. Anchor-Based Synthetic Data Generation Model

To model the relationship between literacy rate and education infrastructure, an anchor-based linear model was used. The formulation is grounded in three assumptions:

* National literacy average ≈ 74% (Census reference baseline)
* School access positively influences literacy
* Education spending shows diminishing returns (log transformation applied)

---

### Model Structure

$$
\text{Literacy}_i = \beta_0 + \beta_1 S_i + \beta_2 \log(E_i) + \epsilon_i
$$

Where:

- $S_i$ = Schools per 1000 children  
- $E_i$ = Government expenditure on primary education  
- $\epsilon_i \sim N(0, \sigma^2)$ = Controlled random noise  

---

### Concrete Model Used in This Project

$$
\text{Literacy}_i = 50 + 1.8S_i + 4\left(\frac{\log(E_i)}{\max(\log(E))}\right) + \epsilon_i
$$

**Parameter Explanation:**

- **50** → Baseline lower literacy bound  
- **1.8** → Effect of school access  
- **4** → Scaled impact of expenditure  
- **ε** → Random noise between −2 and +2  

---

### Rationale for Log Transformation

Education expenditure data is typically skewed.  
Applying $\log(E)$:

- Reduces skewness  
- Stabilizes variance  
- Reflects diminishing marginal returns of spending  

---

# III. Feature Engineering

### Calculation of Schools per 1000 Children

The metric **Schools per 1000 children** was calculated using the formula:

$$
\text{Schools per 1000 children} =
\left(
\frac{\text{Total number of schools in the region}}
{\text{Population of children aged 6–18 in that region}}
\right)
\times 1000
$$

This measure standardizes school availability relative to the eligible student population and enables fair comparison across states.

---

# IV. Final Dataset Structure

The final dataset consists of:

| Column           | Description                                                   |
| ---------------- | ------------------------------------------------------------- |
| State/UT         | Name of state or union territory                              |
| Expenditure      | Government spending on primary education (Central allocation) |
| Schools_per_1000 | School density indicator                                      |
| Literacy_Rate    | Target variable                                               |

Each row represents one state/UT.

---

# V. Data Preprocessing

## Step 1 – Import Libraries

Import `pandas` and `numpy` to handle data manipulation and numerical operations.

In [12]:
import pandas as pd
import numpy as np

## Step 2 – Load Dataset

Load the SDG 4 dataset and inspect the first few rows to understand structure and column names.

In [13]:
data = pd.read_csv("literacy_rate_data.csv")
data.head()

Unnamed: 0,State/UT,Expenditure,Schools_per_1000,Literacy_Rate
0,Andhra Pradesh,13421630.0,3.464459,63.2
1,Arunachal Pradesh,3329964.0,7.942123,81.9
2,Assam,25753790.0,2.412662,87.0
3,Bihar,21153720.0,12.911845,71.0
4,Chhattisgarh,5882643.0,5.10536,76.1


## Step 3 – Dataset Information

Use `.info()` to check column names, data types, and non-null counts.

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   State/UT          34 non-null     object 
 1   Expenditure       34 non-null     float64
 2   Schools_per_1000  34 non-null     float64
 3   Literacy_Rate     34 non-null     float64
dtypes: float64(3), object(1)
memory usage: 1.2+ KB



## Step 4 – Check for Null Values

Check for missing values in each column to ensure completeness.


In [15]:
data.isnull().sum()

State/UT            0
Expenditure         0
Schools_per_1000    0
Literacy_Rate       0
dtype: int64

## Step 5 – Check Skewness

Compute skewness to examine distribution shape, especially for `Expenditure`.


In [16]:
data[["Expenditure", "Schools_per_1000"]].skew()


Expenditure         0.294173
Schools_per_1000    0.163415
dtype: float64

## Step 6 – Apply Log Transformation

Apply log transformation to reduce skewness in expenditure.

In [17]:
data["Expenditure"]=np.log(data["Expenditure"])

data.head()

Unnamed: 0,State/UT,Expenditure,Schools_per_1000,Literacy_Rate
0,Andhra Pradesh,16.412378,3.464459,63.2
1,Arunachal Pradesh,15.018472,7.942123,81.9
2,Assam,17.064092,2.412662,87.0
3,Bihar,16.867326,12.911845,71.0
4,Chhattisgarh,15.587517,5.10536,76.1



## Step 7 – Define Features and Target

Separate independent and dependent variables using indexing.

In [18]:
X=data.iloc[:,:-1].values
y=data.iloc[:,-1].values
X

array([['Andhra Pradesh', 16.412378448115795, 3.464458818],
       ['Arunachal Pradesh', 15.018472158295738, 7.942122921],
       ['Assam', 17.06409239715041, 2.412662253],
       ['Bihar', 16.867326225787235, 12.91184482],
       ['Chhattisgarh', 15.587516724829655, 5.105359779],
       ['Delhi', 15.587375257491168, 9.950267412],
       ['Goa', 14.733353931865729, 5.740532913],
       ['Gujarat', 17.229396334729742, 8.240816254],
       ['Haryana', 16.871324619438802, 8.560523352],
       ['Himachal Pradesh', 17.031522442868695, 4.218253466],
       ['Jharkhand', 14.006267315708293, 13.63501553],
       ['Karnataka', 17.34074956862458, 11.30159388],
       ['Kerala', 17.190339019074944, 13.2739873],
       ['Madhya Pradesh', 15.872923660109716, 12.73792821],
       ['Maharashtra', 17.321057251289783, 9.174799746],
       ['Manipur', 15.736462627731765, 13.06249082],
       ['Mizoram', 16.213074628648553, 3.061910025],
       ['Nagaland', 16.738892377332345, 4.351794349],
       ['Odis

In [19]:
y

array([63.2, 81.9, 87. , 71. , 76.1, 93.1, 87.9, 80.3, 75.5, 85.8, 69.3,
       74.4, 92.9, 72.3, 78.5, 87.6, 99.8, 95.6, 73.6, 76.4, 67. , 83.5,
       79. , 62.3, 91.1, 79.3, 71.2, 78.1, 84. , 93. , 76.9, 67.7, 99.4,
       85. ])

In [27]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

X = data[['Expenditure', 'Schools_per_1000']]
y = data['Literacy_Rate']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ==============================
# 6️⃣ Polynomial Regression
# ==============================

poly = PolynomialFeatures(degree=2, include_bias=False)

X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

model = LinearRegression()
model.fit(X_train_poly, y_train)

# ==============================
# 7️⃣ Predictions
# ==============================

y_pred = model.predict(X_test_poly)

# ==============================
# 8️⃣ Model Evaluation
# ==============================

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R2 Score:", r2)

Mean Squared Error: 47.98402796694558
R2 Score: -0.09828913040934961


In [28]:
from sklearn.linear_model import LinearRegression
# Multilinear Regression

multi_reg = LinearRegression()
multi_reg.fit(X_train, y_train)
y_multi_pred = multi_reg.predict(X_train)