# Project 3
- **Datasets to be used:** 
    - **Languages in Indian States:** Translators without Borders (2020). Language and literacy data from the Office of the Registrar General & Census Commissioner, India (2011 Census). Accessed on November 22, 2024. Available here: https://data.humdata.org/dataset/61a9df6a-c1fa-4885-9c3f-e00666ec859c/resource/a6a31c2f-20a9-4f25-81f6-12ae60000bcd/download/in_lang_admin1_v01.csv.

    - **Literacy in India:** Satyam Prasad Tiwari (2020). India Literacy Data - District Wise. Accessed November 22, 2024. Available here: Downloads/archive.zip
https://www.kaggle.com/api/v1/datasets/download/satyampd/india-literacy-data-district-wise

- **Analysis question:** Do states with larger populations, speaking a specific dominant language, have higher literacy rates?

- **Columns that will be used:** States, Population Size, Language spoken

- **Column to be used to merge datasets:** States

- **Hypothesis:** States with larger populations speaking a specific dominant language have higher literacy rates because more populous states may have better access to resources, infrastructure, and educational materials tailored to their dominant language.

- **Assistance:** I ran my codes through chatGPT and amended them accordingly.

- **Site URL:** https://project-3-js6400.readthedocs.io/en/latest/

In [None]:
! pip install pandas matplotlib seaborn

Collecting pandas
  Downloading pandas-2.2.3-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting matplotlib
  Downloading matplotlib-3.9.3-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting numpy>=1.26.0 (from pandas)
  Downloading numpy-2.2.0-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.1-cp312-cp312-win_amd64.whl.metadata (5.4 kB)
Collecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.55.3-cp312-cp312-win_amd64.whl.metadata (168 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.7-cp312-cp312-win_amd64.whl.metadata (6.4 kB)
Collecting pillow>=8 (from matplotlib)
  Download

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


## Uploading and Cleaning the Datasets

#### **Dataset 1: Languages in Indian States**
This dataset details the linguistic diversity in India. For this project, I chose to work with the columns State, State Language, and State Population. The data is based on last conducted census in India.

In [43]:
language_df = pd.read_csv("C:/Users/joosa/OneDrive/Desktop/in_lang_admin1_v01.csv")

columns_keep = ["admin1_name", "main_language", "pop_total"]

language_df = language_df[columns_keep]

new_columns = {"admin1_name": "State", "main_language": "State Language", "pop_total": "State Population"}

language_df.rename(columns = new_columns, inplace = True)

language_df = language_df.drop(index=0)

language_df["State"] = language_df["State"].str.strip().str.lower()

language_df["State"] = language_df["State"].replace({
    "nct of delhi": "delhi",
    "odisha": "orissa",
})

language_df["State"] = language_df["State"].str.replace("&", "and", regex=False)

language_df["State Population"] = pd.to_numeric(
    language_df["State Population"], errors="coerce"
)

print(language_df)

                          State State Language  State Population
1   andaman and nicobar islands        Bengali            380581
2                andhra pradesh         Telugu          84580777
3             arunachal pradesh          Nissi           1383727
4                         assam       Assamese          31205576
5                         bihar          Hindi         104099452
6                    chandigarh          Hindi           1055450
7                  chhattisgarh          Hindi          25545198
8        dadra and nagar haveli          Bhili            343709
9                 daman and diu       Gujarati            243247
10                          goa        Konkani           1458545
11                      gujarat       Gujarati          60439692
12                      haryana          Hindi          25351462
13             himachal pradesh          Hindi           6864602
14            jammu and kashmir       Kashmiri          12541302
15                    jha

#### **Dataset 2: Literacy in India**
This dataset details the literacy rate across Indian states. For this project, I worked with the columns State and Literacy. The data is based on the last conducted census in India, that is, 2011.

In [7]:
literacy_df = pd.read_csv("C:/Users/joosa/OneDrive/Desktop/Literacy Data 2011.csv")

columns_to_keep = ["State", "Literacy"]

literacy_df = literacy_df[columns_to_keep]

literacy_df = literacy_df.groupby('State', as_index=False).mean()

literacy_df["State"] = literacy_df["State"].str.strip().str.lower()

print(literacy_df)

                          State   Literacy
0   andaman and nicobar islands  83.700000
1                andhra pradesh  66.293913
2             arunachal pradesh  63.861875
3                         assam  72.247407
4                         bihar  61.756316
5                    chandigarh  86.050000
6                  chhattisgarh  65.841667
7        dadra and nagar haveli  76.240000
8                 daman and diu  85.765000
9                         delhi  86.556667
10                          goa  88.580000
11                      gujarat  76.389231
12                      haryana  75.358571
13             himachal pradesh  81.747500
14            jammu and kashmir  65.377273
15                    jharkhand  64.744583
16                    karnataka  73.655667
17                       kerala  93.695000
18                  lakshadweep  91.850000
19               madhya pradesh  67.683400
20                  maharashtra  80.967143
21                      manipur  76.360000
22         

#### **Merging Datasets**
I merged the datasets on the common column, that is, 'State.' I previously cleaned the two datasets accordingly to ensure uniformity.

In [8]:
literacy_and_language_df = pd.merge(language_df, literacy_df, on='State', how='inner')

print(literacy_and_language_df)

                          State State Language State Population   Literacy
0   andaman and nicobar islands        Bengali           380581  83.700000
1                andhra pradesh         Telugu         84580777  66.293913
2             arunachal pradesh          Nissi          1383727  63.861875
3                         assam       Assamese         31205576  72.247407
4                         bihar          Hindi        104099452  61.756316
5                    chandigarh          Hindi          1055450  86.050000
6                  chhattisgarh          Hindi         25545198  65.841667
7        dadra and nagar haveli          Bhili           343709  76.240000
8                 daman and diu       Gujarati           243247  85.765000
9                           goa        Konkani          1458545  88.580000
10                      gujarat       Gujarati         60439692  76.389231
11                      haryana          Hindi         25351462  75.358571
12             himachal p

## Data Analysis & Visualization
For the purposes of my hypothesis and analysis question, I relied on 4 different types of data visualization methods.
- **Scatterplot:** to identify potential trends across the languages.
- **Regression plot:** to quantify the correlation between state population and language, while also indicating the variability.
- **Bar Chart:** to comparison of average literacy rates across different State Languages.
- **Bubble Chart:** to analyze the combined effect of population size and literacy rates, while grouping them visually by the language spoken.

In [52]:
pip install plotly

Collecting plotly
  Downloading plotly-5.24.1-py3-none-any.whl.metadata (7.3 kB)
Collecting tenacity>=6.2.0 (from plotly)
  Downloading tenacity-9.0.0-py3-none-any.whl.metadata (1.2 kB)
Downloading plotly-5.24.1-py3-none-any.whl (19.1 MB)
   ---------------------------------------- 0.0/19.1 MB ? eta -:--:--
   ----- ---------------------------------- 2.6/19.1 MB 12.5 MB/s eta 0:00:02
   --------- ------------------------------ 4.7/19.1 MB 11.4 MB/s eta 0:00:02
   ---------------- ----------------------- 7.9/19.1 MB 12.8 MB/s eta 0:00:01
   ---------------------- ----------------- 10.5/19.1 MB 12.8 MB/s eta 0:00:01
   --------------------------- ------------ 13.1/19.1 MB 12.8 MB/s eta 0:00:01
   ---------------------------------- ----- 16.3/19.1 MB 13.1 MB/s eta 0:00:01
   ---------------------------------------  18.9/19.1 MB 13.4 MB/s eta 0:00:01
   ---------------------------------------- 19.1/19.1 MB 13.1 MB/s eta 0:00:00
Downloading tenacity-9.0.0-py3-none-any.whl (28 kB)
Installing

#### **Scatterplot**
- This scatterplot explores the association between a State's population size and its' literacy rate.
- There is no linear relation that can be observed between the population size and literacy rate, indicative of a wide variability.
- Bihar and Uttar Pradesh, both Hindi-speaking states, cluster at the lower end of literacy rates despite large populations.
- Kerala, a Malayalam-speaking state, stands out with a high literacy rate despite having a relatively smaller population.
- Variability among Hindi-speaking states further suggests that other factors like infrastructure, governance at a state level etc. play a significant role.

In [57]:
import plotly.express as px

fig = px.scatter(
    literacy_and_language_df,
    x="State Population",
    y="Literacy",
    size="State Population",
    color="State Language",
    hover_name="State",
    title="Interactive Scatter Plot: Population vs Literacy Rate",
    labels={"State Population": "Population", "Literacy": "Literacy (%)"}
)

fig.show()

#### **Regression Plot**
- Languages such as Konkani and Malayalam, associated with smaller states (e.g., Kerala, Goa), show nearly flat regression lines, suggesting little or no correlation between population size and literacy rate in these regions.
- On the other hand, States (like, Uttar Pradesh and Maharashtra) significantly influence regression lines, leading to exaggerated slopes.
- Hindi speaking states exhibit wide variablity, where some States with smaller population sizes (like, Haryana) have a relatively high rate fo literacy in comparison to other States with bigger population sizes.

In [63]:
pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.6.0-cp312-cp312-win_amd64.whl.metadata (15 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Downloading scipy-1.14.1-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.6.0-cp312-cp312-win_amd64.whl (11.1 MB)
   ---------------------------------------- 0.0/11.1 MB ? eta -:--:--
   ---------- ----------------------------- 2.9/11.1 MB 15.2 MB/s eta 0:00:01
   ---------------------- ----------------- 6.3/11.1 MB 15.5 MB/s eta 0:00:01
   --------------------------------- ------ 9.4/11.1 MB 15.5 MB/s eta 0:00:01
   ---------------------------------------- 11.1/11.1 MB 14.8 MB/s eta 0:00:00
Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
Downloading scipy-1.14.1-cp312-cp312-win_amd64.whl (44.5 M

In [71]:
import plotly.graph_objects as go
import numpy as np
from sklearn.linear_model import LinearRegression

fig = go.Figure()

languages = literacy_and_language_df["State Language"].unique()

for language in languages:

    lang_data = literacy_and_language_df[literacy_and_language_df["State Language"] == language]

    fig.add_trace(go.Scatter(
        x=lang_data["State Population"],
        y=lang_data["Literacy"],
        mode='markers',
        name=f"{language} - Data",
        marker=dict(size=8),
        text=lang_data["State"]
    ))

    X = lang_data["State Population"].values.reshape(-1, 1)
    y = lang_data["Literacy"].values
    if len(X) > 1: 
        model = LinearRegression()
        model.fit(X, y)
        y_pred = model.predict(X)

        fig.add_trace(go.Scatter(
            x=lang_data["State Population"],
            y=y_pred,
            mode='lines',
            name=f"{language} - Regression",
            line=dict(dash='dash')
        ))

fig.update_layout(
    title="Interactive Regression Plot: Population vs Literacy Rate",
    xaxis_title="State Population",
    yaxis_title="Literacy Rate (%)",
    legend_title="Legend",
    template="plotly_white"
)

fig.show()

#### **Bar Charts**
- This bar chart indicates the literacy rates for each State language.
- While some dominant languages like Malayalam and Tamil are linked with high literacy, others such as Hindi and Nissi exhibit variability.
- It shows that literacy is influenced by state-specific factors rather than the language alone.
- For example, in Kerala (Malayalam) and Tamil Nadu (Tamil) reflect strong regional education systems.

In [70]:
average_literacy = literacy_and_language_df.groupby(["State Language", "State"]).agg({"Literacy": "mean"}).reset_index()

fig = px.bar(
    average_literacy,
    x="State",
    y="Literacy",
    color="State Language", 
    title="Interactive Bar Chart: Average Literacy Rate by State",
    labels={"Literacy": "Literacy Rate (%)", "State": "State Name"},
    color_continuous_scale="Viridis"
)

fig.update_layout(
    xaxis_title="State Name",
    yaxis_title="Average Literacy Rate (%)",
    template="plotly_white",
    xaxis=dict(tickangle=45)
)

fig.show()

#### **Bubble Chart**
- The bubble chart combines population and the literacy rates for each Indian state.
- Uttar Pradesh, with the largest population, has a relatively low literacy rate, whereas states like Kerala and Maharashtra, with smaller populations compared to Uttar Pradesh, exhibit significantly higher literacy rates.
- The hypothesis is potentially challenged with this data and indicative that larger populations may not inherently lead to higher literacy.


In [72]:
fig = px.scatter(
    literacy_and_language_df,
    x="State Population",
    y="Literacy",
    size="State Population",
    color="State Language",
    hover_name="State", 
    title="Interactive Bubble Chart: Population vs Literacy Rate",
    labels={"State Population": "Population", "Literacy": "Literacy Rate (%)"},
    size_max=60
)

fig.show()

## Overall Takeaways:
- On the whole, the data suggests that while the language spoken in a State and its population size have an impact on literacy rates, there are other determinants of literacy as well, which perhaps have a bigger impact. 
- These other determinants could include State laws and policies, socio-economic conditions, gender disparities etc.
- The hypothesis is only partially supported, with states like Kerala, Lakshadweep, Uttar Pradesh etc. being excpetions and further highlighting the complexity of the relationship.