Importing the Pandas library

In [None]:
import pandas as pd

The data below is from the 2011 census of India, which, as of November 2024, is the latest census conducted by the government of India. Let's begin analysing this dataset by loading the data.

In [None]:
df = pd.read_csv('india_census_2011.csv')
df.head()

In [None]:
df.info()

In [None]:
df.columns

In [None]:
df.describe()

Let's start by getting the total populations by state/union territory, and sort this information in descending order

In [None]:
# df.groupby('State name')

In [None]:
df.groupby('State name')['Population'].sum()

In [None]:
df.groupby('State name')['Population'].sum().sort_values(ascending = False)

Thus, we can see that Uttar Pradesh has the highest population while Lakshadweep has the lowest.

The female-to-male gender ratio can be calculated by dividing the number of females by the number of males. Let's calculate this for each state; first we would need to find the total female and male populations by state, using a similar method as above.

In [None]:
# Make an interim DataFrame which groups and aggregates population by gender
df_state_population_gender = df.groupby('State name')[['Female', 'Male']].sum()

# Calculate the gender ratio based on the aggregated populations
df_state_population_gender['Gender_Ratio'] = df_state_population_gender['Female'] / df_state_population_gender['Male']

# Display the gender ratio by state, in descending order
df_state_population_gender['Gender_Ratio'].sort_values(ascending = False)

Thus, we can see that Kerala and Puducherry are the only states/UTs with a ratio greater than 1 (i.e., they have more females than males), while Dadra and Nagar Haveli and Daman and Diu are the UTs with the lowest ratio.

Let's get some insights on literacy rates, including how they vary by gender. We can calculate the literacy rate for each state as follows:

In [None]:
# df[['Male', 'Female', 'Male_Literate', 'Female_Literate', 'Literate', 'Population']].head(1)

In [None]:
# Aggregate the relevant columns by state
df_state_literacy = df.groupby('State name')[['Male', 'Female', 'Male_Literate', 'Female_Literate', 'Literate', 'Population']].sum()

# Calculate the overall literacy rate for each state
df_state_literacy['Literacy_Rate'] = df_state_literacy['Literate'] / df_state_literacy['Population']

# Calculate male and female literacy rates separately
df_state_literacy['Male_Literacy_Rate'] = df_state_literacy['Male_Literate'] / df_state_literacy['Male']
df_state_literacy['Female_Literacy_Rate'] = df_state_literacy['Female_Literate'] / df_state_literacy['Female']

# Calculate the literacy gap between male and female
df_state_literacy['Literacy_Gap'] = df_state_literacy['Male_Literacy_Rate'] - df_state_literacy['Female_Literacy_Rate']

# Sorting the literacy rates by state in ascending order
df_literacy_rate_by_state = df_state_literacy[[
    'Literacy_Rate',
    'Male_Literacy_Rate',
    'Female_Literacy_Rate',
    'Literacy_Gap'
]].sort_values(by = 'Literacy_Rate', ascending = False)

df_literacy_rate_by_state

Thus, we can see that the literacy rates are highest in Kerala and Lakshadweep, and lowest in Arunachal Pradesh and Bihar. We can further query our new DataFrame to find the states with the highest and lowest literacy gap.

In [None]:
print('State with the highest literacy gap:', df_literacy_rate_by_state['Literacy_Gap'].idxmax())
print('State with the lowest literacy gap:', df_literacy_rate_by_state['Literacy_Gap'].idxmin())

If we wanted to find out what this literacy gap is, we could use it to query our DataFrame as the index is the state name

In [None]:
print(f'{df_literacy_rate_by_state["Literacy_Gap"].idxmax()} has the highest literacy gap with a value of {df_literacy_rate_by_state.loc[df_literacy_rate_by_state["Literacy_Gap"].idxmax()]["Literacy_Gap"]:.4f}')
print(f'{df_literacy_rate_by_state["Literacy_Gap"].idxmin()} has the lowest literacy gap with a value of {df_literacy_rate_by_state.loc[df_literacy_rate_by_state["Literacy_Gap"].idxmin()]["Literacy_Gap"]:.4f}')

Let's find some statistics on Scheduled Caste populations by state, and sort by their mean populations across districts

In [None]:
df_state_sc = df.groupby(['State name'])['SC'].agg([
    'mean',
    'median',
    'sum',
    'min',
    'max',
    'std',
    lambda x: x.quantile(0.33),
    lambda x: x.quantile(0.66),
]).sort_values('mean', ascending = False)

# Renaming the columns for better readability
df_state_sc.columns = ['mean', 'median', 'sum', 'min', 'max', 'std', '33rd_percentile', '66th_percentile']
df_state_sc

Let's bin our original data by creating a grouping for high, medium, and low Scheduled Caste populations **by district,** based on:
*   low: below the 33rd quantile
*   medium: between the 33rd and 66th quantiles
*   high: above the 66th quantile

**Note:** We can use the `pd.cut()` function to bin data into different groups

In [None]:
# Calculate the 33rd and 66th percentiles to define low and high thresholds
sc_quantiles = df['SC'].quantile([0.33, 0.66])

# Define thresholds for low, medium, and high SC population categories
low_threshold = sc_quantiles[0.33]  # 33rd percentile
high_threshold = sc_quantiles[0.66]  # 66th percentile

# Bin the SC population into 'Low', 'Medium', and 'High' categories based on these thresholds
df['SC_Population_Category'] = pd.cut(
    df['SC'],
    bins = [-1, low_threshold, high_threshold, df['SC'].max()],
    labels = ['Low', 'Medium', 'High']
)

In [None]:
# Display a sample of the binned data with SC population categories
df[['State name', 'District name', 'SC', 'SC_Population_Category']].sample(5)  # takes 5 random samples from the data