Importing libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Let's begin analysing this dataset by loading the data

In [None]:
df = pd.read_csv('india_census_2011.csv')
df.head()

### Scatter Plot

Let's begin by making a scatter plot of the literate population by the total population, and colour it by state

In [None]:
# Scatter plot of literate population versus total population, coloured by state
plt.figure(figsize = (10, 6))
sns.scatterplot(data = df, x = 'Population', y = 'Literate', hue = 'State name', palette = 'tab20')
plt.legend(bbox_to_anchor = (1, 1), title = 'State')
plt.title('Literate population versus total population by state')
plt.ylabel('Literate population')
plt.show()

In general, the literacy count seems to grow in proportion to the population. There are too many states to derive any meaningful insights from the colouring of points; furthermore, given there are so many states, our colours are repeating causing further confusion. Let's try to make regional groupings of states to better understand our data.

In [None]:
# Define state-to-region mappings
region_mapping = {
    'JAMMU AND KASHMIR': 'Northern India', 'HIMACHAL PRADESH': 'Northern India', 'PUNJAB': 'Northern India',
    'HARYANA': 'Northern India', 'UTTAR PRADESH': 'Northern India', 'UTTARAKHAND': 'Northern India',
    'DELHI': 'Northern India', 'CHANDIGARH': 'Northern India', 'LADAKH': 'Northern India',
    'RAJASTHAN': 'Western India', 'GUJARAT': 'Western India', 'MAHARASHTRA': 'Western India',
    'GOA': 'Western India', 'DADRA AND NAGAR HAVELI': 'Western India', 'DAMAN AND DIU': 'Western India',
    'ANDHRA PRADESH': 'Southern India', 'KARNATAKA': 'Southern India', 'KERALA': 'Southern India',
    'TAMIL NADU': 'Southern India', 'TELANGANA': 'Southern India', 'PUDUCHERRY': 'Southern India',
    'PONDICHERRY': 'Southern India', 'NCT OF DELHI': 'Northern India', 'ORISSA': 'Eastern India',
    'LAKSHADWEEP': 'Southern India', 'BIHAR': 'Eastern India', 'JHARKHAND': 'Eastern India',
    'WEST BENGAL': 'Eastern India', 'ARUNACHAL PRADESH': 'North-Eastern India',
    'ASSAM': 'North-Eastern India', 'MANIPUR': 'North-Eastern India', 'MEGHALAYA': 'North-Eastern India',
    'MIZORAM': 'North-Eastern India', 'NAGALAND': 'North-Eastern India', 'TRIPURA': 'North-Eastern India',
    'SIKKIM': 'North-Eastern India', 'MADHYA PRADESH': 'Central India', 'CHHATTISGARH': 'Central India',
    'ANDAMAN AND NICOBAR ISLANDS': 'Eastern India'
}

# Create a new column mapping states to their appropriate regions
df['Region'] = df['State name'].map(region_mapping)

Let's use the region mappings instead of the state for the `hue` parameter

In [None]:
# Scatter plot of literate population versus total population, coloured by geographical region
plt.figure(figsize = (10, 6))
sns.scatterplot(data = df, x = 'Population', y = 'Literate', hue = 'Region', palette = 'tab10')
plt.title('Literate population versus total population by geographical region')
plt.legend(bbox_to_anchor = (1, 1), title = 'Geographical region')
plt.xlabel('Population')
plt.ylabel('Literate population')
plt.show()

We now see that Western India tends to have the highest literacy rate per capita, while Northern India has the lowest

### Histogram

Let's move onto something simpler: a histogram to show district populations

In [None]:
# Histogram showing the distribution of population across districts
plt.figure(figsize = (10, 6))
sns.histplot(data = df, x = 'Population')
plt.title('Distribution of population across districts')
plt.show()

Let's alter the bin size

In [None]:
plt.figure(figsize = (10, 6))
sns.histplot(data = df, x = 'Population', bins = 3)
plt.title('Distribution of population across districts')
plt.show()

There seems to be a clear trend of how populations are distributed across districts. The bulk of districts have a population of lesser than 2 million, and there's a long tail showing the dropoff in the number of districts as the population grows. The district with the greatest population can be given by:

In [None]:
df.loc[df['Population'].idxmax(), ['District name', 'State name', 'Population']]

The district with the lowest population can be given by:

In [None]:
df.loc[df['Population'].idxmin(), ['District name', 'State name', 'Population']]

### Count Plot

Let's create a count plot to show the number of districts per state

In [None]:
# Count plot of number of districts per state
plt.figure(figsize = (12, 8))
sns.countplot(data = df, y = 'State name')
plt.title('Number of districts per state')
plt.show()

Let's reorder this in descending order of count

In [None]:
df['State name'].value_counts()

In [None]:
plt.figure(figsize = (12, 8))
sns.countplot(data = df, y = 'State name', order = df['State name'].value_counts().index)
plt.title('Number of districts per state')
plt.show()

This plot shows how Uttar Pradesh has the greatest number of districts by far, while a bunch of Union Territories and Goa have the lowest


### Bar Plot

Let's make a plot which shows the Scheduled Caste and Scheduled Tribe percentage in each region

In [None]:
region_sc_st = df.groupby('Region')[['SC', 'ST', 'Population']].sum().reset_index()
region_sc_st

In [None]:
# Aggregate SC, ST, and Population sums by region
region_sc_st = df.groupby('Region')[['SC', 'ST', 'Population']].sum().reset_index()

# Calculate SC and ST percentages based on aggregated regional totals
region_sc_st['SC_Percentage'] = (region_sc_st['SC'] / region_sc_st['Population']) * 100
region_sc_st['ST_Percentage'] = (region_sc_st['ST'] / region_sc_st['Population']) * 100

region_sc_st[['Region', 'SC_Percentage', 'ST_Percentage']]

In [None]:
df_melted = region_sc_st[['Region', 'SC_Percentage', 'ST_Percentage']].melt(id_vars = 'Region', var_name = 'Category', value_name = 'Percentage')

In [None]:
df_melted

In [None]:
plt.figure(figsize = (12, 8))
sns.barplot(data = df_melted,
            x = 'Region',
            y = 'Percentage',
            hue = 'Category',
            palette = ['skyblue', 'salmon'])

plt.title('Average Scheduled Castes and Scheduled Tribes population percentages by region')
plt.xlabel('Region')
plt.ylabel('Population %')
plt.xticks(rotation = 45)
plt.legend(title = 'Category')

plt.tight_layout()
plt.show()

We can see above that North-Eastern India has the highest proportion of Scheduled Tribes to the general population, while Northern India has the highest proportion of Scheduled Castes. It is probably coincidental that the areas with higher Scheduled Tribes have lower Scheduled Castes and vice versa. We would have to study the data more on the state or district level to understand this.