<a href="https://www.kaggle.com/code/naveenapaleti/india-population-data-analysis?scriptVersionId=183487098" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **India Population Data Analysis**

## Step 1: Importing necessary libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Step 2: Reading the datasets

In [None]:
population_data = pd.read_csv('/kaggle/input/india-population-data/1.1 - India_Historical_Population_Density_Data.csv')
population_density_data = pd.read_csv('/kaggle/input/india-population-data/1.3 - India_Historical_Population_Growth_Rate_Data.csv')
rural_population_data = pd.read_csv('/kaggle/input/india-population-data/1.4 - India_Rural_Population_Historical_Data.csv')
urban_population_data = pd.read_csv('/kaggle/input/india-population-data/1.5 - India_Urban_Population_Historical_Data.csv')
birth_rate_data = pd.read_csv('/kaggle/input/india-population-data/1.6 - India_Historical_Birth_Rate_Data.csv')
death_rate_data = pd.read_csv('/kaggle/input/india-population-data/1.7 -India_Historical_Death_Rate_Data.csv')
fertility_rate_data = pd.read_csv('/kaggle/input/india-population-data/1.8 - India_Historical_Fertility_Rate_Data.csv')
infant_mortality_rate_data = pd.read_csv('/kaggle/input/india-population-data/1.9  - India_Historical_Infant_Mortality_Rate_Data.csv')
life_expectancy_data = pd.read_csv('/kaggle/input/india-population-data/2 - India_Historical_Life_Expectancy_Data.csv')

In [None]:
datasets = {
    'Population Data': population_data,
    'Population Density Data': population_density_data,
    'Rural Population Data': rural_population_data,
    'Urban Population Data': urban_population_data,
    'Birth Rate Data': birth_rate_data,
    'Death Rate Data': death_rate_data,
    'Fertility Rate Data': fertility_rate_data,
    'Infant Mortality Rate Data': infant_mortality_rate_data,
    'Life Expectancy Data': life_expectancy_data
}

## Step 3: Displaying the first few rows of each dataset

In [None]:
for name, data in datasets.items():
    print(f"{name}:")
    display(data.head())

## Step 4: Data Cleaning

### Renaming columns


In [None]:
for name, data in datasets.items():
    data.rename(columns={'Unnamed: 0': 'id'}, inplace=True)

In [None]:
population_data.rename(columns={'popu1ation_growth_rate': 'population_growth_rate'}, inplace=True)

population_density_data.rename(columns={'popu1ation_growth_rate': 'population_growth_rate'}, inplace=True)

### Handling Missing Values

In [None]:
for name, data in datasets.items():
    print(f"Missing Values in {name}:")
    print(data.isnull().sum())
    print("\n")

In [None]:
for name, data in datasets.items():
    print(f"Data Types for {name}:")
    print(data.dtypes)
    print("\n")

### Converting data types

#### Removing '%' sign and converting to float

In [None]:
population_data['growth_rate'] = population_data['growth_rate'].str.replace('%', '').astype(float)
population_density_data['growth_rate'] = population_density_data['growth_rate'].str.replace('%', '').astype(float)
rural_population_data['Change'] = rural_population_data['Change'].str.replace('%', '').astype(float)
urban_population_data['Change'] = urban_population_data['Change'].str.replace('%', '').astype(float)
birth_rate_data['Growth_Rate'] = birth_rate_data['Growth_Rate'].str.replace('%', '').astype(float)
death_rate_data['Growth_Rate'] = death_rate_data['Growth_Rate'].str.replace('%', '').astype(float)
fertility_rate_data['growth_rate'] = fertility_rate_data['Growth_Rate'].str.replace('%', '').astype(float)
infant_mortality_rate_data['Growth_Rate'] = infant_mortality_rate_data['Growth_Rate'].str.replace('%', '').astype(float)

#### Removing commas and converting to integers

In [None]:
population_data['population_growth_rate'] = population_data['population_growth_rate'].str.replace(',', '').astype(int)
population_density_data['population_growth_rate'] = population_density_data['population_growth_rate'].str.replace(',', '').astype(int)
rural_population_data['Population'] = pd.to_numeric(rural_population_data['Population'].str.replace(',', ''))
urban_population_data['Population'] = pd.to_numeric(urban_population_data['Population'].str.replace(',', ''))

### Ensuring consistency

#### Checking for duplicate values

In [None]:
for name, data in datasets.items():
    print(f"Duplicate Values in {name}:", end = " ")
    print(data.duplicated().sum())

#### Descriptive statistics for each dataset

In [None]:
for name, data in datasets.items():
    print(f"Descriptive Statistics for {name}:")
    display(data.describe(include='all'))
    print("\n")

## Step 5: Data Visualizing 

#### Plotting Growth Rate Over the Years

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(data=population_data, x='year', y='growth_rate')
plt.title('Growth Rate Over the Years')
plt.xlabel('Year')
plt.ylabel('Growth Rate')
plt.grid(True)
plt.xlim(population_data['year'].min(), population_data['year'].max())
plt.ylim(population_data['growth_rate'].min(), population_data['growth_rate'].max() * 1.1)
plt.show()

#### Plotting Population Growth Rate Over the Years

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(data=population_data, x='year', y='population_growth_rate')
plt.title('Year Over Population Growth Rate')
plt.xlabel('Year')
plt.ylabel('Population Growth Rate')
plt.grid(True)
plt.xlim(population_data['year'].min(), population_data['year'].max())
plt.ylim(population_data['population_growth_rate'].min(), population_data['population_growth_rate'].max() * 1.1)
plt.show()

#### Plotting Population Density Over the Years

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(data=population_data, x='year', y='Population_Density')
plt.title('Population Density Over the Years')
plt.xlabel('Year')
plt.ylabel('Population Density')
plt.grid(True)
plt.show()

#### Plotting Rural vs Urban Population Over the Years

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(data=rural_population_data, x='year', y='Population', label='Rural Population')
sns.lineplot(data=urban_population_data, x='year', y='Population', label='Urban Population')
plt.title('Rural vs Urban Population Over the Years')
plt.xlabel('Year')
plt.ylabel('Population')
plt.legend()
plt.grid(True)
plt.show()

#### Plotting Change in Rural vs Urban Population Over the Years

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(data=rural_population_data, x='year', y='Change', label='Rural Population Change')
sns.lineplot(data=urban_population_data, x='year', y='Change', label='Urban Population Change')
plt.title('Change in Rural vs Urban Population Over the Years')
plt.xlabel('Year')
plt.ylabel('Change in Population')
plt.legend()
plt.grid(True)
plt.show()

#### Plotting Birth and Death Rates Over the Years

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(data=birth_rate_data, x='year', y='Birth_Rate', label='Birth Rate')
sns.lineplot(data=death_rate_data, x='year', y='Death_Rate', label='Death Rate')
plt.title('Birth and Death Rates Over the Years')
plt.xlabel('Year')
plt.ylabel('Rate')
plt.legend()
plt.grid(True)
plt.show()

#### Plotting Fertility Rate Over the Years

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(data=fertility_rate_data, x='year', y='Fertility_Rate')
plt.title('Fertility Rate Over the Years')
plt.xlabel('Year')
plt.ylabel('Fertility Rate')
plt.grid(True)
plt.show()

#### Plotting Infant Mortality Rate Over the Years

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(data=infant_mortality_rate_data, x='year', y='Infant_Mortality_Rate')
plt.title('Infant Mortality Rate Over the Years')
plt.xlabel('Year')
plt.ylabel('Infant Mortality Rate')
plt.grid(True)
plt.show()

#### Plotting Life Expectancy Over the Years

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(data=life_expectancy_data, x='year', y='Life_Expectancy')
plt.title('Life Expectancy Over the Years')
plt.xlabel('Year')
plt.ylabel('Life Expectancy')
plt.grid(True)
plt.show()

#### Comparing Population Density with Growth Rate

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=population_data, x='Population_Density', y='growth_rate')
plt.title('Population Density vs. Growth Rate')
plt.xlabel('Population Density')
plt.ylabel('Growth Rate')
plt.grid(True)
plt.show()

#### Birth Rate vs. Fertility Rate

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=birth_rate_data, x='year', y='Birth_Rate', label='Birth Rate')
sns.scatterplot(data=fertility_rate_data, x='year', y='Fertility_Rate', label='Fertility Rate')
plt.title('Birth Rate vs. Fertility Rate')
plt.xlabel('Year')
plt.ylabel('Rate')
plt.legend()
plt.grid(True)
plt.show()

#### Infant Mortality Rate vs. Life Expectancy

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=infant_mortality_rate_data, x='Infant_Mortality_Rate', y='year', label='Infant Mortality Rate')
sns.scatterplot(data=life_expectancy_data, x='Life_Expectancy', y='year', label='Life Expectancy')
plt.title('Infant Mortality Rate vs. Life Expectancy')
plt.xlabel('Rate')
plt.ylabel('Year')
plt.legend()
plt.grid(True)
plt.show()

#### Bar Plot of Average Population Growth Rate by Decade

In [None]:
population_data['decade'] = (population_data['year'] // 10) * 10
avg_growth_rate_by_decade = population_data.groupby('decade')['growth_rate'].mean().reset_index()
plt.figure(figsize=(10, 6))
sns.barplot(data=avg_growth_rate_by_decade, x='decade', y='growth_rate')
plt.title('Average Population Growth Rate by Decade')
plt.xlabel('Decade')
plt.ylabel('Average Growth Rate')
plt.grid(True)
plt.show()

#### Histogram of Birth Rates

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(birth_rate_data['Birth_Rate'], bins=20, kde=True)
plt.title('Distribution of Birth Rates')
plt.xlabel('Birth Rate')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

#### Box Plot of Life Expectancy by Decade

In [None]:
life_expectancy_data['decade'] = (life_expectancy_data['year'] // 10) * 10
plt.figure(figsize=(10, 6))
sns.boxplot(data=life_expectancy_data, x='decade', y='Life_Expectancy')
plt.title('Life Expectancy by Decade')
plt.xlabel('Decade')
plt.ylabel('Life Expectancy')
plt.grid(True)
plt.show()

#### Heatmap of Correlation Matrix for Life Expectancy Data

In [None]:
plt.figure(figsize=(10, 6))
correlation_matrix = life_expectancy_data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix of Life Expectancy Data')
plt.show()

# Conclusion

### Summary of Findings
In this notebook, we explored various datasets related to India's population, including historical data on population density, rural and urban populations, birth and death rates, fertility rates, infant mortality rates, and life expectancy. Through data cleaning, transformation, and visualization, we were able to uncover key trends and insights:
- The population growth rate has shown significant changes over the years.
- There is a notable difference between rural and urban population trends.
- Birth and death rates have varied considerably, impacting overall population dynamics.
- Fertility rates have declined over the years, which is reflected in the birth rate trends.
- Infant mortality rates have decreased, correlating with improvements in life expectancy.


### Conclusions

- The growth rate of India's population has undergone various phases, with periods of rapid growth and relative stability.
- Urbanization is evident, with the urban population growing at a faster rate compared to the rural population.
- Improvements in healthcare and living conditions are reflected in the declining infant mortality rate and increasing life expectancy.
- Fertility rates have shown a downward trend, contributing to changes in birth rates.

