# Project: Analysis on Demographics of India

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#Gathering Data">Gathering Data</a></li>
<li><a href="#Cleaning Data">Cleaning Data</a></li>
<li><a href="#Exploring Data">Exploration</a></li>
<li><a href="#Conclusion">Conclusion</a></li>
</ul>

<a id='intro'></a>
## Introduction

India is the second most populated country in the world with nearly a fifth of the world's population. According to the 2017 revision of the World Population Prospects, the population stood at 1,324,171,354. During 1975–2010 the population doubled to 1.2 billion. The Indian population reached the billion mark in 1998. India is projected to be the world's most populous country by 2024, surpassing the population of China.
    
Let's try to find the characteristics of increasing population and factors affecting the population.

In [80]:
# import necessary libraries
import pandas as pd
import numpy as np
#% matplotlib inline
#from matplotlib import pyplot as plt
#import seaborn as sns
import requests
from bs4 import BeautifulSoup  
import plotly.plotly as py
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected = True)
import plotly.graph_objs as go
from plotly import tools

<a id='Gathering Data'></a>
## Data Gathering

In [81]:
#get the url 
url = 'https://en.wikipedia.org/wiki/Demographics_of_India'
response = requests.get(url)
response

<Response [200]>

In [82]:
#create soup
soup = BeautifulSoup(response.content,'lxml')

In [83]:
# get all tables data 
all_tables = soup.find_all('table', class_ = 'wikitable sortable')
print(f"Number of tables captured: {len(all_tables)}")

Number of tables captured: 18


In [84]:
#get_data function returns a list with only text 

def get_data(all_data):
    df_list = []
    for data in all_data:
        df_list.append(data.text.strip())
    return df_list

In [85]:
'''get_dataframe function collects soup for individual table.
Gathers data under 'td' tag then creates a list for each column in respective table.
then to create a dataframe we need dictionary structure hence created a dictionary 'df_dict' which takes in every column
'''
def get_dataframe(sample,num):
    all_data = sample.find_all('td')
    df_list = []
    for i in range(num):
        df_list.append(all_data[i::num])
        
    df_dict = {i:[] for i in range(num)}
    for i in range(len(df_dict)):
        data = get_data(df_list[i])
        df_dict[i] = data
    title = collect_col_names(sample)
    return df_dict,title

In [86]:
# collect_col_names function collectes column name of each table
def collect_col_names(sample):
    all_data = sample.find_all('th')
    col_title=  [] 
    for header in all_data:
        col_title.append(header.text.strip())
    return col_title

In [87]:
pop_under_british_dict,pop_under_british_cols = get_dataframe(all_tables[2],3)
pop_per_decade_dict,pop_per_decade_cols = get_dataframe(all_tables[3],3)
pop_dist_by_states_dict,pop_dist_by_states_cols = get_dataframe(all_tables[4],12)
pop_bet_age_0_6_dict,pop_bet_age_0_6_cols = get_dataframe(all_tables[7],6)
pop_abv_7_dict,pop_abv_7_cols = get_dataframe(all_tables[8],5)
literacy_rate_dict,literacy_rate_cols = get_dataframe(all_tables[9],5)
native_speakers_dict,native_speakers_cols = get_dataframe(all_tables[10],4)
un_stat_dict,un_stat_cols = get_dataframe(all_tables[11],9)
census_dict,census_cols = get_dataframe(all_tables[12],9)
pop_struc_dict,pop_struc_cols = get_dataframe(all_tables[13],5)
pop_struc_2016_dict,pop_struc_2016_cols = get_dataframe(all_tables[14],4)
fertility_rate_dict,fertility_rate_cols = get_dataframe(all_tables[15],7)
crude_birth_rate_dict,crude_birth_rate_cols = get_dataframe(all_tables[16],7)
regional_stats_dict,regional_stats_cols = get_dataframe(all_tables[17],13)

In [88]:
lc = soup.find_all('table', class_ = 'navbox')[0].find_all('td')[1:]
largest_cities = []
for i in lc:
    largest_cities.append(i.text.strip())
    
index = largest_cities.index('BangaloreHyderabad')

del largest_cities[index]

df_dict = {}
for i in range(1,4):
    df_dict[i] = largest_cities[i::4]

    
largest_cities_df = pd.DataFrame(df_dict)
largest_cities_df.columns = ['city','state','population']
largest_cities_df.head()

Unnamed: 0,city,state,population
0,Mumbai,Maharashtra,12478447
1,Kanpur,Uttar Pradesh,2920067
2,Delhi,Delhi,11007835
3,Lucknow,Uttar Pradesh,2901474
4,Bangalore,Karnataka,8425970


In [89]:
life_expec = soup.find_all('table',class_ = 'wikitable')[18]
life_dict = {i:[] for i in range(2)}
life_dict

life_expectaion = []
for i in range(0,2):
    life_expectaion.append(life_expec.find_all('td')[i::2])
    
for i in range(len(life_dict)):
    data = get_data(life_expectaion[i])
    life_dict[i] = data
    
life_expe_df = pd.DataFrame(life_dict)
life_expe_df.columns = ['period','life_expec_in_years']
life_expe_df.head()

Unnamed: 0,period,life_expec_in_years
0,1950–1955,36.6
1,1985–1990,56.7
2,1955–1960,39.7
3,1990–1995,59.1
4,1960–1965,42.7


- Collected all the necessary tables

Let's move ahead to clean the tables

<a id='Cleaning Data'></a>
## Data Cleaning

- First of all assign names to all the columns of dataframes

In [90]:
# create_dataframe function return a dataframe by a taking dictionary and column names as input.
def create_dataframe(df,cols):
    df=pd.DataFrame(df)
    df.columns = cols
    return df

### Dataframe has population and growth % under british raj

In [91]:
pop_under_british_df = create_dataframe(pop_under_british_dict,pop_under_british_cols )
print(f"Number of observations:{pop_under_british_df.shape}")
pop_under_british_df.head()

Number of observations:(8, 3)


Unnamed: 0,Census year,Population,Growth (%)
0,1871[35],238830958,–
1,1881[36],253896330,6.3
2,1891[35],287223431,13.1
3,1901[35],293550310,2.2
4,1911[37],315156396,7.4


### Dataframe has information about population growth per decade

In [92]:
pop_per_decade_df= create_dataframe(pop_per_decade_dict,pop_per_decade_cols)
print(f"Number of observations:{pop_per_decade_df.shape}")
pop_per_decade_df.head()

Number of observations:(7, 3)


Unnamed: 0,Census year,Population,Change (%)
0,1951,361088000,–
1,1961,439235000,21.6
2,1971,548160000,24.8
3,1981,683329000,24.7
4,1991,846387888,23.9


### Dataframe has information about population distribution by states

In [93]:
print(pop_dist_by_states_cols[:-12])

['Rank', 'State/UT', 'Population[51]', 'Percent (%)', 'Male', 'Female', 'Difference between male and female', 'Sex Ratio', 'Rural[52]', 'Urban[52]', 'Area[53] (km2)', 'Density (per km2)']


In [94]:
pop_dist_by_states_df = create_dataframe(pop_dist_by_states_dict,pop_dist_by_states_cols[:-12])
print(f"Number of observations:{pop_dist_by_states_df.shape}")
pop_dist_by_states_df.head()

Number of observations:(36, 12)


Unnamed: 0,Rank,State/UT,Population[51],Percent (%),Male,Female,Difference between male and female,Sex Ratio,Rural[52],Urban[52],Area[53] (km2),Density (per km2)
0,1,Uttar Pradesh,199812341,16.5,104480510,95331831,9148679,930,155111022,44470455,240928,828
1,2,Maharashtra,112374333,9.28,58243056,54131277,4111779,929,61545441,50827531,307713,365
2,3,Bihar,104099452,8.6,54278157,49821295,4456862,918,92075028,11729609,94163,1102
3,4,West Bengal,91276115,7.54,46809027,44467088,2341939,950,62213676,29134060,88752,1030
4,5,Madhya Pradesh,72626809,6.0,37612306,35014503,2597803,931,52537899,20059666,308245,236


### Dataframe has information about population distribution between age 0 and 6 by states.

In [95]:
pop_bet_age_0_6_cols[:-6]

['State or UT code', 'State or UT', 'Total', 'Male', 'Female', 'Difference']

In [96]:
pop_bet_age_0_6_df = create_dataframe(pop_bet_age_0_6_dict,pop_bet_age_0_6_cols[:-6])
print(f"Number of observations:{pop_bet_age_0_6_df.shape}")
pop_bet_age_0_6_df.head()

Number of observations:(35, 6)


Unnamed: 0,State or UT code,State or UT,Total,Male,Female,Difference
0,1,Jammu and Kashmir,2008670,1080662,927982,152680
1,2,Himachal Pradesh,763864,400681,363183,37498
2,3,Punjab,2941570,1593262,1348308,244954
3,4,Chandigarh,117953,63187,54766,8421
4,5,Uttarakhand,1328844,704769,624075,80694


### Dataframe has information about population distribution above age 7 by states.

In [97]:
pop_abv_7_cols[:-5]

['State or UT code', 'State or UT', 'Total', 'Male', 'Female']

In [98]:
pop_abv_7_df = create_dataframe(pop_abv_7_dict,pop_abv_7_cols[:-5])
print(f"Number of observations:{pop_abv_7_df.shape}")
pop_abv_7_df.head()

Number of observations:(35, 5)


Unnamed: 0,State or UT code,State or UT,Total,Male,Female
0,1,Jammu and Kashmir,–,–,–
1,2,Himachal Pradesh,–,–,–
2,3,Punjab,–,–,–
3,4,Chandigarh,–,–,–
4,5,Uttarakhand,–,–,–


### Dataframe has information about literacy rate by states.

In [99]:
literacy_rate_cols[:-5]

['State or UT code', 'State or UT', 'Overall (%)', 'Male (%)', 'Female (%)']

In [100]:
literacy_rate_df = create_dataframe(literacy_rate_dict,literacy_rate_cols[:-5])
print(f"Number of observations:{literacy_rate_df.shape}")
literacy_rate_df.head()

Number of observations:(35, 5)


Unnamed: 0,State or UT code,State or UT,Overall (%),Male (%),Female (%)
0,1,Jammu and Kashmir,86.61,87.26,86.23
1,2,Himachal Pradesh,83.78,90.83,76.6
2,3,Punjab,76.6,81.48,71.34
3,4,Chandigarh,86.43,90.54,81.38
4,5,Uttarakhand,79.63,88.33,70.7


### Dataframe has information about languages of india by number of native speakers at the 2001 census.

In [101]:
native_speakers_df  = create_dataframe(native_speakers_dict,native_speakers_cols)
print(f"Number of observations:{native_speakers_df.shape}")
native_speakers_df.head()

Number of observations:(29, 4)


Unnamed: 0,Rank,Language,Speakers,Percentage (%)
0,1,Hindi[74],422048642,41.03
1,2,Bengali,83369769,8.11
2,3,Telugu,74002856,7.19
3,4,Marathi,71936894,6.99
4,5,Tamil,60793814,5.91


### Dataframe has information about United Nations, World Population Prospects: The 2015 revision – India

In [102]:
del un_stat_dict[0][-1]
un_stat_df = create_dataframe(un_stat_dict,un_stat_cols)
print(f"Number of observations:{un_stat_df.shape}")
un_stat_df.head()

Number of observations:(13, 9)


Unnamed: 0,Period,Births per year,Deaths per year,Natural change per year,CBR1,CDR1,NC1,TFR1,IMR1
0,1950–1955,16832000,9928000,6904000,43.3,25.5,17.7,5.9,165.0
1,1955–1960,17981000,9686000,8295000,42.1,22.7,19.4,5.9,153.1
2,1960–1965,19086000,9358000,9728000,40.4,19.8,20.6,5.82,140.1
3,1965–1970,20611000,9057000,11554000,39.2,17.2,22.0,5.69,128.5
4,1970–1975,22022000,8821000,13201000,37.5,15.0,22.5,5.26,118.0


### Dataframe has information about Census of India
- The numbers of births and deaths were calculated from the birth and death rates and the average population.

In [103]:
del census_dict[0][-1]
census_df = create_dataframe(census_dict,census_cols)
print(f"Number of observations:{census_df.shape}")
census_df.head()

Number of observations:(36, 9)


Unnamed: 0,Year,Average population(x 1000),Live births1,Deaths1,Natural change,Crude birth rate(per 1000),Crude death rate(per 1000),Natural change(per 1000),Total fertility rate
0,1981,716493,24289000,8956000,15333000,33.9,12.5,21.4,–
1,1982,733152,24781000,8725000,16056000,33.8,11.9,21.9,–
2,1983,750034,25276000,8925000,16351000,33.7,11.9,21.8,–
3,1984,767147,26006000,9666000,16340000,33.9,12.6,21.3,–
4,1985,784491,25810000,9257000,16553000,32.9,11.8,21.1,–


### Dataframe has structure of the population (09.02.2011) (Census) (Includes data for the Indian-administered part of Jammu and Kashmir)

In [104]:
pop_struc_df = create_dataframe(pop_struc_dict,pop_struc_cols[:-5])
print(f"Number of observations:{pop_struc_df.shape}")
pop_struc_df.head()

Number of observations:(22, 5)


Unnamed: 0,Age group,Male,Female,Total,Percentage (%)
0,0–4,58632074,54174704,112806778,9.32
1,5–9,66300466,60627660,126928126,10.48
2,10–14,69418835,63290377,132709212,10.96
3,15–19,63982396,56544053,120526449,9.95
4,20–24,57584693,53839529,111424222,9.2


### Dataframe has population pyramid 2016 (estimates)

In [105]:
pop_struc_2016_df = create_dataframe(pop_struc_2016_dict,pop_struc_2016_cols )
print(f"Number of observations:{pop_struc_2016_df.shape}")
pop_struc_2016_df.head()

Number of observations:(21, 4)


Unnamed: 0,Age group,Male,Female,Total
0,0–4,8.7,8.2,8.5
1,5–9,9.1,8.8,8.9
2,10–14,9.8,9.4,9.6
3,15–19,10.4,9.9,10.1
4,20–24,10.2,10.7,10.4


CBR = crude birth rate (per 1000); TFR = total fertility rate (number of children per woman). 1Number in parenthesis represents the wanted fertility rate.
- Collected from the Demographic Health Survey:

In [106]:
del crude_birth_rate_dict[0][-1]
del crude_birth_rate_dict[1][-1]
crude_birth_rate_df = create_dataframe(crude_birth_rate_dict,crude_birth_rate_cols)
print(f"Number of observations:{crude_birth_rate_df.shape}")
crude_birth_rate_df.head()

Number of observations:(29, 7)


Unnamed: 0,State (Population 2011),CBR – Total,TFR – Total1,CBR – Urban,TFR – Urban1,CBR – Rural,TFR – Rural1
0,Uttar Pradesh (199 812 341),22.6,2.74 (2.06),18.6,2.08 (1.62),24.0,2.99 (2.22)
1,Maharashtra (112 374 333),16.6,1.87 (1.57),15.5,1.68 (1.41),17.5,2.06 (1.73)
2,Bihar (104 099 452),27.1,3.41 (2.48),20.4,2.42 (1.83),28.0,3.56 (2.58)
3,West Bengal (91 276 115),16.6,1.77 (1.53),14.0,1.57 (1.38),18.0,1.85 (1.58)
4,Madhya Pradesh (72 626 809),20.2,2.32 (1.82),17.7,1.95 (1.61),21.3,2.48 (1.91)


In [107]:
del fertility_rate_dict[0][-1]
fertility_rate_df = create_dataframe(fertility_rate_dict,fertility_rate_cols)
print(f"Number of observations:{fertility_rate_df.shape}")
fertility_rate_df.head()

Number of observations:(4, 7)


Unnamed: 0,Year,CBR – Total,TFR – Total1,CBR – Urban,TFR – Urban1,CBR – Rural,TFR – Rural1
0,1992–1993,28.7,3.39 (2.64),24.1,2.70 (2.09),30.4,3.67 (2.86)
1,1998–1999,24.8,2.85 (2.13),20.9,2.27 (1.73),26.2,3.07 (2.28)
2,2005–2006,23.1,2.68 (1.90),18.8,2.06 (1.60),25.0,2.98 (2.10)
3,2015–2016,19.0,2.18 (1.8),15.8,1.75 (1.5),20.7,2.41 (1.9)


### Birth rate, death rate, natural growth rate, and infant mortality rate, by state or UT(2010)

In [108]:
regional_stats_df = pd.DataFrame(regional_stats_dict)
regional_stats_df.columns = ['state','birth_rate_total','birth_rate_rural','birth_rate_urban','death_rate_total','death_rate_rural','death_rate_urban','natural_growth_rate_total','natural_growth_rate_rural','natural_growth_rate_urban','infant_moratatily_rate_total','infant_moratatily_rate_rural','infant_moratatily_rate_urban']
print(f"Number of observations:{regional_stats_df.shape}")
regional_stats_df.head()

Number of observations:(35, 13)


Unnamed: 0,state,birth_rate_total,birth_rate_rural,birth_rate_urban,death_rate_total,death_rate_rural,death_rate_urban,natural_growth_rate_total,natural_growth_rate_rural,natural_growth_rate_urban,infant_moratatily_rate_total,infant_moratatily_rate_rural,infant_moratatily_rate_urban
0,Andaman and Nicobar Islands,15.6,15.5,15.8,4.3,4.8,3.3,11.3,10.7,12.6,25,29,18
1,Andhra Pradesh,17.9,18.3,16.7,7.6,8.6,5.4,10.2,9.7,11.3,46,51,33
2,Arunachal Pradesh,20.5,22.1,14.6,5.9,6.9,2.3,14.6,15.2,12.3,31,34,12
3,Assam,23.2,24.4,15.8,8.2,8.6,5.8,14.9,15.8,10.1,58,60,36
4,Bihar,28.1,28.8,22.0,6.8,7.0,5.6,21.3,21.8,16.4,48,49,38


### Quality

- Values in columns are separated by comma Eg. Population column in pop_under_british_df 
- Missing data is represented by '-'.
- Multiple information is available in one column.
- total fertility column has '~' symbol before the number
- dtype of some of the columns should be int or float instead of object.

Copy dataframes into another before moving to cleaning process.

In [109]:
pop_under_british_df_clean = pop_under_british_df.copy()
regional_stats_df_clean = regional_stats_df.copy()
fertility_rate_df_clean = fertility_rate_df.copy()
crude_birth_rate_df_clean = crude_birth_rate_df.copy()
pop_struc_2016_df_clean = pop_struc_2016_df.copy()
pop_struc_df_clean =  pop_struc_df.copy()
census_df_clean =  census_df.copy()
un_stat_df_clean =  un_stat_df.copy()
native_speakers_df_clean =  native_speakers_df.copy()
literacy_rate_df_clean =  literacy_rate_df.copy()
pop_abv_7_df_clean =  pop_abv_7_df.copy()
pop_bet_age_0_6_df_clean =  pop_bet_age_0_6_df.copy()
pop_dist_by_states_df_clean =  pop_dist_by_states_df.copy()
pop_per_decade_df_clean = pop_per_decade_df.copy()
life_expe_df_clean = life_expe_df.copy()
largest_cities_df_clean = largest_cities_df.copy()

### `Define`
- Get the rows containing '~' and replace them.

### `Code`

In [110]:
maskk = census_df_clean['Total fertility rate'].str.contains('~')
census_df_clean.loc[maskk,'Total fertility rate'] = census_df_clean.loc[maskk,'Total fertility rate'].str.replace('~ ','')

### `Test`

In [111]:
census_df_clean['Total fertility rate'].tail()

31    2.4
32    2.3
33    2.3
34    2.2
35    2.2
Name: Total fertility rate, dtype: object

### `Define`

- Get the rows having '-' and put nan 

### `Code`

In [112]:
df_missing_list = [pop_abv_7_df_clean,census_df_clean,pop_under_british_df_clean,
                   pop_per_decade_df_clean,un_stat_df_clean] 
#un_stat_df_clean.iloc[:,1:]
errors_missing = {}
for df in df_missing_list:
    cols = df.columns
    for col in cols:
        try:
            mask = df.loc[:,col] == '–'
            df.loc[mask,col] = np.nan
        except Exception as e:
            errors_missing[col] = str(e)     


In [113]:
errors_missing

{}

### `Test`

In [114]:
pop_abv_7_df_clean.head()

Unnamed: 0,State or UT code,State or UT,Total,Male,Female
0,1,Jammu and Kashmir,,,
1,2,Himachal Pradesh,,,
2,3,Punjab,,,
3,4,Chandigarh,,,
4,5,Uttarakhand,,,


### `Define`

#### Few of the columns in dataframes are integer type but stored as object also the number are separated by commas.
- To solve the issue I decided to create a dictionary with dataframe and assigned a key value..
- The next step is to apply a for loop to the dictionary take column names of each dataframe in a variable and again apply  a for loop to check if respective column values contains comma. If yes, then replace it and change the dtype.
- If error occurs collect it in another dictionary and handle it separately.

### `Code`

In [115]:
df_dict = {0:pop_under_british_df_clean,1:regional_stats_df_clean,2:fertility_rate_df_clean,3:crude_birth_rate_df_clean,
           4:pop_struc_2016_df_clean,5:pop_struc_df_clean,6:census_df_clean,7:un_stat_df_clean,8:native_speakers_df_clean,
           9:literacy_rate_df_clean,10:pop_abv_7_df_clean,11:pop_bet_age_0_6_df_clean,12:pop_dist_by_states_df_clean,
           13:pop_per_decade_df_clean,14:life_expe_df_clean,15:largest_cities_df_clean}

errors = {}
for key ,df in df_dict.items():
    cols = df.columns
    for col in cols:
        if (df[col].str.contains(',')).any():
            try:
                df[col] = df[col].str.replace(',','').astype(int)
            except Exception as e:
                errors[key] = col

In [116]:
print(errors)
largest_cities_df_clean['population'] = largest_cities_df_clean['population'].str.replace(',','').str.replace(' ','').astype(int)

{7: 'Natural change per year', 10: 'Female', 15: 'population'}


### `Test`

In [117]:
print(f"Before Cleaning\n{pop_under_british_df.dtypes}")
print(f"After Cleaning\n{pop_under_british_df_clean.dtypes}")

Before Cleaning
Census year    object
Population     object
Growth (%)     object
dtype: object
After Cleaning
Census year    object
Population      int32
Growth (%)     object
dtype: object


- As we can see population columns is now changed to integer type. In similar way it has been applied to all the required columns in dataframe.

<a id='Exploring Data'></a>
## Exploratory Analysis

In [118]:
def att_line_plot(X,Y,T,XA,YA):
    trace = go.Scatter(
     x = X,
     y =Y,
     mode = 'lines+markers',
     marker = dict(size = (Y/ Y.mean())*50,
                    color = np.random.randn(len(Y)),
                    opacity = 0.6,
                    line = dict( color = 'rgb(0,0,0)'))
    
    )
    
    layout = dict(title = T, 
                  xaxis = dict(title = XA),
                  yaxis = dict(title = YA))
    
    data = [trace]
    fig = {'data':data,'layout':layout}
    return fig
    

In [119]:
#Census year has [35] after a year hence remove it.
pop_under_british_df_clean['Census year'] = pop_under_british_df_clean['Census year'].apply(lambda x:x.split('[')[0])

FIG = att_line_plot(pop_under_british_df_clean['Census year'],pop_under_british_df_clean['Population'],'Population Growth under British Raj','Year','Population in Millions')
iplot(FIG)

- Population increased by 63% in 70 years. Let's find the growth rate after independence.

In [120]:
FIG = att_line_plot(pop_per_decade_df_clean['Census year'],pop_per_decade_df_clean['Population'],'Population Growth after Independence','Year','Population in Billions')
iplot(FIG)

- In next 60 years population increased by 235%.
- From 1970 to 2010 population almost doubled to 1.2 Billion.
- Decade population growth rate was 1.77%.(2001 to 2010)

In [121]:
def bar_plot(X,Y,color_list,T,TP):
    trace = go.Bar(
        x = X,
        y = Y,
        text = T,
        textposition = TP,
        marker = dict(color = color_list,
                             line=dict(color='rgb(0,0,0)',width=1.5)))
    
    return trace

In [122]:
#assign color to each bar in bar plot
def assign_color(col,condn):
    colors = []
    for i in col:
        if i > condn:
            colors.append('rgba(222,45,38,0.8)')
        else:
            colors.append('rgba(204,204,204,1)')
    return colors

In [123]:
pop_dist_by_states_df_clean.sort_values(by = 'Population[51]',ascending = False,inplace = True)

colors_states = assign_color(pop_dist_by_states_df_clean['Population[51]'],100000000)
trace1 = bar_plot(pop_dist_by_states_df_clean['State/UT'],pop_dist_by_states_df_clean['Population[51]'],colors_states,T = None,TP = None)


largest_cities_df_clean.sort_values(by = 'population',ascending = False,inplace = True)
colors_bar = assign_color(largest_cities_df_clean['population'],largest_cities_df_clean['population'].mean())
trace2 = bar_plot(largest_cities_df_clean['city'],largest_cities_df_clean['population'],colors_bar,T = largest_cities_df_clean['state'],TP = None)


pop_dist_by_states_df_clean['Sex Ratio']=pop_dist_by_states_df_clean['Sex Ratio'].astype(int)
df = pop_dist_by_states_df_clean.sort_values(by = 'Sex Ratio',ascending = False)
colors_sexratio = assign_color(df['Sex Ratio'],1000)
trace3 = bar_plot(df['State/UT'],df['Sex Ratio'],colors_sexratio,T = None, TP = None)


data = [trace1,trace2,trace3]

fig = tools.make_subplots(rows=3, cols=1, print_grid = False,subplot_titles=('Population distribution by States(2011)', 
                                                                             'Cities with largest Population(2011)',
                                                                              'Sex Ratio by States(2011)'))

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 2, 1)
fig.append_trace(trace3, 3, 1)


fig['layout']['yaxis1'].update(title='Population')
fig['layout']['yaxis2'].update(title='Population')
fig['layout']['yaxis3'].update(title='Sex Ratio')


fig['layout'].update(height = 1000, width = 1000, showlegend = False)

iplot(fig)

Population distribution by states:
- Uttarpradesh,Maharashtra,Bihar are the states with population above 100Million at 2011 census.

Cities with largest population:
- Mumbai, Delhi, Bangalore, Hyderabad, Ahmedabad, cheannai, kolkata, surat are the cities with population above the mean population.

Sex ratio is used to describe number of females per 1000 of males.The major cause of the decrease of the female birth ratio in India is considered to be the violent treatments meted out to the girl child at the time of the birth.

- Puducherry and Kerala are the states where the number of women are more than the number of men.In 2011 Kerala and Puducherry had sex ratio of 1084 and 1037 respectively

In [124]:
pop_dist_by_states_df_clean['Rural_pop_per'] = pop_dist_by_states_df_clean['Rural[52]'] / pop_dist_by_states_df_clean['Population[51]']

pop_dist_by_states_df_clean['Urban_pop_per'] = pop_dist_by_states_df_clean['Urban[52]'] / pop_dist_by_states_df_clean['Population[51]']

In [125]:
pop_dist_by_states_df_clean.sort_values(by  = 'Rural_pop_per',ascending = False,inplace = True )
print(f"Percentage of Population staying in Rural Areas in 2011:{pop_dist_by_states_df_clean['Rural_pop_per'].mean()*100:.2f}%")
print(f"Percentage of Population staying in Urban Areas in 2011:{pop_dist_by_states_df_clean['Urban_pop_per'].mean()*100:.2f}%")

traces = []
col_list = ['Rural_pop_per','Urban_pop_per']
color_list = ['rgba(55, 128, 191, 0.7)','rgba(219, 64, 82, 0.7)']
for i in range(2):
    trace = go.Bar(
                   x  = pop_dist_by_states_df_clean['State/UT'],
                   y  =  pop_dist_by_states_df_clean[col_list[i]],
                   name = col_list[i],
                   marker = dict(color = color_list[i],
                              line=dict(color='rgb(0,0,0)',width=1.5)))
    traces.append(trace)

data = traces
layout = dict(title = 'Rural and Urban Population by States in 2011',
             yaxis = dict(title = 'Proportion'))
fig = {'data':data,'layout':layout}
iplot(fig)

Percentage of Population staying in Rural Areas in 2011:61.28%
Percentage of Population staying in Urban Areas in 2011:38.03%


In [126]:
def change_dtype(df,col_list):
    df[col_list] = df[col_list].astype(float)
    
    return df[col_list]
    

In [127]:
col_list = ['Overall (%)', 'Male (%)','Female (%)']
literacy_rate_df_clean[col_list] = change_dtype(literacy_rate_df_clean,col_list)

In [128]:
literacy_rate_df_clean.sort_values(by = 'Overall (%)',ascending = False,inplace = True)
print(f"Average Literacy rate in India is:{literacy_rate_df_clean['Overall (%)'].mean(): .4f}%")

traces = []
col_list = literacy_rate_df_clean.columns
color_list = [0,1,'rgba(55, 128, 191, 0.7)','rgba(219, 64, 82, 0.7)','rgba(50, 171, 96, 0.7)']
for i in range(2,5):
    trace = go.Bar(
                   x  = literacy_rate_df_clean.iloc[:15,1],
                   y = literacy_rate_df_clean.iloc[:15,i],
                   name = col_list[i],
                   marker = dict(
                               color = color_list[i],
                              line=dict(color='rgb(0,0,0)',width=1.5)))
    traces.append(trace)


data = traces
layout = dict(title = 'Top 15 States in Literacy Rate',
                 yaxis = dict(title = 'Literacy Rate'))

fig = {'data':data,'layout' :layout}
iplot(fig)



Average Literacy rate in India is: 79.1497%


Kerala has the highest literacy rate at the 2011 Census.

Overall - 93.91%

Male - 96.02

Female - 91.98%

In [129]:
print("State with Lowest Litearcy Rate:")
literacy_rate_df_clean.loc[:,['State or UT','Overall (%)']].tail(1)

State with Lowest Litearcy Rate:


Unnamed: 0,State or UT,Overall (%)
9,Bihar,63.82


In [130]:
col_list = ['Crude birth rate(per 1000)', 'Crude death rate(per 1000)','Natural change(per 1000)','Total fertility rate']
census_df_clean[col_list] = change_dtype(census_df_clean,col_list)

In [131]:
census_df_clean.head()

Unnamed: 0,Year,Average population(x 1000),Live births1,Deaths1,Natural change,Crude birth rate(per 1000),Crude death rate(per 1000),Natural change(per 1000),Total fertility rate
0,1981,716493,24289000,8956000,15333000,33.9,12.5,21.4,
1,1982,733152,24781000,8725000,16056000,33.8,11.9,21.9,
2,1983,750034,25276000,8925000,16351000,33.7,11.9,21.8,
3,1984,767147,26006000,9666000,16340000,33.9,12.6,21.3,
4,1985,784491,25810000,9257000,16553000,32.9,11.8,21.1,


In [132]:
print(f"{census_df_clean['Crude birth rate(per 1000)'].mean():.2f} Births per 1000 Population")
print(f"{census_df_clean['Crude death rate(per 1000)'].mean():.2f} Deaths per 1000 Population")

traces = []
col_list = ['Crude birth rate(per 1000)','Crude death rate(per 1000)']
for i in range(2):
    trace = go.Scatter(
                   x  = census_df_clean['Year'],
                   y  =  census_df_clean[col_list[i]],
                   name = col_list[i],
                   text = census_df_clean['Year'],
                   mode = 'lines',
                   marker = dict(
                        opacity = 0.6,
                        line = dict( color = 'rgb(0,0,0)'))
                   )
    traces.append(trace)
data = traces
layout = dict(title = 'Birth and Death Rate Per 1000 Population',
                 xaxis = dict(title = 'Year'))
fig = {'data':data,'layout':layout}
iplot(fig)

26.84 Births per 1000 Population
8.95 Deaths per 1000 Population


**Crude Birth Rate** is the number of births recorded in a year, per 1000 population. Crude Birth Rate is usually the dominant factor in determining the rate of population growth.
Once seemingly unstoppable, India’s population juggernaut is finally slowing down. It has decreased by 43.68% since 1981.

**Crude Death Rate** is the number of deaths recorded in a year, per 1000 population.
Death rate in India is declining since 1981. It has decreased by 48.8% since 1981.

There can be following reasons for declining of death rates in India:

1.Control of Epidemics

2.Urbanisation of Population

3.More Medical Facilities

4.Spread of Education

5.Late Marriage

6.Control over famine

7.Balanced Diet

In [133]:
life_expe_df_clean = life_expe_df_clean.iloc[:-1,:].sort_values(by = 'period')
life_expe_df_clean['life_expec_in_years'] = change_dtype(life_expe_df_clean,'life_expec_in_years')

FIG = att_line_plot(life_expe_df_clean['period'],life_expe_df_clean['life_expec_in_years'],'Life Expectancy in Years','Year','Age')
iplot(FIG)

What is life expectancy? 
- Life expectancy is the expected number of years of life remaining at a particular age.

In the past two decades life expectancy in India has increased by more than 8 years.

Between 2010-2015 life expectancy at birth in India was 67.6 years.

In [134]:
for col in crude_birth_rate_df_clean.iloc[:,[0,2,4,6]].columns:
    if col == 'State (Population 2011)':
        crude_birth_rate_df_clean[col.split('(')[0]] = crude_birth_rate_df_clean[col].apply(lambda x:x.split('(')[0])
        crude_birth_rate_df_clean[col.split('(')[-1]] = crude_birth_rate_df_clean[col].apply(lambda x:x.split('(')[-1])
    else:
        crude_birth_rate_df_clean[col+str('actual')] = crude_birth_rate_df_clean[col].apply(lambda x:x.split('(')[0])
        crude_birth_rate_df_clean[col+str('wanted')] = crude_birth_rate_df_clean[col].apply(lambda x:x.split('(')[-1])
    

In [135]:
crude_birth_rate_df_clean.drop(['State (Population 2011)','TFR – Total1','TFR – Urban1','TFR – Rural1'],axis = 1,inplace = True)
crude_birth_rate_df_clean.sort_values(by  = 'CBR – Total',ascending = False,inplace  =True)

In [136]:
col_list = ['TFR – Total1actual', 'TFR – Urban1actual','TFR – Rural1actual']
crude_birth_rate_df_clean[col_list] = change_dtype(crude_birth_rate_df_clean,col_list)

In [137]:
print(f"Average Fertility Rate: {crude_birth_rate_df_clean['TFR – Total1actual'].mean()}")
print(f"Average Fertility rate in urban areas: {crude_birth_rate_df_clean['TFR – Urban1actual'].mean():.1f}")
print(f"Average fertility rate in rural areas: {crude_birth_rate_df_clean['TFR – Rural1actual'].mean():.1f}")

traces = []
col_list = ['CBR – Total','TFR – Total1actual']
color_list = ['rgba(55, 128, 191, 0.7)','rgba(219, 64, 82, 0.7)']
for i in range(2):
    trace = go.Bar(
                   x  = crude_birth_rate_df_clean['State '],
                   y  = crude_birth_rate_df_clean[col_list[i]],
                   name = col_list[i],
                   marker = dict(
                                 color = color_list[i],
                                 line=dict(color='rgb(0,0,0)',width=1.5)))
    traces.append(trace)
data = traces
layout = dict(title = ('Birth and Fertility Rate by States'),
             barmode = 'relative')

fig = {'data':data,'layout':layout}
iplot(fig)

Average Fertility Rate: 2.11
Average Fertility rate in urban areas: 1.7
Average fertility rate in rural areas: 2.3


Bihar has the highest crude birth rate and fertility rate(3.41) whereas Sikkim has lowest Fertility rate(1.17).

**Total Fertility Rate(TFR)** is defined as average number of children that would be born to a woman .The total fertility rate is a more direct measure of the level of fertility than the birth rate, since it refers to births per woman. This indicator shows the potential for population change in a country. A TFR of about 2.1 children per woman is called replacement-level fertility, which means a population that is stable, neither rising nor falling. 

In [138]:
pop_struc_df_clean['Sex Ratio']= round((pop_struc_df_clean['Female'] / pop_struc_df_clean['Male'])*1000,2)

In [139]:
colors_sexratio_age = []
for sr in pop_struc_df_clean['Sex Ratio']:    
    if sr > 1000:
        colors_sexratio_age.append('rgba(49,130,189,0.8)')
    else:
        colors_sexratio_age.append('rgba(204,204,204,1)')

In [140]:
trace = bar_plot(pop_struc_df_clean['Age group'],pop_struc_df_clean['Sex Ratio'],colors_sexratio_age,T = pop_struc_df_clean['Sex Ratio'],TP='auto')
data = [trace]
layout = dict(title = 'Sex Ratio by Age group(2011)',
              yaxis = dict(title = 'Sex Ratio'))

fig = {'data':data,'layout':layout}
iplot(fig)

- Bar coloured in blue indicates number of Female higher than number of male.
- 59% of the age groups have sex ratio below 1000.

In [141]:
a = pop_struc_df_clean.iloc[:,1:4]
tot_pop = a['Total'].sum()


age_group_0_14 = pd.DataFrame((a.iloc[:3,:].sum(axis = 0) / tot_pop)*100).reset_index()
trace_1 = go.Bar(
                 x = age_group_0_14['index'],
                 y = age_group_0_14[0],  
                 name = 'Age group 0-14',
                 marker = dict(color = 'rgba(255, 255, 128, 0.5)',
                              line=dict(color='rgb(0,0,0)',width=1.5))
                )

age_grp_15_64 = pd.DataFrame((a.iloc[3:13,:].sum(axis = 0) / tot_pop)*100).reset_index()
trace_2 = go.Bar(  
                 x = age_grp_15_64['index'],
                 y = age_grp_15_64[0],
                 name = 'Age Group 15-64',
                 marker=dict(color='rgba(120, 300, 100, 0.5)',
                             line= dict(color = 'rgb(0,0,0)',width = 1.5))
                 )
 

age_grp_64_abv = pd.DataFrame((a.iloc[13:-1,:].sum(axis = 0) / tot_pop)*100).reset_index()
trace_3 = go.Bar( 
                x = age_grp_64_abv['index'],
                y = age_grp_64_abv[0],
                name = 'Age group 64 & above',
                marker = dict(color = 'rgba(255, 174, 128, 0.5)',
                              line=dict(color='rgb(0,0,0)',width=1.5)))

layout = dict(title='Percentage of Population by Age Groups',
              yaxis = dict(title = 'Percentage'))

data = [trace_1,trace_2,trace_3]
fig = {'data':data,'layout':layout}
iplot(fig)


- 63% of total population is from age group 15-64.

Out of which 

Male Population: 32.55% 

Female Population: 30.85%
       

In [142]:
print(f"Number of Languages in India: {native_speakers_df_clean.shape[0]}")

trace = bar_plot(native_speakers_df_clean['Language'],native_speakers_df_clean['Percentage (%)'],None,T = None,TP=None)

data = [trace]
layout = dict(title = 'Languages in India by percentage of native speakers at the 2001 census',
             yaxis = dict(title = 'Percentage'))
fig = {'data':data, 'layout':layout}
iplot(fig)

Number of Languages in India: 29


<a id='Conclusion'></a>
## Conclusion

Before Independence Population increased by 63% in 70 Years whereas after independence (in next 60 years) population increased by 235%. From 1970 to 2010 population almost doubled to 1.2 Billion.

States with Population above 100Million in 2011
* Uttarpradesh: 199 Million
* Maharashtra: 112 Million
* Bihar: 104 Million

Highly Populated cities in 2011
* Mumbai : 13.48 Million
* Delhi : 11 Million
* Bangalore : 8.42 Million
* Hyderabad : 6.8 Million
* Ahmedabad : 5.57 Million

Sex Ratio
* Puducherry and Kerala are the states where the number of women are more than the number of men. In 2011 Kerala and Puducherry had sex ratio of 1084 and 1037 respectively

`As per 2011`:
- Percentage of Population in rural areas: 61.28%
- Percentage of population in Urban Areas: 38.03%

States with large number of population in rural areas

- Himachal Pradesh
- Bihar
- Assam
- Odisha
- Meghalaya
- UttarPradesh

Literacy Rate as per 2011
- average literacy rate in India: 79.14%

Kerala has highest literacy rate of 93.91% with Male:96.02% and Female:91.98%

Fertility rate: 2.11 Urban Areas: 1.7, Rural Areas: 2.3

Total Population 
- between age 15-64: 63.40%
- between age 0-14: 30.73%
- between age 64 & Above 5.466%

India has total 29 Spoken languages.Native Hindi speaker were 41% as per 2001 Census.