# **Analysis on India District Health Survey Data**

This notebook contains analysis using SQL queries on data provided by [Ministry of Health and Family Welfare](http://)https://data.gov.in/resource/india-districts-factsheets-national-family-health-survey-nfhs-5-2019-2021. Data has been filtered to fetch important columns and ingested into Postgres Tables using Python Libaries and SQL Queries. 


## Setup

In [None]:
# Install postgresql server
!sudo apt-get -y -qq update
!sudo apt-get -y -qq install postgresql
!sudo service postgresql start

# Setup a password `postgres` for username `postgres`
!sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"

# Setup a database with name `tfio_demo` to be used
!sudo -u postgres psql -U postgres -c 'DROP DATABASE IF EXISTS testdb;'
!sudo -u postgres psql -U postgres -c 'CREATE DATABASE testdb;'
!pip install psycopg2-binary

In [5]:
# Importing Python Libraries
import pandas as pd
import psycopg2
from sqlalchemy import create_engine

In [6]:
data=pd.read_csv("../input/survey/Data.csv")

In [7]:
# Defining Postgres Connection
connection_string = {'host':'localhost',
                     'dbname':'testdb',
                     'user':'postgres',
                     'password':'postgres',
                     'port':5432}
connection = psycopg2.connect(**connection_string)

In [19]:
## Create Schema of district_health Table which contains health related attributes of districts of India
create_schema_query = """ Create table district_health 
(
District_Names varchar(250),
State_UT varchar(250),
Households_survey numeric(10,10),
 Women_interview numeric(10,10),
Men_interview numeric(10,10),
Female_attend_school_per numeric(10,10),
have_electricity_per numeric(10,10),
clean_fuel_for_cooking_per numeric(10,10),
health_insurance_per numeric(10,10),
Women_literacy_per numeric(10,10),
Institutional_births_per numeric(10,10),
Children_having_diarrhoea_ORS_per  numeric(10,10),
Children_having_diarrhoea_zinc_per  numeric(10,10),
Women_high_Blood_sugar_per numeric(10,10),
Women_veryhigh_Blood_sugar_per numeric(10,10),
Men_high_Blood_sugar_per numeric(10,10),
Men_veryhigh_Blood_sugar_per numeric(10,10),
Women_mildelyElevated_Blood_sugar_per numeric(10,10),
Women_undergone_screeningTest_cervicalCancer_per numeric(10,10),
Examination_breastCancer_per numeric(10,10),
Men_use_tobacco_per numeric(10,10),
Women_consume_alcohol_per numeric(10,10),
Men_consume_alcohol_per numeric(10,10)
);
 """
with connection.cursor() as cur:
  cur.execute("rollback")
  cur.execute("drop table district_health if not exists;")
  cur.execute(create_schema_query)



In [22]:
## Load data in district_health table from pandas dataframe        
engine = create_engine('postgresql://postgres:postgres@localhost:5432/testdb')
data.to_sql('district_health', engine, index = False)

In [23]:
# Show 10 Records of district_health Table
pd.read_sql_query("""select * from district_health limit 10;""",connection)

## Analysis 

**Analysis 1: What are the top 5 states on the basis of number of households surveyed?**

In [24]:
analysis1_query = """ 
select "State_UT" from(
select "State_UT",sum("Households_survey") as o from district_health
group by "State_UT"
order by o desc
limit 5) temp
"""
pd.read_sql_query(analysis1_query, connection)

**Analysis 2: What are the top 5 states on the basis of percentage of women in households surveyed?**

In [25]:
analysis2_query = """ 
select "State_UT" from(
select "State_UT",w/m as a from(
select "State_UT",sum(" Women_interview") as w,sum("Men_interview") as m from district_health
group by "State_UT"
) temp
order by a desc
limit 5)temp2
"""
pd.read_sql_query(analysis2_query, connection)

**Analysis 3: What is the top district in each state on the basis of Men Age 15 Years And Above Wih High (141-160 Mg/dl) Blood Sugar Level23 (%)?**

In [26]:
analysis3_query = """ 
select "District_Names","State_UT" from district_health
where ("State_UT","Men_high_Blood_sugar_per") in(
select "State_UT",max("Men_high_Blood_sugar_per") from  district_health
group by "State_UT")

"""
pd.read_sql_query(analysis3_query, connection)


**Analysis 4: Rank districts on the basis of less number of people having Men (age 15 Years And Above Wih Very High (>160 Mg/dl) Blood Sugar Level23 (%)?**


In [27]:
analysis4_query = """ 

select "District_Names", rank() over(order by "Men_veryhigh_Blood_sugar_per") as rank from  district_health

"""
pd.read_sql_query(analysis4_query, connection)



**Analysis 5: Update and Clean column-Children with diarrhoea in the 2 weeks preceding the survey who received oral rehydration salts (ORS) (Children under age 5 years) (%)?**


In [None]:
analysis5_query = """ 
update district_health
set "Children_having_diarrhoea_ORS_per "=0
where "Children_having_diarrhoea_ORS_per "='*'

"""
pd.read_sql_query(analysis5_query, connection)
with connection.cursor() as cur:
  cur.execute(analysis5_query)

pd.read_sql_query("""select "Children_having_diarrhoea_ORS_per " from district_health limit 10;""",connection)