# Introduction 
![](https://miro.medium.com/max/757/1*eYEoP5hF-IyQ4SCLEs1cbQ.png)
### In this kernel notebook I will be focusing on initially covering the new Pandas_Bokeh Data visualisation followed by a exploratory data analysis ,a case study about Karnataka Education using Bokeh.

**Pandas-Bokeh** provides a Bokeh plotting backend for Pandas, GeoPandas and Pyspark DataFrames, similar to the already existing Visualization feature of Pandas. Importing the library adds a complementary plotting method plot_bokeh() on DataFrames and Series.


With **Pandas-Bokeh**, creating stunning, interactive, HTML-based visualization is as easy as calling:

df.plot_bokeh()


**Pandas-Bokeh** also provides native support as a Pandas Plotting backend for Pandas >= 0.25. When **Pandas-Bokeh** is installed, switchting the default Pandas plotting backend to Bokeh can be done via:

pd.set_option('plotting.backend', 'pandas_bokeh')

![](https://miro.medium.com/max/1962/0*lfsR26JXj4o_QMWI.gif)


Now its time to first install Pandas_Bokeh using PIP command.

In [1]:
!pip install pandas-bokeh

Collecting pandas-bokeh
  Downloading pandas_bokeh-0.5-py2.py3-none-any.whl (29 kB)
Installing collected packages: pandas-bokeh
Successfully installed pandas-bokeh-0.5


### Import Libraries


In [2]:
import numpy as np
import pandas as pd
import pandas_bokeh
pandas_bokeh.output_notebook()
pd.set_option('plotting.backend', 'pandas_bokeh')
# Create Bokeh-Table with DataFrame:
from bokeh.models.widgets import DataTable, TableColumn
from bokeh.models import ColumnDataSource

# Import Data



## Plot types


#### Pandas & Pyspark DataFrames
* Line plot
* Point plot
* Step plot
* Scatter plot
* Bar plot
* Histogram
* Area plot
* Pie plot
* Map plot

#### Geoplots (Point, Line, Polygon) with GeoPandas


### Lineplot

This simple lineplot in Pandas-Bokeh already contains various interactive elements:

* a pannable and zoomable (zoom in plotarea and zoom on axis) plot
* by clicking on the legend elements, one can hide and show the individual lines
* a Hovertool for the plotted lines

Consider the following simple example:

We will be importing the time series data about the power usage in various states in India. All of the values are measured in **MU(millions of units)**. **The date ranges from 28/10/2019 to 23/05/2020.**

In [3]:
df = pd.read_csv('../input/state-wise-power-consumption-in-india/dataset_tk.csv')
df_long = pd.read_csv('../input/state-wise-power-consumption-in-india/long_data_.csv')

Firstly creating Date column and dropping the unwanted column and reformatting the date column

In [4]:
df["Date"]=df["Unnamed: 0"]
df['Date'] = pd.to_datetime(df.Date, dayfirst=True)
df = df.drop(["Unnamed: 0"], axis = 1) 

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 34 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Punjab             503 non-null    float64       
 1   Haryana            503 non-null    float64       
 2   Rajasthan          503 non-null    float64       
 3   Delhi              503 non-null    float64       
 4   UP                 503 non-null    float64       
 5   Uttarakhand        503 non-null    float64       
 6   HP                 503 non-null    float64       
 7   J&K                503 non-null    float64       
 8   Chandigarh         503 non-null    float64       
 9   Chhattisgarh       503 non-null    float64       
 10  Gujarat            503 non-null    float64       
 11  MP                 503 non-null    float64       
 12  Maharashtra        503 non-null    float64       
 13  Goa                503 non-null    float64       
 14  DNH       

Now let us divide the states based on 5 regions namely 

1. Northern Region

2. Southern Region

3. Eastern Region

4. Western Region

5. North Eastern Region


In [6]:
df['NR'] = df['Punjab']+ df['Haryana']+ df['Rajasthan']+ df['Delhi']+df['UP']+df['Uttarakhand']+df['HP']+df['J&K']+df['Chandigarh']
df['WR'] = df['Chhattisgarh']+df['Gujarat']+df['MP']+df['Maharashtra']+df['Goa']+df['DNH']
df['SR'] = df['Andhra Pradesh']+df['Telangana']+df['Karnataka']+df['Kerala']+df['Tamil Nadu']+df['Pondy']
df['ER'] = df['Bihar']+df['Jharkhand']+ df['Odisha']+df['West Bengal']+df['Sikkim']
df['NER'] =df['Arunachal Pradesh']+df['Assam']+df['Manipur']+df['Meghalaya']+df['Mizoram']+df['Nagaland']+df['Tripura']

In [7]:
df_line = pd.DataFrame({"Northern Region": df["NR"].values,
                        "Southern Region": df["SR"].values,
                        "Eastern Region": df["ER"].values,
                        "Western Region": df["WR"].values,
                        "North Eastern Region": df["NER"].values},index=df.Date)

df_line.plot_bokeh(kind="line",title ="India - Power Consumption Regionwise",
                   figsize =(1000,800),
                   xlabel = "Date",
                   ylabel="MU(millions of units)"
                   )

### In the above data visualisation which is completely intereactive. You can click on any index regions and check the data .Is it an interesting data visualisation ???????

##### Let us look at some other types of LinePlot

#### Bar Type:


In [8]:
df_line.plot_bokeh(kind="bar",title ="India - Power Consumption Regionwise",figsize =(1000,800),xlabel = "Date",ylabel="MU(millions of units)")

#### Point Type:

In [9]:
df_line.plot_bokeh(kind="point",title ="India - Power Consumption Regionwise",figsize =(1000,800),xlabel = "Date",ylabel="MU(millions of units)")

#### Histogram Type:

In [10]:
df_line.plot_bokeh(kind="hist",title ="India - Power Consumption Regionwise",
                   figsize =(1000,800),
                   xlabel = "Date",
                   ylabel="MU(millions of units)"
                )

#### Lineplot with rangetool

In [11]:
df_line = pd.DataFrame({"Northern Region": df["NR"].values,
                        "Southern Region": df["SR"].values,
                        "Eastern Region": df["ER"].values,
                        "Western Region": df["WR"].values,
                        "North Eastern Region": df["NER"].values},index=df.Date)

df_line.plot_bokeh(kind="line",title ="India - Power Consumption Regionwise",
                   figsize =(1000,800),
                   xlabel = "Date",
                   ylabel="MU(millions of units)",rangetool=True
                   )

### Pointplot

If you just wish to draw the date points for curves, the pointplot option is the right choice. It also accepts the kwargs of bokeh.plotting.figure.scatter like marker or size:

In [12]:
df_line.plot_bokeh.point(
    x=df.Date,
    xticks=range(0,1),
    size=5,
    colormap=["#009933", "#ff3399","#ae0399","#220111","#890300"],
    title=" Point Plot - India Power Consumption",
    fontsize_title=20,
    marker="x",figsize =(1000,800))

### Stepplot

With a similar API as the line- & pointplots, one can generate a stepplot. Additional keyword arguments for this plot type are passes to bokeh.plotting.figure.step, e.g. mode (before, after, center), see the following example


In [13]:
df_line.plot_bokeh.step(
    x=df.Date,
    xticks=range(-1, 1),
    colormap=["#009933", "#ff3399","#ae0399","#220111","#890300"],
    title="Step Plot - India Power Consumption",
    figsize=(1000,800),
    fontsize_title=20,
    fontsize_label=20,
    fontsize_ticks=20,
    fontsize_legend=8,
    )

### Scatterplot

A basic scatterplot can be created using the kind="scatter" option. For scatterplots, the x and y parameters have to be specified and the following optional keyword argument is allowed:

category: Determines the category column to use for coloring the scatter points

kwargs**: Optional keyword arguments of bokeh.plotting.figure.scatter

Note, that the pandas.DataFrame.plot_bokeh() method return per default a Bokeh figure, which can be embedded in Dashboard layouts with other figures and Bokeh objects (for more details about (sub)plot layouts and embedding the resulting Bokeh plots as HTML click here).

In the example below, we use the building grid layout support of Pandas-Bokeh to display both the DataFrame (using a Bokeh DataTable) and the resulting scatterplot:

In [14]:
df = pd.read_csv("../input/iris/Iris.csv")
df = df.sample(frac=1)

In [15]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
27,28,5.2,3.5,1.5,0.2,Iris-setosa
100,101,6.3,3.3,6.0,2.5,Iris-virginica
130,131,7.4,2.8,6.1,1.9,Iris-virginica
16,17,5.4,3.9,1.3,0.4,Iris-setosa
0,1,5.1,3.5,1.4,0.2,Iris-setosa


In [16]:
data_table = DataTable(
    columns=[TableColumn(field=Ci, title=Ci) for Ci in df.columns],
    source=ColumnDataSource(df),
    height=300,
)

# Create Scatterplot:
p_scatter = df.plot_bokeh.scatter(
    x="PetalLengthCm",
    y="SepalWidthCm",
    category="Species",
    title="Iris DataSet Visualization",
    show_figure=False
)

# Combine Table and Scatterplot via grid layout:
pandas_bokeh.plot_grid([[data_table, p_scatter]], plot_width=400, plot_height=350)

### Barplot

The barplot API has no special keyword arguments, but accepts optional kwargs of bokeh.plotting.figure.vbar like alpha. It uses per default the index for the bar categories (however, also columns can be used as x-axis category using the x argument).

Let us look at an example

In [17]:
data = {
    'Cars':
    ['Maruti Suzuki', 'Honda', 'Toyota', 'Hyundai', 'Benz', 'BMW'],
    '2018': [20000, 15722, 4340, 38000, 2890, 412],
    '2019': [19000, 13700, 340, 31200, 290, 234],
    '2020': [23456, 15891, 440, 36700, 890, 417]
}
df = pd.DataFrame(data).set_index("Cars")

p_bar = df.plot_bokeh.bar(
    ylabel="Price per Unit", 
    title="Car Units sold per Year", 
    alpha=0.6)

Using the stacked keyword argument you also make stacked barplots as shown below

In [18]:
stacked_bar = df.plot_bokeh.bar(
    ylabel="Price per Unit", 
    title="Car Units sold per Year", 
    stacked=True,
    alpha=0.6)

Also horizontal versions of the above barplot are supported with the keyword kind="barh" or the accessor plot_bokeh.barh. You can still specify a column of the DataFrame as the bar category via the x argument if you do not wish to use the index.

In [19]:
#Reset index, such that "Cars" is now a column of the DataFrame:
df.reset_index(inplace=True)

#Create horizontal bar (via kind keyword):
p_hbar = df.plot_bokeh(
    kind="barh",
    x="Cars",
    ylabel="Price per Unit", 
    title="Car Units sold per Year", 
    alpha=0.6,
    legend = "bottom_right",
    show_figure=False)

#Create stacked horizontal bar (via barh accessor):
stacked_hbar = df.plot_bokeh.barh(
    x="Cars",
    stacked=True,
    ylabel="Price per Unit", 
    title="Car Units sold per Year", 
    alpha=0.6,
    legend = "bottom_right",
    show_figure=False)

#Plot all barplot examples in a grid:
pandas_bokeh.plot_grid([[p_bar, stacked_bar],
                        [p_hbar, stacked_hbar]], 
                       plot_width=450)

Now let us look at a more practical example of housing prices problem to understand it better.

In [20]:
df = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv",index_col='SalePrice')
numeric_features = df.select_dtypes(include=[np.number])
p_bar = numeric_features.plot_bokeh.bar(
    ylabel="Sale Price", 
    figsize=(1000,800),
    title="Housing Prices", 
    alpha=0.6)

# Bokeh Introduction

Bokeh is an interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of versatile graphics, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.

For this kernel we will taking an example of Karnataka State(India) Education dataset for our exploratory data analysis.

# EDA -Bokeh - Karnataka Education

An NGO organisation takes initiatives to improve primary education in  India and want to carry out this program in Karnataka. It wants to target districts that fall behind in areas such as 

- Education Infrastructure

- Education Awareness

- Demographic features

Identify such districts that could be targeted in its first phase.

The source data for this exercise is obtained from data.gov.in

## Goal :

The goal of this notebook was primarily to:

1.       Explain the data, define your target and come up with features that can be used for modelling.

2.       Create a model based on your features and come up with the list of target districts

3.       Detailed analysis to include all components such as 

        - Data fetch
        - Data cleansing
        - Exploratory data analysis
        - Summary and Data Visualization
        


# Import Libraries

In [23]:
!pip install pandas-bokeh



In [24]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
# Import Bokeh Library for output
from bokeh.io import output_notebook
output_notebook()
from bokeh.models import ColumnDataSource
from bokeh.models import HoverTool
from bokeh.models import LinearInterpolator,CategoricalColorMapper
from bokeh.io import show
from bokeh.plotting import figure
from bokeh.palettes import Spectral8

## Data Fetching

In [25]:
data = pd.read_csv('../input/karnataka-state-education/Town-wise-education - Karnataka.csv')

## Exploratory Data Analysis

In [26]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 812 entries, 0 to 811
Data columns (total 46 columns):
 #   Column                                                                                     Non-Null Count  Dtype 
---  ------                                                                                     --------------  ----- 
 0   Table Name                                                                                 812 non-null    object
 1   State Code                                                                                 812 non-null    int64 
 2   District Code                                                                              812 non-null    int64 
 3   Town Code                                                                                  812 non-null    int64 
 4   Total/ Rural/ Urban                                                                        812 non-null    object
 5   Area Name                                                

Let us have a quick glance of what the data looks like by observing the first and last few rows

In [27]:
data.head()

Unnamed: 0,Table Name,State Code,District Code,Town Code,Total/ Rural/ Urban,Area Name,Age-Group,Total - Persons,Total - Males,Total - Females,...,Educational Level - Non-technical Diploma or Certificate Not Equal to Degree Females,Educational Level - Technical Diploma or Certificate Not Equal to Degree Persons,Educational Level - Technical Diploma or Certificate Not Equal to Degree Males,Educational Level - Technical Diploma or Certificate Not Equal to Degree Females,Educational Level - Graduate & Above Persons,Educational Level - Graduate & Above Males,Educational Level - Graduate & Above Females,Unclassified - Persons,Unclassified - Males,Unclassified - Females
0,C2308,29,1,40117000,Urban,Belgaum (M Corp.),All ages,399653,204598,195055,...,362,7143,5210,1933,41152,26488,14664,3,2,1
1,C2308,29,1,40117000,Urban,Belgaum (M Corp.),0-6,47642,24768,22874,...,0,0,0,0,0,0,0,0,0,0
2,C2308,29,1,40117000,Urban,Belgaum (M Corp.),7,6759,3495,3264,...,0,0,0,0,0,0,0,0,0,0
3,C2308,29,1,40117000,Urban,Belgaum (M Corp.),8,8067,4152,3915,...,0,0,0,0,0,0,0,0,0,0
4,C2308,29,1,40117000,Urban,Belgaum (M Corp.),9,6948,3559,3389,...,0,0,0,0,0,0,0,0,0,0


In [28]:
data.tail()

Unnamed: 0,Table Name,State Code,District Code,Town Code,Total/ Rural/ Urban,Area Name,Age-Group,Total - Persons,Total - Males,Total - Females,...,Educational Level - Non-technical Diploma or Certificate Not Equal to Degree Females,Educational Level - Technical Diploma or Certificate Not Equal to Degree Persons,Educational Level - Technical Diploma or Certificate Not Equal to Degree Males,Educational Level - Technical Diploma or Certificate Not Equal to Degree Females,Educational Level - Graduate & Above Persons,Educational Level - Graduate & Above Males,Educational Level - Graduate & Above Females,Unclassified - Persons,Unclassified - Males,Unclassified - Females
807,C2308,29,26,42604000,Urban,Mysore (M Corp.),65-69,13904,6773,7131,...,9,289,228,61,1898,1585,313,0,0,0
808,C2308,29,26,42604000,Urban,Mysore (M Corp.),70-74,10754,5501,5253,...,3,149,118,31,1177,1012,165,0,0,0
809,C2308,29,26,42604000,Urban,Mysore (M Corp.),75-79,5359,2728,2631,...,2,73,62,11,574,503,71,0,0,0
810,C2308,29,26,42604000,Urban,Mysore (M Corp.),80+,6236,2848,3388,...,2,57,42,15,410,344,66,0,0,0
811,C2308,29,26,42604000,Urban,Mysore (M Corp.),Age not stated,539,267,272,...,1,12,4,8,44,28,16,0,0,0


Lets us examine the shape of this dataset

In [29]:
data.shape

(812, 46)

This means that we have 812 dimensions(rows) and 46 features (columns) in this dataset.
Now let us explore the data types of the dataset

In [30]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 812 entries, 0 to 811
Data columns (total 46 columns):
 #   Column                                                                                     Non-Null Count  Dtype 
---  ------                                                                                     --------------  ----- 
 0   Table Name                                                                                 812 non-null    object
 1   State Code                                                                                 812 non-null    int64 
 2   District Code                                                                              812 non-null    int64 
 3   Town Code                                                                                  812 non-null    int64 
 4   Total/ Rural/ Urban                                                                        812 non-null    object
 5   Area Name                                                

The above observation shows that there are categorical and numerical features in the dataset.Let us explore further...

Now let us find out how many unique categories are available from the above categorical features.
Let us examine if there are any nulls in the dataset

In [31]:
data.isnull().sum()

Table Name                                                                                   0
State Code                                                                                   0
District Code                                                                                0
Town Code                                                                                    0
Total/ Rural/ Urban                                                                          0
Area Name                                                                                    0
Age-Group                                                                                    0
Total - Persons                                                                              0
Total - Males                                                                                0
Total - Females                                                                              0
Illiterate - Persons                              

Let us also look at the entire metrics including the inter quartile range,mean,standard deviation for all the features


In [32]:
data.describe(include = 'all')

Unnamed: 0,Table Name,State Code,District Code,Town Code,Total/ Rural/ Urban,Area Name,Age-Group,Total - Persons,Total - Males,Total - Females,...,Educational Level - Non-technical Diploma or Certificate Not Equal to Degree Females,Educational Level - Technical Diploma or Certificate Not Equal to Degree Persons,Educational Level - Technical Diploma or Certificate Not Equal to Degree Males,Educational Level - Technical Diploma or Certificate Not Equal to Degree Females,Educational Level - Graduate & Above Persons,Educational Level - Graduate & Above Males,Educational Level - Graduate & Above Females,Unclassified - Persons,Unclassified - Males,Unclassified - Females
count,812,812.0,812.0,812.0,812,812,812,812.0,812.0,812.0,...,812.0,812.0,812.0,812.0,812.0,812.0,812.0,812.0,812.0,812.0
unique,1,,,,1,28,29,,,,...,,,,,,,,,,
top,C2308,,,,Urban,Chikmagalur (CMC),Age not stated,,,,...,,,,,,,,,,
freq,812,,,,812,29,28,,,,...,,,,,,,,,,
mean,,29.0,15.035714,41509820.0,,,,27507.64,14222.53,13285.11,...,26.519704,482.08867,371.315271,110.773399,3043.32266,1905.359606,1137.963054,0.054187,0.03202,0.022167
std,,0.0,6.714906,671824.5,,,,164098.3,85432.87,78673.9,...,239.305779,3034.946141,2344.507666,693.711738,22771.068866,13770.026347,9031.48477,0.429515,0.279085,0.220975
min,,29.0,1.0,40117000.0,,,,50.0,23.0,27.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,,29.0,11.25,41127000.0,,,,2916.0,1466.75,1425.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,,29.0,16.5,41655000.0,,,,5657.5,2880.5,2710.0,...,0.0,12.0,9.0,2.0,7.5,4.5,2.0,0.0,0.0,0.0
75%,,29.0,20.0,42009250.0,,,,13917.5,7124.5,6696.0,...,4.0,199.75,153.5,48.25,1205.0,844.0,252.75,0.0,0.0,0.0


Now let us look in detail the categorical features.For this basically extract all the categorical features into a dataframe object

In [33]:
categorical_features = data.select_dtypes(include=[np.object])
categorical_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 812 entries, 0 to 811
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Table Name           812 non-null    object
 1   Total/ Rural/ Urban  812 non-null    object
 2   Area Name            812 non-null    object
 3   Age-Group            812 non-null    object
dtypes: object(4)
memory usage: 25.5+ KB


#### Let us observe the unique categories for all the object varables

In [34]:
for column_name in data.columns:
    if data[column_name].dtypes == 'object':
        data[column_name] = data[column_name].fillna(data[column_name].mode().iloc[0])
        unique_category = len(data[column_name].unique())
        print("Feature '{column_name}' has '{unique_category}' unique categories".format(column_name = column_name,
                                                                                         unique_category=unique_category))

Feature 'Table Name' has '1' unique categories
Feature 'Total/ Rural/ Urban' has '1' unique categories
Feature 'Area Name' has '28' unique categories
Feature 'Age-Group' has '29' unique categories


So based on the above results it is evident that 'Table Name' and 'Total/Rural/Urban' categorical features are redundant in nature which can be eliminated as it has no significance.

We can also eliminate 'State Code' as we are dealing with only Karnataka.

From above observations we can conclude that we do not need to do any imputation as there are no missing values.

Before we jump into data visualisations and explore further let us observe the general metrics using the describe function

ok now let me explore more about the data in detail and come up with some basic observations

## Data Cleansing

Now let us drop some of the columns as discussed above which has no importance in our EDA such as 
- 'Table Name'
- 'State Code'
- 'Total/Rural/Urban'

In [35]:
data.drop('Table Name',axis =1,inplace = True)

data.drop('State Code',axis =1,inplace = True)

data.drop('Total/ Rural/ Urban',axis =1,inplace = True)


Let us further get more insights of the data by observing first few records say for a district & town 

In [36]:
data.head()

Unnamed: 0,District Code,Town Code,Area Name,Age-Group,Total - Persons,Total - Males,Total - Females,Illiterate - Persons,Illiterate - Males,Illiterate - Females,...,Educational Level - Non-technical Diploma or Certificate Not Equal to Degree Females,Educational Level - Technical Diploma or Certificate Not Equal to Degree Persons,Educational Level - Technical Diploma or Certificate Not Equal to Degree Males,Educational Level - Technical Diploma or Certificate Not Equal to Degree Females,Educational Level - Graduate & Above Persons,Educational Level - Graduate & Above Males,Educational Level - Graduate & Above Females,Unclassified - Persons,Unclassified - Males,Unclassified - Females
0,1,40117000,Belgaum (M Corp.),All ages,399653,204598,195055,91358,36857,54501,...,362,7143,5210,1933,41152,26488,14664,3,2,1
1,1,40117000,Belgaum (M Corp.),0-6,47642,24768,22874,47642,24768,22874,...,0,0,0,0,0,0,0,0,0,0
2,1,40117000,Belgaum (M Corp.),7,6759,3495,3264,1375,662,713,...,0,0,0,0,0,0,0,0,0,0
3,1,40117000,Belgaum (M Corp.),8,8067,4152,3915,568,292,276,...,0,0,0,0,0,0,0,0,0,0
4,1,40117000,Belgaum (M Corp.),9,6948,3559,3389,275,137,138,...,0,0,0,0,0,0,0,0,0,0


#### Few Data Observations:

The general observation observed in each of the district are as follows
- 29 unique records for age group for each town code
    - 'All Ages' category is a summation of all ages 
- All of the below 12 categories are depicted in the form of persons ,male and female where persons is summation of male and female
    - Illiterate
    - Literate
    - Educational Level - Literate without Educational Level
    - Educational Level - Below Primary 
    - Educational Level - Primary
    - Educational Level - Middle
    - Educational Level - Matric/Secondary
    - Educational Level - Higher Secondary/Intermediate Pre-University/Senior Secondary
    - Educational Level - Non-technical Diploma or Certificate Not Equal to Degree
    - Educational Level - Technical Diploma or Certificate Not Equal to Degree
    - Educational Level - Graduate & Above
    - Unclassified
Note : Also we have 'Total - Persons', 'Total - Male','Total - Female' which do not have any significance to our analysis as these are summation of all the above categories person/male/female wise for each of the above category

#### Current Focus :

 - Since our main focus of this exercise is to improve the 'Primary Education'. So the analysis going forward  as per my assumption that the following categories fall in the need of primary education and rest not
    - Illiterate 
    - Educational Level - Literate without Educational Level
    - Educational Level - Below Primary 
    - Educational Level - Primary
    - Unclassified
    
   Which means that the following features can be eliminated from the dataset
   
    Total - Persons                                                                              
    Total - Males                                                                                
    Total - Females                                                                              
    Literate - Persons                                                                           
    Literate - Males                                                                             
    Literate - Females                                                                           
    Educational Level - Middle Persons                                                           
    Educational Level - Middle Males                                                             
    Educational Level - Middle Females                                                           
    Educational Level - Matric/Secondary Persons                                                 
    Educational Level - Matric/Secondary Males                                                   
    Educational Level - Matric/Secondary Females                                                 
    Educational Level - Higher Secondary/Intermediate Pre-University/Senior Secondary Persons    
    Educational Level - Higher Secondary/Intermediate Pre-University/Senior Secondary Males      
    Educational Level - Higher Secondary/Intermediate Pre-University/Senior Secondary Females    
    Educational Level - Non-technical Diploma or Certificate Not Equal to Degree Persons         
    Educational Level - Non-technical Diploma or Certificate Not Equal to Degree Males           
    Educational Level - Non-technical Diploma or Certificate Not Equal to Degree Females         
    Educational Level - Technical Diploma or Certificate Not Equal to Degree Persons             
    Educational Level - Technical Diploma or Certificate Not Equal to Degree Males               
    Educational Level - Technical Diploma or Certificate Not Equal to Degree Females             
    Educational Level - Graduate & Above Persons                                                 
    Educational Level - Graduate & Above Males                                                   
    Educational Level - Graduate & Above Females 
    
 
 #### Key Note:

For each district code & town code we have the total of all age groups as 'All Ages' category in 'Age Group' feature .In my opinion it is irrelevant as we need to focus on age groups which need primary education.So lets go ahead and remove these rows in the dataset

In [37]:
data = data[data['Age-Group'] != 'All ages']

Now let us look in detail the numerical features that need to be dropped as they do not contribute to the primary education.


In [38]:
columns = [ 'Total - Persons',
           'Total - Males',
           'Total - Females',
           'Literate - Persons',
           'Literate - Males',
           'Literate - Females',
           'Educational Level - Middle Persons',
           'Educational Level - Middle Males',
           'Educational Level - Middle Females',
           'Educational Level - Matric/Secondary Persons',
           'Educational Level - Matric/Secondary Males',
           'Educational Level - Matric/Secondary Females',
           'Educational Level - Higher Secondary/Intermediate Pre-University/Senior Secondary Persons',
           'Educational Level - Higher Secondary/Intermediate Pre-University/Senior Secondary Males',
           'Educational Level - Higher Secondary/Intermediate Pre-University/Senior Secondary Females',
           'Educational Level - Non-technical Diploma or Certificate Not Equal to Degree Persons',
           'Educational Level - Non-technical Diploma or Certificate Not Equal to Degree Males',
           'Educational Level - Non-technical Diploma or Certificate Not Equal to Degree Females',
           'Educational Level - Technical Diploma or Certificate Not Equal to Degree Persons',
           'Educational Level - Technical Diploma or Certificate Not Equal to Degree Males',
           'Educational Level - Technical Diploma or Certificate Not Equal to Degree Females',
           'Educational Level - Graduate & Above Persons',
           'Educational Level - Graduate & Above Males',
           'Educational Level - Graduate & Above Females']                                                                            

data.drop(columns,axis =1,inplace = True)

## Exploratory Data Analysis:

In this section I am going to visualize data with respect to categories focused on improving 'Primary Education'.

- Illiterate 
- Educational Level - Literate without Educational Level
- Educational Level - Below Primary 
- Educational Level - Primary
- Unclassified

 I have used Bokeh interactive data visualisation library to visualise data .Please note that the mouse hover over function is enabled to see the data visualisations for each district based on the above mentioned categories depicted below .Also please note the size of the circle depicts the size of the feature .

The visualisations depict district and age wise data representations for each of the above mentioned groups including total count as well as male and female counts.

#### Illeiterate :

In this section you can observe the data visualisation of the following features with respect to district and age group 

- Illiterate - Persons
- Illiterate - Males
- Illiterate - Females

In [39]:
source = ColumnDataSource(dict(
    x = data['District Code'],
    y = data['Illiterate - Persons'],
    area = data['Area Name'],
    illerate = data['Illiterate - Persons'],
    illerate_male = data['Illiterate - Males'],
    illerate_female = data['Illiterate - Females'],
    age = data['Age-Group']
)       
)

size_mapper = LinearInterpolator(
    x = [data['Illiterate - Persons'].min(),data['Illiterate - Persons'].max()],
    y = [2,100]
)

color_mapper = CategoricalColorMapper(
    factors = list(data['Area Name'].unique()),
    palette = Spectral8
)

PLOT_OPTS = dict(height = 800,width = 800,x_range = (1,30),y_range=(10,100000))

p = figure(title = 'Illiteracy District/Area Wise',
           toolbar_location = 'above',
           tools = [HoverTool(
               tooltips = [('Area ','@area'),
                           ('Illerate - Total Persons ','@illerate'),
                           ('Illerate - Total Males ','@illerate_male'),
                           ('Illerate - Total Females ','@illerate_female'),
                           ('Age Group ','@age'),
                        ],show_arrow = False)],
           x_axis_label = 'District Code',
           y_axis_label = 'No of Illiterates',
           **PLOT_OPTS)

p.circle(x='x',
         y='y', 
         size = {'field': 'illerate','transform':size_mapper},
         color = {'field': 'area','transform':color_mapper},
         alpha = 0.7,
         legend = 'area',
         source = source)
p.legend.location = (0,-50)
p.right.append(p.legend[0])
p.legend.border_line_color = None
show(p,notebook_handle=True)

#### Educational Level - Literate without Educational Level :

In this section you can observe the data visualisation of the following features with respect to district and age group 
- Educational Level - Literate without Educational Level - Persons
- Educational Level - Literate without Educational Level - Males
- Educational Level - Literate without Educational Level - Females

In [40]:
source = ColumnDataSource(dict(
    x = data['District Code'],
    y = data['Educational Level - Literate without Educational Level Persons'],
    area = data['Area Name'],
    illerate = data['Educational Level - Literate without Educational Level Persons'],
    illerate_male = data['Educational Level - Literate without Educational Level Males'],
    illerate_female = data['Educational Level - Literate without Educational Level Females'],
    age = data['Age-Group']
)       
)

size_mapper = LinearInterpolator(
    x = [data['Educational Level - Literate without Educational Level Persons'].min(),
         data['Educational Level - Literate without Educational Level Persons'].max()],
    y = [2,100]
)

color_mapper = CategoricalColorMapper(
    factors = list(data['Area Name'].unique()),
    palette = Spectral8
)

PLOT_OPTS = dict(height = 800,width = 800,x_range = (1,30),y_range=(10,6000))

p = figure(title = 'Educational Level - Literate without Educational Level (District vs. Age Wise)',
           toolbar_location = 'above',
           tools = [HoverTool(
               tooltips = [('Area ','@area'),
                           ('Educational Level - Literate without Educational Level - Total Persons ','@illerate'),
                           ('Educational Level - Literate without Educational Level - Total Males ','@illerate_male'),
                           ('Educational Level - Literate without Educational Level - Total Females ','@illerate_female'),
                           ('Age Group ','@age'),
                        ],show_arrow = False)],
           x_axis_label = 'District Code',
           y_axis_label = 'No of Educational Level - Literate without Educational Level',
           **PLOT_OPTS)

p.circle(x='x',
         y='y', 
         size = {'field': 'illerate','transform':size_mapper},
         color = {'field': 'area','transform':color_mapper},
         alpha = 0.7,
         legend = 'area',
         source = source)
p.legend.location = (0,-50)
p.right.append(p.legend[0])
p.legend.border_line_color = None
show(p,notebook_handle=True)

#### Educational Level - Below Primary:

In this section you can observe the data visualisation of the following features with respect to district and age group 
- Educational Level - Below Primary - Persons
- Educational Level - Below Primary - Males
- Educational Level - Below Primary - Females

In [41]:
source = ColumnDataSource(dict(
    x = data['District Code'],
    y = data['Educational Level - Below Primary Persons'],
    area = data['Area Name'],
    illerate = data['Educational Level - Below Primary Persons'],
    illerate_male = data['Educational Level - Below Primary Males'],
    illerate_female = data['Educational Level - Below Primary Females'],
    age = data['Age-Group']
)       
)

size_mapper = LinearInterpolator(
    x = [data['Educational Level - Below Primary Persons'].min(),
         data['Educational Level - Below Primary Persons'].max()],
    y = [2,50]
)

color_mapper = CategoricalColorMapper(
    factors = list(data['Area Name'].unique()),
    palette = Spectral8
)

PLOT_OPTS = dict(height = 800,width = 800,x_range = (1,30),y_range=(10,100000))

p = figure(title = 'Educational Level - Below Primary (District vs. Age Wise)',
           toolbar_location = 'above',
           tools = [HoverTool(
               tooltips = [('Area ','@area'),
                           ('Educational Level - Below Primary Total Persons ','@illerate'),
                            ('Educational Level - Below Primary Total Males ','@illerate_male'),
                           ('Educational Level -  Below Primary Total Females ','@illerate_female'),
                           ('Age Group ','@age'),
                        ],show_arrow = False)],
           x_axis_label = 'District Code',
           y_axis_label = 'No of Educational Level - Below Primary Persons',
           **PLOT_OPTS)

p.circle(x='x',
         y='y', 
         size = {'field': 'illerate','transform':size_mapper},
         color = {'field': 'area','transform':color_mapper},
         alpha = 0.7,
         legend = 'area',
         source = source)
p.legend.location = (0,-50)
p.right.append(p.legend[0])
p.legend.border_line_color = None
show(p,notebook_handle=True)

#### Educational Level - Primary:

In this section you can observe the data visualisation of the following features with respect to district and age group 
- Educational Level - Primary - Persons
- Educational Level - Primary - Males
- Educational Level - Primary - Females

In [42]:
source = ColumnDataSource(dict(
    x = data['District Code'],
    y = data['Educational Level - Primary Persons'],
    area = data['Area Name'],
    illerate = data['Educational Level - Primary Persons'],
    illerate_male = data['Educational Level - Primary Males'],
    illerate_female = data['Educational Level - Primary Females'],
    age = data['Age-Group']
)       
)

size_mapper = LinearInterpolator(
    x = [data['Educational Level - Primary Persons'].min(),
         data['Educational Level - Primary Persons'].max()],
    y = [2,50]
)

color_mapper = CategoricalColorMapper(
    factors = list(data['Area Name'].unique()),
    palette = Spectral8
)

PLOT_OPTS = dict(height = 800,width = 800,x_range = (1,30),y_range=(10,100000))

p = figure(title = 'Educational Level - Primary (District vs. Age Wise)',
           toolbar_location = 'above',
           tools = [HoverTool(
               tooltips = [('Area ','@area'),
                           ('Educational Level -  Primary Total Persons ','@illerate'),
                           ('Educational Level -  Primary Total Male ','@illerate_male'),
                           ('Educational Level -  Primary Total Female ','@illerate_female'),
                           ('Age Group ','@age'),
                        ],show_arrow = False)],
           x_axis_label = 'District Code',
           y_axis_label = 'No of Educational Level - Primary Persons',
           **PLOT_OPTS)

p.circle(x='x',
         y='y', 
         size = {'field': 'illerate','transform':size_mapper},
         color = {'field': 'area','transform':color_mapper},
         alpha = 0.7,
         legend = 'area',
         source = source)
p.legend.location = (0,-50)
p.right.append(p.legend[0])
p.legend.border_line_color = None
show(p,notebook_handle=True)

#### Unclassified:

In this section you can observe the data visualisation of the following features with respect to district and age group 
- Unclassified - Persons
- Unclassified - Males
- Unclassified - Females

In [43]:
source = ColumnDataSource(dict(
    x = data['District Code'],
    y = data['Unclassified - Persons'],
    area = data['Area Name'],
    illerate = data['Unclassified - Persons'],
    illerate_male = data['Unclassified - Males'],
    illerate_female = data['Unclassified - Females'],
    age = data['Age-Group']
)       
)

size_mapper = LinearInterpolator(
    x = [data['Unclassified - Persons'].min(),
         data['Unclassified - Persons'].max()],
    y = [1,100]
)

color_mapper = CategoricalColorMapper(
    factors = list(data['Area Name'].unique()),
    palette = Spectral8
)

PLOT_OPTS = dict(height = 800,width = 800,x_range = (1,30),y_range=(10,400))

p = figure(title = 'Unclassified -  (District vs. Age Wise)',
           toolbar_location = 'above',
           tools = [HoverTool(
               tooltips = [('Area ','@area'),
                           ('Unclassified - Total Persons ','@illerate'),
                           ('Unclassified - Total Males ','@illerate_male'),
                           ('Unclassified - Total Females ','@illerate_female'),
                           ('Age Group ','@age'),
                        ],show_arrow = False)],
           x_axis_label = 'District Code',
           y_axis_label = 'Unclassified - Persons',
           **PLOT_OPTS)

p.circle(x='x',
         y='y', 
         size = {'field': 'illerate','transform':size_mapper},
         color = {'field': 'area','transform':color_mapper},
         alpha = 0.7,
         legend = 'area',
         source = source)
p.legend.location = (0,-50)
p.right.append(p.legend[0])
p.legend.border_line_color = None

show(p,notebook_handle=True)

Now let us look at summary total count of all categories with respect to total persons,total males & total females for each of the current features of our focus as show below.We are going to create three new features for the same namely
- Total
- Total_Males
- Total_Females

In [44]:
data['Total']=data['Illiterate - Persons']+data['Educational Level - Below Primary Persons']+data['Educational Level - Literate without Educational Level Persons']+data['Educational Level - Primary Persons']+data['Unclassified - Persons']
data['Total_Males']=data['Illiterate - Males']+data['Educational Level - Below Primary Males']+data['Educational Level - Literate without Educational Level Males']+data['Educational Level - Primary Males']+data['Unclassified - Males']
data['Total_Females']=data['Illiterate - Females']+data['Educational Level - Below Primary Females']+data['Educational Level - Literate without Educational Level Females']+data['Educational Level - Primary Females']+data['Unclassified - Females']


Now let us visualise with the above new features created to get a summary holistic view of the entire analysis.

#### Summary (Total):

In this section you can observe the data visualisation of the following features with respect to district and age group 
- Total 
- Total - Males
- Total - Females

In [45]:
source = ColumnDataSource(dict(
    x = data['District Code'],
    y = data['Total'],
    area = data['Area Name'],
    illerate = data['Total'],
    illerate_male = data['Total_Males'],
    illerate_female = data['Total_Females'],
    age = data['Age-Group']
)       
)

size_mapper = LinearInterpolator(
    x = [data['Total'].min(),
         data['Total'].max()],
    y = [5,100]
)

color_mapper = CategoricalColorMapper(
    factors = list(data['Area Name'].unique()),
    palette = Spectral8
)

PLOT_OPTS = dict(height = 800,width = 800,x_range = (1,30),y_range=(10,120000))

p = figure(title = 'Summary -  (District vs. Age Wise)',
           toolbar_location = 'above',
           tools = [HoverTool(
               tooltips = [('Area ','@area'),
                           ('Summary','@illerate'),
                           ('Total Males ','@illerate_male'),
                           ('Total Females ','@illerate_female'),
                           ('Age Group ','@age'),
                        ],show_arrow = False)],
           x_axis_label = 'District Code',
           y_axis_label = 'Total Population needing Primary Education',
           **PLOT_OPTS)

p.circle(x='x',
         y='y', 
         size = {'field': 'illerate','transform':size_mapper},
         color = {'field': 'area','transform':color_mapper},
         alpha = 0.7,
         legend = 'area',
         source = source)
p.legend.location = (0,-50)
p.right.append(p.legend[0])
p.legend.border_line_color = None
show(p,notebook_handle=True)

## Conclusion:
    
- The above summary data visualisation depicts that the target districts that need to be focused with respect to primary education as part of the Phase 1 NGO initiative
    - Hubli Darwad 
    - Mysore
    - Bangalore
    - Belguam
    - Gulbarga
    - Bellary
    - Davanagiri
    - Mangalore
- Also it is observed that age groups of 0-6 years and 30-45 years need more attention for most of the cases
Scope of improvement :
To perform more detailed analysis and understand the reasons and come up with a predictive model.
Due to time constraints of doing this exercise this part is left for further exercise .

## If you like this  kernel Greatly Appreciate to <font color="red">UPVOTE</font> .  Thank you

