# Welcome to Data Analytics
- Data analytics is the science of analyzing raw data to make conclusions about that information.

## Five Step of Data Analytics

![da.jpg](https://raw.githubusercontent.com/locus-ioe/sf23-content/master/Day_05%20-%20Data%20Analytics/da.jpg)

# Importing the required libraries

In [114]:
import pandas as pd
import json
import plotly.express as px
from urllib.request import urlopen

# Download the dataset from [sf23-content Day_05 - Data Analytics](https://raw.githubusercontent.com/locus-ioe/sf23-content/master/Day_05%20-%20Data%20Analytics/admission.csv)

In [115]:
!wget https://raw.githubusercontent.com/locus-ioe/sf23-content/master/Day_05%20-%20Data%20Analytics/admission.csv

--2022-07-09 07:47:49--  https://raw.githubusercontent.com/locus-ioe/sf23-content/master/Day_05%20-%20Data%20Analytics/admission.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 240347 (235K) [text/plain]
Saving to: ‘admission.csv.2’


2022-07-09 07:47:49 (11.5 MB/s) - ‘admission.csv.2’ saved [240347/240347]



# Read the admission dataset downloaded before as pandas dataframe

In [116]:
admission_df = pd.read_csv("admission.csv")
admission_df.head()

Unnamed: 0,Name,Rank,College,Program,EntranceScore,District,Gender
0,Suman Tamang,1,Pulchowk Campus,BCE,131.2,DADELDHURA,Male
1,Prasun Sitaula,2,Pulchowk Campus,BCT,131.2,RUKUM,Male
2,Saroj Basnet,3,Pulchowk Campus,BME,131.2,TANAHU,Male
3,Utsav Manandhar,4,Pulchowk Campus,BCT,129.0,JHAPA,Male
4,Kalpesh Manandhar,5,Pulchowk Campus,BCT,129.0,JHAPA,Male


In [117]:
admission_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3543 entries, 0 to 3542
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Name           3543 non-null   object 
 1   Rank           3543 non-null   int64  
 2   College        3543 non-null   object 
 3   Program        3543 non-null   object 
 4   EntranceScore  3543 non-null   float64
 5   District       3543 non-null   object 
 6   Gender         3543 non-null   object 
dtypes: float64(1), int64(1), object(5)
memory usage: 193.9+ KB


The dataset has already been preprocessed and has no null values. 

# Exploratory Data Analysis

## Task 1
- Male and Female Population distribution in engineering field

In [118]:
sample = admission_df["Gender"].value_counts()
sample

Male      2815
Female     728
Name: Gender, dtype: int64

In [119]:
sample = sample.reset_index()
sample

Unnamed: 0,index,Gender
0,Male,2815
1,Female,728


In [120]:
sample.rename(columns={"index":"Gender", "Gender":"Count"}, inplace=True)
sample

Unnamed: 0,Gender,Count
0,Male,2815
1,Female,728


### Pie chart showing the population distribution of male and female gender in engineering field

In [121]:
fig = px.pie(sample, values='Count', names='Gender', title='Gender distribution in engineering field')
fig.show()

## Task 2
- Top 5 most common first name of students

In the dataset, we only have full name of the student. To obtain the first name, we have to split the full name by " " and get the first element i.e index 0. For uniformity, convert the names to uppercase.

In [122]:
admission_df["FirstName"] = admission_df["Name"].str.split(" ").str[0].str.upper()
admission_df["FirstName"].head()

0      SUMAN
1     PRASUN
2      SAROJ
3      UTSAV
4    KALPESH
Name: FirstName, dtype: object

In [123]:
sample = admission_df["FirstName"].value_counts()[:5].reset_index() \
            .rename(columns={"index":"FirstName", "FirstName":"Count"})

fig = px.bar(sample, x='FirstName', y='Count', title='Top 5 most common first names')
fig.show()

## Assignment 1
- Find the top 5 unique names and visualize it 

In [124]:
### Your Code Goes Here

### End Code

## Task 3
- Visualization of population distribution of students based on districts

### Obtain the number of students based on districts

In [125]:
sample = admission_df.groupby("District").count()["Program"]
sample

District
ACHHAM          74
ARGHAKHANCHI    12
BAGLUNG         49
BAITADI          5
BAJHANG         77
                ..
SURKHET         51
SYANGJA         12
TANAHU          11
TEHRATHUM       43
UDAYAPUR        21
Name: Program, Length: 71, dtype: int64

In [126]:
sample = pd.DataFrame({"District":sample.index.values, "count":sample.values})
max, min = sample["count"].agg(["max", "min"])

### Loading the geojson file of Nepal for visualizing map in plotly

In [127]:
with urlopen('https://raw.githubusercontent.com/mesaugat/geoJSON-Nepal/master/nepal-districts.geojson') as response:
    districts = json.load(response)

A Choropleth Map is a map composed of colored polygons. It is used to represent spatial variations of a quantity.

In [128]:
fig = px.choropleth(sample, geojson=districts, locations='District', color='count',
                           color_continuous_scale="Viridis",
                           range_color=(0, max),
                           scope="asia",
                           hover_name="District",
                           featureidkey="properties.DISTRICT",
                           labels={'count':'Total no of students'}
                          )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

## Task 4
- Rank distribution in each constituent colleges

In [129]:
fig = px.box(admission_df, x="College" ,y="Rank")
fig.show()

## Task 5
- Rank distribution and comparison based on Gender in 'Pulchowk Campus', 'Thapathali Campus', 'Paschimanchal Campus'

In [130]:
colleges = ['Pulchowk Campus', 'Thapathali Campus', 'Paschimanchal Campus']

sample = admission_df[admission_df["College"].isin(colleges)]

fig = px.box(sample, x="College" ,y="Rank", color="Gender")
fig.show()

## Task 6
- Program wise gender distribution visualization


In [131]:
sample = admission_df.groupby(["Program", "Gender"]).count()["Name"].reset_index().rename(columns={"Name":"Count"})
sample.head()

Unnamed: 0,Program,Gender,Count
0,BAG,Female,16
1,BAG,Male,32
2,BAM,Female,10
3,BAM,Male,84
4,BAR,Female,164


In [132]:
total = sample.groupby("Program").sum().rename(columns={"Count":"Total"})
total.head()

Unnamed: 0_level_0,Total
Program,Unnamed: 1_level_1
BAG,48
BAM,94
BAR,262
BAS,47
BCE,1332


In [133]:
sample = sample.merge(total, on="Program", how="left")
sample.head()

Unnamed: 0,Program,Gender,Count,Total
0,BAG,Female,16,48
1,BAG,Male,32,48
2,BAM,Female,10,94
3,BAM,Male,84,94
4,BAR,Female,164,262


In [134]:
sample["Percentage"] = sample["Count"]*100/sample["Total"]

In [135]:
sample.head()

Unnamed: 0,Program,Gender,Count,Total,Percentage
0,BAG,Female,16,48,33.333333
1,BAG,Male,32,48,66.666667
2,BAM,Female,10,94,10.638298
3,BAM,Male,84,94,89.361702
4,BAR,Female,164,262,62.59542


In [136]:
fig = px.bar(sample, x="Program", y="Percentage", color="Gender", title="Program wise Gender distribution")
fig.show()

## Assignment 2
- Visualize the male/female population based on geographical location 

In [137]:
### Your Code Goes Here

### End Code

# End of content