## **Explanatory Data Analysis Using Plotly-Express**

![alt text](https://miro.medium.com/max/2560/1*Ptv1_9wX9O2Rm2IBklyufw.png)

### Background

**EDA** is very important to understand data Insights.The best way to present some insight of course with visualisation, we have a visualisasion library called **plotly-express** that can help you to explore **EDA**. Plotly Express is a terse, consistent, high-level API for rapid data exploration and figure generation. 
**Advantage of Plot-Express**
- Help you to visualize your data quick and easy.
- Interactive Plotswith single function calls.
-Plotly Express is fully compatible with the rest of Plotly ecosystem: use it in your Dash apps, export your figures to almost any file format using Orca, or edit them in a GUI with the JupyterLab Chart Editor!

Link - **https://plot.ly/python/plotly-express/**

So understand more about EDA using Plotly-Express this notbook has been created where we will cover few insights and understand the library for business point of view.


# The Dataset
**Brief introduction about the dataset we are using for EDA**

The dataset we use consists of the marks secured by the students in various subjects, which accessible from Kaggle Student Performance in Exams.

The Inspiration is to understand the influence of the parents background, test preparation etc on students performance. It comprises of 1,000 rows and 8 columns:

- gender
- race / ethnicity
- parental level of education - Bachelor's degree, master's degree, or some college
- lunch - standard or free/reduced
- test preparation course - none or completed
- math score
- reading score
- writing score

### Libraries

The libraries we will use:-
- pandas. You can install it by `pip install pandas` 
- plotly_express. You can install it by `pip install plotly`

In [2]:
import pandas as pd
import plotly.express as px

### Reading the data

*   We will read the data using pandas dataframe.
*   Changing the colum name for easy understanding.


In [None]:
#!unzip /content/students-performance-in-exams.zip

Archive:  /content/students-performance-in-exams.zip
  inflating: StudentsPerformance.csv  


In [5]:
df = pd.read_csv('https://raw.githubusercontent.com/johnpl765/EDA-Students-Performance-in-Exams/main/StudentsPerformance.csv')
df.columns = ['gender', 'ethnicity', 'parental_level_of_education','lunch','test_preparation_course','math','reading','writing']
df.head()

Unnamed: 0,gender,ethnicity,parental_level_of_education,lunch,test_preparation_course,math,reading,writing
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


Using `dtypes` , We will identy the data type of each variable

In [6]:
df.dtypes

gender                         object
ethnicity                      object
parental_level_of_education    object
lunch                          object
test_preparation_course        object
math                            int64
reading                         int64
writing                         int64
dtype: object

### First Exploration

*   Checking distribution of categorical data.



In [8]:
print(df.gender.value_counts(),"\n\n",
      df.lunch.value_counts(),"\n\n",
      df.ethnicity.value_counts(),"\n\n",
      df.parental_level_of_education.value_counts(),"\n\n",
      df.test_preparation_course.value_counts(),
     sep='')

gender
female    518
male      482
Name: count, dtype: int64

lunch
standard        645
free/reduced    355
Name: count, dtype: int64

ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

parental_level_of_education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
master's degree        59
Name: count, dtype: int64

test_preparation_course
none         642
completed    358
Name: count, dtype: int64


From what we see at this distribution of our categorcial coloumn, here some insight we can take of:

- it's quite distribute equally gender wise. 
- Most of the students have better quality of lunch, 
- Most of parent's don't have very high education level. 
- Most of people don't take test preparation course. 

let's see the distribution of our numeric coloumn 

In [9]:
df.describe()

Unnamed: 0,math,reading,writing
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


From what we see at this distribusion of our categorical coloumn, here some insight we can take of:

- Math have lower average score(66.08) compare to reading and writing.
- most of student score around 68

### Question - Hypothesis

After first exploration, we have a couple question that we can answer with this data, for the demo let's answer the these 2 question: 

- Is a certain gender excels in certain subject? 
- Is there a specific ethnicity that better at math? 

### Answering the question

Let's check distribution of our subject (gender and ethnicity). 
First plot we shall be using is bar plot. 

At plotly library we can use `bar(dataframe, x, y)`. 
creating a crosstab to count the number of female and male.

In [10]:
gender = pd.crosstab(index= df['gender'],columns='count').reset_index()

In [11]:
fig = px.bar(gender,x = 'gender',y = 'count')
fig.show()

Analysis

*   No. of female students is higher compare to Male
*   No. of Females students has higher participation in studies.



In [12]:
# Etchincity as color for differentiation 
fig = px.bar(df,
             x= 'gender',
             color='ethnicity')
fig.show()

Above plot is not clear, because the bar is stacked, we can change it with `barmode = ` parameter. The default of is 'relative' to make the unstacked we can use 'group'.

In [13]:
fig = px.bar(df,
             x= 'gender',
             color='ethnicity', 
             barmode='group')
fig.show()

Distribution of Ethenticity.
Beautification also can be taken care. Some theme can be privided using `template` parameter, you can use the theme template they provides for example `ploty_dark` or `plotly_white` . Theme can be checked in Plotly -express documentation. at this one I'll use my favorite `plotly_white`. 

Then if you noticed the order of their group is ungroup, we can reorder with `category_order` parameter that can accept dictionary. It will automaticly detect and order your category if you give correct category. 


In [14]:
fig = px.bar(df,
             x= 'gender',
             color='ethnicity',
             template='plotly_white', 
             barmode='group',
             category_orders={'ethnicity':["group A","group B","group C","group D","group E"]},
             title= "Ethnicity Distribution on Gender")
fig.show()

Quite similiar distribution between female and male, but at our dataset ethnicity **C** dominates. 

With this distribution now it's safe to assume we can analize the gender in our dataset equally. 

#### Question 1

So, let's answer our first question, is a certain gender excels in certain subject?

To answer this question we will take math and reading subject, why?  Because Math and Reading are the subject with lowest average and the highest average. 
To answer the question we can use the scatter plot. To make scatter plot, we shall be using `scatter` function. 
We will make math and reading as x and y and will color them with the gender so we can see if there is some difference to answer our question and as usual we shall be using `plotly_white` template. 

In [15]:
fig = px.scatter(df,
                 x='math',
                 y='reading', 
                 color ='gender',
                 template='plotly_white',
                 title="Is a certain gender excels in certain subject?")
fig.show()

As plot shows it we will give the marginal plot to see the distribution of the score. 
Distibution can be seen using histogram bar plot.

In [17]:
fig = px.scatter(df,
                 x='math',
                 y='reading', 
                 color ='gender',
                 marginal_x='histogram',
                 marginal_y='histogram',
                 template='plotly_white',
                 title="Is a certain gender excels in certain subject?")
fig.show()

For simplicity male will colored as red and female is blue. based on this analysis on scatter plot male has excel on math subject but female excels in reading. 

from Marginal plot we can understand that most of female has average score in math, however male scored below average on reading subject. 

So the answer for our question is yes, hence we can develop a conclusion that certain gender excels in certain subject.

#### Question 2

Is there a specific ethnicity that better at math?

We shall be using another type of plot,which is popolar with the name of **box plot** ,function we are using is `box`. 

First let's run our analysis on the generated hypothesis , we can check it with box plot, 

we shall be doing analysis on 'Math' subject.

In [18]:
fig = px.box(df,
             x='gender', 
             y='math',
             template='plotly_white')
fig.show()

W.R.T. to our analysis Male has higher median compare to female.
Next let's try to answer our question. At above box plot we have added one more parameter which is `notched` to help us find out where our median is lying.

In [19]:
fig = px.box(df,
             x='ethnicity', 
             y='math',
             template='plotly_white', 
             category_orders={'ethnicity':["group A","group B","group C","group D","group E"]},
             title="Is there a specific ethnicity that better at math?",
             notched=True)
fig.show()

Box plot- 
- We can see if any outlier in data, 
- We can see the variance of data. 

The point outside the whisker is an ourlier, 
while we can also visulize the variance of data whole looking into the size of box-plot.

As we can see a certain ethnicity group is having much higher median, so it's safe to assume, Specific ethnicity is better at math. 



In [0]:
fig = px.box(df,
             x='ethnicity', 
             y='math',
             color = 'gender',
             template='plotly_white', 
             notched=True,
             category_orders={'ethnicity':["group A","group B","group C","group D","group E"]},
             facet_col = 'gender',
             title="Is there a specific ethnicity and gender that better at math?")
fig.show()

Based on Analysis:-
- Excel in math is male of ethnicity group E, 
- while the other both female and male actually have similar median, However group E median certainly higher than other group, 
- Female at Group A certainly have much lower median. 
- After looking into the analysis we can see the max score (100) beside from group E, only male group A and D reach 100, so it's hard to say group A is worst at this subject. 



### Reference 

- [plotly express](https://plot.ly/python/plotly-express/)
- [plotly express reference documentation](https://www.plotly.express/plotly_express/)