# Student Performance - Visualizations

### Import Packages

In [1]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.tools as tls
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import plot
!pip install -U kaleido
import kaleido



### Create Directory Structure

In [3]:
# Define Directories and Dataset
parent_dir = 'Assignment_1-2_Student-Performance-Visualizations'
data_raw_dir = 'data_raw'
data_clean_dir = 'data_clean'
results_dir = 'results'
source_dir = 'src'
dataset = 'StudentsPerformance'

# Create Directories
os.mkdir(f'./{parent_dir}')
os.mkdir(f'./{parent_dir}/{data_raw_dir}')
os.mkdir(f'./{parent_dir}/{data_clean_dir}')
os.mkdir(f'./{parent_dir}/{results_dir}')
os.mkdir(f'./{parent_dir}/{source_dir}')

## Data Collection

The dataset was provided in a CSV:

StudentsPerformance.csv

At this point, upload or copy StudentsPerformance.csv into:

In [4]:
print(f'./{parent_dir}/{data_raw_dir}')

./Assignment_1-2_Student-Performance-Visualizations/data_raw


Then, create a README

In [5]:
with open(f'./{parent_dir}/{data_raw_dir}/README.md', "w") as file: # Create a (mostly) empty README
    file.write("Raw Data Metadata")

**The README will need to be manually updated with the appropriate field data.**

### Import the Dataset

In [6]:
# Create the Dataframe
data_raw = pd.read_csv(f'./{parent_dir}/{data_raw_dir}/{dataset}.csv')
data_raw.head(15)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
5,female,group B,associate's degree,standard,none,71,83,78
6,female,group B,some college,standard,completed,88,95,92
7,male,group B,some college,free/reduced,none,40,43,39
8,male,group D,high school,free/reduced,completed,64,64,67
9,female,group B,high school,free/reduced,none,38,60,50


## Data Processing/Cleaning

In [7]:
# Check for NaNs
data_raw.isna().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

Since there are no NaNs and the data doesn't appear to need to be scaled, let's instead just add an "average score" column

In [8]:
# Create a new dataframe
data_clean = data_raw

# Calculate row-wise average of the math, reading, and writing scores
data_clean['average score'] = data_clean[['math score', 'reading score', 'writing score']].mean(axis=1)

# Confirm there are no NaNs in the cleaned dataframe
data_clean.head(15)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,average score
0,female,group B,bachelor's degree,standard,none,72,72,74,72.666667
1,female,group C,some college,standard,completed,69,90,88,82.333333
2,female,group B,master's degree,standard,none,90,95,93,92.666667
3,male,group A,associate's degree,free/reduced,none,47,57,44,49.333333
4,male,group C,some college,standard,none,76,78,75,76.333333
5,female,group B,associate's degree,standard,none,71,83,78,77.333333
6,female,group B,some college,standard,completed,88,95,92,91.666667
7,male,group B,some college,free/reduced,none,40,43,39,40.666667
8,male,group D,high school,free/reduced,completed,64,64,67,65.0
9,female,group B,high school,free/reduced,none,38,60,50,49.333333


In [9]:
# Save the new dataframe as a csv
data_clean.to_csv(f'./{parent_dir}/{data_clean_dir}/{dataset}_clean.csv')

## Data Analysis

Let's start by comparing how many of the students are in the various groups with some pie charts

In [26]:
# Define an array to be used while saving all of these
fig_list = {}

# Define the overall subplot figure
pie_fig = make_subplots(rows=1, cols=4, specs=[[{"type": "pie"}, {"type": "pie"}, {"type": "pie"}, {"type": "pie"}]], subplot_titles=('Percentage of Male/Female Samples','Percentage of Ethnicity Samples','Percentage of Parental Education Samples','Percentage of Lunch Samples'))

#Define the subplots

pie_figures = [px.pie(data_clean,values=data_clean.value_counts().values, names='gender',title='Percentage of Male/Female Samples'),
            px.pie(data_clean,values=data_clean.value_counts().values, names='race/ethnicity',title='Percentage of Ethnicity Samples'),
            px.pie(data_clean,values=data_clean.value_counts().values, names='parental level of education',title='Percentage of Parental Education Samples'),
            px.pie(data_clean,values=data_clean.value_counts().values, names='lunch',title='Percentage of Lunch Samples')]

# Convert the subplots to traces (because make_sublots works with go and not px, dang) and append the traces to the overall figure

for i, figure in enumerate(pie_figures):
    for trace in range(len(figure["data"])):
        pie_fig.append_trace(figure["data"][trace], row=1, col=1+i)

# So that we don't forget this later
fig_list.update({'pie_fig':pie_fig})

# Show the figure
pie_fig.show()

Regardless of the messy legend (an unfortunate drawback of Plotly's make_subplots), this lets us better see the class distributions, which can help us see if any of the classes are potentially imbalanced.

Now let's try vizualizing the data with the labels. Let's do a scatter plot, showing male/female in the ethnicity groups against the average score.

In [27]:
# Define the scatter figure
scatter_fig = px.scatter(data_clean,x='race/ethnicity', y='average score', color='gender',title='Male vs Female Average Scores of Ethnicity Groups')

# So that we don't forget this later
fig_list.update({'scatter_fig':scatter_fig})

# Show the scatter figure
scatter_fig.show()

You'll notice that this isn't *super* useful, as there are not that many groups but there are lots and lots of results, all very close together.  It does show us the overal distribution of scores, and that groups D and E maybe *very slightly* have higher scores, but it's honestly pretty inconclusize.

Next let's try a histogram, comparing the same values but averaging the average scores.

In [28]:
# Define the overall subplot figure
hist_fig = make_subplots(rows=1, cols=2, subplot_titles=('Average Score of Race/Ethnicity Samples','Average Score of Male/Female Samples'))

#Define the subplots

hist_figures = [px.histogram(data_clean,x=data_clean['race/ethnicity'].unique(), y=data_clean.groupby('race/ethnicity')['average score'].mean()),
            px.histogram(data_clean,x=data_clean['gender'].unique(), y=data_clean.groupby('gender')['average score'].mean())]

# Convert the subplots to traces (because make_sublots works with go and not px, dang) and append the traces to the overall figure

for i, figure in enumerate(hist_figures):
    for trace in range(len(figure["data"])):
        hist_fig.append_trace(figure["data"][trace], row=1, col=1+i)

# For some reason beyond my understanding, make_subplots doesn't work with plotly express with faceted, string labels, unless you do this:
hist_fig.update_traces(bingroup=None)

# So that we don't forget this later
fig_list.update({'hist_fig':hist_fig})

# Show the figure
hist_fig.show()

This shows us, much more concretely, that both samples in group E and female samples do have a very slightly higher average score.  But since this score is an average of an average, and is likely not significantly higher, let's plot some of the raw scores.

In [29]:
# Define the figures
math_hist_fig = px.histogram(data_clean, x="gender", y="math score", color='race/ethnicity', barmode='group', title="Average Math Scores",
             histfunc='avg') # I learned about histfunc after the above but I'm leaving the above alone just because

read_hist_fig = px.histogram(data_clean, x="gender", y="reading score", color='race/ethnicity', barmode='group', title="Average Reading Scores",
             histfunc='avg') # I learned about histfunc after the above but I'm leaving the above alone just because

write_hist_fig = px.histogram(data_clean, x="gender", y="writing score", color='race/ethnicity', barmode='group', title="Average Writing Scores",
             histfunc='avg') # I learned about histfunc after the above but I'm leaving the above alone just because

# So that we don't forget these later
fig_list.update({'math_hist_fig':math_hist_fig})
fig_list.update({'read_hist_fig':read_hist_fig})
fig_list.update({'write_hist_fig':write_hist_fig})

#Show the figures
math_hist_fig.show() # I'm sure I could figure out how to make this with make_subplots, but it's so funky
read_hist_fig.show()
write_hist_fig.show()

This shows us a much more detailed breakdown of the scores (again, averages, but only one level of average).  It shows that, slightly, generally female samples have higher reading and writing scores, but male samples have higher math scores.  It also shows us that the average scores of the ethnicity groups aren't quite as clear-cut as the average scores histogram may imply.

Now let's do one more!  It's probably time to involve the other categories, and parental level of education is much more interesting (to me).  Also we've done a lot of histograms, so let's do something different.

In [14]:
# First, let's get all the different parental levels
data_clean['parental level of education'].unique()

array(["bachelor's degree", 'some college', "master's degree",
       "associate's degree", 'high school', 'some high school'],
      dtype=object)

In [30]:
# Then, let's manually make an array ordering these in ascending order, because pandas/plotly aren't smart enough to know that "master's degree" is > "some high school"
parental_degrees = ['some high school', 'high school', 'some college', "associate's degree", "bachelor's degree", "master's degree"]

# Now that we have these in some order, we can do a line plot!
#Define the line figure
line_fig = px.line(data_clean, x=parental_degrees, y=data_clean.groupby('parental level of education')['average score'].mean(), title='Average Scores vs Parental Level of Education')

# So that we don't forget these later
fig_list.update({'line_fig':line_fig})

# Show the figure
line_fig.show()

While again keeping in mind that "average score" is being averaged (so an average of an average), this shows us something unexpected, at least to me.  I (silently) hypothesized that students whose parents had higher levels of education would have higher scores, but that's not necessarily the case!  Now, where this dataset comes from is not given in the assignment nor the dataset itself, so I do not know if this dataset shows college scores, high school scores, etc. This is important for reaching any conclustion.  For example, it's always possible that a parent that has had too high of an education is not able help with simpler concepts presented in high school, or something similar.

Unrelated: I lied.  One more!  Boxplot!

In [43]:
# Define the boxplot
box_fig = px.box(data_clean, x="test preparation course", y="average score")

# So that we don't forget these later
fig_list.update({'box_fig':box_fig})

# Show the boxplot
box_fig.show()

I did not need to do this, I just wanted to.  Boxplots are a significantly better way of obtaining an overview of a dataset, as they show the median/min/max/outliers/etc.  This one shows that students that have completed a test preparation course generally have higher scores, and higher minimums and outliers.

Finally, let's save all of these.

In [44]:
for figure in fig_list:
  fig_list[figure].write_image(f'./{parent_dir}/{results_dir}/{figure}.png')

### Cleanup

In [38]:
!zip -r './{parent_dir}.zip' './{parent_dir}' # Zip up the directory to be downloaded

  adding: Assignment_1-2_Student-Performance-Visualizations/ (stored 0%)
  adding: Assignment_1-2_Student-Performance-Visualizations/src/ (stored 0%)
  adding: Assignment_1-2_Student-Performance-Visualizations/data_raw/ (stored 0%)
  adding: Assignment_1-2_Student-Performance-Visualizations/data_raw/README.md (stored 0%)
  adding: Assignment_1-2_Student-Performance-Visualizations/data_raw/StudentsPerformance.csv (deflated 89%)
  adding: Assignment_1-2_Student-Performance-Visualizations/data_clean/ (stored 0%)
  adding: Assignment_1-2_Student-Performance-Visualizations/data_clean/StudentsPerformance_clean.csv (deflated 84%)
  adding: Assignment_1-2_Student-Performance-Visualizations/results/ (stored 0%)
  adding: Assignment_1-2_Student-Performance-Visualizations/results/hist_fig.png (deflated 32%)
  adding: Assignment_1-2_Student-Performance-Visualizations/results/line_fig.png (deflated 15%)
  adding: Assignment_1-2_Student-Performance-Visualizations/results/write_hist_fig.png (deflated

Finally, don't forget to save this notebook in:

In [37]:
print(f'./{parent_dir}/{source_dir}')

./Assignment_1-2_Student-Performance-Visualizations/src
