<a href="https://colab.research.google.com/github/melanieshimano/plotly-data-viz/blob/main/2020_11_20_2020_data_viz_plotlyexpress_bcpss_melanieshimano.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Visualizations with Plotly Express and BCPSS Enrollment Data

Here, we'll review different kinds of data visualizations that we can quickly create with Plolty Express using Maryland public school enrollment data from [here](https://reportcard.msde.maryland.gov/Graphs/#/DataDownloads/datadownload/3/17/6/99/XXXX/2019). Since this data is only available in data for each year individually, I merged all of the available years of enrollment data (2012-2020), which you can view in [this notebook](https://github.com/jhu-business-analytics/plotly-data-viz/blob/main/11-20-2020-md-education-all-enrollment-melanieshimano.ipynb).

## import packages

In [1]:
# data analysis
import pandas as pd

# data viz
import plotly.express as px
from plotly.subplots import make_subplots # to make subplots
import plotly.figure_factory as ff # to make density plots

# exporting files
from google.colab import files

## import data

In [2]:
# import bcpss enrollment data (2012-2020)

df = pd.read_csv("https://raw.githubusercontent.com/jhu-business-analytics/plotly-data-viz/main/md_enrollment_2012_2020.csv")

In [3]:
# preview data
df.head()

Unnamed: 0,school_year,lss_num,lss_name,school_num,school_name,grade,enrolled_count,grade_level
0,2020,1,Allegany,301,Flintstone Elementary,Prekindergarten,19.0,Elementary
1,2020,1,Allegany,301,Flintstone Elementary,Kindergarten,28.0,Elementary
2,2020,1,Allegany,301,Flintstone Elementary,Grade 1,31.0,Elementary
3,2020,1,Allegany,301,Flintstone Elementary,Grade 2,40.0,Elementary
4,2020,1,Allegany,301,Flintstone Elementary,Grade 3,30.0,Elementary


In [4]:
# check what's in our dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88687 entries, 0 to 88686
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   school_year     88687 non-null  int64  
 1   lss_num         88687 non-null  object 
 2   lss_name        88687 non-null  object 
 3   school_num      88687 non-null  object 
 4   school_name     88687 non-null  object 
 5   grade           88687 non-null  object 
 6   enrolled_count  85397 non-null  float64
 7   grade_level     88403 non-null  object 
dtypes: float64(1), int64(1), object(6)
memory usage: 5.4+ MB


## General Plotly Express Formatting

Plotly Express allows us to create really nice, interactive visualizations without much code, and each kind of visual follows a general convention. We won't necessarily use all of the parameters available for a plotly express visualization in this course, but you can read more about how you can customize your visualizations [here](https://plotly.com/python-api-reference/generated/plotly.express.html#module-plotly.express).

The general convention and parameters that we'll focus on are: 

```
chart_name = px.bar(df, # dataframe of the data we want to plot
                     x = "column on the x axis", 
                     y = "column on the y axis", 
                     color = "how to categorize data with different colors", 
                     hover_name = "values we want to show up when we hover over the chart", 
                    title = "chart title", 
                    labels = {"column_name": "new label", "column_name": "new label"}, # renaming labels
                    orientation = "h" # change the x, y values to make a horizontal bar chart
                    facet_col = "column that identifies how you want to separate columns of subplots",
                    facet_row = "column that identifies how you want to separate rows of subplots", 
                    animation_frame = "column name that identifies base data for each animation frame",
                    animation_group = "column name that identifies which data in the graph is changing with each animation frame",
                    range_x = [lowest x value, highest x value],
                    range_y = [lowest y value, highest y value],
                    log_x = False (default) or True which identifies if we want to make a scatter chart the log of the values instead of the values--useful to compare exponential growth
                    )
```
We won't use all of these for all of our graphs, but these are general parameters that might be useful for your work.

# Line Graphs

We'll start by making a line graph to visualize the total enrollment in all MD schools from 2012-2020.

## Edit Data

In [5]:
# create new df with only data from all schools and the grade is "total enrollment"
df_all = df[(df["lss_num"] == "A") & (df["grade"] == "Total Enrollment")]

In [6]:
# sort values by the school year
df_all = df_all.sort_values(by = "school_year")

In [7]:
# look at the data
df_all

Unnamed: 0,school_year,lss_num,lss_name,school_num,school_name,grade,enrolled_count,grade_level
69101,2012,A,All Public Schools,A,All Maryland Schools,Total Enrollment,854086.0,
59216,2013,A,All Public Schools,A,All Maryland Schools,Total Enrollment,859638.0,
49337,2014,A,All Public Schools,A,All Maryland Schools,Total Enrollment,866169.0,
39461,2015,A,All Public Schools,A,All Maryland Schools,Total Enrollment,874514.0,
19660,2016,A,All Public Schools,A,All Maryland Schools,Total Enrollment,879601.0,
29487,2017,A,All Public Schools,A,All Maryland Schools,Total Enrollment,886221.0,
88686,2018,A,All Public Schools,A,All Maryland Schools,Total Enrollment,893689.0,
78882,2019,A,All Public Schools,A,All Maryland Schools,Total Enrollment,896837.0,
9767,2020,A,All Public Schools,A,All Maryland Schools,Total Enrollment,909414.0,


## make a line graph with all school data

In [8]:
# make a line graph
line_all_md = px.line(df_all,
                      x = "school_year",
                      y = "enrolled_count",
                      title = "Total Enrollment in Maryland Public Schools, 2012-2020",
                      labels = {"school_year": "School Year", "enrolled_count": "Number of Students Enrolled"})

In [9]:
# view viz
line_all_md

One thing to keep in mind is the number of schools who didn't report any total enrollment, which might affect our results. We can look at a dataframe of the schools who didn't contribute to the Total Enrollment counts by looking at a dataframe of null values.

In [10]:
# create a df to filter for schools who don't report total enrollment 
df_all_null = df[(df["enrolled_count"].isnull()) & (df["grade"] == "Total Enrollment")]

In [11]:
# look at this data
df_all_null

Unnamed: 0,school_year,lss_num,lss_name,school_num,school_name,grade,enrolled_count,grade_level
990,2020,3,Baltimore County,51,Northwest EDLP at Milford Mill Academy,Total Enrollment,,High
1004,2020,3,Baltimore County,56,Central EDLP at Loch Raven High School,Total Enrollment,,High
1008,2020,3,Baltimore County,57,Home Assignments-Elementary,Total Enrollment,,Middle
2323,2020,4,Calvert,500,Calvert County Alternative School,Total Enrollment,,High
2594,2020,6,Carroll,718,PRIDE School,Total Enrollment,,Elementary
2597,2020,6,Carroll,726,Crossroads Middle School,Total Enrollment,,Middle
7773,2020,19,Somerset,1003,Ewell School,Total Enrollment,,Middle
12056,2016,4,Calvert,500,Calvert County Alternative School,Total Enrollment,,High
15067,2016,15,Montgomery,524,Gateway to College Program,Total Enrollment,,High
17526,2016,19,Somerset,1003,Ewell School,Total Enrollment,,Middle


## Make a line graph with different categories

In [12]:
# make a df with total enrollment for each district
# all "school_name" that start with "All"
df_all_district = df[(df["school_name"].str.startswith("All ")) & (df["lss_num"] != "A") & (df["grade"] == "Total Enrollment")]

In [13]:
# preview
df_all_district = df_all_district.sort_values(by = "school_year")

In [14]:
# make the school name a category to make a line graph for each district on our line graph
# make a line graph
line_all_dist = px.line(df_all_district,
                      x = "school_year",
                      y = "enrolled_count",
                      title = "Total Enrollment in Maryland Public Schools, 2012-2020",
                      labels = {"school_year": "School Year", "enrolled_count": "Number of Students Enrolled", "lss_name": "School District"},
                      color = "lss_name")

In [15]:
# view viz
line_all_dist

## Make line graph with two y-axes

In [16]:
# create a new df of only Baltimore City Schools
df_bmore = df[(df["lss_name"] == "Baltimore City") & (df["grade"] == "Total Enrollment") & (df["school_num"] != "A")]

In [17]:
# preview
df_bmore.head()

Unnamed: 0,school_year,lss_num,lss_name,school_num,school_name,grade,enrolled_count,grade_level
8379,2020,30,Baltimore City,4,Steuart Hill Academic Academy,Total Enrollment,250.0,Elementary
8387,2020,30,Baltimore City,7,Cecil Elementary,Total Enrollment,391.0,Elementary
8398,2020,30,Baltimore City,8,City Springs Elementary/Middle,Total Enrollment,703.0,Middle
8409,2020,30,Baltimore City,10,James McHenry Elementary/Middle,Total Enrollment,625.0,Middle
8417,2020,30,Baltimore City,11,Eutaw-Marshburn Elementary,Total Enrollment,271.0,Elementary


In [18]:
# add column for average enrollment for each year
df_bmore["avg_enrolled_count"] = df_bmore.groupby("school_year")["enrolled_count"].transform("mean")

In [19]:
# make aggregated table for averages of enrolled counts per year for each grade level
df_bmore_grade = df_bmore.groupby(["school_year","grade_level"])[["enrolled_count", "avg_enrolled_count"]].agg(["mean"]).reset_index()

In [20]:
# preview new df
df_bmore_grade.head(10)

Unnamed: 0_level_0,school_year,grade_level,enrolled_count,avg_enrolled_count
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean,mean
0,2012,Elementary,350.636364,431.85641
1,2012,High,534.1,431.85641
2,2012,Middle,424.688889,431.85641
3,2013,Elementary,369.473684,436.840206
4,2013,High,523.666667,436.840206
5,2013,Middle,430.0,436.840206
6,2014,Elementary,357.722222,448.306878
7,2014,High,534.18,448.306878
8,2014,Middle,455.341176,448.306878
9,2015,Elementary,344.277778,466.901099


In [21]:
# make line graphs of the elementary, middle, high school enrolled counts from 2012-2020
line_bmore = px.line(df_bmore_grade,
                      x = "school_year",
                      y = "enrolled_count",
                      title = "Total Enrollment in Baltimore City Public Schools, 2012-2020",
                      labels = {"school_year": "School Year", "enrolled_count": "Number of Students Enrolled"},
                      color = "grade_level")

In [22]:
# show graph
line_bmore

In [23]:
# aggregate average enrolled count for each grade
df_bmore_total_agg = df_bmore.groupby("school_year")["enrolled_count"].agg(["mean"]).reset_index()

In [24]:
df_bmore_total_agg

Unnamed: 0,school_year,mean
0,2012,431.85641
1,2013,436.840206
2,2014,448.306878
3,2015,466.901099
4,2016,464.811111
5,2017,473.247126
6,2018,471.292398
7,2019,477.692771
8,2020,488.808642


In [25]:
# make list of x values for bar chart
total_enrolled_years = df_bmore_total_agg["school_year"].tolist()

In [26]:
# make list of y values for bar chart
total_enrolled_list = df_bmore_total_agg["mean"].tolist()

In [27]:
# add in bar graph of total average enrollment
line_bmore.add_bar(x = total_enrolled_years, y = total_enrolled_list, name = "Total Average Enrollment")

### Exporting Interactive Graphs (as HTML files)

If the html export doesn't work for you, you might need to download orca to d othis (you'll only need to do this once!): 

```
# install orca to download plotly html/images
!wget https://github.com/plotly/orca/releases/download/v1.2.1/orca-1.2.1-x86_64.AppImage -O /usr/local/bin/orca
!chmod +x /usr/local/bin/orca
!apt-get install xvfb libgtk2.0-0 libgconf-2-4
```

In [28]:
# export graph as an html file
line_bmore.write_html("bmore_enrollment.html")

In [29]:
# download files from google colab
files.download("bmore_enrollment.html")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Adding two different y-axes on the same chart

This kind of chart is a little bit more complicated, but essentially what we'll do here is create two different charts and then overlay them on each other.

In [30]:
only_lines_bmore = line_bmore = px.line(df_bmore_grade,
                      x = "school_year",
                      y = "enrolled_count",
                      title = "Total Enrollment in Baltimore City Public Schools, 2012-2020",
                      labels = {"school_year": "School Year", "enrolled_count": "Number of Students Enrolled"},
                      color = "grade_level")

In [31]:
only_bar_bmore = px.bar(df_bmore_total_agg,
                      x = "school_year",
                      y = "mean",
                      title = "Total Enrollment in Baltimore City Public Schools, 2012-2020",
                      labels = {"school_year": "School Year", "mean": "Average Total Enrollment"},
                      color_discrete_sequence = ["pink"])

In [32]:
only_lines_bmore.update_traces(yaxis="y2")

In [33]:
combo_fig = make_subplots(specs=[[{"secondary_y": True}]])

In [34]:
combo_fig.add_traces(only_bar_bmore.data + only_lines_bmore.data)

In [35]:
# add title
combo_fig.layout.title="Average Enrollment in Baltimore City Public Schools 2012-2020"

In [36]:
# add x-axis title
combo_fig.layout.xaxis.title = "School Year"

In [37]:
# add right y-axis title
combo_fig.layout.yaxis.title = "Average Enrollment by Grade Level"

In [38]:
# add left y-axis title
combo_fig.layout.yaxis2.title = "Total Average Enrollment"

In [39]:
combo_fig

# Histogram

In [40]:
# bring back df of only total enrollments
df_all_district.head(10)

Unnamed: 0,school_year,lss_num,lss_name,school_num,school_name,grade,enrolled_count,grade_level
67508,2012,23,Worcester,A,All Worcester Schools,Total Enrollment,6643.0,
69087,2012,30,Baltimore City,A,All Baltimore City Schools,Total Enrollment,84212.0,
69095,2012,32,SEED,A,All SEED Schools,Total Enrollment,308.0,
59393,2012,1,Allegany,A,All Allegany Schools,Total Enrollment,8913.0,
60196,2012,2,Anne Arundel,A,All Anne Arundel Schools,Total Enrollment,76303.0,
61351,2012,3,Baltimore County,A,All Baltimore County Schools,Total Enrollment,105153.0,
61516,2012,4,Calvert,A,All Calvert Schools,Total Enrollment,16553.0,
61580,2012,5,Caroline,A,All Caroline Schools,Total Enrollment,5545.0,
61867,2012,6,Carroll,A,All Carroll Schools,Total Enrollment,27082.0,
62054,2012,7,Cecil,A,All Cecil Schools,Total Enrollment,15827.0,


In [41]:
# make histogram of enrollment counts
hist_dist_yr = px.histogram(df_all_district, 
                            x = "enrolled_count", 
                            color = "school_year", 
                            barmode = "overlay", 
                            labels = {"enrolled_count": "Number of Enrolled Students"})

In [42]:
hist_dist_yr

We can also use plotly's figure factory to create a density plot with or without histograms

In [43]:
# "unmelt" the data so that we have the years in different columns using "pivot"
df_unmelt_yr = df_all_district.pivot(index = "school_name", columns = "school_year", values = "enrolled_count").reset_index()


In [44]:
df_unmelt_yr

school_year,school_name,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,All Allegany Schools,8913.0,8929.0,8872.0,8865.0,8812.0,8702.0,8629.0,8539.0,8437.0
1,All Anne Arundel Schools,76303.0,77770.0,78489.0,79518.0,80387.0,81379.0,82777.0,83300.0,84984.0
2,All Baltimore City Schools,84212.0,84747.0,84730.0,84976.0,83666.0,82354.0,80591.0,79297.0,79187.0
3,All Baltimore County Schools,105153.0,106927.0,108191.0,109830.0,111138.0,112139.0,113282.0,113814.0,115038.0
4,All Calvert Schools,16553.0,16323.0,16221.0,16031.0,16017.0,15950.0,15908.0,15936.0,16022.0
5,All Caroline Schools,5545.0,5585.0,5545.0,5592.0,5602.0,5705.0,5787.0,5829.0,5874.0
6,All Carroll Schools,27082.0,26687.0,26331.0,25879.0,25551.0,25255.0,25290.0,25179.0,25345.0
7,All Cecil Schools,15827.0,15634.0,15824.0,15681.0,15859.0,15633.0,15364.0,15307.0,15256.0
8,All Charles Schools,26778.0,26644.0,26455.0,26258.0,26307.0,26390.0,26891.0,27108.0,27521.0
9,All Dorchester Schools,4647.0,4718.0,4766.0,4796.0,4739.0,4816.0,4767.0,4785.0,4710.0


In [45]:
# create a list of the values you want as the density curve lines in the graph
hist_data = [df_unmelt_yr[2012].tolist(),df_unmelt_yr[2013].tolist(), df_unmelt_yr[2014].tolist(),df_unmelt_yr[2015].tolist(),df_unmelt_yr[2016].tolist(), df_unmelt_yr[2017].tolist(), df_unmelt_yr[2018].tolist(), df_unmelt_yr[2019].tolist(), df_unmelt_yr[2020].tolist()]
# create a list of the values you want to label the data  "lines" (and histograms) in the graph
group_labels = df_all_district["school_year"].unique().tolist()


In [46]:
# define the histogram 
distplot = ff.create_distplot(hist_data, # use the histogram data we defined above
                              group_labels, # use the group labels we defined above
                              show_rug= False, # this shows the distribution of the data in a "rug" below, if you change to True, you can see what this looks like
                              show_hist= False) # this means that we are hiding the histogram used to make the graph, if you change to True, you can see the histograms with this also


In [47]:
distplot

In [48]:
distplot.update_layout(title_text="Distribution of Enrolled Count in Maryland Public Schools") # update the figure to add a title
distplot.update_xaxes(title = "Enrolled Count") # update the figure to add an x-axis label
distplot.update_yaxes(title = "Density") # update the figure to add a y axis label

# Pie Chart

In [49]:
# filter data to include all grade levels but not "all schools"

df_grades = df[(df["grade"] == "Total Enrollment") & (df["grade_level"].notna())]


In [50]:
# preview data
df_grades.head()

Unnamed: 0,school_year,lss_num,lss_name,school_num,school_name,grade,enrolled_count,grade_level
7,2020,1,Allegany,301,Flintstone Elementary,Total Enrollment,221.0,Elementary
15,2020,1,Allegany,401,South Penn Elementary,Total Enrollment,571.0,Elementary
23,2020,1,Allegany,402,John Humbird Elementary,Total Enrollment,279.0,Elementary
28,2020,1,Allegany,405,Fort Hill High,Total Enrollment,672.0,High
32,2020,1,Allegany,406,Washington Middle,Total Enrollment,612.0,Middle


In [51]:
# look at only 2020 data
df_grades_2020 = df_grades[df_grades["school_year"] == 2020]

In [52]:
# make pie chart to show distribution of grade type in 2020
pie_2020 = px.pie(df_grades_2020, 
                  names = "grade_level", 
                  values = "enrolled_count",
                  title = "Distribution of Students in Maryland Public Schools in 2020")

In [53]:
pie_2020