# Notebook #3: Visualizations Part 1 (Scatter, Box, Violin, & Bar Plots/Charts)

In this notebook, we create visualizations, including scatter, box, violin, and bar plots/charts, using the Plotly visualization library. We believe these visualizations will help us to explore the relationships between the variables in our dataset, and we hope that they are insightful in regards to our project goals and research questions (i.e., demographics, student status, funding).

In [1]:
# !pip install plotly
# !pip install openpyxl
# !pip install statsmodels
# !pip install --upgrade pip

## 1.0. Import Libraries

In [2]:
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
import statsmodels.api as sm
from plotly.graph_objs import Figure
from plotly.subplots import make_subplots
from typing import Dict 

<hr>

## 2.0. Import the Cleaned 'district_and_expenses' CSV Data

In [3]:
# Import and store the cleaned DataFrame from the 'district_and_expenses.csv' file
district_and_expenses = pd.read_csv('district_and_expenses.csv')

# Display the imported 'district_and_expenses' DataFrame
display(district_and_expenses)

Unnamed: 0,Fed ID,District Code,CDS Code,County Name,District Type,Grade Low,Grade High,Grade Low Census,Grade High Census,Assistance Status,...,Students with Disabilities (%),Socioeconomically Disadvantaged,Socioeconomically Disadvantaged (%),District Label,District Name,EDP 365,Expense ADA,Expense per ADA,LEA Type,Decimal Difference
0,601770.0,61119,1.611190e+12,Alameda,Unified,KG,12,KG,12,General Assistance,...,12.200000,4035.0,38.200000,Alameda Unified (Alameda),Alameda Unified,1.550948e+08,8567.86,18101.93,Unified,0.232163
1,601860.0,61127,1.611270e+12,Alameda,Unified,KG,12,KG,12,General Assistance,...,9.000000,1122.0,31.400000,Albany City Unified (Alameda),Albany City Unified,6.149090e+07,3435.41,17899.14,Unified,0.040342
2,604740.0,61143,1.611430e+12,Alameda,Unified,KG,12,KG,12,General Assistance,...,12.000000,2508.0,27.600000,Berkeley Unified (Alameda),Berkeley Unified,2.205508e+08,8572.17,25728.70,Unified,0.058892
3,607800.0,61150,1.611500e+12,Alameda,Unified,KG,12,KG,12,General Assistance,...,11.000000,3686.0,38.800000,Castro Valley Unified (Alameda),Castro Valley Unified,1.424913e+08,8991.52,15847.30,Unified,0.055328
4,612630.0,61168,1.611680e+12,Alameda,Unified,KG,12,KG,12,General Assistance,...,12.500000,327.0,54.500000,Emery Unified (Alameda),Emery Unified,1.586300e+07,554.70,28597.44,Unified,0.081666
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
927,,76349,,Mendocino,Elementary,KG,12,KG,8,General Assistance,...,14.883721,243.0,56.511628,Arena Union Elementary/Point Arena Joint Union...,Arena Union Elementary/Point Arena Joint Union...,1.016266e+07,325.53,31218.80,Comm Admin,0.320923
928,,40261,,Santa Cruz,Elementary,KG,5,KG,5,General Assistance,...,14.636480,2304.0,36.734694,Santa Cruz City Elementary/High (Santa Cruz),Santa Cruz City Elementary/High,1.152800e+08,5688.18,20266.58,Comm Admin,0.102637
929,,40246,,Sonoma,Elementary,KG,12,KG,6,,...,17.717921,3326.0,45.018950,Petaluma City Elementary/Joint Union High (Son...,Petaluma City Elementary/Joint Union High,1.252075e+08,6651.17,18824.88,Comm Admin,0.110782
930,,40253,,Sonoma,Elementary,KG,8,KG,6,,...,17.340181,7541.0,50.959589,Santa Rosa City Schools (Sonoma),Santa Rosa City Schools,2.486762e+08,11701.14,21252.30,Comm Admin,0.264663


<hr>

## 3.0. Preparing the Data for the Visualizations

### 3.1. Storing the Demographic Percentage Columns and the Demographic Names as Global Variables

In [4]:
# Create a global variable list to store the DataFrame's demographic percent column names
DEMOGRAPHICS = [
    'African American (%)', 
    'American Indian (%)', 
    'Asian (%)',
    'Filipino (%)',
    'Hispanic (%)', 
    'Pacific Islander (%)', 
    'White (%)', 
    'Two or More Races (%)',
    'English Learner (%)', 
    'Foster (%)', 
    'Homeless (%)', 
    'Migrant (%)', 
    'Students with Disabilities (%)', 
    'Socioeconomically Disadvantaged (%)'
]

# Create a global variable list to store the demographic names without ' (%)' at the end using the 'DEMOGRAPHICS' 
# values 
DEMOS = [demo.removesuffix(' (%)') for demo in DEMOGRAPHICS]

### 3.2. Adding a 'Funding' Column to the district_and_expenses DataFrame

By adding a `'Funding'` category based on the median value of `'Expense per ADA'` (per-pupil spending) for the `district_and_expenses` DataFrame, we can differentiate schools by whether they are `'Well-funded'` or `'Underfunded'`. This will allow us to explore the relationship between `'Funding'` and other variables (such as pupil demographic percentages) in the DataFrame for our scatter, box, and bar plots/charts.

In [5]:
# Create a variable to hold the median value of 'Expense per ADA' to be used to determine the funding 
# category of a district
threshold = district_and_expenses['Expense per ADA'].median()

# Create a column in the DataFrame that will keep track of a district's funding category using the
# median value of 'Expense per ADA' as the threshold
district_and_expenses['Funding'] = district_and_expenses['Expense per ADA'].apply(
    lambda x: 'Well-funded' if x > threshold else 'Underfunded'
)

# Display school names and new 'Funding' column of the updated DataFrame
display(district_and_expenses[['District Label', 'Funding']].head())

# Display the value counts of the 'Funding' column
print(district_and_expenses['Funding'].value_counts())

Unnamed: 0,District Label,Funding
0,Alameda Unified (Alameda),Underfunded
1,Albany City Unified (Alameda),Underfunded
2,Berkeley Unified (Alameda),Well-funded
3,Castro Valley Unified (Alameda),Underfunded
4,Emery Unified (Alameda),Well-funded


Funding
Underfunded    466
Well-funded    466
Name: count, dtype: int64


### 3.3. Adding 'Most Represented Race/Ethnicity Demographic' and 'Least Represented Race/Ethnicity Demographic' Columns to a Copied district_and_expenses DataFrame

By adding these two columns, we can later use them to create violin plots later to visualize the distribution of per-pupil spending (`'Expense per ADA'` values) for schools where a specific pupil race/ethnicity demographic is the most or least represented.


#### 3.3.1. Adding the `'Most Represented Race/Ethnicity Demographic'` and `'Least Represented Race/Ethnicity Demographic'` Columns

We recognize that a school can have more than one demographic as the most/least represented pupil racial/ethnic demographic. This is because the max/min values of the racial/ethnic demographic percent columns for a school can be the same (ex. a school having 0% of `'White'` and `'Hispanic'` students ). As such, we need to account for this when adding the `'Most Represented Race/Ethnicity Demographic'` and `'Least Represented Race/Ethnicity Demographic'` columns to the `district_and_expenses` DataFrame.

We chose to create the two new columns with a list of every demographic that is the most or least represented pupil racial/ethnic demographic for every school.

In [6]:
# Store the max and min values of the racial/ethnic demographic percent columns for each school in two variables
demo_maxs = district_and_expenses[DEMOGRAPHICS[:8]].max(axis=1)
demo_mins = district_and_expenses[DEMOGRAPHICS[:8]].min(axis=1)

# Create a function that will assign the most/least represented racial/ethnic demographics to a school
# based on the max/min values of the racial/ethnic demographic percent columns for a school
def assign_demo_representation(row, demo_values):
    row = row[DEMOGRAPHICS[:8]]
    return [
        column.removesuffix(' (%)')
        for column in list(row[row == demo_values[row.name]].index)
    ]

# Add two columns to the DataFrame to store the most and least represented racial/ethnic demographic 
# for each school using the 'assign_demo_representation' function
district_and_expenses['Most Represented Race/Ethnicity Demographic'] = district_and_expenses.apply(
    lambda row: assign_demo_representation(row, demo_maxs), axis=1
)
district_and_expenses['Least Represented Race/Ethnicity Demographic'] = district_and_expenses.apply(
    lambda row: assign_demo_representation(row, demo_mins), axis=1
)

# Display school names, demographic percent columns and the two new columns of the updated DataFrame to verify 
# that the function worked as intended
display(
    district_and_expenses[
        ['District Label'] +
        DEMOGRAPHICS[:8] +
        ['Most Represented Race/Ethnicity Demographic', 'Least Represented Race/Ethnicity Demographic']
    ]
)

# Display the value counts of the 'Most Represented Race/Ethnicity Demographic' and 'Least 
# Represented Race/Ethnicity Demographic' columns
print(district_and_expenses['Most Represented Race/Ethnicity Demographic'].value_counts())
print(district_and_expenses['Least Represented Race/Ethnicity Demographic'].value_counts())

Unnamed: 0,District Label,African American (%),American Indian (%),Asian (%),Filipino (%),Hispanic (%),Pacific Islander (%),White (%),Two or More Races (%),Most Represented Race/Ethnicity Demographic,Least Represented Race/Ethnicity Demographic
0,Alameda Unified (Alameda),7.100000,0.200000,24.700000,4.600000,17.800000,0.500000,26.900000,15.100000,[White],[American Indian]
1,Albany City Unified (Alameda),4.300000,0.300000,31.600000,1.400000,17.200000,0.100000,25.800000,13.500000,[Asian],[Pacific Islander]
2,Berkeley Unified (Alameda),11.600000,0.200000,8.100000,0.800000,22.500000,0.200000,41.100000,15.300000,[White],"[American Indian, Pacific Islander]"
3,Castro Valley Unified (Alameda),4.300000,0.100000,32.900000,4.500000,24.600000,0.400000,19.600000,10.300000,[Asian],[American Indian]
4,Emery Unified (Alameda),44.200000,0.000000,9.300000,0.800000,21.000000,0.700000,12.700000,9.300000,[African American],[American Indian]
...,...,...,...,...,...,...,...,...,...,...,...
927,Arena Union Elementary/Point Arena Joint Union...,0.000000,6.511628,0.232558,0.000000,56.976744,0.232558,31.860465,2.790698,[Hispanic],"[African American, Filipino]"
928,Santa Cruz City Elementary/High (Santa Cruz),1.147959,0.175383,2.248087,0.462372,40.449617,0.207270,47.241709,6.521046,[White],[American Indian]
929,Petaluma City Elementary/Joint Union High (Son...,1.109908,0.649702,2.395777,0.609096,35.097455,0.378993,53.099621,6.348132,[White],[Pacific Islander]
930,Santa Rosa City Schools (Sonoma),1.790783,0.601433,3.959995,0.979862,58.852548,0.783890,27.794297,5.230437,[Hispanic],[American Indian]


Most Represented Race/Ethnicity Demographic
[Hispanic]            515
[White]               371
[Asian]                33
[American Indian]       8
[Hispanic, White]       4
[African American]      1
Name: count, dtype: int64
Least Represented Race/Ethnicity Demographic
[Pacific Islander]                                                                                     246
[American Indian]                                                                                      213
[American Indian, Pacific Islander]                                                                     80
[Filipino]                                                                                              62
[Filipino, Pacific Islander]                                                                            51
[African American, American Indian, Asian, Filipino, Pacific Islander]                                  26
[African American, Filipino, Pacific Islander]                                         

#### 3.3.2. Exploding the `'Most Represented Race/Ethnicity Demographic'` and `'Least Represented Race/Ethnicity Demographic'` Columns

In order to create the violin plots, we need to explode the `'Most Represented Race/Ethnicity Demographic'` and `'Least Represented Race/Ethnicity Demographic'` columns so that each demographic that is the most or least pupil racial/ethnic demographic for a school is represented in its own row. We can store the exploded DataFrames in new variables (`violin_most_data` and `violin_least_data`) that we can later use for the violin plots so that we don't have to alter the original DataFrame too much.

In [7]:
# Explode the DataFrame to deal with the issue of multiple demographics being the most represented 
# for a school, store it as a new DataFrame to be used for a violin plot, and display the new DataFrame shape
violin_most_data = district_and_expenses.explode('Most Represented Race/Ethnicity Demographic')
print(violin_most_data.shape)

# Explode the DataFrame to deal with the issue of multiple demographics being the most represented 
# for a school, store it as a new DataFrame to be used for a violin plot, and display the new DataFrame shape
violin_least_data = district_and_expenses.explode('Least Represented Race/Ethnicity Demographic')
print(violin_least_data.shape)

# Display the new value counts of the 'Most Represented Race/Ethnicity Demographic' and 'Least 
# Represented Race/Ethnicity Demographic' columns after the DataFrame has been exploded 
print(violin_most_data['Most Represented Race/Ethnicity Demographic'].value_counts())
print(violin_least_data['Least Represented Race/Ethnicity Demographic'].value_counts())

(936, 54)
(1796, 54)
Most Represented Race/Ethnicity Demographic
Hispanic            519
White               375
Asian                33
American Indian       8
African American      1
Name: count, dtype: int64
Least Represented Race/Ethnicity Demographic
Pacific Islander     613
American Indian      436
Filipino             323
African American     171
Asian                160
Two or More Races     79
Hispanic               9
White                  5
Name: count, dtype: int64


### 3.3. Filtering and Melting the district_and_expenses DataFrame from Wide to Long Format

By melting the DataFrame, we can create a new DataFrame (`dist_and_exp_demos_melt`) that will be easier to use for some of our scatter, box, and bar plots/charts visualizations we create later using the Plotly visualization library. We can also filter the original DataFrame before melting it to only include the columns we are interested in exploring later on.

The number of rows in the melted DataFrame should be equal to the number of schools multiplied the number of demographics (932 schools x 14 demographics = 13048 rows)

In [8]:
# Create a new DataFrame with only the desired columns including demographic percentages
dist_and_exp_demos = district_and_expenses[
    ['District Label', 'District Type', 'Locale', 'Funding', 'Expense per ADA'] + DEMOGRAPHICS
]

# Melt the DataFrame for some of the visualizations using the 'DEMOGRAPHICS' list
dist_and_exp_demos_melt = dist_and_exp_demos.melt(
    id_vars=['District Label', 'District Type', 'Locale', 'Funding', 'Expense per ADA'],
    value_vars=DEMOGRAPHICS,
    var_name='Demographic',
    value_name='Demographic (%)'
)

# Remove ' (%)' from the end of each demographic name in the 'Demographic' column
dist_and_exp_demos_melt['Demographic'] = dist_and_exp_demos_melt['Demographic'].str.slice(stop=-4)

# Sort the melted DataFrame by the 'District Label' column and reset the index
dist_and_exp_demos_melt = dist_and_exp_demos_melt.sort_values('District Label', ascending=False).reset_index(drop=True)

# Display the melted DataFrame to be used for some of the visualizations
display(dist_and_exp_demos_melt)

# Verify that there are 932 rows for each demographic in the melted DataFrame
print(dist_and_exp_demos_melt['Demographic'].value_counts())

Unnamed: 0,District Label,District Type,Locale,Funding,Expense per ADA,Demographic,Demographic (%)
0,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,Two or More Races,2.4
1,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,Foster,0.7
2,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,American Indian,0.4
3,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,Hispanic,50.6
4,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,Socioeconomically Disadvantaged,61.3
...,...,...,...,...,...,...,...
13043,ABC Unified (Los Angeles),Unified,Suburban,Underfunded,18827.59,Pacific Islander,0.5
13044,ABC Unified (Los Angeles),Unified,Suburban,Underfunded,18827.59,Asian,23.7
13045,ABC Unified (Los Angeles),Unified,Suburban,Underfunded,18827.59,Homeless,0.7
13046,ABC Unified (Los Angeles),Unified,Suburban,Underfunded,18827.59,Hispanic,45.5


Demographic
Two or More Races                  932
Foster                             932
American Indian                    932
Hispanic                           932
Socioeconomically Disadvantaged    932
Pacific Islander                   932
Homeless                           932
African American                   932
Filipino                           932
Asian                              932
Students with Disabilities         932
White                              932
English Learner                    932
Migrant                            932
Name: count, dtype: int64


### 3.4. Adding a 'Demographic (%) - Quartile Category' Column to the Melted dist_and_exp_demos_melt DataFrame

By adding this column to the melted DataFrame, we can later use it to create bar charts to visualize the average `'Expense per ADA'` (per-pupil spending) values for the top/bottom quartiles of specific demographic percentage column values. The quartiles are determined using the top 25% and bottom 25% percentile thresholds for each demographic's percentage values.

In [9]:
# Determine the top 25% and bottom 25% percentile thresholds for each demographic
top_25_thresholds = district_and_expenses[DEMOGRAPHICS].quantile(0.75)
bottom_25_thresholds = district_and_expenses[DEMOGRAPHICS].quantile(0.25)

# Create a function that will assign a quartile category to each row in the melted DataFrame 
def assign_quartile_category(row: pd.Series) -> str:
    """
    Assigns a quartile category ('Top 25%', 'Bottom 25%', or 'Middle 50%') 
    to a DataFrame row based on the demographic percentage value and
    the precomputed top 25% and bottom 25% thresholds for each demographic.

    Args:
        row (pd.Series): Row from a DataFrame with keys 'Demographic' and 'Demographic (%)'.
    
    Returns:
        str: The assigned quartile category.
    """
    demographic = row['Demographic']
    demographic_percent = row['Demographic (%)']
    if demographic_percent >= top_25_thresholds.loc[demographic + ' (%)']:
        return 'Top 25%'
    elif demographic_percent <= bottom_25_thresholds.loc[demographic + ' (%)']:
        return 'Bottom 25%'
    else:
        return 'Middle 50%'

# Assign and store the quantile category for each row in the melted DataFrame in a new column using
# the 'assign_quartile_category' function
dist_and_exp_demos_melt['Demographic (%) - Quartile Category'] = dist_and_exp_demos_melt.apply(
    assign_quartile_category, axis=1
)

# Display the top 25% and bottom 25% percentile thresholds and the updated melted DataFrame
print(top_25_thresholds)
print(bottom_25_thresholds)
display(dist_and_exp_demos_melt)

African American (%)                    2.400
American Indian (%)                     0.800
Asian (%)                               5.700
Filipino (%)                            1.400
Hispanic (%)                           70.825
Pacific Islander (%)                    0.300
White (%)                              55.325
Two or More Races (%)                   7.500
English Learner (%)                    23.925
Foster (%)                              0.600
Homeless (%)                            4.700
Migrant (%)                             0.900
Students with Disabilities (%)         15.000
Socioeconomically Disadvantaged (%)    78.925
Name: 0.75, dtype: float64
African American (%)                    0.300
American Indian (%)                     0.100
Asian (%)                               0.500
Filipino (%)                            0.000
Hispanic (%)                           21.900
Pacific Islander (%)                    0.000
White (%)                              10.975
Two or 

Unnamed: 0,District Label,District Type,Locale,Funding,Expense per ADA,Demographic,Demographic (%),Demographic (%) - Quartile Category
0,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,Two or More Races,2.4,Middle 50%
1,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,Foster,0.7,Top 25%
2,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,American Indian,0.4,Middle 50%
3,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,Hispanic,50.6,Middle 50%
4,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,Socioeconomically Disadvantaged,61.3,Middle 50%
...,...,...,...,...,...,...,...,...
13043,ABC Unified (Los Angeles),Unified,Suburban,Underfunded,18827.59,Pacific Islander,0.5,Top 25%
13044,ABC Unified (Los Angeles),Unified,Suburban,Underfunded,18827.59,Asian,23.7,Top 25%
13045,ABC Unified (Los Angeles),Unified,Suburban,Underfunded,18827.59,Homeless,0.7,Middle 50%
13046,ABC Unified (Los Angeles),Unified,Suburban,Underfunded,18827.59,Hispanic,45.5,Middle 50%


### 3.5. Filtering the Melted dist_and_exp_demos_melt DataFrame for the Pupil Race/Ethnicity or Status Demographics

As many of our visualizations relate to pupil demographics, we want to filter the melted DataFrame and store it twice (`demographic_data_1` and `demographic_data_2`) to only include the pupil race/ethnicity OR status demographics. This will allow us to create visualizations that are focused on one or the other category of demographics.

In [10]:
# Filter the melted dataframe to only include the racial/ethnic demographics and store it as a 
# new DataFrame
demographic_data_1 = dist_and_exp_demos_melt[dist_and_exp_demos_melt['Demographic'].isin(DEMOS[:8])]

# Filter the melted dataframe to only include the status demographics and store it as a new DataFrame
demographic_data_2 = dist_and_exp_demos_melt[dist_and_exp_demos_melt['Demographic'].isin(DEMOS[8:])]

# Display the filtered and melted dataframes for the racial/ethnic and status demographics
display(demographic_data_1.head())
display(demographic_data_2.head())

# Verify that there are 932 rows for each demographic in the filtered and melted dataframes
print(demographic_data_1['Demographic'].value_counts())
print(demographic_data_2['Demographic'].value_counts())

Unnamed: 0,District Label,District Type,Locale,Funding,Expense per ADA,Demographic,Demographic (%),Demographic (%) - Quartile Category
0,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,Two or More Races,2.4,Middle 50%
2,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,American Indian,0.4,Middle 50%
3,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,Hispanic,50.6,Middle 50%
5,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,Pacific Islander,0.1,Middle 50%
7,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,African American,1.4,Middle 50%


Unnamed: 0,District Label,District Type,Locale,Funding,Expense per ADA,Demographic,Demographic (%),Demographic (%) - Quartile Category
1,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,Foster,0.7,Top 25%
4,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,Socioeconomically Disadvantaged,61.3,Middle 50%
6,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,Homeless,1.4,Middle 50%
10,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,Students with Disabilities,15.2,Top 25%
12,Yucaipa-Calimesa Joint Unified (San Bernardino),Unified,Suburban,Underfunded,18294.56,English Learner,7.6,Middle 50%


Demographic
Two or More Races    932
American Indian      932
Hispanic             932
Pacific Islander     932
African American     932
Filipino             932
Asian                932
White                932
Name: count, dtype: int64
Demographic
Foster                             932
Socioeconomically Disadvantaged    932
Homeless                           932
Students with Disabilities         932
English Learner                    932
Migrant                            932
Name: count, dtype: int64


<hr>

## 4.0. Creating a Visualization Template and Color Maps

Plotly References:
- Theming/Templates - Documentation: https://plotly.com/python/templates/
- Discrete Colors - Documentation: https://plotly.com/python/discrete-color/

### 4.1. Defining a Custom Visualization Template

Creating a custom visualization template using `plotly.graph_objects` and `plotly.io` that will allow us to apply the same visual style to all of our Plotly visualizations to make them more consistent.

In [11]:
# Define a custom template to use for the visualizations
visual_template = go.layout.Template(
    layout=dict(
        height=600,
        plot_bgcolor='#EAEAEA',      # Plot background: lightest gray
        paper_bgcolor='#EAEAEA',     # Figure background: lightest gray
        font=dict(color='#032C3C'),  # All text: dark blue
        margin=dict(t=100, b=100),   # Top/bottom figure margins: 100px
        title=dict(xanchor='center'),     # Center the figure title
        legend=dict(
            x=1.05,                      # Place legend further right of plot
            bordercolor='#032C3C',       # Legend border: dark blue
            borderwidth=1,               # Make border have 1px width
            itemsizing='constant'        # Keep constant legend item size
        ),
        xaxis=dict(
            title=dict(standoff=20),     # Place x axis title 20px down
            gridcolor='#E1DCD8',         # Grid lines (x-axis): light gray
            linecolor='#9E9E9E',         # x axis line: gray
            tickfont=dict(color='#032C3C')  # Ticks (x-axis): dark blue
        ),
        yaxis=dict(
            title=dict(standoff=20),     # Place y axis title 20px left
            gridcolor='#E1DCD8',         # Grid lines (y-axis): light gray
            linecolor='#9E9E9E',         # y axis line: gray
            tickfont=dict(color='#032C3C')  # Ticks (y-axis): dark blue
        )
    )
)

# Register the custom template to a chosen name within 'pio.templates' to be used for the visualizations
pio.templates['SIADS_593_Visuals'] = visual_template

### 4.2. Defining Discrete Color Maps for the Demographic-Related Scatter and Violin Plots

We plan to use discrete color maps for the demographic-related scatter and violin plots to make it easier to distinguish between the different demographics in the plots. These color maps were based on a color scheme that we crafted.

In [12]:
# Create two discrete color sequence lists to be used to encode racial/ethnic and status demographics
# in the scatter and violin plots
color_seq_1 = ['#A6B8E7', '#5A6184', '#032C3C', '#1A9FBA', '#008179', '#BA351A', '#6B2A2D', '#EE982C']
color_seq_2 = ['#A6B8E7', '#5A6184', '#1A9FBA', '#008179', '#BA351A', '#EE982C']

# Create a color map for both the racial/ethnic and status demographics using the two color sequences
color_map_1 = dict(zip(DEMOS[:8], color_seq_1))
color_map_2 = dict(zip(DEMOS[8:], color_seq_2))

<hr>

## 5.0. Scatter Plots

Plotly References:
- Scatter Plot - API: https://plotly.com/python-api-reference/generated/plotly.graph_objects.Scatter.html
- Scatter Plot - Documentation: https://plotly.com/python/line-and-scatter/
- Subplots - Documentation: https://plotly.com/python/subplots/

### 5.1. Per-Pupil Spending vs. Pupil Race/Ethnicity OR Status Demographic Percentages

The function below can be used to create a  scatter plot with specific demographic column percentage vs. `'Expense per ADA'` (per-pupil spending) values with `plotly.express`.

In [13]:
def create_demo_exp_scatter_all(
    demographic_data: pd.DataFrame, 
    colors: Dict[str, str], 
    title: str,
    subtitle: str
) -> Figure:
    """
    Creates a Plotly scatter plot to visualize the relationship between demographic percentages and per-pupil 
    spending ('Expense per ADA') across all demographics using the melted and filtered DataFrame, with separate 
    colors for each demographic and OLS trendlines.

    Args:
        demographic_data (pd.DataFrame): Melted and filtered DataFrame containing columns: 
            'Demographic', 'Demographic (%)', 'Expense per ADA', 'Funding', 'Locale', 
            'District Type', 'District Label'.
        colors (Dict[str, str]): The dictionary mapping demographic names to color codes (hex).
        title (str): The title for the scatter plot.
        subtitle (str): The subtitle for the scatter plot.

    Returns:
        Figure: A Plotly Figure object representing the completed scatter plot.
    """
    demo_exp_scatter_all = px.scatter(
        demographic_data, 
        x='Demographic (%)', 
        y='Expense per ADA', 
        color='Demographic',
        color_discrete_map=colors,  # Use the specified color map for the demographics
        labels={
            'Demographic (%)': 'Demographic Percentage (%)', 
            'Expense per ADA': 'Per-Pupil Spending ($)'
        },
        category_orders={'Demographic': DEMOS},  # Alphabetize the legend items
        opacity=0.5,      # Make the data points somewhat transparent
        trendline='ols',  # Add an OLS regression trendline for each demographic
        hover_data=demographic_data.columns,
        title=title,
        subtitle=subtitle,
        template='SIADS_593_Visuals'
    )

    # Alter the color of the subtitle
    demo_exp_scatter_all.update_layout(title_subtitle_font_color='#5A6184')

    return demo_exp_scatter_all

#### 5.1.1. Per-Pupil Spending vs. Pupil Race/Ethnicity Demographic Percentages

The scatter plot below shows the relationship between California pupil race/ethnicity demographic percentages and `'Expense per ADA'` (per-pupil spending) values. The pupil race/ethnicity demographics that are represented in the plot are `'African American'`, `'American Indian'`, `'Asian'`, `'Filipino'`, `'Hispanic'`, `'Pacific Islander'`, `'White'`, and `'Two or More Races'`.

OLS regression trendlines are added for each demographic to show the linear relationship between the demographic percentages and per-pupil spending, and data points have been made slighly transparent to help with distinguishing overlapping data points

##### 5.1.1.1. Create the Scatter Plot Using the `'create_demo_exp_scatter_all'` Function

In [14]:
# Create a scatter plot to show the relationship between racial/ethnic demographic percentages and 
# 'Expense per ADA' values using the filtered and melted DataFrame
demo_exp_scatter_all_1 = create_demo_exp_scatter_all(
    demographic_data_1, 
    color_map_1, 
    'Per-Pupil Spending vs. Pupil Race/Ethnicity Demographic Percentages',
    'Weak Positive and Negative Correlations'
)

# Update the scatter plot's left and right margins
demo_exp_scatter_all_1.update_layout(margin=dict(l=110, r=225)) 

# Display the scatter plot of racial/ethnic demographic %s vs. 'Expense per ADA'
demo_exp_scatter_all_1.show()

##### 5.1.1.2. Observations from the Scatter Plot

- For most of the race/ethnicity demographics, the R-squared values are very low, which means that the linear relationship between the demographic percentages and per-pupil spending is weak. Some demographics have a weak negative linear correlation while others have a weak positive linear correlation.
  - Weak negative linear correlations: `'Asian'` (R-squared = 0.0143), `'Filipino'` (R-squared = 0.0117), `'Hispanic'` (R-squared = 0.0029), `'Pacific Islander'` (R-squared = 0.0028), and `'African American'` (R-squared = 0.0003)
  - Weak positive linear correlations: `‘American Indian’` (R-squared = 0.0576), `‘White’` (R-squared =0.0047), and `‘Two or More Races’` (R-squared = 0.0013)

#### 5.1.2. Per-Pupil Spending vs. Pupil Status Demographic Percentages

The scatter plot below shows the relationship between California pupil status demographic percentages and `'Expense per ADA'` (per-pupil spending) values. The pupil status demographics that are represented in the plot are `'English Learner'`, `'Foster'`, `'Homeless'`, `'Migrant'`, `'Students with Disabilities'`, and `'Socioeconomically Disadvantaged'`.

OLS regression trendlines are added for each demographic to show the linear relationship between the demographic percentages and per-pupil spending, and data points have been made slighly transparent to help with distinguishing overlapping data points.

##### 5.1.2.1. Create the Scatter Plot Using the `'create_demo_exp_scatter_all'` Function

In [15]:
# Create a scatter plot to show the relationship between status demographic percentages and 
# 'Expense per ADA' values using the filtered and melted DataFrame
demo_exp_scatter_all_1 = create_demo_exp_scatter_all(
    demographic_data_2, 
    color_map_2, 
    'Per-Pupil Spending vs. Pupil Status Demographic Percentages',
    'Weak Positive Correlations'
)

# Update the scatter plot's left and right margins
demo_exp_scatter_all_1.update_layout(margin=dict(l=110, r=310))

# Display the scatter plot of status demographic %s vs. 'Expense per ADA'
demo_exp_scatter_all_1.show()

##### 5.1.2.2. Observations from the Scatter Plot

- For most of the status demographics, the R-squared values are very low, which means that the linear relationship between the demographic percentages and per-pupil spending is weak. All demographics have a weak positive linear correlation.
  - Weak positive linear correlations: `'Students with Disabilities'` (R-squared = 0.0356), `'Foster'` (R-squared = 0.0267), `'Socioeconomically Disadvantaged'` (R-squared = 0.0207), `'Homeless'` (R-squared = 0.0114), `'Migrant'` (R-squared = 0.0015), and `'English Learner'` (R-squared = 0.0012)

### 5.2. Per-Pupil Spending vs. Pupil Race/Ethnicity AND Status Demographic Percentages

The scatter subplot grid below shows individual scatter plots for each demographic's percentage vs. `'Expense per ADA'` (per-pupil spending) values. This will make it easier to see the individual data points associated with each demographic than the previous two scatter plots.

OLS regression trendlines are added for each scatter subplot to show the linear relationship between the demographic percentages and per-pupil spending.

#### 5.2.1. Create a Function to Create a Scatter Subplot for Each Demographic

The function below will add a scatter subplot for each demographic's percentage vs. `'Expense per ADA'` values with an OLS regression trendline to a provided Plotly Figure. The function utilizes `plotly.graph_objects` to create the individual subplots.

In [16]:
def create_demo_exp_scatter(
    scatter_plot: Figure,
    melted_data: pd.DataFrame,
    demographic: str,
    color: str,
    row: int,
    col: int
) -> None:
    """
    Adds a scatter plot and OLS regression line for a given demographic's percentage vs. per-pupil 
    spending ('Expense per ADA') to a subplot in the provided Plotly Figure using the melted DataFrame.

    Args:
        scatter_plot (Figure): The Plotly subplot Figure to which traces will be added.
        melted_data (pd.DataFrame): Melted DataFrame containing columns: 
            'Demographic', 'Demographic (%)', 'Expense per ADA', 'Funding', 'Locale', 
            'District Type', 'District Label'.
        demographic (str): The demographic to plot.
        color (str): The color for the scatter plot data points.
        row (int): The subplot row index.
        col (int): The subplot column index.

    Returns:
        None. The function modifies `scatter_plot` in place by adding two traces.
    """
    # Filter the melted DataFrame for the specific demographic and store it in a new DataFrame
    demo_data = melted_data[melted_data['Demographic'] == demographic]

    # Plot the data points for the specific demograhic percentages vs. 'Expense per ADA' to an individual subplot
    demo_exp_scatter_indiv.add_trace(
        go.Scatter(
            x=demo_data['Demographic (%)'],
            y=demo_data['Expense per ADA'],
            mode='markers',
            marker=dict(color=color),
            customdata=demo_data[['Funding', 'Locale', 'District Type']].values,
            hovertemplate=(
                '<b>%{text}</b><br>'
                'Demographic Percentage (%) = %{x}<br>'
                'Per-Pupil Spending ($) = %{y}<br>'
                'Funding = %{customdata[0]}<br>'
                'Locale = %{customdata[1]}<br>'
                'District Type = %{customdata[2]}<br>'
                '<extra></extra>'
            ),
            text=demo_data['District Label'],
            name=demographic + 'Percents'
        ),
        row=row,
        col=col
    )

    # Add a constant term for the intercept to the independent variable (x = demo_data['Demographic (%)']) 
    # for the OLS regression model
    x_constant = sm.add_constant(demo_data['Demographic (%)'])

    # Build and fit the OLS regression model to the filtered DataFrame
    OLS_model = sm.OLS(demo_data['Expense per ADA'], x_constant)
    OLS_results = OLS_model.fit()

    # Create a copy of the filtered DataFrame and add a new column in the filtered DataFrame that will contain 
    # the predicted values of 'Expense per ADA' using the OLS regression model
    demo_data_new = demo_data.copy()
    demo_data_new['OLS Predicted Expense per ADA'] = OLS_results.predict(x_constant)

    # Add the OLS regression line to the individual subplot using the predicted values of 
    # 'Expense per ADA' and the 'Demographic (%)' values
    demo_exp_scatter_indiv.add_trace(
        go.Scatter(
            x=demo_data_new['Demographic (%)'],
            y=demo_data_new['OLS Predicted Expense per ADA'],
            mode='lines',
            line=dict(color='black', width=2),
            hovertemplate=(
                '<b>OLS Regression Line</b><br>'
                'Formula: Per-Pupil Spending = %.4f * Demographic Percentage + %.4f<br>'
                'R-squared = %.6f'
                '<extra></extra>'
            ) % (
                OLS_results.params.iloc[1],
                OLS_results.params.iloc[0],
                OLS_results.rsquared
            ),
            name='OLS Regression Line'
        ),
        row=row,
        col=col
    )

#### 5.2.2. Create the Grid of Scatter Subplots Using the `'create_demo_exp_scatter'` Function

In [17]:
# Combine the two defined color maps into one color map to be used for all the demographics in the individual 
# scatter plots
scatter_color_map = color_map_1 | color_map_2

# Create a list of subplot titles using the 'DEMOS' list to be used for the subplot titles
subplot_titles = ['Demographic = ' + demo for demo in DEMOS]

# Create a list to hold the subplot grid location (row, column) for each demographic's scatter plot
subplot_locations = [
    [1, 1], [1, 2], [2, 1], [2, 2], [3, 1], [3, 2], [4, 1],
    [4, 2], [5, 1], [5, 2], [6, 1], [6, 2], [7, 1], [7, 2]
]

# Initialize a subplot grid to hold the individual scatter plots for each demographic with 7 rows x 2 columns 
demo_exp_scatter_indiv = make_subplots(
    rows=7,
    cols=2,
    vertical_spacing=0.06,         # Adjust the vertical spacing between subplots
    horizontal_spacing=0.15,       # Adjust the horizontal spacing between subplots
    subplot_titles=subplot_titles
)

# Plot each demographic's %s vs. 'Expense per ADA' on an individual scatter subplot using the 
# 'create_demo_exp_scatter' function
for index, demo in enumerate(DEMOS):
    create_demo_exp_scatter(
        demo_exp_scatter_indiv,
        dist_and_exp_demos_melt,
        demo,
        scatter_color_map[demo],
        subplot_locations[index][0],
        subplot_locations[index][1]
    )

# Update the scatter subplot grid's height, width, title, left/right margins, and legend visibility and apply the 
# visualiation template
demo_exp_scatter_indiv.update_layout(
    height=2000,
    width=1000,
    title_text='Per-Pupil Spending vs. Pupil Demographic Percentages for Each Demographic',
    showlegend=False,
    template='SIADS_593_Visuals',
    paper_bgcolor='white',
    margin=dict(t=150)
)

# Update the font size for the scatter subplot titles
demo_exp_scatter_indiv.update_annotations(font_size=14)

# Update the x and y axis title text and title/tick size for the scatter subplots
demo_exp_scatter_indiv.update_xaxes(
    title_text='Demographic Percentage (%)',
    title_font=dict(size=12),
    tickfont=dict(size=12)
)
demo_exp_scatter_indiv.update_yaxes(
    title_text='Per-Pupil Spending ($)',
    title_font=dict(size=12),
    tickfont=dict(size=12)
)

# Display the subplot grid containing the individual scatter plots for each demographic
demo_exp_scatter_indiv.show()

### 5.3. Per-Pupil Spending vs. Percentage of Socioeconomically Disadvantaged Students by District Type (for the Top Five Schools by Total Pupil Enrollment)

The graph below shows the relationship between the percentage of `'Socioeconomically Disadvantaged'` students and per-pupil spending (`'Expense per ADA'` values) for the top five California schools by total pupil enrollment (`'Enroll Total'` values) for each `'District Type'`.

#### 5.3.1. Create the Scatter Plot Using `plotly.express`

In [18]:
# Define the main DataFrame with district and expense data
district_and_expenses: pd.DataFrame

# Select the top 5 districts with the largest total enrollment within each district type
top5_per_type: pd.DataFrame = (
    district_and_expenses
    .sort_values(['District Type', 'Enroll Total'], ascending=[True, False])
    .groupby('District Type', as_index=False)
    .head(5)
)

# Map district types to specific colors for consistent plotting
district_color_map: dict[str, str] = {
    'Elementary': '#1A9FBA',  # Blue for Elementary
    'High': '#BA351A',        # Red for High
    'Unified': '#008179'      # Teal for Unified
}

# Create a scatter plot visualizing per-pupil spending vs. the percentage of socioeconomically disadvantaged students,
# colored by district type and sized by enrollment
soc_dis_exp_scatter: go.Figure = px.scatter(
    top5_per_type,
    x='Socioeconomically Disadvantaged (%)',     # X-axis: Percentage of disadvantaged students
    y='Expense per ADA',                         # Y-axis: Per-pupil spending
    color='District Type',                       # Color markers by district type
    size='Enroll Total',                         # Marker size by district enrollment
    trendline='ols',                             # Add OLS trendline
    color_discrete_map=district_color_map,       # Use our custom color map
    labels={
        'Socioeconomically Disadvantaged (%)': 'Percentage of Socioeconomically Disadvantaged Students (%)',
        'Expense per ADA': 'Per-Pupil Spending ($)',
        'Enroll Total': 'Total Pupil Enrollment'
    },
    template='SIADS_593_Visuals'
)

# Update the layout: reverse legend order, adjust margins, set a formatted title with HTML subtitle
soc_dis_exp_scatter.update_layout(
    legend_traceorder='reversed',
    margin=dict(t=130, l=110, r=185),
    title=dict(
        text=(
            'Per-Pupil Spending vs. Percentage of Socioeconomically Disadvantaged Students by District Type'
            '<br> for the Top 5 Largest Districts'
            '<br><span style="font-size:12px; color:#5A6184;">'
            'Size of Markers Represents Total Pupil Enrollment Size'
            '</span>'
        )
    )
)

# Add an outline to each marker for visual clarity
soc_dis_exp_scatter.update_traces(
    marker=dict(line=dict(width=1, color='#5A6184'))
)

# Display the final plot
soc_dis_exp_scatter.show()

#### 5.3.2. Observations From the Scatter Plot

- The top five California schools by total pupil enrollment (`'Enroll Total'`) tend to have a higher percentage of `'Socioeconomically Disadvantaged'` students
- There is a positive correlation for all three district types of `'Unified'`, `'High'`, and `'Elementary'`

### 5.4. Per-Pupil Spending vs. Percentage of Socioeconomically Disadvantaged Pupils by District Type

The scatter plot below provides a more holistic view of the California district data and it shows the relationship between the percentage of `'Socioeconomically Disadvantaged'` pupils and per-pupil spending (`'Expense per ADA'` values). The size of the markers represents the total pupil enrollment size (`'Enroll Total'`) values, and the colors represent the `'District Type'`. There is also an OLS trendline included.

#### 5.4.1. Create the Scatter Plot Using a Function and `plotly.express`

In [19]:
def plot_soc_dis_exp_by_type(
    df: pd.DataFrame,
    color_map: Dict[str, str],
    x_col: str = 'Socioeconomically Disadvantaged (%)',
    y_col: str = 'Expense per ADA',
    size_col: str = 'Enroll Total',
    color_col: str = 'District Type',
    template: str = 'SIADS_593_Visuals'
) -> None:
    """
    Scatter plot: Per-Pupil Spending vs. Percentage Socioeconomically Disadvantaged Pupils by District Type.

    Args:
        df (pd.DataFrame): Input DataFrame.
        color_map (dict): Mapping for district type colors.
        x_col (str): Column for x-axis.
        y_col (str): Column for y-axis.
        size_col (str): Column for marker size.
        color_col (str): Column for color grouping.
        template (str): Plotly theme template.
    """
    # Create the scatter plot using Plotly Express
    fig = px.scatter(
        df,
        x=x_col,                       # set x-axis to show the percentage of disadvantaged pupils
        y=y_col,                       # set y-axis to show per-pupil spending
        color=color_col,               # color points by district type
        size=size_col,                 # size points by enrollment
        trendline='ols',               # add an ordinary least squares (OLS) trendline
        color_discrete_map=color_map,  # use the provided district color mapping
        labels={
            x_col: 'Percentage of Socioeconomically Disadvantaged Pupils (%)',
            y_col: 'Per-Pupil Spending ($)',
            size_col: 'Total Pupil Enrollment'
        },
        title=(
            # HTML allows for subtitle styling directly in the title argument
            'Per-Pupil Spending vs. Percentage of Socioeconomically Disadvantaged Pupils by District Type'
            '<br><span style="font-size:12px; color:#5A6184;">'
            'Size of Markers Represents Total Pupil Enrollment Size'
            '</span>'
        ),
        template=template               # use custom Plotly template
    )
    # Update the layout: legend order and margins
    fig.update_layout(
        legend_traceorder='reversed',   # reverse legend so largest/most important is last (on top)
        margin=dict(l=110, r=185)       # adjust left and right margins for readability
    )
    # Add an outline to points so they stand out against the plot background
    fig.update_traces(
        marker=dict(line=dict(width=1, color='#5A6184'))
    )
    # Display the figure
    fig.show()


# Define the color mapping for district types
district_color_map: Dict[str, str] = {
    'Elementary': '#1A9FBA',  # Blue for elementary schools
    'High': '#BA351A',        # Red for high schools
    'Unified': '#008179'      # Teal for unified (K-12) districts
}

# Generate the plot using the district_and_expenses DataFrame and our color map
plot_soc_dis_exp_by_type(district_and_expenses, district_color_map)

#### 5.4.2. Observations From the Scatter Plot

- There is a positive correlation between the percentage of `'Socioeconomically Disadvantaged'` pupils and per-pupil spending (`'Expense per ADA'` values)

### 5.5. Per-Pupil Spending vs. District Funding by District Type

The scatterplot below shows the relationship between total district funding (`'Expense ADA'` values) and per-pupil spending (`'Expense per ADA'` values). The size of the markers represents the total pupil enrollment size (`'Enroll Total'` values), and the colors represent the `'District Type'`.

#### 5.5.1. Create the Scatter Plot Using a Function and `plotly.express`

In [20]:
def plot_exp_ada_by_district_type(
    df: pd.DataFrame,
    color_map: Dict[str, str],
    symbol_map: Dict[str, str],
    x_col: str = 'Expense ADA',
    y_col: str = 'Expense per ADA',
    size_col: str = 'Enroll Total',
    color_col: str = 'District Type',
    opacity: float = 0.5,
    template: str = 'SIADS_593_Visuals'
) -> None:
    """
    Scatter plot: Per-Pupil Spending vs. District Funding by District Type.
    
    Args:
        df (pd.DataFrame): Data to plot.
        color_map (dict): Color mapping for district types.
        symbol_map (dict): Symbol mapping for district types.
        x_col (str): Column for x-axis.
        y_col (str): Column for y-axis.
        size_col (str): Column for marker size.
        color_col (str): Column for color (legend groups).
        opacity (float): Opacity of plot markers.
        template (str): Plotly template.
    """
    # Create the scatter plot; assign color and marker symbol by district type
    fig = px.scatter(
        df,
        x=x_col,                          # Map x-axis to total district funding
        y=y_col,                          # Map y-axis to per-pupil funding
        color=color_col,                  # Color points by district type for the legend
        symbol=color_col,                 # Vary marker symbols by district type for clarity
        size=size_col,                    # Size points by total enrollment in the district
        opacity=opacity,                  # Set the transparency of points for readability
        color_discrete_map=color_map,     # Use user-defined color palette for district types
        symbol_map=symbol_map,            # Use user-defined symbol map for district types
        trendline='ols',                  # Add OLS regression trendline for visual guidance
        labels={
            x_col: 'District Funding ($)',        # x-axis label
            y_col: 'Per-Pupil Funding ($)',       # y-axis label
            size_col: 'Total Pupil Enrollment',   # legend label for marker size
        },
        title=(
            # Main and subtitle for context and clarity - HTML for subtitle styling
            'Per-Pupil Spending vs. District Funding by District Type'
            '<br><span style="font-size:12px; color:#5A6184;">'
            'Size of Markers Represents Total Pupil Enrollment Size | '
            'Larger Districts Tend to Spend Less Per Student'
            '</span>'
        ),
        template=template,               # Apply consistent theme for branding
    )
    # Refine layout: reverse legend order and adjust plot margins for clarity
    fig.update_layout(
        legend_traceorder='reversed',
        margin=dict(t=125, l=110, r=185)
    )
    # Add an outline to each marker for improved contrast and visibility
    fig.update_traces(marker=dict(line=dict(width=1, color='#5A6184')))
    # Display the plot
    fig.show()

# Define the symbol mapping for each district type for unique marker shapes
district_symbol_map: Dict[str, str] = {
    'Elementary': 'circle',
    'High': 'square',
    'Unified': 'diamond'
}

# Define the color mapping for each district type for consistency in visuals
district_color_map: Dict[str, str] = {
    'Elementary': '#1A9FBA',
    'High': '#BA351A',
    'Unified': '#008179'
}

# Generate the scatter plot for the enrollment and funding dataframe using custom mappings
plot_exp_ada_by_district_type(
    district_and_expenses,
    color_map=district_color_map,
    symbol_map=district_symbol_map
)

##### 5.5.2. Observations From the Scatter Plot

- There is a negative correlation between the total district funding (`'Expense ADA'` values) and per-pupil spending (`'Expense per ADA'` values). 
- The largest California district, Los Angeles Unified School District, is an outlier, which demonstrates that larger districts tend to spend less per student.

<hr>

## 6.0. Box Plots - Distribution of Pupil Demographic Percentages by Funding Category

Plotly References:
- Box Plot - API: https://plotly.com/python-api-reference/generated/plotly.express.box.html
- Box Plot - Documenation: https://plotly.com/python/box-plots/

The function below can be used to create a box plot of the distribution of specific demographic column percentage values by `'Funding'` category using `plotly.express`.

In [21]:
def create_demo_funding_box(
    demographic_data: pd.DataFrame,
    title: str,
    left_margin: int,
    right_margin: int
) -> Figure:
    """
    Creates a Plotly box plot to show the distribution of demographic percentages by funding category 
    for each demographic using the melted and filtered DataFrame.

    Args:
        demographic_data (pd.DataFrame): Melted and filtered DataFrame containing columns: 
            'Demographic', 'Demographic (%)', 'Expense per ADA', 'Funding', 'Locale', 
            'District Type', 'District Label'.
        title (str): The title for the box plot.
        left_margin (int): The left margin (in px) for the figure layout.
        right_margin (int): The right margin (in px) for the figure layout.

    Returns:
        Figure: A Plotly Figure object representing the completed box plot.
    """
    demo_funding_box = px.box(
        demographic_data, 
        x='Demographic (%)', 
        y='Demographic',  
        color='Funding', 
        color_discrete_sequence=['#BA351A', '#008179'],
        category_orders={'Demographic': DEMOS},
        labels={
            'Demographic (%)': 'Demographic Percentage (%)', 
            'Funding': 'Funding Category'
        },
        hover_data=demographic_data.columns,
        title=title,
        template='SIADS_593_Visuals'
    )

    # Reverse the ordering of the items in the box plot's legend to match the box plot's color sequence and update 
    # the left/right margins
    demo_funding_box.update_layout(
        legend_traceorder='reversed',
        margin=dict(l=left_margin, r=right_margin)
    )

    # Update how the hover data is displayed for the box plot
    demo_funding_box.update_traces(
        hovertemplate=(
            '<b>%{customdata[0]}</b><br>'
            'Demographic = %{y}<br>'
            'Demographic Percentage (%) = %{x}<br>'
            'Per-Pupil Spending ($) = %{customdata[4]}<br>'
            'Funding = %{customdata[3]}<br>'
            'Locale = %{customdata[1]}<br>'
            'District Type = %{customdata[2]}<br>'
            '<extra></extra>'
        )
    )

    return demo_funding_box

### 6.1. Distribution of Pupil Race/Ethnicity Demographic Percentages by Funding Category

The box plot below shows the distribution of California pupil race/ethnicity demographic percentages by `'Funding'` category. The race/ethnicity demographics are `'African American'`, `'American Indian'`, `'Asian'`, `'Filipino'`, `'Hispanic'`, `'Pacific Islander'`, `'White'`, and `'Two or More Races'`.

#### 6.1.1. Create the Box Plot Using the `'create_demo_funding_box'` Function

In [22]:
# Create a box plot to show the distribution of racial/ethnic demographic percentages by funding category using the 
# filtered and melted DataFrame using the 'create_demo_funding_box' function
demo_funding_box_1 = create_demo_funding_box(
    demographic_data_1, 
    'Distribution of Pupil Race/Ethnicity Demographic Percentages by Funding Category',
    185, 190
)

# Update the box plot's height
demo_funding_box_1.update_layout(height=650)

# Display the box plots of racial/ethnic demographic %s by funding category
demo_funding_box_1.show()

#### 6.1.2. Observations from the Box Plot

- Many California schools have extremely high percentages of `'Hispanic'` and `'White'` students, and these demographics have a wide spread of percentage values for both `'Well-funded'` and `'Underfunded'` schools
- There are many outliers for `'Well-funded'` and `'Underfunded'` schools for the `'African American'`, `'American Indian'`, `'Asian'`, `'Filipino'`, and `'Pacific Islander'` demographics
    - The `'African American'` and `'American Indian'` demographics have the most outliers for `'Well-funded'` schools, which means that schools that have abnormally high percentages of these demographics tend to to be `'Well-funded'`
    - The `'Asian'` demographic has the most outliers for `'Underfunded'` schools, which means that schools that have abnormally high percentages of this demographic tend to be `'Underfunded'`
- The only pupil race/ethnicity demographic that has a median percentage that is higher for `'Well-funded'` schools than `'Underfunded'` schools is `'Hispanic'`, which means that most pupil race/ethnicity demographics are associated with `'Underfunded'` schools when there percentages are higher

### 6.2. Distribution of Pupil Status Demographic Percentages by Funding Category

The box plot below shows the distribution of California pupil status demographic percentages by `'Funding'` category. The status demographics are `'English Learner'` `'Foster'`, `'Homeless'`, `'Migrant'`, `'Students with Disabilities'`, and `'Socioeconomically Disadvantaged'`.

#### 6.2.1. Create the Box Plot Using the `'create_demo_funding_box'` Function

In [23]:
# Create a box plot to show the distribution of status demographic percentages by funding category using the 
# filtered and melted DataFrame using the 'create_demo_funding_box' function
demo_funding_box_2 = create_demo_funding_box(
    demographic_data_2, 
    'Distribution of Pupil Status Demographic Percentages by Funding Category',
    275, 185
)

# Display the box plots of status demographic %s by funding category
demo_funding_box_2.show()

#### 6.2.2. Observations from the Box Plots

- Many California schools have extremely high percentages of `'Socioeconomically Disadvantaged'` students and signifcant percentages of `'English Learner'` students as well as `'Students with Disabilities'`
- It seems that for all pupil status demographics, the median percentage is higher as well as outliers for `'Well-funded'` schools than `'Underfunded'` schools, which means that higher percentages of students in these demographics are more associated with `'Well-funded'` schools

<hr>

## 7.0. Violin Plots - Distribution of Per-Pupil Spending by Schools' Most/Least Represented Pupil Race/Ethnicity Demographics

Plotly References:
- Violin Plot - API: https://plotly.com/python-api-reference/generated/plotly.express.violin.html
- Violin Plot - Documenation: https://plotly.com/python/violin/

The function below can be used to create a violin plot of the distribution of per-pupil spending (`'Expense per ADA'` values) by schools' most/least represented pupil race/ethnicity demographics (`'Most Represented Race/Ethnicity Demographic'`/`'Least Represented Race/Ethnicity Demographic'` category values). The function utilizes `plotly.express` to create the violin plot.

In [24]:
def create_demo_exp_violin(
    representation_data: pd.DataFrame,
    y_column: str,
    colors: Dict[str, str],
    title: str,
    height: int
) -> Figure:
    """
    Creates a Plotly violin plot showing the distribution of per-pupil spending ('Expense per ADA') by the 
    specified (most or least represented) demographic group.

    Args:
        representation_data (pd.DataFrame): 
            DataFrame containing at least 'Expense per ADA', 'District Label', 'Funding', 'Locale',
            'District Type', and the grouping demographic columns.
        y_column (str): 
            The grouping column to use on the y-axis (e.g., 'Most Represented Race/Ethnicity Demographic').
        colors (Dict[str, str]): The dictionary mapping demographic names to color codes (hex).
        title (str): The title for the violin plot.
        height (int): The height (in px) for the figure layout.

    Returns:
        Figure: A Plotly Figure object representing the completed violin plot.
    """
    demo_exp_violin = px.violin(
        representation_data, 
        x='Expense per ADA',
        y=y_column, 
        color=y_column,
        color_discrete_map=colors, # Use the designated color map for the demographics
        box=True,      # Draw a mini box plot within the violin plot
        points='all',  # Displays all data points on the violin plot
        labels={'Expense per ADA': 'Per-Pupil Spending ($)'},
        category_orders={
            'Most Represented Race/Ethnicity Demographic': [
                'African American', 'American Indian', 'Asian', 'Hispanic', 'White'
            ],
            'Least Represented Race/Ethnicity Demographic': DEMOS[:8]
        },
        hover_data=['District Label', 'Funding', 'Locale', 'District Type'], 
        title=title,
        template='SIADS_593_Visuals'
    )

    # Alter the violin plot to hide the legend, show the mean line, and update how the hover data is displayed
    demo_exp_violin.update_traces(
        meanline_visible=True, 
        showlegend=False,
        hovertemplate=(
            '<b>%{customdata[0]}</b><br>'
            'Per-Pupil Spending ($) = %{x}<br>'
            'Funding = %{customdata[1]}<br>'
            'Locale = %{customdata[2]}<br>'
            'District Type = %{customdata[3]}<br>'
            '<extra></extra>'
        )
    )

    # Update the height and left/right margins of the violin plot
    demo_exp_violin.update_layout(height=height, margin=dict(l=200, r=50))

    return demo_exp_violin

### 7.1. Distribution of Per-Pupil Spending by Schools' Most Represented Pupil Race/Ethnicity Demographics

This violin plot below shows the distribution of per-pupil spending (`'Expense per ADA'` values) by California schools' most represented pupil race/ethnicity demographics (`'Most Represented Race/Ethnicity Demographic'` values). The pupil demographics that are represented in the plot are `'African American'`, `'American Indian'`, `'Asian'`, `'Hispanic'`, and `'White'`.

#### 7.1.1. Create the Violin Plot Using the `'create_demo_exp_violin'` Function

In [25]:
# Create a violin plot to show the distribution of 'Expense per ADA' by most represented racial/ethnic demographic 
# using the original DataFrame using the 'create_demo_exp_violin' function
most_demo_exp_violin_1 = create_demo_exp_violin(
    violin_most_data,
    'Most Represented Race/Ethnicity Demographic',
    color_map_1,
    "Distribution of Per-Pupil Spending by Schools' Most Represented Pupil Race/Ethnicity Demographic",
    650
)

# Display the violin plots of 'Expense per ADA' by most represented racial/ethnic demographic
most_demo_exp_violin_1.show()

#### 7.1.2. Observations from the Violin Plot

- There were zero occurrences in which `‘Filipino’` or `'Pacific Islander'` students as well as students of two or more races are the most represented pupil race/ethnicity demographic for a California school AND very few occurrences in which a California school’s most represented pupil race/ethnicity demographic = `‘African American’` (x1) or `‘American Indian’` (x8)
- `'Hispanic'` and `'White'` = most represented pupil race/ethnicity demographic for a majority of the schools
- The distribution of per-pupil spending seems to be somewhat similar (median: ~$19-20K) for schools where `'Hispanic'` or `'White'` students were the most represented pupil race/ethnicity demographic
- The largest middle 50% spread occured in schools where `'American Indian'` or `'White'` students were the most represented pupil race/ethnicity demographic
- The smallest middle 50% spread occured in schools where `'Asian'` students were the most represented pupil race/ethnicity demographic


### 7.2. Distribution of Per-Pupil Spending by Schools' Least Represented Pupil Race/Ethnicity Demographics

This violin plot below shows the distribution of per-pupil spending (`'Expense per ADA'` values) by California schools' least represented pupil race/ethnicity demographics (`'Least Represented Race/Ethnicity Demographic'` values). The pupil demographics that are represented in the plot are `'African American'`, `'American Indian'`, `'Asian'`, `'Filipino'`, `'Pacific Islander'`, and `'Two or More Races'`.

#### 7.2.1. Create the Violin Plot Using the `'create_demo_exp_violin'` Function

In [38]:
# Create a violin plot to show the distribution of 'Expense per ADA' by least represented racial/ethnic demographic 
# using the original DataFrame using the 'create_demo_exp_violin' function
least_demo_exp_violin_1 = create_demo_exp_violin(
    violin_least_data,
    'Least Represented Race/Ethnicity Demographic',
    color_map_1,
    "Distribution of Per-Pupil Spending by Schools' Least Represented Pupil Race/Ethnicity Demographic",
    765
)

# Display the violin plots of 'Expense per ADA' by least represented racial/ethnic demographic
least_demo_exp_violin_1.show()

#### 7.2.2. Observations from the Violin Plot

- There were very few occurrences in which the least represented pupil race/ethnicity demographic for a California school = `'White'` (x5) or `'Hispanic'` (x9), and the per-pupil spending is fairly spread out during these instances
- The distribution of per-pupil spending seems to be somewhat similar (median: ~$20-24K) for schools where `'African American'`, `'American Indian'`, `'Asian'`, `'Filipino'`, `'Pacific Islander'`, or students of two or more races were the least represented pupil race/ethnicity demographic for a school
- Without considering the `'White'` or `'Hispanic'` demographics, the largest middle 50% per-pupil spending spreads occured in schools where `'American American'` or `'Asian'` students were the least represented pupil race/ethnicity demographic
- The smallest middle 50% per-pupil spending spreads occured in schools where `'Pacific Islander'` or `'American Indian'` students were the least represented pupil race/ethnicity demographic
- Without considering the `'White'` or `'Hispanic'` demographics, the distribution of per-pupil spending for all demographics when they were the least represented pupil race/ethnicity demographic for a school seems to be right skewed
- The distribution of per-pupil spending for the `'American Indian'`,`'Pacific Islander'`, and `'Filipino'` demographics when they were the least represented pupil race/ethnicity demographic for a school seems to be denser around their median values.

<hr>

## 8.0. Bar Charts

Plotly References:
- Bar Chart - API: https://plotly.com/python-api-reference/generated/plotly.express.bar
- Bar Chart - Documentation: https://plotly.com/python/bar-charts/

The function below can be used to create a bar chart of the average per-pupil spending (`'Expense per ADA'` values) for the top/bottom quartiles of specific demographic percentage column values OR for each `'Locale'`/`'District Type'` (location/district type) combination. The quartiles are determined using the top 25% and bottom 25% percentile thresholds for each demographic's percentage values are represented in the `'Demographic (%) - Quartile Category'` column. The function utilizes `plotly.express` to create the bar chart.

In [27]:
def create_average_exp_bars(
    average_exp_data: pd.DataFrame,
    y_column: str,
    color_column: str,
    title: str,
    left_margin: int,
    right_margin: int
) -> Figure:
    """
    Creates a grouped horizontal bar chart showing the average per-pupil spending ('Expense per ADA') for 
    groups within the supplied data. 

    Args:
        average_exp_data (pd.DataFrame): 
            DataFrame containing at least 'Average Expense per ADA', y_column, and color_column columns.
        y_column (str): 
            The name of the column to use for the y-axis grouping (e.g., demographic or locale names).
        color_column (str): 
            The name of the column to use for color grouping (e.g., quartile category or location type).
        title (str): The title for the bar chart.
        left_margin (int): The right margin (in px) for the figure layout.
        right_margin (int): The right margin (in px) for the figure layout.

    Returns:
        Figure: A Plotly Figure object representing the completed bar chart.
    """
    average_exp_bars = px.bar(
        average_exp_data, 
        x='Average Expense per ADA', 
        y=y_column,  
        color=color_column,
        color_discrete_sequence=['#BA351A', '#008179', '#5A6184'], # Red, Teal, Lavender
        category_orders={
            'Demographic': DEMOS,
            'Locale': ['City', 'Rural', 'Suburban', 'Town']
        }, 
        labels={
            'Average Expense per ADA': 'Average Per-Pupil Spending ($)',
            'Demographic (%) - Quartile Category': 'Quartile Category',
            'Locale': 'Location Type'
        },
        orientation='h', # Make the bars horizontal
        barmode='group', # Group the bars by quartile category for each demographic
        text_auto=True,  # Add text with the bar height values to the bars
        title=title,
        template='SIADS_593_Visuals'
    )

    # Reverse the ordering of the items in the bar chart's legend to match the bar chart's color sequence,
    # place the legend closer to the bar chart than the default template location, and update the left/right
    # margins
    average_exp_bars.update_layout(
        legend_traceorder='reversed', 
        legend_x=1.02, 
        margin=dict(l=left_margin, r=right_margin)
    )
                            
    return average_exp_bars

### 8.1. Average Per-Pupil Spending for Top/Bottom Percentage Quartiles for Each Demographic

#### 8.1.1. Average Per-Pupil Spending for Top/Bottom Quartiles of Pupil Race/Ethnicity Demographic Percentages

The bar graph below shows the average per-pupil spending (`'Expense per ADA'` values) for each California pupil race/ethnicity-related `'Demographic'`/`'Demographic (%) - Quartile Category'` (demographic/quartile category) combination. The quartiles represent the top/bottom 25% of percentages. The pupil race/ethnicity demographics are `'African American'`, `'American Indian'`, `'Asian'`, `'Filipino'`, `'Hispanic'`, `'Pacific Islander'`, `'White'`, and `'Two or More Races'`.

##### 8.1.1.1. Check the Counts for Each Pupil Race/Ethnicity Demographic Quartile Category

Our reasoning behind doing this was to verify that there were not any major imbalances in the number of California schools in each demeographic quartile category or any demographic/quartile category combinations related to very few schools.

In [28]:
demographic_data_1.groupby(
    ['Demographic', 'Demographic (%) - Quartile Category']
)['District Label'].count().reset_index()

Unnamed: 0,Demographic,Demographic (%) - Quartile Category,District Label
0,African American,Bottom 25%,256
1,African American,Middle 50%,442
2,African American,Top 25%,234
3,American Indian,Bottom 25%,303
4,American Indian,Middle 50%,380
5,American Indian,Top 25%,249
6,Asian,Bottom 25%,260
7,Asian,Middle 50%,436
8,Asian,Top 25%,236
9,Filipino,Bottom 25%,260


There seems to be many California schools associated with each demographic/quartile category combination. This implies that the average per-pupil spending values we calculate for these combinations may be more representative of the actual average per-pupil spending for these combinations.

##### 8.1.1.2. Create the Bar Chart Using the `'create_average_exp_bars'` Function

We chose to exclude the `'Middle 50%'` quartile category because we wanted to focus on the average per-pupil spending for different demographics when their percentages are higher or lower.

In [29]:
# Determine the average 'Expense per ADA' for the top/bottom quartiles of racial/ethnic demographic %s using the 
# filtered and melted DataFrame and store it in a new DataFrame
demo_1_exp_bars_data = demographic_data_1.groupby(
    ['Demographic', 'Demographic (%) - Quartile Category']
)['Expense per ADA'].mean().reset_index()

# Rename the new DataFrame's 'Expense per ADA' column to 'Average Expense per ADA'
demo_1_exp_bars_data.rename(columns={'Expense per ADA': 'Average Expense per ADA'}, inplace=True)

# Filter the new DataFrame to exclude the 'Middle 50%' quartile category data
demo_1_exp_bars_data = demo_1_exp_bars_data[
    demo_1_exp_bars_data['Demographic (%) - Quartile Category'] != 'Middle 50%'
]

# Create a bar chart to show the average 'Expense per ADA' for the top/bottom quartiles of racial/ethnic 
# demographic %s using the new filtered DataFrame and the 'create_demo_exp_bars' function
demo_1_exp_bars = create_average_exp_bars(
    demo_1_exp_bars_data,
    'Demographic',
    'Demographic (%) - Quartile Category',
    'Average Per-Pupil Spending for Top/Bottom Quartiles of Pupil Race/Ethnicity Demographic Percentages',
    185, 175
)

# Display the bar chart of average 'Expense per ADA' for the top/bottom quartiles of racial/ethnic demographic %s
demo_1_exp_bars.show()

##### 8.1.1.3. Observations from the Bar Chart

- The only occurrence in which average per-pupil spending was higher AND percentages were higher rather than lower is for the `‘White’` pupil race/ethnicity demographic
- Highest average per-pupil spending occured with lower percentages of `‘Asian’`, `‘Filipino’`, `‘African American’`, and `'Pacific Islander'` students and  higher percentages of `‘White’` and `‘Hispanic’` students as well as students of two or more races

#### 8.1.2. Average Per-Pupil Spending for Top/Bottom Quartiles of Pupil Status Demographic Percentages

The bar graph below shows the average per-pupil spending (`'Expense per ADA'` values) for each pupil status-related `'Demographic'`/`'Demographic (%) - Quartile Category'` (demographic/quartile category) combination. The quartiles represent the top/bottom 25% of percentages. The pupil status demographics are `'English Learner'`, `'Foster'`, `'Homeless'`, `'Migrant'`, `'Students with Disabilities'`, and `'Socioeconomically Disadvantaged'`.

##### 8.1.2.1. Check the Counts for Each Pupil Status Demographic Quartile Category

Our reasoning behind doing this was to verify that there were not any major imbalances in the number of California schools in each demeographic quartile category or any demographic/quartile category combinations related to very few schools.

In [30]:
demographic_data_2.groupby(
    ['Demographic', 'Demographic (%) - Quartile Category']
)['District Label'].count().reset_index()

Unnamed: 0,Demographic,Demographic (%) - Quartile Category,District Label
0,English Learner,Bottom 25%,237
1,English Learner,Middle 50%,462
2,English Learner,Top 25%,233
3,Foster,Bottom 25%,320
4,Foster,Middle 50%,331
5,Foster,Top 25%,281
6,Homeless,Bottom 25%,243
7,Homeless,Middle 50%,451
8,Homeless,Top 25%,238
9,Migrant,Bottom 25%,569


There seems to be many California schools associated with each demographic/quartile category combination. This implies that the average per-pupil spending values we calculate for these combinations may be more representative of the actual average per-pupil spending for these combinations.

##### 8.1.2.2. Create the Bar Chart Using the `'create_average_exp_bars'` Function

We chose to exclude the `'Middle 50%'` quartile category because we wanted to focus on the the average per-pupil spending for different demographics when their percentages are higher or lower.

In [31]:
# Determine the average 'Expense per ADA' for the top/bottom quartiles of status demographic %s using the 
# filtered and melted DataFrame and store it in a new DataFrame
demo_2_exp_bars_data = demographic_data_2.groupby(
    ['Demographic', 'Demographic (%) - Quartile Category']
)['Expense per ADA'].mean().reset_index()

# Rename the new DataFrame's 'Expense per ADA' column to 'Average Expense per ADA'
demo_2_exp_bars_data.rename(columns={'Expense per ADA': 'Average Expense per ADA'}, inplace=True)

# Filter the new DataFrame to exclude the 'Middle 50%' quartile category data
demo_2_exp_bars_data = demo_2_exp_bars_data[
    demo_2_exp_bars_data['Demographic (%) - Quartile Category'] != 'Middle 50%'
]

# Create a bar chart to show the average 'Expense per ADA' for the top/bottom quartiles of status demographic 
# %s using the new filtered DataFrame and the 'create_demo_exp_bars' function
demo_2_exp_bars = create_average_exp_bars(
    demo_2_exp_bars_data,
    'Demographic',
    'Demographic (%) - Quartile Category',
    'Average Per-Pupil Spending for Top/Bottom Quartiles of Pupil Status Demographic Percentages',
    275, 175
)

# Display the bar chart of average 'Expense per ADA' for the top/bottom quartiles of status demographic %s
demo_2_exp_bars.show()

##### 8.1.2.3. Observations from the Bar Chart

- The only occurrence in which average per-pupil spending was higher AND percentages were higher rather than lower is for 'Students with Disabilities' demographic
- Highest average per-pupil spending occured with lower percentages of `‘Foster’` and `‘Homeless’` students and with higher percentages of `‘Students with Disabilities’` and `‘Socioeconomically Disadvantaged’` students

### 8.2.  Average Per-Pupil Spending for Each District Type by Location Type

This bar graph shows the average per-pupil spending (`'Expense per ADA'` values) for each `'Locale'`/`'District Type'` (location/district type) combination. The district types are `'Elementary'`, `'High'`, and `'Unified'` while the location types are `'City'`, `'Rural'`, `'Suburban'` and `'Town'`.

#### 8.2.1. Check the Counts for Each Location/District Type Combination

Our reasoning behind doing this was to verify that there were not any major imbalances in the # of schools in each location/district type category

In [32]:
# print(district_and_expenses['Locale'].value_counts())
district_and_expenses.groupby(['Locale', 'District Type'])['District Label'].count().reset_index()

Unnamed: 0,Locale,District Type,District Label
0,City,Elementary,62
1,City,High,18
2,City,Unified,53
3,Not Reported,Elementary,5
4,Rural,Elementary,250
5,Rural,High,12
6,Rural,Unified,79
7,Suburban,Elementary,133
8,Suburban,High,23
9,Suburban,Unified,140


There seems to be some location/district type category combinations related to very few California schools. This implies that the average per-pupil spending values we calculate for these combinations may not be the most accurate. We identify this as a limitation regarding any insights we take from the data presented in this visualization.

#### 8.2.2. Create the Bar Chart Using the `'create_average_exp_bars'` Function

We chose to exclude the `'Not Reported'` location type because we found that there are only five schools that have this location type and we wanted to focus on the location types that have are related to more schools.

In [33]:
# Determine the average 'Expense per ADA' for each location/district type combination using the 
# filtered and melted DataFrame and store it in a new DataFrame
loc_dist_exp_data = district_and_expenses.groupby(
    ['Locale', 'District Type']
)['Expense per ADA'].mean().reset_index()

# Rename the new DataFrame's 'Expense per ADA' column to 'Average Expense per ADA'
loc_dist_exp_data = loc_dist_exp_data.rename(columns={'Expense per ADA': 'Average Expense per ADA'})

# Filter the new DataFrame to exclude the 'Not Reported' location type data
loc_dist_exp_data = loc_dist_exp_data[loc_dist_exp_data['Locale'] != 'Not Reported']

# Create a bar chart to show the average 'Expense per ADA' for each location/district type combination using the 
# new filtered DataFrame and the 'create_demo_exp_bars' function
loc_dist_exp_bars = create_average_exp_bars(
    loc_dist_exp_data,
    'Locale',
    'District Type',
    'Average Per-Pupil Spending for Each District Type by Location Type',
    130, 160
)

# Display the bar chart of average 'Expense per ADA' for each location/district type combination
loc_dist_exp_bars.show()

#### 8.2.3. Observations from the Bar Chart

- Average per-pupil spending was the highest across ALL district types for `‘Rural’` schools with `‘Rural’`/`’Unified’` schools having highest spending OVERALL
- Significant spending gaps exist between `‘Rural’` spending and the next best average for each district type
- `'Suburban'` and `'Town'` districts had the lowest average per-pupil spending of ~$20K across all district types
- Differences between average per-pupil spending across different district types seem to be minimal when `‘Location Type’` = `‘Town’`, `‘City'`, or `‘Suburban’` BUT `‘Rural’` districts had more variation

### 8.3. Total Student Enrollment by Race/Ethnicity Demographic and Funding Categories

This bar graph below shows the total student enrollment by race/ethnicity demographic and funding categories.

#### 8.3.1. Create the Bar Chart Using a Function and `plotly.express`

In [34]:
def plot_demo_funding_enrollment(
    df, 
    demo_cols, 
    funding_col='Funding', 
    enrollment_label='Enrollment', 
    color_sequence=None, 
    template='SIADS_593_Visuals'
):
    """
    Plots enrollment by demographic and funding status as a grouped bar chart.

    Args:
        df (pd.DataFrame): Source DataFrame.
        demo_cols (list): List of demographic columns to include.
        funding_col (str): Name of funding status column.
        enrollment_label (str): Column name for enrollment values in melted dataframe.
        color_sequence (list): List of color hex codes for bar colors.
        template (str): Plotly template to use.
    """
    # Reshape the DataFrame from wide to long format:
    # - Each row will represent a (Funding Category, Demographic, Enrollment) combination
    melted = df.melt(
        id_vars=[funding_col],            # Keep the funding column as an identifier
        value_vars=demo_cols,             # Demographic columns to turn into 'variable' rows
        var_name='Demographic',           # Name for new column containing demographic group
        value_name=enrollment_label       # Name for new column containing enrollment figures
    )

    # Aggregate enrollment by funding and demographic for total counts to plot 
    grouped = (
        melted
        .groupby([funding_col, 'Demographic'], as_index=False)[enrollment_label]
        .sum()                           # Sum enrollments for each (Funding, Demographic) pair
    )

    # Create a grouped bar chart using Plotly Express
    fig = px.bar(
        grouped,
        x='Demographic',                 # Demographic groups on the X-axis
        y=enrollment_label,              # Enrollment counts on the Y-axis
        color=funding_col,               # Group bars by funding status (color)
        barmode='group',                 # Place bars for different funding groups side by side
        color_discrete_sequence=color_sequence if color_sequence else px.colors.qualitative.Safe,  # Color palette
        title='Total Student Enrollment by Race/Ethnicity Demographic and Funding Categories',      # Chart title
        labels={enrollment_label: "Total Enrollment"},                                              # Y-axis label
        template=template                # Apply custom Plotly template
    )

    # Update figure layout: reverse legend, adjust margins for readability
    fig.update_layout(
        legend_traceorder='reversed',    # Show legend in the order traces were added
        margin=dict(t=115, b=135, l=100, r=185)
    )
    # Display the final chart
    fig.show()


# List of demographic columns to include in the plot
demographic_columns = [
    'African American', 'American Indian', 'Asian', 'Filipino', 'Hispanic', 
    'Pacific Islander', 'White', 'Two or More Races', 'Not Reported'
]

# Custom color sequence for funding categories
color_sequence = ['#BA351A', '#008179']

# Run the plotting function with chosen data and options
plot_demo_funding_enrollment(
    df=district_and_expenses, 
    demo_cols=demographic_columns, 
    color_sequence=color_sequence
)

#### 8.3.2. Observations from the Bar Chart

- The `'Hispanic'` demographic has the highest student enrollment in the State of California, followed by `'White'`, `'Asian'`, and `'African American'`.
- Students identifying as `'Hispanic'` have a high concentration attending `'Well-funded'` districts, as do students identifying as '`African American'`
- Students identifying as `'Asian'`, `'Filipino'`, `'White'`, or `'Two or More Races'` appear significantly more concentrated in `'Underfunded'` districts.

### 8.4. Charter vs. Non-Charter Enrollment: Top 5 Well-funded & Underfunded Schools

This graph shows the top five `'Well-funded'` and `'Underfunded'` districts by per-pupil spending (`'Expense per ADA'` values) and their charter vs. non-charter enrollment size (`'Enroll Charter'` and `'Enroll Non Charter'` values).

#### 8.4.1. Create the Bar Chart Using a Function and `plotly.express`

In [35]:
def plot_charter_vs_noncharter_enrollment(
    df,
    funding_col='Funding',
    charter_col='Enroll Charter',
    noncharter_col='Enroll Non Charter',
    expense_col='Expense per ADA',
    district_label_col='District Label',
    n=5,
    charter_fund_scheme=None,
    template='SIADS_593_Visuals'
):
    """
    Plots Charter vs Non-Charter Enrollment for top/bottom n districts by per-pupil expense.

    Args:
        df (pd.DataFrame): Input data.
        funding_col (str): Column indicating funding status.
        charter_col (str): Column for Charter enrollment.
        noncharter_col (str): Column for Non-Charter enrollment.
        expense_col (str): Column for sorting districts.
        district_label_col (str): Column for x-axis labeling.
        n (int): Number from top and bottom to display.
        charter_fund_scheme (dict): Nested dict of color mappings.
        template (str): Plotly theme template.
    """
    # If not provided, set default color scheme for both funding and charter status
    if charter_fund_scheme is None:
        charter_fund_scheme = {
            'Well-funded': {'Charter': '#EE982C',  'Non-Charter': '#1A9FBA'},   # Orange for Charter, Blue for Non-Charter (Well-funded)
            'Underfunded': {'Charter': '#BA351A',  'Non-Charter': '#6B2A2D'}    # Red for Charter, Magenta for Non-Charter (Underfunded)
        }

    # Sort districts by per-pupil expense, descending, to find top and bottom n districts
    df_sorted = df.sort_values(expense_col, ascending=False)
    # Concatenate top n and bottom n records for plotting
    combined = pd.concat([df_sorted.head(n), df_sorted.tail(n)])

    # Create an empty figure to which we will add bar traces
    fig = go.Figure()

    # Loop through both funding categories to add bars for each
    for funding in ['Well-funded', 'Underfunded']:
        # Select only districts in the current funding group
        subset = combined[combined[funding_col] == funding]
        # Add Charter bars for current funding group
        fig.add_bar(
            x=subset[district_label_col],                  # x-axis: District label/name
            y=subset[charter_col],                         # y-axis: Charter enrollment count
            name=f"{funding} - Charter",                   # Trace legend label
            marker_color=charter_fund_scheme[funding]['Charter']  # Color from scheme
        )
        # Add Non-Charter bars for current funding group
        fig.add_bar(
            x=subset[district_label_col],                  # x-axis: District label/name
            y=subset[noncharter_col],                      # y-axis: Non-Charter enrollment count
            name=f"{funding} - Non-Charter",               # Trace legend label
            marker_color=charter_fund_scheme[funding]['Non-Charter']  # Color from scheme
        )

    # Set up layout: side-by-side bars, HTML-formatted subtitle, axis labels, and aesthetics
    fig.update_layout(
        barmode='group',                                   # Grouped bars by district
        title=(
            'Charter vs Non-Charter Enrollment: Top 5 Well-funded & Underfunded Districts'
            '<br><span style="font-size:12px; color:#5A6184;">'
            'Comparison of Enrollment Levels by District Type and Funding Category'
            '</span>'
        ),
        xaxis_title='District',
        yaxis_title='Enrollment',
        legend_title='Category',
        xaxis_tickangle=-45,                               # Angle x labels for readability
        template=template,                                 # Plotly visual theme
        margin=dict(b=250, l=150, r=275)                  # Expand margins to fit long district names
    )
    # Display the final grouped bar chart
    fig.show()

# Produces the figure with default colors and data
plot_charter_vs_noncharter_enrollment(district_and_expenses)

#### 8.4.2. Observations from the Bar Chart

- A high volume of California students attend charter schools in extremely `'Well-funded'` districts, while students continue to attend public schools in extremely `'Underfunded'` districts.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=246b06f0-3e45-45e3-acef-efea2bae7701' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>