# Clustering

*Justin R. Garrard*

### *Executive Summary*

This section represents the **Data Preparation** and **Modeling** sections of the CRISP-DM process.


### *Objectives*


1. **[Feature Selection]** To prototype the selection of features and data processing required before clustering.


2. **[Clustering Experimentation]** To prototype the clustering process, experimenting until a satisfactory result is produced.



### Setup

In [1]:
# Import libraries
import os 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
from ipywidgets import * 
from sklearn.cluster import KMeans

In [2]:
# Declare global variables
DATA_DIR = os.path.join('../data/processed')
DATA_FILE = os.path.join(DATA_DIR, 'processed.csv')
plt.style.use('ggplot')

CLUSTERING_COLS = ['leaid', 'year', 'lea_name', 'fips', 'number_of_schools', 'cbsa_type', 
                  'teachers_total_fte', 'spec_ed_students', 
                  'enrollment_x',
                  'read_test_num_valid', 'read_test_pct_prof_midpt', 'math_test_num_valid',
                   'math_test_pct_prof_midpt', 'rev_total', 'exp_total']

## Set a target year for early analysis
TGT_YEAR = 2016

In [6]:
# Useful functions
def null_counter(df):
    record_nulls = []
    for col in df.columns:
        nulls = df[col].isnull().sum()
        percent_null = round((nulls / df.shape[0]) * 100, 2)
        record_nulls.append([col, nulls, percent_null])
    output = pd.DataFrame(record_nulls, columns=['Attribute', 'Null Count', '% Null'])
    return output

def get_year_range(df):
    year_range = list(df['year'].unique())
    year_range.sort()
    return year_range

def subset_by_states_only(df):
    df = df[df['fips'] <= 56]
    return df

def sound_off(df):
    nRow, nCol = cluster_df.shape
    print(f'There are {nRow} rows and {nCol} columns.')
    print('')

    YEAR_RANGE = get_year_range(cluster_df)
    print(f'Data spans the years {YEAR_RANGE[0]} to {YEAR_RANGE[-1]}.')
    print('')

    print('Available columns include:')
    display(null_counter(cluster_df))
    
def filter_out_factor(df, column_name):
    ## Identify records with null values in column
    bad_records = df[df[column_name].isnull()]
    bad_records.to_csv(f'missing_{column_name}.csv')

    ## Drop records with null values in column
    df = df[df[column_name].notnull()]
    return df

### Data Preparation

In this section we load the data, running various scripts to format the contents properly.

***High-Level Overview***

We tried to choose a subset of columns in which the data was mostly complete. That meant disqualifying rows that were:

* ... not states (i.e. territories).


* ... did not have reported scores for standardized tests.


We were especially disappointed to have to remove "english_language_learners" from the clustering data. In literature this factor is frequently referred to as significant. But, more than 6,000 records in our limited set simply have no reported value for this metric. 

In [4]:
# Load and preview data
## Isolate by specific columns
cluster_df = pd.read_csv(DATA_FILE)[CLUSTERING_COLS]
## Filter out non-state records
cluster_df = subset_by_states_only(cluster_df)
## Filter by year
cluster_df = cluster_df[cluster_df['year'] == TGT_YEAR]

sound_off(cluster_df)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


There are 18654 rows and 15 columns.

Data spans the years 2016 to 2016.

Available columns include:


Unnamed: 0,Attribute,Null Count,% Null
0,leaid,0,0.0
1,year,0,0.0
2,lea_name,0,0.0
3,fips,0,0.0
4,number_of_schools,0,0.0
5,cbsa_type,1,0.01
6,teachers_total_fte,885,4.74
7,spec_ed_students,2238,12.0
8,enrollment_x,1383,7.41
9,read_test_num_valid,2374,12.73


In [7]:
# Remove records with missing test scores
cluster_df = filter_out_factor(cluster_df, 'read_test_num_valid')
cluster_df = filter_out_factor(cluster_df, 'math_test_num_valid')
sound_off(cluster_df)

There are 16272 rows and 15 columns.

Data spans the years 2016 to 2016.

Available columns include:


Unnamed: 0,Attribute,Null Count,% Null
0,leaid,0,0.0
1,year,0,0.0
2,lea_name,0,0.0
3,fips,0,0.0
4,number_of_schools,0,0.0
5,cbsa_type,0,0.0
6,teachers_total_fte,31,0.19
7,spec_ed_students,467,2.87
8,enrollment_x,37,0.23
9,read_test_num_valid,0,0.0


In [8]:
# Remove records with missing spec_ed_students data
cluster_df = filter_out_factor(cluster_df, 'spec_ed_students')
sound_off(cluster_df)

There are 15805 rows and 15 columns.

Data spans the years 2016 to 2016.

Available columns include:


Unnamed: 0,Attribute,Null Count,% Null
0,leaid,0,0.0
1,year,0,0.0
2,lea_name,0,0.0
3,fips,0,0.0
4,number_of_schools,0,0.0
5,cbsa_type,0,0.0
6,teachers_total_fte,28,0.18
7,spec_ed_students,0,0.0
8,enrollment_x,6,0.04
9,read_test_num_valid,0,0.0


In [9]:
# Remove records with missing teachers_total_fte and enrollment_x
cluster_df = filter_out_factor(cluster_df, 'teachers_total_fte')
cluster_df = filter_out_factor(cluster_df, 'enrollment_x')
sound_off(cluster_df)

There are 15772 rows and 15 columns.

Data spans the years 2016 to 2016.

Available columns include:


Unnamed: 0,Attribute,Null Count,% Null
0,leaid,0,0.0
1,year,0,0.0
2,lea_name,0,0.0
3,fips,0,0.0
4,number_of_schools,0,0.0
5,cbsa_type,0,0.0
6,teachers_total_fte,0,0.0
7,spec_ed_students,0,0.0
8,enrollment_x,0,0.0
9,read_test_num_valid,0,0.0


In [10]:
# Remove the columns that won't be used as features
cluster_prepared_df = cluster_df.drop(['leaid', 'year', 'lea_name', 'fips'], axis=1)

### Clustering

The purpose of this tool is specifically *descriptive* analytics. In short, we are looking to understand our underlying data, rather than build predictions.


In [23]:
# Setup a KMeans learner
## Parameters
clusters = 8
random_seed = 777

## Constructor
kmeans_learner = KMeans(n_clusters=clusters, random_state=random_seed)

In [24]:
# Fit the KMeans learner with the available data
results = kmeans_learner.fit_predict(cluster_prepared_df)

In [25]:
# Display the results, looking for patterns
print(results.shape)
print()
print(results[0])

(15772,)

0


In [26]:
# Attach the labels to the original dataframe
cluster_df['labels'] = results

In [28]:
# View the characteristics of each labeled dataset
for i in range(0, clusters):
    subset = cluster_df[cluster_df['labels'] == i]
    print(i)
    display(subset.describe())
    print()

0


Unnamed: 0,leaid,year,fips,number_of_schools,cbsa_type,teachers_total_fte,spec_ed_students,enrollment_x,read_test_num_valid,read_test_pct_prof_midpt,math_test_num_valid,math_test_pct_prof_midpt,rev_total,exp_total,labels
count,13085.0,13085.0,13085.0,13085.0,13085.0,13085.0,13085.0,13085.0,13085.0,13085.0,13085.0,13085.0,13085.0,13085.0,13085.0
mean,3010011.0,2016.0,29.983645,3.014597,0.411769,71.577153,159.549026,1088.980741,567.93206,50.009744,566.77906,44.463279,13717310.0,13518980.0,0.0
std,1437162.0,0.0,14.367338,3.292272,1.489271,125.473898,424.919548,1966.599519,969.655554,19.606918,980.509037,21.710719,13110250.0,12977470.0,0.0
min,100005.0,2016.0,1.0,0.0,-2.0,0.0,3.0,0.0,1.0,-3.0,1.0,-3.0,-2.0,-2.0,0.0
25%,1918030.0,2016.0,19.0,1.0,-2.0,19.59,36.0,278.0,135.0,36.0,135.0,28.0,3757000.0,3624000.0,0.0
50%,3013560.0,2016.0,30.0,2.0,1.0,45.0,87.0,647.0,337.0,51.0,337.0,42.0,8908000.0,8700000.0,0.0
75%,4023550.0,2016.0,40.0,4.0,1.0,94.84,199.0,1416.0,742.0,65.0,739.0,60.0,19896000.0,19543000.0,0.0
max,5606240.0,2016.0,56.0,117.0,2.0,4044.66,17373.0,62815.0,30703.0,99.5,31107.0,99.5,64802000.0,65219000.0,0.0



1


Unnamed: 0,leaid,year,fips,number_of_schools,cbsa_type,teachers_total_fte,spec_ed_students,enrollment_x,read_test_num_valid,read_test_pct_prof_midpt,math_test_num_valid,math_test_pct_prof_midpt,rev_total,exp_total,labels
count,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0
mean,2470779.0,2016.0,24.666667,292.75,1.0,13374.658333,25767.416667,214392.666667,118422.0,51.916667,120495.916667,49.916667,2760856000.0,2848292000.0,1.0
std,1509334.0,0.0,15.041357,91.051559,0.0,3827.102007,8120.370832,70948.782071,43897.358262,13.701084,43705.233958,16.962302,503640500.0,581064000.0,0.0
min,1200180.0,2016.0,12.0,206.0,1.0,7951.87,14887.0,130814.0,64640.0,28.0,65782.0,20.0,2121511000.0,2057207000.0,1.0
25%,1201298.0,2016.0,12.0,220.75,1.0,11336.6825,19222.75,175915.0,92319.5,48.5,92858.75,41.25,2314538000.0,2337748000.0,1.0
50%,1950255.0,2016.0,19.5,282.0,1.0,12483.81,24666.0,196697.5,114972.0,52.0,117560.5,52.5,2777372000.0,2840278000.0,1.0
75%,3454792.0,2016.0,34.5,316.25,1.0,16299.575,31556.5,230042.5,133671.0,54.25,132664.75,58.0,3121526000.0,3136256000.0,1.0
max,5101260.0,2016.0,51.0,528.0,1.0,20884.0,38604.0,357249.0,212368.0,84.0,211532.0,77.0,3679802000.0,3785166000.0,1.0



2


Unnamed: 0,leaid,year,fips,number_of_schools,cbsa_type,teachers_total_fte,spec_ed_students,enrollment_x,read_test_num_valid,read_test_pct_prof_midpt,math_test_num_valid,math_test_pct_prof_midpt,rev_total,exp_total,labels
count,133.0,133.0,133.0,133.0,133.0,133.0,133.0,133.0,133.0,133.0,133.0,133.0,133.0,133.0,133.0
mean,2849099.0,2016.0,28.323308,63.398496,1.0,2556.291278,5530.353383,43738.308271,23098.308271,52.428571,22999.210526,49.75188,570889300.0,580339700.0,2.0
std,1777744.0,0.0,17.800046,23.504977,0.0,1105.739745,2111.903881,15254.260511,8809.259726,18.459737,8823.419516,21.97878,113046400.0,127686500.0,0.0
min,102370.0,2016.0,1.0,7.0,1.0,0.0,9.0,3137.0,450.0,16.0,453.0,10.0,402834000.0,410991000.0,2.0
25%,1200150.0,2016.0,12.0,48.0,1.0,1833.15,4181.0,34656.0,17513.0,39.0,17248.0,31.0,470552000.0,482863000.0,2.0
50%,2733840.0,2016.0,27.0,61.0,1.0,2557.81,5148.0,42746.0,22695.0,51.0,22440.0,47.0,553093000.0,547540000.0,2.0
75%,4814280.0,2016.0,48.0,79.0,1.0,3225.99,6711.0,53157.0,28568.0,67.0,28466.0,71.0,652022000.0,648751000.0,2.0
max,5508520.0,2016.0,55.0,127.0,1.0,6159.37,12977.0,78957.0,42795.0,91.0,44022.0,93.0,903136000.0,982566000.0,2.0



3


Unnamed: 0,leaid,year,fips,number_of_schools,cbsa_type,teachers_total_fte,spec_ed_students,enrollment_x,read_test_num_valid,read_test_pct_prof_midpt,math_test_num_valid,math_test_pct_prof_midpt,rev_total,exp_total,labels
count,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
mean,622710.0,2016.0,6.0,1012.0,1.0,28088.4,85751.0,633621.0,319838.0,40.0,322262.0,30.0,10578720000.0,9824700000.0,3.0
std,,,,,,,,,,,,,,,
min,622710.0,2016.0,6.0,1012.0,1.0,28088.4,85751.0,633621.0,319838.0,40.0,322262.0,30.0,10578720000.0,9824700000.0,3.0
25%,622710.0,2016.0,6.0,1012.0,1.0,28088.4,85751.0,633621.0,319838.0,40.0,322262.0,30.0,10578720000.0,9824700000.0,3.0
50%,622710.0,2016.0,6.0,1012.0,1.0,28088.4,85751.0,633621.0,319838.0,40.0,322262.0,30.0,10578720000.0,9824700000.0,3.0
75%,622710.0,2016.0,6.0,1012.0,1.0,28088.4,85751.0,633621.0,319838.0,40.0,322262.0,30.0,10578720000.0,9824700000.0,3.0
max,622710.0,2016.0,6.0,1012.0,1.0,28088.4,85751.0,633621.0,319838.0,40.0,322262.0,30.0,10578720000.0,9824700000.0,3.0



4


Unnamed: 0,leaid,year,fips,number_of_schools,cbsa_type,teachers_total_fte,spec_ed_students,enrollment_x,read_test_num_valid,read_test_pct_prof_midpt,math_test_num_valid,math_test_pct_prof_midpt,rev_total,exp_total,labels
count,496.0,496.0,496.0,496.0,496.0,496.0,496.0,496.0,496.0,496.0,496.0,496.0,496.0,496.0,496.0
mean,2687259.0,2016.0,26.725806,28.100806,1.004032,1096.137056,2397.647177,18953.137097,9783.772177,53.419355,9780.800403,48.699597,261489800.0,261497600.0,4.0
std,1690988.0,0.0,16.935213,13.508019,0.063436,492.89701,1056.33336,8011.756563,4696.599655,17.543526,4679.194225,20.172901,65506750.0,64997090.0,0.0
min,100007.0,2016.0,1.0,3.0,1.0,0.0,16.0,2381.0,578.0,12.0,578.0,7.0,155260000.0,165480000.0,4.0
25%,691046.5,2016.0,6.0,19.0,1.0,750.315,1692.75,13313.0,6654.0,40.0,6652.25,33.0,204329000.0,207801800.0,4.0
50%,2693335.0,2016.0,26.5,27.0,1.0,1009.2,2272.0,18021.0,9267.0,52.0,9106.0,48.0,253303000.0,248566500.0,4.0
75%,4205790.0,2016.0,42.0,36.0,1.0,1364.2375,2959.5,23910.25,12467.0,67.0,12512.0,63.25,302422800.0,303688200.0,4.0
max,5604510.0,2016.0,56.0,78.0,2.0,2744.25,7389.0,44352.0,25117.0,97.0,25457.0,98.0,434969000.0,459706000.0,4.0



5


Unnamed: 0,leaid,year,fips,number_of_schools,cbsa_type,teachers_total_fte,spec_ed_students,enrollment_x,read_test_num_valid,read_test_pct_prof_midpt,math_test_num_valid,math_test_pct_prof_midpt,rev_total,exp_total,labels
count,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0
mean,2902620.0,2016.0,28.947368,133.710526,1.0,5733.942895,11177.394737,91664.263158,48394.684211,48.657895,48899.868421,47.078947,1248235000.0,1279461000.0,5.0
std,1679499.0,0.0,16.787942,44.827542,0.0,2109.286438,4027.488598,33947.642568,19318.664196,18.711575,20130.750422,20.421745,283395600.0,291568700.0,0.0
min,614550.0,2016.0,6.0,59.0,1.0,2715.0,5334.0,34293.0,14319.0,17.0,14392.0,15.0,857988000.0,933454000.0,5.0
25%,1300412.0,2016.0,13.0,106.25,1.0,4190.24,8440.75,73874.0,36247.25,33.25,36247.0,28.25,1032429000.0,1027427000.0,5.0
50%,2451605.0,2016.0,24.5,126.5,1.0,5648.675,10379.0,88386.5,47237.5,48.0,50540.0,46.5,1199210000.0,1230440000.0,5.0
75%,4782500.0,2016.0,47.75,167.25,1.0,6864.125,13412.5,110088.5,58812.0,56.5,58609.0,58.5,1396518000.0,1423654000.0,5.0
max,5509600.0,2016.0,55.0,240.0,1.0,11125.2,21822.0,178214.0,95602.0,87.0,109581.0,90.0,1972752000.0,1937798000.0,5.0



6


Unnamed: 0,leaid,year,fips,number_of_schools,cbsa_type,teachers_total_fte,spec_ed_students,enrollment_x,read_test_num_valid,read_test_pct_prof_midpt,math_test_num_valid,math_test_pct_prof_midpt,rev_total,exp_total,labels
count,2006.0,2006.0,2006.0,2006.0,2006.0,2006.0,2006.0,2006.0,2006.0,2006.0,2006.0,2006.0,2006.0,2006.0,2006.0
mean,2918350.0,2016.0,29.051346,10.999501,1.044865,394.077856,875.807577,6379.858923,3324.984546,55.764955,3327.952642,50.954885,93448320.0,93982400.0,6.0
std,1511047.0,0.0,15.124714,6.106086,0.531296,185.865776,434.634266,3167.335882,1812.951963,17.366784,1820.292745,19.097925,32840320.0,33277370.0,0.0
min,100006.0,2016.0,1.0,1.0,-2.0,0.0,3.0,254.0,32.0,3.0,26.0,2.0,38547000.0,48444000.0,6.0
25%,1721855.0,2016.0,17.0,7.0,1.0,262.875,572.0,4061.5,2045.0,43.0,2044.25,37.25,65905250.0,65749750.0,6.0
50%,3401485.0,2016.0,34.0,10.0,1.0,357.18,791.0,5645.5,2952.5,56.0,2967.0,51.0,84543500.0,85502000.0,6.0
75%,4200046.0,2016.0,42.0,14.0,1.0,483.3375,1099.0,8186.75,4347.25,69.0,4361.75,66.0,115272500.0,115960200.0,6.0
max,5605830.0,2016.0,56.0,78.0,2.0,1405.86,3078.0,19475.0,12382.0,97.0,12424.0,98.0,185164000.0,208300000.0,6.0



7


Unnamed: 0,leaid,year,fips,number_of_schools,cbsa_type,teachers_total_fte,spec_ed_students,enrollment_x,read_test_num_valid,read_test_pct_prof_midpt,math_test_num_valid,math_test_pct_prof_midpt,rev_total,exp_total,labels
count,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
mean,1709930.0,2016.0,17.0,585.0,1.0,19016.08,51826.0,378199.0,184139.0,28.0,185465.0,24.0,5840203000.0,5827667000.0,7.0
std,,,,,,,,,,,,,,,
min,1709930.0,2016.0,17.0,585.0,1.0,19016.08,51826.0,378199.0,184139.0,28.0,185465.0,24.0,5840203000.0,5827667000.0,7.0
25%,1709930.0,2016.0,17.0,585.0,1.0,19016.08,51826.0,378199.0,184139.0,28.0,185465.0,24.0,5840203000.0,5827667000.0,7.0
50%,1709930.0,2016.0,17.0,585.0,1.0,19016.08,51826.0,378199.0,184139.0,28.0,185465.0,24.0,5840203000.0,5827667000.0,7.0
75%,1709930.0,2016.0,17.0,585.0,1.0,19016.08,51826.0,378199.0,184139.0,28.0,185465.0,24.0,5840203000.0,5827667000.0,7.0
max,1709930.0,2016.0,17.0,585.0,1.0,19016.08,51826.0,378199.0,184139.0,28.0,185465.0,24.0,5840203000.0,5827667000.0,7.0





In [29]:
# Output labeled data
cluster_df.to_csv('processed_labeled.csv')