Federal and state governments publish a huge amount of data. You can find a large collection of it on Data.gov -- everything from land surveys to pollution to census data.
As programmers, we can use those data sets to ask and answer questions. We'll build upon a dataset centered around schools in Colorado provided by the Annie E. Casey foundation. What can we learn about education across the state?
Starting with the CSV data we will:
- build a "Data Access Layer" which allows us to query/search the underlying data
- build a "Relationships Layer" which creates connections between related data
- build an "Analysis Layer" which uses the data and relationships to draw conclusions
- Use tests to drive both the design and implementation of code
- Decompose a large application into components such as parsers, repositories, and analysis tools
- Use test fixtures instead of actual data when testing
- Connect related objects together through references
- Learn an agile approach to building software
- One team member forks the repository at https://github.com/turingschool-examples/headcount and adds the other(s) as collaborators.
- Everyone on the team clones the repository
- Setup SimpleCov to monitor test coverage along the way
During this project, we'll be working with a large body of data that covers various information about Colorado school districts.
The data is divided into multiple CSV files, with the concept of a District being the unifying piece of information across the various data files.
Districts are identified by simple names (strings), and are listed
under the Location
column in each file.
So, for example, the file Kindergartners in full-day program.csv
contains
data about Kindergarten enrollment rates over time. Let's look at the file headers
along with a sample row:
Location,TimeFrame,DataFormat,Data
AGUILAR REORGANIZED 6,2007,Percent,1
The Location
, column indicates the District (AGUILAR REORGANIZED 6
), which
will re-appear as a District in other data files as well. The other columns
indicate various information about the statistic being reported. Note that
percentages appear as decimal values out of 1
, with 1
meaning 100% enrollment.
With the idea of a District sitting at the top of our overall data hierarchy (it's the thing around which all the other information is organized), we can now look at the secondary layers.
We will ultimately be performing analysis across numerous data files within the project, but it turns out that there are generally multiple files dealing with a related concepts. The overarching data themes we'll be working with include:
- Enrollment - Information about enrollment rates across various grade levels in each district
- Statewide Testing - Information about test results in each district broken down by grade level, race, and ethnicity
- Economic Profile - Information about socioeconomic profiles of students and within districts
The list of files that are relevant to each data "category" are listed below. You'll find the data files in the data
folder of the cloned repository.
Dropout rates by race and ethnicity.csv
High school graduation rates.csv
Kindergartners in full-day program.csv
Online pupil enrollment.csv
Pupil enrollment by race_ethnicity.csv
Pupil enrollment.csv
Special education.csv
3rd grade students scoring proficient or above on the
CSAP_TCAP.csv8th grade students scoring proficient or above on the
CSAP_TCAP.csvAverage proficiency on the CSAP_TCAP by race_ethnicity_ Math.csv
Average proficiency on the CSAP_TCAP by race_ethnicity_
Reading.csvAverage proficiency on the CSAP_TCAP by race_ethnicity_
Writing.csvRemediation in higher education.csv
Median household income.csv
School-aged children in poverty.csv
Students qualifying for free or reduced price lunch.csv
Title I students.csv
Ultimately, a crude visualization of the structure might look like this:
- District: Gives access to all the data relating to a single, named school district
|-- Enrollment: Gives access to enrollment data within that district, including:
| | -- Dropout rate information
| | -- Kindergarten enrollment rates
| | -- Online enrollment rates
| | -- Overall enrollment rates
| | -- Enrollment rates by race and ethnicity
| | -- High school graduation rates by race and ethnicity
| | -- Special education enrollment rates
|-- Statewide Testing: Gives access to testing data within the district, including:
| | -- 3rd grade standardized test results
| | -- 8th grade standardized test results
| | -- Subject-specific test results by race and ethnicity
| | -- Higher education remediation rates
|-- Economic Profile: Gives access to economic information within the district, including:
| | -- Median household income
| | -- Rates of school-aged children living below the poverty line
| | -- Rates of students qualifying for free or reduced price programs
| | -- Rates of students qualifying for Title I assistance
Because the requirements for this project are lengthy and complex, we've broken them into Iterations in their own files:
- Iteration 0 - District Kindergarten Data Access
- Iteration 1 - District Kindergarten Relationships & Analysis
- Iteration 2 - Remaining Enrollment Access & Analysis: High School Graduation
- Iteration 3 - Data Access & Relationships: Statewide Testing
- Iteration 4 - Data Access & Relationships: Economic Profile
- Iteration 5 - Analysis: Statewide Testing
- Iteration 6 - Analysis: Economic Profile
- Iteration 7 - Total Enrollment (coming soon)
- Iteration 8 - Special Education, Remediation, and Dropout Rates (coming soon)
The test harness for Headcount is here.
The project will be assessed with the following guidelines:
- 4: Application fulfills all expectations of Iterations 0 - 6 as well as one additional, comparable Iteration of your own design.
- 3: Application fulfills expectations of Iterations 0 - 4 as well as one of Iterations 5 or 6
- 2: Application has some missing functionality but no crashes
- 1: Application crashes during normal usage
- 4: Application is broken into components which are well tested in both isolation and integration using appropriate data
- 3: Application is well tested but does not balance isolation and integration tests, using only the data necessary to test the functionality
- 2: Application makes some use of tests, but the coverage is insufficient
- 1: Application does not demonstrate strong use of TDD
- 4: Application is expertly divided into logical components each with a clear, single responsibility
- 3: Application effectively breaks logical components apart but breaks the principle of SRP
- 2: Application shows some effort to break logic into components, but the divisions are inconsistent or unclear
- 1: Application logic shows poor decomposition with too much logic mashed together
- 4: Application demonstrates excellent knowledge of Ruby syntax, style, and refactoring
- 3: Application shows strong effort towards organization, content, and refactoring
- 2: Application runs but the code has long methods, unnecessary or poorly named variables, and needs significant refactoring
- 1: Application generates syntax error or crashes during execution
- 4: Application consistently makes use of the best-choice Enumerable methods
- 3: Application demonstrates comfortable use of appropriate Enumerable methods
- 2: Application demonstrates functional knowledge of Enumerable but only uses the most basic techniques
- 1: Application demonstrates deficiencies with Enumerable and struggles with collections
The output from rake sanitation:all
shows...
- 4: Zero complaints
- 3: Five or fewer complaints
- 2: Six to ten complaints
- 1: More than ten complaints
The original data files and more information about the data can be found here:
- Search Index
- Median Household Income
- School Aged Children in Poverty
- Pupil Enrollment
- Special Education
- Title 1 Students
- Students Qualifying for Free and Reduced Price Lunch
- Kindergarteners in Full-Day Program
- High School Graduation Rates
- Dropout Rates by Race and Ethnicity
- Online Pupil Enrollment
- Remediation in Higher Education
- 3rd Grade Students Scoring Proficient or Above on the CSAP/TCAP
- 8th Grade Students Scoring Proficient or Above on the CSAP/TCAP
- Pupil Enrollment by Race & Ethnicity
- AVERAGE PROFICIENCY ON THE CSAP/TCAP BY RACE/ETHNICITY: READING
- AVERAGE PROFICIENCY ON THE CSAP/TCAP BY RACE/ETHNICITY: MATH
- AVERAGE PROFICIENCY ON THE CSAP/TCAP BY RACE/ETHNICITY: WRITING