Data-Mini-1

Welcome to Data Mini #1! You and your teammates will have 2 hours to explore and prepare a presentation about your data analysis. The hour after will be spent presenting your findings.

This Git repository contains this README file and the dataset in both CSV and RDS formats. This README will include the dataset's dictionary, along with the rubric and policies. The CSV and the RDS have the same rows and columns, but the RDS will preserve R data types exactly (factors, dates, lists, etc.). We recommend you use Duke's Container Manager to access your preferred IDE.

How to Participate

Fork the repository emailed to you by trisha.iyer@duke.edu
Load the data from data/texasdata.csv or data/texasdata.rds
Come up with a question to answer with data analysis. Analyze the dataset and prepare a presentation of your findings.
AFTER ANALYZING TIMEFRAME: Submit your presentation slides that include the forked repository link to the Google Drive so we can screenshare it. Please do not attempt to push changes to this repository - it is read-only for participants.

Data Dictionary

Student-Teacher Race Match Dataset

About

Same-race teachers improve academic outcomes for minoritized students, yet most never encounter a teacher who shares their racial background. Existing research highlights the benefits of race-matched instruction but often focuses on aggregate workforce diversity, obscuring students’ structural opportunities for contact. QSIDE and students at Williams College introduce race-match sufficiency—the probability that a student has at least one same-race teacher—and estimate it using administrative data from 8,691 Texas public schools serving 5.4 million students in 2022-23.

texasdata.rds draws on three datasets, all accessed through TEA’s Texas Academic Performance Reports (TAPR) data portal (Texas Education Agency, 2023). The Campus Reference File (CREF) provides unique campus identifiers, structural characteristics, and campus type designations; its campus–district nesting reflects the governance structure that motivates our multilevel modeling approach. The Campus Profile File (CAMPPROF) reports student and teacher demographic counts by race and ethnicity, along with total enrollment and economic disadvantage measures. Finally, district classifications come from TEA’s District type, 2022–23 release, which applies National Center for Education Statistics (NCES) geographic categories (Texas Education Agency, 2024c). Merging these files yields a raw dataset of 9,044 campuses serving nearly 5.5 million students across 1,209 districts—a census of Texas public K–12 schools.

Data Variables

texasdata.rds: Cleaned campus-level dataset derived from Texas Education Agency (TEA) TAPR data and NCES district classifications. Includes:

Student and teacher race counts
School-level demographics and grade spans
Charter status and district type
Economic disadvantage proportions

variable	mode (in R)	description
campusName	character	Name of the school campus
campusNumber	character	Unique identifier for the campus assigned by TEA
countyName	character	Name of the county where the campus is located
countyNumber	character	Numeric code identifying the county
districtName	character	Name of the school district to which the campus belongs
districtNumber	character	Unique identifier for the district assigned by TEA
region	factor	TEA-designated education service region
schoolLevel	factor	Level of schooling offered (e.g., elementary, middle, high school)
grade_lo	numeric	Lowest grade level served by the campus
grade_hi	numeric	Highest grade level served by the campus
charterSchool	logical	Indicator for whether the campus is a charter school (TRUE/FALSE)
locationType	factor	Geographic classification of the district based on NCES categories (e.g., urban, suburban, rural)
studentsTotal	numeric	Total student enrollment at the campus
teachersTotal	numeric	Total number of teachers at the campus
s_econ_disadv_prop	numeric	Proportion of students classified as economically disadvantaged
race	factor	Racial/ethnic category for demographic breakdowns
student_frac	numeric	Proportion of total students in a given racial/ethnic category
teacher_frac	numeric	Proportion of total teachers in a given racial/ethnic category
student_count	numeric	Number of students in a given racial/ethnic category

Competition Policies

Code of Conduct

AI and Online Sources

AI tools and online resources are permitted with proper citations
Include citations for forums (Stack Overflow), textbooks, and other references
Provide a link to any AI chat conversations used
You must list all citations (including the original study’s citation found at the bottom of this README) at the end of your code notebook

Code Reuse

You may reuse code from your own research or class projects
You may NOT copy from the original student-teacher race match study - doing so will result in a major point deduction

Analysis

Must answer a question that your team chooses
All types of data modeling are allowed
All coding languages are allowed

Submission Requirements

We will give teams ~5 minutes after the 2-hour analysis period to upload their presentation media (slides, PDF, etc.) to the Google Drive folder

This presentation media must include a link to your forked repository and the citation of the dataset.
You and your teammates will have received an email from susan.li@duke.edu near the beginning of the competition start
Your team’s Google Drive folder contents will not be accessible to view, comment on, or edit by any other team

Presentation Guidelines

Present using any format: Quarto doc, slides, visualizations, etc: Your chosen presentation media will be projected by an SSMU member through your Google Drive
Presentations will be recorded during the event.
Recordings will be sent to you and your teammates a few days following the event
SSMU would love to share your presentation recording with QSIDE and the original dataset creators! The dataset and study were created by previous undergraduate students at Williams College in 2023. If you would like to opt into this opportunity, please let us know!
Presentation order will be randomized, but let us know beforehand if your team has a preference for going first, last, or in the middle

Judging

Winners will be announced on Sunday, February 15
As stated earlier, you must come up with a question that your team will use to guide your statistical analysis.
A fair evaluation will be justified by multiple people deciding on competition winners.
The judging criteria is as follows:
1. Communication: Clarity of presentation, logical structure, and ability to explain findings to a broad audience. - 20%
1. Creativity & Insight: Depth of analysis, creativity in approach, and ability to identify meaningful patterns. - 40%
1. Reproducibility: Well-organized code, clear documentation, and transparent workflows. - 5%
1. Visualization & Design: Effective use of charts and visuals to support your conclusions. - 25%
1. Technical Soundness: Appropriate use of statistical methods, reasonable assumptions, and valid interpretations. - 10%
This will be prioritized more than the experience level of the statistical methods your team members used. This will account for differing experience levels across competitors.

Other Competitions from SSMU

Duke’s StatSci Majors Union will be hosting Data Mini #2 from Feb. 28 - Mar. 1. This competition will be a little different: There will be a one-hour information session on Feb. 27th from 5:30-6:30 pm in Old Chem 001. From there, you and your team will have until Mar. 1 at 11:59 pm to submit materials and a video recording of findings to a Google Drive folder. Please contact Chelsea Nguyen (chelsea.nguyen@duke.edu) with any questions!
In addition, ASA DataFest 2026 at Duke, organized by the Department of Statistical Science, will take place in Perkins Library on Duke University West Campus, Friday, March 20, through Sunday, March 22. Similar to Data Minis, DataFest is a data analysis competition where teams of up to five students attack a large, complex, and surprise data set over a weekend. Please reach out to Dr. Alexander Fisher (alexander.fisher@duke.edu) with any questions.

Dataset Acknowledgment

The Student-Teacher Race-Match dataset and study were created by then-undergraduate students Anchal Bhaskar, Kayla Chang, Iman Najib, Olivia Thornton, Jules Tucher, and Régan Schwartz under the guidance of Professor Chad M. Topaz at Williams College.

This data is freely accessible online through the Institute for the Quantitative Study of Inclusion, Diversity, and Equity (QSIDE).

The citation for the replication package is: Bhaskar, A., Chang, K., Najib, I., Thornton, O., Tucher, J., Schwartz, R., & Topaz, C. M. (2025). Sufficiency in Teacher–Student Race Matching. Manuscript

This citation must be included at the end of your code notebook.

About This Event

Data Mini #1 was organized by Duke's Statistical Science Majors Union - Competitions and Opportunities Committee. Contributing members of the committee and the SSMU executive team are listed below:

Hanna Chee, ‘29
Trisha Iyer, ‘28
Shivansh Gupta, ‘28
Hyunjin Lee, ‘27
Susan Li, ‘29
Phillip Lin, ‘27
Liane Ma, ‘27
Chelsea Nguyen, ‘28
Cooper Ruffing, ‘27
Derek Wang, ‘28
Amy Xu, ‘27
Allison Yang, ‘27
And an additional thank-you goes out to the professors who helped advertise the event.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Data_mini_files		Data_mini_files
.gitignore		.gitignore
Data-Mini-1.Rproj		Data-Mini-1.Rproj
Data_mini.html		Data_mini.html
Data_mini.pptx		Data_mini.pptx
Data_mini.qmd		Data_mini.qmd
README.md		README.md
texasdata.csv		texasdata.csv
texasdata.rds		texasdata.rds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-Mini-1

How to Participate

Data Dictionary

Student-Teacher Race Match Dataset

About

Data Variables

Competition Policies

Code of Conduct

AI and Online Sources

Code Reuse

Analysis

Submission Requirements

Presentation Guidelines

Judging

Other Competitions from SSMU

Dataset Acknowledgment

About This Event

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data-Mini-1

How to Participate

Data Dictionary

Student-Teacher Race Match Dataset

About

Data Variables

Competition Policies

Code of Conduct

AI and Online Sources

Code Reuse

Analysis

Submission Requirements

Presentation Guidelines

Judging

Other Competitions from SSMU

Dataset Acknowledgment

About This Event

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages