# **Give a summary of what you think the following project is doing. Limit your answer to one paragraph.**

I believe that this project aims to analyze how the household structure has changed across U.S. counties from the years 2009 to 2023, which is making use of data from the American Community Survey. It focuses specifically on comparing the trends that are found in married-couple households versus the trends in single-person households, and then looking at how the balance between the two has come to shift over time. In order to accomplish this, the project attempts to fit polynomial curves to the counts of the household for each of the counties, extracting some features like the slope and acceleration, and then evaluates the consistency in directional change. These features are then to be used to cluster counties and build a model that predicts cluster membership using the Social Vulnerability Index (SVI), which includes a number of socioeconomic and demographic variables. I believe the general idea is to learn what kinds of social or economic factors seem to go hand-in-hand with changes in the composition of a household.


# **Please suggest at least one major non-technical improvement/correction they should do (e.g. writing, graphs, etc)**

I think the most pressing non-technical issue that I found in this project was the inconsistent and unclear labeling of the graphs, which made it rather difficult to interpret the key findings. While the San Francisco plot (labelled sf_eg.png) includes what some would consider to be proper labels and a legend, the majority of the other figures seem to lack clear titles, axis labels, or descriptive legends. For example, the elbow plot (kmeans_btwss_by_k_with_LA.png) uses the raw variable names like km_bag[, 1], which in fact offers no context for what any of the axes represent. Likewise, the included cluster visualizations label the axes with more technical terms like sdf$year and y1 instead of the more helpful “Year” and maybe “Normalized Household Count,” and for that matter, they lack explanations as to what happens to make each cluster unique. These labeling issues significantly reduce readability, and you can’t help but feel that an employer would be dissatisfied. Simple changes like making sure every figure has a descriptive title, clear axis labels, and maybe a legend, combined with perhaps a brief explanation of the text, would significantly increase the overall quality of the project.


# **Please suggest at least one major technical improvement/correction they should do.**

In terms of the technical improvements, although the original code does seem to work fine, there were definitely some maintainability issues that, having made similar mistakes in the past, I noticed. The CSV loading was done without any form of error handling, which could definitely cause the script to fail without any sort of explanation if the file happened to go missing or somehow misformat. In order to fix this, I added some error handling that would also give a clear sort of explanation if something did happen to go wrong.

I also noted the repeated hardcoded use of the raw census codes (like B11002_003E) made the code harder to read and adapt. Similarly to the graph titles, I didn’t think that the codes were intuitive, meaning that it would be hard to remember what they meant without looking them up somewhere. In order to make the code easier to read and modify if you needed to make changes later, I introduced a dictionary called CENSUS_VARS that maps the previous codes to more understandable names like 'married' and 'unmarried'.

The final thing I changed was that the original code manually built out the polynomial features, which basically looked like mathematical terms such as x^2 or x^3 using a Python loop and the NumPy library. While it actually looked like it worked for the project, it’s definitely more error-prone and harder to maintain than it has to be, since there is a dedicated library. I replaced that section with a tool that was built in from scikit-learn called PolynomialFeatures, which just handles all the math safely and automatically. This hopefully makes the code cleaner and more reliable, hopefully aligning with more of the best practices of data science.
