Urban Data Science & Smart Cities <br>
URSP688Y <br>
Instructor: Chester Harvey <br>
Urban Studies & Planning <br>
National Center for Smart Growth <br>
University of Maryland

[<img src="https://colab.research.google.com/assets/colab-badge.svg">](https://colab.research.google.com/github/ncsg/ursp688y_sp2024/blob/main/exercises/exercise04/exercise04.ipynb)

# Exercise 4 (in Two Parts)

# The Data Viz Part

Next week is Data Visualization week. This is one of my favorite topics, in part because we get to look at lots of pictures, and in part because it provides an excuse for some very lighthearted competition.

In prep for next week, part of your exercise is to find an example of either an _excellent_ or _terrible_ data visualization. We will vote on the best (and worst) in each category, and the winner gets a small (tasty) prize.

Please find an example of a data visualization that is either _very effective_ or _terribly ineffective_ in communicating an interesting finding from data. Here are a few ground rules:
- One figure only: We should be able to see the whole thing at once on a projector screen.
- Static images only: If you find something dynamic or interactive that you _must_ submit, please take a screenshot.
- Do the reading first: Tufte will give you some ideas for what makes visualizations good or bad
- No examples from Tufte. Gotta work a little bit.

Please either paste a link to your image in the text cell below (can you figure out how to get markdown to display the image?) or add an image file to your PR.
- Please label it clearly as "good" or "bad" so we know which race you're in.
- Please write a couple bullets about why it's good or bad. This is your pitch (we can haggle about it in class, too.)

## Good/Bad (please choose one and delete the other)
- Why
- Some more why
- Any more?

***** Put image link or insert image here *****

# The Programming Part

## Problem

In [Exercise 3](https://github.com/ncsg/ursp688y_sp2024/blob/main/exercises/exercise03/exercise03.ipynb), you examined how many affordable housing units available to households up to 60% AMI were planned within each ward in Washington, D.C.

The bonus problem was to calculate which wards were producing a _disproportionately_ large and small number of housing units given their populations.

This week, please reproduce this analysis, <ins>including</ins> the bonus part, using some of your new data loading, joining, and module-building skills.

Please write a program that:

- Loads the affordable housing project data from `affordable_housing.csv`
- Loads the ward populations from `wards_from_2022.csv`
- Joins the population data to the affordable housing data
- Calculates which wards are producing disproportionately large and small number of housing units given their populations
- Completes all of this data loading and processing within a function (or a series of functions called by a single main function)
- Stores that function (and any related functions) in a module
- Calls the main function in the exercise notebook to return table or other summary or results

## Data

CSVs for both required data tables are included on GitHub at `exercises/exercise04`.

Please consult the city's database of [affordable housing](https://opendata.dc.gov/datasets/DCGIS::affordable-housing/about) projects and [ward demographic](https://opendata.dc.gov/datasets/DCGIS::wards-from-2022/about) data.

Bonus: find, download, and use more recent ward population data. (Remember to include it in your PR.) My cursory search found data as late as 2022.

## New instructions for submitting a PR with multiple files

Because you'll be working with multiple files, PRs become _slightly_ more complicated, so we're graduating to a new 'mini-repository' pattern:
- Make a new folder in `exercises/exercise04` with your last name (just like the suffix for your notebook file)
- Upload your notebook file, also appropriately named, into that folder
- Upload any other files you make/use, including `.py` and `.csv` files, into that folder, so everything is together in the same place

Ultimately, this will look a bit like this:
```
── exercises
    ├── exercise04
        ├── harvey
            ├── exercise04_harvey.ipynb
            ├── affordable_housing_calcs.py
            ├── affordable_housing.csv
            └── wards_from_2022.csv
```

**NOTE:** Yes, I realize this is a bit redundant because everyone will have copies of the same CSV files. This would never be a good idea for production coding--we would have one `data` directory, and everyone would draw from the same data. However, there are two reasons for all these copies in this case:
1. It's good practice to build a repository with all the parts your code needs to run.
    - In later weeks, when you  _don't_ all have the same data, it won't seem as redundant.
3. Having everything in one folder will make it easy for me to run your code on my computer.

## Hints
- You may want to join the population data _after_ summarizing the affordable housing data (i.e., join populations to sums of units). However, I could also see an approach where you join at the beginning, then aggregate the population column with a method called `first`



In [None]:
# Import your module

# Call your main function


In [98]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [99]:
import os

os.getcwd()

'/content/drive/My Drive/ursp688y_shared_data_Saiful'

In [100]:
# 3) Change working directory
wd_path = '/content/drive/MyDrive/ursp688y_shared_data_Saiful'
os.chdir(wd_path)

print(f'cwd: {os.getcwd()}')

os.path.isfile('affordable_housing.csv')
os.path.isfile('Wards_from_2022.csv')

cwd: /content/drive/MyDrive/ursp688y_shared_data_Saiful


True

In [101]:
import pandas as pd


# Load affordable housing data
affordable_housing = pd.read_csv("affordable_housing.csv")
# Filter the data for units under construction or in the development pipeline
affordable_housing_filtered = affordable_housing[affordable_housing['STATUS_PUBLIC'].isin(['Under Construction', 'Pipeline'])]
# Exclude rows where MAR_Ward equals 1
affordable_housing_cln = affordable_housing_filtered.loc[affordable_housing_filtered['MAR_WARD'] != '1']
affordable_housing_cln['MAR_WARD'].unique()


array(['Ward 6', 'Ward 8', 'Ward 1', 'Ward 5', 'Ward 4', 'Ward 3',
       'Ward 7', 'Ward 2'], dtype=object)

In [102]:
# Group by ward and sum the affordable units

# Group by ward and sum the affordable units
ward_affordable_units = affordable_housing_cln.groupby('MAR_WARD')['AFFORDABLE_UNITS_AT_0_30_AMI', 'AFFORDABLE_UNITS_AT_31_50_AMI', 'AFFORDABLE_UNITS_AT_51_60_AMI'].sum().reset_index()
ward_affordable_units['total_affordable_units'] = ward_affordable_units[['AFFORDABLE_UNITS_AT_0_30_AMI', 'AFFORDABLE_UNITS_AT_31_50_AMI', 'AFFORDABLE_UNITS_AT_51_60_AMI']].sum(axis=1)
print(ward_affordable_units)

  MAR_WARD  AFFORDABLE_UNITS_AT_0_30_AMI  AFFORDABLE_UNITS_AT_31_50_AMI  \
0   Ward 1                           258                            411   
1   Ward 2                            23                             17   
2   Ward 3                            77                             87   
3   Ward 4                           126                            229   
4   Ward 5                           600                            683   
5   Ward 6                           659                            575   
6   Ward 7                           528                           1140   
7   Ward 8                          1189                           1760   

   AFFORDABLE_UNITS_AT_51_60_AMI  total_affordable_units  
0                            785                    1454  
1                            220                     260  
2                            280                     444  
3                            277                     632  
4                            

  ward_affordable_units = affordable_housing_cln.groupby('MAR_WARD')['AFFORDABLE_UNITS_AT_0_30_AMI', 'AFFORDABLE_UNITS_AT_31_50_AMI', 'AFFORDABLE_UNITS_AT_51_60_AMI'].sum().reset_index()


In [103]:
# Find the ward with the most and fewest total affordable units
most_affordable_ward = ward_affordable_units.loc[ward_affordable_units['total_affordable_units'].idxmax()]
fewest_affordable_ward = ward_affordable_units.loc[ward_affordable_units['total_affordable_units'].idxmin()]

In [104]:
# Print the ward with the most and fewest total affordable units
print("Ward with the most total affordable units:")
print(most_affordable_ward)
print("\nWard with the fewest total affordable units:")
print(fewest_affordable_ward)

Ward with the most total affordable units:
MAR_WARD                         Ward 8
AFFORDABLE_UNITS_AT_0_30_AMI       1189
AFFORDABLE_UNITS_AT_31_50_AMI      1760
AFFORDABLE_UNITS_AT_51_60_AMI      1343
total_affordable_units             4292
Name: 7, dtype: object

Ward with the fewest total affordable units:
MAR_WARD                         Ward 2
AFFORDABLE_UNITS_AT_0_30_AMI         23
AFFORDABLE_UNITS_AT_31_50_AMI        17
AFFORDABLE_UNITS_AT_51_60_AMI       220
total_affordable_units              260
Name: 1, dtype: object


In [105]:
import pandas as pd

def analyze_housing_data():
    # Load affordable housing project data
    affordable_housing = pd.read_csv("affordable_housing.csv")

    # Load ward populations
    ward_populations = pd.read_csv("Wards_from_2022.csv")

    # Join population data to affordable housing data
    housing_with_population = pd.merge(affordable_housing, ward_populations[['NAME','POP100','HU100']], left_on='MAR_WARD', right_on='NAME')

    # Calculate proportion of housing units per population
    housing_with_population['prop_housing_units'] = housing_with_population['HU100'] / housing_with_population['POP100']
    # Calculate the mean ratio for all wards
    mean_ratio = housing_with_population['prop_housing_units'].mean()

    # Identify wards with ratios significantly higher or lower than the mean
    disproportionately_large = housing_with_population[housing_with_population['prop_housing_units'] > mean_ratio]
    disproportionately_small = housing_with_population[housing_with_population['prop_housing_units'] < mean_ratio]

    return mean_ratio, disproportionately_large, disproportionately_small



In [106]:
# Call the function and store the returned values
mean_ratio, disproportionately_large, disproportionately_small = analyze_housing_data()

# Print or further process the results as needed
print("Mean Ratio:", mean_ratio)
print("\nWards with disproportionately large housing units:", disproportionately_large)
print("\nWards with disproportionately small housing units:", disproportionately_small)

Mean Ratio: 0.4986376195955401

Wards with disproportionately large housing units:              X          Y  OBJECTID MAR_WARD  \
0   -77.009383  38.910255     89281   Ward 6   
1   -77.009436  38.906403     89282   Ward 6   
2   -77.002499  38.877245     89296   Ward 6   
3   -77.015444  38.902768     89300   Ward 6   
4   -76.981976  38.879213     89308   Ward 6   
..         ...        ...       ...      ...   
872 -77.083052  38.956662     90023   Ward 3   
873 -77.077740  38.939937     90035   Ward 3   
874 -77.072940  38.932576     90059   Ward 3   
875 -77.071928  38.959224     90086   Ward 3   
876 -77.095042  38.945749     90138   Ward 3   

                                               ADDRESS  \
0    1520 North Capitol Street Northwest, Washingto...   
1    1200 North Capitol Street Northwest, Washingto...   
2    1100 2nd Place Southeast, Washington, District...   
3    307 K Street Northwest, Washington, District o...   
4    1600 Pennsylvania Avenue Southeast, Washingto