Urban Data Science & Smart Cities <br>
URSP688Y <br>
Instructor: Chester Harvey <br>
Urban Studies & Planning <br>
National Center for Smart Growth <br>
University of Maryland

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


[<img src="https://colab.research.google.com/assets/colab-badge.svg">](https://colab.research.google.com/github/ncsg/ursp688y_sp2024/blob/main/exercises/exercise04/exercise04.ipynb)

# Exercise 4 (in Two Parts)

# The Data Viz Part

Next week is Data Visualization week. This is one of my favorite topics, in part because we get to look at lots of pictures, and in part because it provides an excuse for some very lighthearted competition.

In prep for next week, part of your exercise is to find an example of either an _excellent_ or _terrible_ data visualization. We will vote on the best (and worst) in each category, and the winner gets a small (tasty) prize.

Please find an example of a data visualization that is either _very effective_ or _terribly ineffective_ in communicating an interesting finding from data. Here are a few ground rules:
- One figure only: We should be able to see the whole thing at once on a projector screen.
- Static images only: If you find something dynamic or interactive that you _must_ submit, please take a screenshot.
- Do the reading first: Tufte will give you some ideas for what makes visualizations good or bad
- No examples from Tufte. Gotta work a little bit.

Please either paste a link to your image in the text cell below (can you figure out how to get markdown to display the image?) or add an image file to your PR.
- Please label it clearly as "good" or "bad" so we know which race you're in.
- Please write a couple bullets about why it's good or bad. This is your pitch (we can haggle about it in class, too.)

## Good/Bad (please choose one and delete the other)
- Why
- Some more why
- Any more?

***** Put image link or insert image here *****

# The Programming Part

## Problem

In [Exercise 3](https://github.com/ncsg/ursp688y_sp2024/blob/main/exercises/exercise03/exercise03.ipynb), you examined how many affordable housing units available to households up to 60% AMI were planned within each ward in Washington, D.C.

The bonus problem was to calculate which wards were producing a _disproportionately_ large and small number of housing units given their populations.

This week, please reproduce this analysis, <ins>including</ins> the bonus part, using some of your new data loading, joining, and module-building skills.

Please write a program that:

- Loads the affordable housing project data from `affordable_housing.csv`
- Loads the ward populations from `wards_from_2022.csv`
- Joins the population data to the affordable housing data
- Calculates which wards are producing disproportionately large and small number of housing units given their populations
- Completes all of this data loading and processing within a function (or a series of functions called by a single main function)
- Stores that function (and any related functions) in a module
- Calls the main function in the exercise notebook to return table or other summary or results

## Data

CSVs for both required data tables are included on GitHub at `exercises/exercise04`.

Please consult the city's database of [affordable housing](https://opendata.dc.gov/datasets/DCGIS::affordable-housing/about) projects and [ward demographic](https://opendata.dc.gov/datasets/DCGIS::wards-from-2022/about) data.

Bonus: find, download, and use more recent ward population data. (Remember to include it in your PR.) My cursory search found data as late as 2022.

## New instructions for submitting a PR with multiple files

Because you'll be working with multiple files, PRs become _slightly_ more complicated, so we're graduating to a new 'mini-repository' pattern:
- Make a new folder in `exercises/exercise04` with your last name (just like the suffix for your notebook file)
- Upload your notebook file, also appropriately named, into that folder
- Upload any other files you make/use, including `.py` and `.csv` files, into that folder, so everything is together in the same place

Ultimately, this will look a bit like this:
```
── exercises
    ├── exercise04
        ├── harvey
            ├── exercise04_harvey.ipynb
            ├── affordable_housing_calcs.py
            ├── affordable_housing.csv
            └── wards_from_2022.csv
```

**NOTE:** Yes, I realize this is a bit redundant because everyone will have copies of the same CSV files. This would never be a good idea for production coding--we would have one `data` directory, and everyone would draw from the same data. However, there are two reasons for all these copies in this case:
1. It's good practice to build a repository with all the parts your code needs to run.
    - In later weeks, when you  _don't_ all have the same data, it won't seem as redundant.
3. Having everything in one folder will make it easy for me to run your code on my computer.

## Hints
- You may want to join the population data _after_ summarizing the affordable housing data (i.e., join populations to sums of units). However, I could also see an approach where you join at the beginning, then aggregate the population column with a method called `first`



In [None]:
#create function
def john():

    df = pd.read_csv('/content/drive/MyDrive/ursp688y_shared_data/ursp688y_shared_data/affordable_housing.csv')
    value_counts = df['STATUS_PUBLIC'].value_counts()
    value_counts

    wd2 = pd.read_csv('/content/drive/MyDrive/ursp688y_shared_data/ursp688y_shared_data/Wards_from_2022.csv')
    wd2.head()

    filtered_df = df[(df["STATUS_PUBLIC"].str.contains("Under Construction")) | (df["STATUS_PUBLIC"].str.contains("Pipeline"))]
    #filtered_df
    # Select these three numerical columns, and the grouping columns
    ward_units_filtered=filtered_df[["MAR_WARD","AFFORDABLE_UNITS_AT_0_30_AMI", "AFFORDABLE_UNITS_AT_31_50_AMI", "AFFORDABLE_UNITS_AT_51_60_AMI"]]

    #Group units under wards
    ward_units = filtered_df.groupby("MAR_WARD")["AFFORDABLE_UNITS_AT_0_30_AMI", "AFFORDABLE_UNITS_AT_31_50_AMI", "AFFORDABLE_UNITS_AT_51_60_AMI"].sum()
    ward_units

    #Add ward units under 60%
    ward_units_filtered["TOTAL_AFFORDABLE_UNITS_UP_TO_60%_AMI"] = ward_units_filtered["AFFORDABLE_UNITS_AT_0_30_AMI"] + ward_units_filtered["AFFORDABLE_UNITS_AT_31_50_AMI"]+ward_units_filtered["AFFORDABLE_UNITS_AT_51_60_AMI"]
    ward_units_filtered_sum=ward_units_filtered.groupby(["MAR_WARD"])["TOTAL_AFFORDABLE_UNITS_UP_TO_60%_AMI"].agg('sum')

    #join the data set together
    housing_projects_with_pops = pd.merge(ward_units_filtered_sum, wd2, left_on='MAR_WARD', right_on='NAME')

    housing_projects_with_pops = pd.merge(
        ward_units_filtered_sum,
        wd2[['NAME','POP100','HU100']],
        left_on='MAR_WARD',
        right_on='NAME')
    #print(housing_projects_with_pops)

    #calculate unit for each population
    housing_projects_with_pops['housing_projects_per_pop'] = housing_projects_with_pops['TOTAL_AFFORDABLE_UNITS_UP_TO_60%_AMI'] / housing_projects_with_pops['POP100']
    housing_projects_with_pops

    return housing_projects_with_pops







In [None]:
john()

  ward_units = filtered_df.groupby("MAR_WARD")["AFFORDABLE_UNITS_AT_0_30_AMI", "AFFORDABLE_UNITS_AT_31_50_AMI", "AFFORDABLE_UNITS_AT_51_60_AMI"].sum()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ward_units_filtered["TOTAL_AFFORDABLE_UNITS_UP_TO_60%_AMI"] = ward_units_filtered["AFFORDABLE_UNITS_AT_0_30_AMI"] + ward_units_filtered["AFFORDABLE_UNITS_AT_31_50_AMI"]+ward_units_filtered["AFFORDABLE_UNITS_AT_51_60_AMI"]


Unnamed: 0,TOTAL_AFFORDABLE_UNITS_UP_TO_60%_AMI,NAME,POP100,HU100,housing_projects_per_pop
0,1454,Ward 1,85285,45694,0.017049
1,260,Ward 2,89485,53217,0.002906
2,444,Ward 3,85301,44109,0.005205
3,632,Ward 4,84660,34650,0.007465
4,2252,Ward 5,89617,41794,0.025129
5,2434,Ward 6,84266,52768,0.028885
6,2310,Ward 7,85685,38968,0.026959
7,4292,Ward 8,85246,39164,0.050348


In [None]:
# Import dependencies
import pandas as pd
import os
os.getcwd()
#from settings import PROJECT_ROOT

# Call your main function

'/content'

In [None]:
#mount mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#working dir
abs_path ='/content/drive/MyDrive/ursp688y_shared_data/ursp688y_shared_data/affordable_housing.csv'
os.path.isfile(abs_path)


True

In [1]:
 wd = pd.read_csv ('/content/drive/MyDrive/ursp688y_shared_data/ursp688y_shared_data/affordable_housing.csv')

NameError: name 'pd' is not defined

In [None]:

wd.tail()

Unnamed: 0,X,Y,OBJECTID,MAR_WARD,ADDRESS,PROJECT_NAME,STATUS_PUBLIC,AGENCY_CALCULATED,TOTAL_AFFORDABLE_UNITS,LATITUDE,...,AFFORDABLE_UNITS_AT_31_50_AMI,AFFORDABLE_UNITS_AT_51_60_AMI,AFFORDABLE_UNITS_AT_61_80_AMI,AFFORDABLE_UNITS_AT_81_AMI,CASE_ID,MAR_ID,XCOORD,YCOORD,FULLADDRESS,GIS_LAST_MOD_DTTM
873,-77.022849,38.955067,90154,Ward 4,"710 Jefferson Street Northwest, Washington, Di...",710 Jefferson St NW,Completed 2015 to Date,DHCD,14,38.955059,...,7,0,2,0,,250236,398019.73,143017.57,710 JEFFERSON STREET NW,2024/02/05 05:00:27+00
874,-76.93219,38.88498,90155,Ward 7,"4922 Call Place Southeast, Washington, Distric...",Amber Overlook,Completed 2015 to Date,DHCD,32,38.884973,...,6,0,26,0,,145732,405883.52,135239.33,4922 CALL PLACE SE,2024/02/05 05:00:27+00
875,-76.989295,38.882209,90156,Ward 6,"1220 Pennsylvania Avenue Southeast, Washington...",Rushmore,Completed 2015 to Date,DHCD,11,38.882208,...,1,10,0,0,,314729,400929.03,134929.53,1220 PENNSYLVANIA AVENUE SE,2024/02/05 05:00:27+00
876,-77.022682,38.917214,90157,Ward 1,"2009 8th Street Northwest, Washington, Distric...",2009 8th St NW,Completed 2015 to Date,DHCD,10,38.917206,...,1,9,0,0,,242769,398033.11,138815.59,2009 8TH STREET NW,2024/02/05 05:00:27+00
877,-77.020728,38.913367,90158,Ward 6,"660 Glick Court NW, Washington, District of Co...",Glick Court,Completed 2015 to Date,DHCD,1,38.913398,...,0,1,0,0,,316497,398202.54,138388.45,660 GLICK COURT NW,2024/02/05 05:00:27+00


In [None]:

df = pd.read_csv('/content/drive/MyDrive/ursp688y_shared_data/ursp688y_shared_data/affordable_housing.csv')
value_counts = df['STATUS_PUBLIC'].value_counts()
value_counts

Completed 2015 to Date    526
Under Construction        187
Pipeline                  165
Name: STATUS_PUBLIC, dtype: int64

In [None]:
import pandas as pd
abs_path = '/content/drive/MyDrive/ursp688y_shared_data/ursp688y_shared_data/wards_from_2022(1).csv'
os.path.isfile(abs_path)

True

In [None]:
wd2 = pd.read_csv('/content/drive/MyDrive/ursp688y_shared_data/ursp688y_shared_data/Wards_from_2022.csv')
wd2.head()

Unnamed: 0,WARD,NAME,REP_NAME,WEB_URL,REP_PHONE,REP_EMAIL,REP_OFFICE,WARD_ID,LABEL,STUSAB,...,P0050009,P0050010,OBJECTID,GLOBALID,CREATED_USER,CREATED_DATE,LAST_EDITED_USER,LAST_EDITED_DATE,SHAPEAREA,SHAPELEN
0,8,Ward 8,"Trayon White, Sr.",https://www.dccouncil.us/council/councilmember...,(202) 724-8045,twhite@dccouncil.us,"1350 Pennsylvania Ave, Suite 400, NW 20004",8,Ward 8,DC,...,563,1745,1,{E31550AE-6FAE-4B74-909F-52B283BFAF68},,,,,0,0
1,6,Ward 6,Charles Allen,https://www.dccouncil.us/council/councilmember...,(202) 724-8072,callen@dccouncil.us,"1350 Pennsylvania Ave, Suite 110, NW 20004",6,Ward 6,DC,...,255,887,2,{765C4F49-9292-4BDB-AA24-39F4EE43359F},,,JLAY,2023/12/07 20:08:04+00,0,0
2,7,Ward 7,Vincent Gray,https://dccouncil.us/council/vincent-gray,(202) 724-8068,vgray@dccouncil.us,"1350 Pennsylvania Ave, Suite 406, NW 20004",7,Ward 7,DC,...,0,1971,3,{73F07042-7D7F-452B-9BB3-0F87B0EC5418},,,,,0,0
3,2,Ward 2,Brooke Pinto,https://www.dccouncil.us/council/ward-2-counci...,(202) 724-8058,bpinto@dccouncil.us,"1350 Pennsylvania Ave, Suite 106, NW 20004",2,Ward 2,DC,...,0,1543,4,{7F8C2A51-427C-45FC-91EB-9693656AED9C},,,,,0,0
4,1,Ward 1,Brianne Nadeau,https://dccouncil.us/council/brianne-nadeau,(202) 724-8181,bnadeau@dccouncil.us,"1350 Pennsylvania Ave, Suite 108, NW 20004",1,Ward 1,DC,...,0,752,5,{C3C6E2E7-E68D-49B2-970C-D60675EA7B4B},,,JLAY,2023/12/07 20:08:04+00,0,0


In [None]:
#ward_units

In [None]:
filtered_df = df[(df["STATUS_PUBLIC"].str.contains("Under Construction")) | (df["STATUS_PUBLIC"].str.contains("Pipeline"))]
#filtered_df

In [None]:
#filtered_df["MAR_WARD"].value_counts()

In [None]:
# Select these three numerical columns, and the grouping columns
ward_units_filtered=filtered_df[["MAR_WARD","AFFORDABLE_UNITS_AT_0_30_AMI", "AFFORDABLE_UNITS_AT_31_50_AMI", "AFFORDABLE_UNITS_AT_51_60_AMI"]]
#ward_units_filtered.head(10)

In [None]:
ward_units = filtered_df.groupby("MAR_WARD")["AFFORDABLE_UNITS_AT_0_30_AMI", "AFFORDABLE_UNITS_AT_31_50_AMI", "AFFORDABLE_UNITS_AT_51_60_AMI"].sum()
ward_units

  ward_units = filtered_df.groupby("MAR_WARD")["AFFORDABLE_UNITS_AT_0_30_AMI", "AFFORDABLE_UNITS_AT_31_50_AMI", "AFFORDABLE_UNITS_AT_51_60_AMI"].sum()


Unnamed: 0_level_0,AFFORDABLE_UNITS_AT_0_30_AMI,AFFORDABLE_UNITS_AT_31_50_AMI,AFFORDABLE_UNITS_AT_51_60_AMI
MAR_WARD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,0,3
Ward 1,258,411,785
Ward 2,23,17,220
Ward 3,77,87,280
Ward 4,126,229,277
Ward 5,600,683,969
Ward 6,659,575,1200
Ward 7,528,1140,642
Ward 8,1189,1760,1343


In [None]:
#add
ward_units_filtered["TOTAL_AFFORDABLE_UNITS_UP_TO_60%_AMI"] = ward_units_filtered["AFFORDABLE_UNITS_AT_0_30_AMI"] + ward_units_filtered["AFFORDABLE_UNITS_AT_31_50_AMI"]+ward_units_filtered["AFFORDABLE_UNITS_AT_51_60_AMI"]
ward_units_filtered_sum=ward_units_filtered.groupby(["MAR_WARD"])["TOTAL_AFFORDABLE_UNITS_UP_TO_60%_AMI"].agg('sum')
ward_units_filtered_sum

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ward_units_filtered["TOTAL_AFFORDABLE_UNITS_UP_TO_60%_AMI"] = ward_units_filtered["AFFORDABLE_UNITS_AT_0_30_AMI"] + ward_units_filtered["AFFORDABLE_UNITS_AT_31_50_AMI"]+ward_units_filtered["AFFORDABLE_UNITS_AT_51_60_AMI"]


MAR_WARD
1            3
Ward 1    1454
Ward 2     260
Ward 3     444
Ward 4     632
Ward 5    2252
Ward 6    2434
Ward 7    2310
Ward 8    4292
Name: TOTAL_AFFORDABLE_UNITS_UP_TO_60%_AMI, dtype: int64

In [None]:
#arrange in ascending order
ward_units_filtered_sum_sorted = ward_units_filtered_sum.sort_values(ascending=True)
ward_units_filtered_sum_sorted

MAR_WARD
1            3
Ward 2     260
Ward 3     444
Ward 4     632
Ward 1    1454
Ward 5    2252
Ward 7    2310
Ward 6    2434
Ward 8    4292
Name: TOTAL_AFFORDABLE_UNITS_UP_TO_60%_AMI, dtype: int64

In [None]:
#arrange in descending order
ward_units_filtered_sum_sorted = ward_units_filtered_sum.sort_values(ascending=False)
ward_units_filtered_sum_sorted

MAR_WARD
Ward 8    4292
Ward 6    2434
Ward 7    2310
Ward 5    2252
Ward 1    1454
Ward 4     632
Ward 3     444
Ward 2     260
1            3
Name: TOTAL_AFFORDABLE_UNITS_UP_TO_60%_AMI, dtype: int64

In [None]:
#find manimum
ward_with_minimum = ward_units_filtered_sum_sorted.idxmin()
ward_with_minimum

'1'

In [None]:
#find maximum
ward_with_minimum = ward_units_filtered_sum_sorted.idxmax()
ward_with_minimum

'Ward 8'

In [None]:
#join the data set together
housing_projects_with_pops = pd.merge(ward_units_filtered_sum, wd2, left_on='MAR_WARD', right_on='NAME')
#housing_projects_with_pops

In [None]:
#add population to the data
housing_projects_with_pops = pd.merge(
    ward_units_filtered_sum,
    wd2[['NAME','POP100','HU100']],
    left_on='MAR_WARD',
    right_on='NAME')
print(housing_projects_with_pops)

   TOTAL_AFFORDABLE_UNITS_UP_TO_60%_AMI    NAME  POP100  HU100
0                                  1454  Ward 1   85285  45694
1                                   260  Ward 2   89485  53217
2                                   444  Ward 3   85301  44109
3                                   632  Ward 4   84660  34650
4                                  2252  Ward 5   89617  41794
5                                  2434  Ward 6   84266  52768
6                                  2310  Ward 7   85685  38968
7                                  4292  Ward 8   85246  39164


In [None]:
#calculate unit for each population
housing_projects_with_pops['housing_projects_per_pop'] = housing_projects_with_pops['TOTAL_AFFORDABLE_UNITS_UP_TO_60%_AMI'] / housing_projects_with_pops['POP100']
housing_projects_with_pops

Unnamed: 0,TOTAL_AFFORDABLE_UNITS_UP_TO_60%_AMI,NAME,POP100,HU100,housing_projects_per_pop
0,1454,Ward 1,85285,45694,0.017049
1,260,Ward 2,89485,53217,0.002906
2,444,Ward 3,85301,44109,0.005205
3,632,Ward 4,84660,34650,0.007465
4,2252,Ward 5,89617,41794,0.025129
5,2434,Ward 6,84266,52768,0.028885
6,2310,Ward 7,85685,38968,0.026959
7,4292,Ward 8,85246,39164,0.050348


In [None]:
housing_projects_with_pops

Unnamed: 0,TOTAL_AFFORDABLE_UNITS_UP_TO_60%_AMI,NAME,POP100,HU100,housing_projects_per_pop
0,1454,Ward 1,85285,45694,0.017049
1,260,Ward 2,89485,53217,0.002906
2,444,Ward 3,85301,44109,0.005205
3,632,Ward 4,84660,34650,0.007465
4,2252,Ward 5,89617,41794,0.025129
5,2434,Ward 6,84266,52768,0.028885
6,2310,Ward 7,85685,38968,0.026959
7,4292,Ward 8,85246,39164,0.050348


In [None]:
#function