Urban Data Science & Smart Cities <br>
URSP688Y <br>
Instructor: Chester Harvey <br>
Urban Studies & Planning <br>
National Center for Smart Growth <br>
University of Maryland

[<img src="https://colab.research.google.com/assets/colab-badge.svg"> Clean version](https://colab.research.google.com/github/ncsg/ursp688y_sp2024/blob/main/demos/demo04/demo04.ipynb)

[<img src="https://colab.research.google.com/assets/colab-badge.svg"> Modified in class](https://colab.research.google.com/drive/11bvlfaXuamFZ__Bb97a2IGYom05Mtj5z?usp=sharing) #### UPDATE ####

# Demo 4 - Debugging and Working with Files

- Using a debugger (and installing packages with `pip`)
- Connecting to Google Drive in Colab
- Loading data from a file
- Tabular joining
- Saving data to a file
- Loading code from a file (modules)
- Repository structure
- Introducing the final project

## Using a debugger (and installing packages with `pip`)

Sometimes things just really don't work and it's hard to figure out why. This can be especially true when there is a lot of nesting with names accessible only inside functions or loops.

There are special tools for debugging that can help step through code one line at a time, stop in specific places, and understand the values stored in variables at specific points in the program.

A good way to [implement this in a Jupyter notebook](https://zohaib.me/debugging-in-google-collab-notebook/), including Colab, is with a package called `ipdb`. Unfortunately, ipdb does not come pre-installed with Colab, so we'll need to install it before we can import it.

This is as easy as using a special character (`!`) to ask CoLab to run a command as if it was on the computer's command line, not with the Python interpreter.

We're using a program called `pip`, which goes to its internet repositories, downloads ipdb, and installs it.

With CoLab, you need to do this every time you use it because it wipes your virtual computer clean when your session times out.

In [1]:
# This version will run 'quietly'
!pip install -Uqq ipdb

# This version will show log outputs
# !pip install ipdb

Now that we have ipdb, we can import it and start debugging.

In [2]:
import ipdb

Here's a loop with a logic error (this may look familiar from two weeks ago):

In [3]:
people = {'Daniela': 5, 'Rowen': 65, 'Zoe': 10, 'Jude': 81, 'Austin': 45}

for name, age in people.items():
    if age < 18:
        age_desc = 'a child'
    else:
        age_des = 'an adult'
    # ipdb.set_trace() # Here is the breakpoint where we'll inspect
    print(f'{name} is {age_desc}')

Daniela is a child
Rowen is a child
Zoe is a child
Jude is a child
Austin is a child


Let's use ipdb to dig into why every person is being listed as a child rather than an adult.

First, we have to turn the debugger on. We use a shortcut called a 'magic command,' which is only valid within a notebook. It isn't technically Python.

In [4]:
%pdb on

# %pdb off

Automatic pdb calling has been turned ON


### Breakpoints

Add `ipdb.set_trace()` anywhere in program to stop it for inspection.

Then use commands to continue flow as needed.

|Command|Description|
|--- |--- |
|h(elp)|Show various commands supported by ipdb|
|h(elp) COMMAND|Show description of the COMMAND specificed|
|c(ontinue)|Continue executing till it hits another breakpoint|
|n(ext)|Execute till next line in the same code frame. So if there is a function it wouldn't step into that function but execute it.|
|s(tep)|Step to next code, so if its a function, it will step into the function.|
|r(eturn)|Execute code till it returns from the current function or hits another breakpoint.|
|l(ist)|Show more of the source code surrounding the  line.|
|w(here)|Shows the stacktrace i.e. the chain of functions that made it reach the current function|
|a(rguments)|List of arguments passed and its values to the function|
|q(uit)|Immediately stop execution and quit the debugger|

### Other debugging strategies
- `assert` (the most basic form of testing)
- `print`
- `break` (terminate loop)
- early returns from functions

## Connecting to Google Drive in Colab

If/when you run python on your own computer, you can easily access other files on the computer. This is important if you want to load data from a spreadsheet, for example.

The biggest downside to Colab is that this is a bit harder. Because Colab is running on a virtual machine that you don't directly control (free computing!) you need to tell it where to access files in a place you _do_ control. The easiest place to point it to is Google Drive, and they make it fairly straightforward because ... _ahem_ ... it's another Google product.

To connect to Google Drive, you need to import a function, `mount`, from the `colab.drive` module in Google's own `google` package.

Then you run the function with a string argument showing the directory path within the virtual machine where you want your Google Drive to be mounted.

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Directory structure, paths, and the working directory

Now we get to talk a bit about directory structures and the `paths` used to point us to them.

You have probably noticed that files on computers tend to be stored in folders, and folders get stored inside of other folders, forming a file tree. 'Directory' just means a system of files and folders. It's a way of storing things and giving the computer directions to find them.

'Directory' is often used synonymously with 'folder.'

'Paths' are computer-speak for directions to a location in the file system. They are strings, where nesting is denoted with slashes (`/`). There are two types of paths:

- **Absolute paths** always start with the 'root' directory:
    - `/content/drive/MyDrive/ursp688y_demo_data`
    - `/content/drive/MyDrive/ursp688y_demo_data/affordable_housing.csv`

- **Relative paths** give directions from the 'place' in the file structure you (or a program) already are. Let's say we are already in `MyDrive`:
    - `./ursp688y_demo_data`
    - `./ursp688y_demo_data/affordable_housing.csv`

Python operates in a 'working directory'. This is where we 'are' for the purpose of relative paths.

Let's see where the working directory is located by default in Colab. We need to import a package called `os`, which has tools for interfacing with the computer's operating system, then use its `getcwd` function. (This stands for "get current working directory".)

In [6]:
import os

os.getcwd()

'/content'

If we want to access the `affordable_housing.csv` spreadsheet, we need to do one of three things:

1. Provide an absolute path to it
2. Provide a relative path from our current working directory
3. Change our current working directory

In [7]:
# 1) Absolute path
abs_path = '/content/drive/MyDrive/ursp688y_shared_data/affordable_housing.csv'
os.path.isfile(abs_path)

True

In [8]:
# 2) Relative path
rel_path = 'drive/MyDrive/ursp688y_shared_data/affordable_housing.csv'
os.path.isfile(rel_path)

True

In [9]:
# 3) Change working directory
wd_path = '/content/drive/MyDrive/ursp688y_shared_data'
os.chdir(wd_path)

print(f'cwd: {os.getcwd()}')

os.path.isfile('affordable_housing.csv')

cwd: /content/drive/MyDrive/Teaching/URSP688Y Spring 2024/ursp688y_shared_data


True

For simplicity, let's set our working directories to the `ursp688y_shared_data` folder I have shared with everyone.

**First, let's all make a shortcut to it in the root folder of our Google Drives.** That way, the path will be the same for everyone.

Then we can change our the working directory in our notebook.



In [10]:
wd_path = '/content/drive/MyDrive/ursp688y_shared_data'
os.chdir(wd_path)

print(f'cwd: {os.getcwd()}')

os.path.isfile('affordable_housing.csv')

cwd: /content/drive/MyDrive/Teaching/URSP688Y Spring 2024/ursp688y_shared_data


True

Now let's check what's in our working directory.

In [11]:
os.listdir()

['affordable_housing.csv',
 'demo04.py',
 '.ipynb_checkpoints',
 '__pycache__',
 'wards_from_2022.csv']

## Loading data from files

Now that we know about directories and paths, we can load data from files. This really expands our horizons for analyzing larger and more real-world data.

We're going to load a table, so we need Pandas.

In [12]:
import pandas as pd

Then we'll use the `read_csv` function, and give it the path to our file.

Notice that our path is very short? It's just pointing to the file name. This is because the file is within our current working directory, so it's a relative path.

We could also enter an absolute path, but it would be a pain to enter and it wouldn't be as flexible. What if the directory structure changed? A relative path is very easy to update: just change the current working directory, and everything works.

In [17]:
# relative path example
housing_projects = pd.read_csv('affordable_housing.csv')

# absolute path example
# housing_projects = pd.read_csv('/content/drive/MyDrive/ursp688y_shared_data/affordable_housing.csv')

# both yield the same thing
housing_projects.head()

Unnamed: 0,X,Y,OBJECTID,MAR_WARD,ADDRESS,PROJECT_NAME,STATUS_PUBLIC,AGENCY_CALCULATED,TOTAL_AFFORDABLE_UNITS,LATITUDE,...,AFFORDABLE_UNITS_AT_31_50_AMI,AFFORDABLE_UNITS_AT_51_60_AMI,AFFORDABLE_UNITS_AT_61_80_AMI,AFFORDABLE_UNITS_AT_81_AMI,CASE_ID,MAR_ID,XCOORD,YCOORD,FULLADDRESS,GIS_LAST_MOD_DTTM
0,-77.009383,38.910255,89281,Ward 6,"1520 North Capitol Street Northwest, Washingto...",Cycle House,Under Construction,DMPED DHCD,18,38.910248,...,4,12,0,0,,331764,399186.36,138042.91,1520 NORTH CAPITOL STREET NW,2024/02/05 05:00:27+00
1,-77.009436,38.906403,89282,Ward 6,"1200 North Capitol Street Northwest, Washingto...",Tyler House Apartments,Completed 2015 to Date,DCHFA,284,38.906396,...,0,284,0,0,,237128,399181.75,137615.28,1200 NORTH CAPITOL STREET NW,2024/02/05 05:00:27+00
2,-77.030061,38.962519,89283,Ward 4,"5922 13th Street Northwest, Washington, Distri...",Valencia Apartments,Completed 2015 to Date,DHCD,29,38.962511,...,0,29,0,0,,243483,397394.87,143845.04,5922 13TH STREET NW,2024/02/05 05:00:27+00
3,-76.950868,38.922332,89284,Ward 5,"3814 Fort Lincoln Drive Northeast, Washington,...",Villages at Dakota Crossing Phase III,Completed 2015 to Date,DMPED,24,38.922333,...,0,0,24,0,,310077,404260.75,139384.6,3814 FORT LINCOLN DRIVE NE,2024/02/05 05:00:27+00
4,-77.033056,38.967357,89285,Ward 4,"1388 Tuckerman Street Northwest, Washington, D...",Vizcaya Apartments,Completed 2015 to Date,DHCD,17,38.967349,...,0,17,0,0,,257527,397135.52,144382.12,1388 TUCKERMAN STREET NW,2024/02/05 05:00:27+00


## Tabular Joins

Tabular data become much more powerful when we combine them with other data.

Adding columns from one table to another by relating them through a consistent 'key' column (or multiple key columns) is called a 'tabular join.'

The concept of a join comes from [relational databases](https://www.codecademy.com/article/what-is-rdbms-sql): a data structure in which many tables, each storing a particular type of data, are dynamically related to one another through a series of keys.

In Pandas, we can approximate this by merging two DataFrames together to make a combined DataFrame. It's not quite a relational database because the result is a whole new dataframe stored in memory as a separate object (the merged version won't update automatically if you update the original DataFrames). But it's still very powerful.

Let's try out joining by loading another CSV with data about the population of each ward. This CSV also comes from the [Washington, D.C. Open Data Portal](https://opendata.dc.gov/datasets/DCGIS::wards-from-2022/about).

In [24]:
ward_demogs = pd.read_csv('wards_from_2022.csv')
ward_demogs.head(2)

Unnamed: 0,WARD,NAME,REP_NAME,WEB_URL,REP_PHONE,REP_EMAIL,REP_OFFICE,WARD_ID,LABEL,STUSAB,...,P0050009,P0050010,OBJECTID,GLOBALID,CREATED_USER,CREATED_DATE,LAST_EDITED_USER,LAST_EDITED_DATE,SHAPEAREA,SHAPELEN
0,8,Ward 8,"Trayon White, Sr.",https://www.dccouncil.us/council/councilmember...,(202) 724-8045,twhite@dccouncil.us,"1350 Pennsylvania Ave, Suite 400, NW 20004",8,Ward 8,DC,...,563,1745,1,{E31550AE-6FAE-4B74-909F-52B283BFAF68},,,,,0,0
1,6,Ward 6,Charles Allen,https://www.dccouncil.us/council/councilmember...,(202) 724-8072,callen@dccouncil.us,"1350 Pennsylvania Ave, Suite 110, NW 20004",6,Ward 6,DC,...,255,887,2,{765C4F49-9292-4BDB-AA24-39F4EE43359F},,,JLAY,2023/12/07 20:08:04+00,0,0


Let's compare it to our housing projects DataFrame. Are there to columns with the same information that we can use as a key?

In [25]:
housing_projects.head(2)

Unnamed: 0,X,Y,OBJECTID,MAR_WARD,ADDRESS,PROJECT_NAME,STATUS_PUBLIC,AGENCY_CALCULATED,TOTAL_AFFORDABLE_UNITS,LATITUDE,...,AFFORDABLE_UNITS_AT_31_50_AMI,AFFORDABLE_UNITS_AT_51_60_AMI,AFFORDABLE_UNITS_AT_61_80_AMI,AFFORDABLE_UNITS_AT_81_AMI,CASE_ID,MAR_ID,XCOORD,YCOORD,FULLADDRESS,GIS_LAST_MOD_DTTM
0,-77.009383,38.910255,89281,Ward 6,"1520 North Capitol Street Northwest, Washingto...",Cycle House,Under Construction,DMPED DHCD,18,38.910248,...,4,12,0,0,,331764,399186.36,138042.91,1520 NORTH CAPITOL STREET NW,2024/02/05 05:00:27+00
1,-77.009436,38.906403,89282,Ward 6,"1200 North Capitol Street Northwest, Washingto...",Tyler House Apartments,Completed 2015 to Date,DCHFA,284,38.906396,...,0,284,0,0,,237128,399181.75,137615.28,1200 NORTH CAPITOL STREET NW,2024/02/05 05:00:27+00


The `NAME` column in the `ward_demogs` DataFrame seems to have the same information as the `MAR_WARD` column in `housing_projects`. Let's make sure they're exactly the same:

In [33]:
for name in ward_demogs['NAME'].sort_values().unique():
    if name in housing_projects['MAR_WARD'].unique():
        print(f'{name} in both')
    else:
        print(f'{name} not in housing projects')

Ward 1 in both
Ward 2 in both
Ward 3 in both
Ward 4 in both
Ward 5 in both
Ward 6 in both
Ward 7 in both
Ward 8 in both


Looks like a perfect match. That's actually a little unusual, so we'll just be thankful. Otherwise, we might have had to do some pre-processing on those columns to make sure they have exactly the same data and types. For example, if one was a string (`'Ward 1'`) and the other was just an integer (`1`), you might need to parse the integer from the string version before joining.

But we can just move on to the join. Let's add the ward population to each record in the `housing_projects` DataFrame:

In [37]:
housing_projects_with_pops = pd.merge(housing_projects, ward_demogs, left_on='MAR_WARD', right_on='NAME')

# housing_projects_with_pops.columns.tolist()

There are **a lot** of columns in that `ward_demogs` DataFrame. Maybe we can keep things tidier by only joining a couple of them.

In [44]:
housing_projects_with_pops = pd.merge(
    housing_projects,
    ward_demogs[['NAME','POP100','HU100']],
    left_on='MAR_WARD',
    right_on='NAME')

housing_projects_with_pops.head(2)

Unnamed: 0,X,Y,OBJECTID,MAR_WARD,ADDRESS,PROJECT_NAME,STATUS_PUBLIC,AGENCY_CALCULATED,TOTAL_AFFORDABLE_UNITS,LATITUDE,...,AFFORDABLE_UNITS_AT_81_AMI,CASE_ID,MAR_ID,XCOORD,YCOORD,FULLADDRESS,GIS_LAST_MOD_DTTM,NAME,POP100,HU100
0,-77.009383,38.910255,89281,Ward 6,"1520 North Capitol Street Northwest, Washingto...",Cycle House,Under Construction,DMPED DHCD,18,38.910248,...,0,,331764,399186.36,138042.91,1520 NORTH CAPITOL STREET NW,2024/02/05 05:00:27+00,Ward 6,84266,52768
1,-77.009436,38.906403,89282,Ward 6,"1200 North Capitol Street Northwest, Washingto...",Tyler House Apartments,Completed 2015 to Date,DCHFA,284,38.906396,...,0,,237128,399181.75,137615.28,1200 NORTH CAPITOL STREET NW,2024/02/05 05:00:27+00,Ward 6,84266,52768


## Saving data to file

We can also save a DataFrame back to a file that you can open in Excel or just save for posterity.

In [None]:
housing_projects_with_pops.to_csv('affordable_housing_with_ward_pops.csv')

## Loading code from files (modules)

We just loaded data from a file. Guess what?: You can also load code from a file!

So-far, we have been loading packages and their subcomponents, modules, that other people have written.

A module is just a `.py` file with code that defines variables, functions, classes, etc. that we can import into our own namespace.

When you `import` a module, Python looks for it in a few places by default (the exact places and order of search can change depending on the system type and configuration, but you don't have to worry about that too much as long as there aren't naming conflicts).

1.  The current working directory
2.  The directory containing the executed script file (`.py` or `.ipynb`)
3.  The `site-packages` directory where `pip` installs things

This is very helpful, because it means you can write your own module, put it in your current working directory OR the same directory as your notebook (i.e., the executed script file), and import them.

Writing a module is a great way to automate setup tasks, store functions your're going to use again elsewhere, and keep your notebook tidy.

I made a new module called `harvey.py` in the shared data folder. Then I wrote a function in it. Let's load it and use it.





In [45]:
import demo04

# These autoreload magic functions are only for notebooks.
# They tell the notebook to automatically reload the module
# when changes are made to it, which is great for development.

# If you define an entirely new name (e.g., a function), you may
# still need to restart the kernel/runtime/session for Python to
# add it to the namespace.

%load_ext autoreload
%autoreload 2

In [46]:
df = demo04.load_affordable_housing_with_pops()
df.head(2)

Unnamed: 0,X,Y,OBJECTID,MAR_WARD,ADDRESS,PROJECT_NAME,STATUS_PUBLIC,AGENCY_CALCULATED,TOTAL_AFFORDABLE_UNITS,LATITUDE,...,AFFORDABLE_UNITS_AT_81_AMI,CASE_ID,MAR_ID,XCOORD,YCOORD,FULLADDRESS,GIS_LAST_MOD_DTTM,NAME,POP100,HU100
0,-77.009383,38.910255,89281,Ward 6,"1520 North Capitol Street Northwest, Washingto...",Cycle House,Under Construction,DMPED DHCD,18,38.910248,...,0,,331764,399186.36,138042.91,1520 NORTH CAPITOL STREET NW,2024/02/05 05:00:27+00,Ward 6,84266,52768
1,-77.009436,38.906403,89282,Ward 6,"1200 North Capitol Street Northwest, Washingto...",Tyler House Apartments,Completed 2015 to Date,DCHFA,284,38.906396,...,0,,237128,399181.75,137615.28,1200 NORTH CAPITOL STREET NW,2024/02/05 05:00:27+00,Ward 6,84266,52768


## Repository structure

Once we start having lots of separate files and modules for our project, we start needing some structure to organize them.

There is no one right structure for a data science project, but there are some good examples to draw on.

[Cookiecutter Data Science](https://drivendata.github.io/cookiecutter-data-science/) is a popular one.

Here's the repository structure they suggest:
```
├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- Make this project pip installable with `pip install -e`
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io
```

For now, let's stay basic and just use a single folder to add some structure for our exercises. Rather than making a pull request with a single file, you'll add a folder with your name, and in that folder put your notebook, also named after you, along with any data or custom modules it depends on. It will look like this:
```
└── harvey
    ├── exercise04_harvey.ipynb
    ├── affordable_housing.csv
    ├── affordable_housing_calcs.py
    ├── ...
    └── ...
```

## [Introducing the final project](https://github.com/ncsg/ursp688y_sp2024/tree/main?tab=readme-ov-file#final-project-50-of-grade)