# Assemble

This notebook loads and assembles all your basic data sources, including tabulor and geospatial data.

The final output is two dataframes:

- UNIVERSE
- SALES

The SALES dataframe represents transactions or parcels; these are ownership transfers with prices, dates, and metadata.  
The UNIVERSE dataframe represents the parcels themselves (land and buildings, and their associated characteristics).

These will be packaged together in a handy data structure called a `SalesUniversePair`, or `sup` for short. `openavmkit` provides many handy functions that carefully perform operations on `sup`s without mixing up their fields.

The key thing to understand is that the **Assemble** notebook outputs a `sup` that represents *factual assertions* about the world. In later notebooks, we will have to add assumptions, opinions, and educated guesses, but we first will establish the firmest facts we can in this notebook.

You can think of the two dataframes in the `sup` as answering the following questions:

- UNIVERSE:
  - Where is each parcel located in space, and what is its shape?
  - What are the *current* characteristics of each parcel?
    - Which parcels have buildings and which are vacant lots?
    - How big is each parcel?
    - What is the age/size/quality/condition/etc of each building?
- SALES:
  - Which parcels have sold?
  - What prices did they sell for?
  - What dates did they sell on?
  - Which sales were valid?
  - What characteristics were different *at the time of sale* from how the parcel is now?



In [7]:
# Change these as desired

# The slug of the locality you are currently working on
locality = "us-nc-guilford"

# Whether to print out a lot of stuff (can help with debugging) or stay mostly quiet
verbose = True

# Clear previous state for this notebook and start fresh
clear_checkpoints = True

# 1. Basic setup

In [8]:
import init_notebooks
init_notebooks.setup_environment()

Environment setup completed.


In [9]:
# import OpenAVMkit:
from openavmkit.pipeline import ( 
    init_notebook,
    from_checkpoint,
    delete_checkpoints,
    examine_df,
    examine_df_in_ridiculous_detail,
    examine_sup,
    examine_sup_in_ridiculous_detail,
    cloud_sync,
    load_settings,
    load_dataframes,
    process_dataframes,
    process_sales,
    enrich_sup_streets,
    tag_model_groups_sup,
    write_notebook_output_sup
)

In [10]:
init_notebook(locality)

locality = us-nc-guilford
base path = C:\Users\jacks\Documents\Non-Game Stuff\Programming\openavmkit\notebooks\pipeline
current path = C:\Users\jacks\Documents\Non-Game Stuff\Programming\openavmkit\notebooks\pipeline\data\us-nc-guilford


In [11]:
if clear_checkpoints:
    delete_checkpoints("1-assemble")

# 2. Sync with Cloud
- If you have configured cloud storage, syncs with your remote storage
- Reconciles your local input files with the versions on the remote server
- Pulls down whatever is newer from the remote server
- Uploads whatever is newer from your local machine

In [12]:
cloud_sync(locality, verbose=True)

Cloud access was None, defaulting to 'read_only'...
Initializing cloud service of type 'azure' with access 'read_only'...
_init_service('azure', 'read_only')
ignore_paths = ['cloud.json']
Syncing files from local="in" to remote="us/nc/guilford/"...
Local file 'in/building.csv' missing for remote file 'us/nc/guilford/building.csv'. Downloading...
Downloading 'in/building.csv' <-- 'us/nc/guilford/building.csv'...
Local file 'in/caprates.csv' missing for remote file 'us/nc/guilford/caprates.csv'. Downloading...
Downloading 'in/caprates.csv' <-- 'us/nc/guilford/caprates.csv'...
Local file 'in/elevation.csv' missing for remote file 'us/nc/guilford/elevation.csv'. Downloading...
Downloading 'in/elevation.csv' <-- 'us/nc/guilford/elevation.csv'...
Local file 'in/geo/airport.parquet' missing for remote file 'us/nc/guilford/geo/airport.parquet'. Downloading...
Downloading 'in/geo/airport.parquet' <-- 'us/nc/guilford/geo/airport.parquet'...
Local file 'in/geo/census_block_groups.parquet' missing

In [13]:
settings = load_settings()

# 3. Load & process data

In [14]:
# load all of our initial dataframes, but don't do anything with them just yet
dataframes = from_checkpoint("1-assemble-01-load_dataframes", load_dataframes,
    {
        "settings": load_settings(),
        "verbose": verbose
    }
)

Loading "in/parcels.csv"...
--> rows = 219384
Loading "in/sale.csv"...
Valid sales: 114200 out of 220125 total




--> rows = 219957
Loading "in/building.csv"...
--> rows = 174179
Loading "in/elevation.csv"...
--> rows = 219373
Loading "in/slope.csv"...
--> rows = 219373
Loading "in/noise.csv"...
--> rows = 219373
Loading "in/ref_zoning.csv"...
--> rows = 204
Loading "in/geo/parcels.parquet"...
--> rows = 219382
Loading "in/geo/census_tracts.parquet"...
--> rows = 126
Loading "in/geo/census_block_groups.parquet"...
--> rows = 338
Loading "in/geo/city.parquet"...
--> rows = 20
Loading "in/geo/zoning.parquet"...
--> rows = 211
Loading "in/geo/school_district.parquet"...
--> rows = 8
Loading "in/geo/golf_courses.parquet"...
ensure_geometries, default crs = OGC:CRS84
--> rows = 12
Loading "in/geo/lakes.parquet"...
--> rows = 11
Loading "in/geo/central_business_district.parquet"...
ensure_geometries, default crs = OGC:CRS84
--> rows = 1
Loading "in/geo/airport.parquet"...
--> rows = 1
Loading "in/geo/universities.parquet"...
ensure_geometries, default crs = OGC:CRS84
--> rows = 7
Loading "in/geo/college

In [15]:
# load all of our initial dataframes and assemble our data
sales_univ_pair = from_checkpoint("1-assemble-02-process_data", process_dataframes,
    {
        "dataframes": dataframes,
        "settings": load_settings(), 
        "verbose": verbose
    }
)



Valid sales: 114196 (51.9% of 219957 total)
Enriching data...
Performing spatial joins...
--> census_tracts
--> census_block_groups
--> city
--> zoning
--> school_district
Using "key" to merge shapefiles onto df
Enriching with Overture building data...
--> Current settings: {'enabled': True, 'cache': True, 'footprint': {'units': 'sqft'}}
--> Bounding box: [-80.04696875  35.89955906 -79.53229945  36.25871445]
--> Fetching data from Overture...
--> Dataset columns: ['id', 'geometry', 'bbox', 'version', 'sources', 'level', 'subtype', 'class', 'height', 'names', 'has_parts', 'is_underground', 'num_floors', 'num_floors_underground', 'min_height', 'min_floor', 'facade_color', 'facade_material', 'roof_material', 'roof_shape', 'roof_direction', 'roof_orientation', 'roof_color', 'roof_height']
--> Skipping unavailable columns: ['est_height']
--> Counting batches...
--> Found 241 batches


Processing batches: 100%|████████████████████████████████████████████████████████| 241/241 [00:02<00:00, 80.48it/s]


--> Found 217459 buildings
--> Available columns: ['id', 'geometry', 'bbox', 'height', 'num_floors', 'num_floors_underground', 'subtype', 'class', 'sources', 'height_m_best', 'height_ft_best', 'floors_best', 'height_confidence', 'num_floors_confidence']
--> Calculating building footprint areas...
--> Using UTM CRS: EPSG:32617
--> Saving buildings to cache: cache/overture\buildings_-80.04696875251265_35.89955906359216_-79.53229945143045_36.25871445429295.parquet
--> Building columns = ['id' 'geometry' 'bbox' 'height' 'num_floors' 'num_floors_underground'
 'subtype' 'class' 'sources' 'height_m_best' 'height_ft_best'
 'floors_best' 'height_confidence' 'num_floors_confidence'
 'bldg_area_footprint_sqm' 'bldg_area_footprint_sqft']




Calculating building footprint areas!
--> Projected to equal area CRS...(5.09s)
--> Calculated building footprint intersections with parcels...(1.44s)
--> Found 278922 potential building-parcel intersections
--> Calculated precise intersection areas...(21.34s)
--> Aggregated building footprint areas...(0.09s)
--> Finished up...(0.38s)
--> Added building footprint areas to 219373 parcels
--> Total building footprint area: 625,733,227 sqft
--> Average building footprint area: 2,852 sqft
--> Number of parcels with buildings: 184,712
--> Saving intersection areas to cache: cache/overture\intersections_area_1690008.2952693105_782774.103934899_1843018.0420707166_914621.7529687285.parquet
Calculating building heights & floors!
--> CRS aligned...(0.75s)
--> Spatial join done...(1.52s), rows=278,922
--> Aggregated heights/floors...(0.18s)
--> Finished...(0.14s)
--> Saving to cache: cache/overture\intersections_height_1690008.2952693105_782774.103934899_1843018.0420707166_914621.7529687285.parqu

  gdf["land_area_sqft"].combine_first(gdf["land_area_gis_sqft"])
   120.02168562 71687.99805612]' has dtype incompatible with int32, please explicitly cast to a compatible dtype first.
  gdf.loc[
  result = getattr(ufunc, method)(*inputs, **kwargs)


--> calculated GIS area of each parcel...(3.50s)
--> calculated parcel rectangularity...(6.97s)
--> calculated parcel aspect ratios...(1.56s)
--> identifying irregular parcels...
----> simplified geometry...(2.38s)
----> identified triangular parcels...(79.68s)
----> identified complex geometry...(4.49s)
----> identified elongated parcels...(0.00s)
----> finished up...(0.21s)
--> identified irregular parcels (total)...(86.77s)
--> calculated polar coordinates...(4.11s)
Enriching with Census data...
Getting Census Data...
Performing spatial join with Census Data...
Census block group matching: 219115 of 219373 records have valid census geoid (99.88%)
Enriching with OpenStreetMap data...
--> Getting cbd...
Getting cbd from source file...
--> Found 1 cbd

Distance settings for cbd:
max_distance: 26400
unit: ft

Calculating distance, id=cbd, max_distance=26400, unit=ft, max_distance (in meters)=8046.719742504968
--> Getting airport...
Getting airport from source file...
--> Found 1 airport

In [16]:
# calculate street frontages
sales_univ_pair = from_checkpoint("1-assemble-03-enrich_streets", enrich_sup_streets,
    {
        "sup": sales_univ_pair,
        "settings":load_settings(), 
        "verbose":verbose
    }
)

Street enrichment disabled. To enable it, add `data.process.enrich.streets.enabled = true` to your settings file.


# 4. Inspect results

## 4.1 Examine

- Run the next cell and look at the printed out results.
- Note the "Non-zero" and "Non-null" columns in particular and make sure they're what you expect
- This view is for a quick glance to get a good idea of what all your data is

In [17]:
examine_sup(sales_univ_pair, load_settings())


EXAMINING UNIVERSE...

            FIELD                 TYPE     NON-ZERO    %    NON-NULL    %                    UNIQUE                 
             LAND             
------------------------------ ---------- ---------- ----- ---------- ----- ----------------------------------------
           NUMERIC            
------------------------------ ---------- ---------- ----- ---------- ----- ----------------------------------------
assr_land_value                                                              Float64      213,051   97%    219,373  100%                                         
dist_to_airport                                                              float64        6,604    3%      8,195    4%                                         
dist_to_cbd                                                                  float64       76,785   35%     78,077   36%                                         
dist_to_colleges                                                             

## 4.2 Examine in ridiculous detail

- You've looked, now LOOK AGAIN. This cell will run `describe()` for each numeric field and `value_counts()` for each categorical field.
- Use this info to decide which variables are useful/useless
- Consult this readout when you build your modeling group filters

In [18]:
examine_sup_in_ridiculous_detail(sales_univ_pair, load_settings())


EXAMINING UNIVERSE...

            FIELD                 TYPE     NON-ZERO    %    NON-NULL    %                    UNIQUE                 
             LAND             
------------------------------ ---------- ---------- ----- ---------- ----- ----------------------------------------
           NUMERIC            
------------------------------ ---------- ---------- ----- ---------- ----- ----------------------------------------
assr_land_value                                                              Float64      213,051   97%    219,373  100%                                         
DESCRIBE --> count         219373.0
mean      81052.698281
std      350385.522831
min                0.0
25%            25000.0
50%            44000.0
75%            67200.0
max         44808700.0
Name: assr_land_value, dtype: Float64


dist_to_airport                                                              float64        6,604    3%      8,195    4%                                         
DE

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



geometry                                                                     geometry     219,373  100%    219,373  100%                                  219,373
total_population                                                             float64      219,115  100%    219,115  100%                                      313

EXAMINING SALES...

            FIELD                 TYPE     NON-ZERO    %    NON-NULL    %                    UNIQUE                 
             LAND             
------------------------------ ---------- ---------- ----- ---------- ----- ----------------------------------------
           BOOLEAN            
------------------------------ ---------- ---------- ----- ---------- ----- ----------------------------------------
is_vacant                                                                      bool         8,828    8%    114,196  100%                            [True, False]

         IMPROVEMENT          
------------------------------ ---------- ------

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



## 4.3 Look at it on a map

- Go to your `out/look/` folder
- There should be parquets there
- Drop them into ArcGIS, QGIS, or Felt
- Look at your location fields and make sure they make sense

# 5. Tag modeling groups
- Separates rows into groups like "single family", "townhomes" and "commercial" as specified by the user
- These groups will guide all further processing

In [19]:
sales_univ_pair = from_checkpoint("1-assemble-04-tag_modeling_groups", tag_model_groups_sup,
    {
        "sup": sales_univ_pair, 
        "settings": load_settings(), 
        "verbose": verbose
    }
)

Len univ before = 219373
Len sales before = 114196 after = 114196
Overall
--> 219,373 parcels
--> 114,196 sales


KeyError: 'zoning_category'

# 6. Write out results

In [None]:
write_notebook_output_sup(
    sales_univ_pair, 
    "1-assemble", 
    parquet=True, 
    gpkg=False, 
    shp=False
)

# 7. Look at it on a map!
- Take the files output in the previous step and put them in a map viewer like QGIS, ArcGIS, or Felt
- Look at them with your eyeballs
- Make sure the data looks correct
- If not, go back and fix it!
- Don't proceed to the next step until everything looks right