# Assemble

This notebook loads and assembles all your basic data sources, including tabulor and geospatial data.

The final output is two dataframes:

- UNIVERSE
- SALES

The SALES dataframe represents transactions or parcels; these are ownership transfers with prices, dates, and metadata.  
The UNIVERSE dataframe represents the parcels themselves (land and buildings, and their associated characteristics).

These will be packaged together in a handy data structure called a `SalesUniversePair`, or `sup` for short. `openavmkit` provides many handy functions that carefully perform operations on `sup`s without mixing up their fields.

The key thing to understand is that the **Assemble** notebook outputs a `sup` that represents *factual assertions* about the world. In later notebooks, we will have to add assumptions, opinions, and educated guesses, but we first will establish the firmest facts we can in this notebook.

You can think of the two dataframes in the `sup` as answering the following questions:

- UNIVERSE:
  - Where is each parcel located in space, and what is its shape?
  - What are the *current* characteristics of each parcel?
    - Which parcels have buildings and which are vacant lots?
    - How big is each parcel?
    - What is the age/size/quality/condition/etc of each building?
- SALES:
  - Which parcels have sold?
  - What prices did they sell for?
  - What dates did they sell on?
  - Which sales were valid?
  - What characteristics were different *at the time of sale* from how the parcel is now?



In [183]:
# Change these as desired

# The slug of the locality you are currently working on
locality = "us-pa-philadelphia"

# Which cloud service to look for data in (only used for a new locality, ignored otherwise)
bootstrap_cloud = "azure"

# Whether to print out a lot of stuff (can help with debugging) or stay mostly quiet
verbose = True

# Clear previous state for this notebook and start fresh
clear_checkpoints = True

# 1. Basic setup

In [184]:
import init_notebooks
init_notebooks.setup_environment()

Environment setup completed.


In [185]:
# import OpenAVMkit:
from openavmkit.pipeline import ( 
    init_notebook,
    from_checkpoint,
    delete_checkpoints,
    examine_df,
    examine_df_in_ridiculous_detail,
    examine_sup,
    examine_sup_in_ridiculous_detail,
    cloud_sync,
    load_settings,
    load_dataframes,
    process_data,
    process_sales,
    enrich_sup_streets,
    tag_model_groups_sup,
    write_notebook_output_sup
)

In [186]:
init_notebook(locality)

locality = us-ky-louisville
base path = C:\Users\jacks\Documents\Non-Game Stuff\Programming\openavmkit\notebooks\pipeline
current path = C:\Users\jacks\Documents\Non-Game Stuff\Programming\openavmkit\notebooks\pipeline\data\us-ky-louisville


In [187]:
if clear_checkpoints:
    delete_checkpoints("1-assemble")

# 2. Sync with Cloud
- If you have configured cloud storage, syncs with your remote storage
- Reconciles your local input files with the versions on the remote server
- Pulls down whatever is newer from the remote server
- Uploads whatever is newer from your local machine

In [188]:
cloud_sync(locality, verbose=True, bootstrap=bootstrap_cloud)

In [189]:
settings = load_settings()

# 3. Load & process data

In [190]:
# load all of our initial dataframes, but don't do anything with them just yet
dataframes = from_checkpoint("1-assemble-01-load_dataframes", load_dataframes,
    {
        "settings":load_settings(),
        "verbose":verbose
    }
)

Loading "in/geoparcels.parquet"...
Loading "in/universe.parquet"...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super().__setitem__(key, value)


Loading "in/sales.parquet"...
Valid sales: 238422 out of 238422 total
Loading "in/universe.parquet"...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super().__setitem__(key, value)


In [191]:
dataframes["sales"]

Unnamed: 0,key,sale_price,sale_type,sale_date,key_sale,valid_sale,sale_year,sale_month,sale_day,sale_quarter,sale_year_month,sale_year_quarter,sale_age_days
0,000100040000,0.0,,2006-07-13,000100040000---2006-07-13,True,2006,7,13,3,2006-07,2006Q3,6747
1,000100070000,5000000.0,,2000-02-01,000100070000---2000-02-01,True,2000,2,1,1,2000-02,2000Q1,9101
2,000100090000,5000000.0,,2000-02-01,000100090000---2000-02-01,True,2000,2,1,1,2000-02,2000Q1,9101
3,000100140000,5000000.0,,2006-01-31,000100140000---2006-01-31,True,2006,1,31,1,2006-01,2006Q1,6910
4,000100150000,5000000.0,,2006-01-31,000100150000---2006-01-31,True,2006,1,31,1,2006-01,2006Q1,6910
...,...,...,...,...,...,...,...,...,...,...,...,...,...
237829,W00701810000,729500.0,Homeowner to Investor,2020-11-07,W00701810000---2020-11-07,True,2020,11,7,4,2020-11,2020Q4,1516
237830,W00701830000,300000.0,,2014-12-01,W00701830000---2014-12-01,True,2014,12,1,4,2014-12,2014Q4,3684
237831,W00701840000,1800000.0,,2017-06-11,W00701840000---2017-06-11,True,2017,6,11,2,2017-06,2017Q2,2761
237832,W00701850000,2500000.0,,2017-06-11,W00701850000---2017-06-11,True,2017,6,11,2,2017-06,2017Q2,2761


In [192]:
dataframes

{'geo_parcels':                  key                                           geometry
 0       000100020000  POLYGON ((1248200.995 319417.121, 1248397.047 ...
 1       000100040000  POLYGON ((1250455.369 319064.562, 1250477.392 ...
 2       000100050000  POLYGON ((1244210.031 320000, 1244209.031 3199...
 3       000100060000  POLYGON ((1250444.955 317331.51, 1250673.332 3...
 4       000100070000  POLYGON ((1245713.096 320044.013, 1245732.005 ...
 ...              ...                                                ...
 284424  W00701830000  POLYGON ((1243826.468 304645.882, 1243846.558 ...
 284425  W00701840000  POLYGON ((1242310.565 302176.502, 1241848.507 ...
 284426  W00701850000  POLYGON ((1242328.093 302148.3, 1242306.132 30...
 284427  W00701860000  POLYGON ((1242039.511 300353.589, 1241849.53 3...
 284428          <NA>  POLYGON ((1222214.656 238748.78, 1222184.905 2...
 
 [284429 rows x 2 columns],
 'universe':         bldg_ac                             property_owner_address

In [193]:
dataframes["vacant_sales"]

Unnamed: 0,key,property_type,geometry,vacant_sale
0,000100020000,Structure,"POLYGON ((-85.61807 38.36913, -85.61739 38.369...",False
1,000100040000,Land,"POLYGON ((-85.61019 38.36825, -85.61011 38.367...",True
2,000100050000,Structure,"POLYGON ((-85.63202 38.37057, -85.63202 38.370...",False
3,000100060000,Structure,"POLYGON ((-85.61014 38.36349, -85.60934 38.363...",False
4,000100070000,Structure,"POLYGON ((-85.62678 38.37075, -85.62672 38.370...",False
...,...,...,...,...
290068,W00701810000,Land,"POLYGON ((-85.63098 38.32632, -85.63105 38.326...",True
290069,W00701830000,Structure,"POLYGON ((-85.63256 38.32839, -85.63248 38.327...",False
290070,W00701840000,Land,"POLYGON ((-85.63771 38.32155, -85.63929 38.320...",True
290071,W00701850000,Structure,"POLYGON ((-85.63765 38.32147, -85.63772 38.321...",False


In [194]:
# assemble our data
sales_univ_pair = from_checkpoint("1-assemble-02-process_data", process_data,
    {
        "dataframes":dataframes, 
        "settings":load_settings(), 
        "verbose":verbose
    }
)

Valid sales: 237834 (100.0% of 237834 total)
Enriching data...
Performing spatial joins...
Using "key" to merge shapefiles onto df
Performing basic geometric enrichment...
--> added latitude/longitude...(12.08s)
--> calculated GIS area of each parcel...(0.08s)


  gdf["land_area_sqft"] = gdf["land_area_sqft"].combine_first(
  gdf["land_area_sqft"].combine_first(gdf["land_area_gis_sqft"])
 1.24246859e+04 1.75888366e+04 1.45379423e+04 3.22748165e+03
 1.34470404e+04 8.05322608e+03 9.01368305e+03 3.92209829e+03
 6.06836251e+03 8.59676850e+03 4.49311436e+03 7.94569813e+04
 2.06818085e+04]' has dtype incompatible with int32, please explicitly cast to a compatible dtype first.
  gdf.loc[


--> calculated parcel rectangularity...(10.82s)
--> calculated parcel aspect ratios...(2.42s)
--> identifying irregular parcels...
----> simplified geometry...(4.33s)
----> identified triangular parcels...(126.74s)
----> identified complex geometry...(6.63s)
----> identified elongated parcels...(0.00s)
----> finished up...(0.19s)
--> identified irregular parcels (total)...(137.89s)
--> calculated polar coordinates...(6.15s)
Enriching with Census data...
Enriching with OpenStreetMap data...
Performing reference table joins...


In [195]:
# calculate street frontages
sales_univ_pair = from_checkpoint("1-assemble-03-enrich_streets", enrich_sup_streets,
    {
        "sup": sales_univ_pair,
        "settings":load_settings(), 
        "verbose":verbose
    }
)

--> found streets in in/osm/streets.parquet, loading from disk!


# 4. Inspect results

## 4.1 Examine

- Run the next cell and look at the printed out results.
- Note the "Non-zero" and "Non-null" columns in particular and make sure they're what you expect
- This view is for a quick glance to get a good idea of what all your data is

In [196]:
examine_sup(sales_univ_pair, load_settings())

## 4.2 Examine in ridiculous detail

- You've looked, now LOOK AGAIN. This cell will run `describe()` for each numeric field and `value_counts()` for each categorical field.
- Use this info to decide which variables are useful/useless
- Consult this readout when you build your modeling group filters

In [197]:
examine_sup_in_ridiculous_detail(sales_univ_pair, load_settings())


EXAMINING UNIVERSE...

            FIELD                 TYPE     NON-ZERO    %    NON-NULL    %                    UNIQUE                 
             LAND             
------------------------------ ---------- ---------- ----- ---------- ----- ----------------------------------------
           NUMERIC            
------------------------------ ---------- ---------- ----- ---------- ----- ----------------------------------------
assr_land_value                 Float64      277,198   98%    282,400  100%                                         
DESCRIBE --> count         282400.0
mean      78705.466873
std      420247.428356
min                0.0
25%            19080.0
50%            39480.0
75%            63600.0
max        109291184.0
Name: assr_land_value, dtype: Float64


depth_ft_1                      float64      268,482   95%    282,400  100%                                         
DESCRIBE --> count    282400.000000
mean        143.107921
std        1482.934022
min       

## 4.3 Look at it on a map

- Go to your `out/look/` folder
- There should be parquets there
- Drop them into ArcGIS, QGIS, or Felt
- Look at your location fields and make sure they make sense

# 5. Tag modeling groups
- Separates rows into groups like "single family", "townhomes" and "commercial" as specified by the user
- These groups will guide all further processing

In [198]:
sales_univ_pair = from_checkpoint("1-assemble-04-tag_modeling_groups", tag_model_groups_sup,
    {
        "sup": sales_univ_pair, 
        "settings": load_settings(), 
        "verbose": verbose
    }
)

Len univ before = 282400
Len sales before = 237834 after = 237834
Overall
--> 282,400 parcels
--> 237,834 sales


KeyError: 'bldg_desc'

# 6. Write out results

In [None]:
write_notebook_output_sup(
    sales_univ_pair, 
    "1-assemble", 
    parquet=True, 
    gpkg=False, 
    shp=False
)

# 7. Look at it on a map!
- Take the files output in the previous step and put them in a map viewer like QGIS, ArcGIS, or Felt
- Look at them with your eyeballs
- Make sure the data looks correct
- If not, go back and fix it!
- Don't proceed to the next step until everything looks right