# Assemble

This notebook loads and assembles all your basic data sources, including tabulor and geospatial data.

The final output is two dataframes:

- UNIVERSE
- SALES

The SALES dataframe represents transactions or parcels; these are ownership transfers with prices, dates, and metadata.  
The UNIVERSE dataframe represents the parcels themselves (land and buildings, and their associated characteristics).

These will be packaged together in a handy data structure called a `SalesUniversePair`, or `sup` for short. `openavmkit` provides many handy functions that carefully perform operations on `sup`s without mixing up their fields.

The key thing to understand is that the **Assemble** notebook outputs a `sup` that represents *factual assertions* about the world. In later notebooks, we will have to add assumptions, opinions, and educated guesses, but we first will establish the firmest facts we can in this notebook.

You can think of the two dataframes in the `sup` as answering the following questions:

- UNIVERSE:
  - Where is each parcel located in space, and what is its shape?
  - What are the *current* characteristics of each parcel?
    - Which parcels have buildings and which are vacant lots?
    - How big is each parcel?
    - What is the age/size/quality/condition/etc of each building?
- SALES:
  - Which parcels have sold?
  - What prices did they sell for?
  - What dates did they sell on?
  - Which sales were valid?
  - What characteristics were different *at the time of sale* from how the parcel is now?



In [1]:
# Change these as desired

# The slug of the locality you are currently working on
locality = "us-md-baltimorecity"

# Whether to print out a lot of stuff (can help with debugging) or stay mostly quiet
verbose = True

# Clear previous state for this notebook and start fresh
clear_checkpoints = False

# 1. Basic setup

In [2]:
import init_notebooks
init_notebooks.setup_environment()

Environment setup completed.


In [3]:
# import OpenAVMkit:
from openavmkit.pipeline import ( 
    init_notebook,
    from_checkpoint,
    delete_checkpoints,
    examine_df,
    examine_df_in_ridiculous_detail,
    examine_sup,
    examine_sup_in_ridiculous_detail,
    cloud_sync,
    load_settings,
    load_dataframes,
    process_data,
    process_sales,
    enrich_sup_streets,
    tag_model_groups_sup,
    write_notebook_output_sup
)

In [4]:
init_notebook(locality)

locality = us-md-baltimorecity
base path = L:\git\openavmkit\notebooks\pipeline
current path = L:\git\openavmkit\notebooks\pipeline\data\us-md-baltimorecity


In [5]:
if clear_checkpoints:
    delete_checkpoints("1-assemble")

In [6]:
settings = load_settings()

# 2. Sync with Cloud
- If you have configured cloud storage, syncs with your remote storage
- Reconciles your local input files with the versions on the remote server
- Pulls down whatever is newer from the remote server
- Uploads whatever is newer from your local machine

In [7]:
#cloud_sync(locality, verbose=True, env_path="../../../.env", settings=settings)

# 3. Load & process data

In [8]:
# load all of our initial dataframes, but don't do anything with them just yet
dataframes = from_checkpoint("1-assemble-01-load_dataframes", load_dataframes,
    {
        "settings":load_settings(),
        "verbose":verbose
    }
)

In [9]:
# assemble our data
sales_univ_pair = from_checkpoint("1-assemble-02-process_data", process_data,
    {
        "dataframes":dataframes, 
        "settings":settings, 
        "verbose":verbose
    }
)

In [10]:
df = sales_univ_pair.universe
print("▶️ INPUT df_in CRS:", df.crs)
print("▶️ INPUT df_in bounds:", df.geometry.total_bounds)

▶️ INPUT df_in CRS: {"$schema": "https://proj.org/schemas/v0.7/projjson.schema.json", "type": "GeographicCRS", "name": "WGS 84", "datum_ensemble": {"name": "World Geodetic System 1984 ensemble", "members": [{"name": "World Geodetic System 1984 (Transit)"}, {"name": "World Geodetic System 1984 (G730)"}, {"name": "World Geodetic System 1984 (G873)"}, {"name": "World Geodetic System 1984 (G1150)"}, {"name": "World Geodetic System 1984 (G1674)"}, {"name": "World Geodetic System 1984 (G1762)"}, {"name": "World Geodetic System 1984 (G2139)"}, {"name": "World Geodetic System 1984 (G2296)"}], "ellipsoid": {"name": "WGS 84", "semi_major_axis": 6378137, "inverse_flattening": 298.257223563}, "accuracy": "2.0", "id": {"authority": "EPSG", "code": 6326}}, "coordinate_system": {"subtype": "ellipsoidal", "axis": [{"name": "Geodetic latitude", "abbreviation": "Lat", "direction": "north", "unit": "degree"}, {"name": "Geodetic longitude", "abbreviation": "Lon", "direction": "east", "unit": "degree"}]}, 

In [None]:
# calculate street frontages
sales_univ_pair = from_checkpoint("1-assemble-03-enrich_streets", enrich_sup_streets,
    {
        "sup": sales_univ_pair,
        "settings":settings, 
        "verbose":verbose
    }
)

▶️ INPUT df_in CRS: {"$schema": "https://proj.org/schemas/v0.7/projjson.schema.json", "type": "GeographicCRS", "name": "WGS 84", "datum_ensemble": {"name": "World Geodetic System 1984 ensemble", "members": [{"name": "World Geodetic System 1984 (Transit)"}, {"name": "World Geodetic System 1984 (G730)"}, {"name": "World Geodetic System 1984 (G873)"}, {"name": "World Geodetic System 1984 (G1150)"}, {"name": "World Geodetic System 1984 (G1674)"}, {"name": "World Geodetic System 1984 (G1762)"}, {"name": "World Geodetic System 1984 (G2139)"}, {"name": "World Geodetic System 1984 (G2296)"}], "ellipsoid": {"name": "WGS 84", "semi_major_axis": 6378137, "inverse_flattening": 298.257223563}, "accuracy": "2.0", "id": {"authority": "EPSG", "code": 6326}}, "coordinate_system": {"subtype": "ellipsoidal", "axis": [{"name": "Geodetic latitude", "abbreviation": "Lat", "direction": "north", "unit": "degree"}, {"name": "Geodetic longitude", "abbreviation": "Lon", "direction": "east", "unit": "degree"}]}, 




▶ df after to_crs: +proj=utm +zone=18 +datum=WGS84 +units=m +type=crs
  df.bounds (m): [ 352406.58385038 4339813.28087385  368236.29548508 4359449.84113314]
T setup = 3s
T prepare = 0s
Loading network within (39.195269305450395,-76.7171119681943) -> (39.37649723361619,-76.52388897055145)
T load street = 64s
T edges = 7s
Generating rays for 152991 edges with 8 jobs...


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 tasks      | elapsed:    0.2s
[Parallel(n_jobs=8)]: Done   9 tasks      | elapsed:    0.3s
[Parallel(n_jobs=8)]: Done  16 tasks      | elapsed:    0.3s
[Parallel(n_jobs=8)]: Batch computation too fast (0.17461048648676988s.) Setting batch_size=2.
[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.4s
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.4s
[Parallel(n_jobs=8)]: Batch computation too fast (0.029026508331298828s.) Setting batch_size=4.
[Parallel(n_jobs=8)]: Done  52 tasks      | elapsed:    0.4s
[Parallel(n_jobs=8)]: Done  75 tasks      | elapsed:    0.5s
[Parallel(n_jobs=8)]: Batch computation too fast (0.050551652908325195s.) Setting batch_size=8.
[Parallel(n_jobs=8)]: Done 127 tasks      | elapsed:    0.5s
[Parallel(n_jobs=8)]: Batch computation too fast (0.056055545806884766s.) Setting batch_size=16.
[Parallel(n_jobs=8)]: Done 216 tasks      | elapsed: 

▶ rays_gdf: +proj=utm +zone=18 +datum=WGS84 +units=m +type=crs
  rays.bounds (m): [ 351741.04143925 4339536.03091597  368759.12648424 4359970.92087713]
RAYS GDF


Unnamed: 0,road_idx,road_name,road_type,geometry,angle
0,0,Baltimore-Washington Parkway,motorway,"LINESTRING (354127.513 4340888.256, 354108.233...",2.451510
1,0,Baltimore-Washington Parkway,motorway,"LINESTRING (354127.513 4340888.256, 354146.792...",-0.690083
2,0,Baltimore-Washington Parkway,motorway,"LINESTRING (354128.144 4340889.02, 354108.864 ...",2.451510
3,0,Baltimore-Washington Parkway,motorway,"LINESTRING (354128.144 4340889.02, 354147.423 ...",-0.690083
4,0,Baltimore-Washington Parkway,motorway,"LINESTRING (354128.774 4340889.784, 354109.495...",2.451510
...,...,...,...,...,...
20990433,152990,1381993393,service,"LINESTRING (367063.928 4349834.974, 367038.964...",-3.088254
20990434,152990,1381993393,service,"LINESTRING (367063.974 4349834.116, 367088.939...",0.053339
20990435,152990,1381993393,service,"LINESTRING (367063.974 4349834.116, 367039.01 ...",-3.088254
20990436,152990,1381993393,service,"LINESTRING (367064.02 4349833.259, 367088.984 ...",0.053339


--> T rays_parallel = 230s
▶ parcels gdf: +proj=utm +zone=18 +datum=WGS84 +units=m +type=crs
  parcels.bounds (m): [ 352406.58385038 4339813.28087385  368236.29548508 4359449.84113314]
GDF


Unnamed: 0,key,parcel_geom
0,0001001,"POLYGON ((357648.544 4352400.592, 357647.7 435..."
1,0001002,"POLYGON ((357651.971 4352425.288, 357651.945 4..."
2,0001003,"POLYGON ((357652.789 4352401.625, 357652.815 4..."
3,0001004,"POLYGON ((357656.216 4352426.32, 357657.173 43..."
4,0001005,"POLYGON ((357660.617 4352426.462, 357661.53 43..."
...,...,...
236500,PSC0020,"POLYGON ((358018.025 4356182.746, 358009.276 4..."
236501,PSC0065,"POLYGON ((366866.257 4347429.908, 366866.226 4..."
236502,PSC0070,"POLYGON ((365158.935 4351297.208, 365154.751 4..."
236503,PSC0075,"POLYGON ((354993.102 4347941.94, 355018.028 43..."


RAY PAR


Unnamed: 0,road_idx,road_name,road_type,geometry,angle,index_right,key
15335,14,549573117,motorway,"LINESTRING (360200.583 4347676.569, 360200.764...",-1.563547,35017,1040002A
15337,14,549573117,motorway,"LINESTRING (360201.574 4347676.576, 360201.756...",-1.563547,35017,1040002A
15339,14,549573117,motorway,"LINESTRING (360202.566 4347676.584, 360202.747...",-1.563547,35017,1040002A
15341,14,549573117,motorway,"LINESTRING (360203.558 4347676.591, 360203.739...",-1.563547,35017,1040002A
15343,14,549573117,motorway,"LINESTRING (360204.549 4347676.598, 360204.73 ...",-1.563547,35017,1040002A
...,...,...,...,...,...,...,...
20990420,152990,1381993393,service,"LINESTRING (367063.62 4349840.514, 367088.579 ...",0.057560,207454,6344004
20990421,152990,1381993393,service,"LINESTRING (367063.62 4349840.514, 367038.661 ...",-3.084033,207451,6344001
20990421,152990,1381993393,service,"LINESTRING (367063.62 4349840.514, 367038.661 ...",-3.084033,207452,6344002
20990422,152990,1381993393,service,"LINESTRING (367063.677 4349839.525, 367088.636...",0.057560,207453,6344003


T block = 292s
T ray_par = 8s
T dist_0 = 0s
T origins setup = 8s


# 4. Inspect results

## 4.1 Examine

- Run the next cell and look at the printed out results.
- Note the "Non-zero" and "Non-null" columns in particular and make sure they're what you expect
- This view is for a quick glance to get a good idea of what all your data is

In [None]:
settings = load_settings()
examine_sup(sales_univ_pair, settings)

## 4.2 Examine in ridiculous detail

- You've looked, now LOOK AGAIN. This cell will run `describe()` for each numeric field and `value_counts()` for each categorical field.
- Use this info to decide which variables are useful/useless
- Consult this readout when you build your modeling group filters

In [None]:
examine_sup_in_ridiculous_detail(sales_univ_pair, settings)

## 4.3 Look at it on a map

- Go to your `out/look/` folder
- There should be parquets there
- Drop them into ArcGIS, QGIS, or Felt
- Look at your location fields and make sure they make sense

# 5. Tag modeling groups
- Separates rows into groups like "single family", "townhomes" and "commercial" as specified by the user
- These groups will guide all further processing

In [None]:
settings = load_settings()
sales_univ_pair = from_checkpoint("1-assemble-04-tag_modeling_groups", tag_model_groups_sup,
    {
        "sup": sales_univ_pair, 
        "settings": settings, 
        "verbose": verbose
    }
)

# 6. Write out results

In [None]:
write_notebook_output_sup(sales_univ_pair, "1-assemble")

# 7. Look at it on a map!
- Take the files output in the previous step and put them in a map viewer like QGIS, ArcGIS, or Felt
- Look at them with your eyeballs
- Make sure the data looks correct
- If not, go back and fix it!
- Don't proceed to the next step until everything looks right