# Lab 2 - Finding common parcel columns


## Task 0 - Inspecting parcel columns.

In the previous lab, you created a parcel file column summary table. In this lab, we will (A) decide on which years to include based on common columns and (B) gather the names of the common columns for these years.

1.	Inspect the parcel file column summary table.
2.  Explain why 2003 is problematic.
3.	Explain why focusing on 2004-2015 is reasonable.

> <font color="orange"> We are going to get rid of 2003 since it has a lot of missing columns, also since we want to maintain sequence to our data cutting off 2003 would mean we would have to cut off 2002. Also, I think we should drop the columns `OWN_NAME`, `PIN_1`, `STREET`, `STRUC_TYPE`, `TAX_ADD_LI`, and a bunch of others at the end of the table above. 2004 to 2015 data would give us a good continuous sequence of years.</font>

## Aside -- Python `set`s

The `set` is a core Python data structure that represents a unique collection of labels and provides set operations like `union`, `intersection`, etc.  Sets can be constructed using either the `set` type constructor or using `{}` as delimiters.

In [1]:
L = ['a', 'a', 'b', 'd']
s1 = set(L)
s1

{'a', 'b', 'd'}

In [2]:
s2 = {'a', 'c', 'd'}
s2

{'a', 'c', 'd'}

In [3]:
empty_set = set()
empty_set

set()

#### Set operations

In [4]:
s1.union(s2)

{'a', 'b', 'c', 'd'}

In [5]:
s1.intersection(s2)

{'a', 'd'}

In [6]:
s1.symmetric_difference(s2) # In one but not both

{'b', 'c'}

In [7]:
s1 - s2 # in s1 but not s2

{'b'}

In [8]:
s2 - s1 # in s2 but not s1

{'c'}

## Task 1 - Finding all Common Columns
1.  Use `composable.glob.glob`, `composable.strict.map`, and `composable.strict.filter` to create a list of `pyspark` data frames, one for each parcel file from 2004-2015.  You might want to reuse some of your code from the last lab.  Do this by (A) packaging helper functions in a file called `utility.py` and (B) importing these functions here.
2. Use `map` to extract the header from each data frame.
3. `map` the `set` constructor onto the list of headers.
4. Reduce the list of sets of column labels to the intersection of column labels.
5. Convert the intersection to a list and sort the labels using `composable.strict.sorted`

Do this with two pipes, one for steps 1-4 and another for step 5.

In [9]:
# imports 
from composable.strict import map, filter, sorted
from composable.sequence import reduce, to_list
from composable.glob import glob
from utility import get_year, make_data_frame

22/11/29 22:04:52 WARN Utils: Your hostname, jt7372wd222 resolves to a loopback address: 127.0.1.1; using 172.19.154.159 instead (on interface eth0)
22/11/29 22:04:52 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/11/29 22:04:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/11/29 22:04:54 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [10]:
# Your code here

common_cols_set = ('./data/MinneMUDAC_raw_files/*parcels.txt'
              >> glob
              >> filter(lambda parcel_file: int(get_year(parcel_file)) > 2003)
              >> map(lambda parcel_data_file: make_data_frame(parcel_data_file))
              >> map(lambda parcel_data_frame: set(parcel_data_frame.columns))
              >> reduce(lambda acc, s: acc.intersection(s))
              )
common_cols_set

{'ACRES_DEED',
 'ACRES_POLY',
 'AGPRE_ENRD',
 'AGPRE_EXPD',
 'AG_PRESERV',
 'BASEMENT',
 'BLDG_NUM',
 'BLOCK',
 'CITY',
 'CITY_USPS',
 'COOLING',
 'COUNTY_ID',
 'DWELL_TYPE',
 'EMV_BLDG',
 'EMV_LAND',
 'EMV_TOTAL',
 'FIN_SQ_FT',
 'GARAGESQFT',
 'GREEN_ACRE',
 'HEATING',
 'HOME_STYLE',
 'LANDMARK',
 'LOT',
 'MULTI_USES',
 'NUM_UNITS',
 'OPEN_SPACE',
 'OWNER_MORE',
 'OWNER_NAME',
 'OWN_ADD_L1',
 'OWN_ADD_L2',
 'OWN_ADD_L3',
 'PARC_CODE',
 'PIN',
 'PLAT_NAME',
 'PREFIXTYPE',
 'PREFIX_DIR',
 'SALE_DATE',
 'SALE_VALUE',
 'SCHOOL_DST',
 'SPEC_ASSES',
 'STREETNAME',
 'STREETTYPE',
 'SUFFIX_DIR',
 'Shape_Area',
 'Shape_Leng',
 'TAX_ADD_L1',
 'TAX_ADD_L2',
 'TAX_ADD_L3',
 'TAX_CAPAC',
 'TAX_EXEMPT',
 'TAX_NAME',
 'TOTAL_TAX',
 'UNIT_INFO',
 'USE1_DESC',
 'USE2_DESC',
 'USE3_DESC',
 'USE4_DESC',
 'WSHD_DIST',
 'XUSE1_DESC',
 'XUSE2_DESC',
 'XUSE3_DESC',
 'XUSE4_DESC',
 'YEAR_BUILT',
 'Year',
 'ZIP',
 'ZIP4',
 'centroid_lat',
 'centroid_long'}

In [11]:
sorted_common_cols_list = (common_cols_set 
                    >> to_list() 
                    >> sorted)
sorted_common_cols_list

['ACRES_DEED',
 'ACRES_POLY',
 'AGPRE_ENRD',
 'AGPRE_EXPD',
 'AG_PRESERV',
 'BASEMENT',
 'BLDG_NUM',
 'BLOCK',
 'CITY',
 'CITY_USPS',
 'COOLING',
 'COUNTY_ID',
 'DWELL_TYPE',
 'EMV_BLDG',
 'EMV_LAND',
 'EMV_TOTAL',
 'FIN_SQ_FT',
 'GARAGESQFT',
 'GREEN_ACRE',
 'HEATING',
 'HOME_STYLE',
 'LANDMARK',
 'LOT',
 'MULTI_USES',
 'NUM_UNITS',
 'OPEN_SPACE',
 'OWNER_MORE',
 'OWNER_NAME',
 'OWN_ADD_L1',
 'OWN_ADD_L2',
 'OWN_ADD_L3',
 'PARC_CODE',
 'PIN',
 'PLAT_NAME',
 'PREFIXTYPE',
 'PREFIX_DIR',
 'SALE_DATE',
 'SALE_VALUE',
 'SCHOOL_DST',
 'SPEC_ASSES',
 'STREETNAME',
 'STREETTYPE',
 'SUFFIX_DIR',
 'Shape_Area',
 'Shape_Leng',
 'TAX_ADD_L1',
 'TAX_ADD_L2',
 'TAX_ADD_L3',
 'TAX_CAPAC',
 'TAX_EXEMPT',
 'TAX_NAME',
 'TOTAL_TAX',
 'UNIT_INFO',
 'USE1_DESC',
 'USE2_DESC',
 'USE3_DESC',
 'USE4_DESC',
 'WSHD_DIST',
 'XUSE1_DESC',
 'XUSE2_DESC',
 'XUSE3_DESC',
 'XUSE4_DESC',
 'YEAR_BUILT',
 'Year',
 'ZIP',
 'ZIP4',
 'centroid_lat',
 'centroid_long']

## Task 2 - One Big Happy File

Since we will want to reuse the information generated in this lab, it will be useful to be able save it to a python file.

1. Create a python file named `parcel.py`
2. Save set representing the common columns to a variable named `common_columns_2004_to_2015`.
3. Save sorted list of common columns to a variable named `sorted_common_columns_2004_to_2015`.
4. Restart the kernel and verify that you can import both of these data structures.

In [12]:
with open('parcel.py', 'w') as f: 
    f.write(f'common_columns_2004_to_2015 = {common_cols_set}')
    f.write('\n')
    f.write(f'sorted_common_columns_2004_to_2015 = {sorted_common_cols_list}')


In [13]:
# Your code here
from parcel import common_columns_2004_to_2015, sorted_common_columns_2004_to_2015

common_columns_2004_to_2015

{'ACRES_DEED',
 'ACRES_POLY',
 'AGPRE_ENRD',
 'AGPRE_EXPD',
 'AG_PRESERV',
 'BASEMENT',
 'BLDG_NUM',
 'BLOCK',
 'CITY',
 'CITY_USPS',
 'COOLING',
 'COUNTY_ID',
 'DWELL_TYPE',
 'EMV_BLDG',
 'EMV_LAND',
 'EMV_TOTAL',
 'FIN_SQ_FT',
 'GARAGESQFT',
 'GREEN_ACRE',
 'HEATING',
 'HOME_STYLE',
 'LANDMARK',
 'LOT',
 'MULTI_USES',
 'NUM_UNITS',
 'OPEN_SPACE',
 'OWNER_MORE',
 'OWNER_NAME',
 'OWN_ADD_L1',
 'OWN_ADD_L2',
 'OWN_ADD_L3',
 'PARC_CODE',
 'PIN',
 'PLAT_NAME',
 'PREFIXTYPE',
 'PREFIX_DIR',
 'SALE_DATE',
 'SALE_VALUE',
 'SCHOOL_DST',
 'SPEC_ASSES',
 'STREETNAME',
 'STREETTYPE',
 'SUFFIX_DIR',
 'Shape_Area',
 'Shape_Leng',
 'TAX_ADD_L1',
 'TAX_ADD_L2',
 'TAX_ADD_L3',
 'TAX_CAPAC',
 'TAX_EXEMPT',
 'TAX_NAME',
 'TOTAL_TAX',
 'UNIT_INFO',
 'USE1_DESC',
 'USE2_DESC',
 'USE3_DESC',
 'USE4_DESC',
 'WSHD_DIST',
 'XUSE1_DESC',
 'XUSE2_DESC',
 'XUSE3_DESC',
 'XUSE4_DESC',
 'YEAR_BUILT',
 'Year',
 'ZIP',
 'ZIP4',
 'centroid_lat',
 'centroid_long'}

In [14]:
sorted_common_columns_2004_to_2015

['ACRES_DEED',
 'ACRES_POLY',
 'AGPRE_ENRD',
 'AGPRE_EXPD',
 'AG_PRESERV',
 'BASEMENT',
 'BLDG_NUM',
 'BLOCK',
 'CITY',
 'CITY_USPS',
 'COOLING',
 'COUNTY_ID',
 'DWELL_TYPE',
 'EMV_BLDG',
 'EMV_LAND',
 'EMV_TOTAL',
 'FIN_SQ_FT',
 'GARAGESQFT',
 'GREEN_ACRE',
 'HEATING',
 'HOME_STYLE',
 'LANDMARK',
 'LOT',
 'MULTI_USES',
 'NUM_UNITS',
 'OPEN_SPACE',
 'OWNER_MORE',
 'OWNER_NAME',
 'OWN_ADD_L1',
 'OWN_ADD_L2',
 'OWN_ADD_L3',
 'PARC_CODE',
 'PIN',
 'PLAT_NAME',
 'PREFIXTYPE',
 'PREFIX_DIR',
 'SALE_DATE',
 'SALE_VALUE',
 'SCHOOL_DST',
 'SPEC_ASSES',
 'STREETNAME',
 'STREETTYPE',
 'SUFFIX_DIR',
 'Shape_Area',
 'Shape_Leng',
 'TAX_ADD_L1',
 'TAX_ADD_L2',
 'TAX_ADD_L3',
 'TAX_CAPAC',
 'TAX_EXEMPT',
 'TAX_NAME',
 'TOTAL_TAX',
 'UNIT_INFO',
 'USE1_DESC',
 'USE2_DESC',
 'USE3_DESC',
 'USE4_DESC',
 'WSHD_DIST',
 'XUSE1_DESC',
 'XUSE2_DESC',
 'XUSE3_DESC',
 'XUSE4_DESC',
 'YEAR_BUILT',
 'Year',
 'ZIP',
 'ZIP4',
 'centroid_lat',
 'centroid_long']