# Lab 2 - Finding common parcel columns


## Task 0 - Inspecting parcel columns.

In the previous lab, you created a parcel file column summary table. In this lab, we will (A) decide on which years to include based on common columns and (B) gather the names of the common columns for these years.

1.	Inspect the parcel file column summary table.
2.  Explain why 2003 is problematic.
3.	Explain why focusing on 2004-2015 is reasonable.

> <font color="orange"> 2003 does not have many columns while 2004-2015 is more consistent </font>

## Aside -- Python `set`s

The `set` is a core Python data structure that represents a unique collection of labels and provides set operations like `union`, `intersection`, etc.  Sets can be constructed using either the `set` type constructor or using `{}` as delimiters.

In [1]:
L = ['a', 'a', 'b', 'd']
s1 = set(L)
s1

{'a', 'b', 'd'}

In [12]:
s2 = {'a', 'c', 'd'}
s2

{'a', 'c', 'd'}

In [14]:
empty_set = set()
empty_set

set()

#### Set operations

In [3]:
s1.union(s2)

{'a', 'b', 'c', 'd'}

In [5]:
s1.intersection(s2)

{'a', 'd'}

In [7]:
s1.symmetric_difference(s2) # In one but not both

{'b', 'c'}

In [9]:
s1 - s2 # in s1 but not s2

{'b'}

In [15]:
s2 - s1 # in s2 but not s1

{'c'}

## Task 1 - Finding all Common Columns
1.  Use `composable.glob.glob`, `composable.strict.map`, and `composable.strict.filter` to create a list of `pyspark` data frames, one for each parcel file from 2004-2015.  You might want to reuse some of your code from the last lab.  Do this by (A) packaging helper functions in a file called `utility.py` and (B) importing these functions here.
2. Use `map` to extract the header from each data frame.
3. `map` the `set` constructor onto the list of headers.
4. Reduce the list of sets of column labels to the intersection of column labels.
5. Convert the intersection to a list and sort the labels using `composable.strict.sorted`

Do this with two pipes, one for steps 1-4 and another for step 5.

In [9]:
from composable.strict import map, filter

In [12]:
import re
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Ops').getOrCreate()
from composable.sequence import reduce
from more_pyspark import to_pandas
from glob import glob as orignal_glob
from composable import pipeable
from pyspark.sql.functions import lit
glob = pipeable(orignal_glob)
parcel1_files = sorted(glob('./data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/201*parcel*.txt'))
parcel2_files = sorted(glob('./data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/200[456789]*parcel*.txt'))
parcel_files=(parcel2_files + parcel1_files)
parcel_files

['./data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/2004_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/2005_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/2006_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/2007_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/2008_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/2009_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/2010_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/2011_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/2012_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/2013_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/2014_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/2015_metro_tax_parcels.txt']

In [19]:
# Your code here

name4this = (parcel_files
                >> map(lambda path: spark.read.csv(path, header=True, sep='|'))
                >> map(lambda df: df.columns)
                >> map(lambda yuh: set(yuh))
)
name4this

[{'ACRES_DEED',
  'ACRES_POLY',
  'AGPRE_ENRD',
  'AGPRE_EXPD',
  'AG_PRESERV',
  'BASEMENT',
  'BLDG_NUM',
  'BLOCK',
  'CITY',
  'CITY_USPS',
  'COOLING',
  'COUNTY_ID',
  'DWELL_TYPE',
  'EMV_BLDG',
  'EMV_LAND',
  'EMV_TOTAL',
  'FIN_SQ_FT',
  'GARAGE',
  'GARAGESQFT',
  'GREEN_ACRE',
  'HEATING',
  'HOMESTEAD',
  'HOME_STYLE',
  'ID',
  'LANDMARK',
  'LOT',
  'MULTI_USES',
  'NUM_UNITS',
  'OPEN_SPACE',
  'OWNER_MORE',
  'OWNER_NAME',
  'OWN_ADD_L1',
  'OWN_ADD_L2',
  'OWN_ADD_L3',
  'PARC_CODE',
  'PIN',
  'PLAT_NAME',
  'PREFIXTYPE',
  'PREFIX_DIR',
  'SALE_DATE',
  'SALE_VALUE',
  'SCHOOL_DST',
  'SPEC_ASSES',
  'STREETNAME',
  'STREETTYPE',
  'SUFFIX_DIR',
  'Shape_Area',
  'Shape_Leng',
  'TAX_ADD_L1',
  'TAX_ADD_L2',
  'TAX_ADD_L3',
  'TAX_CAPAC',
  'TAX_EXEMPT',
  'TAX_NAME',
  'TOTAL_TAX',
  'UNIT_INFO',
  'USE1_DESC',
  'USE2_DESC',
  'USE3_DESC',
  'USE4_DESC',
  'WSHD_DIST',
  'XUSE1_DESC',
  'XUSE2_DESC',
  'XUSE3_DESC',
  'XUSE4_DESC',
  'YEAR_BUILT',
  'Year',
  'ZIP

In [20]:
compile_year = re.compile('./data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/(\d{4})_metro_tax_parcels.txt')
get_year = lambda path: compile_year.search(path).group(1)
years = [get_year(path) for path in parcel_files]
the_new_year = sorted(years)
the_new_year

['2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015']

In [25]:
from composable.sequence import reduce
update_set = lambda final, cols: final.intersection(cols)

lab2_p1 = (name4this
            >>reduce(update_set)
)
lab2_p1

{'ACRES_DEED',
 'ACRES_POLY',
 'AGPRE_ENRD',
 'AGPRE_EXPD',
 'AG_PRESERV',
 'BASEMENT',
 'BLDG_NUM',
 'BLOCK',
 'CITY',
 'CITY_USPS',
 'COOLING',
 'COUNTY_ID',
 'DWELL_TYPE',
 'EMV_BLDG',
 'EMV_LAND',
 'EMV_TOTAL',
 'FIN_SQ_FT',
 'GARAGESQFT',
 'GREEN_ACRE',
 'HEATING',
 'HOME_STYLE',
 'LANDMARK',
 'LOT',
 'MULTI_USES',
 'NUM_UNITS',
 'OPEN_SPACE',
 'OWNER_MORE',
 'OWNER_NAME',
 'OWN_ADD_L1',
 'OWN_ADD_L2',
 'OWN_ADD_L3',
 'PARC_CODE',
 'PIN',
 'PLAT_NAME',
 'PREFIXTYPE',
 'PREFIX_DIR',
 'SALE_DATE',
 'SALE_VALUE',
 'SCHOOL_DST',
 'SPEC_ASSES',
 'STREETNAME',
 'STREETTYPE',
 'SUFFIX_DIR',
 'Shape_Area',
 'Shape_Leng',
 'TAX_ADD_L1',
 'TAX_ADD_L2',
 'TAX_ADD_L3',
 'TAX_CAPAC',
 'TAX_EXEMPT',
 'TAX_NAME',
 'TOTAL_TAX',
 'UNIT_INFO',
 'USE1_DESC',
 'USE2_DESC',
 'USE3_DESC',
 'USE4_DESC',
 'WSHD_DIST',
 'XUSE1_DESC',
 'XUSE2_DESC',
 'XUSE3_DESC',
 'XUSE4_DESC',
 'YEAR_BUILT',
 'Year',
 'ZIP',
 'ZIP4',
 'centroid_lat',
 'centroid_long'}

## Task 2 - One Big Happy File

Since we will want to reuse the information generated in this lab, it will be useful to be able save it to a python file.

1. Create a python file named `parcel.py`
2. Save set representing the common columns to a variable named `common_columns_2004_to_2015`.
3. Save sorted list of common columns to a variable named `sorted_common_columns_2004_to_2015`.
4. Restart the kernel and verify that you can import both of these data structures.

In [28]:
# Your code here

common_columns_2004_to_2015 = (set(lab2_p1))
sort_common_cols_2004_to_2015 = (sorted(list(lab2_p1)))
sort_common_cols_2004_to_2015

['ACRES_DEED',
 'ACRES_POLY',
 'AGPRE_ENRD',
 'AGPRE_EXPD',
 'AG_PRESERV',
 'BASEMENT',
 'BLDG_NUM',
 'BLOCK',
 'CITY',
 'CITY_USPS',
 'COOLING',
 'COUNTY_ID',
 'DWELL_TYPE',
 'EMV_BLDG',
 'EMV_LAND',
 'EMV_TOTAL',
 'FIN_SQ_FT',
 'GARAGESQFT',
 'GREEN_ACRE',
 'HEATING',
 'HOME_STYLE',
 'LANDMARK',
 'LOT',
 'MULTI_USES',
 'NUM_UNITS',
 'OPEN_SPACE',
 'OWNER_MORE',
 'OWNER_NAME',
 'OWN_ADD_L1',
 'OWN_ADD_L2',
 'OWN_ADD_L3',
 'PARC_CODE',
 'PIN',
 'PLAT_NAME',
 'PREFIXTYPE',
 'PREFIX_DIR',
 'SALE_DATE',
 'SALE_VALUE',
 'SCHOOL_DST',
 'SPEC_ASSES',
 'STREETNAME',
 'STREETTYPE',
 'SUFFIX_DIR',
 'Shape_Area',
 'Shape_Leng',
 'TAX_ADD_L1',
 'TAX_ADD_L2',
 'TAX_ADD_L3',
 'TAX_CAPAC',
 'TAX_EXEMPT',
 'TAX_NAME',
 'TOTAL_TAX',
 'UNIT_INFO',
 'USE1_DESC',
 'USE2_DESC',
 'USE3_DESC',
 'USE4_DESC',
 'WSHD_DIST',
 'XUSE1_DESC',
 'XUSE2_DESC',
 'XUSE3_DESC',
 'XUSE4_DESC',
 'YEAR_BUILT',
 'Year',
 'ZIP',
 'ZIP4',
 'centroid_lat',
 'centroid_long']

In [31]:
with open ('parcel.py', 'w') as pywrite:
    pywrite.write(f'common_columns_2004_to_2015 = {common_columns_2004_to_2015}')
    pywrite.write(f'sort_common_cols_2004_to_2015 = {sort_common_cols_2004_to_2015}')