# Lab 1 - Getting Started

## Project Overview

In this project, you will use `pyspark` to process the data from the MinneMUDAC 2016 competition Dive into Water Data.  While the MinneMUDAC 2016 site, is no longer live, a copy was obtained using the [Wayback Machine (https://web.archive.org) and has been provided in [the overview notebook](./MinneMUDAC_2016_Overview.ipynb).  You should document your work in a Jupyter notebook, which will be used to submit your solution.

## Lab 1 Tasks

In this lab, you will perform the following tasks

1. Download and unzip the data.
2. Investigating the columns in various property data files.

### Task 1 - Data download and unzip

While the download links on the original site no longer work, you can access the data using [this link](https://mnscu-my.sharepoint.com/:u:/g/personal/bn8210wy_minnstate_edu/EdUePet8JsdKv5aUt9gvjoMBxQhXrOx73WpQyVNwLVDfkA?e=rR8qrc)
**Note.** You should have already downloaded the zip file as part of the previous activity.

1. Move the zip file unto your repository
2. Unzip and move the files into your data folder.

**Hint.** Take a look the the Colab section of any module 5 lecture for an example.

In [1]:
# use ls to inspect the data file.
!ls

MinneMUDAC_2016_Overview.ipynb
MinneMUDAC_raw_files.zip
Project_2_Lab_1_Download_and_investigate_property_columns.ipynb
README.md
__pycache__
data
img
more_pyspark.py


In [8]:
# use unzip to unzip the lakes data zip
!unzip MinneMUDAC_raw_files.zip -d ./data

Archive:  MinneMUDAC_raw_files.zip
   creating: ./data/MinneMUDAC_raw_files/
  inflating: ./data/MinneMUDAC_raw_files/2002_metro_tax_parcels.txt  
   creating: ./data/__MACOSX/
   creating: ./data/__MACOSX/MinneMUDAC_raw_files/
  inflating: ./data/__MACOSX/MinneMUDAC_raw_files/._2002_metro_tax_parcels.txt  
  inflating: ./data/MinneMUDAC_raw_files/2003_metro_tax_parcels.txt  
  inflating: ./data/__MACOSX/MinneMUDAC_raw_files/._2003_metro_tax_parcels.txt  
  inflating: ./data/MinneMUDAC_raw_files/2004_metro_tax_parcels.txt  
  inflating: ./data/MinneMUDAC_raw_files/2005_metro_tax_parcels.txt  
  inflating: ./data/__MACOSX/MinneMUDAC_raw_files/._2005_metro_tax_parcels.txt  
  inflating: ./data/MinneMUDAC_raw_files/2006_metro_tax_parcels.txt  
  inflating: ./data/__MACOSX/MinneMUDAC_raw_files/._2006_metro_tax_parcels.txt  
  inflating: ./data/MinneMUDAC_raw_files/2007_metro_tax_parcels.txt  
  inflating: ./data/MinneMUDAC_raw_files/2008_metro_tax_parcels.txt  
  inflating: ./data/__MACOSX

In [10]:
# use ls to inspect the lake folder found in the data folder.
!ls data/MinneMUDAC_raw_files

2002_metro_tax_parcels.txt  2010_metro_tax_parcels.txt
2003_metro_tax_parcels.txt  2011_metro_tax_parcels.txt
2004_metro_tax_parcels.txt  2012_metro_tax_parcels.txt
2005_metro_tax_parcels.txt  2013_metro_tax_parcels.txt
2006_metro_tax_parcels.txt  2014_metro_tax_parcels.txt
2007_metro_tax_parcels.txt  2015_metro_tax_parcels.txt
2008_metro_tax_parcels.txt  Parcel_Lake_Monitoring_Site_Xref.txt
2009_metro_tax_parcels.txt  mces_lakes_1999_2014.txt


#### Questions

1. Notice that we have multiple property files, one per year.  What verb(s) will be used to combine these files?
2. Why is it important to compare the columns of these files?
3. Use `!head path` to inspect the first few lines of one of the files.  How are the columns separated?

> <font color="red"> 
1. We can use `union` to combine these files if the headers of these files are same. If the headers are different we can use a join, an outer join would be good to preserve information of all the files.
2. It's important to compare the columns of these files because they can contain different headers/columns for each year.
3. The columns are separated with pipe or `|` symbol.
</font>

In [21]:
# Use `!head path` to inspect the first few lines of one of the files.
!head data/MinneMUDAC_raw_files/2003_metro_tax_parcels.txt

BLDG_NUM|CITY|COUNTY_ID|EMV_BLDG|EMV_LAND|EMV_TOTAL|HOMESTEAD|NUM_UNITS|OWN_ADD_L1|OWN_ADD_L2|OWN_ADD_L3|OWN_NAME|PARC_CODE|PIN|SALE_DATE|SALE_VALUE|SCHOOL_DST|STREET|STRUC_TYPE|Shape_Area|Shape_Leng|TAX_ADD_L1|TAX_ADD_L2|TAX_ADD_L3|TAX_CAPAC|TAX_EXEMPT|TAX_NAME|TOTAL_TAX|WSHD_DIST|YEAR_BUILT|Year|ZIP|centroid_long|centroid_lat
|ST FRANCIS|003|0.0|17750.0|23398.0|N||24457 DOGWOOD ST NW||BETHEL, MN 55005||0.0|003-253424110001||0.0|15||OVERRIDE STRUCTURE|32468.8805894|1340.1976685|24457 DOGWOOD ST NW||BETHEL, MN 55005|351.0|N||614.0|UPPER RUM RIVER WMO|1980.0|2003||-93.26744|45.41336
24457|ST FRANCIS|003|101672.0|36700.0|147468.0|Y||24457 DOGWOOD ST NW||BETHEL, MN 55005||0.0|003-253424110002||0.0|15|DOGWOOD ST NW|SPLIT FOYER|3744.3683136|252.61429213|24457 DOGWOOD ST NW||BETHEL, MN 55005|1321.0|N||1319.0|UPPER RUM RIVER WMO|1974.0|2003|55005|-93.27015|45.41357
24442|ST FRANCIS|003|94087.0|57576.0|165053.0|Y||24442 DOGWOOD ST NW||ST FRANCIS, MN 55005||0.0|003-253424120001|2001-04-26|21500

## Task 2 - Create a table summarizing the columns from each table.

<img src="./img/column_master_file.png" width="800">

**Hints.**

1. Use `glob` to get a list of all the parcel files.
2. Write a function that takes a parcel file path as input and returns just the year (as a string).
3. Use a list comprehension contains pairs of value of the form `(year, df)` where `df` a `pyspark` data frame for each file. Although the files are large, remember that `pyspark` is lazy and will do minimal work on this step.
4. We need to create some initial data frame that contain the columns labels in one column and an indicator for the respective year.  I did that with a list comprehension that named both elements from the last list using `[ ... for year, df in list_name]`.  I found it easiest to use `pandas` to create the data frame.  This was tricky, so I have provided my helper function below.
5. Next, we need to create a master data frame that contains all possible column labels in a `"columns"` column. Do this using `reduce` to union all dataframes together after selecting just the `"columns"` column of each.  Use `distinct` to remove repeat column labels.
6. Finally, we want to join each of the yearly data frames into the master column data frame.  Do this using `reduce` using 
    * the master column dataframe as the initial value
    * A left join on the `"columns'` columns.
7. Write the resulting file out to a CSV and inspect the results.  Use this files to answer questions in part 2.

In [1]:
# all the imports
from glob import glob
import re
import pandas as pd
from functools import reduce
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from more_pyspark import *

In [2]:
spark = SparkSession.builder.appName('Ops').getOrCreate()

22/11/15 20:38:56 WARN Utils: Your hostname, jt7372wd222 resolves to a loopback address: 127.0.1.1; using 172.26.42.136 instead (on interface eth0)
22/11/15 20:38:56 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/11/15 20:39:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# Function for creating yearly column dataframes in step **4.**
make_column_df = lambda year, df: (spark.createDataFrame(pd.DataFrame({'columns':df.columns}))
                                        .select('columns', lit(1).alias(year)))

In [4]:
# Your code here
data_file_paths = sorted(glob('./data/MinneMUDAC_raw_files/*parcels.txt'))

In [5]:
extract_year = re.compile(r'./data/MinneMUDAC_raw_files/(\d{4})_metro_tax_parcels.txt')
year = lambda path: extract_year.search(path).group(1)

In [6]:
year(data_file_paths[1])

'2003'

In [7]:
parcels = [(year(file_path), spark.read.csv(file_path, sep='|', header=True, inferSchema=True)) for file_path in data_file_paths]

                                                                                

In [8]:
parcels

[('2002',
  DataFrame[ACRES_DEED: string, ACRES_POLY: string, AGPRE_ENRD: string, AGPRE_EXPD: string, AG_PRESERV: string, BASEMENT: string, BLDG_NUM: string, BLOCK: string, CITY: string, CITY_USPS: string, COOLING: string, COUNTY_ID: int, DWELL_TYPE: string, EMV_BLDG: double, EMV_LAND: double, EMV_TOTAL: double, FIN_SQ_FT: string, GARAGE: string, GARAGESQFT: string, GREEN_ACRE: string, HEATING: string, HOMESTEAD: string, HOME_STYLE: string, LANDMARK: string, LOT: string, MULTI_USES: string, NUM_UNITS: string, OPEN_SPACE: string, OWNER_MORE: string, OWNER_NAME: string, OWN_ADD_L1: string, OWN_ADD_L2: string, OWN_ADD_L3: string, OWN_NAME: string, PARC_CODE: double, PIN: string, PIN_1: string, PLAT_NAME: string, PREFIXTYPE: string, PREFIX_DIR: string, SALE_DATE: timestamp, SALE_VALUE: double, SCHOOL_DST: string, SPEC_ASSES: string, STREET: string, STREETNAME: string, STREETTYPE: string, STRUC_TYPE: string, SUFFIX_DIR: string, Shape_Area: double, Shape_Leng: double, TAX_ADD_L1: string, TAX

In [10]:
initial_parcel_df_columns = [make_column_df(year, parcel_data) for year, parcel_data in parcels]

In [11]:
initial_parcel_df_columns

[DataFrame[columns: string, 2002: int],
 DataFrame[columns: string, 2003: int],
 DataFrame[columns: string, 2004: int],
 DataFrame[columns: string, 2005: int],
 DataFrame[columns: string, 2006: int],
 DataFrame[columns: string, 2007: int],
 DataFrame[columns: string, 2008: int],
 DataFrame[columns: string, 2009: int],
 DataFrame[columns: string, 2010: int],
 DataFrame[columns: string, 2011: int],
 DataFrame[columns: string, 2012: int],
 DataFrame[columns: string, 2013: int],
 DataFrame[columns: string, 2014: int],
 DataFrame[columns: string, 2015: int]]

In [12]:
parcel_data_unions = reduce(lambda years, data_frame: years.union(data_frame).distinct(), [data_frame.select("columns") for data_frame in initial_parcel_df_columns])

In [13]:
parcel_data_unions

DataFrame[columns: string]

In [11]:
parcel_data_unions.collect() >> to_pandas

                                                                                

Unnamed: 0,columns
0,BLDG_NUM
1,CITY
2,BASEMENT
3,AG_PRESERV
4,ACRES_DEED
...,...
79,Garage20
80,Shape_STLe
81,Shape_STAr
82,Shape_Le_1


In [14]:
parcel_data_joins = reduce(lambda columns, df: columns.join(df, on='columns', how='left'), initial_parcel_df_columns, parcel_data_unions)

In [21]:
parcel_header_years = parcel_data_joins.collect() >> to_pandas

In [23]:
parcel_header_years.to_csv('./data/parcel_header_years.csv')

### Part 2 -- Inspecting and comparing the columns

**Goal.** Find a interval of years that

1. Cover a large amount of time.
2. Contain as many common columns as possible.

**Task.** Inspect the column summary table and discuss what you find.  Suggest a time frame that satisfies our competing goals.

In [24]:
pd.set_option('display.max_rows', None)

In [25]:
parcel_header_years

Unnamed: 0,columns,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,BLDG_NUM,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,CITY,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,BASEMENT,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,AG_PRESERV,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,ACRES_DEED,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
5,ACRES_POLY,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
6,BLOCK,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
7,AGPRE_EXPD,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
8,AGPRE_ENRD,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
9,COOLING,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


#### Your conclusions

We are going to use data from 2004 to 2015. We are going to get rid of 2003 since it has a lot of missing columns, also since we are doing a time series thing cutting off 2003  would mean we would have to cut off 2002. Also, I think we should drop the columns `OWN_NAME`, `PIN_1`, `STREET`, `STRUC_TYPE`, `TAX_ADD_LI`, and a bunch of others at the end of the table above.
