# Project 2 Lab 6 - Parcel Feature Extraction

Next, we will illustrate the construction of features related to our main task: finding the relationship between property development and water quality over time.  In a previous lab, you identified lakes for which we have complete information for the years from 2004 to 2015.  In this lab, we will

[Original Data and variable information](https://gisdata.mn.gov/organization/us-mn-state-metrogis?q=Metro+Regional+Parcel+Dataset&sort=score+desc%2C+metadata_modified+desc)

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col, pandas_udf
from more_pyspark import to_pandas, recode
spark = (SparkSession.builder.appName('Ops')
         .getOrCreate())
from composable.glob import glob
from composable.strict import map, star_map, filter, sorted
from composable.sequence import reduce
from composable import pipeable
from pyspark.sql.functions import lit
import pandas as pd
from composable.tuple import split_by
from composable import pipeable
from pyspark.sql.types import IntegerType

your 131072x1 screen size is bogus. expect trouble


22/12/08 08:44:54 WARN Utils: Your hostname, lu4543hm221 resolves to a loopback address: 127.0.1.1; using 192.168.26.203 instead (on interface eth0)
22/12/08 08:44:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/12/08 08:44:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Problem 1 - Read/filter/union/write the combined parcel data

Next, we will use a pipe to read, filter, union and write the parcel data, with the resulting file being partitioned by the year and lake ID.

#### Tasks 1

Create a pipe that

1. Finds a list of paths for the parquet "files" created in a previous lab.
    * We only needs the years 2004-2014.
2. Reads/filters each of the parcel parquet files by mapping the first helper function from the last step to each path. 
    * You should use the imported list of lakes with complete information to filter on lakes.
    * Use the distance category to only include parcels within 1600 m of the respective lake. 
    * You can drop the centroid lat & long, only with the distance information, once the filters are applied.
3. Union the resulting data frames into one data frame.

#### Task 2.

Write the combined parcel file to a parquet file that is partitioned by the lake ID and year (in that order).  This is our silver table for the parcel data.



In [2]:
parcels = sorted('./parcel_20*' >> glob)

parcels_minus = parcels[:-1]
parcels_minus

['./parcel_2004.parquet',
 './parcel_2005.parquet',
 './parcel_2006.parquet',
 './parcel_2007.parquet',
 './parcel_2008.parquet',
 './parcel_2009.parquet',
 './parcel_2010.parquet',
 './parcel_2011.parquet',
 './parcel_2012.parquet',
 './parcel_2013.parquet',
 './parcel_2014.parquet']

In [3]:
# Your code here
from lake import lake_complete

read_parcels = lambda path: (spark.read.parquet(path, header=True, sep='|')
                            .where((col('Monit_MAP_CODE1').isin(lake_complete))&(col('distance')!= 'Over 1600'))
                            .drop('centroid_long','centroid_lat')
)

parcel_join = (parcels_minus
            >>map(read_parcels)
            >>reduce(lambda df, df2: df.union(df2).distinct())
)

                                                                                

22/12/07 22:54:29 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance.


                                                                                

In [4]:
len(lake_complete)

49

In [11]:
# (parcel_join
#  .write
#  .partitionBy('Monit_MAP_CODE1', 'Year')
#  .mode('overwrite')
#  .parquet('allparcels.parquet')
# )

22/12/06 19:07:35 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


[Stage 46:>                                                        (0 + 8) / 12]

22/12/06 19:08:28 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.


                                                                                

In [5]:
allparcels = spark.read.parquet('./allparcels.parquet/')

In [6]:
allparcels.count() #-- THIS COUNT TAKES TOO DAMN LONG TO RUN DO NOT RUN AGAIN JUST FOR FUTURE REF

                                                                                

1357337

> many many rows after we union(ized) all the parcel files together

## Problem 2 - Feature construction

**Overview.** Remember that our target output file will have one row per year-lake combination.  To attach property information, we will need to group and aggregate the parcel data to create features for each lake-year combination.  When grouping the data, be sure to maintain the variables needed to join to the water quality data, namely the lake ID and year.  Since we are looking at tracking property development/change over time, we will want to generate features tracking

* Number of properties close to each lake,
* The value of properties close to each lake,
* Aggregate size and type of the properties, and
* Other features that might impact water quality.
    
#### Task 1. Understanding parcel variables

Before we can construct features, we need to make sure we understand the parcel data.  The metro parcel data is provided by the State of Minnesota and the meta data can be found online.  For example, searching for *metro parcel 2014* lead to [this site](https://geo.btaa.org/catalog/304cf3d8-a53b-4ea9-b02a-f550bd68e320).  Clicking on the *Meta data* button in the top left, brought up more information.  Clicking *Download* opened in this meta data [in a separate page](https://resources.gisdata.mn.gov/pub/gdrs/data/pub/us_mn_state_metrogis/plan_regonal_parcels_2014/metadata/metadata.html)

Look through the **Section 4: Attributes** and identify variables that might impact the water quality of near-by lakes.

> <font color="orange"> From doing research on the watershed district attribute, it seems like that could impact the quality of water in the lakes as the goal is to fix/solve water related issues in the lakes - also agricultural preserve could have an impoact on the quality of water due to the overall quality of the land residing around it and see how that overall affect changes around lakes. the other attributes then under the agricultural preserve because if it is enrolled in the program then it is focused on keeping the land preserved and if it is expiring then it focuses on whether or not the lake is on the verge of being in poorer quality</font>

#### Task 2. Brainstorm about features

Remember that we need to aggregate down to a table with one row per lake-year, which means that feature construction will involve computing summary statistics. Below are some techniques for feature construction that you might employ.

1. **Numerical summaries.** For numeric variables, you could should compute one or more summary statistics (mean, median, SD, IQR, etc.) per group.
2. **Categorical summaries.**. For text data, we will have some more work.  Here are some strategies.
    * **Success rates.** Compute success rates for binary variables.  For example, we could compute the percent/fraction of residences that have a basement.
    * **Clean labels.** Be sure to inspect the unique labels and clean up duplicate/similar labels.
    * **Make broader classifications.**  Some categorical variables will have too many categories that apply to a small number of properties.  These should be recoded into a smaller set of broad categories.  Try to eliminate or combine rare categories in the process.
    * **Indicator columns.** Another strategy is to create indicator variables then aggregate, where the result can be either zero-one (presence/absence) or the total/proportion over all rows.  For example, we could create the number of properties of each use type.

Consider the variables you identified in the last step, and develop a feature construction strategy for each.

> <font color="orange"> The watershed district attribute is a text value attribute (WSHD_DIST) and grab then each of the unique different names associated with the lakes and DNR ID's - there can be calculated rates though for it though on what the levy is in that district while some can be - having no values which can be misleading whether there is a levy tax of 0 or there truly is none
>> Agricultural Preserve is a binary column (AG_PRESERV) which can be created into indicators or filtered down to only lakes that are YES and then can create statstical summaries on lakes that are preserved vs the ones that are not
>>>The other two agricultural preserves are date columns which can then do a year by year breakdown for each year - you could cast the cloumns to date format (assuming they are in YYYY-DD-MM format)
 </font>

#### Task 3.  Numerical Summaries

Two important categories of property data involve the size (e.g., finished square footage) and value (e.g., accessed value and/or taxes paid).

**Tasks.** 

1. Identify 2-3 variables for each of these categories.
2. Write a query that computes the summary statistics for each of these variables for each lake-year.  
3. Write this summary table out to a parquet file named `parcel_numerical_summaries.parquet`.  Again, you should partition by lake ID and year.

In [7]:
# Your code here.
from pyspark.sql.functions import mean, stddev, max, percent_rank, sum, regexp_extract, regexp_replace, column

numsum = (allparcels
        .groupBy(col('MONIT_MAP_CODE1'), col('Year'))
        .agg(max(col('EMV_TOTAL')).alias('Max_Estimated_Total'), 
             mean(col('TOTAL_TAX')).alias('Avg_Tax'))
)
numsum.toPandas()

                                                                                

Unnamed: 0,MONIT_MAP_CODE1,Year,Max_Estimated_Total,Avg_Tax
0,02000500-01,2004,97858.0,3467.759058
1,02000500-01,2005,94200.0,0.000000
2,02000500-01,2006,93600.0,3435.787510
3,02000500-01,2007,99500.0,3230.503464
4,02000500-01,2008,99900.0,3317.029727
...,...,...,...,...
523,82036800-01,2010,961400.0,0.000000
524,82036800-01,2011,985500.0,0.000000
525,82036800-01,2012,954300.0,0.000000
526,82036800-01,2013,963300.0,0.000000


In [17]:
#already ran this code chunk parquet file exists

# (numsum
#  .write
#  .partitionBy('Monit_MAP_CODE1', 'Year')
#  .mode('overwrite')
#  .parquet('parcel_numerical_summaries.parquet')
# )

                                                                                

## Problem 3.  Simple categorical summaries.

In this part, you will create summary statistics for some of the simpler categorical variables.

**Binary variables.** There are two examples of binary variables, listed below.  You will need to compute the percent of `Yes` for each.

* GARAGE: Garage Y/N
* BASEMENT: Basement Y/N

**Other categorical variables.** There are a number of other categorical variables.  You need to select one of these variables, inspect/clean your variable as needed, create indicator variables for each resulting label, and compute summary statistics for each label.

* HOMESTEAD: Homestead Status
* TAX_EXEMPT: Tax Exempt Status 
* DWELL_TYPE: Dwelling Type 
* HOME_STYLE: Home Style
* HEATING: Heating type
* COOLING: Cooling type

**Tasks.**
Create a query that

1. Select one binary and two other categorical variables for feature construction.
2. Reads in the parcel data and selects the relevant columns (be sure to keep the lake ID and year).
3. Inspect unique labels and recode/clean as needed.
4. Creates indicator columns for all labels.
5. Groups/aggregates to compute summary statistics for each lake year.

Write this summary table out to a parquet file named `parcel_categorical_summaries.parquet`.  Again, you should partition by lake ID and year.

In [9]:
bateman = allparcels.drop_duplicates(['HEATING'])
trying = bateman.select(['HEATING']).distinct()
trying.show(truncate=False)



+----------+
|HEATING   |
+----------+
|FA Gas    |
|null      |
|Yes       |
|0         |
|Forced Air|
|Electric  |
|Gravity   |
|Hot Water |
|Other     |
|H. Water  |
|Oil F.A.  |
|FORCED AIR|
|No        |
|FHA Gas   |
|RAD/BBELEC|
|SPACE HTR |
|FRC AIR ND|
|HOT WATER |
|NONE      |
|RAD INFRED|
+----------+
only showing top 20 rows



                                                                                

In [10]:
from pyspark.sql.functions import array, explode, struct, lit, col, collect_list

In [11]:
basementrecode = {'Y':1, 'N':0, 'None':0}
tax_exemptrec = {'Y':1, 'N':0, 'None':0}

In [12]:
list(allparcels.select('TAX_EXEMPT').distinct().toPandas()['TAX_EXEMPT'])

                                                                                

[None, 'Y', 'N']

In [13]:
list(allparcels.select('BASEMENT').distinct().toPandas()['BASEMENT'])

                                                                                

[None, 'Y', 'N']

In [14]:
hvac = list(allparcels.select('HEATING').distinct().toPandas()['HEATING'])
hvac

                                                                                

['FA Gas',
 None,
 'Yes',
 '0',
 'Forced Air',
 'Electric',
 'Gravity',
 'Hot Water',
 'Other',
 'H. Water',
 'Oil F.A.',
 'FORCED AIR',
 'No',
 'FHA Gas',
 'RAD/BBELEC',
 'SPACE HTR',
 'FRC AIR ND',
 'HOT WATER',
 'NONE',
 'RAD INFRED',
 'GRAVITY/WA',
 'STEAM W A/',
 'LP',
 'Wood',
 'HOT AIR',
 'AIR DUCTED',
 'ENG F AIR',
 'CONVECTION',
 'IN FLOOR',
 'FHA',
 'ENG STEAM',
 'ELEC BASBD',
 'RAD WATER',
 'ELECTRIC',
 'SPACE HEAT',
 'OTHER W A/',
 'Evaporative Cooling',
 'Forced Air Furnace',
 'Electric Baseboard',
 'Complete HVAC',
 'Space Heater',
 'Package Unit',
 'Baseboard, Hot Water',
 'Y',
 'N',
 'Radiant Space Heaters',
 'Gravity Furnace',
 'ELEC WALL',
 'GEO THERM',
 'RAD ELEC',
 'Solar',
 'STEAM',
 'HEAT PUMP',
 'SP HT W/FN',
 'SPACE-FAN']

In [15]:
basementrecode = {'Y':1, 'N':0, 'None':0}
tax_exemptrec = {'Y':1, 'N':0, 'None':0}
heatrec = {'FA Gas':1, None:0, 'Yes':1, '0':0, 'Forced Air':0,'Electric':1,'Gravity':0,'Hot Water':1,'Other':1,'H. Water':1,'Oil F.A.':1,'FORCED AIR':1,'No':0,'FHA Gas':0,
'RAD/BBELEC':1,'SPACE HTR':1,'FRC AIR ND':1,'HOT WATER':1,'NONE':0,'RAD INFRED':1,'GRAVITY/WA':1,'STEAM W A/':1,'LP':1,'Wood':1,'HOT AIR':1,'AIR DUCTED':1,'ENG F AIR':1,
 'CONVECTION':1,'IN FLOOR':1,'FHA':1,'ENG STEAM':1,'ELEC BASBD':1,'RAD WATER':1,'ELECTRIC':1,'SPACE HEAT':1,'OTHER W A/':1,'Evaporative Cooling':0,'Forced Air Furnace':1,
 'Electric Baseboard':1,'Complete HVAC':1,'Space Heater':1,'Package Unit':1,'Baseboard, Hot Water':1,'Y':1,'N':0,'Radiant Space Heaters':1,'Gravity Furnace':1,'ELEC WALL':1,
 'GEO THERM':1,'RAD ELEC':1,'Solar':1,'STEAM':1,'HEAT PUMP':1,'SP HT W/FN':1,'SPACE-FAN':1}

In [16]:
# Your code here

catsum = (allparcels
        .select('BASEMENT', 'MONIT_MAP_CODE1', 'Year', 'TAX_EXEMPT', 'HEATING')
        .withColumn('Basement_Ind', recode('BASEMENT', basementrecode, default=0))
        .withColumn('Tax_Ind', recode('TAX_EXEMPT', tax_exemptrec, default=0))
        .withColumn('Heat_Ind', recode('HEATING', heatrec, default=0))
        .groupBy('Year', 'MONIT_MAP_CODE1')
        .agg(mean(col('Basement_Ind')).alias('Percent_Basement'),(mean(col('Tax_Ind')).alias('Percent_Tax_Exempt')),(mean(col('Heat_Ind')).alias('Percent_Heated')))
)
catsum.take(4) >> to_pandas

                                                                                

Unnamed: 0,Year,MONIT_MAP_CODE1,Percent_Basement,Percent_Tax_Exempt,Percent_Heated
0,2014,82009400-01,0.945263,0.0,0.945263
1,2012,82009400-01,0.945972,0.0,0.945972
2,2013,82009400-01,0.945972,0.0,0.945972
3,2013,27005300-01,0.495488,0.064807,0.033908


In [17]:
yes = catsum
yes.show(truncate = False)



+----+---------------+--------------------+--------------------+--------------------+
|Year|MONIT_MAP_CODE1|Percent_Basement    |Percent_Tax_Exempt  |Percent_Heated      |
+----+---------------+--------------------+--------------------+--------------------+
|2014|82009400-01    |0.9452626411389298  |0.0                 |0.9452626411389298  |
|2012|82009400-01    |0.9459724950884086  |0.0                 |0.9459724950884086  |
|2013|82009400-01    |0.9459724950884086  |0.0                 |0.9459724950884086  |
|2013|27005300-01    |0.49548810500410173 |0.06480721903199343 |0.03390757451462948 |
|2010|27005300-01    |0.008487337440109514|0.06351813826146475 |0.033949349760438056|
|2014|27005300-01    |0.49562841530054647 |0.06448087431693988 |0.033879781420765025|
|2010|82009400-01    |0.9467030242935052  |0.041398116013882005|0.9467030242935052  |
|2014|27062700-01    |0.8828422388270125  |0.02354433948963767 |0.04370506132806993 |
|2009|27005300-01    |0.00849780701754386 |0.063596491

                                                                                

In [55]:
# (catsum
#  .write
#  .partitionBy('Monit_MAP_CODE1', 'Year')
#  .mode('overwrite')
#  .parquet('parccel_categorical_variables.parquet')
# )

                                                                                

## Problem 4.  Join all the summaries.

Finally, you need to join all the summaries created above, along with the water quality summaries created in a previous lab, into one overall summary file.  Write the resulting table to a CSV file named `water_quality_and_parcel_summaries_2004_to_2015.csv`.

Next, we need to recode the 

In [18]:
quality = spark.read.parquet('./water_quality_by_year.parquet', header=True, sep='|')
quality.take(5)>>to_pandas

Unnamed: 0,LAKE_NAME,DNR_ID_Site_Number,AvgSecchi,AvgPhos,year
0,Big Comfort Lake,13005300-01,1.954167,0.03225,2009
1,Colby Lake,82009400-01,0.533333,0.125333,2009
2,DeMontreville Lake,82010100-01,3.038462,0.022,2009
3,Forest Lake,82015900-01,1.957143,0.023929,2009
4,Goggins Lake,82007700-01,1.088571,0.093286,2009


In [19]:
catsum_with_binary = (catsum.join(numsum, on=['MONIT_MAP_CODE1', 'Year'], how='inner'))

catsum_with_binary.take(5) >> to_pandas

                                                                                

Unnamed: 0,MONIT_MAP_CODE1,Year,Percent_Basement,Percent_Tax_Exempt,Percent_Heated,Max_Estimated_Total,Avg_Tax
0,02000500-01,2004,0.0,0.085145,0.0,97858.0,3467.759058
1,02000500-01,2005,0.750656,0.087489,0.0,94200.0,0.0
2,02000500-01,2006,0.70073,0.084347,0.0,93600.0,3435.78751
3,02000500-01,2007,0.724403,0.083141,0.0,99500.0,3230.503464
4,02000500-01,2008,0.735756,0.079273,0.0,99900.0,3317.029727


In [20]:
stupidyear = (catsum_with_binary
    .withColumn("OfficialYear", col('Year'))
    .drop('Year')
)

In [21]:
summarywater = (stupidyear.join(quality, on=[(stupidyear.MONIT_MAP_CODE1 == quality.DNR_ID_Site_Number),
                                             (stupidyear.OfficialYear == quality.year)], how='left'))
summarywater.take(10)>>to_pandas

                                                                                

Unnamed: 0,MONIT_MAP_CODE1,Percent_Basement,Percent_Tax_Exempt,Percent_Heated,Max_Estimated_Total,Avg_Tax,OfficialYear,LAKE_NAME,DNR_ID_Site_Number,AvgSecchi,AvgPhos,year
0,02000500-01,0.0,0.085145,0.0,97858.0,3467.759058,2004,George Watch Lake,02000500-01,0.705,0.199,2004
1,02000500-01,0.750656,0.087489,0.0,94200.0,0.0,2005,George Watch Lake,02000500-01,0.681667,0.210083,2005
2,02000500-01,0.70073,0.084347,0.0,93600.0,3435.78751,2006,George Watch Lake,02000500-01,0.728571,0.164286,2006
3,02000500-01,0.724403,0.083141,0.0,99500.0,3230.503464,2007,George Watch Lake,02000500-01,0.562857,0.203714,2007
4,02000500-01,0.735756,0.079273,0.0,99900.0,3317.029727,2008,George Watch Lake,02000500-01,0.55,0.148833,2008
5,02000500-01,0.745665,0.079273,0.0,98600.0,3489.229562,2009,George Watch Lake,02000500-01,0.538,0.1056,2009
6,02000500-01,0.758877,0.080925,0.0,98200.0,3384.645747,2010,George Watch Lake,02000500-01,0.493333,0.173,2010
7,02000500-01,0.768404,0.082713,0.0,99900.0,3459.313482,2011,George Watch Lake,02000500-01,0.973333,0.119417,2011
8,02000500-01,0.70174,0.081193,0.0,97900.0,3460.064623,2012,George Watch Lake,02000500-01,0.359,0.2649,2012
9,02000500-01,0.699918,0.079143,0.825227,981400.0,3319.081616,2013,George Watch Lake,02000500-01,0.365,0.3105,2013


In [22]:
water_quality_and_parcel_summaries = (summarywater
            .drop(column('year'))
)

In [24]:
# water_quality_and_parcel_summaries.write.csv('./water_quality_and_parcel_summaries_2004_2014.csv')

^^ I already wrote that file so line is commented out

## Problem 5.  Put it all together

It is often useful to package all of the data constructions steps together in one convenient place.  Your last task is to

1. Gather all of your data construction code below.
    * You don't need to include exploratory code, e.g., exploring join mismatches; only the code necessary to combine, clean, and write your data.
2. Clean/refactor the code.

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col, pandas_udf, regexp_extract, year, to_date, avg
from more_pyspark import to_pandas, recode
spark = (SparkSession.builder.appName('Ops')
         .getOrCreate())
from composable.strict import map, star_map, filter, sorted
from composable.sequence import reduce
from composable import pipeable
from pyspark.sql.functions import lit
import pandas as pd
from composable.tuple import split_by
from pyspark.sql.types import IntegerType
from glob import glob as orignal_glob
from composable.glob import glob
import re

In [5]:
# Your code here.
from parcel import sort_common_cols_2004_to_2015
from lake import lake_complete

glob = pipeable(orignal_glob)
parcel1_files = sorted(glob('./data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/201*parcel*.txt'))
parcel2_files = sorted(glob('./data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/200[456789]*parcel*.txt'))
parcel_files=(parcel2_files + parcel1_files)


getcol = (parcel_files
                >> map(lambda path: spark.read.csv(path, header=True, sep='|'))
                >> map(lambda df: df.columns)
                >> map(lambda columnar: set(columnar))
)

update_set = lambda final, cols: final.intersection(cols)

reducingcols = (getcol
            >>reduce(update_set)
)
sort_common_cols_2004_to_2015 = (sorted(list(reducingcols)))

# with open ('parcel.py', 'w') as pywrite:
#     pywrite.write(f'sort_common_cols_2004_to_2015 = {sort_common_cols_2004_to_2015}')

xref_file = spark.read.csv('./data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/Parcel_Lake_Monitoring_Site_Xref.txt',
                    header=True,
                    sep='\t')

water_quality = spark.read.csv('./data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/mces_lakes_1999_2014.txt',
                              header = True,
                              sep='\t')

xrefcolsel = (xref_file.select('Monit_MAP_CODE1', 'centroid_lat', 'centroid_long', 'Distance_Parcel_Monitoring_Site_meters', 'Distance_Parcel_Lake_meters'))

xrefcolmutate= (xref_file.withColumn('distance', when(col('Distance_Parcel_Lake_meters') < 500, 'Less Than 500')
                                        .when(col('Distance_Parcel_Lake_meters') > 1600, "Over 1600")
                                        .when(col('Distance_Parcel_Lake_meters') <= 1600, "Between 500 and 1600")
                                        .otherwise('Unknown Distance')))


xref_selected = (xrefcolsel, xrefcolmutate)

# (xref_selected
#  .write
#  .partitionBy('Monit_MAP_CODE1')
#  .mode('overwrite')
#  .parquet('xref_selected.parquet')
# )

# (water_quality
#  .write
#  .partitionBy('DNR_ID_Site_Number')
#  .mode('overwrite')
#  .parquet('water_quality.parquet')
# )

waterparq = spark.read.parquet('water_quality.parquet')
xref_parq = spark.read.parquet('xref_selected.parquet/')

read_parcel = lambda path: spark.read.csv(path, header=True, sep='|').select(sort_common_cols_2004_to_2015).join(xref_parq, on=['centroid_lat', 'centroid_long'], how='inner')

compile_yr = re.compile('./data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/(\d{4})_metro_tax_parcels.txt')
get_year = lambda path: compile_yr.search(path).group(1)

joined_parcels = (parcel_files
                    >>map(split_by((read_parcel, get_year)))
)

to_parquet = lambda df, year:df.write.partitionBy('Monit_MAP_CODE1', 'distance').mode('overwrite').parquet(f'parcel_{year}.parquet')

# (joined_parcels
#     >>star_map(to_parquet)
#     )

newfunc = {'':1}

mces = spark.read.csv('./data/MinneMUDAC_raw_files/MinneMUDAC_raw_files/mces_lakes_1999_2014.txt', header=True, sep='\t')

colsaggregated = (mces.filter(col('Total_Phosphorus_QUALIFIER') == 'Approved')
        .filter(col('Secchi_Depth_QUALIFIER') == 'Approved')
        .filter(col('Secchi_Depth_RESULT').isNotNull())
        .filter(col('Total_Phosphorus_RESULT').isNotNull())
        .withColumn('year', regexp_extract(to_date(col('END_DATE')), r'\d{4}', 0))
        .filter((col('year') > 2003) & (col('year') <= 2014))
        .groupBy(col('year'), col('LAKE_NAME'), col('DNR_ID_Site_Number'))
        .agg(avg(col('Secchi_Depth_RESULT')).alias('AvgSecchi'), avg(col('Total_Phosphorus_RESULT')).alias('AvgPhos')))

water_quality_by_year = (colsaggregated)

watermadetolist= (water_quality_by_year.select('DNR_ID_Site_Number', 'year')
            .withColumn('addingstuff', recode('year', newfunc, default=1))
            .groupBy('DNR_ID_Site_Number')
            .agg(sum(col('addingstuff')).alias('addingstuff'))
            .filter(col('addingstuff')==11)
            .toPandas()['DNR_ID_Site_Number'])

listedwater = (watermadetolist)
sortedwater = list(sorted(listedwater))

# with open ('lake.py', 'w') as pywrite:
#     pywrite.write(f'lake_complete = {(sortedwater)}')

confirmation = (water_quality_by_year
        .where(col('DNR_ID_Site_Number').isin(lake_complete) == True)        
)

# (confirmation
#  .write
#  .partitionBy('year')
#  .mode('overwrite')
#  .parquet('water_quality_by_year.parquet')


parcels = sorted('./parcel_20*' >> glob)
parcels_minus = parcels[:-1]

read_parcels = lambda path: (spark.read.parquet(path, header=True, sep='|')
                            .where((col('Monit_MAP_CODE1').isin(lake_complete))&(col('distance')!= 'Over 1600'))
                            .drop('centroid_long','centroid_lat')
)

parcel_join = (parcels_minus
            >>map(read_parcels)
            >>reduce(lambda df, df2: df.union(df2).distinct())
)

# (parcel_join
#  .write
#  .partitionBy('Monit_MAP_CODE1', 'Year')
#  .mode('overwrite')
#  .parquet('allparcels.parquet')
# )

allparcels = spark.read.parquet('./allparcels.parquet/')

binaryaggregates = (allparcels
        .groupBy(col('MONIT_MAP_CODE1'), col('Year'))
        .agg(max(col('EMV_TOTAL')).alias('Max_Estimated_Total'), 
             mean(col('TOTAL_TAX')).alias('Avg_Tax')))

numsum = (binaryaggregates)

# (numsum
#  .write
#  .partitionBy('Monit_MAP_CODE1', 'Year')
#  .mode('overwrite')
#  .parquet('parcel_numerical_summaries.parquet')
# )

categoryaggregates = (allparcels.select('BASEMENT', 'MONIT_MAP_CODE1', 'Year', 'TAX_EXEMPT', 'HEATING')
        .withColumn('Basement_Ind', recode('BASEMENT', basementrecode, default=0))
        .withColumn('Tax_Ind', recode('TAX_EXEMPT', tax_exemptrec, default=0))
        .withColumn('Heat_Ind', recode('HEATING', heatrec, default=0))
        .groupBy('Year', 'MONIT_MAP_CODE1')
        .agg(mean(col('Basement_Ind')).alias('Percent_Basement'),(mean(col('Tax_Ind')).alias('Percent_Tax_Exempt')),(mean(col('Heat_Ind')).alias('Percent_Heated'))))

basementrecode = {'Y':1, 'N':0, 'None':0}
tax_exemptrec = {'Y':1, 'N':0, 'None':0}
heatrec = {'FA Gas':1, None:0, 'Yes':1, '0':0, 'Forced Air':0,'Electric':1,'Gravity':0,'Hot Water':1,'Other':1,'H. Water':1,'Oil F.A.':1,'FORCED AIR':1,'No':0,'FHA Gas':0,
'RAD/BBELEC':1,'SPACE HTR':1,'FRC AIR ND':1,'HOT WATER':1,'NONE':0,'RAD INFRED':1,'GRAVITY/WA':1,'STEAM W A/':1,'LP':1,'Wood':1,'HOT AIR':1,'AIR DUCTED':1,'ENG F AIR':1,
 'CONVECTION':1,'IN FLOOR':1,'FHA':1,'ENG STEAM':1,'ELEC BASBD':1,'RAD WATER':1,'ELECTRIC':1,'SPACE HEAT':1,'OTHER W A/':1,'Evaporative Cooling':0,'Forced Air Furnace':1,
 'Electric Baseboard':1,'Complete HVAC':1,'Space Heater':1,'Package Unit':1,'Baseboard, Hot Water':1,'Y':1,'N':0,'Radiant Space Heaters':1,'Gravity Furnace':1,'ELEC WALL':1,
 'GEO THERM':1,'RAD ELEC':1,'Solar':1,'STEAM':1,'HEAT PUMP':1,'SP HT W/FN':1,'SPACE-FAN':1}

catsum = (categoryaggregates)

# (catsum
#  .write
#  .partitionBy('Monit_MAP_CODE1', 'Year')
#  .mode('overwrite')
#  .parquet('parccel_categorical_variables.parquet')
# )

quality = spark.read.parquet('./water_quality_by_year.parquet', header=True, sep='|')
catsum_with_binary = (catsum.join(numsum, on=['MONIT_MAP_CODE1', 'Year'], how='inner'))

stupidyear = (catsum_with_binary
    .withColumn("OfficialYear", col('Year'))
    .drop('Year')
)

summarywater = (stupidyear.join(quality, on=[(stupidyear.MONIT_MAP_CODE1 == quality.DNR_ID_Site_Number),
                                             (stupidyear.OfficialYear == quality.year)], how='left'))

water_quality_and_parcel_summaries = (summarywater
            .drop(column('year'))
)

# water_quality_and_parcel_summaries.write.csv('./water_quality_and_parcel_summaries_2004_2014.csv')




TypeError: Column is not iterable

22/12/08 09:03:15 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 157369 ms exceeds timeout 120000 ms
22/12/08 09:03:15 WARN SparkContext: Killing executors is not supported by current scheduler.


## Deliverables.

Make sure you have pushed all of your lab notebooks, along with the final combined `CSV` to the GitHub Classroom repo.  Submit a WORD document on D2L that includes

1. A link to your repository.
2. Screen shots of verifying the construction of the larger parquet files. You don't (and probably can't) record all of the folders/files, but should be able to capture the basic structure/partitioning.