In [1]:
!pip install more_pyspark



In [2]:
!pip install composable --upgrade



# Lab 4 - Creating partitioned parquet files

In this lab, we will perform our first round of data preparation by writing the larger files (`XREF` and the yearly `parcel` files to the `parquet` format.  

In the process, we will discuss and investigate an important concept in managing lots of data: the principle of locality.  Big data problem as IO bound, meaning that almost all of the time/resources will be used managing the input/output of data.  The principle of locality holds that is often reused over a short period of time (temporal locality) and data that is stored in similar locations tend to be used at similar points in a program (spatial locality).  The `parquet` always us to partition a data set to leverage these properties.   *The correct partitioning can result in orders of magnitude speed up in processing time!*



## The Principle of Locality.

<img src="./img/locality.png" width="800">

We can leverage the behavior of the operating system (OS)--in particular the loading of chucks of data in proximity and keeping that data in memory for a time--by partitioning our data so that similar data is stored together.

## We need to group the data by lake and distance to the lake

<img src="./img/row_proximity.png" width="800">

## Problem 1 - Understanding the big picture and tables keys

**Tasks.**  

1. Explain why it, in the case of parcel data, to group the rows by lake id and distance to the lake.
2. Neither of these columns is present in the parcel data files.  How will we go about adding this information?

> <font color="orange"> 1. I think we would group the rows by lake id and distance to the lake because we would have easier and faster time to get these data because of principle of locality.<br>
2. The coulmns could be added to parcel data by first union-ing the parcel data and then join these information to the parcel data.</font>

## Problem 2 - Writing the XREF to a partitioned parquet "file"

**Tasks.**

1. Load the `XREF` data and select the relevant columns (Lake ID, centroid lat & long, distance to the lake).
2. Create a new categorical variable named with three categories based on distance to the lake: withing 500m, between 501-1600m, and over 1600m.
3. Write the table 
2. Read in each of these files and suggest the columns that will be used to join the tables.
3. To understand the relationship (one-to-one; one-to-many; many-to-many) between tables, perform aggregation on each table to determine if there is one or many keys per row.
4. Based on the results of the last task, suggest a join type and justify your response.
5. For each table, create query that results in a column with one unique key per row.
6. Perform the join suggested in **4.** and investigate any mismatches.  Document your findings and suggest necessary remedies.

**Note.** The code for partitioning and writing a parquet file for the water quality data is provided as an example. 

#### Example - Writing a partitioned water quality file

In [3]:
from pyspark.sql import SparkSession


spark = (SparkSession.builder.appName('Ops').getOrCreate())

22/12/05 12:07:43 WARN Utils: Your hostname, jt7372wd222 resolves to a loopback address: 127.0.1.1; using 172.21.135.20 instead (on interface eth0)
22/12/05 12:07:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/12/05 12:07:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/12/05 12:07:45 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [4]:
from more_pyspark import to_pandas, pprint_schema

water_quality = spark.read.csv('./data/MinneMUDAC_raw_files/mces_lakes_1999_2014.txt',
                              header = True,
                              sep='\t')
water_quality.take(2) >> to_pandas

22/12/05 12:07:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


Unnamed: 0,PROJECT_ID,DATA_SET_TITLE,LAKE_NAME,CITY,COUNTY,DNR_ID_Site_Number,MAJOR_WATERSHED,WATER_PLANNING_AUTHORITY,LAKE_SITE_NUMBER,START_DATE,...,Secchi_Depth_RESULT_SIGN,Secchi_Depth_RESULT,Secchi_Depth_QUALIFIER,Secchi_Depth_Units,Total_Phosphorus_RESULT_SIGN,Total_Phosphorus_RESULT,Total_Phosphorus_QUALIFIER,Total_Phosphorus_Units,longitude,latitude
0,7108,Citizen Assisted Monitoring Program (CAMP) for...,Acorn Lake,Oakdale,Washington,82010200-01,Lower St. Croix River,Valley Branch WD,1,2006-04-16,...,,1.0,Approved,m,,0.156,Approved,mg/L,-92.97171054,45.01655642
1,7108,Citizen Assisted Monitoring Program (CAMP) for...,Acorn Lake,Oakdale,Washington,82010200-01,Lower St. Croix River,Valley Branch WD,1,2006-05-01,...,,,,m,,,,mg/L,-92.97171054,45.01655642


In [5]:
%%timeit -n 1 -r 1

(water_quality
 .write
 .partitionBy('DNR_ID_Site_Number')
 .mode('overwrite')
 .parquet('water_quality.parquet')
)

[Stage 2:>                                                          (0 + 3) / 3]

4.65 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


                                                                                

In [6]:
# Place your code/thoughts in one or more code/markdown cells, respectively.
import pandas as pd
from pyspark.sql.functions import col, when
pd.set_option('display.max_columns', None)

In [7]:
xref_data = spark.read.csv('./data/MinneMUDAC_raw_files/Parcel_Lake_Monitoring_Site_Xref.txt',
                              header = True,
                              sep='\t')
xref_data.take(2) >> to_pandas

Unnamed: 0,Parcel_PIN,Monit_MAP_CODE1,Monit_SITE_CODE,Monit_LAKE_SITE,Distance_Parcel_Monitoring_Site_meters,Lake_Hydroid,Distance_Parcel_Lake_meters,centroid_long,centroid_lat,Parcel_pkey
0,,19007900-01,19007900,1,2815.4927104148846,110517277058,2571.526792225838,-93.11451,44.94283,2163034
1,,19007900-01,19007900,1,2753.474687531216,110517277058,2515.3738022144425,-93.11539,44.94234,2163035


In [8]:
# %%timeit -n 1 -r 1

xref_data_w_distance_var = (xref_data
 .select('Monit_MAP_CODE1','Distance_Parcel_Lake_meters','centroid_long','centroid_lat')
 .withColumn("distance_categories", when(col('Distance_Parcel_Lake_meters') <= 500, 'within 500m')
                         .when((col('Distance_Parcel_Lake_meters') > 500)
                             & (col('Distance_Parcel_Lake_meters') <= 1600), 'between 501-1600m')
                         .otherwise('over 1600m')
            )
)

In [9]:
(xref_data_w_distance_var
 .write
 .partitionBy('Monit_MAP_CODE1','distance_categories')
 .mode('overwrite')
 .parquet('./data/xref.parquet')
)

                                                                                

In [10]:
xref_parquet = spark.read.parquet("data/xref.parquet")

                                                                                

In [11]:
xref_parquet_within_500 = xref_parquet.where(col('distance_categories') == 'within 500m')
xref_parquet_between_501_and_1600 = xref_parquet.where(col('distance_categories') == 'between 501-1600m')
xref_parquet_over_1600 = xref_parquet.where(col('distance_categories') == 'over 1600m')

> We will use the columns `centroid_long` and `centroid_lat` to join the parcel data to xref data and use the column `Monit_MAP_CODE1` to join the xref data to the water quality data.

In [12]:
(xref_parquet_within_500
 .groupBy("centroid_lat","centroid_long")
 .count()
 .orderBy(col('count').desc())
 .take(5)
) >> to_pandas

                                                                                

Unnamed: 0,centroid_lat,centroid_long,count
0,45.31604,-93.15177,1
1,45.31259,-93.17935,1
2,45.30731,-93.16032,1
3,45.30093,-93.18571,1
4,45.30534,-93.15771,1


In [13]:
(xref_parquet_between_501_and_1600
 .groupBy("centroid_lat","centroid_long")
 .count()
 .orderBy(col('count').desc())
 .take(5)
) >> to_pandas

                                                                                

Unnamed: 0,centroid_lat,centroid_long,count
0,44.91728,-93.22746,1
1,44.91771,-93.2287,1
2,44.91059,-93.22877,1
3,44.90976,-93.22878,1
4,44.90609,-93.23137,1


In [14]:
(xref_parquet_over_1600
 .groupBy("centroid_lat","centroid_long")
 .count()
 .orderBy(col('count').desc())
 .take(5)
) >> to_pandas

                                                                                

Unnamed: 0,centroid_lat,centroid_long,count
0,45.16768,-93.35504,1
1,45.22941,-93.40192,1
2,45.22526,-93.40267,1
3,45.22969,-93.40316,1
4,45.1752,-93.33356,1


> One to one relationship can be seen, there are one key per row. A join type could be inner-join since it is a one to one relationship, anything that would be missing from the xref file or the parcel file both could be dropped like it was mentioned in the previous lab, lab 4. Also for 7 and 8, I am just confused on what to join on and which data are we talking about? If we are talking about the parcel data or the water_quality data, it would be the same answer as the last lab since it is the same data.

### Part 3 - Inspecting the partitioned `parquet` file

**Tasks.** Inspect the resulting "file" (actually a folder) from the last set and answer the following questions.

1. What impact did the partitioning have on the way the data was saved?
2. How would this structure help `pyspark` apply predicate pushdown?
3. How would this structure provide help via the principle of locality.  
4. When working with a cluster of machines, operations such as `groupby` are WIDE operations, meaning they generally need to shuffle data between machines.  Such a suffle is *very* expensive.  In a future lab, we will be creating features for each labke by grouping and aggregating on the lakes and years.  How would applying a similar structure to the parcel data help in this case? **Hint.** Remember that the data will be distributed across multiple machines using the partitions, i.e., each machine will load all or some of the same partition(s).


> <font color="orange"> 
    1. The file is a folder containing many folders inside for the partition varibles of the distances and the lake id that we mentioned. <br>
    2. Pyspark can filter rows through the structure of the the folders since the folders are already named that way. <br>
    3. The data is already partitioned so it will keep the similar data close together and in the memory so it will help via the principle of locality. <br>
    4. When working with a cluster of machines, applying a similar technique to the parcel data would help us by loading all or some of a partition, which will save us having to spread the data across multiple machines, which is more efficient.
</font>

## <font color="blue"> Key </font>

> <font color="orange"> <b>1.</b> The "file" is actually a directory with sub-folders for each combo of labels for the partitioning variables.  <b>2.</b> <code>pyspark</code> can use the directory structure to totally combination that we filter out. <b>3.</b> Having the data partitioned/sorted should also help with the principle of locality by keeping similar data close and thus in memory at the same time.  When spreading our data across multiple machines, this will be particularly advantagous as each meaning can just load some/all of a partition, saving us having to spread data across multiple machines. </font>

### Part 4 - Filter parcels and joining lake id

Next, we will partition and write each of the 2004-2015 parcel files to a `parquet` "file".  To do this, complete each of the following tasks.

**Tasks.**

1. Write a helper functions that takes a parcel file path as input, reads corresponding CSV, selects the common columns (import from `parcel.py`), and joins on the necessary info from the `XREF` (lake ID, distance to the lake, distance category defined above, and centroid lat & long).
2. Write a helper function that takes a `year` and parcel `df`, partitions the file by the lake ID and distance category, and writes the data to a "file" names `parcel_year.parquet`.
3. Test the two helper functions on one of the parcel file years to make sure they are bug free.
4. Write a pipe with a familiar shape
    * Use `glob` to get all parcel file paths
    * Filter the paths to 2004-2015
    * split into year/df tuples using `get_year` and your helper function from **1.**.
    * star_map your helper function from **2.** to write each of the files.
    
**Important note.** Each parcel files took 10+ minutes on my laptop, so running the whole pipe will take a while.  Pick a convenient time and be sure to plug in your laptop!

In [15]:
# Imports 

from parcel import sorted_common_columns_2004_to_2015
from utility import get_year, make_data_frame
from composable.strict import map, filter, sorted, star_map
from composable.glob import glob

In [16]:
# Place your code/thoughts in one or more code/markdown cells, respectively.

# Helper function 1

# parcel_join_with_xref_lambda = lambda file_path: (make_data_frame(file_path)
#                                                   .select(sorted_common_columns_2004_to_2015)
#                                                   .join(xref_data_w_distance_var, on=["centroid_lat", "centroid_long"], how='inner'))

def parcel_join_with_xref(file_path):
    join_xref = (make_data_frame(file_path)
                 .select(sorted_common_columns_2004_to_2015)
                 .join(xref_data_w_distance_var, on=["centroid_lat", "centroid_long"], how='inner')
                )
    return join_xref
    
# parcel_join_with_xref_lambda('data/MinneMUDAC_raw_files/2006_metro_tax_parcels.txt').take(5)>>to_pandas
parcel_join_with_xref('data/MinneMUDAC_raw_files/2006_metro_tax_parcels.txt').take(5)>>to_pandas

                                                                                

Unnamed: 0,centroid_lat,centroid_long,ACRES_DEED,ACRES_POLY,AGPRE_ENRD,AGPRE_EXPD,AG_PRESERV,BASEMENT,BLDG_NUM,BLOCK,CITY,CITY_USPS,COOLING,COUNTY_ID,DWELL_TYPE,EMV_BLDG,EMV_LAND,EMV_TOTAL,FIN_SQ_FT,GARAGESQFT,GREEN_ACRE,HEATING,HOME_STYLE,LANDMARK,LOT,MULTI_USES,NUM_UNITS,OPEN_SPACE,OWNER_MORE,OWNER_NAME,OWN_ADD_L1,OWN_ADD_L2,OWN_ADD_L3,PARC_CODE,PIN,PLAT_NAME,PREFIXTYPE,PREFIX_DIR,SALE_DATE,SALE_VALUE,SCHOOL_DST,SPEC_ASSES,STREETNAME,STREETTYPE,SUFFIX_DIR,Shape_Area,Shape_Leng,TAX_ADD_L1,TAX_ADD_L2,TAX_ADD_L3,TAX_CAPAC,TAX_EXEMPT,TAX_NAME,TOTAL_TAX,UNIT_INFO,USE1_DESC,USE2_DESC,USE3_DESC,USE4_DESC,WSHD_DIST,XUSE1_DESC,XUSE2_DESC,XUSE3_DESC,XUSE4_DESC,YEAR_BUILT,Year,ZIP,ZIP4,Monit_MAP_CODE1,Distance_Parcel_Lake_meters,distance_categories
0,44.47145,-93.15725,0.0,0.35,,,N,,,,NORTHFIELD,,,37,,0.0,6600.0,6600.0,0.0,,N,,,,,N,0,N,,SKLUZACEK DEBORA,804 MAYFLOWER CT,,NORTHFIELD MN 55057,1.0,037-436760001000,,,,,0.0,659,0.0,,,,1433.97645355,315.516837022,,,,90.0,N,,142.0,,RESIDENTIAL,,,,NORTH CANNON RIVER,,,,,0.0,2006,,,19002000-01,10702.704528893337,over 1600m
1,44.47158,-93.17607,0.0,0.34,,,N,,1202.0,,NORTHFIELD,NORTHFIELD,,37,S.FAM.RES,260300.0,69900.0,330200.0,3372.0,,N,,TWO STORY,,,N,1,N,,BORNHAUSER TODD J & DIANE E,1202 BLUESTEM CT,,NORTHFIELD MN 55057-5291,1.0,037-435865001001,,,,1997-10-01,200000.0,659,0.0,BLUESTEM,CT,,1381.85902107,151.671876342,,,,3243.0,N,,3576.0,,RESIDENTIAL,,,,NORTH CANNON RIVER,,,,,1994.0,2006,55057.0,,19002000-01,10089.983605115398,over 1600m
2,44.47165,-93.15479,0.0,1.09,,,N,,,,WATERFORD TWP,,,37,,0.0,43800.0,43800.0,0.0,,N,,,NORTHFIELD TRACTOR,,N,0,N,,LANGER & ESTREM PR OP,32980 NORTHFIELD BLVD,,NORTHFIELD MN 55057-1484,1.0,037-410300001058,,,,,0.0,659,0.0,,,,,,,,,248.0,N,,367.0,,COMMERCIAL,,,,NORTH CANNON RIVER,,,,,0.0,2006,,,19002000-01,10769.861066280544,over 1600m
3,44.47182,-93.17815,0.0,0.25,,,N,,1205.0,,NORTHFIELD,NORTHFIELD,,37,S.FAM.RES,281800.0,76700.0,358500.0,3224.0,,N,,TWO STORY,,,N,1,N,TOURE-KEITA MAIMOUNA,KEITA CHEICK M C,1205 CANNON VALLEY DR,,NORTHFIELD MN 55057-5292,1.0,037-435865002002,,,,2002-06-01,299000.0,659,0.0,CANNON VALLEY,DR,,1025.66974075,131.005985483,,,,3519.0,N,,3912.0,,RESIDENTIAL,,,,NORTH CANNON RIVER,,,,,1995.0,2006,55057.0,,19002000-01,10006.6277290234,over 1600m
4,44.47185,-93.17551,0.0,0.63,,,N,,1233.0,,NORTHFIELD,NORTHFIELD,,37,S.FAM.RES,150600.0,75000.0,225600.0,1768.0,,N,,ONE STORY,,,N,1,N,ETT THOMAS E,GLIMSDAL ELIZABETH J,1233 WOODLAND TRAIL,,NORTHFIELD MN 55057-5287,1.0,037-431640027000,,,,2001-07-01,189900.0,659,0.0,WOODLAND,TRL,,,,,,,2178.0,N,,2279.0,,RESIDENTIAL,,,,NORTH CANNON RIVER,,,,,1971.0,2006,55057.0,,19002000-01,10078.147771489736,over 1600m


In [17]:
# Helper function 2

def create_partition(year, df):
    (df
    .write
    .partitionBy('Monit_MAP_CODE1','distance_categories')
    .mode('overwrite')
    .parquet(f'./data/parcel_{year}.parquet')
    )
    
# create_partition(2005, parcel_join_with_xref('data/MinneMUDAC_raw_files/2005_metro_tax_parcels.txt'))

In [17]:
('./data/MinneMUDAC_raw_files/*parcels.txt'
 >> glob
 >> filter(lambda parcel_file: int(get_year(parcel_file)) > 2003)
 >> map(lambda parcel_data_file: (get_year(parcel_data_file),parcel_join_with_xref(parcel_data_file)))
 >> star_map(create_partition)
)

[Stage 39:>                                                        (0 + 8) / 15]

22/12/04 22:50:35 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers


[Stage 44:>                                                        (0 + 8) / 16]

22/12/04 22:53:48 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers




22/12/04 22:54:40 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers


[Stage 49:>                                                        (0 + 8) / 17]

22/12/04 22:56:13 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers




22/12/04 22:58:50 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers


[Stage 69:>                                                        (0 + 8) / 14]

22/12/04 23:04:19 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers


[Stage 74:>                                                        (0 + 8) / 16]

22/12/04 23:06:07 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers


[Stage 79:>                                                        (0 + 8) / 17]

22/12/04 23:08:14 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers


[Stage 89:>                                                        (0 + 8) / 16]

22/12/04 23:12:45 WARN TaskMemoryManager: Failed to allocate a page (8388608 bytes), try again.


                                                                                

[None, None, None, None, None, None, None, None, None, None, None, None]