## Preprocess Data from Landing to Raw
This notebook preprocesses the landing data, enacting basic data transformation such as column renaming and data type conversion
### Load constants and libraries
First of all we load any constant values for use within the notebook, and load any libraries required. , and initiate a spark object. 

In [None]:
# import all constants used in the note books
from constants import *

# libraries required
import os
from pyspark.sql import functions as F
import pandas as pd

In [2]:
from pyspark.sql import SparkSession

# Create a spark session (which will run spark jobs)
spark = (
    SparkSession.builder.appName("MAST30034 Project 1")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config('spark.driver.memory', '4g')
    .config('spark.executor.memory', '2g')
    .getOrCreate()
)

your 131072x1 screen size is bogus. expect trouble
23/08/16 11:53:56 WARN Utils: Your hostname, PVM203L resolves to a loopback address: 127.0.1.1; using 172.20.203.188 instead (on interface eth0)
23/08/16 11:53:56 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/16 11:53:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Preprocess the TLC Data

In [3]:

def prepare_raw_tlc_data(source: str, use_schema: str, rename_cols: dict[str:str]) -> None:
    '''
    Retrieves downloaded TLC data (landing data), applies the schema of a selected month of data (use_schema)
    to all other months in the dataset, to ensure consistency, and renames columns of interest. 
    Saves this raw data down accordingly
    Arguments:
        source = TLC data source (yellow, green, or fhvhv)
        use_schema = YYYY-MM to use as default schema for all months
        rename_cols = columns of interest to be renamed
                      dict with keys: original column name, and values: new column name
    Ouput: None
    '''
    # get the schema of the supplied year-month
    get_schema = spark.read.parquet(LANDING_DATA + source +'/' + use_schema + '.parquet').schema
    # cycle through the landing data and save down with columns cast to correct types
    for file in os.listdir(LANDING_DATA+source):
        month_data = spark.read.parquet(LANDING_DATA + source +'/' + file)
        month_data = month_data \
            .select([F.col(c).cast(get_schema[i].dataType) for i, c in enumerate(month_data.columns)])
        # rename columns if they exist in rename_cols, otherwise just keep original (in lower case)
        month_data = month_data.select([F.col(c).alias(rename_cols.get(c, c.lower())) for c in month_data.columns])
        # save the raw data
        month_data.write.mode('overwrite').parquet(RAW_DATA+source+'/'+file.split('.')[0])

In [1]:
# prepare the raw data for each of the yellow, green, and fhv data sets in turn
print(f'Saving yellow data')
rename_cols = {'tpep_pickup_datetime': 'pickup_datetime', 'tpep_dropoff_datetime': 'dropoff_datetime',
               'trip_distance': 'trip_distance', 'PULocationID':'pickup_id','fare_amount':'fare'}
prepare_raw_tlc_data(source='yellow', use_schema='2023-02', rename_cols=rename_cols)

print(f'Saving green data')
rename_cols = {'lpep_pickup_datetime': 'pickup_datetime', 'lpep_dropoff_datetime': 'dropoff_datetime',
               'trip_distance': 'trip_distance', 'PULocationID':'pickup_id','fare_amount':'fare'}
prepare_raw_tlc_data(source='green', use_schema='2023-02', rename_cols=rename_cols)

print(f'Saving fhvhv data')
rename_cols = {'Pickup_datetime': 'pickup_datetime', 'DropOff_datetime': 'dropoff_datetime',
               'trip_miles': 'trip_distance', 'PULocationID':'pickup_id','base_passenger_fare':'fare'}
prepare_raw_tlc_data(source='fhvhv', use_schema='2023-02', rename_cols=rename_cols)

Saving yellow data
Saving green data
Saving fhvhv data


### Preprocess the Weather Data

In [3]:
def isfloat(value): 
    ''' 
    A helper function to determine if value, which is a string, can be converted to float or not
    '''    
    try: 
        return True, float(value) 
    except ValueError: 
        return False, value 

def fill_missing(all_data: pd.DataFrame, fill_col: str) -> None:
    '''
    Converts a column of strings to numeric, and imputes missing values. This is for timeseries values,
    therefore the method of imputation is to take an average of adjacent values if available, or just the
    preceding and next value if only one of these is avialable    
    Arguments:
        all_data = DataFrame to execute on
        fill_col = column to execute on
    Ouput: None
    '''
    
    # Extract the column being imputed
    data = all_data[fill_col]
    
    # find first valid entry, this is used to fill entries at the beginning of the series
    for row, entry in enumerate(data):
        found_float, first_float = isfloat(entry)
        if found_float:
            break
    
    # if no numeric values are found we cannot run imputation
    if not(found_float):
        print('no numeric values found')
        return

    # iterate through the data and apply the imputation methodology described in the docstring
    for row, entry in enumerate(data):
        if not(isfloat(entry)[0]):
            found_prior, prior_float = isfloat(data[max(row-1,0)])
            found_next, next_float = isfloat(data[min(row+1,len(data))])
            
            if not(found_prior) and not(found_next):
                all_data.loc[row, fill_col] = first_float
            elif not(found_prior) and found_next:
                all_data.loc[row, fill_col] = next_float
            elif found_prior and not(found_next):
                all_data.loc[row, fill_col] = prior_float
            else:
                all_data.loc[row, fill_col] = (prior_float+next_float)/2
    print(f'data for {fill_col} successfully filled')

In [25]:
# speed and precipitation data requries imputation where dummy values have been included
weather_data = pd.read_csv(f'{LANDING_DATA}JFK.csv', sep=',')
fill_missing(weather_data, 'sped')
fill_missing(weather_data, 'p01m')

# Data type conversion
weather_data[['sped', 'p01m']] = weather_data[['sped', 'p01m']].apply(pd.to_numeric)
weather_data['valid'] = weather_data['valid'].apply(pd.to_datetime)

# pickle data format is implemented to preserve the schema
weather_data.to_pickle(f'{RAW_DATA}weather_data.pkl')

data for sped successfully filled
data for p01m successfully filled


### Preprocess the Events Data

In [36]:
events_data = pd.read_csv(f'{LANDING_DATA}events_data.csv')

# Data type conversion
events_data['create_time']=events_data['create_time'].apply(pd.to_datetime)
events_data['close_time']=events_data['close_time'].apply(pd.to_datetime)

# pickle data format is implemented to preserve the schema
events_data.to_pickle(f'{RAW_DATA}events_data.pkl')