# NYC Taxi Dataset Project - Data Prep AWS Spark

## Overall Steps

**Step 0:** Prerequisites

**Step 1:** Start Spark Cluster

**Step 2:** Upload this notebook and packages

**Step 3:** Clean Data and generate features

**Step 4:** Save RDD to file on s3


Note: Step 1 is based on the CS109 [instructions](https://piazza.com/class/icf0cypdc3243c?cid=1369). However there are modifications for optimizing performance for this project

### Step 0: Prerequisites

1. You need the files CS109.pem and credentials.csv.If you had followed the cs109 instructions (for lab8 or HW5) you will already have these files.

2. You will need a directory containing the following files:
    
    a) CS109.pem
    
    b) credentials.csv
    
    c) Setup Project.ipynb
    
    d) myConfig.json
    
    e) DataPrepAWSSpark.ipynb (this notebook)
    
    f) geohash.py
3. You must have completed the instructions in Setup Project.ipynb

#### Note: The notebook was updated so the datafiles must be downloaded again in order to be properly preprocessed.

### Step 1: Start Spark cluster and sanity check

#### Step 1a) Start your Spark cluster as described in Step 1 from Setup Project (unless your spark cluster is already running)



#### Step 1c) Wait for the cluster to be ready: AWS web console has to show "WAITING"

#### Step 1d)  Get the cluster master's IP:

#### Step 1e) Run the script to configure Spark 

#### Step 1f) Create an SSH tunel to the AWS box and connect to the cluster. This command assumes your SSH key is on the same directory you are invoking the SSH command from. At the end of this you will be in a terminal session on the cluster's master node.

### Step 2: Upload this notebook and geohash.py

#### Upload this Jupyter Notebook and geohash.py using the console from http://localhost:8989

Notes: 
1. All the steps in Step 2 are to be executed from the Jupyter Notebook iteself
2. We will frequently be loading data form the s3 bucket you created in Step 3 of Setup Project (I will use the bucket name: "sdaultontestbucket", but replace this with your own
3. All the steps in Step 2 are to be executed from the Jupyter Notebook iteself

#### Sanity check: make sure Spark cluster is working

In [1]:
import sys
rdd = sc.parallelize(xrange(10),10)
aa = rdd.map(lambda x: sys.version)
aa.count()

10

#### Make sure geohash.py was copied properly

In [2]:
import geohash
geohash.encode(40,74,6)

'txheec'

In [3]:
sc.addPyFile("geohash.py")

### Step 3: Clean Data and Calculate Features

In [11]:
y_rdd = sc.textFile("/home/parallels/Documents/TaxiPrediction-master/data/nyc/green*.csv")
y_rdd = y_rdd.map(lambda line: tuple(line.split(',')))

In [12]:
g_rdd = sc.textFile("/home/parallels/Documents/TaxiPrediction-master/data/nyc/green*.csv")
g_rdd = g_rdd.map(lambda line: tuple(line.split(',')))

### Data preparation specification
Given, a certain granularity in location (geohash length g), granularity in time (bins per day b) and a chosen wideness (w) of the neighbourhood we want to look at, the aggregated data in the end should have the following columns

#### "geohash"
Geohash with length g (categorical feature). This column will not actually be used in the prediction. It is just an id and can be used when calculating the distances between the geohashes.
#### "time_cat" 
Time of the day as a categorical feature. If $b = 24$ (one bin for every hour), then "time_cat" for a pickup at 14:20:00 should be the string "14:00". If $b = 96$ (one bin for every quarter of an hour), then "time_cat" for a pickup at 14:20:00 should be the string `'14:15'`.
#### "time_num" 
Time of the day as a (binned!) floating point number between 0 and 1, where the center of the bin is converted to a floating point number between 0 and 1. So if $b = 24$, then "time_num" for a pickup at 14:20:00 should be $14.5\,/\,24 =  0.6042$. If $b = 96$, it should translate to $14.375\,/\,24 = 0.5990$.
#### "time_cos" 
The binned "time_num" variable converted to a cosine version so that time nicely 'loops' rather than going saw-like when it traverses midnight. See the figure below. This transformation doesn't have any magic powers, but it can make it easier for a model to find the right patterns. "time_cos" = $\cos(\textrm{time_num} \cdot 2\pi)$. So for 24 bins, 14:20:00 would translate to $\cos(0.6042 \cdot 2\pi) = -0.7932$.
<img src="figures/cyclic-numeric-feature-transformation.png">
#### "time_sin" 
Same thing as 4) but then with sine. So, "time_sin" = $\sin(\textrm{time_num} \cdot 2 \pi)$. For 24 bins per day, 14:20:00 would translate to $\sin(0.6042 \cdot 2 \pi) = -0.6089$.
#### "day_cat" 
Day of the week as a categorical feature: "Monday", "Tuesday", etc.
#### "day_num" 
Day of the week as  a numerical feature going from 0 (Monday morning, start of the week) to 1 (Sunday night), European style. With 24 bins, Tuesday afternoon 14:20:00 would translate to $(1 + \frac{14.5}{24})\,/\,7 = 0.2292$.
#### "day_cos" 
Binned "day_num" variable converted to a cosine version. "day_cos" = $\cos(\textrm{day_num} \cdot 2\pi)$
#### "day_sin" 
Binned "day_num"variable converted to a sine version. "day_sin" = $\sin(\textrm{day_num} \cdot 2\pi)$
#### "weekend" 
0 if weekday, 1 if weekend (Saturday/Sunday)
#### Location features 
Latitude and longitude of the center of the geohash the record was bucketed in.


#### Helper functions for cleaning and feature extraction/generation

In [7]:
# Needed libraries
import time
from datetime import date
import math

def date_extractor(date_str,b,minutes_per_bin):
    # Takes a datetime object as a parameter
    # and extracts and returns a tuple of the form: (as per the data specification)
    # (time_cat, time_num, time_cos, time_sin, day_cat, day_num, day_cos, day_sin, weekend)
    # Split date string into list of date, time
    
    d = date_str.split()
    
    #safety check
    if len(d) != 2:
        return tuple([None,])
    
    # TIME (eg. for 16:56:20 and 15 mins per bin)
    #list of hour,min,sec (e.g. [16,56,20])
    time_list = [int(t) for t in d[1].split(':')]
    
    #safety check
    if len(time_list) != 3:
        return tuple([None,])
    
    # calculate number of minute into the day (eg. 1016)
    num_minutes = time_list[0] * 60 + time_list[1]
    
    # Time of the start of the bin
    time_bin = num_minutes / minutes_per_bin     # eg. 1005
    hour_bin = num_minutes / 60                  # eg. 16
    min_bin = (time_bin * minutes_per_bin) % 60  # eg. 45
    
    #get time_cat
    hour_str = str(hour_bin) if hour_bin / 10 > 0 else "0" + str(hour_bin)  # eg. "16"
    min_str = str(min_bin) if min_bin / 10 > 0 else "0" + str(min_bin)      # eg. "45"
    time_cat = hour_str + ":" + min_str                                     # eg. "16:45"
    
    # Get a floating point representation of the center of the time bin
    time_num = (hour_bin*60 + min_bin + minutes_per_bin / 2.0)/(60*24)      # eg. 0.7065972222222222
    
    time_cos = math.cos(time_num * 2 * math.pi)
    time_sin = math.sin(time_num * 2 * math.pi)
    
    # DATE
    # Parse year, month, day
    date_list = d[0].split('-')
    d_obj = date(int(date_list[0]),int(date_list[1]),int(date_list[2]))
    day_to_str = {0: "Monday",
                  1: "Tuesday",
                  2: "Wednesday",
                  3: "Thursday",
                  4: "Friday",
                  5: "Saturday",
                  6: "Sunday"}
    day_of_week = d_obj.weekday()
    day_cat = day_to_str[day_of_week]
    day_num = (day_of_week + time_num)/7.0
    day_cos = math.cos(day_num * 2 * math.pi)
    day_sin = math.sin(day_num * 2 * math.pi)
    
    year = d_obj.year
    month = d_obj.month
    day = d_obj.day
    
    weekend = 0
    #check if it is the weekend
    if day_of_week in [5,6]:
        weekend = 1
       
    return (year, month, day, time_cat, time_num, time_cos, time_sin, day_cat, day_num, day_cos, day_sin, weekend)

In [8]:
def data_cleaner(zipped_row):
    # takes a tuple (row,g,b,minutes_per_bin) as a parameter and returns a tuple of the form:
    # (time_cat, time_num, time_cos, time_sin, day_cat, day_num, day_cos, day_sin, weekend,geohash)
    row = zipped_row[0]
    g = zipped_row[1]
    b = zipped_row[2]
    minutes_per_bin = zipped_row[3]
    # The indices of pickup datetime, longitude, and latitude respectively
    indices = (1, 6, 5)
    
    #safety check: make sure row has enough features
    if len(row) < 7:
        return None
    
    #extract day of the week and hour
    date_str = row[indices[0]]
    clean_date = date_extractor(date_str,b,minutes_per_bin)
    #get geo hash

    latitude = float(row[indices[1]])
    longitude = float(row[indices[2]])
    location = None
    #safety check: make sure latitude and longitude are valid
    if latitude < 41.1 and latitude > 40.5 and longitude < -73.6 and longitude > -74.1:
        location = geohash.encode(latitude,longitude, g)
    else:
        return None

    return tuple(list(clean_date)+[location])

#### Specify Parameters

In [9]:
g = 7 #geohash length
b = 48 # number of time bins per day
# Note: b must evenly divide 60
minutes_per_bin = int((24 / float(b)) * 60)

#### Clean data create and create features as specified above

In [13]:
gclean_rdd = g_rdd.map(lambda row: (row, g, b, minutes_per_bin))\
                  .map(data_cleaner)\
                  .filter(lambda row: row != None)\
                  .map(lambda row: (row,1))\
                  .reduceByKey(lambda a,b: a + b)\
                  .map(lambda row: (row,'g'))  

In [16]:
# Add the parameters to each record
# Clean the data and add all the necessary features
# Filter out any records that were invalid and returned None
# Add a 1 to be able to count
# Group by each geohash/time combination and count the number of pickups
# Add the color of the cab
yclean_rdd = y_rdd.map(lambda row: (row, g, b, minutes_per_bin))\
                  .map(data_cleaner)\
                  .filter(lambda row: row != None)\
                  .map(lambda row: (row,1))\
                  .reduceByKey(lambda a,b: a + b)\
                  .map(lambda row: (row,'y'))

#### Combine rows from both rdds

In this step we combine all the aggregated data from the yellow and the green taxis. We also add a random number which will be used to sample the data before saving it to S3. This is necessary because the aggreggated data - especially of the training set (72 million records) - is still too large to be able to process on a laptop.

In [23]:
import numpy as np
#combined_rdd = yclean_rdd.union(gclean_rdd)   # Create a combined dataset of yellow and green taxis
combined_rdd = yclean_rdd
#get rid of g, y letters and reduce

# Get rid of the taxi color
# Combine the counts
# Add a random number
final_rdd = combined_rdd.map(lambda row: row[0])\
                .reduceByKey(lambda a,b: a + b)\
                        .map(lambda (a,b): (a,b,np.random.random()))
        
        
print final_rdd.count()

869980


### Step 4: Save RDD to file on s3

In [25]:
# From the training set (roughly 72 million records) we will take 10% totalling about 7 million
train_fraction = 0.8

# From the validation set (roughly 6 million records) we will take 25% totalling about 1.5 million
valid_fraction = 0.8

Underneath we create the training, validation and test sets. We do this by filtering on the dates and by sampling down to manageable number of records. We then repartition the data as a single file and save it to AWS S3.
* **Training set**: January 2013 to February 2015
* **Validation set**: March 2015 to April 2015
* **Test set**: May 2015 to June 2015

In [26]:
trainset = final_rdd.filter(lambda (a,b,c): ((a[0] == 2013) | (a[1] == 12)) & (c <= train_fraction))
trainset.repartition(1).saveAsTextFile("/home/parallels/Documents/TaxiPrediction-master/data/nyc" +  "/trainset7")

In [27]:
validset = final_rdd.filter(lambda (a,b,c): (a[0] == 2013) & ((a[1] == 8) | (a[1] == 9)) & (c <= valid_fraction))
validset.repartition(1).saveAsTextFile("/home/parallels/Documents/TaxiPrediction-master/data/nyc" + "/validset7")

In [28]:
testset  = final_rdd.filter(lambda (a,b,c): (a[0] == 2013) & ((a[1] == 10) | (a[1] == 11))) # 
testset.repartition(1).saveAsTextFile("/home/parallels/Documents/TaxiPrediction-master/data/nyc" + "/testset7")

Now you can download the files as RDD's to your computer. With the ConvertRDDtoCSV notebook, the files can be converted to csv, so they can be loaded in the machine learning algorithms.