# NYC Taxi Dataset Project - Setup

## Overall Steps

**Step 0:** Prerequisites

**Step 1:** Setup Spark cluster in AWS (do this from the shell/git bash)

**Step 2:** Sanity check to ensure Spark & S3 are setup properly

**Step 3:** Upload data files into S3 (you can skip this and use my S3 location)

**Step 4:** Check cluster performance on dataset

In the end I have also added other **uselful notes**

### Step 1: Setup Spark cluster in AWS & perform sanity check



**Note: Run all the items in Step 1 from unix shell/Git bash (not from the Jupyter notebook)**

#### Step 1a) Create the cluster in AWS: run the following command in unix shell/Git bash (you may change the instance type/count)

**Enhancements incorporated to the script**: 
1. Setup Spark to use the maximum available resources (the myConfig.json file has the instructions)
2. Download the admin application Ganglia

#### Step 1c) Wait for the cluster to be ready: AWS web console has to show "WAITING"

#### Step 1d)  Get the cluster master's IP:

#### Step 1e) Run the script to configure Spark 

#### Step 1f) Create an SSH tunel to the AWS box and connect to the cluster. This command assumes your SSH key is on the same directory you are invoking the SSH command from. At the end of this you will be in a terminal session on the cluster's master node.

#### Step 1f) Open your browser and got to http://localhost:8989

Note: The notebook you open will **already** have the spark context set up for you.

### Step 2: Sanity check to ensure Spark & S3 are setup properly

#### Step 2a) Upload this Jupyter Notebook using the console from http://localhost:8989

Note: all the steps in Step 2 are to be executed from the Jupyter Notebook iteself

#### Step 2b) Setup the SparkContext (automatically setup by YARN)

In [1]:
sc

<pyspark.context.SparkContext at 0x7fe2fa205310>

In [2]:
#Just an informational message
sc.master

u'local[*]'

#### Step 2c) Spark sanity check

In [3]:
import sys
rdd = sc.parallelize(xrange(10),10)
aa = rdd.map(lambda x: sys.version)
aa.cache()
aa.count()

10

#### Step 2d) S3 sanity check

In [4]:
#Get a wikipedia page and store it in a local folder
!pwd
!mkdir test_s3
!wget http://en.wiktionary.org/wiki/awesome -P test_s3/ --trust-server-names

/home/parallels/Documents/TaxiPrediction-master
--2018-03-29 15:39:23--  http://en.wiktionary.org/wiki/awesome
Resolving en.wiktionary.org (en.wiktionary.org)... 198.35.26.96, 2620:0:863:ed1a::1
Connecting to en.wiktionary.org (en.wiktionary.org)|198.35.26.96|:80... connected.
HTTP request sent, awaiting response... 301 TLS Redirect
Location: https://en.wiktionary.org/wiki/awesome [following]
--2018-03-29 15:39:24--  https://en.wiktionary.org/wiki/awesome
Connecting to en.wiktionary.org (en.wiktionary.org)|198.35.26.96|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘test_s3/awesome’

awesome                 [  <=>               ]  76.21K   196KB/s    in 0.4s    

2018-03-29 15:39:25 (196 KB/s) - ‘test_s3/awesome’ saved [78038]



##### Create a S3 bucket : provide a unique name below: replace the "testaaset1"

In [1]:
# Add the downloaded file to S3: remeber to replace "testaaset1" with a unique bucket name
!aws s3 mb s3://jeonghyonkim-seoul-files/NYTaxi/
    
# Add the downloaded file to the test bucket in S3: remeber to replace "testaaset1" with a unique bucket name
!aws s3 cp test_s3/awesome s3://jeonghyonkim-seoul-files/NYTaxi/

make_bucket failed: s3://jeonghyonkim-seoul-files/NYTaxi/ An error occurred (BucketAlreadyOwnedByYou) when calling the CreateBucket operation: Your previous request to create the named bucket succeeded and you already own it.
upload: test_s3/awesome to s3://jeonghyonkim-seoul-files/NYTaxi/awesome


##### Test if you are able to lookup the S3 file from Spark

In [5]:
!pwd
!ls /home/parallels/Documents/TaxiPrediction-master/test_s3

/home/parallels/Documents/TaxiPrediction-master
awesome


In [6]:
testS3RDD = sc.textFile("/home/parallels/Documents/TaxiPrediction-master/test_s3/awesome")
testS3RDD.count()

439

##### Congrats! Now you have a working spark cluster with ability to connect with S3!

In [9]:
!pwd

/home/parallels/Documents/TaxiPrediction-master


#### Download the data files into local folder

**Note**: We are downloading the data from **2013 onwards only** - though data is available from 2009

**Data source**: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

In [17]:
#Setup the variables

baseUrl = "/home/parallels/Documents/TaxiPrediction-master/data/nyc/"
#Yellow/green cab filename prefix
yCabFNPrefix = "/yellow_tripdata_"
gCabFNPrefix = "/green_tripdata_"

#Availaiblity of data set by month & year
yDict = {}
gDict = {}

#availablity for Yellow cab
yDict[2015] = range(1,1) #available till jun 2015
yDict[2014] = range(1,1)
yDict[2013] = range(1,1)

#availablity for Green cab
gDict[2015] = range(1,1) #available till jun 2015
gDict[2014] = range(1,1)
gDict[2013] = range(8,13) #avialable only from august 2013

In [18]:
#  Yellow cab data file name list
# file name is of format:  yellow_tripdata_2015-01.csv
yCabUrls = []
yCabFilenames = []
for year, monthList in yDict.iteritems():
    yearStr = str(year)
    for month in monthList:
        monthStr = str(month)
        if len(monthStr) == 1:
            monthStr = "0"+monthStr    
        url = baseUrl+yearStr+yCabFNPrefix+yearStr+'-'+monthStr+".csv"
        yCabUrls.append(url)
        yCabFilenames.append(yCabFNPrefix+yearStr+'-'+monthStr+".csv")

#  green cab data file name list
gCabUrls = []
gCabFilenames = []
for year, monthList in gDict.iteritems():
    yearStr = str(year)
    for month in monthList:
        monthStr = str(month)
        if len(monthStr) == 1:
            monthStr = "0"+monthStr    
        url = baseUrl+yearStr+gCabFNPrefix+yearStr+'-'+monthStr+".csv"
        gCabFilenames.append(gCabFNPrefix+yearStr+'-'+monthStr+".csv")
        gCabUrls.append(url)

In [20]:
#Disk space of the Yellow Cab files
!du -mh data/nyc

180M	data/nyc


In [22]:
print gCabFilenames


['/green_tripdata_2013-08.csv', '/green_tripdata_2013-09.csv', '/green_tripdata_2013-10.csv', '/green_tripdata_2013-11.csv', '/green_tripdata_2013-12.csv']


In [23]:
def preprocess_data(cabFilenames, isYellow):
    """
    Function that takes a list of filenames (strings) and a boolean as parameters.
    Removes the header from the each file and verifies the schema of the data.
    """
    # Dictionary where key = filename, value = (schema, bool==True if there is a blank line after header)
    file_schemas = {}
    prefix = 'data/nyc'
    if isYellow:
        prefix = 'data/nyc'
        
    for filename in cabFilenames:
        # Fetch schema
        with open(prefix+filename,'r') as in_fp:
            #read first two lines
            lines = [in_fp.readline() for i in xrange(2)]

        # now open again to write out
        file_schemas[filename] = (tuple(lines[0].split(',')), lines[1]=='\r\n')
    
    # verify all files have the necessary columns in the same position
    for (schema,blank) in file_schemas.values():
        assert 'ickup' in schema[1]
        assert 'atetime' in schema[1]
        assert 'ickup' in schema[5]
        assert 'ongitude' in schema[5]
        assert 'ickup' in schema[6]
        assert 'atitude' in schema[6]
    print "Schema:", file_schemas[filename][0]
    
    # Remove header and blank line from file
    for filename in cabFilenames:
        print "Writing to %r" % filename 
        with open(prefix+filename,'r') as in_fp:
            #read whole file
            lines = in_fp.readlines()

        with open(prefix+filename,'w') as out_fp:

            # check if there is a blank line after the header
            if file_schemas[filename][1]:
                out_fp.writelines(lines[2:])
            else:
                out_fp.writelines(lines[1:])

In [24]:
#Preprocess Yellow Cab files -- check schema
preprocess_data(gCabFilenames, True)

Schema: ('VendorID', 'lpep_pickup_datetime', 'Lpep_dropoff_datetime', 'Store_and_fwd_flag', 'RateCodeID', 'Pickup_longitude', 'Pickup_latitude', 'Dropoff_longitude', 'Dropoff_latitude', 'Passenger_count', 'Trip_distance', 'Fare_amount', 'Extra', 'MTA_tax', 'Tip_amount', 'Tolls_amount', 'Ehail_fee', 'Total_amount', 'Payment_type', 'Trip_type \n')
Writing to '/green_tripdata_2013-08.csv'
Writing to '/green_tripdata_2013-09.csv'
Writing to '/green_tripdata_2013-10.csv'
Writing to '/green_tripdata_2013-11.csv'
Writing to '/green_tripdata_2013-12.csv'


### Step 4: Check cluster performance

In [23]:
myRDD = sc.textFile("s3://sdaultonbucket1/nyc/yellow_tripdata_2015-02.csv")
%time myRDD.cache()
%time myRDD.count()

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 2.87 ms
CPU times: user 0 ns, sys: 12 ms, total: 12 ms
Wall time: 14.6 s


12450521

In [25]:
myRDD.is_cached

True

### Other useful notes

**1. Enable the web admin interface** from the AWS console (follow the steps it says). Note: in this step when you open the SSH conection (as per instructions), the connection might not show ANY thing status etc) - this is fine. The SSH command (As per instruction) is:  ssh -i CS109.pem $DNS_NAME -ND 8157

**2. Admin UI's**
    
    a) To get to the Spark Jobs Admin Console: Go to the Hadoop Resource Manager UI (from AWS console) and click on "Application master" link (it will be one of the items in the listed running applications).
    
    b) Spark history server: http://<domain>:18080/
    
    c) For CPU/Memory performance on each node use the Ganglia UI (link from the AWS console)
    