# iPython Notebook Sample
### Jukka Ruponen, IBM, 2016-03-30

#### Hypothesis: "In order to get best tips, taxi drivers should favor carrying 3-4 people at once"

To confirm the hypothesis, this notebook will do the following:
1. Read NYC taxidata from Open API
1. Normalize and test the data
1. Perform analysis
1. Provide few **optional** steps just to show you how you may export data as a CSV file and then upload it elsewhere
1. Visualize the result to confirm (or reject) our hypothesis

### Add the required modules

In [None]:
import pandas
import requests
import json

### Read NYC taxidata from REST API

In [None]:
raw_taxidata = requests.get('https://data.cityofnewyork.us/resource/2yzn-sicd.json')
json_taxidata = raw_taxidata.json()

In [None]:
# Print out the data - DO NOT RUN THIS STEP unless you really want to see lots of data printed out in here!!!
json_taxidata

In [None]:
# What is the lenght of data (number of lines)?
print len(json_taxidata)

### Lets normalize and test the data, and try to find an answer

In [None]:
# Lets import json_normalize since we can use it to convert JSON data to tabular data
from pandas.io.json import json_normalize

In [None]:
# Make a normalized data frame and print out the first five rows
taxidata = json_normalize(json_taxidata)
taxidata.head()

In [None]:
# Since the numeric values are actually text strings, we'll first need to convert them to float
taxidata2 = taxidata.convert_objects(convert_numeric=True)

In [None]:
# Test: What is the biggest amount of fare paid?
taxidata2['fare_amount'].max()

In [None]:
# Test: What is the biggest amount of tip paid?
taxidata2['tip_amount'].max()

In [None]:
# Test: How many individual taxi trips with different number of passengers?
taxidata2['passenger_count'].value_counts()

In [None]:
# Just to play around, setting index to vendorid
passengers = taxidata2.set_index(taxidata["vendor_id"])
# Drop off unneeded columns to clean the data
passengers.drop(['extra','mta_tax','vendor_id','dropoff_latitude','dropoff_longitude','pickup_latitude','pickup_longitude','rate_code','store_and_fwd_flag'], axis=1, inplace=True)
passengers.head()

In [None]:
# Test: How much tips were given total by the number of passangers in the taxi?
passengers.groupby('passenger_count')['tip_amount'].sum()

In [None]:
# Got it! So group the data by passenger_count and extract average stats for each 'number of passengers' group
averages = passengers.groupby(['passenger_count']).agg({'fare_amount': 'mean',
                                             'tip_amount': 'mean',
                                             'trip_distance': 'mean'})
averages

### (Optional) Storing the previous output as a CSV file on the local GPFS file system

In [None]:
# Save the last result as CSV file on the local GPFS filesystem (just to make it clear: It's NOT saved in the "Object Store")
_.to_csv('NYC_taxi_passenger_tips.csv')

In [None]:
# Just to confirm its there, list files in the current directory of GPFS
!ls -l

In [None]:
# Just to confirm it's content, print out the first 4 lines of the saved file
!head -n4 'NYC_taxi_passenger_tips.csv'

### (Optional) Uploading the stored CSV file from local GPFS onto an external Hadoop (if you have one) for further processing

Skip this step if you do not have available Hadoop system that can accessed from this cloud environment

In [None]:
# Change these variables according to your target HDFS
HDFS_WebhdfsUrl = "https://bi-hadoop-prod-2470.services.dal.bluemix.net:8443/gateway/default/webhdfs/v1" # Replace with your WebhdfsUrl
HDFS_userid = "" # Replace with your WebHDFS userid
HDFS_password = "" # Replace with your WebHDFS password
Local_filename = "NYC_taxi_passenger_tips.csv" # This is the local filename you just stored above with the _.to_csv('filename') command
HDFS_filepath = "/user/biblumix/test/test.csv" # This is the full path and filename to be uploaded on the target HDFS (the directory will be created, if not exist)
HDFS_operator = "op=CREATE" # 'op=CREATE' is an operator to create a new file. For other operators, see: https://hadoop.apache.org/docs/r1.0.4/webhdfs.html
HDFS_maxTime = 45 # This is time in seconds after which the transfer will timeout, success or not. Make sure its long enough to cover full transfer time.

In [None]:
# When you run this cell, the variables you've set above will be used to execute the upload
!curl -i -L -k -s --user "$HDFS_userid":"$HDFS_password" --max-time $HDFS_maxTime -X PUT -T "$Local_filename" "$HDFS_WebhdfsUrl$HDFS_filepath?$HDFS_operator"

### (Optional) Uploading the stored CSV file from local GPFS onto your Object Storage

This is useful if you want to, for example, download the stored CSV file on your own computer.
The cells on below will perform uploading the file on your Object Storage in Bluemix, from where you can then manually download it.
Skip this step if you don't want to do this.

In [None]:
# Fill in the missing values on below according to your Object Store credentials.
# IMPORTANT: Make sure the FILENAME corresponds to the one you stored on the GPFS!

# Hint: If you are not sure what to enter here, place your cursor in the empty cell above and then
# click "Insert to code" option under one of the existing files on your "Data Source" panel on the right.
# This will give you the values you should use on below, except the filename which is now different.

credentials = {
    'auth_url': 'https://identity.open.softlayer.com',
    'region': 'dallas',
    'domain_id': '',
    'username': '',
    'password': '',
    'filename': 'NYC_taxi_passenger_tips.csv',
    'container': 'notebooks'
}

In [None]:
# Don't change anything here. When this function is called it will use the credentials above to perform the following:
# 1) acquire from the Object Storage the required authentication token and storage URL, and then
# 2) use curl command in a shell to upload the file to the Object Store
def osUploadFile(credentials):
    '''This function will use the given credentials to upload the file'''

    auth_url = ''.join([credentials['auth_url'], '/v3/auth/tokens'])
    request = {'auth': {'identity': {'methods': ['password'],
            'password': {'user': {'name': credentials['username'],'domain': {'id': credentials['domain_id']},
            'password': credentials['password']}}}}}
    auth_headers = {'Content-Type': 'application/json'}
    auth_response = requests.post(url=auth_url, data=json.dumps(request), headers=auth_headers)
    resp1_body = auth_response.json()
    for e1 in resp1_body['token']['catalog']:
        if(e1['type']=='object-store'):
            for e2 in e1['endpoints']:
                if(e2['interface']=='public'and e2['region']==credentials['region']):
                    upload_url = ''.join([e2['url'],'/', credentials['container'], '/', credentials['filename']])                    
    auth_token = auth_response.headers['x-subject-token']
    upload_headers = ''.join(["X-Auth-Token: ",auth_token])
    filename = credentials['filename']
    !curl -i -L -k -s -H "$upload_headers" $upload_url -X PUT -T $filename
    return

In [None]:
osUploadFile(credentials)

### Lets visualize the answer

In [None]:
%matplotlib inline

In [None]:
#passengers = averages.ix['passenger_count']
averages.plot(kind='bar', figsize=(8,5), title="Average earnings by # of passengers" % passengers)

Extra challenge: What other valuable information could you derive from the data?