# iPython Notebook Sample
### Jukka Ruponen, IBM, 2015-11-18

#### Hypothesis: "In order to get best tips, taxi drivers should favor carrying 3-4 people at once"

In order to confirm the hypothesis, this notebook will do the following:
1. Read NYC taxidata from Open API
1. Normalize and test the data
1. Perform analysis
1. (Optional) Export analysis result (as CSV file) and upload to an exteral HDFS (if you have one) for further processing
1. Visualize the result to confirm (or reject) our hypothesis

### Add the required modules

In [None]:
import pandas
import requests
import json

### Read NYC taxidata from REST API

In [None]:
raw_taxidata = requests.get('https://data.cityofnewyork.us/resource/2yzn-sicd.json')
json_taxidata = raw_taxidata.json()

In [None]:
# Print out the data - DO NOT RUN THIS STEP unless you really want to see lots of data printed out in here!!!
json_taxidata

In [None]:
# What is the lenght of data (number of lines)?
print len(json_taxidata)

### Lets normalize and test the data, and try to find an answer

In [None]:
# Lets import json_normalize since we can use it to convert JSON data to tabular data
from pandas.io.json import json_normalize

In [None]:
# Make a normalized data frame and print out the first five rows
taxidata = json_normalize(json_taxidata)
taxidata.head()

In [None]:
# Since the numeric values are actually text strings, we'll first need to convert them to float
taxidata2 = taxidata.convert_objects(convert_numeric=True)

In [None]:
# Test: What is the biggest amount of fare paid?
taxidata2['fare_amount'].max()

In [None]:
# Test: What is the biggest amount of tip paid?
taxidata2['tip_amount'].max()

In [None]:
# Test: How many individual taxi trips with different number of passengers?
taxidata2['passenger_count'].value_counts()

In [None]:
# Just to play around, setting index to vendorid
passengers = taxidata2.set_index(taxidata["vendor_id"])
# Drop off unneeded columns to clean the data
passengers.drop(['extra','mta_tax','vendor_id','dropoff_latitude','dropoff_longitude','pickup_latitude','pickup_longitude','rate_code','store_and_fwd_flag'], axis=1, inplace=True)
passengers.head()

In [None]:
# Test: How much tips were given total by the number of passangers in the taxi?
passengers.groupby('passenger_count')['tip_amount'].sum()

In [None]:
# Got it! So group the data by passenger_count and extract average stats for each 'number of passengers' group
averages = passengers.groupby(['passenger_count']).agg({'fare_amount': 'mean',
                                             'tip_amount': 'mean',
                                             'trip_distance': 'mean'})
averages

### (Optional) Save the previous output into a CSV file in local GPFS file system

In [None]:
# Save the last result as CSV file on the local GPFS filesystem (just to make it clear: It's NOT saved in the "Object Store")
_.to_csv('NYC_taxi_passenger_tips.csv')

# Just to confirm it was saved, print out the first 4 lines of the saved file
!head -n4 'NYC_taxi_passenger_tips.csv'

### (Optional) Uploading the stored file into external Hadoop

In [None]:
# Using curl, you could upload the stored file into an external HDFS (if you have one)
HDFS_WebhdfsUrl = "https://bi-hadoop-prod-2470.services.dal.bluemix.net:8443/gateway/default/webhdfs/v1" # Replace with your WebhdfsUrl
HDFS_userid = "" # WebHDFS userid
HDFS_password = "" # WebHDFS password
Local_filename = "NYC_taxi_passenger_tips.csv" # This is the local filename you used with the _.to_csv('filename') command above
HDFS_filepath = "/user/biblumix/test/test.csv" # This is the filename to be stored, with full path on HDFS (the directory will be created, if not exist)
HDFS_operator = "op=CREATE" # 'op=CREATE' will create a new file. For other operators, see: https://hadoop.apache.org/docs/r1.0.4/webhdfs.html
HDFS_maxTime = 45 # This is seconds after which the transfer will timeout, success or not. Make sure its long enough for full transfer.
!curl -i -L -k -s --user "$HDFS_userid":"$HDFS_password" --max-time $HDFS_maxTime -X PUT -T "$Local_filename" "$HDFS_WebhdfsUrl$HDFS_filepath?$HDFS_operator"

### Lets visualize the answer

In [None]:
%matplotlib inline

In [None]:
#passengers = averages.ix['passenger_count']
averages.plot(kind='bar', figsize=(8,5), title="Average earnings by # of passengers" % passengers)

Extra challenge: What other valuable information could you derive from the data?