# 3. Model Building

I figured that the single most important factor that influences the amount that gets tipped is the total amount before tipping. For that reason I took the data, and created two new columns, total_notip which is the total cost before tipping, and tip_ratio which is the percentage of tip relative to total_notip. In addition, I filtered out outliers.

In [None]:
import pandas as pd
import numpy as np
import sys
import matplotlib.pyplot as plt

input = pd.read_csv('tlc_yellow_trips_2018_11_22_CLEAN.csv')

np.set_printoptions(threshold=sys.maxsize)
pd.set_option('display.max_columns', None)

input = input[input['tip_amount']>0]
input = input[input['payment_type']==1.]

input['total_notip'] = input['total_amount'] - input['tip_amount']
input['tip_ratio'] = input['tip_amount']*100/input['total_notip']

def remove_outliers(df, key):
    q1 = df[key].quantile(q=0.25)
    q3 = df[key].quantile(q=0.75)
    df = df[df[key] > (q1 - 3. * (q3 - q1))]
    df = df[df[key] < (q3 + 3. * (q3 - q1))]
    return df

input = remove_outliers(input,'tip_ratio')
input = remove_outliers(input,'tip_amount')

After this, I created two scatterplots comparing total_tips, one with tip_ratio and the other with tip_amount. Included is the line of best fit. 

In [None]:
total = np.array(input['total_notip'])
tip = np.array(input['tip_amount'])
plt.plot(total,tip, 'o', markersize=1)
plt.ylabel('Tip Amount ($)')
plt.xlabel('Total Cost Excluding Tips ($)')
m, b = np.polyfit(total, tip, 1)
plt.plot(total, m*total + b)
plt.show()

In [None]:
ratio = np.array(input['tip_ratio'])
plt.plot(total,ratio, 'o', markersize=1)
plt.ylabel('Tip to Cost Ratio (%)')
plt.xlabel('Total Cost Excluding Tips ($)')
m, b = np.polyfit(total, ratio, 1)
plt.plot(total, m*total + b)
plt.show()

Looking at the 2 scatterplots, we can see two contrasting trends. 

One is that people pay a certain percentage of the total cost (with all extra fees included). These percentages are usually a multiple of 5, and from the line of best fit (orange) and the frequency of points, we can conclude that the mean and the most common percentage is 20%. 

The other trend, noted by the horizontal lines on the first plot, show that a certain number of passengers pay to the nearest dollar. The quantity varies somewhat depending on the total amount, but 3 USD is the most common tip. 

By analysing the lines of best fit in both plots, we can see that there are many more who tip a percentage rather than to the nearest dollar. Although not included here, I also tested the tips based on other factors, such as passenger count and location (no significant trend) and fare amount (similar to total amount minus tip, but the lines at 20% were not as well defined).

Given this analysis, I would conclude that the most reliable (and rather simple) model to predict a tip would be 20 % of the total amount charged before tip. The mathematical expression is below. 

`predicted_tip = (fare_amount + extra + mta_tax + tolls_amount + imp_surcharge)/5`

It is worth mentioning, however, that this model was made using only data that paid by card and that we shouldn't expect to see a similar trend with tipping with cash. 