### Entity Embeddings
Tree-based techniques are usually used to handle structured data. However, Entity embeddings shown that neural network can also lead to the very promissing result. It is used by the winner of predicting the distance of taxi rides (de Br閎isson et al., 2015) or third place of Rossmann store sale prediction (Guo & Berkahn, 2016)

In short, this technique represent each category by a vector, then training to obtain the characteristics of the category. For example, when dealing with date of week, we are likely to have the distance of Saturday and Sunday smaller than Saturday and Wednesday. People works with NLP may find this very familiar . For deeper understanding of entity embeddings you can find in this [blog](https://towardsdatascience.com/deep-learning-structured-data-8d6a278f3088)

This kernel based mostly on the fast.ai Rossman notebook and use [fastai](http://www.fast.ai/) library.

The features are pickup and dropoff location, distance, date-parsing.

Update: Based on [NYC Taxi Fare - Data Exploration
](https://www.kaggle.com/breemen/nyc-taxi-fare-data-exploration) and [XGBoost'ing Taxi Fares
](https://www.kaggle.com/gunbl4d3/xgboost-ing-taxi-fares), the spherical distance and distance to airport are being added to input features

Number of rows : 5M. While practicing, smaller number (500_000) is recommended to accelerate the training time  

In [None]:
%matplotlib inline

In [None]:
from fastai.structured import *
from fastai.column_data import *
# np.set_printoptions(threshold=50, edgeitems=20)
PATH = '../input'

In [None]:
os.listdir(PATH)

In [None]:
# To reproduce the value in the next time
manual_seed = 555
random.seed(manual_seed)
np.random.seed(manual_seed)
torch.manual_seed(manual_seed)
torch.cuda.manual_seed_all(manual_seed)
torch.backends.cudnn.deterministic = True

In [None]:
train_df_raw = pd.read_csv(f'{PATH}/train.csv', nrows=5000000)

In [None]:
test_df_raw = pd.read_csv(f'{PATH}/test.csv')

In [None]:
def add_travel_vector_features(df):
    df['abs_diff_longitude'] = (df.dropoff_longitude - df.pickup_longitude).abs()
    df['abs_diff_latitude'] = (df.dropoff_latitude - df.pickup_latitude).abs()

In [None]:
# this function will also be used with the test set below
def select_within_boundingbox(df, BB):
    return (df.pickup_longitude >= BB[0]) & (df.pickup_longitude <= BB[1]) & \
           (df.pickup_latitude >= BB[2]) & (df.pickup_latitude <= BB[3]) & \
           (df.dropoff_longitude >= BB[0]) & (df.dropoff_longitude <= BB[1]) & \
           (df.dropoff_latitude >= BB[2]) & (df.dropoff_latitude <= BB[3])

BB = (-74.5, -72.8, 40.5, 41.8)

In [None]:
def sphere_dist(pickup_lat, pickup_lon, dropoff_lat, dropoff_lon):
    """
    Return distance along great radius between pickup and dropoff coordinates.
    """
    #Define earth radius (km)
    R_earth = 6371
    #Convert degrees to radians
    pickup_lat, pickup_lon, dropoff_lat, dropoff_lon = map(np.radians,
                                                             [pickup_lat, pickup_lon, 
                                                              dropoff_lat, dropoff_lon])
    #Compute distances along lat, lon dimensions
    dlat = dropoff_lat - pickup_lat
    dlon = dropoff_lon - pickup_lon
    
    #Compute haversine distance
    a = np.sin(dlat/2.0)**2 + np.cos(pickup_lat) * np.cos(dropoff_lat) * np.sin(dlon/2.0)**2
    
    return 2 * R_earth * np.arcsin(np.sqrt(a))

def add_airport_dist(dataset):
    """
    Return minumum distance from pickup or dropoff coordinates to each airport.
    JFK: John F. Kennedy International Airport
    EWR: Newark Liberty International Airport
    LGA: LaGuardia Airport
    """
    jfk_coord = (40.639722, -73.778889)
    ewr_coord = (40.6925, -74.168611)
    lga_coord = (40.77725, -73.872611)
    
    pickup_lat = dataset['pickup_latitude']
    dropoff_lat = dataset['dropoff_latitude']
    pickup_lon = dataset['pickup_longitude']
    dropoff_lon = dataset['dropoff_longitude']
    
    pickup_jfk = sphere_dist(pickup_lat, pickup_lon, jfk_coord[0], jfk_coord[1]) 
    dropoff_jfk = sphere_dist(jfk_coord[0], jfk_coord[1], dropoff_lat, dropoff_lon) 
    pickup_ewr = sphere_dist(pickup_lat, pickup_lon, ewr_coord[0], ewr_coord[1])
    dropoff_ewr = sphere_dist(ewr_coord[0], ewr_coord[1], dropoff_lat, dropoff_lon) 
    pickup_lga = sphere_dist(pickup_lat, pickup_lon, lga_coord[0], lga_coord[1]) 
    dropoff_lga = sphere_dist(lga_coord[0], lga_coord[1], dropoff_lat, dropoff_lon) 
    
    dataset['jfk_dist'] = pd.concat([pickup_jfk, dropoff_jfk], axis=1).min(axis=1)
    dataset['ewr_dist'] = pd.concat([pickup_ewr, dropoff_ewr], axis=1).min(axis=1)
    dataset['lga_dist'] = pd.concat([pickup_lga, dropoff_lga], axis=1).min(axis=1)
    
    return dataset

In [None]:
def data_preprocessing(df, testset=0):
    add_travel_vector_features(df)
    if testset==0:
        df = df.dropna(how='any',axis='rows')
        df = df[(df.abs_diff_longitude<5) & (df.abs_diff_latitude<5)]
        df = df[df.fare_amount>0]
        df = df[(df.passenger_count >= 0) & (df.passenger_count <= 6)]
        df = df[select_within_boundingbox(df, BB)]
        
    df[['date','time','timezone']] = df['pickup_datetime'].str.split(expand=True)
    add_datepart(df, "date", drop=False)

    df[['hour','minute','second']] = df['time'].str.split(':',expand=True).astype('int64')
    df[['trash', 'order_no']] = df['key'].str.split('.',expand=True)
    df['order_no'] = df['order_no'].astype('int64')
    df = df.drop(['timezone','time', 'pickup_datetime','trash','date'], axis = 1)
    
    df = add_airport_dist(df)
    df['distance'] = sphere_dist(df['pickup_latitude'], df['pickup_longitude'], 
                                   df['dropoff_latitude'] , df['dropoff_longitude'])
    return df

In [None]:
train_df = data_preprocessing(train_df_raw)

In [None]:
test_df = data_preprocessing(test_df_raw, testset=1)

In [None]:
train_df = train_df.reset_index()
test_df = test_df.reset_index()

In [None]:
train_df.columns


Assign the categorical variables and continuous variable

In [None]:
cat_vars = ['passenger_count', 'Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear',
    'Is_month_end','Is_month_start','Is_quarter_end','Is_quarter_start','Is_year_end','Is_year_start','hour','minute','second','order_no']

contin_vars = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude'
               ,'jfk_dist','ewr_dist','lga_dist','distance']

dep = 'fare_amount'
n = len(train_df); n

In [None]:
train_df = train_df[cat_vars+contin_vars+ [dep,'key']].copy()
test_df[dep] = 0
test_df = test_df[cat_vars+contin_vars+ [dep,'key']].copy()

In [None]:
for v in cat_vars: train_df[v] = train_df[v].astype('category').cat.as_ordered()

Convert the categorical variables from string to category. Same mapping for test_df and train_df

In [None]:
apply_cats(test_df, train_df)

In [None]:
for v in contin_vars:
    train_df[v] = train_df[v].fillna(0).astype('float32')
    test_df[v] = test_df[v].fillna(0).astype('float32')

In [None]:
train_df = train_df.set_index("key")

proc_df used to prepare the data ready for training (Normalization input, Process the Not Available data, ...) 

In [None]:
df, y, nas, mapper = proc_df(train_df, 'fare_amount', do_scale=True)

In [None]:
test_df = test_df.set_index("key")

In [None]:
df_test, _, nas, mapper = proc_df(test_df, 'fare_amount', do_scale=True,
                                  mapper=mapper, na_dict=nas)

In [None]:
# train_ratio = 0.75
train_ratio = 0.8
train_size = int(n * train_ratio); train_size
val_idx = list(range(train_size, len(df)))

In [None]:
y

In [None]:
def rmse(y_pred, targ):
    pct_var = (targ - y_pred)
    return math.sqrt((pct_var**2).mean())
# ,test_df=df_test

The fastai library in this moment (13/08/18) is not updated yet so some core functions must to be modified. (ColumnarModelData,TMP_PATH,MODEL_PATH)

batch size (bs) can be increase to training faster ( 512)

In [None]:
df_test

In [None]:
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, y.astype(np.float32), cat_flds=cat_vars, bs=256,test_df=df_test)

In [None]:
cat_vars

In [None]:
cat_sz = [(c, len(train_df[c].cat.categories)+1) for c in cat_vars]

In [None]:
y

In [None]:
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz];emb_szs

In [None]:
max_y = np.max(y)
y_range = (0, max_y*1.2)

In [None]:
TMP_PATH = "/tmp/tmp"
MODEL_PATH = "/tmp/model/"

In [None]:
!ls ../input

In [None]:
m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
                   0.04, 1, [256, 128, 64, 32, 8], [0.008,0.008, 0.008, 0.01, 0.01], y_range=y_range,tmp_name=TMP_PATH,models_name=MODEL_PATH)

Finding the good learning rate

In [None]:
m.lr_find()

In [None]:
m.sched.plot()

In [None]:
m.sched.plot_lr()

In [None]:
lr = 2e-5


In [None]:
m.fit(lr, 3, metrics=[rmse])

In [None]:
m.fit(lr, 6, cycle_len=1, metrics=[rmse])

In [None]:
m.fit(lr, 4, cycle_len=1, cycle_mult=2, metrics=[rmse])

In [None]:
m.fit(lr, 4, cycle_len=1, cycle_mult=2, metrics=[rmse])

In [None]:
pred_test=m.predict()

In [None]:
len(pred_test)

In [None]:
len(y[val_idx])

In [None]:
y[:20]

In [None]:
y_test = m.predict(True)

In [None]:
y_test = y_test.reshape(-1)


In [None]:
submission = pd.DataFrame(
    {'key': test_df.index, 'fare_amount': y_test},
    columns = ['key', 'fare_amount'])
submission.to_csv('submission.csv', index = False)

In [None]:
test_df.index