<a href="https://colab.research.google.com/github/jicksy/coursera-test/blob/master/Solution_data_science_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## TA DS Test

## Data

You are provided with two sample data sets

- `events.csv.gz` - A sample of events collected from an online travel agency, containing:
  * `ts` - the timestamp of the event
  * `event_type` - either `search` for searches made on the site, or `book` for a conversion, e.g. the user books the flight
  * `user_id` - unique identifier of a user
  * `date_from` - desired start date of the journey
  * `date_to` - desired end date of the journey
  * `origin` - IATA airport code of the origin airport
  * `destination` - IATA airport code of the destination airport
  * `num_adults` - number of adults
  * `num_children` - number of children

- `iata.csv` - containing geo-coordinates of major airports
  * `iata_code` - IATA code of the airport
  * `lat` - latitude in floating point format
  * `lon` - longitude in floating point format


**Data Preparation**
* Rows with null and duplicate values are dropped
* Output column:`event_type` is changed to numeric
* The geographic distance between origins and destinations is calculated using haversine formula. 
* Length of Stay and Time to Travel is calculated.
* Origin and Destination Columns are label encoded

**Feature Engineering**
* Distance, Length of Stay (LOS) and Time To Travel (TTT) could be important features. 

**Experimental design**
* Data is split into train and test set

**Model**
* I have chosen lightgbm classifier as my model. I am able to achieve around 60% AUC on test data. 


**How to run the model**
* Running the following cell will invoke the main method, and will train and evaluate the model on test data. 
* Please make sure to have the given events data set, and iata dataset in the same directory you are running the below code from.

## The code

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn import metrics

def haversine(origin, destination):
    """
    Function to calculate the distance between two points
    """
    lon1, lat1, lon2, lat2 = map(np.radians, [iata_dict['lon'][origin], iata_dict['lat'][origin], iata_dict['lon'][destination], iata_dict['lat'][destination]])
    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km


def preprocessing(events, iata):
  print('\n *******Preprocessing Steps*******\n')
  print('No. of missing values in events dataset: \n', events.isnull().sum())
  print('\nNo. of missing values in iata dataset: \n', iata.isnull().sum())

  # drop rows with null values (if any)
  print('\n Dropping rows with null values')
  events.dropna(inplace=True)
  iata.dropna(inplace=True)
  
  # drop duplicate values
  print('\n Dropping rows with duplicate values')
  events_copy.drop_duplicates(keep='first', inplace=True)
  iata.iata_code.drop_duplicates(keep='first', inplace=True)

  # For further use, we will convert iata into a dictionary
  iata_dict = iata.set_index('iata_code').to_dict()
  
  # Add outcome column, "book" == 1, search == 0
  print('\n Adding output column with book = 1 and search =0')
  events = events.assign(outcome=(events['event_type'] == 'book').astype(int))

  # Calling haversine function to create a new column distance
  print('\n Compute the distance between origin and destination using haversine distance formula')
  events['distance'] = events.apply(lambda x: haversine(x['origin'], x['destination']), axis = 1)
  
  # Create new column length of stay(los) and time to travel (ttt)
  print('\n Create los and ttt columns')
  events[['ts','date_from', 'date_to']] = events[['ts','date_from','date_to']].apply(pd.to_datetime) #if conversion required
  events['ttt'] = (events['date_from'] - events['ts']).dt.days
  events['los'] = (events['date_to'] - events['date_from']).dt.days

  # Label encode categorical columns
  print('\n Label encode origin and destination columns')
  cat_features = ['origin', 'destination']
  encoder = LabelEncoder()
  encoded = events[cat_features].apply(encoder.fit_transform)

  # Since events and encoded have the same index, wecan easily join them
  X = events[['num_adults', 'num_children', 'distance', 'ttt', 'los', 'outcome']].join(encoded)
  return X

def train_model(train, valid):
    feature_cols = train.columns.drop('outcome')
    
    dtrain = lgb.Dataset(train[feature_cols], label=train['outcome'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['outcome'])

    param = {'num_leaves': 8, 'objective': 'binary'}
    param['metric'] = 'auc'
    num_round = 500
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=5, verbose_eval=False)

    ypred = bst.predict(valid[feature_cols])
    score = metrics.roc_auc_score(valid['outcome'], ypred)

    print(f"\nTest AUC score: {score}")

    return bst

def main():
    events = pd.read_csv('events_1_1.csv.gz')
    iata = pd.read_csv('iata_1_1.csv')

    X = preprocessing(events, iata)

    # Train the model
    print('\n *******Training the model*******')
    train, test = train_test_split(X, test_size=0.33, random_state=42)
    model = train_model(train, test)

if __name__ == '__main__':
    main()


 *******Preprocessing Steps*******

No. of missing values in events dataset: 
 ts               0
event_type       0
user_id          0
date_from       22
date_to          3
origin           0
destination      0
num_adults       0
num_children     0
dtype: int64

No. of missing values in iata dataset: 
 iata_code    0
lat          0
lon          0
dtype: int64

 Dropping rows with null values

 Dropping rows with duplicate values

 Adding output column with book = 1 and search =0

 Compute the distance between origin and destination using haversine distance formula

 Create los and ttt columns

 Label encode origin and destination columns

 *******Training the model*******

Test AUC score: 0.6037677305668645
