Modelling exercise for Grab AI Challenge: Safety dataset
- Project written in
python 3.6
- Repository contains:
- 5 Jupyter notebooks showing the whole workflow from data download to submission predictions
- Requirements.txt specifying all the dependancies required for model evaluation
- Store folder containing older iterations and backups
- Outputs folder containing pickled:
- Feature transformation functions
- Saved model weights
- Notebooks labelled 1-4 are not required for submission but showcases:
- Thought process
- Code quality
- Download dependancies required with
pip install requirements.txt
- For evaluation of model, please open
5. Grab Safety - Submission Notebook
and edit theSOURCE_LIST
list to include all the paths of the feature data. - Once
SOURCE_LIST
is updated, just run the all cells in the kernel and predictions will be stored in the global variablepreds
. - Depending on requirements, a
pd.to_csv
command has been commented out.
- Understand the data and telematics sufficiently enough to proceed with the exercise
- Decide on a most reasonable format to transform and extract features from the data before modelling
- Choose a suitable algorithm (or ensemble) that gives reasonable test results
- Grid search to optimize
- Decide on a technique for deployment: Interactive web-app, Jupyter results, ect.
- Chosen method of deployment: Jupyter notebooks
bookingID
: Numeric key to differentiate bookings. One to many relationship in terms of features to labelsAccuracy
: How accurate the gps location is with reference to what?Bearing
: Your GPS bearing is the compass direction from your current position to your intended destination. In other words, it describes the direction of a destination or object. If you're facing due north, and you want to move toward a building directly to your right, the bearing would be east or 90 degreesacceleration_x
: How fast speed changes in the x directionacceleration_y
: How fast speed changes in the y directionacceleration_z
: How fast speed changes in the z directiongyro_x
: How fast angular position changes in xgyro_y
: How fast angular position changes in ygyro_z
: How fast angular position changes in zsecond
: Each booking starts at 0 and ends whenever the last record is taken.Speed
: Speed of the vehicle at the time of measure
- A high-level view shows that not all the features are useful in prediction the safety category
- Also, there are 18 duplicated labels. Whether or not they are supposed to be dangerous or safe is unknown and will be taken out of the dataset for modelling purposes
- Some bookings show extremely long trips which may not make sense as it converts into approximately 47 years. These entries are removed.
- Some entries also show ridiculous speed readings converting to about 500km/h. These entries are also removed.
- Distribution plots show some differences between the categories for:
- Duration of trip
- Speed
-
Distance travelled
: This gives the total length of a trip in meters. This is calculated as:- Duration of current record * speed recorded
-
Duration of trip
: Measured in Seconds and taken from the last recorded entry per booking -
Acceleration
,Gyro
: X, Y and Z figures are combined into a single datapoint using Euclidean distance:sqrt( x^2 + y^2 + z^2 )
-
Gyro
,Acceleration
,Speed
,Change in bearing
- Maximum per trip
- Minimum per trip
- Mean per trip
- Standard deviation per trip
- Various percentiles
- Models tried: GBM, Random Forests, Neural Networks, Logistic Regression
- First few attempts showed consistent training and testing accuracy but both at low values of ~77%. This shows signs of underfitting
- Iteration involved feature engineering before cycling through models
- Tried polynomial features to increase the accuracy but not effective
- Noticed a degree of class imbalance between 0 and 1 from the confusion matrix results and overall base accuracy. Try up-sampling the minority class
- Resampling data dropped accuracy from 78% to 68% as expected
- However, AUC score improved which suggests better results over different decision thresholds
- Trying to avoid upsampling as it may cause some bias in results. Use sample weights instead
- Realised that there are data leakage issues in the current model with outlier conditions and aggregations. To restructure process to minimize this.
- Split data into training and testing before performing any feature extractions
- Choose a more robust condition for second outliers instead of using the full dataset. (That was dumb)
- Model performance dropped from 96% back down to 78% accuracy after fixing for data leakage issue.
- Focusing on the models:
- Gradient Boosted Machine
- Logistic Regression
- Random Forests
- All 3 models show signs of underfitting with lower than desired training and testing accuracy.
- Current notebook doesn't show the selected model. Selected model used is a Gradient Boosted Machine with a 74% auc score with a microsecond timestamp of 87995.