grab-ai-safety

Modelling exercise for Grab AI Challenge: Safety dataset

Instructions

Project written in python 3.6
Repository contains:
- 5 Jupyter notebooks showing the whole workflow from data download to submission predictions
- Requirements.txt specifying all the dependancies required for model evaluation
- Store folder containing older iterations and backups
- Outputs folder containing pickled:
  - Feature transformation functions
  - Saved model weights
Notebooks labelled 1-4 are not required for submission but showcases:
- Thought process
- Code quality

Download dependancies required with pip install requirements.txt
For evaluation of model, please open 5. Grab Safety - Submission Notebook and edit the SOURCE_LIST list to include all the paths of the feature data.
Once SOURCE_LIST is updated, just run the all cells in the kernel and predictions will be stored in the global variable preds.
Depending on requirements, a pd.to_csv command has been commented out.

Project Plan

Understand the data and telematics sufficiently enough to proceed with the exercise
Decide on a most reasonable format to transform and extract features from the data before modelling
Choose a suitable algorithm (or ensemble) that gives reasonable test results
Grid search to optimize
Decide on a technique for deployment: Interactive web-app, Jupyter results, ect.
- Chosen method of deployment: Jupyter notebooks

Understanding the data

bookingID: Numeric key to differentiate bookings. One to many relationship in terms of features to labels
Accuracy: How accurate the gps location is with reference to what?
Bearing: Your GPS bearing is the compass direction from your current position to your intended destination. In other words, it describes the direction of a destination or object. If you're facing due north, and you want to move toward a building directly to your right, the bearing would be east or 90 degrees
acceleration_x: How fast speed changes in the x direction
acceleration_y: How fast speed changes in the y direction
acceleration_z: How fast speed changes in the z direction
gyro_x: How fast angular position changes in x
gyro_y: How fast angular position changes in y
gyro_z: How fast angular position changes in z
second: Each booking starts at 0 and ends whenever the last record is taken.
Speed: Speed of the vehicle at the time of measure

Exploratory Data Analysis

A high-level view shows that not all the features are useful in prediction the safety category
Also, there are 18 duplicated labels. Whether or not they are supposed to be dangerous or safe is unknown and will be taken out of the dataset for modelling purposes
Some bookings show extremely long trips which may not make sense as it converts into approximately 47 years. These entries are removed.
Some entries also show ridiculous speed readings converting to about 500km/h. These entries are also removed.
Distribution plots show some differences between the categories for:
- Duration of trip
- Speed

Chosen features

Distance travelled: This gives the total length of a trip in meters. This is calculated as:
- Duration of current record * speed recorded
Duration of trip: Measured in Seconds and taken from the last recorded entry per booking
Acceleration, Gyro: X, Y and Z figures are combined into a single datapoint using Euclidean distance:
- sqrt( x^2 + y^2 + z^2 )
Gyro, Acceleration, Speed, Change in bearing
- Maximum per trip
- Minimum per trip
- Mean per trip
- Standard deviation per trip
- Various percentiles

Modelling

Models tried: GBM, Random Forests, Neural Networks, Logistic Regression
First few attempts showed consistent training and testing accuracy but both at low values of ~77%. This shows signs of underfitting
Iteration involved feature engineering before cycling through models
Tried polynomial features to increase the accuracy but not effective
Noticed a degree of class imbalance between 0 and 1 from the confusion matrix results and overall base accuracy. Try up-sampling the minority class
- Resampling data dropped accuracy from 78% to 68% as expected
- However, AUC score improved which suggests better results over different decision thresholds
- Trying to avoid upsampling as it may cause some bias in results. Use sample weights instead
Realised that there are data leakage issues in the current model with outlier conditions and aggregations. To restructure process to minimize this.
- Split data into training and testing before performing any feature extractions
- Choose a more robust condition for second outliers instead of using the full dataset. (That was dumb)
Model performance dropped from 96% back down to 78% accuracy after fixing for data leakage issue.
Focusing on the models:
- Gradient Boosted Machine
- Logistic Regression
- Random Forests
All 3 models show signs of underfitting with lower than desired training and testing accuracy.
Current notebook doesn't show the selected model. Selected model used is a Gradient Boosted Machine with a 74% auc score with a microsecond timestamp of 87995.

References

Driver Telematics Analysis thesis from SJSU

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.ipynb_checkpoints		.ipynb_checkpoints
outputs		outputs
store		store
.DS_Store		.DS_Store
1. Safety Challenge - Data reading and cleaning.ipynb		1. Safety Challenge - Data reading and cleaning.ipynb
2. Safety Challenge - Data splitting.ipynb		2. Safety Challenge - Data splitting.ipynb
3. Safety Challenge - Feature Extraction.ipynb		3. Safety Challenge - Feature Extraction.ipynb
4. Safety Challenge - Model Selection.ipynb		4. Safety Challenge - Model Selection.ipynb
5. Safety Challenge - Submission Notebook.ipynb		5. Safety Challenge - Submission Notebook.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

grab-ai-safety

Instructions

Project Plan

Understanding the data

Exploratory Data Analysis

Chosen features

Modelling

References

About

Releases

Packages

Contributors 2

Languages

License

jtsw1990/grab-ai-safety

Folders and files

Latest commit

History

Repository files navigation

grab-ai-safety

Instructions

Project Plan

Understanding the data

Exploratory Data Analysis

Chosen features

Modelling

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages