Mars-spectrometry-14th-place-solution

This is my final solution to the Mars-spectrometry challenge by NASA hosted on drivendata.

Why this challenge ?

TLDR:

NASA's rover sends back data after conducting evolutionary gas analysis on the soil samples it collected. We need to model the data if given 10 compounds are present in the sample.

The story so far:

We all are curious about our neighbouring red planet. So NASA's been sending rovers to the surface of Mars, These rovers move on the surface of the mars and collect various soil samples. These rovers are also equipped with gas evoultionary analysis(EGA) instruments. The collected soil samples are heated at different temperatures and the evolved gaseous ions are observed. Based on the abundance of evolved ions we can tell what kind of chemical compostion the soil sample is made of. This is called EGA

The data generated from the whole EGA is sent back to the earth. Now Scientists need to model this data to find clues about presence of any life on the martian soil.

The data is available at competition website (You have to login to datadriven). We as Data scientists or Machine Learning practioners have to model this data to find out the best possible model and to make the best predictions.

About the Problem:

The very domain/nature of the problem is unique.
The data has less samples and lots of features.
The problem requires us to classify the prescence of 10 targets ( carbonate, iron_oxide, sulfate etc...)
The metric used is aggregated_log_loss.

My solution:

My final soulution is simple average of caliberated predictions of ensemble models.
I used logistic regression to find out the most relevant features (feature selection) [selected 10s of features from more than 10k features].
Added other features like total_abundance for each sample, relative abundance, changes in abundance of ions etc... (feature engineering).
Generated 8 different types of training sets based on various temperature and time bins (feature engineering).
used 20 fold cross validation.
used 10 Catboost classifiers to predict 10 targets (Binary classification fashion) on each dataset.
Caliberated every models predictions to better match the targets.
Stacked the predictions in simple average fashion.

My best predictions are avg_preds.csv from caliberate and predict notebook with agg_logloss of 0.13 on private leaderboard.

Other things I have tried

Nerual networks --> not very great for this competition
Autoencoders and denoising autoencoders and they too didn't workout
upsampling the minority class
Automated feature Engineering (feature tools) didn't workout due to huge data and low compute resources.

Thank you

Have a nice day😊

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
caliberate and predict.ipynb		caliberate and predict.ipynb
data_analysis_0.ipynb		data_analysis_0.ipynb
data_analysis_1.ipynb		data_analysis_1.ipynb
ft_eng.ipynb		ft_eng.ipynb
ft_eng_2.ipynb		ft_eng_2.ipynb
ft_eng_3.ipynb		ft_eng_3.ipynb
intial-submission.ipynb		intial-submission.ipynb
modelling.ipynb		modelling.ipynb
ms-get-oof-calib-cv.ipynb		ms-get-oof-calib-cv.ipynb
v11 oof_train_temp_100.ipynb		v11 oof_train_temp_100.ipynb
v12 oof_train_temp_200.ipynb		v12 oof_train_temp_200.ipynb
v16 oof_train_temp_500.ipynb		v16 oof_train_temp_500.ipynb
v2 oof_train_200.ipynb		v2 oof_train_200.ipynb
v4 oof_train_100.ipynb		v4 oof_train_100.ipynb
v5 oof train_50.ipynb		v5 oof train_50.ipynb
v9 oof_train_temp_50.ipynb		v9 oof_train_temp_50.ipynb

k-loki/Mars-spectrometry-14th-place-solution

Folders and files

Latest commit

History

Repository files navigation

Mars-spectrometry-14th-place-solution

This is my final solution to the Mars-spectrometry challenge by NASA hosted on drivendata.

About

Topics

Resources

Stars

Watchers

Forks

Languages