Machine Learning - Project 1

The code of the project of team 'answer42' for EPFL Machine Learning Higgs challenge for year 2020.

Software requirements

Python version: 3.6.9

Disclaimer: The requirements aren't strict, but are recommended as all the code was tested using them

Data preprocessing

There is no use of filtering of data samples.

Features are augmented and modified by applying different functions in the following order:

Numeric features are standardized
Added one hot encoding for the single nominal feature (3 new features created, the initial dropped)
Missing columns replaced with the mean of the values in the given column (as all the missing are numeric features, missing columns are set to 0 which is the mean after applying standardization)
Sin and cos function applied on all the columns obtained after step 3, the new columns are added beside the already existing ones
Top 54 features obtained after step 4 selected (Selection done using backward attribute selection and evaluated using 10-fold cross validation on logistic regression)
Features obtained after step 5 are multiplied with each other and the result is added beside the original features after step 5
Polynomial degrees 2 and 3 of features obtained after step 5 are added beside the features obtained after step 6
For each of the features that had missing values in the starting data added a binary column indicating whether the given value was missing in the data before step 1 and the new columns are added beside the columns obtained after step 7
Bias column added beside the features obtained after step 8

Model

The model is obtained by applying regularized logistic regression on the preprocessed features and trained using mini batch gradient descent.

The model achieves mean accuracy 0.842, F1-score 0.76 on the training set using 5-fold cross validation and mean accuracy 0.84, mean F1 score 0.759 on the test set for the values of training parameters and hyperparameters:

Trade-off parameter: 10^(-9)
Learning rate: 0.04
Batch size: 2000
Number of epochs: 400

Author's notes

Pretty late into experimenting with big number of augmented features we noticed that some of the implemented functions (notably cross validation [mainly because of stratification part] and pairwise mutliplication) are not memory optimized for work with huge amount of data and therefore require a lot of memory. This effect is especially noticeable in Google Colab, which was used for testing different setups and models, as sessions would often crash due to the lack of RAM. We don't fully understand the effect this may have on local execution of the code on PC, but expect it to be rather slow if there is not enough RAM. The files provided to recreate the final submission are expected to require anywhere between 12 and 16 GB of memory.

Authors

Andrei Atanov
Valentina Shumovskaia
Miloš Vujasinović

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
submissions		submissions
.gitignore		.gitignore
832submissionDataPrep.ipynb		832submissionDataPrep.ipynb
8418validation8402test_model.ipynb		8418validation8402test_model.ipynb
Andrei_084.ipynb		Andrei_084.ipynb
FinalModel.ipynb		FinalModel.ipynb
HopefullyFinalSubmission.ipynb		HopefullyFinalSubmission.ipynb
LICENSE		LICENSE
README.md		README.md
Tests.ipynb		Tests.ipynb
attribute_selection.py		attribute_selection.py
baseline.ipynb		baseline.ipynb
class_config.py		class_config.py
data_io.py		data_io.py
data_preprocessing.py		data_preprocessing.py
evaluators.py		evaluators.py
exampleNotebook.ipynb		exampleNotebook.ipynb
implementations.py		implementations.py
metrics.py		metrics.py
requirements.txt		requirements.txt
run.py		run.py
test_results_data.txt		test_results_data.txt
testing_notebook.ipynb		testing_notebook.ipynb
validation.py		validation.py
verify_submission.py		verify_submission.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning - Project 1

Software requirements

Data preprocessing

Model

Author's notes

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Machine Learning - Project 1

Software requirements

Data preprocessing

Model

Author's notes

Authors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages