microsoft_malware

Computer infection by malware constitutes a serious security problem that could harm consumers and businesses in many different ways. The ability to predict the chances of malware infection before they occur would benefit both consumers and businesses. The signal that the infection is likely to occur would allow for timely countermeasures to be applied. This project would benefit software manufacturers who would be able to incorporate the model into their software for additional security measures aimed at preventing the infection by malware. The dataset that I will use for this project is provided by Microsoft and is available as part of the Kaggle competition (https://www.kaggle.com/c/microsoft-malware-prediction/data). These data include telemetry properties and infection records of Windows machines as produced by the Windows Defender software. Each row represents an individual machine. Each column represents a variable. The “HasDetections” columns of the train data indicates whether the machine is infected. The goal of the competition is to be able to predict the value of “HasDetections” column for the test data. The train data contains 8,921,483 rows and 83 columns. This is a reasonably large size dataset that allows to explore many Machine Learning techniques. The dataset was resampled such that the frequency of machines with malware approximately matches the frequency of machines without malware detected. As an evaluation metric I have selected the area under the ROC curve, which was also used in the competition. For this project, I have performed feature engineering, feature selection and machine learning using logistic regression, Random Forest, gradient boosting and neural networks. The final outcome of this project is a series of Jupyter notebooks with the code (including one summary notebook) located in the 'notebooks' folder, and a report with presentation slides located in the 'reports' folder.

Project Organization

├── LICENSE
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   └── raw            <- The original, immutable data dump.
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
docs		docs
models		models
notebooks		notebooks
references		references
reports		reports
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
test_environment.py		test_environment.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

models

models

notebooks

notebooks

references

references

reports

reports

src

src

.gitignore

.gitignore

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

test_environment.py

test_environment.py

tox.ini

tox.ini

Repository files navigation

microsoft_malware

Project Organization

About

Releases

Packages

Languages

License

pavelzimin/microsoft_malware

Folders and files

Latest commit

History

Repository files navigation

microsoft_malware

Project Organization

About

Resources

License

Stars

Watchers

Forks

Languages