Skip to content

pavelzimin/microsoft_malware

Repository files navigation

microsoft_malware

Computer infection by malware constitutes a serious security problem that could harm consumers and businesses in many different ways. The ability to predict the chances of malware infection before they occur would benefit both consumers and businesses. The signal that the infection is likely to occur would allow for timely countermeasures to be applied. This project would benefit software manufacturers who would be able to incorporate the model into their software for additional security measures aimed at preventing the infection by malware. The dataset that I will use for this project is provided by Microsoft and is available as part of the Kaggle competition (https://www.kaggle.com/c/microsoft-malware-prediction/data). These data include telemetry properties and infection records of Windows machines as produced by the Windows Defender software. Each row represents an individual machine. Each column represents a variable. The “HasDetections” columns of the train data indicates whether the machine is infected. The goal of the competition is to be able to predict the value of “HasDetections” column for the test data. The train data contains 8,921,483 rows and 83 columns. This is a reasonably large size dataset that allows to explore many Machine Learning techniques. The dataset was resampled such that the frequency of machines with malware approximately matches the frequency of machines without malware detected. As an evaluation metric I have selected the area under the ROC curve, which was also used in the competition. For this project, I have performed feature engineering, feature selection and machine learning using logistic regression, Random Forest, gradient boosting and neural networks. The final outcome of this project is a series of Jupyter notebooks with the code (including one summary notebook) located in the 'notebooks' folder, and a report with presentation slides located in the 'reports' folder.

Project Organization

├── LICENSE
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   └── raw            <- The original, immutable data dump.
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py

Project based on the cookiecutter data science project template. #cookiecutterdatascience

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages