# AI based resource scaling

## Case scenario

You and your team have been assigned to a project. The business idea is to reduce carbon footprint by saving resources. Your initial investigation shows, that majority of applications that are deployed on Cloud have high and low traffic hours, but the resources are set to support the peak times. That means there are hours when the allocated resources to the application are not utilized.

You want to develop a model which will allow automatic scaling of these resources. For that purpose, your machine learning model should correctly predict the CPU usage.

## Data

The extract of the data for a sample application was provided to you. In the `data` folder you will find 2 files: `train.csv` and `test.csv`. You should conduct your exploration and model building on the `train.csv` file.
Once you are happy with the model, use the `test.csv` file to predict the `cpu_usage`. You should store your predictions in a new csv file which will be called `<your_name>.csv`. The new file should have the following attributes:
- `id` column
- `timestamp` column
- `cpu_usage` column which should hold your predictions

### Data explanation

- `id` - identifier of the record
- `timestamp` - timestamp in the format YYYY-MM-DD HH-MM-SS
- `number_of_requests` - number of requests the application received in the given time
- `number_of_errors` - number of errors that the application logged in the given time
- `response_time` - cummulative time the application took to respond to a request, in miliseconds
- `cpu_cores` - number of CPU cores allocated to the application at a given time (maximum 8 are available)
- `memory_usage` - memory allocation in a given time, in percent
- `cpu_usage` - cpu allocation in a given time, in percent

## Your delivery

At the end of the day, you should provide us with your code in which you demonstrate that you followed these steps:
1. Data load
2. Data cleaning
3. Data Exploration
4. Data Modeling and validation
5. Prediction

The submission should be done as a pull request (PR) on github to this repository. **Please use branch with your name (not main branch)**. The PR should contain the `<your_name>.csv` file stored in the `data` folder and your code stored either in a jupyter notebook or python module.
We have prepared this notebook to help you with the exercise however you are not obliged to use it.

# Evaluation

This exercise does not have one solution. The problem can be approached in multiple ways. In order to evaluate how well your model performs, we will use the  root mean squared error (RMSE) metric. You can learn more about this metric here: https://en.wikipedia.org/wiki/Root-mean-square_deviation

# Get started

## Getting this to your local environment

Make sure you have an account at github.com. We suggest you fork this repository to your own space. Refer here to quickly get the right git commands: https://docs.github.com/en/get-started/quickstart/git-cheatsheet or simply use Github Desktop: https://desktop.github.com/.
Before you start working, make sure that your work can be reproduced later on a different machine.

Hint: define the environment for your project along with all dependencies. Make sure that any random element you use in your code is started from the same seed value.

## Git cheatsheet

If you are familiar with git/github and you know your way around, you can skip this section. 

### Basic Github Flow
Follwing [video](https://ibm.box.com/s/dvym4y5ktbcw8sdv02hecfs5wwe0dn22) describes basic github workflow. It describes how to fork repository, clone it, make some changes, push changes to remote repo and create pull request against original repository. 

### Cloning repo
Before you can clone repo you need either github token or ssh key. If you do not have it setup please follow this [guide](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account)

### Usual workflow
Your usual worklow might look like this:
1. Fork and clone repo on your local (first part of video)
2. Develop your solution, periodically commiting changes when you reach some milestone
3. Push changes to remote server if you are finished or you want just make copy on remote server just in case :-).
4. Once you are done with your solution. Create pull request as show in second part of the video. **Please remember to create pull request against branch with your name (do not use main branch).**

### Useful git commands
Here are some useful git commands:
* ```git clone <repo url>``` - clone repo from remote location to local directory
* ```git add <file|folder>``` - stage your changes 
* ```git commit -m "commit message"``` - commit your changes to local git repo
* ```git push``` - push changes to remote git repo


## 1. Load the data

In [3]:
# Your code goes here
import pandas as pd

df_test = pd.read_csv('data/test.csv')
df_train = pd.read_csv('data/train.csv')

## 2. Data cleaning

In this step you want to make sure that the data that you work with is loaded correctly, that it does not contain any strange values or that you are not missing any important records. You can read more about this step here: https://en.wikipedia.org/wiki/Data_cleansing

In [None]:
# Your code goes here



## 3. Data exploration

This may be one of the most important steps in your analysis. Your objective is to explore patterns in the data that will later drive your decisions about the suitable prediction model. You can read more about this step here: https://en.wikipedia.org/wiki/Data_exploration.

There are many visualization libraries in python which can help you visualize and better understand the relationships between the data. Some of the most used ones are `matplotlib` (https://matplotlib.org/) and `seaborn` (https://seaborn.pydata.org/)

At the end of this step, you should be able to make some important decisions. For example, will you include all features, or only a subset? Will you create new features? How will you treat your target variable? Will you need to encode the data in a different way?

In [None]:
# Your code goes here



## 4. Data Modeling and Validation

At this point, you should have an idea about your data. That means you are ready for the modeling part. Based on your exploration, you should choose the right type of the model. Since we are operating in a supervised domain (e.g. we know what we want to predict), your main decision should be whether to use a classification, regression or time-series type model.

Scikit-learn (https://scikit-learn.org/stable/index.html) is very broad library for machine learning practitioners. The documentation provides examples for different machine learning problems. Feel free to check it out before you choose your final implementation. If you are not sure which model will work best, you can also train multiple models and choose the best among them.

Do not forget that you need to validate your model. Especially if you train multiple models that you wish to choose from. There are more ways to do the validation (read more here: https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets) but remember that your final solution will be evaluated based on RMSE metric.

In [None]:
# Your code goes here



## 5. Prediction

Well done! You are almost finished with the assignment. In this last step you want to use a `predict` method on the data from `test.csv` file. Remember, any transformation or data preprocessing steps you did need to be done on this dataset too.

In [None]:
# Your code goes here



<div align="center"> Well done! </div>
You have completed all the steps necessary for the assignment. Don't forget to submit your solution according to instructions.
We hope you have enjoyed this and we thank you for your time.