# AI based resource scaling

## Case scenario

You and your team have been assigned to a project. The business idea is to reduce carbon footprint by saving resources. Your initial investigation shows, that majority of applications that are deployed on Cloud have high and low traffic hours, but the resources are set to support the peak times. That means there are hours when the allocated resources to the application are not utilized.

You want to develop a model which will allow automatic scaling of these resources. For that purpose, your machine learning model should correctly predict the CPU usage.

## Data

The extract of the data for a sample application was provided to you. In the `data` folder you will find 2 files: `train.csv` and `test.csv`. You should conduct your exploration and model building on the `train.csv` file.
Once you are happy with the model, use the `test.csv` file to predict the `cpu_usage`. You should store your predictions in a new csv file which will be called `<your_name>.csv`. The new file should have the following attributes:
- `id` column
- `timestamp` column
- `cpu_usage` column which should hold your predictions

### Data explanation

- `id` - identifier of the record
- `timestamp` - timestamp in the format YYYY-MM-DD HH-MM-SS
- `number_of_requests` - number of requests the application received in the given time
- `number_of_errors` - number of errors that the application logged in the given time
- `response_time` - cummulative time the application took to respond to a request, in miliseconds
- `cpu_cores` - number of CPU cores allocated to the application at a given time (maximum 8 are available)
- `memory_usage` - memory allocation in a given time, in percent
- `cpu_usage` - cpu allocation in a given time, in percent

## Your delivery

At the end of the day, you should provide us with your code in which you demonstrate that you followed these steps:
1. Data load
2. Data cleaning
3. Data Exploration
4. Data Modeling and validation
5. Prediction

The submission should be done as a pull request (PR) on github to this repository. **Please use branch with your name (not main branch)**. The PR should contain the `<your_name>.csv` file stored in the `data` folder and your code stored either in a jupyter notebook or python module.
We have prepared this notebook to help you with the exercise however you are not obliged to use it.

# Evaluation

This exercise does not have one solution. The problem can be approached in multiple ways. In order to evaluate how well your model performs, we will use the  root mean squared error (RMSE) metric. You can learn more about this metric here: https://en.wikipedia.org/wiki/Root-mean-square_deviation

# Get started

## Getting this to your local environment

Make sure you have an account at github.com. We suggest you fork this repository to your own space. Refer here to quickly get the right git commands: https://docs.github.com/en/get-started/quickstart/git-cheatsheet or simply use Github Desktop: https://desktop.github.com/.
Before you start working, make sure that your work can be reproduced later on a different machine.

Hint: define the environment for your project along with all dependencies. Make sure that any random element you use in your code is started from the same seed value.

## Git cheatsheet

If you are familiar with git/github and you know your way around, you can skip this section. 

### Basic Github Flow
Follwing [video](https://ibm.box.com/s/dvym4y5ktbcw8sdv02hecfs5wwe0dn22) describes basic github workflow. It describes how to fork repository, clone it, make some changes, push changes to remote repo and create pull request against original repository. 

### Cloning repo
Before you can clone repo you need either github token or ssh key. If you do not have it setup please follow this [guide](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account)

### Usual workflow
Your usual worklow might look like this:
1. Fork and clone repo on your local (first part of video)
2. Develop your solution, periodically commiting changes when you reach some milestone
3. Push changes to remote server if you are finished or you want just make copy on remote server just in case :-).
4. Once you are done with your solution. Create pull request as show in second part of the video. **Please remember to create pull request against branch with your name (do not use main branch).**

### Useful git commands
Here are some useful git commands:
* ```git clone <repo url>``` - clone repo from remote location to local directory
* ```git add <file|folder>``` - stage your changes 
* ```git commit -m "commit message"``` - commit your changes to local git repo
* ```git push``` - push changes to remote git repo


## 1. Load the data

In [3]:
# Your code goes here
import pandas as pd
import numpy as np

df_test = pd.read_csv('data/test.csv')
df_train = pd.read_csv('data/train.csv')

## 2. Data cleaning

In this step you want to make sure that the data that you work with is loaded correctly, that it does not contain any strange values or that you are not missing any important records. You can read more about this step here: https://en.wikipedia.org/wiki/Data_cleansing

In [4]:
# Your code goes here
df_test = df_test.fillna(0)
df_train = df_train.fillna(0) #cleanig datasets from NaN values

In [5]:
df_test.isnull().values.any()

False

In [6]:
df_train.isnull().values.any() #check for NaN values after cleaning

False

In [7]:
df_test.rename(columns={'Unnamed: 0':'id'}, inplace=True) #add name for missing columns

In [8]:
df_test['id'].is_unique

True

In [9]:
df_train['id'].is_unique

True

In [10]:
drop_c = ['id', 
          'timestamp']
df_test.drop(columns=drop_c, inplace=True, axis=1)
df_train.drop(columns=drop_c, inplace=True, axis=1)

In [11]:
df_train.dtypes.value_counts() 

float64    4
int64      2
dtype: int64

In [12]:
df_test.dtypes.value_counts() 

float64    3
int64      2
dtype: int64

In [44]:
df_test.describe()

Unnamed: 0,number_of_requests,number_of_errors,response_time,cpu_cores,memory_usage
count,17200.0,17200.0,17200.0,17200.0,17200.0
mean,5201.38564,2.491919,12307.8229,3.363488,0.503066
std,10483.977459,22.124663,2037.352316,2.20217,0.289226
min,0.0,0.0,0.0,2.0,5.9e-05
25%,249.0,0.0,10985.987705,2.0,0.253181
50%,498.0,0.0,11948.646584,2.0,0.507247
75%,2955.5,0.0,13739.689722,4.0,0.751626
max,49998.0,690.0,16773.110102,8.0,0.999834


In [45]:
df_train

Unnamed: 0,number_of_requests,number_of_errors,response_time,cpu_cores,memory_usage,cpu_usage
0,9758,0.0,14742.755324,6,0.347599,48.498589
1,9967,2.0,14897.201621,6,0.756413,53.355349
2,5210,1.0,14009.132817,6,0.698468,46.573140
3,7361,1.0,14716.491537,6,0.137349,46.624516
4,9667,0.0,15148.657690,6,0.157933,49.455284
...,...,...,...,...,...,...
51596,9303,2.0,14376.052495,6,0.614495,51.684143
51597,5945,1.0,14373.898084,6,0.209691,47.571840
51598,6959,1.0,14768.549663,6,0.707395,47.585932
51599,8274,0.0,14352.171497,6,0.887683,48.178243


## 3. Data exploration

This may be one of the most important steps in your analysis. Your objective is to explore patterns in the data that will later drive your decisions about the suitable prediction model. You can read more about this step here: https://en.wikipedia.org/wiki/Data_exploration.

There are many visualization libraries in python which can help you visualize and better understand the relationships between the data. Some of the most used ones are `matplotlib` (https://matplotlib.org/) and `seaborn` (https://seaborn.pydata.org/)

At the end of this step, you should be able to make some important decisions. For example, will you include all features, or only a subset? Will you create new features? How will you treat your target variable? Will you need to encode the data in a different way?

In [14]:
# Your code goes here

#has_errors_train = np.where((df_train.number_of_errors >  0),1,df_train.number_of_errors) #encode number_of_errors train
#has_errors_test = np.where((df_test.number_of_errors >  0),1,df_test.number_of_errors) #encode number_of_errors test
#has_errors_train = {'has_error': has_errors_train}
#df_has_errors_train = pd.DataFrame(has_errors_train)
#has_errors_test = {'has_error': has_errors_test}
#df_has_errors_test = pd.DataFrame(has_errors_test)

In [16]:
#df_new_train.drop('number_of_errors',inplace=True, axis=1)
#df_new_test.drop('number_of_errors',inplace=True, axis=1)

In [17]:
#df_new_train = pd.concat([df_has_errors_train, df_new_train], axis=1)
#df_new_test = pd.concat([df_has_errors_test, df_new_test], axis=1)

In [18]:
df_new_train.describe()

Unnamed: 0,has_error,number_of_requests,response_time,cpu_cores,memory_usage,cpu_usage
count,51601.0,51601.0,51601.0,51601.0,51601.0,51601.0
mean,0.188989,0.104556,0.732122,0.420379,0.500593,0.476761
std,0.391503,0.211049,0.120264,0.275277,0.28837,0.07421
min,0.0,0.0,0.0,0.25,5e-06,0.0
25%,0.0,0.005041,0.653422,0.25,0.249918,0.468391
50%,0.0,0.009962,0.710186,0.25,0.502901,0.473184
75%,0.0,0.058973,0.816401,0.5,0.750709,0.479614
max,1.0,1.0,1.0,1.0,1.0,1.0


In [33]:
df_new_train

Unnamed: 0,has_error,number_of_requests,response_time,cpu_cores,memory_usage,cpu_usage
0,0.0,0.195203,0.876610,0.75,0.347608,0.484986
1,1.0,0.199384,0.885793,0.75,0.756432,0.533553
2,1.0,0.104223,0.832988,0.75,0.698485,0.465731
3,1.0,0.147252,0.875048,0.75,0.137353,0.466245
4,0.0,0.193383,0.900745,0.75,0.157937,0.494553
...,...,...,...,...,...,...
51596,1.0,0.186101,0.854805,0.75,0.614510,0.516841
51597,1.0,0.118926,0.854677,0.75,0.209696,0.475718
51598,1.0,0.139211,0.878143,0.75,0.707412,0.475859
51599,0.0,0.165516,0.853385,0.75,0.887705,0.481782


## 4. Data Modeling and Validation

At this point, you should have an idea about your data. That means you are ready for the modeling part. Based on your exploration, you should choose the right type of the model. Since we are operating in a supervised domain (e.g. we know what we want to predict), your main decision should be whether to use a classification, regression or time-series type model.

Scikit-learn (https://scikit-learn.org/stable/index.html) is very broad library for machine learning practitioners. The documentation provides examples for different machine learning problems. Feel free to check it out before you choose your final implementation. If you are not sure which model will work best, you can also train multiple models and choose the best among them.

Do not forget that you need to validate your model. Especially if you train multiple models that you wish to choose from. There are more ways to do the validation (read more here: https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets) but remember that your final solution will be evaluated based on RMSE metric.

In [23]:
# Your code goes here
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics

X = df_train.loc[:, df_train.columns != 'cpu_usage']
y = df_train.iloc[:,df_train.columns == 'cpu_usage']

regr = svm.SVR()
regr.fit(X, y)

  return f(*args, **kwargs)


SVR()


## 5. Prediction

Well done! You are almost finished with the assignment. In this last step you want to use a `predict` method on the data from `test.csv` file. Remember, any transformation or data preprocessing steps you did need to be done on this dataset too.

In [33]:
cpu_usage_t = regr.predict(df_test)

In [40]:
df_cpu_usage_t = pd.DataFrame(cpu_usage_t)
df_cpu_usage_t.rename(columns={0:'cpu_usage'}, inplace=True)

In [41]:
df_cpu_usage_t

Unnamed: 0,cpu_usage
0,48.837469
1,48.348246
2,48.586795
3,48.740181
4,48.844429
...,...
17195,48.873975
17196,49.028965
17197,48.651894
17198,49.065346


In [42]:
df_new_test = pd.concat([df_test, df_cpu_usage_t], axis=1)
df_new_test

Unnamed: 0,number_of_requests,number_of_errors,response_time,cpu_cores,memory_usage,cpu_usage
0,6013,1.0,13839.534811,6,0.332282,48.837469
1,8834,0.0,14657.609451,6,0.931217,48.348246
2,7971,0.0,14137.221618,6,0.803488,48.586795
3,7653,0.0,14257.821995,6,0.486343,48.740181
4,7313,0.0,14281.586371,6,0.369045,48.844429
...,...,...,...,...,...,...
17195,6401,0.0,13977.819765,6,0.863818,48.873975
17196,6525,1.0,14401.515182,6,0.300279,49.028965
17197,8342,2.0,14782.197038,6,0.165270,48.651894
17198,5355,1.0,14421.113212,6,0.377384,49.065346


In [44]:
df_new_test.to_csv('data/melekhova.csv')

<div align="center"> Well done! </div>
You have completed all the steps necessary for the assignment. Don't forget to submit your solution according to instructions.
We hope you have enjoyed this and we thank you for your time.