# Scikit-learn library, overview of useful functions
In the previous hour, we avoided programming as much as possible. But now you also want to try everything yourself.In order to be able to program a thought-out task in your homework, we will go through the most important functions you will need.
Above all, we will use the library [Sciki-learn] (https://scikit-learn.org) and of course pandas.We will go through the necessary things on the example.

In [1]:
import pandas as pd
import numpy as np
np.random.seed(42)

## Loading and preparing data
<table> 
    <tr><td>
        
☑ selection of input variables
☑ division into training and test data        
☑ missing values
☑ categorical values
☑ scaling / standardization of values </td></tr>
</table> 

You will always need to prepare data at the beginning. You should already be able to clean up your data and use the pandas library,we will only focus on things that are specific to machine learning.
So you can read the data.

In [2]:
df_platy = pd.read_csv("static/salaries.csv", index_col=0)
df_platy.sample(10)

Unnamed: 0,rank,discipline,yrs.since.phd,yrs.service,sex,salary
66,AssocProf,B,9,8,Male,100522
115,Prof,A,12,0,Female,105000
17,Prof,B,19,20,Male,101000
142,AssocProf,A,15,10,Male,81500
157,AssocProf,B,12,18,Male,113341
127,Prof,A,28,26,Male,155500
141,AssocProf,A,14,8,Male,100102
31,Prof,B,20,4,Male,132261
19,Prof,A,37,23,Male,124750
168,Prof,B,18,19,Male,130664


For prediction we will use as flags `rank`,` discipline`, `yrs.since.phd`,` yrs.service` and `sex`,we will predict the value of `salary`.
To learn, we need to convert all hondots to numbers (`float`). If the data contained missingvalues, the easiest solution is to drop such lines. (Bonus: if you have data with moremissing values, see options [sklearn.impute] (https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute))
It is important to deal with categorical values. Columns containing Boolean values or two values (such as male / female) can be easily converted to $ [0,1] $ values.

In [3]:
df_platy = df_platy.replace({"Male": 0, "Female": 1})
df_platy.sample(10)

Unnamed: 0,rank,discipline,yrs.since.phd,yrs.service,sex,salary
10,Prof,B,18,18,1,129000
43,Prof,B,40,27,0,101299
34,AsstProf,B,4,2,0,80225
118,Prof,A,39,36,0,117515
5,Prof,B,40,41,0,141500
150,AsstProf,B,4,3,0,95079
56,AssocProf,B,14,5,0,83900
15,Prof,B,20,18,0,104800
170,Prof,B,25,18,0,181257
114,Prof,A,37,37,0,104279


We will use the so-called * onehot encoding * for categorical variables with more options.
E.g. the `rank` column contains the values` Prof`, `AsstProf` and` AssocProf`. We need three columns for onehot encoding:
Original value Code--- | --- 
Prof      | 1 0 0
AsstProf | 0 1 0AssocProf | 0 0 1  


The Scikitlearn library offers [sklearn.preprocessing.OneHotEncoder] (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder), but when working with pandas we can use the same method [get_dummies] (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html). (Note * dummies * because we will have auxiliary variables (columns), which are called * dummy variables *.)

In [4]:
df_platy = pd.get_dummies(df_platy)
df_platy

Unnamed: 0,yrs.since.phd,yrs.service,sex,salary,rank_AssocProf,rank_AsstProf,rank_Prof,discipline_A,discipline_B
1,19,18,0,139750,0,0,1,0,1
2,20,16,0,173200,0,0,1,0,1
3,4,3,0,79750,0,1,0,0,1
4,45,39,0,115000,0,0,1,0,1
5,40,41,0,141500,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...
194,19,19,0,86250,1,0,0,0,1
195,48,53,0,90000,1,0,0,0,1
196,9,7,0,113600,1,0,0,0,1
197,4,4,0,92700,0,1,0,0,1


The last step of preprocessing is to resize the values.But it is already a data intervention that uses the values of the whole set. Therefore, it would not be fair to use a part for it, which will later be used for testing. It&#39;s high time to separate the test set.

## Creating a training and test set

In machine learning theory, model inputs (flags, input variables) are typically denoted by the letter `X` and outputs by the letter` y`. Many programmers also use this to label variables in code.`X` represents an * array * (or table), where each row corresponds to one data pattern and each column to one flag (input variables). `y` is a vector, or one column with a response.
The [pop] method (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pop.html) can be used to trigger the response. However, its disadvantage is the inability to restart the cell repeatedly.

In [5]:
#y = df_platy.pop("salary")
#X = df_platyy = df_platy [&quot;salary&quot;]X = df_platy.drop(columns=["salary"])

print(X.columns)
print(y.name)

Index(['yrs.since.phd', 'yrs.service', 'sex', 'rank_AssocProf',
       'rank_AsstProf', 'rank_Prof', 'discipline_A', 'discipline_B'],
      dtype='object')
salary


In [6]:
X.head()

Unnamed: 0,yrs.since.phd,yrs.service,sex,rank_AssocProf,rank_AsstProf,rank_Prof,discipline_A,discipline_B
1,19,18,0,0,0,1,0,1
2,20,16,0,0,0,1,0,1
3,4,3,0,0,1,0,0,1
4,45,39,0,0,0,1,0,1
5,40,41,0,0,0,1,0,1


In [7]:
y.head()

1    139750
2    173200
3     79750
4    115000
5    141500
Name: salary, dtype: int64

It remains to divide the data into training and testing. The [train_test_split] method is used for this (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train%20test%20split#sklearn.model_selection.train_test_split).The data is randomly divided into a training and test set. The size of the test set can be specified by the parameter `test_size`, its default value is` 0.25`, ie 25%.

In [8]:
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, y)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

## Back to scaling

Scaling is not always necessary, but it can help some models.Využijeme [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler).

StandardScaler scales the values to roughly match the normal distribution. Some algorithms assume this. It may then happen, for example, that the flag (column), which has a significantly larger variance than the others, is considered more significant.

In [9]:
from sklearn.preprocessing import StandardScaler 

priznaky_ke_konverzi = [&quot;yrs.since.phd&quot;, &quot;yrs.service&quot;]
transformation = StandardScaler ()
# just to get rid of SettingWithCopyWarning 
X_train = X_train.copy()
X_test = X_test.copy()

X_train.loc [:, conversion_flags] = transformation.fit_transform (X_train [conversion_flags])X_test.loc [:, conversion_marks] = transform.transform (X_test [conversion_marks])
X_train.sample(10)


Unnamed: 0,yrs.since.phd,yrs.service,sex,rank_AssocProf,rank_AsstProf,rank_Prof,discipline_A,discipline_B
61,-0.821632,-0.603607,0,1,0,0,0,1
20,1.582294,1.680759,1,0,0,1,1,0
192,1.902818,0.538576,0,0,0,1,0,1
139,-0.741501,-0.685192,0,1,0,0,1,0
10,-0.100454,0.212238,1,0,0,1,0,1
44,1.502164,1.843928,0,0,0,1,0,1
183,-0.901763,-0.603607,0,1,0,0,0,1
36,-1.142155,-1.256283,1,0,1,0,0,1
149,1.341902,0.864914,1,0,0,1,0,1
123,0.380331,0.538576,0,0,0,1,1,0


## Models

We can move on to learning itself. We will choose a model. An overview of the models can be found in the [Supervised learning] section (https://scikit-learn.org/stable/supervised_learning.html#supervised-learning).                                                                                                       
                                        
You can use:- [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) 
 
- [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso)
     + hyperparametry: 
          * alpha, float, default=1.0 
 
- [SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR)        
     + hyperparametry:
          * kernel, default rbf, one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’
          * C, float, optional (default=1.0)
          
For classification tasks (which we will get to in the next lesson) you will use: 
- [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)
    
- [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
  + hyperparametry:
    * n_estimators, integer, optional (default=100)
   
- [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
  + hyperparametry:
     * C, float, optional (default=1.0)
     * kernelstring, optional (default=’rbf’)
  


We will create an instance of the selected model (now we only want to use the library, we will take the simplest linear regression):

In [10]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

## Training

We train the model on a training set:

In [11]:
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

## Prediction

We typically want to use the trained model to evaluate some new data samples, for which we have the `predict` method. Let&#39;s call it both training and test data.

In [12]:
train_predikce = model.predict(X_train)
test_prediction = model.predict (X_test)

Let&#39;s list the first ten test samples and their predictions:

In [13]:
print (f &quot;actual salary predicate salary&quot;)for i in range(10):
    print(f"{y_test.iloc[i]:>10.2f}         {test_predikce[i]:>10.2f}")


skutečný plat    predicke platu     
  93418.00           97738.57
 128148.00          135101.30
 101299.00          131878.40
  82100.00           82375.82
  73000.00           67784.49
  86100.00           84455.33
 168635.00          115889.21
 113278.00          115355.42
 120806.00          118690.71
 150743.00          131604.85


## Model evaluation

We can use the `score` function, which returns the value of the $ R ^ 2 $ metric:

In [14]:
print (&quot;R2 on training set:&quot;, model.score (X_train, y_train))print (&quot;R2 on test set:&quot;, model.score (X_test, y_test))

R2 na trénovací množině:  0.5404681269886975
R2 na testovací množině:  0.49459922918324023


Functions for all possible metrics can be found in [sklearn.metrics] (https://scikit-learn.org/stable/modules/classes.html?highlight=sklearn%20metrics#module-sklearn.metrics).(now we are interested in [regression metrics] (https://scikit-learn.org/stable/modules/classes.html?highlight=sklearn%20metrics#module-sklearn.metrics))

In [15]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

MAE_train = mean_absolute_error(y_train, train_predikce)
MAE_test = mean_absolute_error (y_test, test_prediction)MSE_train = mean_squared_error(y_train, train_predikce) 
MSE_test = mean_squared_error(y_test, test_predikce)
R2_train = r2_score(y_train, train_predikce)
R2_test = r2_score (y_test, prediction test)
print (&quot;Training data Test data&quot;)print(f"MSE {MSE_train:>14.3f}  {MSE_test:>14.3f}")
print (f &quot;MAE {MAE_train:&gt; 14.3f} {MAE_test:&gt; 14.3f}&quot;)print(f"R2  {R2_train:>14.3f}  {R2_test:>14.3f}")

    Trénovací data  Testovací data
MSE  390137429.883   371972816.562
MAE      14194.085       13919.789
R2           0.540           0.495


## Save the model

Sometimes we need to keep the learned model for future use. The model can be saved to a file and reloaded with `pickle`.Kujme pikle:

In [16]:
import pickle 

with open("model.pickle", "wb") as soubor:
    pickle.dump(model, soubor)


with open("model.pickle", "rb") as soubor:
staronovy_model = pickle.load (file)
staronovy_model.score(X_test, y_test)

0.49459922918324023

### Bonuses:
- the selection of a suitable model and its hyper-parameters is hidden under the keyword ** model selection **. The Scikit-learn library contains various tools to facilitate this selection. But it&#39;s beyond the scope of this course, if you come across this topic in self-study, read [sklear.model_selection] (https://scikit-learn.org/stable/modules/classes.html?highlight=model%20selection#module -sklearn.model_selection). 
- in the example above, we used various transformations over the data and then the creation of the model. As you become more versed in these things, it will be useful for you to bring these things together. The so-called [pipeline] is used for this (https://scikit-learn.org/stable/modules/classes.html?highlight=pipeline#module-sklearn.pipeline).