# Homework implementation - fish regression

In [None]:
import pandas as pd
import numpy as np 
np.random.seed (42) # make it all work out the same
__ = 0 # ignore, it also serves as free space for completion

- Load data using pandas, select the required columns, convert everything to numeric values.
* (You will need [get_dummies] (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html), which encodes categorical values using one hot encoding (using zeros and ones (Note ** dummies ** because we will get auxiliary variables (columns), which are called ** dummy variables **.) *

In [None]:
fish_data = pd.read_csv("fish_data.csv", index_col=0)

# enter the code...

- Choose the column you will use as the response (** Weight **). Store the columns you will use as flags in the ** X ** variable and the response column in the ** y ** variable.
* In machine learning theory, model inputs (flags, input variables) are typically denoted by the letter X and outputs by the letter y. This is often how variables in code are also called. X represents an array (or table), where each row corresponds to one data sample and each column to one flag (input variable). y is a vector, or one column with a response. *

In [None]:
# complete flag selection and responsey = __X = __ 

- Division of data into training and testing. Note that we have different species of fish in the data, what to watch out for?
* Method [train_test_split] (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train%20test%20split#sklearn.model_selection.train_test_split) distributes data to us at random and test kit. The size of the test set can be specified by the test_size parameter, its default value is 0.25, ie 25%. *

In [None]:
from sklearn.model_selection import train_test_split 

# complete the division into test and training data# X_train, X_test, y_train, y_test = ...

- Choose several regression models and try to use them.
For today you can try:
  - [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) 
 
  - [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso)
      + hyperparametry: 
          * alpha, float, default=1.0 
 
- [SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR)        
     + hyperparametry:
          * kernel, default rbf, one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’
          * C, float, optional (default=1.0)

# Turning: what are those * hyper-parameters *?
For the examples with black boxes in the first hour, we (behind your back) helped each other a few times andwe passed some parameters to the box at the beginning. This is because the box often allows the user to configure it. In box terminology, we can imagine that the box has various levers that can be adjusted. These levers are used to set the so-called ** hyper-parameters ** of the model. All the models you will find in the Scikit-learn library have some default settings and will be used without having to deal with setting these hyper-parameters.If the model does not give a satisfactory result, you can try to adjust these parameters, such as trying several different settings and comparing the value of the metric on the test set.
In the list above we have some hyperparameters listed. Parameters are often related to regularization (above * alpha *, * C *). ** Regularization ** means that the model, in addition to trying to fit the data (giving the correct answers), takes some other criterion into account. Typically, this criterion makes sure that the output of the model does not amplify much, etc. Like you said in the example with the landscape, you choose the solution so that it is * smooth *, * nice *, * corresponds to the usual * landscapes.
The model selection process, including its parameters, is called ** model selection **, in the Scikit-learn library you will find tools that can help you, under the heading [Model selection] (https://scikit-learn.org/stable/model_selection.html ).

In [None]:
from sklearn.linear_model import LinearRegression, Lasso, SGDRegressor  
from sklearn.svm import SVR 

In [None]:
# Don&#39;t be afraid to change the list or try other parameter valuesmodel_zoo = [    LinearRegression(),
Lasso (alpha = 1.0),Lasso (alpha = 0.5),    SVR(kernel="rbf"),
    SVR(kernel="poly")
]

+ The `fit` method is used for training (fitting), the` predict` method is used for prediction for new patterns.```
  model.fit(X_train, y_train)
  pred = model.predict(X_test)
```

+ You don&#39;t have to program the metric, you have `mean_absolute_error`,` mean_squared_error` and `r2_score`.```
  metrika = mean_absolute_error(y_test, pred)
```  

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def ml_game(X_train, y_train, X_test, y_test, model):
&quot;&quot; &quot;1. Practices the model on the training set.2. Calculates and writes metric values on both the training and test sets.returns learned model    """ 
        
# fill in the code according to the points above    
    return model 

In [None]:
# just uncomment
#trained_models = []
#for model in model_zoo: 
#    ml_game(X_train, y_train, X_test, y_test, model)
#    trained_models.append(model) 

We learned several models. Now think about the moment you would choose and why.Let&#39;s call it `best_model`. You can even try playing with hyperparameters and choose another setting.

In [None]:
# fill in the order of the model you selected
# best_model = trained_models[__]

# And beware, surprise ... another test set

We divided the data into training and testing. We used the training to learn the model.** But beware! ** We used the test set to select the model. The metric on the test set is usdoes not give an independent estimate of how our model will behave on unknown data. He was chosen soto give good results on the test set.

The test set serves as an estimate of the generalization capabilities of the model. But it should not be used in learning,even when selecting a model. The part that we separate into &quot;testing&quot; for model selection purposes is called correctly** validation ** set.** Caution: ** However, if we have used this validation set to select a model, we must not use it to evaluate the generalization capabilities of this model.
And so now comes the real test data, load it from the `fish_data_test.csv` file. If you scaled the data when creating the model, don&#39;t forget to resize the data as well.

In [None]:
test_data = pd.read_csv("fish_data_test.csv", index_col=0)
test_data = pd.get_dummies(test_data.drop(columns=["ID"]))

y_real_test = test_data["Weight"]
X_real_test = test_data.drop(columns=["Weight"])

#y_pred_test = best_model.predict(X_real_test)

#print(f"MAE {mean_absolute_error(y_real_test, y_pred_test):.3f}")
#print(f"MSE {mean_squared_error(y_real_test, y_pred_test):.3f}")


In [None]:
#for weight, predicted_weight in zip(y_real_test, y_pred_test):
#    print(f"{weight:>10.1f}     {predicted_weight:>10.1f}")

# Visualization at the end

+ To give you an idea, let&#39;s see the dependence of the weight of the fish on the length Length3. We will show separately for different species, for example for Pike and Roach.

In [None]:
mam_hotovy_prechozi_kod = Falseif mam_hotovy_prechozi_kod:    is_pike = test_data["Species_Bream"] == 1
    pike = test_data[is_pike].sort_values(by=["Length3"])
    pike_weights = pike["Weight"]
    pike_length3 = pike["Length3"]
    X_pike = scaler.transform(pike.drop(columns=["Weight"]))
    
    predicted_pike_weights = best_model.predict(X_pike)

    is_roach = test_data["Species_Roach"] == 1
    roach = test_data[is_roach].sort_values(by=["Length3"])
    roach_weights = roach["Weight"]
    roach_length3 = roach["Length3"]
    X_roach = scaler.transform(roach.drop(columns=["Weight"]))
    
    predicted_roach_weights = best_model.predict(X_roach)

In [None]:
import matplotlib.pyplot as plt
if mam_hotovy_prechozi_kod:    fig, ax = plt.subplots(1, 2)

    ax[0].scatter(pike_length3, pike_weights, label="true weight");
    ax[0].plot(pike_length3, predicted_pike_weights, label="prediction");
    ax[0].legend()

    ax[1].scatter(roach_length3, roach_weights, label="true weight");
    ax[1].plot(roach_length3, predicted_roach_weights, label="prediction");
    ax[1].legend()