# Classification

So far, we have only dealt with regression problems. However, teacher learning involves two main groups of tasks - regression tasks and classification tasks.
While for regression problems the output of the model is a continuous value (float),in classification tasks, the output of the model is a class indicator.
Let&#39;s stick to our fish market and set an example. The task of predicting fish weight was a regression task,we predicted a continuous value.If we want to predict the species of fish (Perch - * perch *, Roach - * roach *, Pike - * pike *, ...), it is a prediction of a categorical value, ie a classification.
Classification problems have slightly different properties and logic than regression problems, so there are models directly designed for such problems. They are called classifiers.
But first we will try to look at the task of classification from the perspective we already know, that is, from the perspective of the landscape.

![data](static/ryby.png)

In [1]:
# načeteme si dataimport pandas as pd 
import numpy as np 
np.random.seed (2020) # random classifier settings
data = pd.read_csv("static/fish_data.csv", index_col=0)
data

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width,ID
0,Bream,242.0,23.2,25.4,30.0,11.5200,4.0200,0
1,Bream,290.0,24.0,26.3,31.2,12.4800,4.3056,1
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961,2
3,Bream,363.0,26.3,29.0,33.5,12.7300,4.4555,3
4,Bream,430.0,26.5,29.0,34.0,12.4440,5.1340,4
...,...,...,...,...,...,...,...,...
153,Smelt,9.8,11.4,12.0,13.2,2.2044,1.1484,153
154,Smelt,12.2,11.5,12.2,13.4,2.0904,1.3936,154
155,Smelt,13.4,11.7,12.4,13.5,2.4300,1.2690,155
156,Smelt,12.2,12.1,13.0,13.8,2.2770,1.2558,156


### Task 1:   
The most common species of fish is * Perch *. Our goal is to create a classifier that returns information for specified measures (weight, different lengths and widths), whether it is a perch or another species. (So for simplicity, we only have two classes, ** Perch ** and ** others **.)
+ Could you fit this task to the landscape? What could be the coordinates and what the altitude is?
+ If you successfully dealt with the previous question, you can use one of the regression models for classification (yes, it probably won&#39;t be ideal when it comes to classification, but let&#39;s try what we already know first). But what will be the value of the response and how will we interpret it?

## Classification models

We again bring some basic offer of classification models:   
+ [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)

+ [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)
    
+ [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
   - n_estimators, integer, optional (default=100)
   
+ [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
     - C, float, optional (default=1.0)
     - kernelstring, optional (default=’rbf’)
     
+ [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)  


### Task 2:Choose one model and try to train for fish.

In [2]:
# let&#39;s prepare the datay = data["Species"] == "Perch"
y = y.astype(int)X = data.drop(columns=["ID", "Species"])

In [3]:
# let&#39;s take a classifier# you can changefrom sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

In [4]:
# let&#39;s divide into a training and validation setfrom sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

In [5]:
#trainmodel.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [6]:
# Evaluate the validation setpred = model.predict(X_test) 

In [7]:
print (&quot;Real class: Prediction:&quot;)for true, predicted in zip(y_test, pred):
    print(f"{true:<15}  {predicted:<10} {'OK' if true == predicted else 'X'}")

print (f &quot;Number of errors: {sum (y_test! = before)}&quot;)

Skutečná třída:  Predikce:
0                0          OK
0                0          OK
0                0          OK
0                0          OK
0                0          OK
0                0          OK
1                1          OK
0                0          OK
1                1          OK
0                0          OK
0                0          OK
1                0          X
0                0          OK
1                0          X
0                0          OK
1                0          X
0                0          OK
0                0          OK
0                0          OK
1                1          OK
0                0          OK
0                1          X
0                1          X
0                0          OK
0                0          OK
1                1          OK
1                0          X
1                1          OK
1                1          OK
1                1          OK
0                0          OK
Počet chyb: 6


### Task 3:
+ It is probably clear that regression metrics are not very suitable for classification problems. What would you use as a metricfor the classification task?

### Task 4:
- One option is to compare the percentage of successfully classified patterns. In our case, it will be:

In [8]:
print (f &quot;Success: {100 * sum (y_test == before) / len (y_test):. 2f}%&quot;)

Úspěšnost: 80.65 %


Success is not entirely bad, knowing the type of fish by size is not an easy task.
But imagine that we have a data set with 100 fish, 95 of which will be perch (Perch type). Will a classifier with this success rate (the same as we came out) feel good or not? Why?

### Task 5:
We will first go through the classification metrics. If you are studying alone, study the chapter on classification metrics and then return to this exercise.
Choose a metric for our task and try to find the best classifier possible. Then load the test set and see what your classifier gives the results.

In [9]:
# retrieve datetest_data = pd.read_csv("static/fish_data_test.csv", index_col=0)
y_real_test = test_data["Species"] == "Perch"
y_real_test = y_real_test.astype(int)
X_real_test = test_data.drop(columns=["ID", "Species"])

In [10]:
# try to learn different models and choose the best onefrom sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

# try to learn different models and choose the best onemodels = {
    "nearest neighbors": KNeighborsClassifier(),
    "tree": DecisionTreeClassifier(),
    "forest": RandomForestClassifier(),
    "svc": SVC()
}
...

Ellipsis

In [11]:
# prediction# model = models[...]
test_pred = model.predict(X_real_test)

In [12]:
# try adding the selected metricprint (f &quot;Real class: Prediction:&quot;)for true, predicted in zip(y_real_test, test_pred):
    print(f"{true:<15}  {predicted:<10} {'OK' if true == predicted else 'X'}")

print (f &quot;Number of errors: {sum (y_real_test! = test_pred)}&quot;)print (f &quot;Success: {100 * sum (y_real_test == test_pred) / len (y_real_test):. 2f}%&quot;)

Skutečná třída:  Predikce:
0                0          OK
0                0          OK
0                0          OK
0                0          OK
0                0          OK
0                1          X
0                1          X
0                1          X
0                0          OK
0                0          OK
0                1          X
0                1          X
0                0          OK
0                0          OK
0                0          OK
0                0          OK
0                0          OK
0                0          OK
0                0          OK
1                0          X
1                0          X
1                1          OK
1                1          OK
1                1          OK
1                0          X
1                1          OK
1                1          OK
1                1          OK
1                0          X
1                0          X
1                0          X
1                0     