# ___

# [ Machine Learning in Earth Observation and Geography ]

</br>

<div>
<img src="attachment:image.png" width="400">
</div>

---

Lecturer: *Lukas Brodsky lukas.brodsky@natur.cuni.cz*

**Department of Applied Geoinformatics and Carthography, Charles University** 

___

# Scikit-learn (first steps)

Scikit-learn is a **Machine Learning Library for Python**. Scikit-learn is: 
* Simple and efficient tools for **predictive data analysis**; 
* **High level of support**, it is an incredibly robust tool; 
* **Clear, consistent code style** which ensures that your machine learning code is easy to understand and reproducible; 
* It is widely supported by third-party tools: **NumPy**, SciPy, and Matplotlib
* **Open source**, commercially usable - BSD license

After completing this lab, you shall know:

* How to import sklearn modules. 
* How to run simple predictive analysis (classification and regression) with Scikit-learn in Python.

WITHOUT ANY AMBITION OF TUNING THE MODEL!!!  

## Documentation

    
Please refer to **[Scikit-learn official documentation](https://scikit-learn.org/)**, and use the **[Scikit-learn API Reference](https://scikit-learn.org/stable/modules/classes.html)**

Used version (shall be 0.24.1):


In [None]:
# src: https://scikit-learn.org/stable/getting_started.html

## 1. Fitting and predicting: estimator basics¶

### Library import

In [1]:
import sklearn
sklearn.__version__

'1.2.2'

### Sklearn API

Scikit-learn provides dozens of built-in machine learning algorithms and models, called **`estimators`**. Each estimator can be fitted to some data using its **`fit`** method.

https://scikit-learn.org/stable/glossary.html#term-estimators 

https://scikit-learn.org/stable/glossary.html#term-fit

### Estimator(s)

An object which manages the estimation and decoding of a model. The model is estimated as a deterministic function of:

* parameters provided in object construction or with set_params;

* the global numpy.random random state if the estimator’s random_state parameter is set to None; and

* any data or sample properties passed to the most recent call to fit, fit_transform or fit_predict, or data similarly passed in a sequence of calls to partial_fit. 

The estimated model is stored in public and private attributes on the estimator instance, facilitating decoding through prediction and transformation methods.

Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator.

### Fit 
The fit method is provided on every estimator. It usually takes some samples **`X`**, targets **`y`** if the model is supervised, and potentially other sample properties such as `sample_weight`.

#### Fitting classification model

In [2]:
# Estimator
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Instantiation
clf = RandomForestClassifier(random_state=0)

In [None]:
# Input data: 2 samples, 3 features
X = [[ 1,  2,  3],  
     [11, 12, 13]]

# classes of each sample
y = [0, 1]  

The size of **`X`** is typically (**n_samples, n_features**), which means that samples are represented as rows and features are represented as columns.

The target values `y` which are **real numbers for regression** tasks, or **integers** for classification (or any other discrete set of values). For unsupervised learning tasks, `y` does not need to be specified. 

**`y`** is usually a 1d array where the i-th entry corresponds to the target of the i-th sample (row) of `X`.

Both `X` and `y` are usually expected to be **numpy arrays** or equivalent array-like data types, though some estimators work with other formats such as sparse matrices.

In [None]:
# Fitting a model
clf.fit(X, y)

#### Model prediction
Once the estimator is fitted, it can be used for predicting target values of new data. 

In [None]:
# Predict classes of new data
clf.predict([[4, 5, 6], [14, 15, 16]])  

## 2. Model evaluation

Sklearn provides wide set of evaluation metrics and evaluation techniques https://scikit-learn.org/stable/modules/model_evaluation.html for all types of machine learning problems. 

E.g. metrics which measure the distance between the model and the data for regression, like `metrics.mean_squared_error()`, are available as `neg_mean_squared_error` which return the negated value of the metric.

Or `metrics.accuracy_score()` score for multilable classification problems. This function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

#### Overall accuracy

In [None]:
# Classification task

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
# Model prediction
y_pred = [0, 2, 1, 3]
# Groud-truth values (refernces)
y_true = [0, 1, 2, 3]

In [None]:
accuracy_score(y_true, y_pred)

#### Mean squared error (MSE)

In [None]:
# Regression task

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

In [None]:
mean_squared_error(y_true, y_pred)

## 3. Transformers and pre-processors
Machine learning workflows are often composed of different parts. A typical pipeline consists of a pre-processing step that transforms or imputes the data, and a final predictor that predicts target values.

In `scikit-learn`, pre-processors and transformers follow the same API as the `estimator objects` (they actually all inherit from the same BaseEstimator class). The **transformer objects** don’t have a predict method but rather a `transform method` that outputs a newly transformed sample matrix `X`. 

#### Standard scaling 

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
X = [[0, 15],
     [1, -10]]

In [None]:
# scale data according to computed scaling values
scaler = StandardScaler()
scaler.fit(X)


In [None]:
# fit_transform() convention
X_scaled = scaler.fit_transform(X)
X_scaled

In [None]:
# Inverse transformation
scaler.inverse_transform(X_scaled)

#### More in the following lessons!