# Data and Machine Learning

Machine learning is about creating models from data: for that reason, we'll start by discussing how data can be represented in order to be processed correctly by **Machine Learning** (i.e. `Data` $\mapsto$ `Machine Learning`).

Indeed, it would be worthwhile investigating this relationship in the other direction, i.e. `Machine learnig` $\mapsto$ `Data`$^{[1]}$, but this is beyond of the scope of this tutorial. We will briefly touch base on this in the next notebook.

<span class="fn"><i>[1]</i>Although nowadays the answer to this question always seems to be **Deep Learning**</span>

### Data for Machine Learning

Despite what you might probably expect, Machine learning algorithm are pretty _strict_ about what input data should look like - and so it is `sklearn`.

Data in **scikit-learn**, with very few exceptions, is assumed to be stored as a
**two-dimensional array**, of size `[n_samples, n_features]`. 

This array is usually referrred as the **feature matrix**.

There is also the **label vector**$^{[2]}$, of size `n_samples`, containing the list of labels for each sample.

$$
{\rm feature~matrix:~~~} {\bf X}~=~\left[
\begin{matrix}
x_{11} & x_{12} & \cdots & x_{1D}\\
x_{21} & x_{22} & \cdots & x_{2D}\\
x_{31} & x_{32} & \cdots & x_{3D}\\
\vdots & \vdots & \ddots & \vdots\\
\vdots & \vdots & \ddots & \vdots\\
x_{N1} & x_{N2} & \cdots & x_{ND}\\
\end{matrix}
\right]
$$

$$
{\rm label~vector:~~~} {\bf y}~=~ [y_1, y_2, y_3, \cdots y_N]
$$

Here, $N$ is `n_samples`; $D$ is `n_features`.

Each sample (data point) is a row in the data array, and each feature is a column.

<span class="fn"><i>[2]</i> The **label vector** only applies to **Supervised** learning settings. More on this later.</span>

- $N$ (`n_samples`):   The number of samples: each sample is an item to process (e.g. classify).
  A sample can be a document, a picture, a sound, a video, an astronomical object,
  a row in database or CSV file,
  or whatever you can describe with a fixed set of quantitative traits.
- $D$ (`n_features`):  The number of features or distinct traits that can be used to describe each
  item in a quantitative manner.  Features are generally real-valued, but may be boolean or
  discrete-valued in some cases.

The number of features must be fixed in advance. 

However it can be very high dimensional
(e.g. millions of features) with most of them being zeros for a given sample. 

This is a case where **sparse** matrices can be useful, in that they are
much more memory-efficient. In this cases, `scipy.sparse` offers a more efficienti solution than **dense** `numpy` arrays.

#### Few Notes on `Sparse` and Data Sparsity

**Scipy** provides `sparse` matrix data structures which are optimized for storing sparse data. 

The main feature of sparse formats is that you don’t store zeros so if your data is 
sparse then you use much less memory. 

A non-zero value in a sparse (`CSR` or `CSC` [Scipy Doc](https://docs.scipy.org/doc/scipy/reference/sparse.html)) representation will only take on 
average one `32bit` integer position + the `64bit` floating point value + an 
additional `32bit` per row or column in the matrix. 

Using sparse input on a dense (or sparse) linear model can speedup prediction by 
quite a bit as only the non zero valued features impact the dot product and thus 
the model predictions. Hence if you have `100` non-zeros in `1e6` dimensional 
space, you only need `100` multiply and add operation instead of `1e6`.

Calculation over a dense representation, however, may leverage highly optimised 
vector operations and multithreading in **BLAS**, and tends to result in fewer CPU 
cache misses. So the sparsity should typically be quite high (`10%` non-zeros max, 
to be checked depending on the hardware) for the sparse input representation to be 
faster than the dense input representation on a machine with many CPUs and an 
optimized BLAS implementation.

Here is a test function to check the sparsity of your data:

```python
def sparsity_ratio(X):
    return 1.0 - np.count_nonzero(X) / float(X.shape[0] * X.shape[1])
print("input sparsity ratio:", sparsity_ratio(X))
```

As a rule of thumb you can consider that if the sparsity ratio is greater than `90%` you can probably benefit from sparse formats. 

Check Scipy’s sparse matrix formats documentation for more information on how to build (or convert your data to) sparse matrix formats. 

$\rightarrow$ Most of the time the `CSR` and `CSC` formats work best.

<span class="fn"><i>Adapted from:</i> [Scikit-learn Doc :: Computing](https://scikit-learn.org/stable/modules/computing.html)

### Dataset API in `sklearn`

```python
from sklearn import datasets
```

There are **three** main kinds of dataset interfaces that can be used to get datasets depending on the desired type of dataset.

1. The dataset **loaders**:

```python
from sklearn.datasets import load_
```

They can be used to load small standard datasets (also referred to as *Toy datasets*).


2. The dataset **fetchers**:

```python
from sklearn.datasets import fetch_
```

They can be used to download and load larger datasets (also referred to as *Real world datasets*).

Both loaders and fetchers functions return a `sklearn.utils.Bunch` object holding 
at least two items: an array of shape `n_samples * n_features` with key data 
(except for 20newsgroups) and a `numpy` array of length `n_samples`, containing the 
target values, with key target.

The `Bunch` object is a dictionary that exposes its keys are attributes.

It’s also possible for almost all of these function to constrain the output to be a 
tuple containing only the data and the target, by setting the `return_X_y` 
parameter to `True`.

The datasets also contain a full description in their `DESCR` attribute and some 
contain `feature_names` and `target_names`. 

The dataset **generation** functions:

```python
from sklearn.datasets import make_
```

They can be used to generate controlled **synthetic datasets**.

These functions return a `tuple` `(X, y)` consisting of a `n_samples * n_features` 
`numpy` array `X` and an array of length `n_samples` containing the targets `y`.

In addition, there are also miscellaneous tools to load datasets of other formats or from other locations (e.g. `fetch_openml`).

---

<span class="fn"><i>Adapted from</i> [Scikit-learn Documentation :: Datasets](https://scikit-learn.org/stable/datasets/index.html)

###### Aquaintance with the API

Leverage on the auto-completion feature of Jupyter notebooks or your code editor (e.g. Visual Studio Code) to see what are supported datasets and functions included in `sklearn`

In [1]:
from sklearn import datasets

```python
from sklearn.datasets import load_<TAB>
from sklearn.datasets import fetch_<TAB>
from sklearn.datasets import make_<TAB>
```

###### Iris data (again) under the "data representation" light

In [2]:
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True, as_frame=True)

In [3]:
first_sample = X.iloc[0]

In [4]:
first_sample

sepal length (cm)    5.1
sepal width (cm)     3.5
petal length (cm)    1.4
petal width (cm)     0.2
Name: 0, dtype: float64

###### A new (?) Dataset

Now we'll take a look at another dataset, one where we have to put a bit more thought into how to represent the data: the `digits` data (a.k.a. **MNIST**).

In [5]:
from sklearn.datasets import load_digits
digits = load_digits()

In [6]:
## Explore the dataset keys and stored information

# your code here...

---

#### Extra (if we have time)

Try to `fetch` and `load` the RCV1-v2 dataset and check the `type` of `data`

In [7]:
from sklearn.datasets import fetch_rcv1

In [8]:
rcv1 = fetch_rcv1()  # This may take some time depending on your internet connection

In [9]:
rcv1.keys()

dict_keys(['data', 'target', 'sample_id', 'target_names', 'DESCR'])

In [10]:
print(rcv1.DESCR)

.. _rcv1_dataset:

RCV1 dataset
------------

Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually 
categorized newswire stories made available by Reuters, Ltd. for research 
purposes. The dataset is extensively described in [1]_.

**Data Set Characteristics:**

    Classes                              103
    Samples total                     804414
    Dimensionality                     47236
    Features           real, between 0 and 1

:func:`sklearn.datasets.fetch_rcv1` will load the following 
version: RCV1-v2, vectors, full sets, topics multilabels::

    >>> from sklearn.datasets import fetch_rcv1
    >>> rcv1 = fetch_rcv1()

It returns a dictionary-like object, with the following attributes:

``data``:
The feature matrix is a scipy CSR sparse matrix, with 804414 samples and
47236 features. Non-zero values contains cosine-normalized, log TF-IDF vectors.
A nearly chronological split is proposed in [1]_: The first 23149 samples are
the training set. The last 7812

In [11]:
X = rcv1.data

In [12]:
X.shape

(804414, 47236)

In [None]:
## Run this!
type(X)