# Data science in Python I. - Problems in data science

## Glossary

<p style="font-size:16px;">
Data science primarily deals with datasets (surprise-surprise). There are a number of fundamental terms this field of science is built upon. It is undeniably necessary to be familiar with them before venturing any further into this field. Below you find a short list of the most important terms. A short description is also attached to all of them.
</p>

<ul style="font-size:16px;">
<li><b>Dataset</b> (<i>hu: adathalmaz</i>): An ensemble of datapoints. Denoted by upper case <code>X</code> by convention. Datasets can be (and most of the time is) multidimensional, which means the <code>x</code> (lower case) datapoints consist of more, than one components. In this case, datapoints can be considered to be "vectors", or at least a list of continuous/discrete/non-numeric values. It is also a common convention that in a table, rows denote the individual datapoints, while columns denote the different dimensions/components of datapoints.</li>

<li><b>Labels</b> (<i>hu: címke</i>): Some datasets are not solely consists of the datapoints themeselves, but corresponding <b>labels</b> too. In normal cicumstances, every <code>x</code> datapoint has a corresponding <code>y</code> value. The list of labels are denoted by lower case <code>y</code> by convention.</li>

<li><b>Features</b> (<i>no hu translation</i>): Another name for the different dimensions of an <code>X</code> dataset. This is the term that is primarily used for dataset dimensions in data science. In practice, most of the "features" represent an actual, measurable quantity.</li>

<li><b>Class</b> (<i>hu: osztály</i>): In classifications problems, labels are discrete, which represent that every datapoint can be "classified" into a specific subset of the dataset. The interpretation of subsets can be arbitrary. They could simply represent "bins" or "intervals", in which the labels are "binned to" by value. Or they could be more meaningful. Eg. if the datapoints are images, labels could represent, whether there is a dog or a cat on the image.</li>

<li><b>Model</b> (<i>same in magyar with double 'L'</i>): A "model" in data science has obviously a very similar meaning as in other fields of science. It means to represent some underlying connection between datapoins or between datapoints and labels in a dataset. In the context of modern data science, "model" represent an arbitrary mathematical operator or sequence of operators, that maps the <code>X</code> dataset to corresponding <code>y</code> values.</li>

<li><b>Training/Learning</b> (<i>hu: tanulás/tanítás</i>): This is just a fancy and generalized way of saying "fitting data on a model". In numerous cases "fitting data" is not just a simple curve fitting, but a much more complex process that is harder to interpret. Also most machine learning methods work in a way, where they're optimizing model parameters during an iterative process. This can be well described by the terms "training" and "learning". Models are essentially "trained" over iterations, as they're "learning" the underlying correlation in the dataset.</li>

<li><b>Supervised vs. unsupervised learning</b> (<i>hu: felügyelt-/felügyelet nélküli tanítás/tanulás</i>): It means whether we're using data with or without labels. If labels are attached to a dataset during training/learning, we're speaking about <b>supervised learning</b>, while if no labels are attached to our dataset, we're speaking about <b>unsupervised learning</b>.</li>
</ul>

# Types of problems

Overwhelming majority of the problems in data science can be classified into 3 groups: regression, classification and clustering.

<img width="700px" src="./images/three-pillars.png" style="display:block; margin:auto;"/>
<p style="text-align:center; font-size:24px;">
  <b>Fig. 1. The three pillars of data science</b>
</p>
<p style="text-align:center; font-size:12px;">
  <b>Source: <a href="https://www.researchgate.net/figure/The-three-pillars-of-learning-in-data-science-clustering-flat-or-hierarchical_fig1_314626729">https://www.researchgate.net/figure/The-three-pillars-of-learning-in-data-science-clustering-flat-or-hierarchical_fig1_314626729</a></b>
</p>

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.cm as cm
import matplotlib.pyplot as plt

from tqdm import tqdm

# Scikit-learn, tensorflow, torch, etc.
#import torch
#import tensorflow as tf

from sklearn.datasets import make_regression, make_classification, \
                             make_blobs, make_moons, make_circles
# ...
# ...

In [None]:
# Initialize seaborn with custom settings
# Facecolor values from S. Conradi @S_Conradi/@profConradi
custom_settings = {
    'figure.facecolor': '#f4f0e8',
    'axes.facecolor': '#f4f0e8',
    'axes.edgecolor': '0.7',
    'axes.linewidth' : '2',
    'grid.color': '0.7',
    'grid.linestyle': 'none',
    'grid.alpha': 0.6,
}
sns.set_theme(palette=sns.color_palette('deep', as_cmap=False),
              rc=custom_settings)
plt.rcParams['text.usetex'] = False

## 1. Regression

<p style="text-align:center; font-size:20px;">
  <b>Data and label -> Model -> Continuous value</b>
</p>

### Data

In [None]:
X, y = make_regression(
    n_samples=1000,
    n_features=10,
    n_informative=10,
    n_targets=1,
    random_state=57
)
X = pd.DataFrame(X)

In [None]:
X

### Labels

In [None]:
fig, ax = plt.subplots(figsize=(20, 5), dpi=120)

ax.plot(y, color='indianred', lw=2)

ax.set_title('$y_{\\text{values}}$',
             fontsize=30, fontweight='bold')
ax.set_xticks([])

plt.show()

### Data $\times$ Labels

In [None]:
nr, nc = 2, 5
fig, axes = plt.subplots(nr, nc, figsize=(nc*5, nr*5), dpi=120)

for i, ax in enumerate(axes.flat):
    ax.scatter(X[i], y, color='indianred', alpha=0.6)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel(f'$X_{{{i+1}}}$', fontsize=30, fontweight='bold')
    ax.set_ylabel('$y$', fontsize=30, fontweight='bold')

plt.show()

## 2. Classification

<p style="text-align:center; font-size:20px;">
  <b>Data and label -> Model -> Discrete value</b>
</p>

### Data

In [None]:
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=10,
    n_redundant=0,
    n_classes=3
)
X = pd.DataFrame(X)

In [None]:
X

### Labels

In [None]:
print(' '.join(y.astype(str)))

In [None]:
fig, ax = plt.subplots(figsize=(24, 5))

ax.barh(*np.unique(y, return_counts=True), height=0.7,
        color=cm.tab10(np.unique(y)))

ax.set_yticks(np.unique(y))
ax.set_yticklabels(np.unique(y))

ax.tick_params(axis='both', which='major', labelsize=30)

plt.show()

### Data $\times$ Labels

In [None]:
nr, nc = 2, 5
fig, axes = plt.subplots(nr, nc, figsize=(nc*5, nr*5), dpi=120)

for i, ax in enumerate(axes.flat):
    ax.scatter(X[i], y, color='indianred', alpha=0.6)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel(f'$X_{{{i+1}}}$', fontsize=30, fontweight='bold')
    ax.set_ylabel('$y$', fontsize=30, fontweight='bold')

plt.show()

## 3. Clustering

<p style="text-align:center; font-size:20px;">
  <b>Only data -> Model -> Discrete value</b>
</p>

### Data

In [None]:
N = 1500
# Create a dummy dataset of blobs
Xb, yb = make_blobs(
    n_samples=N,    # Number of points in the dataset (number of rows)
    n_features=2,   # Dimension of the dataset (number of columns)
    centers=3,      # Number of blobs to create
    cluster_std=[1.0, 2.5, 0.5],
    center_box=(-10, 10),
    random_state=57
)

# Create a dummy dataset of circles
Xc, yc = make_circles(
    n_samples=N,
    noise=0.05,
    factor=0.6,
    random_state=57
)

# Create a dummy dataset of moons
Xm, ym = make_moons(
    n_samples=N,
    noise=0.05,
    random_state=57
)

### Labels

In [None]:
print('Blobs:   ' + ' '.join(yb.astype(str)))
print('Circles: ' + ' '.join(yc.astype(str)))
print('Moons:   ' + ' '.join(ym.astype(str)))

### Data $\times$ Labels

In [None]:
# Visualize them
nr, nc = 1, 3
fig, axes = plt.subplots(nrows=nr, ncols=nc, figsize=(8*nc, 8*nr))

Xi = (Xb, Xc, Xm)
yi = (yb, yc, ym)
for X, y, ax in zip(Xi, yi, axes.flat):

    X = X - np.mean(X)
    ax.scatter(*X.T, c=cm.viridis(y/np.max(y)), alpha=0.6)

    lim = 1.1 * np.max(np.abs(X))
    ax.set_xlim(-lim, lim)
    ax.set_ylim(-lim, lim)

plt.show()

### Let's have a look at the first one...

In [None]:
# Create a dummy dataset of blobs
Xb, yb = make_blobs(
    n_samples=1000,  # Number of points in the dataset (number of rows)
    n_features=10,   # Dimension of the dataset (number of columns)
    centers=3,       # Number of blobs to create
    cluster_std=1.5,
    center_box=(-10, 10),
    random_state=57
)
Xb = pd.DataFrame(Xb)

In [None]:
Xb

### Labels

Although we assign some groundtruth labels to our data, in case of clustering, we do not use them during the training process. In a real life scenario, we do not have any labels for our data, only the data itself. To test the robustness and accuracy of our model, however, we can obviously generate data sets with already known labels.

In [None]:
print(' '.join(yb.astype(str)))

### Data $\times$ Labels

In [None]:
nr, nc = 2, 5
fig, axes = plt.subplots(nr, nc, figsize=(nc*5, nr*5), dpi=120)

for i, ax in enumerate(axes.flat):
    ax.scatter(Xb[i], yb,
               color='indianred', alpha=0.6)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel(f'$X_{{{i+1}}}$', fontsize=30, fontweight='bold')
    ax.set_ylabel('$y$', fontsize=30, fontweight='bold')

plt.show()

## Examples

### Image processing

If we consider images as **datapoints** (**rows**) in a dataset, then pixels of images can be considered as individual *features* (*columns*) of this dataset.

#### Sloan Digital Sky Surve (SDSS) galaxy tiles and corresponding redshift values 

**Install SciScript using `conda`**

In [None]:
# !git clone https://github.com/sciserver/SciScript-Python.git ./tmp/SciScript-Python
# %cd ./tmp/SciScript-Python/py3
# !python -m build
# %pip install dist/*.whl
# %cd ../../..

**Install SciScript using `pip`**

In [None]:
# !git clone https://github.com/sciserver/SciScript-Python.git -p ./tmp/SciScript-Python
# %cd ./tmp/SciScript-Python/py3
# !python -m build
# %pip install dist/*.whl
# %cd ../../..

In [None]:
from astroquery.sdss import SDSS

from astropy import coordinates as coords
from astropy.visualization import ImageNormalize, ZScaleInterval
from sklearn.preprocessing import StandardScaler, Normalizer

from SciServer import SkyServer

import warnings
warnings.filterwarnings("ignore", category=Warning)

In [None]:
query = f'''
SELECT TOP 15
    g.ra, g.dec, g.PetroRad_r, s.z
FROM Galaxy g
    JOIN
        SpecObj s ON s.specObjID = g.specObjID
WHERE
    g.clean=1
    AND g.PetroRad_r BETWEEN 20 AND 32
    AND s.z BETWEEN 0.02 AND 0.05
'''
data = SDSS.query_sql(query, data_release=17).to_pandas()
co = coords.SkyCoord(ra=data['ra'], dec=data['dec'], unit='deg', frame='icrs')

In [None]:
imgs = []
for ra, dec in tqdm(zip(data['ra'], data['dec'])):
    img = SkyServer.getJpegImgCutout(ra=ra, dec=dec,
                                     width=96, height=96, scale=0.7, opt="")
    imgs.append(img)
imgs = np.array(imgs)

In [None]:
di = np.random.randint(len(imgs))
img = imgs[di]

print(f"{img.shape = }")
print(f"num of pixels = {img.size}")
print(f"Redshift is z = {data.z[di]}")

nr, nc = 1, 3
fig, axes = plt.subplots(nr, nc, figsize=(nc*6, nr*6), dpi=120)

for ax in axes:
    ax.axis('off')
# Plot all 3 color channels
ax = axes[0]
ax.imshow(img[..., 0], interpolation='none', cmap='Greys_r')
ax.text(0.025, 0.975, 'Red Channel', color='white', fontweight='bold',
        fontsize=20, ha='left', va='top', transform=ax.transAxes,
        bbox=dict(facecolor='black', alpha=0.5, edgecolor='none'))

ax = axes[1]
ax.imshow(img[..., 1], interpolation='none', cmap='Greys_r')
ax.text(0.025, 0.975, 'Green Channel', color='white', fontweight='bold',
        fontsize=20, ha='left', va='top', transform=ax.transAxes,
        bbox=dict(facecolor='black', alpha=0.5, edgecolor='none'))

ax = axes[2]
ax.imshow(img[..., 2], interpolation='none', cmap='Greys_r')
ax.text(0.025, 0.975, 'Blue Channel', color='white', fontweight='bold',
        fontsize=20, ha='left', va='top', transform=ax.transAxes,
        bbox=dict(facecolor='black', alpha=0.5, edgecolor='none'))

plt.show()

In [None]:
pd.DataFrame(img[..., 1])  # Printing one of the color channels

In [None]:
fig, ax = plt.subplots(figsize=(10, 2), dpi=200)

ax.plot(img.flatten(), color='indianred', lw=1)
ax.set_xticks([])
ax.set_xlim(0, img.size)

plt.show()

#### Now imagine a whole data set of images like this...

In [None]:
nr, nc = 3, 5
fig, axes = plt.subplots(nr, nc, figsize=(nc*4, nr*4), dpi=120,
                         facecolor='black')
fig.subplots_adjust(wspace=0.05, hspace=0.05)
axes = axes.flatten()

for ax in axes:
    ax.axis(False)
    
# Plot cutouts
for ax, img in zip(axes, imgs):
    ax.imshow(img)

plt.show()

In [None]:
X = pd.DataFrame(imgs[..., 0].reshape((-1, img.shape[0] * img.shape[1])))
X

You can even plot this flattened data set, but with so few rows, it is not very informative.

In [None]:
fig, ax = plt.subplots(figsize=(20, 20), dpi=600)
fig.patch.set_visible(False)  # remove figure background

ax.axis(False)
ax.set_aspect('equal')
ax.imshow(X.values, interpolation='none', cmap='Greys_r')

fig.tight_layout()
plt.show()

### Mixed dataset

Data of 891 Titanic passengers.

In [None]:
X = pd.read_csv("http://patbaa.web.elte.hu/physdm/data/titanic.csv")

In [None]:
X

In [None]:
fig, ax = plt.subplots(figsize=(24, 24), facecolor='black')

# Determine the image extent and axis limits for dear Mr. Matplotlib
x_lim = (0, X.values.shape[0]-1)
y_lim = [-0.5, X.values.shape[1]-0.5]

ax.imshow(X.isna().values.T,
          extent=(x_lim[0], x_lim[-1], y_lim[0], y_lim[-1]),
          aspect=10, cmap="Greys", interpolation='none')

# Y-AXIS FORMATTING
ax.set_yticks(range(X.columns.size))
ax.set_yticklabels(X.columns[::-1], ha='right')
ax.tick_params(axis='both', which='major',
               labelsize=12, pad=10, colors='white')

ax.grid(True, axis='y', ls='--', alpha=0.5)

plt.show()

**This dataset needs some preprocessing!**

### A completely different type of problem

In [None]:
fasta = "MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEK"

<img src="./images/alphafold.png"/></img>

# How to approach and handle a problem in data science?

Most of the problems should be approached and treated similarly by following these simple steps:
- Step 1.: Preprocess the dataset for analysis
- Step 2.: Find, tune and fit a model or models on the preprocessed dataset
- Step 3.: Make predictions using the trained model and evaluate and interpret the results

<img src="./images/pipeline-full.png"/></img>

# Preprocessing

A lot of beginner machine learning/data science guide for specific datasets will tell you to work with the data in a very specific way without actually telling you **why** should you do it **that** way? Why *scaling* the data is necessary? Why should you use *hot encoding*? What else can be done about missing data entries besides simply dropping them?

## 1.0. Every data preprocessing starts with data exploration

### Why? Because looking at the data could be extremely insightful...

See this example at https://en.wikipedia.org/wiki/Anscombe's_quartet

In [None]:
X = sns.load_dataset("anscombe")  # Load Anscombe's quartet from seaborn

In [None]:
nr, nc = 1, 4
fig, axes = plt.subplots(nr, nc, figsize=(5*nc, 5*nr), dpi=120)
fig.subplots_adjust(hspace=0.1, wspace=0.1)

for di, ax in zip(X.dataset.unique(), axes.flat):
    Xs = X.query(f"dataset == '{di}'")['x']
    ys = X.query(f"dataset == '{di}'")['y']
    ax.scatter(Xs, ys, s=15**2)
    xi = np.linspace(-10, 25)
    yi = 1/2 * xi + 3.0
    ax.plot(xi, yi, color='tab:red', lw=5, alpha=0.7)
    ax.set_xlim(0.8 * np.min(Xs), 1.1 * np.max(Xs))
    ax.set_ylim(0.8 * np.min(ys), 1.1 * np.max(ys))

plt.show()

#### 1.0.1. Looking at the data

In [None]:
X = pd.read_csv("http://patbaa.web.elte.hu/physdm/data/titanic.csv")

In [None]:
X

### 1.0.2. Exploring missing data points

In [None]:
X.isna().sum()

In [None]:
fig, ax = plt.subplots(figsize=(24, 24), facecolor='black')

# Determine the image extent and axis limits for dear Mr. Matplotlib
x_lim = (0, X.values.shape[0]-1)
y_lim = [-0.5, X.values.shape[1]-0.5]

ax.imshow(X.isna().values.T,
          extent=(x_lim[0], x_lim[-1], y_lim[0], y_lim[-1]),
          aspect=10, cmap="Greys", interpolation='none')

# Y-AXIS FORMATTING
ax.set_yticks(range(X.columns.size))
ax.set_yticklabels(X.columns[::-1], ha='right')
ax.tick_params(axis='both', which='major',
               labelsize=15, pad=10, colors='white')

ax.grid(True, axis='y', ls='--', alpha=0.5)

plt.show()

### 1.0.3. Exploring datatypes in the dataset

Object? Int? Float? Other?

In [None]:
# The `dtypes` variable of a pandas DataFrame object stores the datatypes
# of the columns in a specific DataFrame object
X.dtypes

### 1.0.4. Exploring distribution of feature values

Explore a randomly generated classification dataset with 2 distinct classes and 8 features.

In [None]:
X, y = make_classification(
    n_samples=200,
    n_features=8,
    n_informative=4,
    n_redundant=2,
    n_repeated=0,
    n_classes=2,
    n_clusters_per_class=2,
    random_state=0,
)
X = pd.DataFrame(X)

In [None]:
X

In [None]:
nc, nr = 4, 2
fig, axes = plt.subplots(nr, nc, figsize=(6*nc, 5*nr))

mask = np.bool_(y)
data = [X[mask], X[~mask]]
cmap = [cm.Reds, cm.Blues]
labl = ['Class 0', 'Class 1']
for d, c, l in zip(data, cmap, labl):
    for i, ax in enumerate(axes.flat):
        ax.grid(True, ls='--', alpha=0.6)

        # Convention for plotting numpy.histogram results
        hist, bins = np.histogram(d.values[:, i], bins=20, density=True)
        width = 0.8 * (bins[1] - bins[0])
        center = (bins[:-1] + bins[1:]) / 2
        ax.bar(center, hist, width=width, label=l,
               color=c(0.6), alpha=0.6)

        ax.set_title(f"Feature {i}", fontsize=12, fontweight='bold')
        ax.legend(loc='upper left', fontsize=16)

plt.show()

### 1.0.5. Exploring the correlation of features

In [None]:
X, y = make_classification(
    n_samples=100,    # Number of points in the data set
    n_features=6,     # Number of features in the data set
    n_informative=4,
    n_redundant=2,
    n_repeated=0,
    n_classes=2,
    n_clusters_per_class=2,
    random_state=0,
)
X = pd.DataFrame(X)

In [None]:
X

In [None]:
sns.pairplot(
    X,
    kind='scatter',
    diag_kind='kde'
)

plt.show()

## 1.1. Handling missing data

### 1.1.1. Deleting rows or columns with too much NaN values

- Rows or columns with too many missing values cannot be filled up in a meaningful way that is representative of the underlying distribution of the data
- It is better to simply drop rows or columns with too many missing values

### 1.1.2. Filling empty entries with values

- Filling NaN entries with mean of existing values
- Filling NaN entries with values sampled from the distribution of existing ones

## 1.2. Handling non-numeric data

- E.g. utilize label encoding and one-hot encoding
- Or do nothing with them in case of using various tree ensemble methods (e.g. TabNet)

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [None]:
X, y = make_classification(
    n_samples=10,
    n_features=6,
    n_informative=4,
    n_redundant=2,
    n_repeated=0,
    n_classes=2,
    n_clusters_per_class=2,
    random_state=0,
)
X = pd.DataFrame(X)

categories = ['Dog', 'Cat', 'Raccoon', 'Bear']
feat = np.random.choice(categories, size=100, replace=True)
X[5] = pd.Series(feat)
X

### Label encoding

Used, when the categorical feature is ordinal, i.e. the values have a natural order.
- E.g. "rite", "cum laude", "summa cum laude" can be encoded to 0, 1, 2
- E.g. "small", "medium", "large", "extra large" can be encoded to 0, 1, 2, 3

In [None]:
le = LabelEncoder()
Xenc = le.fit_transform(X.iloc[:, 5])
pd.concat([X.iloc[:, :-1], pd.Series(Xenc, name=X.columns[5])], axis=1).head(10)

### One-hot encoding

Used, when the categorical feature is nominal, i.e. the values do not have a natural order.
- E.g. "red", "green", "blue" can be encoded to 3 binary features
- E.g. "bear", "cat", "dog", "raccoon" can be encoded to 4 binary features

In [None]:
# Method 1.
encoder = OneHotEncoder(categories=[categories], sparse_output=False)
Xenc = encoder.fit_transform(X.iloc[:, 5].values.reshape(-1, 1))
Xenc = pd.DataFrame(Xenc.astype(int), columns=categories)
pd.concat((X.iloc[:, :-1], Xenc), axis=1).head(10)

In [None]:
# Method 2.
Xenc = pd.get_dummies(X.iloc[:, -1], dtype=int)
pd.concat([X.iloc[:, :-1], Xenc], axis=1)

### Why can (and should) we use these two methods in these two different cases?

In the case of **one-hot encoding**, we prevent the model from learning false correlations between different values of a categorical feature. If we instead assigned ascending integer values to these categories, we would be mapping them to a **discrete, ordered scale**. Statistical models would then interpret this as if a meaningful order exists&mdash;for example, that if `0` is assigned to "small" and `1` to "medium", then "small" $<$ "medium", which is **correct** when the categories are **ordinal**.

However, this logic falls apart when applied to **nominal** categories such as "red", "green", and "blue". Assigning `1` to "green" and `2` to "blue" would incorrectly suggest to the model that "green" $<$ "blue"&mdash;which is not only **nonsensical**, but potentially **harmful** to model performance (unless you are a color zealot, in which case that is a separate issue and we will NOT ask your opinion about this question).

That is why **one-hot encoding** is used for nominal features: it treats each category as **independent**, avoiding any unintended assumptions about order or distance between values.


## 1.3. Scaling

### 1.3.1 Example using an SDSS plate

In [None]:
imgs = SDSS.get_images(coordinates=co)
X = pd.DataFrame(imgs[0][0].data.astype(np.float32))

In [None]:
X

In [None]:
fig, ax = plt.subplots(figsize=(24, 5), dpi=120)
ax.grid(True, ls='--', alpha=0.6)

ax.plot(X.values.flatten(), color='indianred', lw=2)

ax.set_xticks([])
ax.tick_params(axis='y', labelsize=30)
ax.yaxis.get_offset_text().set_fontsize(30)

plt.show()

### Machine learning algorithms don't like values all over the scale!

#### StandardScaler

Scales each individual **feature (column)** to have a mean of $0$ and a standard deviation of $1$.

$$
    x' = \frac{x - \mu}{\sigma}
$$

where $\mu$ is the mean and $\sigma$ is the standard deviation of each feature. This is useful when features have different scales or units.

In [None]:
Xs = StandardScaler().fit_transform(X)
Xs = pd.DataFrame(Xs)

In [None]:
Xs

In [None]:
fig, ax = plt.subplots(figsize=(24, 5), dpi=120)
ax.grid(True, ls='--', alpha=0.6)

ax.plot(Xs.values.flatten(), color='indianred', lw=2)

ax.set_xticks([])
ax.tick_params(axis='y', labelsize=30)
ax.yaxis.get_offset_text().set_fontsize(30)

plt.show()

#### Normalizer  
Scales each individual **sample (row)** to have unit norm, meaning the entire row vector has length $1$. This is often useful when the direction of the data matters more than its magnitude. Normalizer uses either L1 or L2 norm to scale the data.

- **L2 norm** (Euclidean norm): scales the vector so that
  $$
      \| \mathbf{x} \|_{2} = \sqrt{x_{1}^{2} + x_{2}^{2} + \cdots + x_{n}^{2}} = 1\,.
  $$
  Each element is then divided by the L2 norm of the vector:
  $$
      x'_i = \frac{x_{i}}{\| \mathbf{x} \|_2}
  $$

- **L1 norm** (Manhattan norm): scales the vector so that
  $$
      \| \mathbf{x} \|_{1} = |x_{1}| + |x_{2}| + \cdots + |x_{n}| = 1\,.
  $$
  Each element is then divided by the L1 norm of the vector:
  $$
      x'_i = \frac{x_{i}}{\| \mathbf{x} \|_{1}}
  $$

In [None]:
Xn = Normalizer(norm='l2').fit_transform(X)
Xn = pd.DataFrame(Xn)

In [None]:
Xn

In [None]:
fig, ax = plt.subplots(figsize=(24, 5), dpi=120)
ax.grid(True, ls='--', alpha=0.6)

ax.plot(Xn.values.flatten(), color='indianred', lw=2)

ax.set_xticks([])
ax.tick_params(axis='y', labelsize=30)
ax.yaxis.get_offset_text().set_fontsize(30)

plt.show()

#### Images themselves need a different type of normalization!

In [None]:
nr, nc = 1, 3
fig, axes = plt.subplots(nr, nc, figsize=(nc*10, nr*10))
fig.subplots_adjust(hspace=0.1, wspace=0.025,
                    left=0.05, right=0.95, bottom=0.05, top=0.95)
ax = axes[0]
ax.axis(False)
ax.imshow(Xs.values, origin='lower', cmap='gray')
ax.text(0.025, 0.975, 'Scaled Image', color='white', fontweight='bold',
        fontsize=30, ha='left', va='top', transform=ax.transAxes,
        bbox=dict(facecolor='black', alpha=0.5, edgecolor='none'))

ax = axes[1]
ax.axis(False)
ax.imshow(Xn.values, origin='lower', cmap='gray')
ax.text(0.025, 0.975, 'Normalized Image', color='white', fontweight='bold',
        fontsize=30, ha='left', va='top', transform=ax.transAxes,
        bbox=dict(facecolor='black', alpha=0.5, edgecolor='none'))

ax = axes[2]
ax.axis(False)
norm = ImageNormalize(X.values, interval=ZScaleInterval())
ax.imshow(X.values, origin='lower', cmap='gray', norm=norm)
ax.text(0.025, 0.975, 'IRAF\'s zscale', color='white', fontweight='bold',
        fontsize=30, ha='left', va='top', transform=ax.transAxes,
        bbox=dict(facecolor='black', alpha=0.5, edgecolor='none'))


plt.show()

### 1.3.2 Example using a 2D map from the CAMELS data set

In [None]:
import io
import requests

In [None]:
# This is a larger image file (~100 Mb), it could take a while to download,
# depending on your internet connection and the server load!
#
# URL might change in the future, search for the latest version
# at https://camels-multifield-dataset.readthedocs.io/en/latest/access.html
url = 'https://users.flatironinstitute.org/~fvillaescusa/priv/DEPnzxoWlaTQ6CjrXqsm0vYi8L7Jy/CMD/2D_maps/data/IllustrisTNG/Maps_Mtot_IllustrisTNG_CV_z=0.00.npy'

response = requests.get(url)
response.raise_for_status()  # Immediately throws error if the download failed

X = np.load(io.BytesIO(response.content))
X = pd.DataFrame(X.reshape((-1, 256**2)))

In [None]:
nr, nc = 1, 3
fig, axes = plt.subplots(nr, nc, figsize=(nc*6, nr*6), dpi=120)
plt.subplots_adjust(hspace=0.1, wspace=0.025,
                    left=0.05, right=0.95, bottom=0.05, top=0.95)

img = X.iloc[X.sum(axis=1).argmax()].values.reshape((256, 256))
cmap = 'magma'

ax = axes[0]
ax.axis(False)
ax.imshow(img, cmap=cmap)
ax.text(0.025, 0.975, 'Raw Image', color='white', fontweight='bold',
        fontsize=30, ha='left', va='top', transform=ax.transAxes,
        bbox=dict(facecolor='black', alpha=0.5, edgecolor='none'))

ax = axes[1]
ax.axis(False)
ax.imshow(np.log10(img), cmap=cmap)
ax.text(0.025, 0.975, '$\\log_{10}$', color='white', fontweight='bold',
        fontsize=30, ha='left', va='top', transform=ax.transAxes,
        bbox=dict(facecolor='black', alpha=0.5, edgecolor='none'))

ax = axes[2]
ax.axis(False)
norm = ImageNormalize(img, interval=ZScaleInterval())
ax.imshow(img, cmap=cmap, norm=norm)
ax.text(0.025, 0.975, 'IRAF\'s zscale', color='white', fontweight='bold',
        fontsize=30, ha='left', va='top', transform=ax.transAxes,
        bbox=dict(facecolor='black', alpha=0.5, edgecolor='none'))

plt.show()

## 1.4. Dimensionality reduction

In [None]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

In [None]:
#...in the next section...