# Data science in Python I. - Problems in data science

## Glossary

<p style="font-size:16px;">
Data science primarily deals with datasets (surprise-surprise). There are a number of fundamental terms this field of science is built upon. It is undeniably necessary to be familiar with them before venturing any further into this field. Below you find a short list of the most important terms. A short description is also attached to all of them.
</p>

<ul style="font-size:16px;">
<li><b>Dataset</b> (<i>hu: adathalmaz</i>): An ensemble of datapoints. Denoted by upper case <code>X</code> by convention. Datasets can be (and most of the time is) multidimensional, which means the <code>x</code> (lower case) datapoints consist of more, than one components. In this case, datapoints can be considered to be "vectors", or at least a list of continuous/discrete/non-numeric values. It is also a common convention that in a table, rows denote the individual datapoints, while columns denote the different dimensions/components of datapoints.</li>

<li><b>Labels</b> (<i>hu: címke</i>): Some datasets are not solely consists of the datapoints themeselves, but corresponding <b>labels</b> too. In normal cicumstances, every <code>x</code> datapoint has a corresponding <code>y</code> value. The list of labels are denoted by lower case <code>y</code> by convention.</li>

<li><b>Features</b> (<i>no hu translation</i>): Another name for the different dimensions of an <code>X</code> dataset. This is the term that is primarily used for dataset dimensions in data science. In practice, most of the "features" represent an actual, measurable quantity.</li>

<li><b>Class</b> (<i>hu: osztály</i>): In classifications problems, labels are discrete, which represent that every datapoint can be "classified" into a specific subset of the dataset. The interpretation of subsets can be arbitrary. They could simply represent "bins" or "intervals", in which the labels are "binned to" by value. Or they could be more meaningful. Eg. if the datapoints are images, labels could represent, whether there is a dog or a cat on the image.</li>

<li><b>Model</b> (<i>same in magyar with double 'L'</i>): A "model" in data science has obviously a very similar meaning as in other fields of science. It means to represent some underlying connection between datapoins or between datapoints and labels in a dataset. In the context of modern data science, "model" represent an arbitrary mathematical operator or sequence of operators, that maps the <code>X</code> dataset to corresponding <code>y</code> values.</li>

<li><b>Training/Learning</b> (<i>hu: tanulás/tanítás</i>): This is just a fancy and generalized way of saying "fitting data on a model". In numerous cases "fitting data" is not just a simple curve fitting, but a much more complex process that is harder to interpret. Also most machine learning methods work in a way, where they're optimizing model parameters during an iterative process. This can be well described by the terms "training" and "learning". Models are essentially "trained" over iterations, as they're "learning" the underlying correlation in the dataset.</li>

<li><b>Supervised vs. unsupervised learning</b> (<i>hu: felügyelt-/felügyelet nélküli tanítás/tanulás</i>): It means whether we're using data with or without labels. If labels are attached to a dataset during training/learning, we're speaking about <b>supervised learning</b>, while if no labels are attached to our dataset, we're speaking about <b>unsupervised learning</b>.</li>
</ul>

# Types of problems

Overwhelming majority of the problems in data science can be classified into 3 groups: regression, classification and clustering.

<img width="700px" src="./images/three-pillars.png" style="display:block; margin:auto;"/>
<p style="text-align:center; font-size:24px;">
  <b>Fig. 1. The three pillars of data science</b>
</p>
<p style="text-align:center; font-size:12px;">
  <b>Source: <a href="https://www.researchgate.net/figure/The-three-pillars-of-learning-in-data-science-clustering-flat-or-hierarchical_fig1_314626729">https://www.researchgate.net/figure/The-three-pillars-of-learning-in-data-science-clustering-flat-or-hierarchical_fig1_314626729</a></b>
</p>

In [None]:
import numpy as np
import scipy as sp
import pandas as pd

import seaborn as sns
import matplotlib.cm as cm
import matplotlib.pyplot as plt
plt.rcParams['text.usetex'] = False

# Scikit-learn, tensorflow, torch, etc.
#import torch
#import tensorflow as tf

from sklearn.datasets import make_regression, make_classification, \
                             make_blobs, make_moons, make_circles
# ...
# ...

## 1. Regression

<p style="text-align:center; font-size:20px;">
  <b>Data and label -> Model -> Continuous value</b>
</p>

### Data

In [None]:
X, y = make_regression(
    n_samples=1000,
    n_features=10,
    n_informative=10,
    n_targets=1,
    random_state=57
)
X = pd.DataFrame(X)

In [None]:
X

### Labels

In [None]:
fig, ax = plt.subplots(figsize=(20, 5), dpi=120)
ax.grid(True, ls='--', alpha=0.6)

ax.plot(y, lw=2)

ax.set_title("\\textbf{y values}",
             fontsize=30, fontweight='bold')
ax.set_xticks([])

plt.show()

### Data $\times$ Labels

In [None]:
nr, nc = 2, 5
fig, axes = plt.subplots(nr, nc, figsize=(nc*5, nr*5), dpi=120)

for i, ax in enumerate(axes.flat):
    ax.scatter(X[i], y, 
               color='indianred', alpha=0.6)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel(f'$X_{{{i+1}}}$', fontsize=30, fontweight='bold')
    ax.set_ylabel('$y$', fontsize=30, fontweight='bold')

plt.show()

## 2. Classification

<p style="text-align:center; font-size:20px;">
  <b>Data and label -> Model -> Discrete value</b>
</p>

### Data

In [None]:
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=10,
    n_redundant=0,
    n_classes=3
)
X = pd.DataFrame(X)

In [None]:
X

### Labels

In [None]:
s = ''
for i in y: s += f'{i} '
print(s)

In [None]:
fig, ax = plt.subplots(figsize=(24, 5))
ax.grid(True, ls='--', alpha=0.6)

ax.barh(*np.unique(y, return_counts=True), height=0.7,
        color=cm.tab10(np.unique(y)))

ax.set_yticks(np.unique(y))
ax.set_yticklabels(np.unique(y))

ax.tick_params(axis='both', which='major', labelsize=30)

plt.show()

### Data $\times$ Labels

In [None]:
nr, nc = 2, 5
fig, axes = plt.subplots(nr, nc, figsize=(nc*5, nr*5), dpi=120)

for i, ax in enumerate(axes.flat):
    ax.scatter(X[i], y, 
               color='indianred', alpha=0.6)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel(f'$X_{{{i+1}}}$', fontsize=30, fontweight='bold')
    ax.set_ylabel('$y$', fontsize=30, fontweight='bold')

plt.show()

## 3. Clustering

<p style="text-align:center; font-size:20px;">
  <b>Only data -> Model -> Discrete value</b>
</p>

### Data

In [None]:
N = 1500
# Create a dummy dataset of blobs
Xb, yb = make_blobs(
    n_samples=N,    # Number of points in the dataset
    n_features=2,   # Dimension of the dataset (Here it's a 2D dataset)
    centers=3,      # Number of blobs to create
    cluster_std=[1.0, 2.5, 0.5],
    center_box=(-10, 10),
    random_state=57
)

# Create a dummy dataset of circles
Xc, yc = make_circles(
    n_samples=N,    # Number of points in the dataset
    noise=0.05,
    factor=0.6,
    random_state=57
)

# Create a dummy dataset of moons
Xm, ym = make_moons(
    n_samples=N,    # Number of points in the dataset
    noise=0.05,
    random_state=57
)

In [None]:
# Visualize them
nr, nc = 1, 3
fig, axes = plt.subplots(nrows=nr, ncols=nc, figsize=(8*nc, 8*nr))

Xi = (Xb, Xc, Xm)
yi = (yb, yc, ym)
for X, y, ax in zip(Xi, yi, axes.flat):
    ax.grid(True, ls='--', alpha=0.6)

    X = X - np.mean(X)
    ax.scatter(*X.T, c=y)

    lim = 1.1 * np.max(np.abs(X))
    ax.set_xlim(-lim, lim)
    ax.set_ylim(-lim, lim)

plt.show()

### Let's have a look at the first one...

In [None]:
# Create a dummy dataset of blobs
Xb, yb = make_blobs(
    n_samples=1000,  # Number of points in the dataset
    n_features=10,  # Dimension of the dataset (Here it's 100D)
    centers=3,       # Number of blobs to create
    cluster_std=1.5,
    center_box=(-10, 10),
    random_state=57
)
Xb = pd.DataFrame(Xb)

In [None]:
Xb

### Labels

We do not have this information however in case of clustering.

In [None]:
s = ''
for i in yb: s += f'{i} '
print(s)

### Data $\times$ Labels

In [None]:
nr, nc = 2, 5
fig, axes = plt.subplots(nr, nc, figsize=(nc*5, nr*5), dpi=120)

for i, ax in enumerate(axes.flat):
    ax.scatter(Xb[i], yb,
               color='indianred', alpha=0.6)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel(f'$X_{{{i+1}}}$', fontsize=30, fontweight='bold')
    ax.set_ylabel('$y$', fontsize=30, fontweight='bold')

plt.show()

## Examples

### Image processing

If we consider images as **datapoints** (**rows**) in a dataset, then pixels of images can be considered as individual *features* (*columns*) of this dataset.

#### Subaru Telescope images with spectro-Z data from SDSS

In [None]:
f = "260_z0.132621.png"
img = plt.imread(f'./data/{f}')[:,:,0]

fig, ax = plt.subplots(figsize=(10, 10),
                       facecolor='black')
ax.axis('off')
ax.imshow(img, cmap="Greys_r")
plt.show()
print(f"{img.shape = }")
print(f"num of pixels = {img.size}")
print(f"Redshift is z = {f.split('z')[-1].split('.png')[0]}")

#### Now let's imagine a whole dataset of images like this...

In [None]:
import os

In [None]:
DDIR = '/home/masterdesky/data/Subaru/'
files = np.array([os.path.join(DDIR, f) for f in os.listdir(DDIR)])
X = np.array([plt.imread(f)[:,:,0].flatten() for f in files])
X = pd.DataFrame(X)

In [None]:
X

### Mixed dataset

Data of 891 Titanic passengers.

In [None]:
X = pd.read_csv("http://patbaa.web.elte.hu/physdm/data/titanic.csv")

In [None]:
X

In [None]:
fig, ax = plt.subplots(figsize=(24, 24), facecolor='black')

# Determine the image extent and axis limits for dear Mr. Matplotlib
x_lim = (0, X.values.shape[0]-1)
y_lim = [-0.5, X.values.shape[1]-0.5]

ax.imshow(X.isna().values.T,
          extent=(x_lim[0], x_lim[-1], y_lim[0], y_lim[-1]),
          aspect=10, cmap="Greys", interpolation='none')

# Y-AXIS FORMATTING
ax.set_yticks(range(X.columns.size))
ax.set_yticklabels(X.columns[::-1], ha='right')
ax.tick_params(axis='both', which='major',
               labelsize=12, pad=10, colors='white')

ax.grid(True, axis='y', ls='--', alpha=0.5)

plt.show()

**This dataset needs some preprocessing!**

### A completely different type of problem

In [None]:
fasta = "MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEK"

<img src="./images/alphafold.png"/></img>

# How to approach and handle a problem in data science?

Most of the problems should be approached and treated similarly by following these simple steps:
- Step 1.: Preprocess the dataset for analysis
- Step 2.: Find, tune and fit a model or models on the preprocessed dataset
- Step 3.: Make predictions using the trained model and evaluate and interpret the results

<img src="./images/pipeline-full.png"/></img>

# Preprocessing

A lot of beginner machine learning/data science guide for specific datasets will tell you to work with the data in a very specific way without actually telling you **why** should you do it **that** way? Why *scaling* the data is necessary? Why should you use *hot encoding*? What else can be done about missing data entries besides simply dropping them?

## 1.0. Every preprocessing starts with data exploration

### Why? Because looking at the data could be extremely insightful...

See this example at https://en.wikipedia.org/wiki/Anscombe's_quartet

In [None]:
X = pd.read_csv('./data/Anscombe_quartet_data.csv')

In [None]:
nr, nc = 1, 4
fig, axes = plt.subplots(nr, nc, figsize=(5*nc, 5*nr), dpi=120)
fig.subplots_adjust(hspace=0.1, wspace=0.1)

sc = 15
Xs = [X['x123'], X['x123'], X['x123'], X['x4']]
ys = [X['y1'], X['y2'], X['y3'], X['y4']]

for i, ax in enumerate(axes.flat):
    ax.scatter(Xs[i], ys[i], s=sc**2)
    xi = np.linspace(-10, 25)
    yi = 1/2 * xi + 3.0
    ax.plot(xi, yi, color='tab:red', lw=5, alpha=0.7)
    ax.set_xlim(0.8 * np.min(Xs), 1.1 * np.max(Xs))
    ax.set_ylim(0.8 * np.min(ys), 1.1 * np.max(ys))

plt.show()

#### 1.0.1. Looking at the data

In [None]:
X = pd.read_csv("http://patbaa.web.elte.hu/physdm/data/titanic.csv")

In [None]:
X

### 1.0.2. Exploring missing datapoints

In [None]:
X.isna().sum()

In [None]:
fig, ax = plt.subplots(figsize=(24, 24), facecolor='black')

# Determine the image extent and axis limits for dear Mr. Matplotlib
x_lim = (0, X.values.shape[0]-1)
y_lim = [-0.5, X.values.shape[1]-0.5]

ax.imshow(X.isna().values.T,
          extent=(x_lim[0], x_lim[-1], y_lim[0], y_lim[-1]),
          aspect=10, cmap="Greys", interpolation='none')

# Y-AXIS FORMATTING
ax.set_yticks(range(X.columns.size))
ax.set_yticklabels(X.columns[::-1], ha='right')
ax.tick_params(axis='both', which='major',
               labelsize=15, pad=10, colors='white')

ax.grid(True, axis='y', ls='--', alpha=0.5)

plt.show()

### 1.0.3. Exploring datatypes in the dataset

Object? Int? Float? Other?

In [None]:
# The `dtypes` variable of a pandas DataFrame object stores the datatypes
# of the columns in a specific DataFrame object
X.dtypes

### 1.0.4. Exploring distribution of feature values

Explore a randomly generated classification dataset with 2 distinct classes and 8 features.

In [None]:
X, y = make_classification(
    n_samples=200,
    n_features=8,
    n_informative=4,
    n_redundant=2,
    n_repeated=0,
    n_classes=2,
    n_clusters_per_class=2,
    random_state=0,
)
X = pd.DataFrame(X)

In [None]:
X

In [None]:
nc, nr = 4, 2
fig, axes = plt.subplots(nr, nc, figsize=(6*nc, 5*nr))

mask = np.bool_(y)
data = [X[mask], X[~mask]]
cmap = [cm.Reds, cm.Blues]
labl = ['Class 0', 'Class 1']
for d, c, l in zip(data, cmap, labl):
    for i, ax in enumerate(axes.flat):
        ax.grid(True, ls='--', alpha=0.6)

        # Convention for plotting numpy.histogram results
        hist, bins = np.histogram(d.values[:, i], bins=20, density=True)
        width = 0.8 * (bins[1] - bins[0])
        center = (bins[:-1] + bins[1:]) / 2
        ax.bar(center, hist, width=width, label=l,
               color=c(0.6), alpha=0.6)

        ax.set_title(f"Feature {i}", fontsize=12, fontweight='bold')
        ax.legend(loc='upper left', fontsize=16)

plt.show()

### 1.0.5. Exploring the correlation of features

In [None]:
X, y = make_classification(
    n_samples=100,    # Number of points in the data set
    n_features=6,     # Number of features in the data set
    n_informative=4,
    n_redundant=2,
    n_repeated=0,
    n_classes=2,
    n_clusters_per_class=2,
    random_state=0,
)
X = pd.DataFrame(X)

In [None]:
X

In [None]:
sns.pairplot(
    X,
    kind='scatter',
    diag_kind='kde'
)

plt.show()

## 1.1. Handling missing data

### 1.1.1. Deleting rows or columns with too much NaN values

- Rows with too much missing values cannot be filled up in a meaningful way
- It's better to simply drop rows or columns with too many missing values

### 1.1.2. Filling empty entries with values

- Filling NaN entries with mean of existing values
- Filling NaN entries with values sampled from the distribution of existing ones

## 1.2. Handling non-numeric data

- Label encoding and one-hot encoding
- Do nothing with them in case of numerous tree ensemble methods (eg. TabNet)

## 1.3. Scaling

In [None]:
from sklearn.preprocessing import StandardScaler, Normalizer, \
                                  LabelEncoder, OneHotEncoder

In [None]:
X = np.load('/home/masterdesky/data/CAMELS/2D_maps/data/Maps_Mtot_Nbody_IllustrisTNG_CV_z=0.00.npy')
X = X.reshape((-1, 256**2))
X = pd.DataFrame(X)

In [None]:
X

In [None]:
fig, ax = plt.subplots(figsize=(24, 5), dpi=120)
ax.grid(True, ls='--', alpha=0.6)

ax.plot(X.iloc[0])

ax.set_xticks([])
ax.tick_params(axis='y', labelsize=30)
ax.yaxis.get_offset_text().set_fontsize(30)

plt.show()

### Machine learning algorithms don't like values all over the scale!

#### StandardScaler

In [None]:
Xs = StandardScaler().fit_transform(X)
Xs = pd.DataFrame(Xs)

In [None]:
Xs

In [None]:
fig, ax = plt.subplots(figsize=(24, 5), dpi=120)
ax.grid(True, ls='--', alpha=0.6)

ax.plot(Xs.iloc[0])

ax.set_xticks([])
ax.tick_params(axis='y', labelsize=30)
ax.yaxis.get_offset_text().set_fontsize(30)

plt.show()

### But what is this?

In [None]:
img1 = X.iloc[0].values.reshape((256, 256))
img2 = Xs.iloc[0].values.reshape((256, 256))

nr, nc = 1, 2
fig, axes = plt.subplots(nr, nc, figsize=(nc*6, nr*6), facecolor='black')
fig.subplots_adjust(wspace=0.1)

for ax in axes.flat:
    for spine in ax.spines.values():
        spine.set_edgecolor('white')
    ax.set_xticks([])
    ax.set_yticks([])

axes[0].imshow(img1, cmap='magma')
axes[1].imshow(img2, cmap='magma')

plt.show()

#### Normalizer

In [None]:
Xn = Normalizer().fit_transform(X)
Xn = pd.DataFrame(Xn)

In [None]:
Xn

In [None]:
fig, ax = plt.subplots(figsize=(24, 5), dpi=120)
ax.grid(True, ls='--', alpha=0.6)

ax.plot(Xn.iloc[0])

ax.set_xticks([])
ax.tick_params(axis='y', labelsize=30)
ax.yaxis.get_offset_text().set_fontsize(30)

plt.show()

In [None]:
img1 = X.iloc[0].values.reshape((256, 256))
img2 = Xn.iloc[0].values.reshape((256, 256))

nr, nc = 1, 2
fig, axes = plt.subplots(nr, nc, figsize=(nc*6, nr*6), facecolor='black')
fig.subplots_adjust(wspace=0.1)

for ax in axes.flat:
    for spine in ax.spines.values():
        spine.set_edgecolor('white')
    ax.set_xticks([])
    ax.set_yticks([])

axes[0].imshow(img1, cmap='magma')
axes[1].imshow(img2, cmap='magma')

plt.show()

### One-hot encoding

In [None]:
X, y = make_classification(
    n_samples=10,    # Number of points in the data set
    n_features=6,     # Number of features in the data set
    n_informative=4,
    n_redundant=2,
    n_repeated=0,
    n_classes=2,
    n_clusters_per_class=2,
    random_state=0,
)
X = pd.DataFrame(X)

categories = ['Dog', 'Cat', 'Raccoon', 'Bear']
feat = np.random.choice(categories, size=100, replace=True)
X[5] = pd.Series(feat)

In [None]:
X

In [None]:
Xenc = OneHotEncoder().fit_transform(X[5].values.reshape(-1, 1)).toarray()
Xenc = pd.DataFrame(np.array(Xenc, dtype=int), columns=categories)

In [None]:
Xenc

In [None]:
Xfull = pd.concat((X.drop(columns=[5]), Xenc), axis=1)
Xfull

## 1.4. Dimensionality reduction

In [None]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA