# 09 Working with data

Part of ["Introduction to Data Science" course](https://github.com/kupav/data-sc-intro) by Pavel Kuptsov, [kupav@mail.ru](mailto:kupav@mail.ru)

Recommended reading for this section:

1. Grus, J. (2019). Data Science From Scratch: First Principles with Python (Vol. Second edition). Sebastopol, CA: O’Reilly Media

1. Beginners Tutorial for Regular Expressions in Python http://www.analyticsvidhya.com/blog/2015/06/regular-expression-python/

The following Python modules will be required. Make sure that you have them installed.
- `matplotlib`
- `numpy`
- `scipy`
- `requests`
- `re`

## Lesson 1

### Get familiar with new data

The first step in processing of a new dataset is its exploring. 

First of all we need to understand what sort of data we have: size of the dataset, number of dimensions (columns in a table), units of measurements, scales along each dimension.

Moreover we need to estimate how poor are our data: are there bad formatted rows, missing values, outlets, duplicated 
entries and so on.

There are no strict algorithms for the familiarizing with the data. 

Some (incomplete) list of common ideas: 

- visualize everything as many ways as you can; 
- look through the datafile itself, if possible;
- suspect evident patterns - probably they are artifacts, i.e., appeared due to systematic errors; 
- do not compare colors with heights, only identical units can be compared - thus convert everything into dimensionless numbers
- do not compare too small features with too large ones - rescale all to a standard range

### Description of one-dimensional data

One dimensional dataset is just a list or column of numbers. 

This is the most simple case to explore.

To get familiar with the one dimensional dataset we can combine its description with basic statistics, like mean and standard deviation, its visualization and its direct reviewing.

Since we are are going to load many dataset, we define first a function that does it:

In [None]:
import numpy as np
import requests

def load_dataset(file_name, dtype=float):
    """Downloads 1D dataset from repo to numpy array."""
    base_url = "https://raw.githubusercontent.com/kupav/data-sc-intro/main/data/"
    web_data = requests.get(base_url + file_name)
    assert web_data.status_code == 200
    data = [dtype(s) for s in web_data.text.strip().split('\n')]
    return np.array(data)

Let us load the first dataset to a numpy array:

In [None]:
data1 = load_dataset("data1d_descr1.txt")

Previously we discussed the basic statistics that gives a first impression of the data: mean, variance and range. 

The module `scipy.stats` has a function `describe` that computes these values at once. 

Moreover this function returns *skewness* and *kurtosis* of the data distribution.

Let us briefly discuss them before continue.

Skewness is the degree of distortion from the symmetrical bell curve (the normal distribution). It measures the lack of symmetry in data distribution.

![skewness.svg](fig/skewness.svg)

If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.

If the skewness is between -1 and -0.5 or between 0.5 and 1, the data are moderately skewed.

If the skewness is less than -1 or greater than 1, the data are highly skewed.

Kurtosis describes the sharpness of the peak and weight of the tails of the distribution. 

It is indicates the presence of outliers in the distribution.

![kurtosis.svg](fig/kurtosis.svg)

Mesokurtic kurtosis = 0: Similar to a normal distribution. The peak is quadratic and tails decay sufficiently fast.

Leptokurtic, kurtosis > 0: "Lepto-" means "slender". The peak is sharp but the tails decays slower then for the normal distribution. This is called heavy or fat tails. It means that the most of the data is concentrated near in the center but there are noticable outliers.

Platykurtic, kurtosis < 0: "Platy-" means "broad". The peak is flat and the tails are thinner. The uniform distributions is platykurtic. Platykurtic distribution measn that the data has light tails or lack of outliers.

In [None]:
from scipy import stats

print(stats.describe(data1))

Here `nobs` means the lengths of the data. We have `100000` numbers.

The smallest and the largest values are more or less symmetric with respect to zero, -22 and 25, respectively. 

Mean value is 3. If we check the deviation, 3-(-22) = 25, 25 - 3 = 22, we will see that the mean value is close to the middle of the data range. 

We also have a pretty small skewness and kurtosis. So we we expect that the data has normal distribution.

Let us plot the histogram:

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(data1, bins=300);

Indeed, the curve has a bell shape specific for the normal distribution.

Let us now consider another dataset

In [None]:
data2 = load_dataset("data1d_descr2.txt")

Consider its histogram

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(data2, bins=300);

It also looks like a bell. But what about its statistics?

In [None]:
from scipy import stats

print(stats.describe(data2))

There is some problem in the data: instead of the expected values we have `nan`. 

This is a special value means "Not A Number". 

Appearance of `nan` indicates that there are `nan`s in the dataset itself. 

It means that an error occurred when the data were computed.

Before further working with the dataset we have to remove `nan`s.

Let us first discuss how it can be done.

In [None]:
# Test array with nan
tst1 = np.array([1.0, 2.0, np.nan, 3.0])

# Check nan and write True is nan is found
print(np.isnan(tst1))

# Check not nan and write True for numbers
print(~np.isnan(tst1))

# Bolean array of True and False can be used to select elements from the array.
print(tst1[~np.isnan(tst1)])

Thus dropping out `nan`s can be done as follows:

In [None]:
data2 = data2[~np.isnan(data2)]

Now describe it again:

In [None]:
print(stats.describe(data2))

Observe nonzero negative kurtosis. It indicates that the data is not so normal. 

If we look more closely at the histogram we will see that the data is indeed more flat than the normal one above. 

In fact the second dataset was generated as a union of two normally distributed lists of data with different parameters.

Lets consider two more distributions that are seriously non-similar to a normal distribution

In [None]:
data3 = load_dataset("data1d_descr3.txt")
print(stats.describe(data3))
fig, ax = plt.subplots()
ax.hist(data3, bins=300);

This distribution is highly asymmetric, its left tail is larger the the right one. So its skewness is large negative.

Also this is leptokurtic distribution: it has very sharp peak and a heavy tail. So its kurtosis is large.

In [None]:
data4 = load_dataset("data1d_descr4.txt")
print(stats.describe(data4))
fig, ax = plt.subplots()
ax.hist(data4, bins=300);

This is very platykurtic distribution - no tails running to infinity. Kurtosis is negative. 

The distribution is highly symmetric, so that the skewness is almost zero.

### Two-dimensional data

Two dimensional data represent dependencies: wind speed vs. atmospheric pressure, car speed vs fuel consumption and so on.

Given the data we can first explore its columns separately as described above. 

Also of course we need to check how the dependency itself looks like.

In [None]:
import csv
import numpy as np
import requests

def load_csv_dataset(file_name, dtype=float):
    """Downloads csv numeric dataset from repo to numpy array."""
    base_url = "https://raw.githubusercontent.com/kupav/data-sc-intro/main/data/"
    web_data = requests.get(base_url + file_name)
    assert web_data.status_code == 200
    
    reader = csv.reader(web_data.text.splitlines(), delimiter=',')
    data = []
    for row in reader:
        try:
            # Try to parse as a row of floats
            float_row = [dtype(x) for x in row]
            data.append(float_row)
        except ValueError:
            # If parsing as floats failed - this is header
            print(row)
            
    return np.array(data)

In [None]:
data = load_csv_dataset("data2d_descr.csv")

We need to know first the shape of our dataset

In [None]:
print(data.shape)

We have 10000 records each of the length 3. 

The first column contains $x$-values and two others are $y_1$ and $y_2$.

Let us first describe them as 1D arrays:

In [None]:
from scipy import stats
import matplotlib.pyplot as plt

# xs
xs = data[:, 0]
print(stats.describe(xs))
fig, ax = plt.subplots()
ax.hist(xs, bins=300);

In [None]:
# ys1
ys1 = data[:, 1]
print(stats.describe(ys1))
fig, ax = plt.subplots()
ax.hist(ys1, bins=300);

In [None]:
# ys2
ys2 = data[:, 2]
print(stats.describe(ys2))
fig, ax = plt.subplots()
ax.hist(ys2, bins=300);

Observe that three distributions are almost identical. Their descriptions indicate that they are most probable sampled from a standard normal distribution. 

But separate histograms lost the information about dependencies between values in rows. 

Let us compare dependencies `ys1` vs. `xs` and `ys2` vs. `xs`.

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()

ax.scatter(xs, ys1, s=1, label="ys1 vs xs")
ax.scatter(xs, ys2, s=1, label="ys2 vs xs")
ax.legend();

We see two quite different dependencies.

These dependencies can be analyzed a little bit further. 

Let us remember Pearson's correlation coefficient. 

High correlation indicates that two data sequences vary similarly. 

If the coefficient is negative then the data also vary similarly but in opposite directions.

Previously we computed the correlation coefficient using our own function. 

Also a `numpy` function `.corrcoef` was considered. 

Now we consider using another function for the correlation coefficient from `scipy.stats`. The function name is `.pearsonr` (with 'r' in the end because it stands for Pearson's r).

In [None]:
from scipy import stats

print(stats.pearsonr(xs, ys1))
print(stats.pearsonr(xs, ys2))

This function returns a tuple. The first element is the correlation coefficient itself. The second one is a $p$-value.

Let us remember what does it mean.

Computing the correlation coefficient $r$ we assume that our `xs` and `ys` are samples form some true and 
very large datasets of $y$ vs $x$. 

It means that taking different samples we will have different coefficients $r$. 

In the other words $r$ is itself a random value. 

We accept a null-hypothesis that true $x$ and $y$ are uncorrelated so that the mean value of a random value $r$ is 0. 

It means that the value $r$ computed for the particular sample `xs` and `ys` can be large by magnitude at random. 

The probability that the sample `xs`, `ys` of an uncorrelated true dataset $x$, $y$ gives by a chance 
the correlation as extreme as $r$  is called $p$-value. 

Zero $p$-value indicates very high confidence that the data are correlated.

In our case it appears because the data are sampled from the standard normal distribution.

### Multidimensional data

Dealing with multidimensional data we again can look at their separate histograms. 

Also it can be useful to check all their pairwise relations.

Let us load the data.

In [None]:
data = load_csv_dataset("datand_descr.csv")

We have a four-dimensional data. 

Let us first compute its description. 

We do not need to feed the function `stats.describe` with the separate columns of our dataset. 

It understands the multidimensional data

In [None]:
from scipy import stats

stats.describe(data)

We have 10000 data records.

All data lay within more or less close ranges - their mins and maxs are not so different. 

Also they are centered near the origin, see mean.

The widths of their distributions (variances) are close to the standard value, except `x3` where the variance is two times higher.

Up to this point all four columns looks similar to each other.

But skewness tells that the distributions of `x1`, `x2`, and `x4` are rather symmetric, while `x3` is highly asymmetric.

The same is for kurtosis: `x3` strongly deviates from the three others. It must have sharp peak and heavy tails.

Now we are going to compute their correlation coefficients. 

We use the function is `np.corrcoef` since it computes all pairwise correlations at once.

In [None]:
import numpy as np

# rowvar=False means the columns must be compaired pairwise
cor = np.corrcoef(data, rowvar=False)

# We use np.printoptions to round the results
with np.printoptions(precision=2):
    print(cor)

We observe very high correlation between `x1` and `x4`: $r=0.98$
 
The second by value correlations are `x1`-`x3` and `x4`-`x3`: $r=0.03$. 

All others are even smaller.

Almost the same correlations for `x1`-`x3` and `x4`-`x3` is explained by the high correlation 
between `x1` and `x4`.

Now we plot pairwise plots. We are going to build a 4 by 4 mesh of plots. 

Diagonal will contain histograms for the corresponding data columns and other cells will show scatter plots.

In [None]:
import matplotlib.pyplot as plt

N = 4
fig, axs = plt.subplots(nrows=N, ncols=N, figsize=(10, 10))
for i in range(N):
    for j in range(N):
        if i == j:
            axs[i, i].hist(data[:, i], bins=300, color='C1')
            axs[i, i].set_title(f"x{i+1}")
        else:
            axs[i, j].scatter(data[:, i], data[:, j], s=1)
            axs[i, j].set_title(f"r[{i+1},{j+1}]={cor[i,j]:.3f}")

# Required to avoid overlapping of the subplots            
fig.tight_layout()

In this plot we see that `x1`, `x2` and `x3` have identical distributions. It looks like a standard normal one. 

Above computed statistics confirm it: their skewness and kurtosis are close to zero, their are close to zero and variances are close to one.

Distributions of `x3` is quite different. It indeed has a very sharp peak and this is very asymmetric.

Scatter plot `x1` vs `x4` shows that these two data columns almost coincides: we observe there the functional dependence 
$$
x_1\approx x_4.
$$

Surprisingly `x1` and `x3` also demonstrate the functional dependence. The scatter plot looks like a parabola, i.e., 
$$
x_3\approx x_1^2
$$

But their correlation though is not so small but nevertheless do not indicate the presence of such a pronounced dependence.

This is because the Pearson's correlation coefficient can reveal only simple linear dependence like between `x1` and `x4`. 

Let us finally notice that `x2` is not correlated with any other data. Despite the fact that they have identical distributions, the scatter plot for `x1` vs `x2` is mere a cloud of points without any visible structure. Same situation is for pairs `x2` - `x3` and `x2` - `x4`.

### Removing outliers

Data often contain outliers. These are the data points that do not fit all other data somehow.

Determining what is or is not an outlier is very subjective and depends on the study.

The most obvious approach in detecting the outliers is based on a common sense.

If we are already familiar with the data and have an idea what should be there we can filter out the outliers. 

For example in a dataset of human ages can not contain a number, say 212. 

The detection of the outliers in the multidimensional data is less obvious sine dependencies must be taken into account.

Say a study is using both people’s ages and marital status to draw conclusions. 

Looking at the data separately we can miss outliers. For example "10 years old" is not an outlier and a "widow" is not an outlier. 

But a record with a 10 years old widow is likely an outlier.

Another outliers that can be found based on the common sense are data reported in the wrong units. 

For example, let the data report the minutes it took for someone to complete a task. 

The task took most people 3 to 10 minutes, but there is also a data point of 300. 

Common sense tells us this could be a data point that was accidentally recorded in seconds.

Consider a illustrations.

Assume we have a dataset of student ages (the data are synthetic). The values are integers.

In [None]:
data = load_dataset("outliers_1d.txt", int)

Histogram is a convenient way to detect outliers.

In [None]:
import matplotlib.pyplot as plt

# Since the data are integers it will be better to compute the exact number of bins
bins = data.max() - data.min() + 1

fig, ax = plt.subplots()
ax.hist(data, bins=bins);

We definitely have outliers in the distribution: the bars are gathered to the left because there 
is small amount of large ages.

But the outliers are so rare that their bars have almost zero height and thus invisible.

To see them we can change vertical scale to the logarithmic:

In [None]:
fig, ax = plt.subplots()
ax.hist(data, bins=bins)
ax.set_yscale('log')

Now we clearly see two outliers.

Let us filter them out and re-plot the histogram.

In [None]:
# The fast way to filter out numpy-array
data_flt = data[data<50]

bins = data_flt.max() - data_flt.min() + 1

fig, ax = plt.subplots()
ax.hist(data_flt, bins=bins);

Let us now consider a 2D dataset with outliers.

We are going to consider a file that contains averaged day temperatures. Exact dates are omitted. Only season name is kept.

In [None]:
# Need a modified version of csv downlaoder - read strings and floats

import csv
import numpy as np
import requests

file_name = "outliers_2d.csv"

base_url = "https://raw.githubusercontent.com/kupav/data-sc-intro/main/data/"
web_data = requests.get(base_url + file_name)
assert web_data.status_code == 200

reader = csv.reader(web_data.text.splitlines(), delimiter=',')
seasons = []
avgtemp = []
for row in reader:
    try:
        s = row[0]
        t = float(row[1])
        seasons.append(s)
        avgtemp.append(t)
    except ValueError:
        print(row)

print(seasons[:10])
print(avgtemp[:10])

First we plot the temperatures histogram. 

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(avgtemp, bins=300);

The histogram does not contain any visible outliers. 

To check the correctness of the season names we can convert a list `seasons` into a set to remove duplicates:

In [None]:
print(set(seasons))

We see using two names `Fall` and `Autumn` for the same season. 

Now check the dependencies in the dataset.

We need to plot a scatter plot Average temperature vs. Season.

First we substitute a season name with an integer code using a dictionary.

To fix the name doubling we just can define the same code both for `Fall` and `Autumn`:

In [None]:
codes = {'Winter':0, 'Spring':1, 'Summer':2, 'Fall':3, 'Autumn': 3}
seas_ints = [codes[s] for s in seasons]

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.scatter(seas_ints, avgtemp);

There are two obvious outliers: too hot in in a winter and too cold in a summer. 

At least we hope that these are the outliers. 

Let us filter them out.

In [None]:
seas_ints1 = []
avgtemp1 = []
for s, a in zip(seas_ints, avgtemp):
    # Skip the first outlier
    if s == 0 and a > 15:
        continue
    # Skip the second outlier
    if s == 2 and a < -5:
        continue
    seas_ints1.append(s)
    avgtemp1.append(a)
            
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.scatter(seas_ints1, avgtemp1);        

### Statistical criterion for detecting outliers

Outliers can be found using statistical criteria.

There are several more or less sophisticated methods.

We consider the one based on computing $p$-values.

The idea is as follows. 

First we need to guess the probability distribution for the data set.

Then we start to check each data point.

The null-hypothesis is that the checked point is sampled from the distribution and thus is not an outlier.

We compute $p$-value for the data point. It means we find the probability of a value at least as extreme as this point.
Let us recall that the $p$-value for a two sided test equals to the doubled probability.

Given $p$-values for each data point we accept a significance level $\alpha$ and remove all data points whose $p$-values are less then $\alpha$.

Or instead we can remove data points with significantly smaller $p$-values then all others.

We will illustrate it using already considered dataset of student ages.

As we have seen these data obey to a normal distribution.

Thus we need a copy of a function for computing CDF of a normal distribution.

In [None]:
from scipy.special import erf
import numpy as np

def norm_cdf(x, mu, sig):
    """Normal cumulative distribution function"""
    return 0.5 * (1 + erf((x-mu)/(sig*np.sqrt(2))))

Now we load the dataset again.

In [None]:
data = load_dataset("outliers_1d.txt", int)

Here we compute the $p$-values. 

Since CDF for a normal distribution attain saturation at 0 and 1 very fast, large outliers can have zero $p$-values. 

For better visualization we substitute zeros with very small numbers. 

In [None]:
mu = np.mean(data)
sig = np.std(data)

prob = []
for d in data:
    if d <= mu:
        p = 2 * norm_cdf(d, mu, sig)
    else:
        p = 2 * (1 - norm_cdf(d, mu, sig))
    # Zeros are substituted with small numbers
    if p == 0.0:
        p = 1e-15
    prob.append(p)

To see what we have we plot the scatter plot data vs $p$-values.

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()

ax.scatter(data, prob)
# we can use logarithmic scale since changed zeros to small numbers
ax.set_yscale('log')

This figure clearly indicates that the outliers at 63 and 99 must be removed. 

Their $p$-values are so small that we reject the null-hypothesis for them that means that they do not belong to the distribution.

### Exercises

1\. Download the file "skewed_data_1d.txt" from the repository "https://raw.githubusercontent.com/kupav/data-sc-intro/main/data/".
This file contains one column of data. Describe it using corresponding function from the module `scipy.stats`. Analyze its skewness and kurtosis. What can you say about the dataset based on these values? Plot a histogram. 

2\. Download the file "multidim_corr.csv" from the repository "https://raw.githubusercontent.com/kupav/data-sc-intro/main/data/".
This file contains several columns of data. Plot pairwise scatter plots as well as separated histograms. Compute the correlation matrix. What can you say about the dependencies between the data columns?

3\. Download the file "dirty_data_1d.txt" from the repository "https://raw.githubusercontent.com/kupav/data-sc-intro/main/data/".
This file contain a normally distributed data with some contamination. Compute $p$-values for the dataset and remove the outliers whose $p$-values are less then $\alpha=0.001$. Plot histograms for the original and cleaned data.

## Lesson 2

### Native Python string find and replace methods

Assume that you need to find some word in a text. 

Maybe also you need to replace it with another word.

Python strings provide simple tools for it.

The following example shows how to find a substring.

In [None]:
txt = """The English Wikipedia was the first Wikipedia edition and has
remained the lagest."""

Method `.find(sub, beg=0, end=len(txt))` search for the first occurrence of the substring. It returns a position of the substring or -1 if fails.

In [None]:
pos = txt.find("Wiki")
print(pos)
print(txt[pos:])

If we find the next occurrence we can start searching from `pos+1`

In [None]:
pos2 = txt.find("Wiki", pos + 1)
print(pos2)
print(txt[pos2:])

There is a version of the search method that finds the last occurrence of the substring.

In [None]:
pos = txt.rfind("the")
print(pos)
print(txt[pos:])

When we want an occurrence before the previous we do as follows:

In [None]:
pos2 = txt.rfind("the", 0, pos)
print(pos2)
print(txt[pos2:])

Replacing of the substring is done using `.replace(old, new, count=-1)` method.

Let us fix a typo in our string:

In [None]:
txt.replace("lagest", "largest")

### Regular expressions

Regular expressions extends the possibilities provided by the find and replace methods.

Regular expressions are sequences of characters symbols used to perform find-and-replace operations.

The power of the regular expressions is that they allows to find sets of somehow similar substrings.

For example the symbol `\w` corresponds to any a alphanumeric character, and `\d` matches any digit.

Working with regular expressions in Python is done via the standard module `re`. 

In addition to the regular expression support, this module comes with a power tools for finding and replacing.

- `re.match()`
- `re.search()`
- `re.findall()`
- `re.split()`
- `re.sub()`
- `re.compile()`

Object `Match` is an object containing information about the search and the result.

Some of the functions of the module `re` returns result as the 
object `Match`, others return mere list of strings.

Function `re.match(pattern, string)` finds the occurrence of a pattern at the beginning of a string. 

It returns an object `Match`.

In [None]:
import re

txt = "When in Rome, do as the Romans"

mtch = re.match(r"When", txt)
print(mtch)

`Match` has a method `.group()` that gives the found pattern:

In [None]:
print("Found pattern:", mtch.group())

The name of the method `.group()` is unclear now. 

When our patters is a plain text as above the only one group is always found. 

But when we specify the pattern as a regular expression, we may want not only find it but also to dissect strings into several parts that match different components of interest.

The found parts are returned by `Match` as groups.

The examples of a nontrivial using of groups will be below.

`Match` can also get positions of the beginning and the end of the pattern, as well as its span:

In [None]:
print(mtch.start(), mtch.end(), mtch.span())

If we try to find another word, the search fails: the method `.match()` checks only the beginning of the string.

In [None]:
mtch = re.match(r"Rome", txt)
print(mtch)

Observe that patterns must be specified with r prefix

```python
r"Hello\n Good bye"
```

This is to protect '\\' from treating it as a special symbol.

Let us check:

In [None]:
s1 = "\nHello\nGood bye"
s2 = r"\nHello\nGood bye"
print("Normal string:", s1)
print()
print("Raw string:", s2)

Function `re.search(pattern, string)` searches the whole string and returns `Match` the first occurrence of the pattern.

In [None]:
import re

txt = "Hope for the best, but prepare for the worst"

mtch = re.search(r"for", txt)
print(mtch.group(), mtch.span())

Function `re.findall(pattern, string)` returns a list of all occurrences of the patter.

Observe that this function returns a list, not an object `Match`.

In [None]:
import re

txt = "Keep your friends close and your enemies closer"

lst = re.findall(r"close", txt)
print(lst)

Function `re.split(pattern, string, maxsplit=0)` splits a string by the pattern. If `maxsplit` is zero (by default) there will be as many splitting as possible. Otherwise the number of splitting will be limited.

In [None]:
import re

txt = "One man's trash is another man's treasure"

spl = re.split(r"man", txt)
print(spl)

spl = re.split(r" ", txt)
print(spl)

spl = re.split(r" ", txt, 3)
print(spl)

If a massive search is performed it is recommended to compile
a pattern before applying: `re.compile(pattern)`.

The returned is an object `RegexObject`. 

It has its own search methods.

In [None]:
import re

txt1 = "If you can't beat them, join them"
txt2 = "You can't judge a book by its cover"
txt3 = "You can lead a horse to water, but you can't make him drink"

rge = re.compile(r"can")
print(rge.findall(txt1))
print(rge.findall(txt2))
print(rge.findall(txt3))

In what follows we will always consider the compiled patterns.

Let us now discuss the regular expressions. 

The regular expressions are built of the special symbols matching one or many different characters.

- `\w` : Matches with an alphanumeric character 
- `\d` : Matches with digits \[0-9\]
- `\s` : Matches with a single white space character (space, newline, tab)

An example is below. 

Observe that each of these symbols matches with only one character.

Also notice that the exclamation point is not matched at all.

In [None]:
import re 

txt = "Agent 007!"

rge_w = re.compile(r"\w")
rge_d = re.compile(r"\d")
rge_s = re.compile(r"\s")

print(rge_w.findall(txt))
print(rge_d.findall(txt))
print(rge_s.findall(txt))

Capital-letter version of these paterns means the negation:
    
- `\W` : Matches with not an alphanumeric character 
- `\D` : Matches with not digits \[0-9\]
- `\S` : Matches with not a single white space character (space, newline, tab)        

In [None]:
import re 

txt = "Agent 007!"

rge_W = re.compile(r"\W")
rge_D = re.compile(r"\D")
rge_S = re.compile(r"\S")

print(rge_W.findall(txt))
print(rge_D.findall(txt))
print(rge_S.findall(txt))

The pattern `\W` detects all non alphanumerical symbols: space and the exclamation point

The pattern `\D` returns all non digits: these are letters, space and the exclamation point.

Finally `\S` finds all non space symbols.

We can specify particular characters that we want to match:

- `[..]` : Matches with any single character in square brackets
- `[^..]` : The negation: matches with any single character not in square brackets

In [None]:
import re 

txt = "experimentalist"

rge_c = re.compile(r"[aei]")
rge_C = re.compile(r"[^aei]")

print(rge_c.findall(txt))
print(rge_C.findall(txt))

Square brackets admit range specification via `-` (minus) sign

- `[a-d]` : Matches characters from a to d
- `[a-zA-Z]` : Matches all Latin letters

Observe that `\w` matches both letters and digits and the range `[a-zA-Z]` allows to get only letters.

In [None]:
import re 

txt = "Agent 007!"

rge_w = re.compile(r"\w")
rge_l = re.compile(r"[a-zA-Z]")

print(rge_w.findall(txt))
print(rge_l.findall(txt))

Finally any single character except new line is matched like this:

- `.` (period) : Matches any single character except newline
- `\n` : Matches newline symbols

In the example below the string `txt1` is a raw string so that `\n` in the middle is considered by Python literally 
as back slash and character `n`.

And `txt2` is a plain string where Python treat `\n` as a newline symbol.

Observe how period-pattern process these strings. 

It matches all symbol from the first string since no newline symbols are there.

And it misses a newline symbol in the second string.

In [None]:
import re 

txt1 = r"Two roads diverged in a yellow wood,\nAnd sorry I could not travel both"
txt2 = "Two roads diverged in a yellow wood,\nAnd sorry I could not travel both"

rge_p = re.compile(r".")

print(rge_p.findall(txt1))
print(rge_p.findall(txt2))

Accordingly, if we try to find a newline symbols we will find one only in the second string:

In [None]:
rge_n = re.compile(r"\n")

print(rge_n.findall(txt1))
print(rge_n.findall(txt2))

The single pattern symbols can be combined together and with plain characters:

In [None]:
import re 

txt = """The longest recorded rated chess game in history: 
Ivan Nikolic vs. Goran Arsovic, 17 Feb 1989. 
1. d4 Nf6 2. c4 g6 3. Nc3 Bg7 4. e4 d6 5. Nf3 O-O 6. Be2 Nbd7
etc
"""

rge1 = re.compile(r"[a-zA-Z]\d")
rge2 = re.compile(r"\d\d \w\w\w \d\d\d\d")

print(rge1.findall(txt))
print(rge2.findall(txt))

In the above example we have found a data repeating `\d` and `\w`. 

In general repeating the pattern symbol can be not so convenient.

To match the repeated character we can use special symbols:

- `?` : Matches 0 or 1 occurrence of the pattern to its left
- `+` : Matches 1 or more occurrences of the pattern to its left
- `*` : Matches 0 or more occurrences of the pattern to its left
- `{n,m}` : Matches at least n and at most m occurrences of preceding expression. 
- `{,m}` : Matches minimum m occurrences of preceding expression. Zero occurrences are also matched.
- `{n,}` : Matches at least n or more occurrences of preceding expression.
- `{n}` : Matches exactly n occurrences of preceding expression.

So another version of a pattern to extract the date above is as follows:

In [None]:
rge3 = re.compile(r"\d+ [a-zA-Z]+ \d+")

print(rge3.findall(txt))

More exact specification:

In [None]:
rge3 = re.compile(r"\d{2} [a-zA-Z]{3} \d{4}")

print(rge3.findall(txt))

Following pattern symbols match start and end of string:

- `^` : Matches the start the string.
- `$` : Matches the end the string.

The example below uses pattern `\w+` to match all words separated by space symbols.

The patterns sounds as follows: "Find each 1 or more occurrence (`+`) in a row of alphanumerical symbols (`\w`)"

In [None]:
import re

txt = """I've a cat named Vesters,
And he eats all day.
He always lays around,
And never wants to play.

Not even with a squeaky toy, 
Nor anything that moves.
When I have him exercise,
He always disapproves.

So we've put him on a diet,
But now he yells all day.
And even though he's thinner,
He still won't come and play.
"""

rge1 = re.compile(r'\w+')
print(rge1.findall(txt))

Now let us try find all worlds in the line starts by adding `^` symbol before the pattern.

In [None]:
rge2 = re.compile(r'^\w+')
print(rge2.findall(txt))

It has found only the very first word because this is the beginning of the string.

If we want to find all words at line beginnings after each line break we need to switch the search to
multiline mode:

In [None]:
rge3 = re.compile(r'^\w+', re.MULTILINE)
print(rge3.findall(txt))

Complex pattern can be combined with logical operator Or:

- `a | b` : Matches either a or b

Assume that we have a text with dates in different formats. 

The following pattern will extract all of them:

In [None]:
import re

txt = """Writers have traditionally written abbreviated dates according 
to their local custom, creating all-numeric equivalents to dates such as, 
"15 February 2021" (15/02/21, 15/02/2021, 15-02-2021 or 15.02.2021)
"""

rge = re.compile(r"\d{2}\s\w+\s\d{4}|\d{2}[/-]\d{2}[/-]\d+|\d{2}\.\d{2}\.\d+")

print(rge.findall(txt))

Here there are three patterns combined with logical or `|`:

- `\d{2}\s\w+\s\d{4}` : Two digits, space, a word, four digits
- `\d{2}[/-]\d{2}[/-]\d+` : Two digits, slash or minis, two digits, slash or minis, one or more digits (need this to match both 21 and 2021)
- `\d{2}\.\d{2}\.\d+` : Two digits, period protected by a backslash (to treat it as a character and not as patterns symbol), two digits, protected period, one or more digits

As we have seen special symbols like `.`, `?` or `*` can be used as plain characters when protected by backslash:

- `\.` `\?` `\+` `\*` : Match special symbols as plain characters.

Below is another illustration of pattern search.

The pattern `\w+,?\s\w+` matches the following sequences:

- `\w+` : one or more alphanumeric letters - actually matches a word
- `,?` : one comma or no comma
- `\s` : space or newline symbol
- `\w+` : again one or more alphanumeric letters, i.e, a word again

This patters splits the string into a pairs of successive words:

In [None]:
import re

txt = """Mary had a little lamb,
Little lamb, little lamb,
Mary had a little lamb
Whose fleece was white as snow.
"""

rge = re.compile(r"\w+,?\s\w+")
print(rge.findall(txt))

The boldface highlighting helps to clarify what was found:

**Mary had** a little **lamb,
Little** lamb, little **lamb,
Mary** had a **little lamb**
Whose fleece **was white** as snow.


Round brackets do exactly what we think they should do - they group patterns:

- `(`, `)` : Create a group of pattern symbols

One example of using round brackets. 

Almost the same pattern but the words are grouped by brackets. 

Observed that now the matched words are extracted separately and the middle spaces, commas and newlines are omitted:

In [None]:
rge = re.compile(r"(\w+),?\s(\w+)")

print(rge.findall(txt))

And in this pattern the middle spaces, commas and newlines are grouped instead:

In [None]:
rge = re.compile(r"\w+(,?\s)\w+")

print(rge.findall(txt))

In this example we use regular expressions to extract names of English kings.

The pattern `\w+\s+\w+\s+[IV]{1,3}` includes the following parts:

- `\w+` : a word
- `\s+` : one or more space or newline in a row
- `\w+` : a word again
- `\s+` : again spaces and/or newlines
- `[IV]{1,3}` : one, two or three characters I or V - naive roman number matcher.

This is the analyzed text with the highlighted king names:

"The Principality of Wales was incorporated into the Kingdom of England under the 
Statute of Rhuddlan in 1284, and in 1301 **King Edward I** invested his eldest son, 
the future **King Edward II**, as Prince of Wales. Since that time, except for **King 
Edward III**, the eldest sons of all English monarchs have borne this title.

After the death of **Queen Elizabeth I** without issue, in 1603, **King James VI** 
of Scotland also became **James I** of England, joining the crowns of England 
and Scotland in personal union."

Now the search:

In [None]:
import re

txt = """
The Principality of Wales was incorporated into the Kingdom of England under the 
Statute of Rhuddlan in 1284, and in 1301 King Edward I invested his eldest son, 
the future King Edward II, as Prince of Wales. Since that time, except for King 
Edward III, the eldest sons of all English monarchs have borne this title.

After the death of Queen Elizabeth I without issue, in 1603, King James VI 
of Scotland also became James I of England, joining the crowns of England 
and Scotland in personal union. 
"""

rge1 = re.compile(r"\w+\s+\w+\s+[IV]{1,3}")

print(rge1.findall(txt))

Notice that the "King Edward III" is split by a newline. The pattern `\s+` processes it correctly.

Now the same pattern with the grouped parts responsible for a personal name and a number: 

In [None]:
rge1 = re.compile(r"\w+\s+(\w+)\s+([IV]{1,3})")
print(rge1.findall(txt))

Now we discuss the function `re.sub(pattern, repl, string)`.

If finds the pattern in the string and replace it with `repl`.

`repl` can be a string or a function. 

If this is a string, backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern. 

First, a trivial example without regular expressions:

In [None]:
import re

txt = "Keep your friends close and your enemies closer"

sbs = re.sub(r"close", "distant", txt)
print(sbs)

Consider now using patterns.

The pattern `(\d+)\s*\+\s*(\d+)\s*=\s*(\d+)[;,]?\s*` below matches arithmetical expressions and extracts the numbers from it.

It contains three parts taken into round brackets. The text matched by the corresponding patterns are called groups.

- `(\d+)` : an integer number; it will be the group 1 since goes first in a row
- `\s*\+\s*` : a plus sign protected by a backslash and surrounded by optional spaces
- `(\d+)` : an integer number; it will be the group 2
- `\s*=\s*` : an equal sign surrounded by optional spaces
- `(\d+)` : an integer number; it will be the group 3
- `[;,]?\s*` : separators - optional comma, semicolon and spaces

In [None]:
import re

txt = '1 + 2 = 3, 3+ 4 = 7; 7+8=15'
pat = r"(\d+)\s*\+\s*(\d+)\s*=\s*(\d+)[;,]?\s*"

rge = re.compile(pat)
rge.findall(txt)

Now we will use the function `sub` to substitute `+` and `=` signs with their verbalizations.

Also we want to drop out all separators and change them to newline symbols.

Observe a key point here: the groups is substituted to the replacement string as `\n` where `n` 
is a number of group.

In [None]:
s = re.sub(pat, r"\1 plus \2 equals \3\n", txt)
print(s)

This substitution can be done simpler. 

Instead of using groups we could substitute `+` and `=` by `plus` and `equals`, respectively and separators could be substituted with newlines. 

But the following already can not be done without grouping parts in the pattern:

In [None]:
s = re.sub(pat, r"\3 minus \2 equals \1\n", txt)
print(s)

If `repl` parameter in the function `sub` is itself a function it is called for every occurrence of the pattern. The function takes a single match object argument, and returns the replacement string.

The example below takes a string with an arithmetical expression in it and substitute the expression with its result.

The pattern `(\d+)\s*\+\s*(\d+)` contains the following parts:

- `(\d+)` : an integer number; it will be the group 1
- `\s*\+\s*` : a plus sign protected by a backslash and surrounded by optional spaces
- `(\d+)` : an integer number; it will be the group 2

When the function find the matching it calls function `add_replacer` and pass a `Match` object to it. 

This object has a method `.group()` that provides access to the matched groups. 

We convert them into integers, add them and return the result converted to a string.

This string is substituted instead of the matched pattern.

In [None]:
# Eaxmple from https://medium.com/python-in-plain-english/the-incredible-power-of-pythons-replace-regex-6cc217643f37
import re
  
def add_replacer(match_obj):
    return str(int(match_obj.group(1)) + int(match_obj.group(2)))

def eval_adds(string):
    return re.sub(r"(\d+)\s*\+\s*(\d+)", add_replacer, string)

print(eval_adds("the result is 1 + 2"))
print(eval_adds("the result is 6 + 4"))
print(eval_adds("the result is 15 + 5"))

### Exercises

4\. In this exercise you will write the program the uses a regular expression to search the text below and extract words beginning with the capital letter. For example two first such words are 'Anglo' and 'Saxon'.

In [None]:
txt = """Anglo-Saxon Chronicle, chronological account of events in Anglo-Saxon and Norman 
England, a compilation of seven surviving interrelated manuscript records that is the primary 
source for the early history of England. The narrative was first assembled in the reign of 
King Alfred (871–899) from materials that included some epitome of universal history: the 
Venerable Bede’s Historia ecclesiastica gentis Anglorum, genealogies, regnal and episcopal lists, 
a few northern annals, and probably some sets of earlier West Saxon annals. The compiler also had 
access to a set of Frankish annals for the late 9th century. Soon after the year 890 several 
manuscripts were being circulated; one was available to Asser in 893, another, which appears 
to have gone no further than that year, to the late 10th-century chronicler Aethelweard, while 
one version, which eventually reached the north and which is best represented by the surviving 
E version, stopped in 892.
"""

5\. In this exercise you will take two verses of the song 'Mary had a little lamb' below and 
write the program that uses a regular expression to find all words that go right after the name 'Mary'. 
For example the first such word is 'had'

In [None]:
txt = """Mary had a little lamb,
Little lamb, little lamb,
Mary had a little lamb
Whose fleece was white as snow.

And everywhere that Mary went,
Mary went, Mary went,
Everywhere that Mary went
The lamb was sure to go.
"""

6\. Browse this document above and find the example where an arithmetic expression is substituted with its result. 
That program assumes only summation. Modify it to process correctly both addition and subtraction. For example the string '1 + 2' must be substituted with '3' and '4-3' must be substituted with '1'.

## Lesson 3

### Rescaling (standardizing) of data

Most of methods of data processing are highly sensitive to the scale of data. 

Let us assume that we have a dataset with maritime weather observation. The data include averaged per day wind speed and air temperature. 

The wind speed units are knots, i.e., nautical miles (1852 meters) per hour and temperatures are in Celsius degrees.

Consider three records only:

In [None]:
import numpy as np

# Wind speed (knots), Temperature (C)
data = np.array(
    [[10, -5],
     [22, 0],
     [30, 10]])

Our goal is to collect the days with similar weather. 

One of the simplest approach is to compute the distances between all pairs of day records treating them as vectors.

In [None]:
def dist(x, y):
    return np.sqrt((x[0]-y[0])**2 + (x[1]-y[1])**2)

N = len(data)
for i in range(N):
    for j in range(i + 1, N):
        d = dist(data[i], data[j])
        print(f"{i} to {j} distance = {d:.2f}")

We observe that days 1 and 2 has the most similar weather.

But then we realize that knots are not very standard unit ashore. 

So we recompute knots to meters per second.

In [None]:
data2 = np.array([[x[0] * 1852 / 3600, x[1]] for x in data])
print(data2)

Let us now recompute the pairwise distances:

In [None]:
N = len(data2)
for i in range(N):
    for j in range(i + 1, N):
        d = dist(data2[i], data2[j])
        print(f"{i} to {j} distance = {d:.2f}")

We clearly see that now the most similar weather is in days 0 and 1.

The source of the problem is that we compare incompatible values:

The distance is:
$$
\sqrt{(\textrm{speed1}-\textrm{speed2})^2 + (\textrm{temp1}-\textrm{temp2})^2}
$$
We add here squared speed and squared temperature. 

This is incorrect because only the same units can be added and subtracted. 

Thus before doing computations with data it is recommend to rescale each data filed (column) 
so that it has mean 0 and standard deviation 1.
$$
z_i = \frac{x_i - \mu}{\sigma}
$$
Also this is called data standardizing.

Doing in this way we get rid of the units and thus make the data columns compatible.

To perform the rescaling we will use `numpy` functions `mean` and `std` for mean and standard deviation.

Let us see how they work:

In [None]:
print(data)
print("mu=", np.mean(data, axis=0))
print("sigma=", np.std(data, axis=0))

In [None]:
def rescale(data):
    return (data - np.mean(data, axis=0)) / np.std(data, axis=0)

data3 = rescale(data)
print(data3)

Now given the compatible data we can recompute the distances correctly:

In [None]:
N = len(data3)
for i in range(N):
    for j in range(i + 1, N):
        d = dist(data3[i], data3[j])
        print(f"{i} to {j} distance = {d:.2f}")

We see that the days 0 and 1 have the most similar weather. And this conclusion is a more trustworthy. 

From the statistical point of view three records are not enough for computing mean and standard deviation. 

But if some reasonable scales are required and no additional data are available this rescaling is the only more or less adequate solution.

### Categorical data

Often data are represent in a categorical form. 

In this case the categorical column in a dataset table contain text strings from a limited set of possible values. 

For example it can be names, or countries. 

Categorical data can even be represented by numbers. But such numbers can not be manipulated 
(adding, subtracting, comparing) like ordinary numbers.

Example of numerical categorical data are numbers on sportsmen uniform. 

These numbers are actually just labels. 

It makes little sense to add a sportsman number 20 with a sportsman number 11.

In the example below the last column 'Origin' is an example of textual categorical data. This is the country name and we can not perform arithmetical manipulations with them.

In [None]:
data = [
    {'Car': 'AMC Concord d/l', 'MPG': 18.1, 'Cylinders': 6, 'Horsepower': 120.0, 'Origin': 'US'},
    {'Car': 'Toyota Corolla', 'MPG': 24.0, 'Cylinders': 4, 'Horsepower': 96.0, 'Origin': 'Japan'},
    {'Car': 'Ford Gran Torino', 'MPG': 14.5, 'Cylinders': 8, 'Horsepower': 152.0, 'Origin': 'US'},
    {'Car': 'Buick Opel Isuzu Deluxe', 'MPG': 30.0, 'Cylinders': 4, 'Horsepower': 80.0, 'Origin': 'US'},
    {'Car': 'Volkswagen Rabbit Custom', 'MPG': 29.0, 'Cylinders': 4, 'Horsepower': 78.0, 'Origin': 'Europe'},
    {'Car': 'Dodge Coronet Brougham', 'MPG': 16.0, 'Cylinders': 8, 'Horsepower': 150.0, 'Origin': 'US'},
    {'Car': 'Chrysler Cordoba', 'MPG': 15.5, 'Cylinders': 8, 'Horsepower': 190.0, 'Origin': 'US'},
    {'Car': 'Toyota Corolla 1200', 'MPG': 32.0, 'Cylinders': 4, 'Horsepower': 65.0, 'Origin': 'Japan'},
    {'Car': 'Volvo 244DL', 'MPG': 22.0, 'Cylinders': 4, 'Horsepower': 98.0, 'Origin': 'Europe'},
    {'Car': 'Chevrolet Woody', 'MPG': 24.5, 'Cylinders': 4, 'Horsepower': 60.0, 'Origin': 'US'}]

### Label encoding of categorical data

All methods of data processing are performed on numerical data. 

It means that the textual categorical data must be converted to numbers somehow.

The obvious approach is to extract all different strings, order them (or left as they are) and enumerate. 

Then assign numbers instead of the strings.

Let us first build a list of unique country names

In [None]:
# Extract all country names
orig = [d['Origin'] for d in data]
print(orig)

In [None]:
# Convert to set to remove duplicates
uniq_orig = set(orig)
print(uniq_orig)

In [None]:
# Convert back to list
uniq_orig1 = list(uniq_orig)

# and create a dictionary with labels
orig_dict = {x: i for i, x in enumerate(uniq_orig1)}
    
print(orig_dict)

In [None]:
# New dataset with numerical labels
data1 = []
for rec in data:
    orig = rec['Origin']
    numer = orig_dict[orig]
    rec1 = rec.copy()  # we copy record to avoid change of the original dataset
    rec1['Origin'] = numer
    data1.append(rec1)

In [None]:
# Print what we have
for d in data1:
    print(d)

We changed categorical data from textual to numeric form. 

### One-hot encoding of categorical data

Above we have converted textual categorical data to numerical labels. 

But the data are still categorical and their numerical representation made thinks even worse.

The numbers are misleading. 

If we want to compare cars like we have compared weather conditions above we can not use the column 'Origin' because the numbers 
have been assigned arbitrarily. 

It makes no sense to compute the distance between codes 0 and 2.

Encoding method allowing further numerical processing is called one-hot encoding.

Given a filed with a categorical data we
- count unique string values (above there were three of them, 'US', 'Japan', and 'Europe')
- instead of the categorical column create new columns corresponding to the number of unique values (three new columns in our example)
- each new column has a name like these: 'Is value1', 'Is value2', ... (in our case 'From US', 'From Japan', and 'From Europe')
- new columns contain only zeros or ones
- ones are written in rows where the corresponding value was in the original column; all others are zeros.

Let us convert the column 'Origin' in our dataset into one-hot encoding.

In [None]:
# Extract all country names
orig = [d['Origin'] for d in data]
# Convert to set to remove duplicates
uniq_orig = set(orig)
# Convert back to list 
uniq_orig1 = list(uniq_orig)
print(uniq_orig1)

In [None]:
data2 = []
for rec in data:
    orig = rec['Origin']
    rec2 = rec.copy()  # copy record to preserve the original data
    del rec2['Origin']  # remove old column
    print(orig)  # in the current record 'Origin' reads
    for orig1 in uniq_orig1:
        name = 'From_' + orig1  # name for a new column
        rec2[name] = 1 if orig1 == orig else 0  # 1 if origin coincides
        print('\t', name, rec2[name])
    data2.append(rec2)

In [None]:
# Print what we have
for d in data2:
    print(d)

Using the one-hot representation we can compare these cars. 

Computing distance we subtract values in corresponding columns.

Consider what happens when we compare two first lines, 'AMC Concord d/l' and Toyota Corolla'.

The first one is from US and the second one not: the subtraction gives 1

The first one is not from Japan and the second one is from Japan: the subtraction again gives 1

Both of them are not from Europe: they are similar in this feature to that the difference is 0

Let us compute all pairwise distances and plot them as a scatter plot.

In [None]:
# First convert list of dictionaries to a list
plot_data1 = [[v for v in d.values()] for d in data2]

for x in plot_data1:
    print(x)

In [None]:
# Remove the first column
plot_data2 = np.array([x[1:] for x in plot_data1])
print(plot_data2)

Now we need to rescale our data according to the discussion above.

This is the function the rescale only one column of a numpy array.

In [None]:
def rescale_column(data):
    # Assume one column here
    return (data - np.mean(data)) / np.std(data)

The question is do we need to rescale one-hot data?

On the one hand side the idea of the rescaling is quite generic and the one-hot data are also must be processed.

But on the other hand side these data are already scaled good: only zeros or once.

Actually the answer depends on the goals and a researcher must make a decision.

Let us discuss what we obtain after rescaling. 

Create two one-hot columns (not related with our cars). The first will contain zeros except the single one. 

It means that we have very seldom car: there are no other cars from this country.

In [None]:
col1 = np.array([0] * 9 + [1])
print(col1)
resc_col1 = rescale_column(col1)
print(resc_col1)

The seldom car highly deviates from all others. The difference for different cars will be 3.333

And if there is equal number of ones and zeros:

In [None]:
col2 = np.array([0] * 5 + [1] * 5)
print(col2)
resc_col2 = rescale_column(col2)
print(resc_col2)

The difference for different cars is 2.

We see that the rescaling of one-hot columns highlights records with rare feature. 

Thus we need to rescale if we what it. If the rareness is not important than we have left the one-hot columns as they are.

For our example we consider both options.

In [None]:
# Do not rescale one-hot columns
raw_plot_data = np.zeros_like(plot_data2)
for i in [0,1,2]:
    raw_plot_data[:, i] = rescale_column(plot_data2[:, i])
    
raw_plot_data[:, 3:] = plot_data2[:, 3:]       
print(raw_plot_data)
print()
    
# Do rescale one-hot columns    
scl_plot_data = np.zeros_like(plot_data2)
for i in [0,1,2,3,4,5]:
    scl_plot_data[:, i] = rescale_column(plot_data2[:, i])
    
print(scl_plot_data)    

Here we compute all pairwise distances and save them as a matrix

In [None]:
def dist(x, y):
    return np.sqrt(np.sum((x - y)**2))

def pairwise_dist(data):
    N = len(data)
    dst = np.zeros((N, N))
    for i in range(N):
        for j in range(N):
            dst[i,j] = dist(data[i], data[j])

    return dst

raw_dst = pairwise_dist(raw_plot_data)
scl_dst = pairwise_dist(scl_plot_data)


# We use np.printoptions to round the results
with np.printoptions(precision=2):
    print(raw_dst)
    print()
    print(scl_dst)

We plot these matrices via color intensity: the darker the close

In [None]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
axs[0].imshow(raw_dst, cmap='hot')
axs[1].imshow(scl_dst, cmap='hot');

In our case not rescaling one hot columns results in more clusters: we see darker areas that indicate similarity of the cars. 

For the fully rescaled data there are less clusters but instead we reveal pairs of very similar cars.

### Exercises

7\. Download the file "rescale.csv" from the repository "https://raw.githubusercontent.com/kupav/data-sc-intro/main/data/". Rescale its columns, compute all pairwise distances and find three closest records.

8\. Download the file "happiness_score.csv" from the repository "https://raw.githubusercontent.com/kupav/data-sc-intro/main/data/". Rescale its column 'Happiness Score'. Transform its column 'Region' into one-hot representation. Compute pairwise distances using 'Happiness Score' and one-hot columns for 'Region'. Find two most similar countries.