### Exercise 1: Flow sensors

The data file has three recorded values of water flow recorded using three sensors S1, S2, and S3.

- Compute the central tendency measures
- Compute the spread measures
- What can you say about S1, S2, S3 in terms of Means, Variance, and Standard deviation?
- Which one of the sensors is different from the others?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# read the data
df = pd.read_csv('group3.csv', sep = ';')
print('Number of rows    = ' + str(df.shape[0]))
print('Number of columns = ' + str(df.shape[1]))

# compute central tendency and spread measures
for col_name, col_data in df.items():
    print(f'\n** Sensor {col_name} **')
    print(f'min, max     = {col_data.min():.1f}, {col_data.max():.1f}')
    print(f'mean, median = {col_data.mean():.1f}, {col_data.median():.1f}')
    print(f'std, var     = {col_data.std():.1f}, {col_data.var():.1f}')
    print(f'skew         = {col_data.skew():.1f}')
    
# create a figure with two subplots
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))

# boxplot
df.plot(kind='box', ax=axs[0])

# histograms
for col_name in df.columns:
    axs[1].hist(df[col_name], bins=20, alpha=0.5, label=col_name)
axs[1].legend()

**Question**<br>
What can you say about S1, S2, S3 in terms of Means, Variance, and Standard deviation?

**Answer**<br>
Means: $S1 > S2 \approx S3$<br>
Variance and std: $S1 \approx S2 < S3$

**Question**<br>
Which one of the sensors is different from the others?

**Answer**<br>
- In terms of central tendency, S1 is different from S2 and S3: it gives much higher values. One possible explanation is that the measurements of S1 are biased (e.g. due to a production error)
- In terms of spread, S3 is different from S1 and S2: it has much higher variance in its measurements. One possible explanation is that S3 has more measurement noise. 

**Conclusion**<br>
S2 seems to give the best measurements: it has lowest variance (highest precision) and it doesn't have the apparent bias of S1. 

(Of course, without more information on the background of this dataset, any conclusion is highly speculative)

### Exercise 2: Categorical quartiles
Is it possible to divide a categorical dataset using quartiles?

**Answer**<br>
It depends on the nature of the categorical data. In order to compute quartiles, we need to be able to sort the data. Since *nominal* ordinal data by definition has no ordering, it is not possible to compute quartiles on such data. With *ordinal* categorical data, however, it is in principle possible to compute quartiles, but issues may arise when having very few levels (many repeating values) or when a quartile falls in between two categories.<br>
<br>
Example: 

![image.png](attachment:44c218c2-4b20-4083-a47e-8fbc34fea595.png)

Conclusion: it's possible, but somewhat ugly and not without issues.<br>

**Question**<br>
Suppose we have list of set of colors
Colors = [Red, Green, Blue, Blue, Green, Red, Yellow, Orange, Purple, Blue, Green, Red, Blue, Yellow, Orange, Green, Blue, Red, Yellow, Green]

How can we make Quartiles in such a case?

**Answer**<br>
There is no natural way to compute quartiles on nominal data, because they cannot be ordered.