### CS4102 - Geometric Foundations of Data Analysis I
Prof. Götz Pfeiffer<br />
School of Mathematical and Statistical Sciences<br />
University of Galway

# Week 3

## 0. Questions

* How to plot the data (scatter) and the least squares fit (line)?
* How to get the data from the problem sheet into python?
* How to compute the $r^2$ for the skin care example?

## Q 1.3

Let's try and solve Question 3 from Lecture 1.

In [None]:
import numpy as np

* list the $x$- and $y$-values and turn them into numpy arrays.

In [None]:
xs = [2,4,5,6]
ys = [6.5,8.5,11.0,12.5]

In [None]:
Y = np.array(ys)
Y

In [None]:
X = np.array([[1, x] for x in xs])
X

* use the normal equation $B = (X^t X)^{-1} (X^t Y)$ to find the least squares fit

In [None]:
B = np.linalg.inv(X.T @ X) @ (X.T @ Y)
B

## Plotting

Now for the plot ...

In [None]:
import matplotlib.pyplot as plt

* multiple `plt.plot` commands will draw in the same axes object, by default it will just join the dots.

In [None]:
plt.plot(xs, ys, 'o')
plt.plot(xs, ys)

* for the least squares straight line, compute $y$-values corresponding to the end points $2.0$ and $6.0$ of the $x$-range and join the two dots ... first for $x = 2.0$:

In [None]:
[1, 2.0] @ B

* then for $x = 6.0$:

In [None]:
[1, 6.0] @ B

* now for both in one go:

In [None]:
xx = [2.0, 6.0]
yy = [[1,x] for x in xx] @ B
yy

* plot the line on top of the scatter plot:

In [None]:
plt.plot(xs, ys, 'o')
plt.plot(xx, yy)

* Of course, these two end points of the least squares fit form part of the fitted values $\hat{Y}$

In [None]:
Yhat = X @ B
Yhat

* and we might just plot the straight line corresponding to $\hat{Y}$: 

In [None]:
plt.plot(xs, Y, 'o')
plt.plot(xs, Yhat)
plt.plot(xs, Yhat, 'o')

* What colour dot corresponds to which kind of $y$-value?

* Let's also plot the sample mean as a straight line

In [None]:
ybar = sum(ys)/len(ys)
ybar

In [None]:
Ybar = [ybar for x in xs]

In [None]:
plt.plot(xs, Y, 'o')
plt.plot(xs, Yhat)
plt.plot(xs, Yhat, 'o')
plt.plot(xs, Ybar)

##  File Processing

Let's try and plot the data in `points1.txt`, after examining the format of the lines in that file.

* Use a **shell command** (`!`) to show the first few lines of the file `points1.txt`

In [None]:
!head points1.txt

* Extract $x$- and $y$-values "by hand"

In [None]:
name = 'points1.txt'
with open(name) as f:
    lines = f.readlines(100)

In [None]:
lines

* each line consists of three (!) short text pieces, separated `;`
* we can split the line on `;` and assign the three pieces to three variables

In [None]:
lines[1]

In [None]:
x, y, z = lines[1].split(';')
x, y

* now `x` contains a name (`x_2`) separated from a value (`3`) by `:`
* we are interested in the value: split on `:` and pick entry `1` from the resulting list

In [None]:
x.split(':')

In [None]:
x.split(':')[1]

* this is a string which needs to be converted to a number (`int` or `float`)

In [None]:
int(x.split(':')[1])

* same for `y`, and all of this for each `line` of the file:

In [None]:
xs, ys = [], []
with open(name) as f:
    for line in f:
        x, y, z = line.split(';')
        xs.append(int(x.split(':')[1]))
        ys.append(float(y.split(':')[1]))

In [None]:
print(xs[:10])
print(ys[:10])

* plotting the $y$-values only yields some random noise ...

In [None]:
plt.plot(ys, '.')

* plotting $y$s against $x$s shows that the data lie on a parabola

In [None]:
plt.plot(xs, ys, 'r+')

* there is always a least squares straight line ...

In [None]:
import numpy as np

In [None]:
Y = np.array(ys)
Y

In [None]:
X = np.array([[1, x] for x in xs])

In [None]:
B0 = np.linalg.inv(X.T @ X) @ (X.T @ Y)
B0

In [None]:
Yhat = X @ B0
Yhat

In [None]:
xxx = np.arange(-20,70)
xxx

In [None]:
plt.plot(xs, ys, '.')
plt.plot(xxx, [[1, x] @ B0 for x in xxx])

* now prepare the data for a **quadratic** least squares best fit ...

In [None]:
X = np.array([[1, x, x*x] for x in xs])

## Exercises.

1. What is the `numpy` command for computing the mean?
1. What is the command for plotting a horizontal line (of slope $0$)?
1. What exactly does the Unix command `head` do?
1. Compute and plot the least squares best quadratic fit to the data in `points1.txt`