# Week 1: Least Squares Fitting

## Goals
- Learn how to get data into Python via [pandas](https://pandas.pydata.org/),
- Basic manipulations of data,
- Construct a least squares fitting in a few ways

## Getting Started

Python is a general-purpose programming language. According to [TIOBE](https://www.tiobe.com/tiobe-index/), Python is the most popular programming language (as of September 2024). 

Python has built in functions:

In [None]:
a = [2, 4, 4, 5, 8]     # A Python list
sum(a)                  # Sums the entries of a

Python has a large ecosystem of libraries (called modules) and communities:

In [None]:
import pandas as pd     # Python code for loading modules
print(pd.__version__)   # Checking the version of pandas

If you want to brush up on your Python skills, there are a variety of tutorials online:
- [learnpython.org](https://www.learnpython.org/),
- [w3schools.com/python](https://www.w3schools.com/python/),
- and many more...

## Basics

- Add together an integer and a float

In [None]:
1 + 3.2

- "Add" (or concatenate) two strings

In [None]:
"hello" + " world"

- Repeat strings by multiplying an integer with a string

In [None]:
5*"hello "

- Convert integers to strings

In [None]:
str(365*24*60) + " minutes per year" 

- Convert some strings to integers

In [None]:
int("4" + "0") + 2

In [None]:
# int("one")

## Our first example

There are seemingly countless ways to get data into a useful format. 

We will load the file `./data/ex1.csv` directly into Python.

In [None]:
with open("data/ex1.csv", "r") as ex1_data:
    print(ex1_data.read())

We could further massage this data, but pandas takes care of all of this. Let's use it.

In [None]:
# Imported above "import pandas as pd"
df = pd.read_csv("data/ex1.csv")
print(df)

Alternatively, we can load the data by hand. To turn this "on" just uncomment the following code by removing all `#` symbols. 

In [None]:
# df = pd.DataFrame({
#     "i" : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
#     "x_i" : [30, 20, 60, 80, 40, 50, 60, 30, 70, 60],
#     "y_i" : [73, 50, 128, 170, 87, 108, 135, 69, 148, 132],
# })

Python starts its indexing at 0 and not at 1, so we can just ignore the column labeled 'i'. Let's just remove it.

In [None]:
df = df[["x_i", "y_i"]]
print(df)

For a line of best fit, we write 
$$
    y = b_0 + b_1x.
$$

The equations for $b_0$ and $b_1$ are given by 
$$
\begin{aligned} 
    nb_0 + b_1\sum x_i &= \sum y_i, \\
	b_0 \sum x_i + b_1 \sum x_i^2 &= \sum x_iy_i. 
\end{aligned}
$$

Let's take each quantity in part.

- $n$

In [None]:
n = len(df)         # length of the columns
print(n)

- $\sum x_i$

In [None]:
sum_x = sum(df["x_i"])      # Sums all entries x_i
print(sum_x)

- $\sum y_i$

In [None]:
sum_y = sum(df["y_i"])      # Sum all entries y_i
print(sum_y)

- $\sum x_i^2$

In [None]:
sum_xx = sum(x**2 for x in df["x_i"])
print(sum_xx)

- $\sum x_iy_i$

In [None]:
sum_xy = sum(t[0] * t[1] for t in zip(df["x_i"], df["y_i"]))
print(sum_xy)

We will put these values into a matrix to solve the system of 2 linear equations. 

We'll use the `numpy` package for this. You can learn more about numpy [here](https://numpy.org/doc/stable/index.html).

In [None]:
import numpy as np                              # Loading numpy
A = np.array([[n, sum_x], [sum_x, sum_xx]])     # Numpy matrix
b = np.array([[sum_y], [sum_xy]])
print(A)
print("")
print(b)

Now we use `numpy` to solve the system
$$
    Ax = b
$$

In [None]:
np.linalg.solve(A, b)

Thus, $b_0=10$ and $b_1=2$, and a line of best fit is equal to 
$$
    y = 10 + 2x.
$$

### Formulating as matrices
It is significantly easier to use the matrix vocabulary to get the values for the least squares fittings.

Recall we defined
$$
\begin{aligned} 
    X &= \begin{pmatrix} 
        1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n
    \end{pmatrix}, & 
    Y &= \begin{pmatrix} 
        y_1 \\ y_2 \\ \vdots \\ y_n 
    \end{pmatrix}, & 
    B &= \begin{pmatrix} 
        b_0 \\ b_1
    \end{pmatrix}. 
\end{aligned}
$$

- $X$

In [None]:
X = np.array([[1]*10, df["x_i"]]).transpose()
print(X)

- $Y$ 

In [None]:
Y = np.array([df["y_i"]]).transpose()
print(Y)

We can get the matrix $B$ from the following equation: 
$$
    B = (X^{\mathrm{t}}X)^{-1}X^{\mathrm{t}} Y.
$$

- $A = (X^{\mathrm{t}}X)^{-1}$

In [None]:
A_pre = np.matmul(X.transpose(), X)   # Use np.matmul to multiply
print("A_pre = \n{}".format(A_pre))
A = np.linalg.inv(A_pre)              # Use np.linalg.inv to invert
print("A = \n{}".format(A))

- $C = X^{\mathrm{t}}Y$ 

In [None]:
C = X.transpose() @ Y                # Use the '@' numpy-operator to multiply matrices
print(C)

- $B = AC$

In [None]:
B = A @ C
print(B)

### An implementation of least squares fitting
We can use the `statsmodels` module which has an implementation of [(ordinary) least squares](https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html).

In [None]:
import statsmodels.api as sm
model = sm.OLS(df["y_i"], sm.add_constant(df["x_i"]))   # Prepend a column of 1s to the "x_i" column.
results = model.fit()

*Note* (not important): this column of 1s is necessary because we are trying to fit a nonhomogenous linear equation. A polynomial is **homogeneous** if each term has the same degree: $x^2 + y^2$ is homogeneous while $x^2 + x + 1$ is not. By including a column of 1s, we are fitting a line of the form $y = b_0x_0 + b_1x_1$ and *then* setting $x_0=1$ and $x_1=x$.

We can view these results, which will give much more information than we want:

In [None]:
print(results.summary(slim=True))