Mathematical Models  
Students are normally taught to draw lines using the formula:  
y = 2x + 1  

Choosing two values for x and calculating corresponding values for y via the equation above.  
Then axes are drawn, points plotted and a line drawn that extends through the two dots on their axes.


In [None]:
# importing matplotlib
import matplotlib.pyplot as plt

# drawing axes
plt.plot([-1, 10], [0, 0], 'k-')
plt.plot([0, 0], [-1, 10], 'k-')

# plotting blue and red lines
plt.plot([1, 1], [-1, 3], 'b:')
plt.plot([-1, 1], [3, 3], 'r:')

# plotting the two points (1, 3) & (2, 5)
plt.plot([1, 2], [3, 5], 'ko')

# joining points with an extending green line
plt.plot([-1, 10], [-1, 21], 'g-')

# setting plot limits
plt.xlim([-1, 10])
plt.ylim([-1, 10])

# showing plot
plt.show()

Scenario  
What if we have some points and are instead looking for the equation?  
Example of weighing our travel case to avoid extra charges at a certain airline.  

We don't have weighing scales to use, but we do have a spring and a set of weights, ranging from 7kg, 14kg & 21kg.  
We attach the spring to the wall hook and mark where the bottom of the spring hangs without any weights attached.  
Hang the 7kg weight on the end of the spring and mark where the bottom of the spring is now hanging.  
Repeat the above step with the 14kg & 21kg weights.  
Finally, we attach our travel case to the spring and it hangs approximately halfway between 7kg & 14kg.  
So is the travel case over the 10kg limit?  

When we look at the marks made on the wall, the marks are evenly spaced out in 7kg increments (0, 7, 14, 21)  
Does the case weigh 10.5kg?  
Is there a linear relationship between the:  
#1. Distance the spring's hook is from its resting position  
#2. Mass on the end of the spring's hook

Experiment  
So we buy some new weights, 1kg, 2kg, 3kg, all up to 20kg in 1kg increments.  
We place each weight in turn on the spring to measure their distances.  
The spring moves from its resting position, and data is recorded and then plotted.

In [None]:
# importing numpy to deal with numerical multi-dimensional arrays
import numpy as np

# importing matplotlib, a plotting library, along with pyplot module
import matplotlib.pyplot as plt

# setting plot size
plt.rcParams['figure.figsize'] = (8, 6)

# w is an array containing weight values
# d is an array containing corresponding distance measurements
w = np.arange(0.0, 21.0, 1.0)
d = 5.0 * w + 10.0 + np.random.normal(0.0, 5.0, w.size)

w
d

# creating the plot
plt.plot(w, d, 'k.')

# labelling axes for the plot
plt.xlabel('Weight (kg)')
plt.ylabel('Distance (cm)')

# showing the plot
plt.show()

The plot shows the data could be linear, the points don't truly show a straight line but are close to it.  
Debating other factors that could prevent a perfect line, such as air density, human error, etc.

Straight Lines  
Expressed via y = mx + c
where m is the slope of the line, c is the y-intercept of the line (the value of y when x is 0)  

We must pick values for m & c in order to fit a straight line to the above data.  
These are the parameters of our model and we want to choose the best possible values.

In [None]:
# Plotting w versus d with black dots
plt.plot(w, d, 'k.', label = "Data")

# Overlaying some lines on the plot
x = np.arange(0.0, 21.0, 1.0)
plt.plot(x, 5.0 * x + 10.0, 'r-', label = r"$5x + 10$")
plt.plot(x, 6.0 * x + 5.0, 'g-', label = r"$6x + 5$")
plt.plot(x, 5.0 * x + 15.0, 'b-', label = r"$5x + 15$")

# adding a legend to the plot
plt.legend()

# adding labels for axes
plt.xlabel('Weight (kg)')
plt.ylabel('Distance (cm)')

# showing the plot
plt.show()

Calculating the Cost  
Each line above roughly fits the data, but which is best?  
The best line minimises the following calculated value:  

Ei (yi - mxi - c)**2  
where (xi, yi) is the i^th point in our data set, and Ei means to sum over all data points.  

Values of m & c are to be determined, denoted as Cost(m, c):  

Looking at "(yi - mxi - c)", the corresponding value to xi is yi, these are the measured values.  
"mxi + c" is what the model says that yi should have been.  

The difference between the observed value (yi) and the value that the model gives (mxi + c) is "yi - mxi - c".

Why Square the Value?  
The value may be positive or negative, and we sum over all values.  
If we allow values to be positive or negative, then positives can cancel negatives.  
So we take the absolute value: | yi - mxi - c |  
Alternatively, we square the quantity as the square of a number is always positive.

In [None]:
# calculating the cost of the lines above for the above data
cost = lambda m,c: np.sum([(d[i] - m * w[i] - c)**2 for i in range(w.size)])

print("Cost with m = %5.2f and c = %5.2f: %8.2f" % (5.0, 10.0, cost(5.0, 10.0)))
print("Cost with m = %5.2f and c = %5.2f: %8.2f" % (6.0, 5.0, cost(6.0, 5.0)))
print("Cost with m = %5.2f and c = %5.2f: %8.2f" % (5.0, 15.0, cost(5.0, 15.0)))

Minimising the Cost  
We are looking to calculate values for m & c that will give the lowest cost value above.  

We can plot the cost value using the formula:  
Cost(m,c) = Ei (yi - mxi - c)**2  

The cost value contains two variables, m & c, so a 3D plot is required.  
For a 2D plot with a few data points, we can easily calculate the best values for m & c.

We must first calculate the mean values of x and y.  
Then subtract the mean of x from each value of x, and the mean of y from each value of y.  

Then we take the dot product of the new x values and new y values, and divide it by the dot product of the new x values with themselves, which gives us m.  
We then use m to calculate c.  

x is called w (weight), y is called d (distance).

In [None]:
# We must first calculate the mean values of x (weight) and of y (distance)
w_avg = np.mean(w)
d_avg = np.mean(d)

# Then subtract the mean of w from each value of w, and the mean of d from each value of d
w_zero = w - w_avg
d_zero = d - d_avg

# The best value for m is calculated
m = np.sum(w_zero * d_zero) / np.sum(w_zero * w_zero)

# Using m to calculate the best value for c
c = d_avg - m * w_avg

# printing results
print("m is %8.6f and c is %6.6f." % (m, c))

In [None]:
# Numpy can perform this calculation via polyfit
np.polyfit(w, d, 1)

Best Fit Line

In [None]:
# Plotting the best fit line
plt.plot(w, d, 'k.', label = 'Original Data')
plt.plot(w, m * w + c, 'b-', label = 'Best Fit Line')

# adding labels for axes and a legend
plt.xlabel('Weight (kg)')
plt.ylabel('Distance (cm)')
plt.legend()

# showing the plot
plt.show()

The Cost of the best m & c is not zero.


In [None]:
print("Cost with m = %5.2f and c = %5.2f: %8.2f" % (m, c, cost(m, c)))

Exercise 1:  
Use numpy & matplotlib to plot the absolute value function.  
Research and explain why the absolute value function is not typically used in fitting straight line data.

References:  
Program to Plotting Absolute Function?, ePythonGuru, https://www.epythonguru.com/2019/11/how-to-plot-absolute-function.html  
Grace Alfie, Python Absolute Value - abs() for real and complex numbers, LearnDataSci, https://www.learndatasci.com/solutions/python-absolute-value/  

The abs( ) function is built into Python's Numpy module, and it returns the absolute value of a given number.  
The function can take a single argument and it can be either an integer, float or complex number.  
The function returns the absolute value for integers or floats, whereas the magnitude is returned for complex numbers.  

A complex number is a combination of real and imaginary numbers, where an imaginary number is denoted as the square root of a negative number, labelled as i or j (square root of -1).  

In [None]:
# complex number with real & imaginary parts
complex_no = (5 - 9j)

print("The Magnitude of 5 - 9j is:", abs(complex_no))

In [None]:
# plotting the absolute function abs()
import numpy as np
import matplotlib.pyplot as plt

x = np.arange(-3.14, 3 * 3.14)

a = 2 * x
b = x * 3

y = np.abs(a * x + b)

plt.plot(x, y)
plt.xlabel("X Values")
plt.ylabel("Y Values")
plt.title("Absolute Function abs()")

plt.grid(True)
plt.show()

Optimization  
We will use SciPy to estimate parameters:  

Cost(m,c) = Ei (yi - mxi - c)**2

In [None]:
import scipy.optimize as so
import numpy as np

# fixing x values
x = np.arange(0.0, 21.0, 1.0)

# fixing y values
y = 5.0 * x + 10.0 + np.random.normal(0.0, 5.0, x.size)

x, y

Now we will use the minimize function in scipy.optimize.  
We first need a function to minimize, the Cost function.  
The x and y values are fixed values as shown above.  
The function must take a single argument, although the cost function takes 2: m & c.  
To remedy this, m & c will go into a list titled MC (MC = (5, 10))

In [None]:
def cost(MC):
    # Unpacking the values, m & c
    m, c = MC
    # Data points needed in this function
    cost = np.sum((y - m * x - c)**2)
    # Returning the value
    return cost

# Running a quick test
cost((5.0, 10.0))

# https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html#scipy.optimize.minimize/
result = so.minimize(cost, (2.0, 2.0))

# showing results
result

# extracting the optimized m & c
m_o, c_o = result.x

# printing the optimized values
m_o, c_o

# the previous analytical result via polyfit
m_a, c_a = np.polyfit(x, y, 1)

# printing the analytical values
m_a, c_a

In [None]:
# Plotting the best fit line from the optimization
fig, ax = plt.subplots(figsize = (8, 6))
ax.plot(x, y, 'k.', label = 'Original Data')
ax.plot(x, m_o * x + c_o, 'b-', label = 'Optimized Line')
ax.plot(x, m_a * x + c_a, 'g-', label = 'Analytical Line')
ax.legend()

# showing the plot
plt.show()

Curve Fitting  
SciPy Optimize Curve Fit, SciPy, https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html

In [None]:
# Creating function for a model
def f(x, m, c):
    return m * x + c

# running curve_fit
result = so.curve_fit(f, x, y)

# looking at the result
result

# pulling out the parameters
m_f, c_f = result[0]

# printing values
m_f, c_f

# plotting best fit line from the optimization
fig, ax = plt.subplots(figsize = (8, 6))
ax.plot(x, y, 'k.', label = 'Original Data')
ax.plot(x, m_f * x + c_f, 'r-', label = 'Curve Fit Line')
ax.plot(x, m_a * x + c_a, 'g-', label = 'Analytical Line')

# showing the plot
plt.show()

Exercise 2:  
Fit a straight line to the following data points using all three methods from the lecture notes.  
Do you think a straight line is a good model for these points below:  
x = [2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]  
y = [0.7, 1.1, 1.5, 1.6, 1.7, 2.0, 2.3, 2.4, 2.2, 2.1, 2.4, 2.6, 2.2, 2.7, 2.5, 2.7, 2.8, 2.9, 3.1]

References:  
Zach, How to Plot Line of Best Fit in Python (With Examples), Statology, https://www.statology.org/line-of-best-fit-python/  
Ashley Michael, Fitting a Straight Line to Data Points, University of New South Wales, Australia, https://newt.phys.unsw.edu.au/~mcba/mcba12.pdf

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# data points
x = np.array([2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0])
y = np.array([0.7, 1.1, 1.5, 1.6, 1.7, 2.0, 2.3, 2.4, 2.2, 2.1, 2.4, 2.6, 2.2, 2.7, 2.5, 2.7, 2.8, 2.9, 3.1])

# plotting straight line via polyfit() curve fitting function
a, b = np.polyfit(x, y, 1)

plt.scatter(x, y)
plt.plot(x, a*x+b)

In [None]:
# data points
x = np.array([2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0])
y = np.array([0.7, 1.1, 1.5, 1.6, 1.7, 2.0, 2.3, 2.4, 2.2, 2.1, 2.4, 2.6, 2.2, 2.7, 2.5, 2.7, 2.8, 2.9, 3.1])

# plotting the best fit line
plt.plot(x)
plt.plot(y)
plt.show()

In [None]:
# plotting x versus y

plt.plot(x, y, 'r-')
plt.plot(x, y, 'g-')
plt.plot(x, y, 'b-')

# adding labels for axes
plt.xlabel('Weight (kg)')
plt.ylabel('Distance (cm)')

# showing the plot
plt.show()

Simulating Data  
Previously, the data was faked/simulated, as follows:

In [None]:
w = np.arange(0.0, 21.0, 1.0)
d = 5.0 * w + 10.0 + np.random.normal(0.0, 5.0, w.size)

The first command creates a numpy array containing all values between 1.0 & 21.0 (including 1.0 but excluding 21.0) in steps/increments of 1.0.  
The second command takes values in the w array, multiplies each value by 5.0 and then adds 10.0.  
Then adds an array of the same length containing random values, taken from normal distribution with mean = 0.0 and standard deviation = 5.0  
Normal distribution follows a bell-shaped curve, centred on the mean and its general width determined by the standard deviation.

In [None]:
normpdf = lambda mu, s, x: (1.0 / (2.0 * np.pi * s**2)) * np.exp(-((x - mu)**2)/(2 * s**2))

x = np.linspace(-20.0, 20.0, 100)
y = normpdf(0.0, 5.0, x)

plt.plot(x, y)
plt.show()

The idea is to add some randomness to the distance measurements.  
Random values are entered around 0.0, with a greater than 99% chance they are within the range -15.0 to 15.0  
Normal distribution is used due to the Central Limit Theorem, when a bunch of random effects happen together, the outcome looks roughly like the normal distribution.

Plotting the Cost function  
From previous code, we can plot the Cost function for a set of data points.  
Cost involves two variables, m & c, with the formula:  

Cost(m, c) = Ei (yi - mxi - c)**2  

A 3D plot is required for a function of two variables.

In [None]:
from mpl_toolkits.mplot3d import Axes3D

# asking pyplot for a 3D set of axes
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(111)

# making data
mvals = np.linspace(4.5, 5.5, 100)
cvals = np.linspace(0.0, 20.0, 100)

# filling the grid
mvals, cvals = np.meshgrid(mvals, cvals)

# flattening the meshes to make it more convenient
mflat = np.ravel(mvals)
cflat = np.ravel(cvals)

# calculating the cost of each data point on the grid
C = [np.sum([(d[i] - m * w[i] - c)**2 for i in range(w.size)]) for m, c in zip(mflat, cflat)]

# plotting the surface
surf = ax.plot_surface(mvals, cvals, C)

# setting the axes labels
ax.set_xlabel("$m$", fontsize = 16)
ax.set_ylabel("$c$", fontsize = 16)
ax.set_zlabel("$Cost$", fontsize = 16)

# showing the plot
plt.show()

Coefficient of Determination (R-Squared Value):  
We used a Cost function to determine the best line to fit the data.  
Usually the data does not perfectly fit on the best fit line, so the Cost is greater than 0.  
A quantity closely related to Cost is known as the coefficient of determination, or the R-squared value.
The R-squared value measures how much of the variance in y is determined by x.  

In the travel case example earlier, the main thing that affects the distance that the spring is hanging down is the specific weight on the end, the only thing that affects it.  
Room temperature and air density while taking our measurements could possibly affect it, also the age of the spring, previous usage, etc.  

The R-squared value estimates how much of the changes in the y value is due to the changes in the x value, compared to all of the other factors affecting the y value:  

R**2 = 1 - Ei (yi - mxi - c)**2 / Ei (yi - y_)**2

The Pearson correlation coefficient is used instead of the R-squared value, we can square the Pearson correlation coefficient to get the R-squared value.

In [None]:
# Calculating the R-squared value for our data set above
rsq = 1.0 - (np.sum((d - m * w - c)**2) / np.sum((d - d_avg)**2))

print("The R-squared value is %6.4f" % rsq)