# Linear Regression

In this notebook, I showcase how to do a simple linear regression on a simulated data set. The idea is that we have some underlying linear function,

$$f(x) = 2 + 3x \ ,$$

which we will use to create some simulated data. To create this data, we first import numpy and generate 100 random points from a uniform distribution between $$[-2,2)$$:

In [2]:
import numpy

x_values = numpy.random.uniform (low=-2, high=2, size=100)

x_values

array([ 0.19400221, -1.8272468 , -0.815422  ,  1.88685038, -0.44190032,
       -1.4877507 , -0.42341515, -1.71156156, -0.09844605,  1.96696142,
       -0.31873922, -0.55701425, -1.66991682,  1.57086108, -1.79824542,
        0.76335304, -1.76323042, -0.92306553,  1.90733913, -1.53621744,
        1.27283491, -0.00271588, -0.07329207,  0.06038414, -0.32478827,
        1.74419069, -1.86179761, -1.6175619 , -0.91153141,  1.52982934,
       -0.93211017,  1.22049813,  0.99897249,  0.67908176, -0.22191142,
        0.85400956,  0.72622715, -0.58956425,  0.52953962, -0.81513454,
       -0.88281033,  1.85167187, -1.1040552 , -0.75101835, -0.02133216,
        0.54560405, -1.63127041,  1.95434498, -0.39933391, -0.65440774,
        0.53570984,  0.71928763, -0.59916747, -0.45649931, -0.77842147,
        1.03154181, -1.53398104,  0.18549824,  0.63504435, -0.31823978,
        0.53687862, -1.54757997, -0.57135264,  1.16616721, -1.38281156,
        0.23826593, -0.86293023,  0.50386557, -0.60741641,  1.50

These x values will then be fed into our function $$f(x)$$. However just inputting these valus won't generate a realistic data set. To make something more realistic, we need to add some random noise $$\epsilon$$, 

$$f(x)_{\text{realistic}} = 2 + 3x + \epsilon$$

To do this, let's generate 100 random values from a normal (Gaussian) distribution:

In [11]:
errors = numpy.random.normal (loc=0.0, scale=1.0, size=100)

simulated_data = x_values + errors

simulated_data

array([-0.51412949, -1.15092283, -0.87341117,  0.5145388 , -0.95689862,
       -0.52576287, -2.14775497, -2.88013364, -0.44476098,  1.57742749,
       -1.84001388, -0.88803303, -2.43192931,  0.5481678 , -2.18219895,
       -1.28623204, -0.63525322, -1.55561136,  0.68212774, -0.977696  ,
        1.39937392, -0.87380695,  1.73090304, -0.69422983, -1.10251656,
        3.46633794, -0.51715726, -2.09536369, -1.53697367,  0.94323639,
       -1.85020039,  1.07604005,  0.88092962, -1.44455806,  0.62789538,
        0.19357298,  0.17841873, -0.36526597, -0.24453935, -0.15740415,
       -0.51859095,  1.24053988, -0.84839414,  0.43660875, -0.54176917,
       -1.73488359, -0.6484865 ,  1.83042837, -0.08088819, -1.38472825,
        0.39376888,  0.56278893, -0.78412265, -0.43226004,  0.4327579 ,
        0.86402793, -0.20358758,  0.24752173,  2.85029498,  1.05362078,
       -0.27036904, -0.97992609, -1.12346673,  3.62151306, -2.42231969,
        0.22649093, -1.38923916, -0.81194131, -1.78305739,  0.90

Let's plot this

In [None]:
plot stuff here

Much better! Suppose we are now looking at this set of data above and think that it could be well modelled by linear regression, that is,

$$y(x) = b_0 + b_1 x$$

where $$y(x)$$ is our prediction for a given input $$x$$ with $$b_0$$ and $$b_1$$ being our two parametric variables (y-intercept and slop respectively).