## Producing some fake data



Let&rsquo;s use `numpy` to produce some fake data.



In [1]:
import numpy as np

N = 100
xs = np.random.uniform(0,np.pi,N)
xs = np.sort(xs) # for plotting later
ys = np.sin(xs) + np.random.normal(0,0.1,N)

Note that `xs` and `ys` are not Python lists but rather `numpy`
arrays.  For &ldquo;technical reasons&rdquo; we&rsquo;d prefer to restructure the arrays
as follows.



In [1]:
xs = xs[:, np.newaxis]
ys = ys[:, np.newaxis]

Let&rsquo;s see a plot of our random data.



In [1]:
import matplotlib.pyplot as plt
plt.scatter(xs,ys)
plt.show()

We now pretend that we don&rsquo;t know the source of this data, and we wish
to &ldquo;learn&rdquo; the relationship between the $x$&rsquo;s and the $y$&rsquo;s.  Of
course, the truth is that $y = \sin x$ plus some noise, but let&rsquo;s
forget about that and see what we can recover.



## Linear regression



We&rsquo;ve been learning about linear regression, so let&rsquo;s use
`scikit-learn` to **again** perform linear regression on our data.



In [1]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression().fit( xs, ys )

Let&rsquo;s make a plot our (linear!) model.



In [1]:
ys_predicted = lm.predict(xs)
plt.scatter(xs, ys)
plt.plot(xs, ys_predicted, color='r')
plt.show()

Because we followed the usual advice to ****look at our data****, we know
our data isn&rsquo;t modeled well by a straight line.  This is an example of
**underfitting**.  We need a more complex model to capture the actual
pattern of the data.



## Use polynomials



We find the &ldquo;polynomial features&rdquo; associated to the $x$&rsquo;s.  This
replaces the vector $(x) \in \mathbb{R}^1$ with the vector $(1,x,x^2)
\in \mathbb{R}^3$.



In [1]:
from sklearn.preprocessing import PolynomialFeatures
polynomial_features= PolynomialFeatures(degree=2)
xs_poly = polynomial_features.fit_transform(xs)

Next we perform linear regression.  A ****common misconception**** would
be that **linear** regression is the wrong choice since we want to find
an approximating polynomial.  Can you explain why we are nevertheless
using `LinearRegression` in an attempt to find a polynomial fit?



In [1]:
qm = LinearRegression().fit( xs_poly, ys )

Let&rsquo;s plot the data as a scatterplot, and our model&rsquo;s predicted values
as a red curve.



In [1]:
ys_predicted = qm.predict(xs_poly)
plt.scatter(xs, ys)
plt.plot(xs, ys_predicted, color='r')
plt.show()

That **looks** much better.  Is it &ldquo;actually&rdquo; better?



In [1]:
print("linear model score:",lm.score(xs,ys))
print("quadratic model score:",qm.score(xs_poly,ys))

Would a cubic polynomial be an even better choice?



## Overfitting



If degree 2 worked well, surely degree 25 is even better!



In [1]:
from sklearn.preprocessing import PolynomialFeatures
polynomial_features= PolynomialFeatures(degree=25)
xs_poly = polynomial_features.fit_transform(xs)

lm = LinearRegression().fit( xs_poly, ys )

ys_predicted = lm.predict(xs_poly)
plt.scatter(xs, ys)
plt.plot(xs, ys_predicted, color='r')
plt.show()

Our prediction now is rather wiggly; what we are doing is not so much
learning the regularities in the data as we are **learning the noise**.
How could we possibly deal with this?  What sort of framework could
help us pick &ldquo;hyperparameters&rdquo; like degree?  These questions pave the
way for Week 2, when we study the **bias/variance tradeoff**.

