# Introduction to Data Science and Machine Learning

<p align="center">
    <img width="699" alt="image" src="https://user-images.githubusercontent.com/49638680/159042792-8510fbd1-c4ac-4a48-8320-bc6c1a49cdae.png">
</p>

---

## Model issues

In this lecture we start to describe the main possible issues with machine learning models.

Let's start by revising an example with a _polynomial regression_.

Consider the problem of predicting $y \in \mathbb{R}$ from $x \in \mathbb{R}$.

Consider the set of fits below. 

<p align="center">
    <img width="951" alt="image" src="https://user-images.githubusercontent.com/49638680/162145915-a5201fce-fa50-4944-8baf-295a1abb92ac.png">
</p>

The yellow one shows the result of a linear fitting $y = \beta_0 + \beta_1 x$.

One can see by looking at point positions, this is not a good fit.

If we add an extra feature, $x^2$, hence the fit is $y = \beta_0 + \beta_1 x + \beta_2 x^2$, we obtain a slightly better fit. 
Naively, one can think that the more features one adds the better fit one gets. However, despite how seducing is the idea, this is actually dangerous. The red fit in the picture is the result of a $7$-rank polynomial $y = \sum_k \beta_k x^k$. 
We see that even though the fitted curve passes through the data perfectly, we would not expect this to be a very good predictor of, say, housing prices.


### Import libraries

In [1]:
# Import libraries we will need in the following
import pandas as pd
import numpy as np
import scipy.io as sio

import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import PolynomialFeatures

# set plot size
plt.rcParams['figure.figsize'] = (20, 13)
%matplotlib inline
%config InlineBackend.figure_format = "retina"