# Polynomial Regression - Position vs Salary
This notebook if for you to learn the basics of Polynomial Regression. We will use the Position vs Salary dataset to understand how to implement Polynomial Regression in Python.

You will see that we are going to use the `LinearRegression` class from the `sklearn.linear_model` module to train the model. This is where the "AI" will be taking place.

In order to use it you will need to install the following library into your Python virtual environment.

First you will want to activate your virtual environment in your terminal (if your virtual terminal is not running). If you do not see the green (.venv) in your terminal, you will need to activate it.

This is how you activate it on PC:
```bash
.venv\Scripts\activate
```

Once it is activated, you can install the library by running the following command in your terminal:

```bash
pip install scikit-learn
```

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression

In [None]:
dataset = pd.read_csv('data/Position_Salaries.csv')

X = dataset.iloc[:, 1:-1].values 
y = dataset.iloc[:, 2].values 
print(X)
print(y)

In [None]:
regressor = LinearRegression() # We are choosing the LinearRegression class
regressor.fit(X, y) # We are fitting the regressor object to our data. "Fitting" is also called "training" the model

In [None]:
plt.scatter(X, y, color = 'red') # We are plotting the real data
plt.plot(X, regressor.predict(X), color = 'blue') # We are plotting the predicted data
plt.title('Position Level vs Salary')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.show()

The "straight line" of the Linear Regression does not seem to fit the data very well. We can see that the data is more of a curve. This is where Polynomial Regression comes in. It is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x.

We would call this underfitting. Underfitting is when the model is too simple to fit the data.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)

In [None]:
plt.scatter(X, y, color = 'red')
plt.plot(X, lin_reg_2.predict(poly_reg.fit_transform(X)), color = 'blue')
plt.title('Position Level vs Salary')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.show()

Somewhere between 2 and 4 degrees of polynomial should be a good fit for this dataset.

The more we increase the degree of the polynomial, the more complex the model will be. This can lead to overfitting. Overfitting is when the model is too complex to fit the data. If others words, it is not abstract enough to generalize the data.