# Polynomial regression 
### Linear and polynomial regression with the SALARY dataset
You'll be using different types of regression to predict the salary of employes, based on historical salaries.

**1. Importing modules needed for the work**

In [1]:
import matplotlib.pyplot as plt 
import numpy as np 
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns 
plt.style.use('fivethirtyeight')  
import warnings
warnings.filterwarnings('ignore')  #this will ignore the warnings.it wont display warnings in notebook

**2. Importing the salary data**

In [2]:
dataset=pd.read_csv('Position_Salaries.csv')
dataset

FileNotFoundError: [Errno 2] No such file or directory: 'Position_Salaries.csv'

We can see that the dataset has 10 levels and the corresponding salary paid to the employee

> #### We only have one usable feature in this dataset
> The features 'Position' and 'Level' are redundant.
> The **regressor** is the column 'Level'

In [None]:
# extracting the regressor/column/feature 'Level'
X=dataset.iloc[:,1:2].values  
X

> #### We now need to extract the regressand, that is the variable that we eventually want to be able to predict
> the **regressand** is the column 'Salary'

In [None]:
# Extracting the column 'Salary'
y=dataset.iloc[:,2].values    
y

**3. Splitting the data into training and test data**
> Typically at this point we should be splitting the dataset into train and test set. So that we can test out our model after training. However in this simple example we only have few data points. So we´ll be using all of them for training. 

> **This is a very bad thing to do!!**


In [None]:
# uncomment the following to create the training and testing datasets
#from sklearn.model_selection import train_test_split
#X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)

**4. Linear Regression**

In [None]:
from sklearn.linear_model import LinearRegression
lin_reg=LinearRegression()
lin_reg.fit(X,y)

**5. Visualizing Linear Regression result**

In [None]:
plt.scatter(X,y,color='red')
plt.plot(X,lin_reg.predict(X),color='blue')
plt.title('Linear Regression')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.show()

**6. Polynomial Regression**<br>
Since our data is distributed in a non linear way, we'll try fitting it with polynomial curves, to get more accurate models.

In [None]:
# polynomial curve of degree 2
from sklearn.preprocessing import PolynomialFeatures
poly_reg2=PolynomialFeatures(degree=2)
X_poly=poly_reg2.fit_transform(X)
lin_reg_2=LinearRegression()
lin_reg_2.fit(X_poly,y)

In [None]:
plt.scatter(X,y,color='red')
plt.plot(X,lin_reg_2.predict(poly_reg2.fit_transform(X)),color='blue')
#plt.plot(X,lin_reg_3.predict(poly_reg3.fit_transform(X)),color='green')
plt.title('Polynomial Regression with degree=2')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.show()

**7. Third-degree Polynomial Regression**<br>
Since our data is still not fitting very well, we'll try fitting it with a higher-degree polynomial (degree=3), to get more accurate models.

In [None]:
# Polynomial curve of degree 3
poly_reg3=PolynomialFeatures(degree=3)
X_poly3=poly_reg3.fit_transform(X)
lin_reg_3=LinearRegression()
lin_reg_3.fit(X_poly3,y)

**8. Visualizing third-degree polynomial**

In [None]:
plt.scatter(X,y,color='red')
#plt.plot(X,lin_reg_2.predict(poly_reg2.fit_transform(X)),color='blue')
plt.plot(X,lin_reg_3.predict(poly_reg3.fit_transform(X)),color='green')
plt.title('Third-degree Polynomial Regression')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.show()

**8. Predicting the salary of an employee with each of the 3 curves**
Let´s use the three models to make predictions, to get a feel as to the accuracy we have achieved.
> We now want to predict the salary of an employee comprised into a new level 6.5 

In [None]:
lin_reg.predict([[6.5]])  # We are assuming the level of the employee is 6.5

In [None]:
lin_reg_2.predict(poly_reg2.fit_transform([[6.5]]))

In [None]:
lin_reg_3.predict(poly_reg3.fit_transform([[6.5]]))

We can clearly see that the Polynomial Regression models fit much better, compared to the Linear Regression Model. As we increase the degree of the polynomial regression the correlation increases. **Linear regression overshoots** by a large amount, ending up with a 6.5-level employee that is predicted to earn much more than the actual salary of a 7-level employee. **Not a good model!!**

> At the end of this notebook, you are required to compute the errors of each of the models more accurately, looking at the metrics introduced in one of the earlier notebooks.


## Overfitting and underfitting

**You should not be tempted into using high-degree polynomials though!!**

We always need to aim for the minimal possible degree (linear regression when possible), which has two benefits:<br>
* it is much faster to compute (particularly on big datasets)
* it avoids over-fitting

Look at the following example to appreciate the problem at hand in a qualitative way.

![PA Work Flow](./figures/underfitting_and_overfitting.png)


## Exercises for this assignment
1. First, solve this exercise in the ´incorrect´ way: don't split the dataset into training and testing sets. Tune the polynomial model. Write a script (set of functions) that, given the dataset, automatically generates all the polynomial models of degrees 1 to 10. For each model, compute MSE and R2_score. Plot the error functions (error vs degree_value). Determine the optimal degree_value.
2. Now, reuse the code you have developped, to solve the same tuning exercise in the ´correct´ way. Split the dataset into training (70% of points) and testing (30% of points), and repeat the tuning as for point 1. Determine the optimal degree_value.
3. Tuning for point 2. Change the ratio of the training set starting from 20% up to 100%, finding the optimal degree_value for each setup. 
4. Write your considerations about this exercise in a markdown cell. Compare and contrast the different results achieved for points 1 to 3.

In [None]:
# write your code here


In [None]:
# write your code here


In [None]:
# write your code here
