# Linear regression exercises

In this notebook we will work with sklearn to compute linear regression models.

**1. We will build a linear regression model for the Airfoil Self-Noise dataset from the UCI machine learning repository. The dataset is available here: https://archive.ics.uci.edu/dataset/291/airfoil+self+noise. The following cell loads it into a pandas Dataframe, complete with appropriate headers.**

**Compute summary statistics for the dataset using the Dataframe `.describe()` method.**

In [None]:
import pandas as pd
airfoil = pd.read_csv('airfoil_self_noise.dat', sep='\t', names=['Frequency', 'Angle', 'Chord length', 'Velocity', 
                                                         'Displacement thickness', 'Sound pressure level'])
airfoil.head()

In [None]:
airfoil.describe()

**2. We will predict the scaled sound pressure level from the other 5 feature variables in the dataset. Create a train/test split with the proportion 80/20. Train a linear regression model on the training set and calculate the root mean squared error (RMSE) on the training and test sets.**

In [None]:
from sklearn.model_selection import train_test_split

X = airfoil[['Frequency', 'Angle', 'Chord length', 'Velocity', 'Displacement thickness']]
y = airfoil[['Sound pressure level']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
from sklearn.linear_model import LinearRegression

airfoil_model = LinearRegression()
airfoil_model.fit(X_train, y_train)

In [None]:
y_train_pred = airfoil_model.predict(X_train)
y_test_pred = airfoil_model.predict(X_test)

In [None]:
from sklearn import metrics
import numpy as np

print("Train RMSE: ", np.sqrt(metrics.mean_squared_error(y_train, y_train_pred)))
print("Test RMSE: ", np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))

**3. Try to improve the performance of your model by adding quadratic basis functions. Evaluate the RMSE performance on the training and test sets.**

In [None]:
X_train_arr = X_train.to_numpy()
X_test_arr = X_test.to_numpy()

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

airfoil_model2 = make_pipeline(PolynomialFeatures(2), LinearRegression())
airfoil_model2.fit(X_train, y_train)

In [None]:
y_train_pred = airfoil_model2.predict(X_train)
y_test_pred = airfoil_model2.predict(X_test)

In [None]:
print("Train RMSE: ", np.sqrt(metrics.mean_squared_error(y_train, y_train_pred)))
print("Test RMSE: ", np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))

**4. Finally, also include cubic basis functions and evaluate RMSE performance on the training and test sets.**

In [None]:
airfoil_model3 = make_pipeline(PolynomialFeatures(3), LinearRegression())
airfoil_model3.fit(X_train, y_train)

In [None]:
y_train_pred = airfoil_model3.predict(X_train)
y_test_pred = airfoil_model3.predict(X_test)

In [None]:
print("Train RMSE: ", np.sqrt(metrics.mean_squared_error(y_train, y_train_pred)))
print("Test RMSE: ", np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))