# Linear Regression Notebook

**Category:** Supervised Learning <br>
**Type:** Classification / Regression (Note: some regression algorithms can also be used for classification)

**What is the mathematical representation of a linear function?** <br>
Answer: `f(x) = a*x + b`, where `f(x)` is sometimes denoted as `y`. <br>
Also, this is a function that has only one feature, `x`. A linear function with two would be `y = a*x1 + b*x2 + c`, one with three would be ... you get it... :)  

**What are we trying to achieve?** <br>
Answer: Find out the linear function that has the label as `y` and the feature(s) as `x(i)`. In other words we are looking for the values of `a`, `b`, `c`, etc. You didn't expect this didn't you ;)...**BUUUT** not just any values for these, the *right* ones.

**What are the right values for the parameters?** <br>
Training a model using Linear Regression means finding the linear function that is representative of your training data. So, the line needs to be as *close* as possible to all the different data points. The closest you could get to a point is right trough it, but crossing all of them would mean that the model simply "memorised" things, it did not "learn". It would not be able to predict new values. A common way to measure this *closeness* is by finding the parameters that give you the smallest Root Mean Square Error https://www.statisticshowto.datasciencecentral.com/rmse/.

The good part is that you don't have to do this yourself. This is what ScikitLearn does for you. This notebook shows you how to do just that.

In [14]:
import pandas as pd

In [15]:
df = pd.read_csv("train.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,2,1,1,38.0,1,0,71.2833
1,4,1,1,35.0,1,0,53.1
2,7,0,1,54.0,0,0,51.8625
3,11,1,3,4.0,1,1,16.7
4,12,1,1,58.0,0,0,26.55


X will be the matrix of all the features, so we need to get out the label

In [16]:
X = df.drop(columns="Survived")
X.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
0,2,1,38.0,1,0,71.2833
1,4,1,35.0,1,0,53.1
2,7,1,54.0,0,0,51.8625
3,11,3,4.0,1,1,16.7
4,12,1,58.0,0,0,26.55


y is the vector containing the label

In [17]:
y = df["Survived"]
y.head()

0    1
1    1
2    0
3    1
4    1
Name: Survived, dtype: int64

The training dataset is split into 80% to use for actual training and 20% to use for testing the performance of the model
* X_train - matrix of features to use for training
* y_train - vector of label to use for training
* X_val   - matrix of features to use for validating the trained model
* y_val   - vector of label to use for validating the trained model

In [33]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

We are using the LinearRegression model. The fit method trains the model.

In [34]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train,y_train) 

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

### Mean Sqared Error (MSE)
Now that the model is trained we can test it on the validation dataset and then use it to make predictions on new datasets, but before that, we can also look at some performance indicators. Let's see the Mean Sqared Error of the dataset used for training.

In [36]:
y_train_predict = lin_reg.predict(X_train) 
train_errors = mean_squared_error(y_train, y_train_predict)
train_errors

0.18834397435800695

Next we will see how the model performs on the validation dataset

In [41]:
y_val_predict = lin_reg.predict(X_val)
val_errors = mean_squared_error(y_val, y_val_predict)
val_errors

0.22694557640559312

### Predict on new data
After training the model we can use it on new data, this data does not contain the label column, that is what we want the model to predict


In [38]:
tstdf = pd.read_csv("predict.csv")
tstdf.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
0,904,1,23.0,1,0,82.2667
1,906,1,47.0,1,0,61.175
2,916,1,48.0,1,3,262.375
3,918,1,22.0,0,1,61.9792
4,920,1,41.0,0,0,30.5


In [39]:
X_new = tstdf
lin_reg.predict(X_new)

array([0.99280942, 0.78794534, 0.65394523, 0.89657244, 0.79916342,
       0.93169223, 0.80953556, 0.77132196, 0.66835755, 0.98648303,
       0.95598785, 0.76186815, 0.94563748, 1.05257788, 0.89201046,
       0.4982435 , 0.91362244, 0.93706949, 0.95784973, 0.75537841,
       0.71045067, 0.8292712 , 0.58165386, 0.84352507, 0.91629023,
       0.86522689, 0.75289415, 0.93837681, 0.88509864, 0.91648706,
       0.73015439, 0.58196545, 0.93349622, 1.01452633, 0.82739471,
       0.79052508, 0.77592891, 0.59149663, 0.53519526, 0.85714041,
       1.07143803, 1.00831592, 1.03683096, 0.9121725 , 0.91627808,
       0.85071895, 0.81480957, 0.91965884, 0.92213253, 0.71949359,
       0.86445512, 0.83079332, 0.90088799, 1.05297852, 0.8476602 ,
       1.06703022, 1.0651446 , 0.75827813, 0.64864855, 0.90874958,
       0.75112176, 0.81190318, 0.82992204, 0.8390021 , 0.91720225,
       1.03170651, 0.90395025, 0.97733769, 0.88165347, 0.8035409 ,
       0.81929516, 0.81501989, 1.06834335, 1.02480275, 0.82165

# Done!