# Linear Regression Notebook

**Category:** Supervised Learning <br>
**Type:** Classification / Regression (Note: some regression algorithms can also be used for classification)

**What is the mathematical representation of a linear function?** <br>
Answer: `f(x) = a*x + b`, where `f(x)` is sometimes denoted as `y`. <br>
Also, this is a function that has only one feature, `x`. A linear function with two would be `y = a*x1 + b*x2 + c`, one with three would be ... you get it... :)  

**What are we trying to achieve?** <br>
Answer: Find out the linear function that has the label as `y` and the feature(s) as `x(i)`. In other words we are looking for the values of `a`, `b`, `c`, etc. You didn't expect this didn't you ;)...**BUUUT** not just any values for these, the *right* ones.

**What are the right values for the parameters?** <br>
Training a model using Linear Regression means finding the linear function that is representative of your training data. So, the line needs to be as *close* as possible to all the different data points. The closest you could get to a point is right trough it, but crossing all of them would mean that the model simply "memorised" things, it did not "learn". It would not be able to predict new values. A common way to measure this *closeness* is by finding the parameters that give you the smallest Root Mean Square Error https://www.statisticshowto.datasciencecentral.com/rmse/.

The good part is that you don't have to do this yourself. This is what ScikitLearn does for you. This notebook shows you how to do just that.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("train.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,2,1,1,38.0,1,0,71.2833
1,4,1,1,35.0,1,0,53.1
2,7,0,1,54.0,0,0,51.8625
3,11,1,3,4.0,1,1,16.7
4,12,1,1,58.0,0,0,26.55


In [3]:
# X will be the matrix of all the features, so we need to get out the label
X = df.drop(columns="Survived")
X.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
0,2,1,38.0,1,0,71.2833
1,4,1,35.0,1,0,53.1
2,7,1,54.0,0,0,51.8625
3,11,3,4.0,1,1,16.7
4,12,1,58.0,0,0,26.55


In [4]:
# y is the vector containing the label
y = df["Survived"]
y.head()

0    1
1    1
2    0
3    1
4    1
Name: Survived, dtype: int64

In [5]:
from sklearn.linear_model import LinearRegression

In [6]:
lin_reg = LinearRegression()
# the fit method trains the model
lin_reg.fit(X,y) 

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [7]:
# after training the model we need to test it, so we need some test data
# the test data does not contain the label column, that is what we want the model to predict
tstdf = pd.read_csv("test.csv")
tstdf.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
0,904,1,23.0,1,0,82.2667
1,906,1,47.0,1,0,61.175
2,916,1,48.0,1,3,262.375
3,918,1,22.0,0,1,61.9792
4,920,1,41.0,0,0,30.5


In [8]:
X_new = tstdf
lin_reg.predict(X_new)

array([0.97630805, 0.75664821, 0.69260885, 0.86750492, 0.74657212,
       0.90656288, 0.77685894, 0.71670008, 0.61784079, 0.9642168 ,
       1.02802784, 0.7590797 , 0.94666662, 1.11384248, 0.84278484,
       0.54008152, 0.8652139 , 0.92700149, 0.94882018, 0.72900463,
       0.70601366, 0.82055678, 0.54135259, 0.81178871, 0.89267628,
       0.81418001, 0.74981382, 0.97241137, 0.84372992, 0.88840713,
       0.67108507, 0.61403319, 0.90753931, 1.00847531, 0.77223932,
       0.73812845, 0.7388332 , 0.59119493, 0.50026983, 0.84183075,
       1.04886378, 1.03390548, 1.0375327 , 0.91623029, 0.86467234,
       0.79857512, 0.82237527, 0.89292176, 0.89431306, 0.68205977,
       0.84066809, 0.82301251, 0.86778259, 1.04465406, 0.80062333,
       1.05893086, 1.04555437, 0.7348825 , 0.60816291, 0.91894981,
       0.72934396, 0.7741636 , 0.81057577, 0.83355679, 0.88837613,
       1.05666066, 0.84907727, 0.92537226, 0.93507043, 0.76199335,
       0.75875035, 0.79002241, 1.04159087, 0.9963188 , 0.75555

# Done!