# Simple linear regression

## Import the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

#import regression (machine learning) module
from sklearn.linear_model import LinearRegression

## Load the data

In [2]:
#load
data = pd.read_csv('1.01. Simple linear regression.csv')

data.head()

Unnamed: 0,SAT,GPA
0,1714,2.4
1,1664,2.52
2,1760,2.54
3,1685,2.74
4,1693,2.83


## Create the regression

**the machine learning algorithm will find the optimal coefficients of a linear regression model, that when given an SAT score, best predict the GPA of a student.**

### Declare the dependent and independent variables

In [3]:
#independent variable: 'SAT'
x = data['SAT']

#dependent variable: 'GPA'
y = data['GPA']

In [4]:
#useful to check the shapes of the features
x.shape

(84,)

In [5]:
y.shape

(84,)

**inputs and targets are both vectors of length 84**

Sklearn **prefers 2D arrays for its INPUTS, not 1D arrays.** 

# A MATRIX IS A 2D OBJECT, ALSO KNOW AS A 2D ARRAY!**

In [6]:
# In order to feed x to sklearn, it should be a 2D array (a matrix)
# Therefore, we must reshape it 
# Note that this will not be needed when we've got more than 1 feature (as the inputs will be a 2D array by default)

# x_matrix = x.values.reshape(84,1)
#NOTE: we are not changing anything, but the dimenstionality
#its (-1,1) NOT (84,1) because the program works better this way. 
x_matrix = x.values.reshape(-1,1) 

# Check the shape just in case
x_matrix.shape

(84, 1)

**NOTE:** regarding reshaping the feature into a 2D object is because it was 1D. This is only an issue we had to do for a **SIMPLE LINEAR REGRESSION**

**Sklearn takes full advantage of the object-oriented capabilities of Python and works a lot with very well written classes. More often than not, we would need to create an object or an instance of the given class**

So in this case, we're letting **reg be equal to LinearRegression()**. After executing that line, reg will be an instance of the linear regression class. Afterwards, all you need to do is fit the regression. 

The fit order is **very important**, must be (inputs,target)

### Regression itself
Full documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [7]:
# We start by creating a linear regression object
reg = LinearRegression()

In [8]:
# The whole learning process boils down to fitting the regression
# Note that the first argument is the independent variable, while the second - the dependent (unlike with StatsModels)
reg.fit(x_matrix,y)

LinearRegression()

Don't know why, but the **reg.fit(x_matrix,y)** on udemy produces **LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)**, which is different from what i get.

**copy_X = True** --> when its set to true it copies the inputs before fitting them. This is a safety net against normalization and other transformations that can be done by SK Learn while creating an algorithm. If you remember in our stats models lectures we would create copies of data frames every now and then. SK learn takes to care of that automatically.

**fit_intercept=True** --> in stats models, we had to manually add a constant. The fit intercept parameter takes care precisely of that. If you don't want an intercept you can just set it to false.

**n_jobs=1** --> is a parameter used when we want to parallelize routines. By default, only one CPU is used. For this simple example we won't see a difference no matter the number  of jobs we set. However, if you work on problems with lots and lots of data and have more than one CPU available, you can take advantage of this parameter by setting it to 2, 3, 5, and so on.

In [12]:
reg.predict([[1750]])

array([3.17249439])