# Lecture 2: Linear and Logistic Regression
### [Areeb Gani](https://github.com/Qwerty71), [Michael Ilie](https://www.mci.sh), [Vijay Shanmugam](https://www.vijayrs.ml)
This notebook will introduce the basic concepts and applications of linear regression and logistic regression

## Note: you will need to run the following code cell every time you restart this notebook


In [None]:
!pip install -r requirements.txt
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import seaborn as sns
from sklearn import datasets, linear_model

from IPython.display import display

iris = sns.load_dataset('iris')

## What is linear regression? 
The long fancy wikipedia answer to this question is: 
```
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables).
```
Translating, linear regression is a method of approximating a linear relationship (think `y=mx+b`) between a variable (or set of variables ) and a value. This can be as simple as relating the number of hours slept by a student to their test scores. Again, multiple variables can be used; you can correspond other things to such a related value, for example, your linear regression model can predict the relationship between a bunch of variables, such as a student's gpa, hours slept, and time studied, to their test score.

The essence of this problem is that when you have a bunch of data points, say an entire class of students, chances are the dots on the graph are not going to align up into a perfect line. So your job, if you want to make a model that can most accurately predict test scores based on certain conditions, is to calculate the coefficients 

## How Exactly does it work?
As mentioned previously, your model will look something akin to `y=mx+b`, however it will be a bit more complicated. Basic linear regression with one dependent variable has the following equation:
```
y= B0 + B1*x + E
```
where B0 is the y intercept value, B1 is the slope/coefficient for the x variable, and E is the error term. This can be generalized to the following:
```
y = B0 + B1*x1 + B2*X2 ... + Bi*xi + E
```
where Bi is each variable's respective coefficient. 
Now that we know what our model should look like, how exactly do we get those magical values? 

## Training 
Machine learning algorithms usually work in the following manner:
1) you have a model, and data that you can plug into said model 
2) you have a function called the Cost function, which represents how off your model is from the provided real data (aka your error, or Loss)
3) Use an algorithm to minimize (find the lowest point) of your cost function, which represents the least error. This in turn will allow you to calculate your most optimal coefficients for your model 

### Cost Function

![Picture title](image-20211005-095748.png)






### Follow along this simple tutorial
Get familiar with your first linear regression sample


![Picture title](image-20211005-095748.png)

use the `sklearn.dataset` to import and arrange your datasets (this ones about diabetes)

In [None]:
# Load the diabetes dataset 
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

# Use only one feature as opposed to using all the features provided in this dataset, well get to that later
diabetes_X = diabetes_X[:, np.newaxis, 2]

# Split the data into training/testing sets. We will use some data to test the accuracy of our linear regression model
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets (values we will predict, or learn to predict)
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f'
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(diabetes_y_test, diabetes_y_pred))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=1576d67f-6fb5-40fe-a51b-b52168739bab' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>