# Linear Regression

* R = Pearson correlation
* R-squared (R2)

### Pearson Correlation Coefficient 

This is a measure of the linear correlation between two variables \(X\) and \(Y\), providing information about the strength and direction of their linear relationship. The value of \(r\) ranges from -1 to 1, where:
   - 1 indicates a perfect positive linear relationship,
   - -1 indicates a perfect negative linear relationship, and
   - 0 indicates no linear relationship.
   
R is calculated as the covariance of the two variables divided by the product of their standard deviations. It gives the degree to which two variables are linearly related, without providing information about the slope of the line or its intercept.


<img src="pearson1.png" width="30%">
<img src="pearson2.png" width="30%">

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas
from scipy.stats import pearsonr

iris = pandas.read_csv('../Datasets/iris.csv')
abalone = pandas.read_csv('../Datasets/abalone.csv')

pearsonr(iris['PetalWidth'], iris['PetalLength'])

In [None]:
iris.drop(columns=['Species']).corr().round(2)

---

### Coefficient of Determination $(R^2)$

This measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s) in a regression model. It is a key output of regression analysis. $R^2$ ranges from 0 to 1, where:
   - 0 indicates that the model explains none of the variability of the response data around its mean, and
   - 1 indicates that the model explains all the variability of the response data around its mean.
   
$R^2$ provides information about the goodness of fit of a model. 

In simple linear regression, $R^2$ is the square of the Pearson correlation coefficient, indicating how much of the variance in the dependent variable is explained by the model. 


### Pearson coefficient vs $R^2$

- **Purpose**: The Pearson correlation coefficient measures the strength and direction of a linear relationship between two variables, while $R^2$ measures how well a regression model fits the data.
- **Interpretation**: Pearson correlation coefficient reflects correlation without implying causation, and its square  can be interpreted as the proportion of variance shared by the two variables. $R^2$, however, indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

Understanding the distinctions between these two measures is crucial for accurately interpreting the relationship between variables and the performance of regression models.

### Examples

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score


Here, we create a linear model, which learns y from X. (line 4)

In [None]:
X = iris[['PetalLength']]
y = iris['PetalWidth']
model = LinearRegression()
model.fit(X,y)


y is a linear function of X.
```python
y = X*coef + intercept
```

In [None]:
model.coef_*3.7 + model.intercept_

In [None]:
some_data = [
    [3.5],
    [3.7],
]
model.predict(some_data)

In [None]:
print('R^2: ',model.score(X,y))

In [None]:
r, pvalue = pearsonr(X['PetalLength'],y)
print(r, r*r, pvalue)

### Understanding R-squared

R-squared can be viewed as how good the model is compared to a baseline model.


##### How R2 is calculated

$R^2 = 1 - {\sum_i (y_i-f_i)^2 \over \sum_i (y_i - \mu_y)^2}$
$= {\sum_i (y_i - \mu_y)^2 - \sum_i (y_i-f_i)^2 \over \sum_i (y_i - \mu_y)^2}$


$f_i = f(X_i)$ is the predicted value of $y_i$.


#### Concept of a baseline model

Here's a baseline model: always predict average value of `y`.



### Multi-variable Regression

In [None]:
import pandas
from sklearn.linear_model import LinearRegression

iris = pandas.read_csv('../Datasets/iris.csv')
X = iris[['PetalLength','PetalWidth']]
y = iris['SepalLength']
model = LinearRegression()
model.fit(X,y)
print(model.coef_, model.intercept_)

In [None]:
model.predict([
    (3.7,2.5)
])

### Exercises

In [None]:
abalone = pandas.read_csv('../Datasets/abalone.csv')
abalone.sample(3)

In [None]:
#PID:15
#
# Exercise: use a linear regression to model Rings using Length, and Whole weight 
#

import pandas
from sklearn.linear_model import LinearRegression

abalone = pandas.read_csv('../Datasets/abalone.csv')
X = abalone[['Length','Whole weight']]
y = abalone['Rings']
model = LinearRegression()
model.fit(X,y)

In [None]:
#PID:16
#
# Exercise: use a linear regression to model Rings using Sex, Length, and Whole weight 
#

import pandas
from sklearn.linear_model import LinearRegression

abalone = pandas.read_csv('../Datasets/abalone.csv')
X = abalone[['Sex','Length','Whole weight']]
X = pandas.get_dummies(X)
y = abalone['Rings']
model = LinearRegression()
model.fit(X,y)


In [None]:
abalone.iloc[2024][['Sex','Length','Whole weight']]

In [None]:
#PID:17
#
# Exercise: use a linear regression to model Rings using Sex, Length, and Whole weight 
# and predict the number of rings of a male abalone with length 0.6 and weight 0.75.
#

import pandas
from sklearn.linear_model import LinearRegression

abalone = pandas.read_csv('../Datasets/abalone.csv')
X = abalone[['Sex','Length','Whole weight']]
X = pandas.get_dummies(X)
y = abalone['Rings']
model = LinearRegression()
model.fit(X,y)

data = [
    [0.6, 0.75, 0, 0, 1],
]
# X.head()
model.predict(data)

In [None]:
abalone.loc[2024]

In [None]:
#PID:18
#
# Exercise: use a linear regression to model Rings using Sex, Length, and Whole weight 
# and identify the most important feature in this modeling
#

import pandas
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler

abalone = pandas.read_csv('../Datasets/abalone.csv')
X = abalone[['Sex','Length','Whole weight']]
X = pandas.get_dummies(X)
print(X.columns)
X = MinMaxScaler().fit_transform(X)
y = abalone['Rings']
model = LinearRegression()
model.fit(X,y)

print(model.coef_)
"""
y  =  X * model.coef_ + model.intercept_

After normalization, length is the most impactful/important feature because 
it has the largest coefficient.
"""



```python
y = 2*x1 + 10*x2 + 3*x3  + b
```
