### Introduction to Linear Regression
- **Objective**: Understand the basic concept of linear regression as a statistical method for modeling the relationship between a dependent variable and one or more independent variables.
- **Key Concepts**:
  - **Dependent Variable (DV)**: The outcome you're trying to predict or explain.
  - **Independent Variable (IV)**: Factors that might influence the DV.
  - **Linear Relationship**: The assumption that the change in IV directly affects DV in a linear manner.

### Real-World Applications of Linear Regression

These are some of the applications of linear regression in real-world. 

- **Business**: Predicting sales based on advertising spend.
- **Healthcare**: Estimating patient recovery time based on treatment dosage.
- **Economics**: Forecasting economic growth from factors like employment rate and industrial production.
- **Environmental Science**: Modeling the relationship between pollution levels and public health.

### The Mathematics of Linear Regression
- **Simple Linear Regression Formula**: \( Y = a + bX + \epsilon \)
  - \( Y \): Dependent variable.
  - \( X \): Independent variable.
  - \( a \): Y-intercept (value of Y when X = 0).
  - \( b \): Slope of the line (change in Y for a unit change in X).
  - \( \epsilon \): Error term.
- **Example**:
  - Consider a dataset with 'Hours Studied' as IV and 'Test Score' as DV.
  - Plot these points on a graph and draw the best-fit line.

### Hands-On Example
Navigate to the Kaggle training on Scatter plots. 
Dowbload the data and follow the instructions on the apge to replicate the scatter plot. 
Pay attention how to load the data into python from an Excel spreadsheet. 

Note that you will need to import Seaborn library as sns 


You may need to install the package if it is not installed before (you will receive the exception "NoModuleNameFound" if it is not installed).


To install the package, open a new terminal in VS Code and try 


pip install seaborn

[Kaggle _ Scatter Plots Tutorial](https://www.kaggle.com/code/alexisbcook/scatter-plots/tutorial)


## R Value 
In linear regression, the \( R \) value, also known as the correlation coefficient, measures the strength and direction of a linear relationship between two variables. It ranges from -1 to +1. Here are some examples to illustrate different \( R \) values and their interpretations:

### 1. High Positive Correlation (\( R $\approx$ 0.9 \) to \( 1 \))
- **Example**: Height and Weight in Adults
  - **Context**: Generally, as height in adults increases, their weight also tends to increase.
  - **Interpretation**: An \( R \) value close to +1 indicates a strong positive linear relationship. This means as one variable (height) increases, the other (weight) also increases in a proportional manner.

### 2. Moderate Positive Correlation (\( R $\approx$ 0.5 \) to \( 0.7 \))
- **Example**: Years of Experience and Salary
  - **Context**: Employees with more years of experience tend to have higher salaries, but other factors like job role and industry also play a significant role.
  - **Interpretation**: An \( R \) value in this range suggests a moderate positive relationship, meaning that while the trend is for salary to increase with experience, the relationship is not perfectly linear and other factors influence salary as well.

### 3. Low Positive Correlation (\( R $\approx$ 0.1 \) to \( 0.3 \))
- **Example**: Daily Coffee Consumption and Productivity
  - **Context**: There might be a slight tendency for increased coffee consumption to be associated with higher productivity, but it's not a strong relationship.
  - **Interpretation**: An \( R \) value in this range indicates a weak positive linear relationship, suggesting that while there might be a trend, it's not strong or consistent.

### 4. No Correlation (\( R $\approx$ 0 \))
- **Example**: Number of Shoes Owned and Intelligence
  - **Context**: There's no logical or observed consistent relationship between the number of shoes a person owns and their intelligence.
  - **Interpretation**: An \( R \) value near zero suggests no linear relationship between the variables.

### 5. Low Negative Correlation (\( R $\approx$ -0.1 \) to \( -0.3 \))
- **Example**: Time Spent Watching TV and Physical Fitness Level
  - **Context**: People who spend more time watching TV may tend to be slightly less physically fit, but the relationship is not strongly linear.
  - **Interpretation**: A small negative \( R \) value indicates a weak inverse relationship; as one variable increases, the other slightly decreases.

### 6. Moderate Negative Correlation (\( R $\approx$ -0.5 \) to \( -0.7 \))
- **Example**: Cigarette Smoking and Lung Capacity
  - **Context**: Generally, heavier smoking is associated with lower lung capacity.
  - **Interpretation**: A moderately negative \( R \) value suggests a moderate inverse relationship; as smoking increases, lung capacity tends to decrease.

### 7. High Negative Correlation (\( R $\approx$ -0.9 \) to \( -1 \))
- **Example**: Outdoor Temperature and Heating Costs
  - **Context**: In colder months, heating costs tend to be higher.
  - **Interpretation**: An \( R \) value close to -1 indicates a strong negative linear relationship, meaning as the outdoor temperature decreases, heating costs increase significantly.

### Important Considerations
- **Correlation does not imply causation.** High correlation values do not necessarily mean that one variable causes changes in the other.
- Other statistical methods should be used to establish causality and understand the nature of the relationship between variables.

The calculation of R, the Pearson correlation coefficient, in the context of a simple linear regression between two variables, involves a specific formula that measures the strength and direction of the linear relationship between them. Here's how R is calculated:

### Formula for Pearson Correlation Coefficient ( R )
The Pearson correlation coefficient ( R ) is calculated as follows:

<div align="center">
    <img src=".\images\RValue1.svg" alt="R Value Formulae">
</div>

Where:
- n is the number of data points.
- $\sum$ denotes the summation.
- x and y are the individual values of the two variables.

### Steps to Calculate R
1. **Collect Data Points**: You need paired data points for the two variables you are correlating (e.g., \( x_1, y_1 \), \( x_2, y_2 \), ..., \( x_n, y_n \)).

2. **Calculate Summations**:
   - Compute the sum of all x values ($\sum x $), the sum of all y values ($\sum y$), the sum of the product of paired x and y values ($\sum xy$), the sum of squared x values ($\sum x^2$), and the sum of squared y values ( $\sum y^2$).

3. **Apply the Formula**:
   - Substitute these sums into the formula and calculate R.

### Example Calculation
Suppose you have a small dataset:

| \( x \) | \( y \) |
|-----|-----|
| 1   | 2   |
| 2   | 3   |
| 3   | 4   |
| 4   | 5   |

1. Calculate the necessary summations:
   - $$\sum x = 1 + 2 + 3 + 4 $$
   - $$ \sum y = 2 + 3 + 4 + 5 $$
   - $$ \sum xy = 1*2 + 2*3 + 3*4 + 4*5 $$
   - $$ \sum x^2 = 1^2 + 2^2 + 3^2 + 4^2 $$
   - $$ \sum y^2 = 2^2 + 3^2 + 4^2 + 5^2 $$

2. Substitute these values into the \( R \) formula.

### Interpretation of R
- A value of R close to +1 indicates a strong positive linear relationship.
- A value close to -1 indicates a strong negative linear relationship.
- A value near 0 suggests little to no linear relationship.

### Limitations
- R only measures linear relationships. Non-linear relationships are not well represented by this statistic.
- Outliers can significantly affect the value of R.
- Correlation does not imply causation.

In practice, many statistical software programs and tools can calculate R quickly and accurately, especially for large datasets.

## R value calculation in Python 

The Pearson correlation coefficient is defined as:

<div align="center">
    <img src=".\images\RValue1.svg" alt="R Value Formulae">
</div>


where:
- $x_i$ and $y_i$ are individual sample points.
- $\bar{x}$ and $\bar{y}$ are the means of the $x$ and $y$ samples, respectively.

Here's a Python function that implements this calculation:

```python
def pearson_correlation(x, y):
    if len(x) != len(y):
        raise ValueError("Lists x and y must have the same length.")

    n = len(x)
    sum_x = sum(x)
    sum_y = sum(y)
    sum_x_sq = sum(xi**2 for xi in x)
    sum_y_sq = sum(yi**2 for yi in y)
    psum = sum(xi * yi for xi, yi in zip(x, y))
    num = psum - (sum_x * sum_y / n)
    den = ((sum_x_sq - sum_x**2 / n) * (sum_y_sq - sum_y**2 / n))**0.5

    if den == 0: 
        return 0

    return num / den

# Example usage
x = [1, 2, 3, 4, 5]
y = [2, 3, 2, 5, 7]

r = pearson_correlation(x, y)
print(f"Pearson Correlation Coefficient: {r}")
```