<a href="https://colab.research.google.com/github/kkokay07/Learning-Machine-Learning/blob/main/Regression%20Model/Simple_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Regression Analysis

## What is Regression?

Regression analysis is a statistical method used to model and analyze the relationship between a dependent variable (target) and one or more independent variables (features/predictors). It helps us understand how changes in independent variables influence the dependent variable, enabling prediction and inference.

### Core Concept
At its heart, regression attempts to find the best-fitting mathematical relationship between variables by minimizing the difference between predicted and actual values.

### Historical Context
The term "regression" was coined by Francis Galton in the 19th century during his study of human heights, where he observed that children of tall parents tend to be shorter than their parents (regressing toward the mean).

## Why Do We Use Regression?

Regression serves multiple crucial purposes in data analysis and machine learning:

1. **Prediction**
   - Forecasting future values based on historical data
   - Example: Predicting house prices based on square footage, location, and amenities
   - Example: Estimating sales volume based on advertising spend

2. **Relationship Understanding**
   - Quantifying the strength and direction of relationships between variables
   - Example: Understanding how temperature affects ice cream sales
   - Example: Measuring the impact of education on income levels

3. **Causal Analysis**
   - Investigating cause-and-effect relationships (with proper experimental design)
   - Example: Analyzing the effect of a new drug on patient recovery time
   - Example: Measuring the impact of price changes on product demand

## Key Terminology

### 1. Variables
- **Dependent Variable (Y)**
  - Also called the response variable or target
  - What we're trying to predict or explain
  - Example: House price, student performance

- **Independent Variables (X)**
  - Also called predictors, features, or explanatory variables
  - Used to predict the dependent variable
  - Example: Square footage, study hours

### 2. Model Components
- **Coefficients (β)**
  - Weights assigned to each independent variable
  - Represent the change in Y for a one-unit change in X
  - Example: Price increase per square foot in house pricing

- **Intercept (β₀)**
  - The base value when all independent variables are zero
  - Starting point of the regression line
  - Example: Base price of a house before considering any features

- **Error Term (ε)**
  - The difference between predicted and actual values
  - Represents unexplained variation
  - Example: Factors affecting house price not included in the model

### 3. Statistical Terms
- **Residuals**
  - Actual differences between predicted and observed values
  - Used to assess model fit
  - Example: Difference between predicted and actual house price

- **Variance**
  - Measure of spread in the data
  - Important for understanding prediction reliability
  - Example: Spread of house prices around the mean

- **Standard Error**
  - Measure of coefficient estimate precision
  - Used for confidence intervals and hypothesis testing
  - Example: Uncertainty in price per square foot estimate

## Real-World Example: Ice Cream Sales

Let's consider a simple example to illustrate regression concepts:

### Scenario
- Dependent Variable (Y): Daily ice cream sales ($)
- Independent Variable (X): Daily temperature (°F)
- Goal: Understand and predict how temperature affects sales

### Data Patterns
- As temperature increases, ice cream sales tend to increase
- The relationship might not be perfectly linear
- Other factors (weekday, holidays, events) also influence sales

### Business Applications
1. **Inventory Management**
   - Predict sales based on weather forecast
   - Optimize stock levels
   - Reduce waste and stockouts

2. **Staff Planning**
   - Schedule appropriate number of workers
   - Plan for busy periods
   - Optimize labor costs

3. **Financial Planning**
   - Project revenue based on seasonal temperatures
   - Plan marketing campaigns
   - Set realistic business targets

## Mathematical Foundation

### Basic Linear Relationship
Y = β₀ + β₁X + ε

Where:
- Y: Ice cream sales ($)
- β₀: Base sales (when temperature is 0°F)
- β₁: Sales increase per degree increase
- X: Temperature (°F)
- ε: Error term

### Example Values
- β₀ = $100 (base sales)
- β₁ = $5 (sales increase per degree)
- At 75°F: Expected sales = $100 + $5(75) = $475

## Transitioning to Advanced Concepts

This introduction sets the foundation for more complex topics we'll cover:
- Multiple regression with several predictors
- Non-linear relationships
- Interaction effects
- Model validation and diagnostics

Understanding these fundamentals is crucial for:
1. Selecting appropriate regression methods
2. Interpreting results correctly
3. Making valid predictions
4. Communicating findings effectively

# Introduction to Regression Analysis

## What is Regression?

Regression analysis is a statistical method used to model and analyze the relationship between a dependent variable (target) and one or more independent variables (features/predictors). It helps us understand how changes in independent variables influence the dependent variable, enabling prediction and inference.

### Core Concept
At its heart, regression attempts to find the best-fitting mathematical relationship between variables by minimizing the difference between predicted and actual values.

### Historical Context
The term "regression" was coined by Francis Galton in the 19th century during his study of human heights, where he observed that children of tall parents tend to be shorter than their parents (regressing toward the mean).

## Why Do We Use Regression?

Regression serves multiple crucial purposes in data analysis and machine learning:

1. **Prediction**
   - Forecasting future values based on historical data
   - Example: Predicting house prices based on square footage, location, and amenities
   - Example: Estimating sales volume based on advertising spend

2. **Relationship Understanding**
   - Quantifying the strength and direction of relationships between variables
   - Example: Understanding how temperature affects ice cream sales
   - Example: Measuring the impact of education on income levels

3. **Causal Analysis**
   - Investigating cause-and-effect relationships (with proper experimental design)
   - Example: Analyzing the effect of a new drug on patient recovery time
   - Example: Measuring the impact of price changes on product demand

## Key Terminology

### 1. Variables
- **Dependent Variable (Y)**
  - Also called the response variable or target
  - What we're trying to predict or explain
  - Example: House price, student performance

- **Independent Variables (X)**
  - Also called predictors, features, or explanatory variables
  - Used to predict the dependent variable
  - Example: Square footage, study hours

### 2. Model Components
- **Coefficients (β)**
  - Weights assigned to each independent variable
  - Represent the change in Y for a one-unit change in X
  - Example: Price increase per square foot in house pricing

- **Intercept (β₀)**
  - The base value when all independent variables are zero
  - Starting point of the regression line
  - Example: Base price of a house before considering any features

- **Error Term (ε)**
  - The difference between predicted and actual values
  - Represents unexplained variation
  - Example: Factors affecting house price not included in the model

### 3. Statistical Terms
- **Residuals**
  - Actual differences between predicted and observed values
  - Used to assess model fit
  - Example: Difference between predicted and actual house price

- **Variance**
  - Measure of spread in the data
  - Important for understanding prediction reliability
  - Example: Spread of house prices around the mean

- **Standard Error**
  - Measure of coefficient estimate precision
  - Used for confidence intervals and hypothesis testing
  - Example: Uncertainty in price per square foot estimate

## Real-World Example: Ice Cream Sales

Let's consider a simple example to illustrate regression concepts:

### Scenario
- Dependent Variable (Y): Daily ice cream sales ($)
- Independent Variable (X): Daily temperature (°F)
- Goal: Understand and predict how temperature affects sales

### Data Patterns
- As temperature increases, ice cream sales tend to increase
- The relationship might not be perfectly linear
- Other factors (weekday, holidays, events) also influence sales

### Business Applications
1. **Inventory Management**
   - Predict sales based on weather forecast
   - Optimize stock levels
   - Reduce waste and stockouts

2. **Staff Planning**
   - Schedule appropriate number of workers
   - Plan for busy periods
   - Optimize labor costs

3. **Financial Planning**
   - Project revenue based on seasonal temperatures
   - Plan marketing campaigns
   - Set realistic business targets

## Mathematical Foundation

### Basic Linear Relationship
Y = β₀ + β₁X + ε

Where:
- Y: Ice cream sales ($)
- β₀: Base sales (when temperature is 0°F)
- β₁: Sales increase per degree increase
- X: Temperature (°F)
- ε: Error term

### Example Values
- β₀ = $100 (base sales)
- β₁ = $5 (sales increase per degree)
- At 75°F: Expected sales = $100 + $5(75) = $475

## Transitioning to Advanced Concepts

This introduction sets the foundation for more complex topics we'll cover:
- Multiple regression with several predictors
- Non-linear relationships
- Interaction effects
- Model validation and diagnostics

Understanding these fundamentals is crucial for:
1. Selecting appropriate regression methods
2. Interpreting results correctly
3. Making valid predictions
4. Communicating findings effectively

# Simple Linear Regression

## Mathematical Foundation

### Basic Equation
The simple linear regression model is represented by:
Y = β₀ + β₁X + ε

Where:
- Y: Dependent variable (target)
- β₀: Y-intercept (constant term)
- β₁: Slope coefficient
- X: Independent variable (feature)
- ε: Error term (residuals)

### Ordinary Least Squares (OLS)
OLS is the most common method for estimating the parameters β₀ and β₁. It minimizes the sum of squared residuals:

minimize Σ(yi - ŷi)²

Where:
- yi: Actual value
- ŷi: Predicted value (β₀ + β₁xi)

## Key Assumptions

1. **Linearity**
   - The relationship between X and Y is linear
   - Can be verified using scatter plots
   - Violation requires non-linear transformation or different model

2. **Independence**
   - Observations are independent of each other
   - Common violation in time series data
   - Test using Durbin-Watson statistic

3. **Homoscedasticity**
   - Constant variance of residuals
   - Check using residual plots
   - Violation requires weighted regression or transformation

4. **Normality**
   - Residuals are normally distributed
   - Verify using Q-Q plots
   - Less critical for large samples (Central Limit Theorem)

5. **No Perfect Multicollinearity**
   - Not a concern in simple linear regression
   - Becomes important in multiple regression


## Practical Example: House Price Prediction

### Dataset Description
# Sample dataset

In [2]:
houses = {
    'size_sqft': [1200, 1400, 1600, 1800, 2000, 2200, 2400],
    'price': [150000, 165000, 180000, 200000, 220000, 235000, 255000]
}

In [3]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

In [None]:
# Create DataFrame
df = pd.DataFrame(houses)

# Split features and target
X = df[['size_sqft']]
y = df['price']

In [None]:
# Create and fit model
model = LinearRegression()
model.fit(X, y)

In [None]:
# Model coefficients
print(f"Intercept (β₀): {model.intercept_:.2f}")
print(f"Slope (β₁): {model.coef_[0]:.2f}")

In [None]:
# Make predictions
X_new = np.array([[2100]])  # Predict price for 2100 sq ft house
predicted_price = model.predict(X_new)

### Interpretation of Results

1. **Coefficient Interpretation**
   - β₀ (Intercept): Base price when size is 0 (may not have practical meaning)
   - β₁ (Slope): Price increase per square foot increase

2. **Model Equation**
   - Price = β₀ + β₁ × Size
   - Example: Price = 50000 + 85 × Size_sqft


## Model Evaluation
# Key Metrics
#1. **R-squared (R²)**

In [None]:
   r2 = model.score(X, y)

   - Proportion of variance explained by the model
   - Ranges from 0 to 1
   - Higher values indicate better fit

2. **Mean Squared Error (MSE)**
  
   - Average squared difference between predicted and actual values
   - Penalizes larger errors more heavily


In [None]:
   from sklearn.metrics import mean_squared_error
   mse = mean_squared_error(y, model.predict(X))

3. **Root Mean Squared Error (RMSE)**
   ```python
   rmse = np.sqrt(mse)
   ```
   - Same units as target variable
   - More interpretable than MSE


3. **Root Mean Squared Error (RMSE)**
   - Same units as target variable
   - More interpretable than MSE

In [None]:
   rmse = np.sqrt(mse)

4. **Mean Absolute Error (MAE)**
  
   - Average absolute difference between predicted and actual values
   - Less sensitive to outliers than RMSE


In [None]:
   from sklearn.metrics import mean_absolute_error
   mae = mean_absolute_error(y, model.predict(X))

## Visualization and Diagnostics


### 1. Scatter Plot with Regression Line

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='blue', alpha=0.5)
plt.plot(X, model.predict(X), color='red', linewidth=2)
plt.xlabel('Size (sq ft)')
plt.ylabel('Price ($)')
plt.title('House Prices vs Size with Regression Line')
plt.show()

### 2. Residual Plot

In [None]:
residuals = y - model.predict(X)
plt.figure(figsize=(10, 6))
plt.scatter(model.predict(X), residuals, color='blue', alpha=0.5)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

## Common Issues and Solutions

1. **Outliers**
   - Identify using scatter plots and residual analysis
   - Consider removal or robust regression techniques
   - Investigate unusual cases for data quality issues

2. **Non-linearity**
   - Transform variables (log, square root, etc.)
   - Consider polynomial regression
   - Use non-linear models if necessary

3. **Heteroscedasticity**
   - Use weighted least squares
   - Transform dependent variable
   - Use robust standard errors

## Practical Applications

1. **Sales Forecasting**
   - Predict sales based on advertising spend
   - Plan inventory based on historical data
   - Set realistic sales targets

2. **Real Estate**
   - Estimate property values
   - Analyze price trends
   - Support investment decisions

3. **Scientific Research**
   - Study relationships between variables
   - Test hypotheses
   - Control for confounding factors

## Best Practices

1. **Data Preparation**
   - Remove or handle missing values
   - Identify and handle outliers
   - Scale variables if necessary

2. **Model Validation**
   - Use train-test split
   - Perform cross-validation
   - Check assumptions

3. **Documentation**
   - Record data sources and preprocessing steps
   - Document model assumptions and limitations
   - Keep track of model performance metrics

4. **Communication**
   - Present results clearly
   - Use appropriate visualizations
   - Explain limitations and uncertainties