## Statistical Modelling
- Statistical modelling is a way of using data to understand and predict the world.
- What we do:
1. Observe something in real life
2. Collect data about it
3. Use math to describe patterns in the data
4. Accept that data is noisy and uncertain

### Variables: The building blocks of models 
- A variable is simply something that can change.

#### Independent Variables(Inputs)
- AKA Predictor, Feature, Explanatory variable.
- Things that we perceive to have an influence or explain something else.
- They are what we put into the model.
- Usually denoted as $X, X_1, X_2,.....$

#### Dependent Variables(Output)
- AKA Response, Target
- This is what we want to predict, explain or understand.
- Depends on the independent variables.
- Usually denoted as $Y$


### What is a Statistical Model?
- This is a rule that connects inputs to outputs.
- Denoted as $Y = model(X) + error$
- We include the error because:
1. Measurements aren't perfect
2. People behave differently
3. The world of data is messy
- This error is also known as random error or noise.

#### Parameters: The knobs of the Model
- Parameters are numbers inside the model that control how it behaves.
- e.g: Straight line model
$$
Y = \beta_0 + \beta_1 X + \varepsilon
$$
- Where : 
- $\beta_0$ -> Where the line starts   (Intercept)
- $\beta_1$ -> How steep the line is (Slope)


#### How do we find parameters?
- We don't guess parameters. Instead, we:
1. Try some parameter values
2. See how wrong the model is
3. Adjust the parameters
4. Repeat until the model is “least wrong”

This leads to $loss$.

#### What is loss?
- Loss is a figure that tells us how bad a prediction is.
- For each data point:
- Loss = actual value - predicted value

- Without loss:
1. We dont know which model is better
2. We cant improve the model
3. Learning is impossible

- *No loss = No learning*

#### Linear Regression
- This is the simplest statistical model.
- It assumes the relationship between $X$ and $Y$ is roughly a straight line.
- Model Equation:
$$
Y = \beta_0 + \beta_1 X + \varepsilon
$$
- Where :  $ \varepsilon$ = random error

- Linear regression finds:
1. The best slope
2. The best intercept

#### Relationship between Linear Regression and Other Statistical Measures
1. Linear Regression and Mean
- If a stat model has no $X$ such that: 
$Y = \beta_0$ then the best value of $\beta_0$ is: $\beta_0$ = mean of $Y$
- This is to say the mean is the simplest statistical model.

2. Linear Regression and Variance
- Variance measures how spread out data is.
- Regression splits variance into two parts:
    1. Explained by the model.
    2. Not explained(error)
- This leads to:
$ R^2 $ = How much of Y the model explains 
- If $R^2$ = 0.70 then we could say that 70% of the variation in $Y$ is explained by $X$

3. Linear Regression and Correlation
- Correlation answers - Do $X$ and $Y$ move together?
- Regression answers - How much does $Y$ change when $X$ changes?

- *Statistical modelling is the process of using data, math, and probability to explain relationships and make predictions while accepting uncertainty.* 

### Linear Regression using Self Constructed Functions

In [1]:
# x = hours studied
# y = exam score

import numpy as np

# Independent variable (hours studied)
X = np.array([1, 2, 3, 4, 5])

# Dependent variable (exam score)
Y = np.array([50, 55, 65, 70, 75])

# function for calculating mean
def mean(x):
    return sum(x) / len(x)


#### Formula for computing slope:
$$
\beta_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}
               {\sum_{i=1}^{n} (x_i - \bar{x})^2}
$$


In [2]:
# function for computing slope
def slope(x, y):
    x_mean = mean(x)
    y_mean = mean(y)
    
    numerator = sum((x - x_mean) * (y - y_mean))
    denominator = sum((x - x_mean) ** 2)
    
    return numerator / denominator

#### Formula for computing intercept:
$$
\beta_0 = \bar{y} - \beta_1 \bar{x}
$$


In [4]:
# function for computing intercept
def intercept(x, y):
    return mean(y) - slope(x, y) * mean(x)

In [5]:
# fit the model
beta_1 = slope(X, Y)
beta_0 = intercept(X, Y)

beta_0, beta_1

(43.5, 6.5)

$ \beta_0$ = 43.5, $\beta_1$ = 6.5
- Regression equation is: $y = 43.5 + 6.5x$
- Intercept = 43.5 means that when hours studies = 0, the predicted exam score is 43.5
- Slope = 6.5 means for every additional hour studied, the exam score increases by 6.5 points on average ( This is the strength and direction of the relationship)


In [7]:
def predict(x, beta_0, beta_1):
    return beta_0 + beta_1 * x

# Predict score for 6 hours of study
predict(6, beta_0, beta_1)


82.5