# WEEK 27 : Regularization

https://learnwith.campusx.in/s/courses/637339afe4b0615a1bbed390/take

y = f(x) $\rightarrow$ population model (data of all the students in world)

$\hat y$ = f\`(x) $\rightarrow$ our prediction model on sample data


there will always be a difference in y and $\hat y$ as we are predicting population model from sample data. 
This difference $(y - \hat y)$ is know as __irreducible  error__. 

f(x) - f'(x) is __reducible error.__


$$y = f(x) + \text{irreducible error}$$

## Bias Variance Trade-off

$$\text{Reducible error} = {Bias}^2 + \text{Variance}$$

### ‚úÖ **Bias**

* **Definition**: Bias refers to the **error introduced by approximating a real-world problem (which may be very complex) with a simplified model**.


* It is the **difference between the average prediction of the model and the correct value** you‚Äôre trying to predict.


* High bias means the model makes strong assumptions about the data and fails to capture its complexity ‚Üí **underfitting**.


* Low bias means the model fits the training data more closely, assuming fewer incorrect generalizations.


* Bias is not strictly the same as "training error," although **high training error may be a sign of high bias**.


* **Example**: A linear model trying to fit nonlinear data will have high bias.

### ‚úÖ **Variance**

* **Definition**: Variance refers to how much the model's predictions **change if it is trained on different subsets of the 
training data**.


* A model with **high variance** learns not only the underlying patterns but also the **noise** in the training data ‚Üí **overfitting**.


* Low variance means the model‚Äôs predictions are consistent across different training sets.


* High variance models usually perform well on training data but poorly on unseen data (test data).


* **Note**: Variance is **not defined as test error**, but rather as the **sensitivity of the model to training data**. However, **high test error combined with low training error is often a sign of high variance**.



### üö´ Corrections to your points:

| Your Statement                                                             | Corrected Version                                                                                                                                                                                          |
| -------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Bias is the error rate between `y_predicted` and `y_train`.                | ‚ùå Not exactly. Bias is the **expected difference between the model's prediction and the true value**, over many training datasets. High training error **may** imply high bias, but they are not the same. |
| Variance is the error rate between `y_predicted` and `y_test`.             | ‚ùå Incorrect. Variance is **how much predictions change** with different training datasets. High test error may be a symptom of high variance, but it is **not the definition**.                            |
| Bias = Error rate on training data, Variance = Error rate on testing data. | ‚ùå Oversimplification. These relationships may correlate in practice, but **bias and variance are about sources of error**, not specific dataset errors.                                                    |

### üìå Summary:

| Term         | Real Definition                                 | High Case    | Low Case                  |
| ------------ | ----------------------------------------------- | ------------ | ------------------------- |
| **Bias**     | Error due to **wrong assumptions in the model** | Underfitting | Captures complex patterns |
| **Variance** | Error due to **sensitivity to training data**   | Overfitting  | Generalizes better        |



### High Bias and Low variance : 

![Screenshot%202023-10-04%20183616.png](attachment:Screenshot%202023-10-04%20183616.png)

- high bias as it predicting very poor on training data.


- low variance because the 3 models difference is not that much. The variation in the 3 models is very less.

![Screenshot%202023-10-04%20184833.png](attachment:Screenshot%202023-10-04%20184833.png)

![Screenshot%202023-10-18%20204147.png](attachment:Screenshot%202023-10-18%20204147.png)

- high variance because for same input point we get 3 different results.

### Bias Variance Trade-Off

#### The "trade-off" in bias-variance trade-off refers to the fact that minimizing bias will usually increase variance and vice versa.

When we decrease bias, we typically use a more complex model that can capture intricate patterns and relationships in the data. This allows the model to fit the training data more closely, reducing the bias. 

However, a more complex model tends to have more variance. This means that if the model is trained on different subsets of the data, it might produce significantly different results.

__Our target is Low Bias and Low Variance__

In Bias Varaince tradeoff we select the model which is not overfitting nor underfitting i.e it is not good in anything but not bad any in anything.

generally, We select model with High bias for Low variance

we use the following methods for it:

- Regularization

- Bagging

- Boosting
    

### NOTE : Lowering Bias will increase Variance leading to overfitting and increasing Bias will lower Variance leading to underfitting.

### An ideal machine learning model should achieve a balance between bias and variance ‚Äî meaning it should be both:

‚úÖ Low Bias ‚Üí It learns the true underlying patterns (does not underfit)


‚úÖ Low Variance ‚Üí It generalizes well to unseen data (does not overfit)

## ‚úÖ **Why do we *prefer* high variance, low bias models** (in some cases)?

We **prefer low bias** (i.e., more complex/flexible models) **because they have the potential to learn the true pattern** in the data. Even though they may suffer from **high variance**, variance can often be **reduced using techniques like regularization, ensemble methods, or more data**.

---

### üîç Here's why we prefer **low bias / high variance** over **high bias / low variance**:

| Preference       | Reason                                                                                                              |
| ---------------- | ------------------------------------------------------------------------------------------------------------------- |
| ‚úÖ **Low Bias**   | The model has **enough capacity** to learn the true relationship in data, including nonlinear and complex patterns. |
| ‚ö†Ô∏è **High Bias** | The model **misses important patterns** (underfitting) and no amount of data or tuning can fix it.                  |

---

### üéØ Key Intuition:

* **High bias (simple models)**:

  * Easy to train.
  * **Always underperform** on complex tasks.
  * Adding more data won't help much.

* **High variance (complex models)**:

  * Powerful learners.
  * May **overfit**, but that can be addressed by:

    * More data.
    * Regularization (e.g., L1, L2).
    * Cross-validation.
    * Ensemble methods (e.g., Random Forests, Bagging).
    * Dropout (for deep learning).
  * Once controlled, they can give **excellent generalization**.

---

### üîÅ Analogy:

Imagine you're learning to play piano.

* **High bias model**: You only learn a few basic chords (not enough to play a full song). Even with more practice, your knowledge is limited.
* **High variance model**: You try to play every detail perfectly (you might make mistakes), but **with enough practice**, you‚Äôll become very good.

---

## üîç Summary

| Model Type                  | Can Learn Complex Patterns? | Easy to Fix?                       | Preferred When                       |
| --------------------------- | --------------------------- | ---------------------------------- | ------------------------------------ |
| High Bias, Low Variance     | ‚ùå No                        | ‚ùå Hard to fix                      | When task is simple or data is small |
| **Low Bias, High Variance** | ‚úÖ Yes                       | ‚úÖ Can fix with regularization/data | ‚úÖ Most real-world cases              |
| Low Bias, Low Variance      | ‚úÖ Yes                       | ‚úÖ Ideal                            | Hard to achieve directly             |

---

## ‚úÖ Conclusion:

> We prefer **low bias / high variance** models because:
>
> * They can **learn complex patterns** (important for real-world data).
> * High variance can be **reduced** using techniques like **regularization, ensembling, or more data**.
> * But **high bias can't be fixed**, even with more data or training.

## üîÅ Bias and Variance in Different Models

| Model                          | Bias | Variance | Explanation                                                                                                        |
| ------------------------------ | ---- | -------- | ------------------------------------------------------------------------------------------------------------------ |
| **Linear Regression**          | High | Low      | Assumes linear relationship. Simple model = low flexibility, so may **miss complex patterns** (underfitting).      |
| **Logistic Regression**        | High | Low      | Similar to linear regression, makes **strong assumptions** about class boundaries.                                 |
| **Naive Bayes**                | High | Low      | Assumes features are **independent**, which is rarely true. Fast and simple, but can miss interactions.            |
| **Decision Tree (deep)**       | Low  | High     | Very flexible ‚Äî can fit training data exactly. But can **overfit** and perform poorly on new data.                 |
| **k-NN (low k)**               | Low  | High     | Uses very few neighbors = highly flexible and can capture fine patterns, but **too sensitive** to noise.           |
| **k-NN (high k)**              | High | Low      | With many neighbors, becomes too smooth and may **miss complex structures** in the data.                           |
| **SVM (with complex kernel)**  | Low  | High     | Captures complex boundaries but can overfit if not regularized well.                                               |
| **Neural Networks (deep)**     | Low  | High     | Highly flexible, powerful learners. But need lots of data ‚Äî with small data = **overfitting** risk.                |
| **Random Forest (well-tuned)** | Low  | Low      | Combines many trees and averages their predictions ‚Äî **reduces variance**, while still capturing complex patterns. |
| **Boosting (e.g., XGBoost)**   | Low  | Low      | Sequentially focuses on errors ‚Äî good at fitting complex data. Can be regularized to control variance.             |

---

## Overfitting and Underfitting

### **Overfitting :** 
- It means our regression has focused on the particular data set so much it has missed the point.


- __HIGH VARIANCE__ is closely related to overfitting


- ___Low Bias and High Variance___


- It has high train accuracy. Performs well with training Dataset called **LOW BIAS** but fails to perfrom well with test data called **HIGH VAIRIANCE**


- Overfitting refers to models that are so super good at modeling the data that they fit or at least come very near each observation. The problem is that the random noise is captured inside an overfitting model.

### **Underfitting :** 

- It means the model has not captured the underlying logic of the data. It doesn't know what to do and therefore provides an answer that is far from correct.


- __HIGH BIAS__ is closely related to underfitting


- ___High Bias and Low Variance___


- It has low train model accuracy. Performs bad with training Dataset called **HIGH BIAS** and even fails to perfrom well with test data called **LOW VARIANCE**


- We can certainly say a linear model would be an underfitting model, it provides an answer, but does not capture the underlying logic of the data. It doesn't have strong predictive power. Under fitted models are clumsy and have a low accuracy, you will quickly realize that either there are no relationships to be found.

Now, underfitting is easy to spot, you have almost no accuracy whatsoever, overfitting is much harder, though, as the accuracy of the model seems outstanding.

### This can lead to the following scenarios:

- Low bias, low variance: Aiming at the target and hitting it with good precision.


- Low bias, high variance: Aiming at the target, but not hitting it consistently.


- High bias, low variance: Aiming off the target, but being consistent.


- High bias, high variance: Aiming off the target and being inconsistent.

![bias_variance.jfif](attachment:bias_variance.jfif)

### **Solution to Overfitting:**

1. __Cross-validation :__ Dividing the data into multiple subsets to train and test the model, ensuring the model's performance on unseen data.



2. __Regularization :__ Adding a penalty term to the loss function to discourage complex models, preventing them from fitting the noise in the training data.



3. __Dropout :__ Randomly deactivating a fraction of neurons during training, preventing the model from relying too much on specific features.



4. __Early stopping :__ Stopping the training process when the model's performance on the validation set starts to degrade, preventing it from over-optimizing the training data.



5. __Ensembling :__ Combining predictions from multiple models to improve generalization and reduce overfitting by leveraging the wisdom of the crowd.

We can split their initial data set into two training and test splits, like 90 percent training and 10 percent test or 80, 20 are common. 

It works like this, we create the regression on the training data after we have the coefficients,
we test the model on the test data by assessing the accuracy. The whole point is that the model has never seen the test data set, therefore it cannot overfit on it.

<br></br>
It makes sense to split our data into two parts training and testing, we train the model on the training data set and then check how well it behaves on the testing one.
Ultimately, we are trying to avoid the scenario where the model learns to predict the training data very well, but fails miserably when given new samples.

In [1]:
import random
import numpy as np

### Expected Value and Variance:

Expected value represents the average outcome of a random variable over a large number of
trials or experiments. 

Expected value is basically average. $E= \frac{\sum}{n}$

__Example : X = rolling a dice__

$$E[X] = x_1\;p(x_1) + x_2\;p(x_2) + x_3\;p(x_3) + x_4\;p(x_4) + x_5\;p(x_5) + x_6\;p(x_6)$$




$$E[X] = 1 * \frac{1}{6}\;+\;2 * \frac{1}{6}\;+\;3 * \frac{1}{6}\;+\;4 * \frac{1}{6}\;+\;5 * \frac{1}{6}\;+\;6 * \frac{1}{6}$$


$$E[X] = \frac{1 + 2 + 3 + 4 + 5 + 6}{6} \approx 3.5$$

In [2]:
outcome = []

for i in range(100000):
    outcome.append(random.randint(1,6))
    
expected_value = np.array(outcome).mean()
expected_value

3.49885

In a simple sense, the expected value of a random variable is the long-term average value of repetitions of the experiment it represents. For example, the expected value of rolling a six-
sided die is 3.5 because, over many rolls, we would expect to average about 3.5.


### NOTE : Expected value is nothing but population mean (which we can never know)


$$E = \frac{\sum}{n}$$

> no need to remember the below formulas : $\Downarrow$

### Expected value formula for Discrete Random variable

$$E[X] = \sum^n_{i=1} \big[x * P(x)\big]$$

Where:
- E(X) represents the expected value or mean of the random variable X.
- Œ£ denotes the summation symbol, which means you need to sum up the following values.
- x represents each possible value of the random variable X.
- P(x) represents the probability of the random variable taking on the value x.

In simpler terms, you multiply each possible value of the random variable by its probability of occurring and then sum up these products to find the expected value.


### Expected value formula for Continuous Random variable

$$E[X] = \int_{-\infty}^{\infty} x \cdot f(x) dx\$$

In this formula:
- $E(X)$ represents the expected value or mean of the continuous random variable $X$.
- The integral sign $\int$ indicates that you need to integrate.
- $x$ represents the value of the random variable.
- $f(x)$ represents the probability density function (PDF) of the continuous random variable.
- The limits of integration, $-\infty$ to $\infty$, indicate that you integrate over the entire range of possible values for $(X)$.

### Variance formula

$$\text{population Variance of x}\Longrightarrow \text{Var(x)} = E[X^2] - \big(E[X]\big)^2$$

$$OR$$
$$\Longrightarrow E\;\Big[\;\big(X - E[X]\;\big)^2\;\Big]$$


where E[X] is mean expected value of x.
#### NOTE : This relationship holds true for both discrete and continuous

___This can be interpreted as formula of variance:___ 





$$\sum_{i=1}^n \frac{{(y\;-\;y_{mean})}^2}{n}$$


where 


- $\frac{\sum}{n} = E$ which is basically mean


- and $(y-y_{mean})^2 = \big(X - E[X]\;\big)^2$

### What exactly are Bias and Variance Mathematically?

#### Bias : 

In the context of machine learning and statistics, bias refers to the systematic
error that a model introduces because it cannot capture the true relationship in
the data. __It represents the difference between the expected prediction of our
model and the correct value which we are trying to predict.__ 


More bias leads to
underfitting, where the model does not fit the training data well.

$$\text{Bias of f'(x)} = E\big[f'(x)\big] - f(x)$$

where:

- f'(x) is our model (made from samples).

- f(x) is the population model


### NOTE : unbiased predictor is when f'(x) - f(x) = 0

#### Variance : 

In the context of machine learning and statistics, __variance refers to the amount by
which the prediction of our model will change__ if we used a different training data
set. 

In other words, it measures how much the predictions for a given point vary
between different realizations of the model.

$$\text{Variance of f'(x)}\Longrightarrow E\;\Big[\;\big(f'(x) - E\big[f'(x)\big]\;\big)^2\;\Big]$$


- E here means we need to take average $\Longrightarrow \frac{\sum}{n}$

# WEEK 27 : Regularization Part 2

## Bias Variance Decomposition

$$\text{Loss = reducible error + irreducible error}$$
<br></br>
$$\text{Loss} = ({\text{bias}}^2 + \text{variance}) + \text{variance of error}(\epsilon)$$

Bias-variance decomposition is a way of analysing a learning algorithm's expected
generalization error with respect to a particular problem by expressing it as the sum of three
very different quantities: bias, variance, and irreducible error.



1. __Bias :__ This is the error from erroneous assumptions in the learning algorithm. High bias can
cause an algorithm to miss the relevant relations between features and target outputs
(underfitting).
    
 ___eg : i am shooting and the ball is landing before the golf post. the distance between the ball and gold post is bias.___


2. __Variance :__ This is the error from sensitivity to small fluctuations in the training set. High
variance can cause an algorithm to model the random noise in the training data, rather
than the intended outputs (overfitting).

   ___eg : mean difference of distance between each and every ball landed with all the balls i shot.___



3. __Irreducible Error :__ This is the noise term. This part of the error is due to the inherent noise
in the problem itself, and can't be reduced by any model.

   ___eg : golf target in itself is moving___

### NOTE : Loss can be divided into 2 parts : reducible error and irreducible error. 

ie. whatever mistake model makes on a prediction can be broken down into 2 parts.
<br></br>


$$\text{MSE = reducible error + irreducible error}$$
<br></br>
$$\text{MSE} = ({\text{bias}}^2 + \text{variance}) + \text{variance of error}(\epsilon)$$



when we plot all the irreducible erros. the variance of all those error with mean of irreducible error is called variance of error$(\epsilon)$.

![Screenshot%202023-10-04%20212253.png](attachment:Screenshot%202023-10-04%20212253.png)

### <span class="mark">why we do square of bias but not variance squared:</span>

The reason for squaring the bias in the equation is to emphasize its significance in the overall error. 


By squaring the bias, we ensure that both positive and negative errors contribute to the overall error in a meaningful way. This enables a comprehensive understanding of how the model performs with respect to both bias and variance.


1. __Squared Bias :__ Squaring the bias in the formula ensures that the positive and negative errors do not cancel each other out, providing a meaningful and comprehensive assessment of the bias's contribution to the overall error. This emphasizes the significance of the model's systematic deviation from the true values.


2. __Variance without Squaring :__ Variance, on the other hand, represents the variability of the model's predictions due to fluctuations in the training data. It is not squared in the formula because it is a measure of the spread of these predictions around the mean. Taking the modulus (absolute value) of variance would not provide the same useful information as it would essentially mask the fluctuations, which are critical in understanding how sensitive the model is to changes in the training data.


Variance is a measure of how much the model's predictions for a given point vary around the model's mean prediction. While it can take both positive and negative values, the magnitude of variance conveys the extent to which the model is influenced by the variability in the training data. Taking the modulus of variance would eliminate this crucial information about the model's sensitivity to data fluctuations.


__In essence, the focus of squaring the bias and not squaring the variance in the formula is not about negating positive and negative values, but rather about emphasizing the significance of the bias in the model's overall error and understanding the spread of predictions around the mean.__

By examining both bias and variance, we can gain insights into how the model is performing and make informed decisions about how to balance these two components to optimize the model's predictive capabilities.

### example of high bias low variance and low bias high variance models:

In [4]:
from mlxtend.evaluate import bias_variance_decomp

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from mlxtend.data import boston_housing_data

In [6]:
X, y = boston_housing_data()

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=123, shuffle=True)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((354, 13), (152, 13), (354,), (152,))

In [7]:
lr =LinearRegression()

##### Linear regression is a high bias but low variance model

In [19]:
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(lr, X_train, y_train, X_test, y_test,
                                                            loss= 'mse',
                                                            random_seed=123)


print('Average expected loss: %.3f'%avg_expected_loss)
print('Average bias: %.3f'%avg_bias) # this is bias^2 value
print('Average Variance: %.3f'%avg_var)

Average expected loss: 29.891
Average bias: 28.609
Average Variance: 1.282


In [20]:
dt = DecisionTreeRegressor(random_state=123)

##### Decision Tree is a low bias but high variance model

In [21]:
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(dt, X_train, y_train, X_test, y_test,
                                                            loss= 'mse',
                                                            random_seed=123)


print('Average expected loss: %.3f'%avg_expected_loss)
print('Average bias: %.3f'%avg_bias)             # this bias^2 value
print('Average Variance: %.3f'%avg_var)

Average expected loss: 31.536
Average bias: 14.096
Average Variance: 17.440


---
---

# Regularization

Think of it as a way to impose constraints on a model's optimization process, ensuring that the model doesn't fit the noise in the training data but rather captures the underlying patterns. Regularization encourages the model to generalize better by finding the simplest or smoothest function that can explain the data adequately. This helps to strike a balance between bias and variance, ultimately leading to improved performance on unseen data.

- Regularization is a technique used for tuning the function by __adding an additional penalty term in the error function.__ The additional term controls the excessively fluctuating function such that the coefficients don‚Äôt take extreme values **preventing OVERFITTING**.


- Control total changes (or error term i.e $\Delta E_m\;and\;\Delta E_c$) and Control degree of freedom of m and c through adjusting error terms i.e. $\frac{dL}{dm}\;and\; \frac{dL}{dc}$


- by adding the penalty term the model performs better as we decrease the slope of the actual  line by adding penalty. It starts performing better on testing data without overfitting. ***We change the original regression line***


- **we try to decrease the slope and prevent overfitting. Higher value of "m" (slope of the regression line) leads to overfitting and  Lower value of "m" (slope of the regression line) leads to underfitting**

### When to use Regularization?


![image.png](attachment:image.png)

## **Types of Regularization techniques :**
### 1. LASSO (L-1 Norm) 
### 2. RIDGE (L-2 Norm)
### 3. ELASTIC NET

## 1. LASSO (Least Absolute Shrinkage and Selection Operator (L-1 Norm) ) : 


**L1-norm is also known as Least Absolute Deviations (LAD).** 

   - **Prevents overfitting and helps to perform feature selection i.e which features are important**


   - It is basically minimizing the Residual Summation of Squares (RSS/SSR) between the target value $(y)$ and the estimated values $(\hat y)\;$ by adding some factors like slope$(m)$
    
    $$L1\;=\;(y-\hat y)^2\;+\;\lambda\;\sum_{i=1}^{n}\;|m|$$
    
    
$$L1 = MSE + \lambda\;\sum_{i=1}^{n}\;|m|$$
    


$where$


- RSS = Residual Sum of Squares $(y-\hat y)^2\;$  or the original cost function or total error or loss.     
   
   
- $\lambda\;\;$is Shrinkage Factor which is a constant-term choosed using Hyper Parameter
   
   
- "m" is slope 

## 4 keypoints of LASSO regression :

### 1. How are coefficients affected?

> In Lasso Regression the coefficents can become 0 for higher values of $\lambda\;(alpha)$ whereas in Ridge it used to only tend towards zero.

>In a data with high no. of columns,there is high chances of overfitting. By increasing $\lambda$ due to applying LASSO the coefficents of column with less importance will become zero. Ultimately applying feature selection


### NOTE : LASSO for  higher value of $\lambda$ will perform feature selection. Ridge can't do that as it can never make value of coefficeint = 0

### 2. Higher Coefficients are affected more

> Higher value of $\lambda$ causes Sparcity i.e. m becomes zero for all independent variables causing Underfitting
<br></br>
Effect of higher lambda will be more on higher weightage coefficents in comparison to the coefficents with lesser weightage<br></br> 
For highest value of $\lambda$ every coefficient becomes 0.

![Screenshot%202023-10-15%20060552.png](attachment:Screenshot%202023-10-15%20060552.png)

### 3. Impact on Bias and Variance
- $\lambda \downarrow$

    - $Bias\;\downarrow\;\;Overfit\;\;Variance\;\uparrow$
    
<br></br>    
- $\lambda \uparrow$

    - $Bias\;\uparrow\;\;Underfit\;\;Variance\;\downarrow$
    
    
    
$$\lambda\;\propto\;BIAS$$

$$\lambda\;\propto\;\frac{1}{VARIANCE}$$

### 4. Effect of Regularization on Loss Function

> **With increase in value of $\lambda\;(alpha)$ we can see that the coefficents are shrinking and becoming zero.**

![Screenshot%202023-10-15%20061440.png](attachment:Screenshot%202023-10-15%20061440.png)

### NOTE : for higher number of columns where some columns donot have high weightage use LASSO.

## Sparsity : How LASSO is able to perform feature selection? 

https://youtu.be/FN4aZPIAfI4

### Why there is Sparcity in Lasso but not in Ridge?

Wrt to coffecient(m):

In Ridge the $\lambda$ is in denominator and the fraction will only become zero when the numerator becomes zero.


But in Lasso $\lambda$ is in numerator so the fraction can become zero.

#### ridge:

![ridge_m.PNG](attachment:ridge_m.PNG)

#### LASSO : as |m| is not differentiable at 0 we take 3 cases of m :

![lasso_m.PNG](attachment:lasso_m.PNG)

- __when m starts becoming negative, we start adding the $\lambda$ instead of subtracting it from $\frac{(y_i-\bar y)\;(x_i - \bar x)}{(x_i - \bar x)^2}$. <br></br>Hence the value of coefficeints stops decreasing at 0 and never become negative__.

## 2. RIDGE REGRESSION (L-2 Norm) : L2-norm is also known as Least squares.


- It is called L-2 as we are squaring the slope.


- It is also minimizing the Residual Summation of Squares (RSS/SSR) between the target value $(y)$ and the estimated values $(\hat y)\;$ by squaring the slope. $(m^2)$
    
    $$L2\;=\;(y-\hat y)^2\;+\;\lambda\;\sum_{i=1}^{n}\;(m)^2$$
    
    
where     
    
    
- RSS = Residual Sum of Squares $(y-\hat y)^2\;$ or the original cost function or total error or loss.    
 
 
- $\lambda\;\;$is Shrinkage Factor which is a constant-term. it ranges from 0 to $\infty$
 
 
- "m" is slope 
    
   

### NOTE: 

- __When the value of "m" is low, Ridge regression (L-2) has minimal effect, whereas Lasso regression (L-1) has a significant impact.__



- __Conversely, when "m" is high, Ridge regression exerts greater control compared to Lasso, as it involves squaring the penalty.__

#### value of m and c in Ridge:

![m_for_ridge.PNG](attachment:m_for_ridge.PNG)

$$c = \bar y - m\bar x$$

#### Higher value of $\lambda$ will decrease "m"

## 5 keypoints of Ridge regression : 

https://www.youtube.com/watch?v=8osKeShYVRQ

### 1. How are coefficients affected?

> After applying Ridge regression, when increasing value of lambda $(\lambda)$ the coefficients of each column starts to shrink. They tends towards zero but doesn't attains zero

##### range of y-axis (coefficents) is shrinking as $\lambda$ is increasing:

![Screenshot%202023-10-15%20052550.png](attachment:Screenshot%202023-10-15%20052550.png)

### NOTE : Coefficents tends towards zero but doesn't attains zero

### 2. Higher Coefficients are affected more

> Jis coefficent ka value jitna zyada value hoga, after applying Ridge, woh utni tezi se shrink hoga.<br></br>
Effect of higher lambda will be more on higher weightage coefficents in comparison to the coefficents with lesser weightage

### 3. Impact on Bias and Variance

- $\lambda \downarrow$

    - $Bias\;\downarrow\;\;Overfit\;\;Variance\;\uparrow$
    
<br></br>    
- $\lambda \uparrow$

    - $Bias\;\uparrow\;\;Underfit\;\;Variance\;\downarrow$
    
    
    
$$\lambda\;\propto\;BIAS$$

$$\lambda\;\propto\;\frac{1}{VARIANCE}$$

### 4. Effect of Regularization on Loss Function


> With increase in value of $\lambda\;(alpha)$ we can see that the coefficents are shrinking and tending to zero. <br></br>

![Screenshot%202023-10-15%20053952.png](attachment:Screenshot%202023-10-15%20053952.png)

#### for 2 coefficients : Loss function is shifting towards centre or zero

![Screenshot%202023-10-24%20233610.png](attachment:Screenshot%202023-10-24%20233610.png)

### 5. Why Ridge Regression is called Ridge?

Hard Constraint Ridge Regression

- Ridge (meaning) :  a line where two surfaces meet at an angle


- By adding the penalty term, the coefficients are shrunk towards zero but never exactly become zero, which helps in reducing the model's complexity and multicollinearity issues. This shrinking effect can be visualized as a "ridge" that prevents the coefficients from growing too large.

<br></br>

Loss Function is a combination of 2 things $(y-\hat y)^2\;\;+\;\;\lambda ||W||^2 $:

1. __MSE (OLS Estimate)__ $(y-\hat y)^2 = \sum_{i}^{n}\;(y_i\;-\;(\beta_{0}\;+\;\beta_{1}x_{i1}\;+\;\beta_{2}x_{i2})^2\longrightarrow\;$ Contour Plot



2. $\lambda\;(\beta_{1}^2\;+\;\beta_{2}^2)\;\longrightarrow$ blue circle $\longrightarrow$ Ridge regularization





<br></br>
**So depending on the Loss function the solution will always be on the perimeter of blue circle (plot for penalty term) and closet to the contuor plot.**



>__As we are getting the solution at the boundary/ridge that's why it is called Ridge Regression__

![Ridge_regress.PNG](attachment:Ridge_regress.PNG)

#### NOTE: Use Ridge when number of columns is 2 or greater.

## 3. **ELASTIC NET :**

https://youtu.be/2g2DBkFhTTY?list=PLKnIA16_Rmvbr7zKYQuBfsVkjoLcJgxHH


https://machinelearningcompass.com/machine_learning_models/elastic_net_regression/    

- when we don't know all the columns are important or only few columns are important. We use Elastic Net


- ElasticNet is a combination of Ridge and Lasso.


- __Also applied when input columns have multicollinearity__


- To control the high and low values of "m", we use Elastic Net


- *ElasticNet Controls drawback of L-1 and L-2.*


- __we have ridge regression if L1-ratio = 0 and lasso regression if L1-ratio = 1.__ 

$$Loss\;=\;\sum (y_{i}\;-\;\hat y_{i})^2\;+\;a\;||W||\;+\;b\;||W||^2$$


In Scikit-learn

- $\lambda\;=\;a\;+\;b$


- $l1-ratio\;=\;\frac{a}{a+b}$






accordingly 
- $a\;=\;l_{1} * \lambda$


- $b\;=\;\lambda\;-\;a$

$$Elastic\;Net\;=\;\frac{(y-\hat y)^2}{2n}\;+\;\alpha \sum |\beta|\;+\;\lambda\;\bigg( \frac{1-\alpha}{2}\bigg) \sum \beta^2\;$$


$Lasso\;\longrightarrow \;\alpha \sum |\beta|$

$Ridge\;\longrightarrow\;\lambda\;\bigg( \frac{1-\alpha}{2}\bigg) \sum \beta^2$

### NOTE: if your dataset has Multicollinearity then definetly apply ElasticNet

In [2]:
import pandas as pd
import numpy as np

In [3]:
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression,Ridge,Lasso,ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

In [4]:
X,y = load_diabetes(return_X_y=True)

In [5]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

In [6]:
# Linear Regression
reg = LinearRegression()
reg.fit(X_train,y_train)
y_pred = reg.predict(X_test)
r2_score(y_test,y_pred)

0.4399387660024644

In [7]:
# Ridge 
reg = Ridge(alpha=0.1)
reg.fit(X_train,y_train)
y_pred = reg.predict(X_test)
r2_score(y_test,y_pred)

0.4519973816947851

In [8]:
# Lasso
reg = Lasso(alpha=0.01)
reg.fit(X_train,y_train)
y_pred = reg.predict(X_test)
r2_score(y_test,y_pred)

0.4411227990495633

In [9]:
# ElasticNet
reg = ElasticNet(alpha=0.005,l1_ratio=0.9)
reg.fit(X_train,y_train)
y_pred = reg.predict(X_test)
r2_score(y_test,y_pred)

0.4531493801165679

### SGD Regressor class also have an option to apply ElasticNet

![Screenshot%202023-10-15%20062307.png](attachment:Screenshot%202023-10-15%20062307.png)

## L2 vs L1 : 

![L1-vs-L2-properties1.png](attachment:L1-vs-L2-properties1.png)

## Q- When to use Ridge or Lasso?

>In cases where only a **small number of predictor variables are significant, rest features are not significant. Lasso regression tends to perform better** because it‚Äôs able to shrink insignificant variables completely to zero and remove them from the model.


>However, when **all predictor variables are significant in the model and their coefficients are roughly equal then ridge regression tends to perform better** because it keeps all of the predictors in the model.

Lasso tends to removes the columns with less significance thus performing feature selection.

To determine which model is better at making predictions, we typically perform k-fold cross-validation and choose whichever model produces the lowest test mean squared error.

![solver.PNG](attachment:solver.PNG)

---
---

# PART 5

In [22]:
import pandas as pd

In [23]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV, ElasticNet, ElasticNetCV, LinearRegression
import statsmodels.api as sm

In [25]:
admission_prediction = pd.read_csv(r'Admission_Prediction.csv')
admission_prediction.head(8)

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337.0,118.0,4.0,4.5,4.5,9.65,1,0.92
1,2,324.0,107.0,4.0,4.0,4.5,8.87,1,0.76
2,3,,104.0,3.0,3.0,3.5,8.0,1,0.72
3,4,322.0,110.0,3.0,3.5,2.5,8.67,1,0.8
4,5,314.0,103.0,2.0,2.0,3.0,8.21,0,0.65
5,6,330.0,115.0,5.0,4.5,3.0,9.34,1,0.9
6,7,321.0,109.0,,3.0,4.0,8.2,1,0.75
7,8,308.0,101.0,2.0,3.0,4.0,7.9,0,0.68


In [26]:
admission_prediction.describe()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
count,500.0,485.0,490.0,485.0,500.0,500.0,500.0,500.0,500.0
mean,250.5,316.558763,107.187755,3.121649,3.374,3.484,8.57644,0.56,0.72174
std,144.481833,11.274704,6.112899,1.14616,0.991004,0.92545,0.604813,0.496884,0.14114
min,1.0,290.0,92.0,1.0,1.0,1.0,6.8,0.0,0.34
25%,125.75,308.0,103.0,2.0,2.5,3.0,8.1275,0.0,0.63
50%,250.5,317.0,107.0,3.0,3.5,3.5,8.56,1.0,0.72
75%,375.25,325.0,112.0,4.0,4.0,4.0,9.04,1.0,0.82
max,500.0,340.0,120.0,5.0,5.0,5.0,9.92,1.0,0.97


In [27]:
admission_prediction

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337.0,118.0,4.0,4.5,4.5,9.65,1,0.92
1,2,324.0,107.0,4.0,4.0,4.5,8.87,1,0.76
2,3,,104.0,3.0,3.0,3.5,8.00,1,0.72
3,4,322.0,110.0,3.0,3.5,2.5,8.67,1,0.80
4,5,314.0,103.0,2.0,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...,...
495,496,332.0,108.0,5.0,4.5,4.0,9.02,1,0.87
496,497,337.0,117.0,5.0,5.0,5.0,9.87,1,0.96
497,498,330.0,120.0,5.0,4.5,5.0,9.56,1,0.93
498,499,312.0,103.0,4.0,4.0,5.0,8.43,0,0.73


#### Handling Missing Values

In [28]:
# replacing null values in GRE Scores column with average of GRE Score column

admission_prediction['GRE Score'] = admission_prediction['GRE Score'].fillna(admission_prediction['GRE Score'].mean())

In [29]:
admission_prediction['TOEFL Score'] = admission_prediction['TOEFL Score'].fillna(admission_prediction['TOEFL Score'].mean())

In [30]:
admission_prediction['University Rating'] = admission_prediction['University Rating'].fillna(admission_prediction['University Rating'].mean())

In [31]:
# checking for null values

admission_prediction.isnull().sum()

Serial No.           0
GRE Score            0
TOEFL Score          0
University Rating    0
SOP                  0
LOR                  0
CGPA                 0
Research             0
Chance of Admit      0
dtype: int64

In [32]:
admission_prediction.describe()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,250.5,316.558763,107.187755,3.121649,3.374,3.484,8.57644,0.56,0.72174
std,144.481833,11.103952,6.051338,1.128802,0.991004,0.92545,0.604813,0.496884,0.14114
min,1.0,290.0,92.0,1.0,1.0,1.0,6.8,0.0,0.34
25%,125.75,309.0,103.0,2.0,2.5,3.0,8.1275,0.0,0.63
50%,250.5,316.558763,107.0,3.0,3.5,3.5,8.56,1.0,0.72
75%,375.25,324.0,112.0,4.0,4.0,4.0,9.04,1.0,0.82
max,500.0,340.0,120.0,5.0,5.0,5.0,9.92,1.0,0.97


In [33]:
admission_prediction.drop(columns = ['Serial No.'],inplace = True)
admission_prediction

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,337.000000,118.0,4.0,4.5,4.5,9.65,1,0.92
1,324.000000,107.0,4.0,4.0,4.5,8.87,1,0.76
2,316.558763,104.0,3.0,3.0,3.5,8.00,1,0.72
3,322.000000,110.0,3.0,3.5,2.5,8.67,1,0.80
4,314.000000,103.0,2.0,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...
495,332.000000,108.0,5.0,4.5,4.0,9.02,1,0.87
496,337.000000,117.0,5.0,5.0,5.0,9.87,1,0.96
497,330.000000,120.0,5.0,4.5,5.0,9.56,1,0.93
498,312.000000,103.0,4.0,4.0,5.0,8.43,0,0.73


In [34]:
# label
y = admission_prediction['Chance of Admit']

# features
x = admission_prediction.drop(columns=['Chance of Admit'])

In [35]:
x.head()

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research
0,337.0,118.0,4.0,4.5,4.5,9.65,1
1,324.0,107.0,4.0,4.0,4.5,8.87,1
2,316.558763,104.0,3.0,3.0,3.5,8.0,1
3,322.0,110.0,3.0,3.5,2.5,8.67,1
4,314.0,103.0,2.0,2.0,3.0,8.21,0


In [36]:
x.columns

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR', 'CGPA',
       'Research'],
      dtype='object')

### Feature Scaling (Standardization):

If the data dispersion is varying a lot, then the model will not be able to understand the relation wrt features and label in a better way. So to find a strong and bettwer relation between feature and label we use StandardScaler. 

It is the process of transforming the data we are working with into a standard scale.


$$standardized\;value\;=\;\frac{\chi - \mu}{\sigma}$$

$$where\; \chi \rightarrow\;original\;variable,\;\; \mu \rightarrow\;mean\;of\;the\;original\;variable,\;\;\sigma\rightarrow\;standard\;deviation\;of\;original\;variable$$

**The formula for standardization is same as formula for Z-statistics i.e where $\mu\;=\;0\;and\;\sigma\;=\;1$**


This translates to subtracting the mean and dividing by the standard deviation. In this way, regardless of the data set, we will always obtain a distribution with a mean of zero and a standard deviation of one which could easily be proven.

This will ensure our linear combinations treat the two variables equally.

#### Standardization

In [37]:
from sklearn.preprocessing import StandardScaler

In [38]:
# importing StandardScaler Object

scaler = StandardScaler()

In [39]:
# transforming the features

arr = scaler.fit_transform(x)
arr

array([[ 1.84274116e+00,  1.78854223e+00,  7.78905651e-01, ...,
         1.09894429e+00,  1.77680627e+00,  8.86405260e-01],
       [ 6.70814288e-01, -3.10581135e-02,  7.78905651e-01, ...,
         1.09894429e+00,  4.85859428e-01,  8.86405260e-01],
       [ 5.12433309e-15, -5.27312752e-01, -1.07876604e-01, ...,
         1.73062093e-02, -9.54042814e-01,  8.86405260e-01],
       ...,
       [ 1.21170361e+00,  2.11937866e+00,  1.66568791e+00, ...,
         1.63976333e+00,  1.62785086e+00,  8.86405260e-01],
       [-4.10964364e-01, -6.92730965e-01,  7.78905651e-01, ...,
         1.63976333e+00, -2.42366993e-01, -1.12815215e+00],
       [ 9.41258951e-01,  9.61451165e-01,  7.78905651e-01, ...,
         1.09894429e+00,  7.67219636e-01, -1.12815215e+00]])

In [40]:
df1 = pd.DataFrame(arr,columns=['GRE_Score_New', 'TOEFL_Score_New', 'University_Rating_New', 'SOP_New', 'LOR_New', 'CGPA_New', 'Research_New'])
df1.head()

Unnamed: 0,GRE_Score_New,TOEFL_Score_New,University_Rating_New,SOP_New,LOR_New,CGPA_New,Research_New
0,1.842741,1.788542,0.778906,1.13736,1.098944,1.776806,0.886405
1,0.6708143,-0.031058,0.778906,0.632315,1.098944,0.485859,0.886405
2,5.124333e-15,-0.527313,-0.107877,-0.377773,0.017306,-0.954043,0.886405
3,0.4905178,0.465197,-0.107877,0.127271,-1.064332,0.154847,0.886405
4,-0.2306679,-0.692731,-0.994659,-1.387862,-0.523513,-0.60648,-1.128152


pf_df1 = ProfileReport(df1)

pf_df1.to_widgets()

# by standarard scaling now we can see that for features mean = 0 and standard deviation = 1

In [41]:
df1.describe()

Unnamed: 0,GRE_Score_New,TOEFL_Score_New,University_Rating_New,SOP_New,LOR_New,CGPA_New,Research_New
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,4.35052e-15,9.419132e-16,5.608847e-16,2.926548e-16,-1.3322680000000001e-17,3.091971e-15,-2.202682e-16
std,1.001002,1.001002,1.001002,1.001002,1.001002,1.001002,1.001002
min,-2.394225,-2.512331,-1.881441,-2.39795,-2.686789,-2.940115,-1.128152
25%,-0.681409,-0.692731,-0.9946589,-0.8828175,-0.5235128,-0.7430227,-1.128152
50%,5.124333e-15,-0.03105811,-0.1078766,0.1272712,0.01730621,-0.02720919,0.8864053
75%,0.6708143,0.796033,0.7789057,0.6323155,0.5581253,0.7672196,0.8864053
max,2.113186,2.119379,1.665688,1.642404,1.639763,2.223672,0.8864053


#### Checking Multicollinearity through VIF Calculations

In [42]:
# checking multicollinearity using VIF for all features
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [43]:
arr.shape

(500, 7)

In [44]:
# so for columns
arr.shape[1]

7

In [45]:
# creating vif_df dataframe
vif_df = pd.DataFrame()

# calculating VIF for each columns with x as a dataframe
vif_df['vif'] = [variance_inflation_factor(arr, i) for i in range(arr.shape[1])]
vif_df

Unnamed: 0,vif
0,4.153268
1,3.792866
2,2.508768
3,2.77575
4,2.037308
5,4.65167
6,1.459311


In [46]:
vif_df['feature'] = x.columns
vif_df

# vif score for each column is less than 10. So we don't need to call Lasso or Ridge or Elastic here or drop a column
# So we can use the df1 dataset as there is no multi collinearity

Unnamed: 0,vif,feature
0,4.153268,GRE Score
1,3.792866,TOEFL Score
2,2.508768,University Rating
3,2.77575,SOP
4,2.037308,LOR
5,4.65167,CGPA
6,1.459311,Research


In [47]:
# this dataset is standardised
df1

Unnamed: 0,GRE_Score_New,TOEFL_Score_New,University_Rating_New,SOP_New,LOR_New,CGPA_New,Research_New
0,1.842741e+00,1.788542,0.778906,1.137360,1.098944,1.776806,0.886405
1,6.708143e-01,-0.031058,0.778906,0.632315,1.098944,0.485859,0.886405
2,5.124333e-15,-0.527313,-0.107877,-0.377773,0.017306,-0.954043,0.886405
3,4.905178e-01,0.465197,-0.107877,0.127271,-1.064332,0.154847,0.886405
4,-2.306679e-01,-0.692731,-0.994659,-1.387862,-0.523513,-0.606480,-1.128152
...,...,...,...,...,...,...,...
495,1.392000e+00,0.134360,1.665688,1.137360,0.558125,0.734118,0.886405
496,1.842741e+00,1.623124,1.665688,1.642404,1.639763,2.140919,0.886405
497,1.211704e+00,2.119379,1.665688,1.137360,1.639763,1.627851,0.886405
498,-4.109644e-01,-0.692731,0.778906,0.632315,1.639763,-0.242367,-1.128152


## Splitting the dataset

In [48]:
from sklearn.model_selection import train_test_split

In [49]:
train_test_split(arr, y, test_size= 0.25,random_state = 345)
# gives 4 tuples

[array([[-4.10964364e-01, -6.92730965e-01,  7.78905651e-01, ...,
          1.63976333e+00, -2.42366993e-01, -1.12815215e+00],
        [-1.67303946e+00, -1.02356739e+00,  7.78905651e-01, ...,
          1.09894429e+00, -1.46711143e+00,  8.86405260e-01],
        [-4.10964364e-01,  1.34360100e-01, -1.07876604e-01, ...,
         -5.23512832e-01, -7.68609886e-02, -1.12815215e+00],
        ...,
        [ 4.90517846e-01,  1.29228759e+00,  1.66568791e+00, ...,
          1.09894429e+00,  1.29683885e+00,  8.86405260e-01],
        [-1.67303946e+00, -1.68524024e+00,  3.93810431e-16, ...,
         -5.23512832e-01, -2.26154025e+00, -1.12815215e+00],
        [ 1.75259294e+00,  1.95396044e+00,  1.66568791e+00, ...,
          1.73062093e-02,  2.02506527e+00,  8.86405260e-01]]),
 array([[-3.20816143e-01,  2.99778313e-01, -1.07876604e-01,
          6.32315491e-01,  1.73062093e-02,  7.01017234e-01,
         -1.12815215e+00],
        [-3.20816143e-01, -8.58149178e-01, -1.07876604e-01,
         -1.38786180e+

In [50]:
# x_train, x_test, y_train, y_test = train_test_split(arr, y, test_size= 0.15, random_state = 345)
# random state is same as seed in numpy

In [51]:
# changing seed to 100
x_train, x_test, y_train, y_test = train_test_split(arr, y, test_size= 0.25, random_state = 100)
# random state is same as seed in numpy

In [52]:
# changing seed to 100
x_train, x_test, y_train, y_test = train_test_split(arr, y, test_size= 0.15, random_state = 100)
# random state is same as seed in numpy

In [53]:
x_train

array([[ 0.85111073,  0.46519653, -0.1078766 , ...,  0.01730621,
         0.30380282,  0.88640526],
       [-1.58289124, -1.1889856 , -1.88144112, ..., -1.60515091,
        -1.13609942, -1.12815215],
       [ 0.67081429,  0.63061474, -0.1078766 , ..., -2.14596996,
         0.35345462,  0.88640526],
       ...,
       [-1.04200191, -0.85814918, -0.99465886, ..., -1.06433187,
        -0.65613201, -1.12815215],
       [-0.50111259, -0.85814918, -0.1078766 , ...,  0.55812525,
         0.10519562,  0.88640526],
       [-1.31244657, -0.85814918, -1.88144112, ..., -2.14596996,
        -0.95404281, -1.12815215]])

In [54]:
linear = LinearRegression()

#### train data

In [55]:
# training the model with train data
linear.fit(x_train, y_train)

LinearRegression()

In [56]:
# storing the model

# pickle.dump(linear, open('admission_linear_model.pickle','wb'))

In [57]:
# now predicting

linear.predict([[337.000000,118.0,4.0,4.5,4.5,9.65,1]])

array([10.08318535])

> in the datset for row 0th the chance of admit with these values is 0.92 but i am getting 9.86

#### this is happening because we have not used standardized values i.e we have only standarized our training dataset not the testing dataset

In [58]:
admission_prediction.head(3)

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,337.0,118.0,4.0,4.5,4.5,9.65,1,0.92
1,324.0,107.0,4.0,4.0,4.5,8.87,1,0.76
2,316.558763,104.0,3.0,3.0,3.5,8.0,1,0.72


In [59]:
# using same values under standardization for correct output

test1 = scaler.transform([[337.000000,118.0,4.0,4.5,4.5,9.65,1]])
test1



array([[1.84274116, 1.78854223, 0.77890565, 1.13735981, 1.09894429,
        1.77680627, 0.88640526]])

In [60]:
linear.predict(test1)
# now the result is close

array([0.95117594])

In [61]:
df1.head(2)

Unnamed: 0,GRE_Score_New,TOEFL_Score_New,University_Rating_New,SOP_New,LOR_New,CGPA_New,Research_New
0,1.842741,1.788542,0.778906,1.13736,1.098944,1.776806,0.886405
1,0.670814,-0.031058,0.778906,0.632315,1.098944,0.485859,0.886405


In [62]:
# or i can directly used the standardisez value in in the input
linear.predict([[1.842741,1.788542,0.778906,1.13736,1.098944,1.776806,0.886405]])

array([0.95117591])

In [63]:
admission_prediction.head(3)

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,337.0,118.0,4.0,4.5,4.5,9.65,1,0.92
1,324.0,107.0,4.0,4.0,4.5,8.87,1,0.76
2,316.558763,104.0,3.0,3.0,3.5,8.0,1,0.72


In [64]:
# checking for next row

scaler.transform([[324.000000,107.0,4.0,4.0,4.5,8.87,1]])



array([[ 0.67081429, -0.03105811,  0.77890565,  0.63231549,  1.09894429,
         0.48585943,  0.88640526]])

In [65]:
linear.predict([[ 0.67081429, -0.03105811,  0.77890565,  0.63231549,  1.09894429, 0.48585943,  0.88640526]])

array([0.80284701])

In [66]:
admission_prediction[1:2]

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
1,324.0,107.0,4.0,4.0,4.5,8.87,1,0.76


In [67]:
# reading the model present in the system

# model = pickle.load(open('admission_linear_model.pickle','rb'))
# model.predict(test1)

#### test data 

In [68]:
# Accuracy of the model with unknown dataset

linear.score(x_test,y_test)

# The model is 84.2% accurate

0.8420039560601401

In [69]:
# function to calculate R-squared or accuracy

def adj_r2(x,y):
    r2 = linear.score(x,y)
    n = x.shape[0]
    p = x.shape[1]
    adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
    return adjusted_r2

In [70]:
adj_r2(x_test,y_test)

0.8254969066932891

In [71]:
# m value for each x

linear.coef_

array([ 0.01912905,  0.01780082,  0.00550634, -0.00025051,  0.01844312,
        0.07254151,  0.01195331])

In [72]:
linear.intercept_

0.7203289055688045

### using Lasso regression

In [73]:
lassocv = LassoCV(cv = 5, max_iter = 2000000, normalize=True)
lassocv.fit(x_train,y_train)

# cv is cross validation
# 4 set for training and 1 for testing randomly


If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Lasso())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alpha to: original_alpha * np.sqrt(n_samples). 


LassoCV(cv=5, max_iter=2000000, normalize=True)

**LassoCV is the technique to find out the best possible parameter by doing a random experiment.**

In [74]:
# value of alpha or lambda from the equations
lassocv.alpha_

4.203962663551952e-05

In [75]:
# linear regression model using Lasso

lasso = Lasso(alpha=lassocv.alpha_)
lasso.fit(x_train,y_train)

Lasso(alpha=4.203962663551952e-05)

In [76]:
lasso.score(x_test,y_test)

# accuracy is 84.21

0.8421260048013872

### Using Ridge regression

In [77]:
ridgecv = RidgeCV(alphas = (0.1, 1.0, 10.0), cv = 10, normalize=True)
# giving higher range of alpha to choose from to get better results

ridgecv.fit(x_train,y_train)

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Ridge())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alpha to: original_alpha * n_samples. 
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Ridge())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alp

RidgeCV(alphas=array([ 0.1,  1. , 10. ]), cv=10, normalize=True)

In [78]:
ridgecv.alpha_

# best lambda/alpha was 0.1 but it is not good

0.1

In [79]:
# creating random range through numpy

np.random.uniform(0,10,50)

array([4.62623183, 5.6408602 , 6.51282733, 5.87374087, 5.80505191,
       3.37965155, 2.42610492, 0.99637915, 2.62135191, 5.21226626,
       3.62663716, 6.50693518, 5.04896583, 7.92291569, 7.22453773,
       4.05372827, 1.52191471, 5.52732798, 7.23012405, 4.44247946,
       4.39367812, 7.41466138, 8.03322036, 4.48673883, 7.52575494,
       0.33952705, 9.25503555, 9.24177989, 7.24349186, 3.49999974,
       4.0876295 , 2.2952895 , 1.44724287, 3.13078555, 2.90009224,
       1.71475074, 5.95423753, 8.48872589, 2.89502793, 5.81222086,
       8.67934524, 4.80125509, 5.11956004, 8.20862742, 3.51447701,
       6.96046588, 5.30091827, 4.03721989, 6.59440094, 9.12793155])

In [80]:
# passing this uniform rnage to the ridge

ridgecv = RidgeCV(alphas = np.random.uniform(0,10,50), cv = 10, normalize=True)

ridgecv.fit(x_train,y_train)

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Ridge())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alpha to: original_alpha * n_samples. 
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Ridge())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alp

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Ridge())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alpha to: original_alpha * n_samples. 
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Ridge())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alp

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Ridge())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alpha to: original_alpha * n_samples. 
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Ridge())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alp

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Ridge())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alpha to: original_alpha * n_samples. 
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Ridge())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alp

RidgeCV(alphas=array([7.82702948, 2.4176247 , 8.12639783, 8.23577074, 2.8676836 ,
       5.54111761, 3.77217526, 9.40134973, 5.24065661, 9.46619388,
       4.11026274, 2.19029967, 3.70980748, 5.99239615, 9.56586123,
       5.81692496, 2.43708709, 0.48315815, 6.78965708, 0.15514936,
       9.6097212 , 5.94354935, 7.20998904, 7.24582145, 8.82324649,
       8.46497141, 4.17879122, 3.35191978, 3.55747247, 9.56225967,
       8.50360866, 1.71017143, 2.93355958, 4.15385635, 9.16162358,
       6.88270054, 8.31296315, 1.32983039, 9.87591852, 2.29464985,
       6.04474857, 2.61882948, 1.33872814, 3.36873486, 3.12672922,
       1.26511823, 9.83175435, 0.56666566, 6.81495676, 1.30420265]),
        cv=10, normalize=True)

In [81]:
ridgecv.alpha_

0.15514936318296257

**ridge model**

In [82]:
ridge_lr = Ridge(alpha=ridgecv.alpha_)
ridge_lr.fit(x_train,y_train)

Ridge(alpha=0.15514936318296257)

In [83]:
# testing the model
# model accuracy

ridge_lr.score(x_test,y_test)

0.8420064922169082

### Using ELASTIC NET regression

In [84]:
# keeping initial alpha as None

elastic = ElasticNetCV(alphas=None, cv = 10)
elastic.fit(x_train,y_train)

ElasticNetCV(cv=10)

In [85]:
elastic.alpha_

# here alpha is lambda for elastic model

0.001391101145529104

In [86]:
# effect of l1-L2 ratio
elastic.l1_ratio_

0.5

**elastic model**

In [87]:
elastic_lr = ElasticNet(alpha = elastic.alpha_, l1_ratio=elastic.l1_ratio_)

In [88]:
elastic_lr.fit(x_train,y_train)

ElasticNet(alpha=0.001391101145529104)

In [89]:
# testing the model
# model accuracy

elastic_lr.score(x_test,y_test)

0.8419586493164081

All 3 models are giving approx same accuracy i.e model is stable. So it is not overfitting model.

___