# **Theoretical**

#### 1. What does R-squared represent in a regression model

R-squared (R²) is a statistical measure that indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s). Essentially, it tells you how well the regression model fits the data. Here’s a bit more detail:

1. Range: R-squared ranges from 0 to 1.
2. Interpretation:
   - An R² value of 0 means that the independent variables do not explain any of the variability in the dependent variable.
   - An R² value of 1 means that the independent variables explain all the variability in the dependent variable.
   - Values between 0 and 1 indicate the proportion of the variance in the dependent variable that can be explained by the model.

#### 2. What are the assumptions of linear regression

Linear regression relies on several key assumptions to ensure the validity of the model and its results. Here are the primary assumptions:

1. **Linearity**: The relationship between the independent variables and the dependent variable is linear. This means that the effect of the independent variables on the dependent variable is additive.

2. **Independence**: The observations are independent of each other. In other words, the value of the dependent variable for any observation is not influenced by the value of the dependent variable for any other observation.

3. **Homoscedasticity**: The residuals (errors) of the regression model have constant variance across all levels of the independent variables. This means that the spread of the residuals is the same for all values of the independent variables.

4. **No Perfect Multicollinearity**: The independent variables are not perfectly correlated with each other. If there is perfect multicollinearity, it means that one independent variable can be perfectly predicted from the others, making it difficult to estimate the regression coefficients.

5. **Normality of Residuals**: The residuals of the regression model are normally distributed. This assumption is important for making statistical inferences, such as hypothesis testing and constructing confidence intervals.

#### 3. What is the difference between R-squared and Adjusted R-squared

Great question! Both R-squared (R²) and Adjusted R-squared (Adjusted R²) are metrics used to evaluate the goodness-of-fit of a regression model, but they serve slightly different purposes.

**R-squared (R²)**:
- **Definition**: R-squared measures the proportion of the variance in the dependent variable that is explained by the independent variables in the model.
- **Range**: It ranges from 0 to 1.
- **Limitation**: One of the main limitations of R-squared is that it can only increase or stay the same when more independent variables are added to the model, regardless of whether those variables are actually significant.

**Adjusted R-squared (Adjusted R²)**:
- **Definition**: Adjusted R-squared adjusts the R-squared value for the number of independent variables in the model. It takes into account the complexity of the model.
- **Adjustment**: It penalizes the addition of unnecessary variables that do not improve the model. Therefore, it can increase or decrease depending on whether the new variables improve the model.
- **Range**: Like R-squared, it also ranges from 0 to 1, but it typically provides a more accurate measure of the model's goodness-of-fit, especially when comparing models with different numbers of independent variables.

To put it simply, while R-squared can sometimes be misleading due to its tendency to always increase with additional variables, Adjusted R-squared provides a more balanced view by accounting for the model's complexity and penalizing overfitting.

Here's a quick summary in table form for clarity:

| Metric           | Description                                                                 | Range | Behavior with More Variables          |
|------------------|-----------------------------------------------------------------------------|-------|----------------------------------------|
| R-squared        | Proportion of variance explained by the model                               | 0 - 1 | Always increases or stays the same     |
| Adjusted R-squared | Adjusts R-squared for the number of predictors, penalizes overfitting      | 0 - 1 | Can increase or decrease based on model improvement |


#### 4. Why do we use Mean Squared Error (MSE)

Mean Squared Error (MSE) is a widely-used metric for evaluating the performance of regression models. Here are the main reasons why we use MSE:

1. **Penalty for Large Errors**: MSE penalizes larger errors more than smaller errors because it squares the differences between the actual and predicted values. This is useful because larger errors are typically more problematic than smaller ones.

2. **Differentiability**: MSE is differentiable, making it suitable for optimization algorithms like gradient descent. This property allows us to find the minimum error by adjusting the model's parameters during training.

3. **Mathematical Simplicity**: MSE is relatively simple to compute and understand. It's the average of the squared differences between actual and predicted values, providing a straightforward measure of model accuracy.

4. **Convexity**: The MSE function is convex, meaning it has a single global minimum. This makes it easier to optimize compared to non-convex functions, which might have multiple local minima.

5. **Interpretability**: MSE gives a clear sense of the average magnitude of prediction errors in the same units as the dependent variable (after taking the square root, if desired).

To sum up, MSE provides a balanced, mathematically convenient way to assess the accuracy of regression models while placing a higher penalty on larger errors.

#### 5. What does an Adjusted R-squared value of 0.85 indicate
An Adjusted R-squared value of 0.85 indicates that approximately 85% of the variance in the dependent variable is explained by the independent variables in the model, while accounting for the number of predictors. This suggests that the model has a strong explanatory power and fits the data well.

In practical terms, this high Adjusted R-squared value implies that the model is likely capturing most of the relevant information and providing accurate predictions. However, it's always important to consider other factors such as the residual plots, potential overfitting, and the theoretical basis of the model.

#### 6.  How do we check for normality of residuals in linear regression

Checking for the normality of residuals is an important step in validating a linear regression model. Here are some common methods to assess the normality of residuals:

1. **Histogram**: Plot a histogram of the residuals. If the residuals are normally distributed, the histogram should resemble a bell curve.

2. **Q-Q Plot (Quantile-Quantile Plot)**: A Q-Q plot compares the quantiles of the residuals to the quantiles of a normal distribution. If the residuals are normally distributed, the points should lie approximately along a straight line.

3. **Shapiro-Wilk Test**: This is a statistical test that evaluates the normality of residuals. A small p-value (typically less than 0.05) suggests that the residuals are not normally distributed.

4. **Kolmogorov-Smirnov Test**: Another statistical test that compares the residuals' distribution to a normal distribution. Like the Shapiro-Wilk test, a small p-value indicates non-normality.

5. **Skewness and Kurtosis**: Calculate the skewness and kurtosis of the residuals. Skewness close to 0 and kurtosis close to 3 indicate normality.

6. **Normal Probability Plot**: Similar to the Q-Q plot, this plot shows the cumulative probability of the residuals versus a normal distribution. A straight line suggests normality.

Here's a quick example of how you might visualize residuals with a histogram and a Q-Q plot using Python (assuming you have a `residuals` array):

```python
import matplotlib.pyplot as plt
import scipy.stats as stats

# Histogram of residuals
plt.hist(residuals, bins=30, edgecolor='k')
plt.title('Histogram of Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()

# Q-Q plot of residuals
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot of Residuals')
plt.show()
```

#### 7.  What is multicollinearity, and how does it impact regression

**Multicollinearity** refers to a situation in regression analysis where two or more independent variables are highly correlated. This high correlation means that one independent variable can be linearly predicted from the others with a substantial degree of accuracy.

### Impact of Multicollinearity on Regression

1. **Unstable Coefficients**: Multicollinearity can make the estimates of the regression coefficients unstable and highly sensitive to small changes in the model. This can lead to large standard errors and inflated confidence intervals, making it hard to determine the true effect of each independent variable.

2. **Interpretation Difficulty**: It becomes difficult to assess the individual impact of correlated variables on the dependent variable, as their effects are intertwined. This can obscure the true relationship between the predictors and the outcome.

3. **Reduced Statistical Power**: The statistical power to detect significant predictors diminishes because of the overlap in the information provided by the correlated variables. This can result in some predictors appearing non-significant when they might actually be important.

4. **Reduced Precision**: Multicollinearity can decrease the precision of the estimated coefficients, leading to less reliable predictions and interpretations.

### Detecting Multicollinearity

1. **Correlation Matrix**: Examining the pairwise correlations between independent variables can give a preliminary idea of potential multicollinearity issues.

2. **Variance Inflation Factor (VIF)**: VIF quantifies how much the variance of a regression coefficient is inflated due to multicollinearity. A VIF value greater than 10 is often considered indicative of significant multicollinearity.

3. **Tolerance**: Tolerance is the inverse of VIF. A low tolerance value (close to 0) indicates high multicollinearity.

4. **Eigenvalues and Condition Index**: The condition index assesses the sensitivity of the estimated coefficients to changes in the data. High values indicate potential multicollinearity problems.

### Managing Multicollinearity

1. **Remove Highly Correlated Predictors**: If some variables are highly correlated, consider removing one or more to reduce multicollinearity.

2. **Combine Predictors**: Creating composite variables or using techniques like principal component analysis (PCA) can help combine correlated predictors into a single index.

3. **Regularization Techniques**: Methods like Ridge Regression and Lasso Regression can help mitigate the impact of multicollinearity by introducing penalties to the regression model.

#### 8. What is Mean Absolute Error (MAE)

Mean Absolute Error (MAE) is a widely-used metric for evaluating the accuracy of regression models. It measures the average magnitude of the errors between the predicted and actual values, without considering their direction. Here's a bit more detail:

### Definition
MAE is calculated by taking the average of the absolute differences between the predicted and actual values. Mathematically, it's expressed as:

$$ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} | y_i - \hat{y}_i | $$

where:
- \( n \) is the number of observations
- \( y_i \) is the actual value
- \( \hat{y}_i \) is the predicted value

### Characteristics
- **Interpretability**: MAE is easy to interpret as it provides the average error in the same units as the dependent variable.
- **Robustness**: MAE is less sensitive to outliers compared to Mean Squared Error (MSE) because it doesn't square the errors.
- **Range**: MAE can range from 0 (perfect prediction) to ∞, with lower values indicating better model performance.

### Advantages
- **Simplicity**: MAE is simple to understand and calculate.
- **Real-world relevance**: Since MAE uses the same units as the dependent variable, it's intuitively meaningful.

### Limitations
- **Equal weight to all errors**: MAE treats all errors equally, which might not always be desirable in situations where larger errors should be penalized more.

Here's a quick summary in a table form for clarity:

| Metric | Formula | Sensitivity to Outliers | Units | Interpretation |
|--------|---------|-------------------------|-------|----------------|
| MAE    | $$ \frac{1}{n} \sum | y_i - \hat{y}_i | $$ | Less sensitive  | Same as dependent variable | Average magnitude of errors |

#### 9. What are the benefits of using an ML pipeline

Using a Machine Learning (ML) pipeline offers several benefits, making the process of developing, deploying, and maintaining ML models more efficient and effective. Here are some key advantages:

1. **Automation**: ML pipelines automate repetitive tasks such as data preprocessing, feature engineering, model training, and evaluation. This reduces manual effort and speeds up the overall workflow.

2. **Consistency**: By defining a standardized sequence of steps, ML pipelines ensure that the same process is followed every time, leading to consistent results. This helps in reducing errors and improving reproducibility.

3. **Modularity**: ML pipelines are often built as modular components, allowing different stages (e.g., data preprocessing, model training) to be independently developed, tested, and reused. This makes it easier to update or swap out individual components without affecting the entire pipeline.

4. **Scalability**: Pipelines can handle large volumes of data and leverage distributed computing resources to scale the training and evaluation processes. This is particularly important for big data and complex models.

5. **Experimentation**: ML pipelines facilitate experimentation by enabling easy modifications and comparisons of different models, parameters, and preprocessing techniques. This accelerates the process of finding the best-performing model.

6. **Maintenance and Monitoring**: Pipelines can be integrated with monitoring tools to track the performance of deployed models in real-time. This helps in identifying issues, managing model drift, and ensuring the continued accuracy of predictions.

7. **Collaboration**: With a well-defined pipeline, multiple team members (e.g., data scientists, engineers) can collaborate more effectively, as each person can work on different stages of the pipeline without interfering with others.

8. **Efficiency**: Automating and streamlining the ML workflow reduces the time and effort required to build and deploy models, leading to faster delivery of insights and solutions.

Here’s a quick summary in a table format:

| Benefit               | Description                                                          |
|-----------------------|----------------------------------------------------------------------|
| Automation            | Reduces manual effort and speeds up the workflow                    |
| Consistency           | Ensures the same process is followed, leading to reproducible results|
| Modularity            | Allows independent development and reuse of pipeline components     |
| Scalability           | Handles large volumes of data and leverages distributed computing   |
| Experimentation       | Facilitates easy modifications and comparisons for better models    |
| Maintenance & Monitoring | Tracks performance and manages model drift in real-time         |
| Collaboration         | Enhances teamwork by defining clear stages and responsibilities     |
| Efficiency            | Streamlines the workflow, reducing time to deliver insights         |

#### 10.  Why is RMSE considered more interpretable than MSE

Root Mean Squared Error (RMSE) is often considered more interpretable than Mean Squared Error (MSE) because of the following reasons:

1. **Units of Measurement**: RMSE is in the same units as the dependent variable, making it more intuitively understandable. MSE, on the other hand, squares the errors, resulting in units that are the square of the dependent variable's units, which can be less intuitive.

2. **Magnitude of Errors**: RMSE directly reflects the average magnitude of the errors in the same scale as the original data, making it easier to comprehend the typical prediction error. In contrast, MSE can sometimes be harder to interpret due to the squaring of errors.

### Quick Comparison

| Metric | Interpretation | Units |
|--------|----------------|-------|
| MSE    | Average of squared errors | Squared units of the dependent variable |
| RMSE   | Square root of MSE, reflecting average error | Same units as the dependent variable |

By taking the square root of MSE, RMSE provides a more intuitive measure of model accuracy, which is why it's often preferred for interpreting model performance.

#### 11. What is pickling in Python, and how is it useful in ML

Pickling in Python refers to the process of serializing and deserializing objects so they can be saved to a file or transferred over a network and later restored back to their original state. This process is handled by the `pickle` module.

### How Pickling Works
- **Serialization (Pickling)**: Converts a Python object into a byte stream, which can be written to a file or sent over a network.
- **Deserialization (Unpickling)**: Converts the byte stream back into the original Python object.


#### Benefits of Pickling in Machine Learning
1. **Model Persistence**: You can save trained models to disk and load them later without having to retrain the model, saving time and computational resources.
   
2. **Data Storage**: Pickling allows you to save complex data structures (like dictionaries or custom objects) that you might use during the data preprocessing or feature engineering stages.

3. **Pipeline Portability**: You can pickle entire ML pipelines, including pre-processing steps, models, and post-processing steps, ensuring that the exact same process can be replicated or shared with others.

#### Use Cases in ML
1. **Model Deployment**: Save trained models and deploy them in production environments for inference without retraining.
   
2. **Experimentation**: Save and reload models during experimentation to compare results without having to train the model each time.

3. **Data Sharing**: Share preprocessed datasets or feature sets with team members or across different environments.

#### 12.  What does a high R-squared value mean
A high R-squared (R²) value indicates that a large proportion of the variance in the dependent variable is explained by the independent variables in the regression model. In practical terms, it means the model fits the data well. Here are some key points:

1. **Good Fit**: A high R² value, close to 1, suggests that the model is capturing most of the underlying patterns in the data. This implies that the independent variables have a strong explanatory power over the dependent variable.

2. **Predictive Accuracy**: A high R² value often correlates with better predictive accuracy. The model's predictions are likely to be closer to the actual values.

3. **Reduced Residuals**: With a high R², the residuals (the differences between the actual and predicted values) are typically smaller, indicating that the model's estimates are closer to reality.

However, it's important to be cautious:
- **Overfitting**: A very high R² might sometimes indicate overfitting, especially if the model is too complex for the data. Overfitting occurs when the model captures noise rather than the underlying relationship.
- **Context Matters**: The acceptable R² value can vary depending on the field of study. For example, in social sciences, an R² of 0.4 might be considered good, whereas in physical sciences, a higher threshold might be expected.

#### 13. What happens if linear regression assumptions are violated

Violating the assumptions of linear regression can impact the validity and reliability of your model's results. Here’s a breakdown of potential consequences for each assumption:

1. **Linearity**: If the relationship between the dependent and independent variables is not linear, the model may miss important patterns, leading to poor predictions and incorrect conclusions. Non-linear relationships can be addressed by transforming the variables or using non-linear regression techniques.

2. **Independence**: If the observations are not independent (e.g., in time series data), the standard errors of the coefficients may be underestimated, leading to overly optimistic p-values and confidence intervals. This can result in incorrect inferences about the significance of predictors. Techniques like time series analysis or mixed-effects models can help address this issue.

3. **Homoscedasticity**: If the residuals do not have constant variance (heteroscedasticity), the estimated coefficients may still be unbiased, but the standard errors may be incorrect. This can affect hypothesis tests and confidence intervals. Robust standard errors or weighted least squares can help mitigate this issue.

4. **No Perfect Multicollinearity**: Perfect multicollinearity makes it impossible to estimate the unique contribution of each independent variable, leading to large standard errors and unstable coefficient estimates. This can be addressed by removing or combining highly correlated predictors.

5. **Normality of Residuals**: If the residuals are not normally distributed, the inference based on the t-tests and F-tests may be invalid, particularly in small samples. However, in large samples, the Central Limit Theorem can mitigate this issue. Transforming the dependent variable or using bootstrapping can help address non-normality.

Here’s a summary table for quick reference:

| Assumption         | Consequence of Violation            | Potential Solutions                          |
|--------------------|-------------------------------------|----------------------------------------------|
| Linearity          | Missed patterns, poor predictions   | Transform variables, use non-linear models   |
| Independence       | Underestimated standard errors      | Use time series or mixed-effects models      |
| Homoscedasticity   | Incorrect standard errors, tests    | Robust standard errors, weighted least squares|
| No Multicollinearity | Unstable coefficients, large errors| Remove/combine correlated predictors         |
| Normality of Residuals | Invalid inference in small samples| Transform dependent variable, use bootstrapping|


#### 14. How can we address multicollinearity in regression

Addressing multicollinearity is crucial for ensuring the reliability and interpretability of a regression model. Here are some strategies to tackle multicollinearity:

1. **Remove Highly Correlated Predictors**: Identify and remove one of the highly correlated predictors. This is the simplest approach and can be effective if you can afford to lose one of the variables.

2. **Combine Predictors**: Combine the correlated variables into a single predictor through methods like:
   - **Principal Component Analysis (PCA)**: Transforms the correlated variables into a smaller set of uncorrelated components.
   - **Factor Analysis**: Groups correlated variables into underlying factors.

3. **Regularization Techniques**: Use regularization methods that can handle multicollinearity by adding a penalty to the regression model:
   - **Ridge Regression**: Adds a penalty term proportional to the square of the coefficients.
   - **Lasso Regression**: Adds a penalty term that can shrink some coefficients to zero, effectively performing variable selection.

4. **Centering the Variables**: Center the predictors by subtracting the mean value from each predictor. This reduces multicollinearity, especially when dealing with interaction terms.

5. **Variance Inflation Factor (VIF) Analysis**: Calculate the VIF for each predictor and remove or combine predictors with high VIF values. A VIF above 10 is often considered indicative of high multicollinearity.

6. **Domain Knowledge**: Use your knowledge of the subject matter to prioritize which variables are most important and which ones can be excluded without losing significant information.

Here’s a quick summary:

| Strategy                   | Description                                                         |
|----------------------------|---------------------------------------------------------------------|
| Remove Predictors          | Eliminate one of the highly correlated predictors                   |
| Combine Predictors         | Use PCA or factor analysis to create composite variables            |
| Regularization Techniques  | Apply Ridge or Lasso regression to handle multicollinearity         |
| Centering Variables        | Subtract the mean value from each predictor                         |
| VIF Analysis               | Calculate and address high VIF values                               |
| Domain Knowledge           | Use subject matter expertise to decide on variable importance       |



#### 15.  How can feature selection improve model performance in regression analysis
Feature selection plays a crucial role in enhancing the performance of regression models. Here are some key benefits:

1. **Reduces Overfitting**: By removing irrelevant or redundant features, feature selection helps reduce the risk of overfitting. This ensures that the model generalizes better to new, unseen data.

2. **Improves Model Interpretability**: A model with fewer features is easier to interpret and understand. It helps in identifying the most important predictors and their relationship with the dependent variable.

3. **Enhances Model Accuracy**: Including only relevant features can improve the predictive accuracy of the model. Irrelevant features can introduce noise and reduce the model's performance.

4. **Reduces Complexity**: Feature selection simplifies the model, making it computationally efficient and faster to train and deploy. This is especially important for large datasets with many features.

5. **Reduces Multicollinearity**: By removing highly correlated predictors, feature selection helps in addressing multicollinearity, which can lead to more stable and reliable coefficient estimates.

6. **Facilitates Better Data Understanding**: By focusing on a smaller set of relevant features, it becomes easier to gain insights into the underlying patterns and relationships in the data.

### Common Feature Selection Methods

1. **Filter Methods**: Use statistical measures to evaluate the relevance of features independently of the model.
   - Examples: Correlation coefficients, Chi-square test, ANOVA.

2. **Wrapper Methods**: Evaluate subsets of features by training and testing a specific model.
   - Examples: Recursive Feature Elimination (RFE), Forward Selection, Backward Elimination.

3. **Embedded Methods**: Perform feature selection during the model training process.
   - Examples: Lasso Regression, Ridge Regression, Decision Trees.

### Summary Table

| Benefit                    | Description                                                        |
|----------------------------|--------------------------------------------------------------------|
| Reduces Overfitting        | Minimizes the risk of overfitting by eliminating irrelevant features|
| Improves Interpretability  | Makes the model easier to understand by focusing on key predictors |
| Enhances Accuracy          | Increases predictive performance by reducing noise                |
| Reduces Complexity         | Simplifies the model, making it computationally efficient         |
| Addresses Multicollinearity| Removes highly correlated predictors for stable estimates         |
| Better Data Understanding  | Facilitates insights into underlying data patterns                |


#### 16. How is Adjusted R-squared calculated

Adjusted R-squared (\(R^2_{\text{adj}}\)) is a modified version of R-squared (\(R^2\)) that accounts for the number of predictors in the model. It adjusts for the potential inflation of R-squared when unnecessary predictors are added. Here's the formula to calculate Adjusted R-squared:

\[ R^2_{\text{adj}} = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - k - 1} \right) \]

where:
- \( R^2 \) is the R-squared value
- \( n \) is the number of observations
- \( k \) is the number of independent variables (predictors)

### Key Points
- **Penalization**: Adjusted R-squared imposes a penalty for adding more variables to the model, helping to prevent overfitting.
- **Interpretation**: While R-squared increases or stays the same as more predictors are added, Adjusted R-squared can decrease if the added predictors do not improve the model.

#### 17.  Why is MSE sensitive to outliers

Mean Squared Error (MSE) is sensitive to outliers because it squares the errors (the differences between the actual and predicted values). This squaring operation amplifies larger errors disproportionately compared to smaller errors. Here's why this happens:

1. **Squaring Effect**: When you square a number, it grows exponentially. For example, an error of 10 becomes 100 when squared, whereas an error of 1 only becomes 1. This means that a few large errors can dominate the MSE, overshadowing smaller errors and making the metric sensitive to outliers.

2. **Influence on Model**: Outliers can significantly impact the calculated MSE, leading to an overestimation of the overall error. This can make the model appear worse than it is for the majority of the data points.

3. **Misleading Performance**: Because MSE gives more weight to larger errors, it can be misleading in datasets where outliers are present. It might suggest that the model is performing poorly, even if it fits most of the data well.

#### 18. What is the role of homoscedasticity in linear regression

Homoscedasticity, or constant variance of the residuals, is an important assumption in linear regression. Here’s why it matters:

### Role of Homoscedasticity

1. **Unbiased Estimates**: When the residuals have constant variance, the estimates of the regression coefficients remain unbiased and efficient. This means the model provides the best linear unbiased estimates (BLUE) of the coefficients.

2. **Valid Inferences**: Homoscedasticity ensures that the standard errors of the coefficients are accurately estimated. This is crucial for valid hypothesis testing and constructing reliable confidence intervals. If the residuals have constant variance, the usual t-tests and F-tests remain valid.

3. **Predictive Accuracy**: When the assumption of homoscedasticity holds, the model's predictions are more reliable. Homoscedastic residuals indicate that the model's predictive performance is consistent across all levels of the independent variables.

### Consequences of Violating Homoscedasticity

If the assumption of homoscedasticity is violated (i.e., if the residuals exhibit heteroscedasticity):

1. **Biased Standard Errors**: The standard errors of the coefficients may be biased, leading to incorrect inferences. This can result in misleading p-values and confidence intervals, affecting the reliability of hypothesis tests.

2. **Inefficient Estimates**: The coefficient estimates remain unbiased but become inefficient, meaning they no longer have the smallest possible variance among all linear unbiased estimators.

3. **Loss of Predictive Performance**: The model's predictive performance can be inconsistent, as the variance of the errors changes across different levels of the independent variables.

### Addressing Heteroscedasticity

Several methods can be used to address heteroscedasticity:

1. **Transformations**: Apply a transformation to the dependent variable, such as the logarithm or square root, to stabilize the variance of the residuals.

2. **Weighted Least Squares (WLS)**: Use WLS instead of Ordinary Least Squares (OLS) to give less weight to observations with larger residuals.

3. **Robust Standard Errors**: Use robust standard errors (also known as heteroscedasticity-consistent standard errors) to adjust for heteroscedasticity without transforming the data.

Here's a quick summary:

| Role                        | Description                                                    |
|-----------------------------|----------------------------------------------------------------|
| Unbiased Estimates          | Ensures unbiased and efficient coefficient estimates           |
| Valid Inferences            | Maintains accurate standard errors for reliable hypothesis tests|
| Predictive Accuracy         | Improves consistency of model predictions                      |
| Consequences of Violation   | Biased standard errors, inefficient estimates, inconsistent predictions |
| Addressing Heteroscedasticity | Transformations, Weighted Least Squares, Robust Standard Errors|

Maintaining homoscedasticity helps ensure that the linear regression model performs accurately and reliably

#### 19. What is Root Mean Squared Error (RMSE)

Root Mean Squared Error (RMSE) is a commonly used metric to evaluate the accuracy of a regression model. It measures the average magnitude of the errors between the predicted and actual values, giving higher weight to larger errors. Here's a breakdown:

### Definition
RMSE is the square root of the average of the squared differences between the predicted and actual values. Mathematically, it's expressed as:

\[ \text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 } \]

where:
- \( n \) is the number of observations
- \( y_i \) is the actual value
- \( \hat{y}_i \) is the predicted value

### Characteristics
- **Units of Measurement**: RMSE is expressed in the same units as the dependent variable, making it intuitively understandable.
- **Sensitivity to Outliers**: RMSE is sensitive to outliers because the errors are squared, amplifying the impact of larger errors.
- **Interpretability**: RMSE gives a clear sense of the average magnitude of prediction errors in the same units as the dependent variable.

### Advantages
- **Simplicity**: RMSE is easy to understand and calculate.
- **Penalty for Larger Errors**: By squaring the errors, RMSE penalizes larger errors more, highlighting their impact on model performance.

### Limitations
- **Sensitivity to Outliers**: The squaring of errors makes RMSE highly sensitive to outliers, which can sometimes distort the overall error measure.

Here's a quick summary in table form:

| Metric | Formula | Sensitivity to Outliers | Units | Interpretation |
|--------|---------|-------------------------|-------|----------------|
| RMSE   | $$ \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 } $$ | Sensitive  | Same as dependent variable | Average magnitude of errors |


#### 20. Why is pickling considered risky

Pickling in Python can be risky due to several reasons:

1. **Security Risks**: Pickle can execute arbitrary code during deserialization, which makes it vulnerable to malicious attacks. If untrusted data is unpickled, it could execute harmful code, leading to security breaches.

2. **Data Corruption**: Pickled files can become corrupted, leading to data loss. If a pickled file is altered or damaged, it might be impossible to unpickle it successfully.

3. **Compatibility Issues**: Pickled objects are not always compatible between different versions of Python or different environments. This can cause issues when sharing pickled data across different systems.

4. **Limited Language Support**: Pickling is specific to Python, meaning that pickled data is not easily readable or usable by programs written in other languages.

5. **Maintenance Challenges**: Over time, as code evolves, the structure of pickled objects might change, making it difficult to unpickle old data with newer versions of the code.

### Summary

| Risk                | Description                                                               |
|---------------------|---------------------------------------------------------------------------|
| Security Risks      | Vulnerable to executing arbitrary code, leading to potential attacks      |
| Data Corruption     | Risk of data loss if pickled files become corrupted                       |
| Compatibility Issues| Incompatibility between different Python versions or environments         |
| Limited Language Support | Specific to Python, not easily usable by other languages              |
| Maintenance Challenges  | Difficult to unpickle old data with evolving code structures          |

Considering these risks, it's important to use pickling cautiously and only with trusted data sources. Alternatives like JSON or specialized serialization libraries can provide safer options for data serialization.

#### 21.  What alternatives exist to pickling for saving ML models

There are several alternatives to pickling for saving machine learning models that address some of the risks associated with pickling. Here are a few popular ones:

1. **Joblib**: Joblib is a library specifically optimized for the serialization of large NumPy arrays and can efficiently save machine learning models. It's faster and more efficient for large datasets compared to pickle.

   ```python
   import joblib

   # Save model
   joblib.dump(model, 'model.joblib')

   # Load model
   model = joblib.load('model.joblib')
   ```

2. **JSON**: JSON is a lightweight data-interchange format that is easy to read and write. You can serialize the model parameters and architecture to JSON format. This approach is language-agnostic and can be used across different programming languages.

   ```python
   import json

   # Save model parameters to JSON
   with open('model.json', 'w') as f:
       json.dump(model_params, f)

   # Load model parameters from JSON
   with open('model.json', 'r') as f:
       model_params = json.load(f)
   ```

3. **HDF5**: Hierarchical Data Format (HDF5) is a file format designed to store and organize large amounts of data. Libraries like h5py or TensorFlow's `tf.keras` API support saving models in HDF5 format.

   ```python
   import h5py

   # Save model
   model.save('model.h5')

   # Load model
   model = tf.keras.models.load_model('model.h5')
   ```

4. **ONNX (Open Neural Network Exchange)**: ONNX is an open-source format for representing machine learning models. It allows models to be shared across different frameworks, promoting interoperability.

   ```python
   import onnx
   import onnxruntime as ort

   # Save model to ONNX format
   onnx.save(model, 'model.onnx')

   # Load model from ONNX format
   ort_session = ort.InferenceSession('model.onnx')
   ```

5. **Model-Specific Serialization**: Many machine learning libraries have their own built-in methods for saving and loading models. For example, scikit-learn's `joblib`, TensorFlow's `save_model`, and PyTorch's `torch.save`.

   ```python
   # TensorFlow example
   model.save('model_path')

   # PyTorch example
   torch.save(model.state_dict(), 'model_path.pth')
   model.load_state_dict(torch.load('model_path.pth'))
   ```

Here's a quick summary in a table format:

| Method           | Description                                     | Use Case                                                |
|------------------|-------------------------------------------------|---------------------------------------------------------|
| Joblib           | Efficient serialization for large NumPy arrays  | Saving large datasets and models                        |
| JSON             | Lightweight, language-agnostic data format      | Saving model parameters and architecture                |
| HDF5             | File format for large data storage              | Saving complex model structures                         |
| ONNX             | Open-source format for model interoperability   | Sharing models across different frameworks              |
| Model-Specific   | Library-specific serialization methods          | Using built-in methods for TensorFlow, PyTorch, etc.    |


#### 22.  What is heteroscedasticity, and why is it a problem


Heteroscedasticity refers to a condition in regression analysis where the variance of the residuals (errors) is not constant across all levels of the independent variables. Instead, the spread of the residuals changes, leading to an unequal distribution.

### Why Heteroscedasticity is a Problem

1. **Biased Standard Errors**: Heteroscedasticity can result in biased estimates of the standard errors of the regression coefficients. This affects hypothesis testing, leading to unreliable p-values and confidence intervals. Consequently, you might incorrectly judge the significance of predictors.

2. **Inefficient Estimates**: Although the regression coefficients themselves remain unbiased, they become inefficient. This means that the estimates do not have the minimum possible variance, leading to less precise predictions.

3. **Invalid Inferences**: The usual t-tests and F-tests assume homoscedasticity. When heteroscedasticity is present, the results of these tests can be misleading, potentially leading to incorrect conclusions about the relationships between variables.

4. **Distorted Model Performance**: If the variance of the residuals is not constant, the model's predictive performance can vary across different levels of the independent variables. This means the model may perform well in certain ranges but poorly in others, leading to inconsistent predictions.

### Detecting Heteroscedasticity

- **Residual Plots**: Plotting the residuals against the fitted values or independent variables can visually indicate heteroscedasticity if there is a pattern or funnel shape.
- **Breusch-Pagan Test**: A statistical test specifically designed to detect heteroscedasticity.
- **White Test**: Another test used to check for heteroscedasticity by examining whether the variance of the residuals is related to the independent variables.

### Addressing Heteroscedasticity

- **Transformations**: Apply transformations like the logarithm, square root, or Box-Cox transformation to the dependent variable to stabilize the variance.
- **Weighted Least Squares (WLS)**: Use WLS instead of Ordinary Least Squares (OLS) to give different weights to observations, reducing the impact of heteroscedasticity.
- **Robust Standard Errors**: Use robust standard errors to adjust for heteroscedasticity without transforming the data.

Here’s a quick summary:

| Issue                      | Description                                                   |
|----------------------------|---------------------------------------------------------------|
| Biased Standard Errors     | Leads to unreliable p-values and confidence intervals         |
| Inefficient Estimates      | Reduces the precision of regression coefficients              |
| Invalid Inferences         | Misleading hypothesis tests                                   |
| Distorted Model Performance| Inconsistent predictions across different data ranges         |
| Detection Methods          | Residual plots, Breusch-Pagan test, White test                |
| Solutions                  | Transformations, Weighted Least Squares, Robust Standard Errors|


#### 23.  How can interaction terms enhance a regression model's predictive power?
Interaction terms can significantly enhance a regression model's predictive power by capturing the combined effects of two or more independent variables on the dependent variable. Here’s how:

### Capturing Combined Effects
Interaction terms allow the model to account for situations where the effect of one independent variable depends on the level of another independent variable. This is particularly useful in complex scenarios where variables do not operate independently but interact with each other.

### Improving Model Fit
By including interaction terms, the model can better fit the data, reducing residual variance and improving overall predictive accuracy. This leads to more accurate predictions and a better understanding of the relationships between variables.

### Revealing Insights
Interaction terms can reveal hidden relationships that are not apparent when considering variables individually. For example, in a marketing context, the combined effect of price and promotion might significantly influence sales, more than either factor alone.

### Addressing Non-Linearity
In some cases, interaction terms can help address non-linear relationships between variables, providing a more accurate representation of the underlying data patterns.

### Example
Consider a regression model with two independent variables, \(X_1\) and \(X_2\). An interaction term \(X_1 \times X_2\) can be included in the model:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 (X_1 \times X_2) + \epsilon \]

In this model, \(\beta_3\) represents the interaction effect, showing how the relationship between \(X_1\) and \(Y\) changes with different levels of \(X_2\).

### Summary Table

| Benefit                  | Description                                                   |
|--------------------------|---------------------------------------------------------------|
| Capturing Combined Effects | Accounts for the combined influence of variables              |
| Improving Model Fit      | Reduces residual variance and enhances predictive accuracy    |
| Revealing Insights       | Uncovers hidden relationships between variables               |
| Addressing Non-Linearity | Helps model non-linear relationships                          |


# **Practical**

#### 1. Write a Python script to visualize the distribution of errors (residuals) for a multiple linear regression model using Seaborn's "diamonds" dataset.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [None]:
# sns.get_dataset_names()
diamond_data = sns.load_dataset('diamonds')

In [None]:
# diamond_data.info()
# #  diamond_data["cut"].value_counts()
# diamond_data["cut"] = diamond_data["cut"].map({"Ideal":1,"Premium":2,"Very Good":3,"Good":4,"Fair":5})

# # diamond_data["color"].value_counts()
# diamond_data["color"] = diamond_data["color"].map({"G":1,"E":2,"F":3,"H":4,"D":5,"I":6,"J":7})

# # diamond_data["clarity"].value_counts()
# diamond_data["clarity"] = diamond_data["clarity"].map({"SI1":1,"VS2":2,"SI2":3,"VS1":4,"VVS2":5,"VVS1":6,"IF":7,"I1":8})

In [None]:
diamond_data.drop(columns=["cut","color","clarity"],inplace=True)

In [None]:
sns.heatmap(diamond_data.corr(),annot=True)

In [None]:
X = diamond_data.drop(columns=["price"])
y = diamond_data["price"]

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.24,random_state=1)

print(f"Shape of X_train = {X_train.shape} \nShape of X_test = {X_test.shape}")
print(f"Shape of y_train = {y_train.shape} \nShape of y_test = {y_test.shape}")

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model

In [None]:
model.fit(X_train,y_train)

In [None]:
y_pred = model.predict(X_test)

from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
print(f"Mean Absolute Error = {mean_absolute_error(y_test,y_pred)}")
print(f"Mean Squared Error = {mean_squared_error(y_test,y_pred)}")
r2_score(y_test,y_pred)

In [None]:
residuals = y_test - y_pred
sns.scatterplot(residuals)
plt.axhline(y=0,color="red",linestyle="--")
plt.title("Residuals Plot")

#### 2. Write a Python script to calculate and print Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) for a linear regression model

In [None]:
from sklearn.metrics import mean_squared_error,mean_absolute_error, root_mean_squared_error
print(f"Mean Squared Error = {round(mean_squared_error(y_test,y_pred),3)}")
print(f"Mean Absolute Error = {round(mean_absolute_error(y_test,y_pred),3)}")
print(f"Root Mean Squared Error = {round(root_mean_squared_error(y_test,y_pred),3)}")

#### 3. Write a Python script to check if the assumptions of linear regression are met. Use a scatter plot to check linearity, residuals plot for homoscedasticity, and correlation matrix for multicollinearity.

In [None]:
# Assumptions 1 : To Check Linearity of each Features in the dataset
# plt.title("To Check Linearity")
sns.pairplot(diamond_data)
plt.show()

In [None]:
# Assumption 02 : NO or Little Multicollinearity

from statsmodels.stats.outliers_influence import variance_inflation_factor
{X.columns[i] : variance_inflation_factor(X.values,i) for i in range(1, X.shape[1])}

sns.clustermap(diamond_data.corr(),annot=True)

In [None]:
#  residuals plot for homoscedasticity
residuals = y_test - y_pred
sns.scatterplot(residuals)
plt.axhline(y=0,color="red",linestyle="--")
plt.title("Residuals Plot")

#### 4 Write a Python script that creates a machine learning pipeline with feature scaling and evaluates the performance of different regression models

In [None]:
# Create Simple Dataset
# sns.get_dataset_names()
data = sns.load_dataset('mpg')
data.head()

In [None]:
# Data Preprocessing
data.drop("name",axis=1,inplace=True)

In [None]:
data.info()

In [None]:
# Featrue Engineering
sns.distplot(data["horsepower"].value_counts())
plt.show()
# horsepower contain outlier so we fill the null value with median

In [None]:
data["horsepower"].fillna(data["horsepower"].median(),inplace=True)

In [None]:
data["origin"].value_counts()

In [None]:
data["origin"] = data["origin"].map({"usa":1,"europe":2,"japan":3})

In [None]:
# Make Independet Feature and Target Feature
X = data.drop("mpg",axis=1)
y = pd.DataFrame(data["mpg"])

In [None]:
X.isnull().sum()

In [None]:
# Split the Data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=1)
print(f"Shape of X_train = {X_train.shape} \nShape of X_test = {X_test.shape}")
print(f"Shape of y_train = {y_train.shape} \nShape of y_test = {y_test.shape}")

In [None]:
# Make The Model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model

In [None]:
model.fit(X_train,y_train)

In [None]:
# Model Evalualtion
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
y_pred = model.predict(X_test)

print(f"Mean Absolute Error = {mean_absolute_error(y_test,y_pred)}")
print(f"Mean Squared Error = {mean_squared_error(y_test,y_pred)}")
print(f"Root Mean Squared Error = {np.sqrt(mean_squared_error(y_test,y_pred))}")
print(f"R2 Score = {r2_score(y_test,y_pred)}")

#### 5.  Implement a simple linear regression model on a dataset and print the model's coefficients, intercept, and R-squared score

In [None]:
# data.head()
# we build a simple linear regression model on a  mpg dataset we take one feature that is horsepower and target feature is mpg
simple_data = pd.DataFrame()
simple_data["horsepower"] = data["horsepower"]
simple_data["mpg"] = data["mpg"]

In [None]:
# horesepower contian some null value
simple_data['horsepower'].fillna(simple_data['horsepower'].median(), inplace=True)

In [None]:
X = simple_data.drop("mpg",axis=1)
y = simple_data["mpg"]

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=1)
print(f"Shape of X_train = {X_train.shape} \nShape of X_test = {X_test.shape}")
print(f"Shape of y_train = {y_train.shape} \nShape of y_test = {y_test.shape}")


In [None]:
# make the model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model

In [None]:
model.fit(X_train,y_train)

In [None]:
# model evaluation
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
y_pred = model.predict(X_test)

print(f"Mean Absolute Error = {mean_absolute_error(y_test,y_pred)}")
print(f"Mean Squared Error = {mean_squared_error(y_test,y_pred)}")
print(f"Root Mean Squared Error = {np.sqrt(mean_squared_error(y_test,y_pred))}")
print(f"R2 Score = {r2_score(y_test,y_pred)}")

#### 6. Write a Python script that analyzes the relationship between total bill and tip in the 'tips' dataset using simple linear regression and visualizes the results

In [None]:
tips_data = sns.load_dataset('tips')
tips_data.head(2)

In [None]:
tips_data.info()

In [None]:
sns.scatterplot(x="total_bill",y="tip",data=tips_data)
plt.title("Relation Total Bill And Tip")
plt.show()

In [None]:
X = tips_data["total_bill"]
y = tips_data["tip"]

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=1)
print(f"Shape of X_train = {X_train.shape} \nShape of X_test = {X_test.shape}")
print(f"Shape of y_train = {y_train.shape} \nShape of y_test = {y_test.shape}")

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model

In [None]:
X_train = X_train.values.reshape(-1,1)
X_test = X_test.values.reshape(-1,1)

In [None]:
model.fit(X_train,y_train)

In [None]:
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
y_pred = model.predict(X_test)

print(f"Mean Absolute Error = {mean_absolute_error(y_test,y_pred)}")
print(f"Mean Squared Error = {mean_squared_error(y_test,y_pred)}")
print(f"Root Mean Squared Error = {np.sqrt(mean_squared_error(y_test,y_pred))}")

#### 7. Write a Python script that fits a linear regression model to a synthetic dataset with one feature. Use the model to predict new values and plot the data points along with the regression line

In [None]:
# Generate synthetic dataset
np.random.seed(42)
X = 2 * np.random.rand(1000, 1)
y = 4 + 3 * X + np.random.randn(1000, 1)

In [None]:
# make the linear regression model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=1)
print(f"Shape of X_train = {X_train.shape} \nShape of X_test = {X_test.shape}")
print(f"Shape of y_train = {y_train.shape} \nShape of y_test = {y_test.shape}")

In [None]:
model.fit(X_train,y_train)

In [None]:
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
y_pred = model.predict(X_test)

print(f"Mean Absolute Error = {mean_absolute_error(y_test,y_pred)}")
print(f"Mean Squared Error = {mean_squared_error(y_test,y_pred)}")
print(f"Root Mean Squared Error = {np.sqrt(mean_squared_error(y_test,y_pred))}")

In [None]:
plt.scatter(X_test,y_test,color="black",label="Data Points")
plt.plot(X_test,y_pred,color="brown",linewidth=2,label="Regression Line")
plt.xlabel("Feature")
plt.ylabel("Target")
plt.title("Linear Regression")
plt.legend()
plt.show()

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generate synthetic dataset
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Fit a linear regression model
model = LinearRegression()
model.fit(X, y)

# Predict new values
X_new = np.array([[0], [2]])
y_predict = model.predict(X_new)

# Plot the data points and the regression line
plt.scatter(X, y, color='blue', label='Data Points')
plt.plot(X_new, y_predict, color='red', linewidth=2, label='Regression Line')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('Linear Regression')
plt.legend()
plt.show()


#### 8. Write a Python script that pickles a trained linear regression model and saves it to a file.

In [None]:
# import pickles
import pickle

# Serialize Process
filename = "model.pkl"
pickle.dump(model, open(filename, "wb"))

In [None]:
# Unserelize Process
pickled_model = pickle.load(open(filename, "rb"))
y_pred = pickled_model.predict(X_test)

#### 9. Write a Python script that fits a polynomial regression model (degree 2) to a dataset and plots the regression curve

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Generate sample data
np.random.seed(0)
X = np.random.rand(100, 1) * 10  # Values between 0 and 10
y = 2 * X**2 + 3 * X + 5 + np.random.randn(100, 1) * 10  # Quadratic relationship with noise

# Create polynomial features (degree 2)
poly_features = PolynomialFeatures(degree=2)
X_poly = poly_features.fit_transform(X)

# Fit the polynomial regression model
model = LinearRegression()
model.fit(X_poly, y)

# Generate points for plotting the curve
X_plot = np.linspace(0, 10, 100).reshape(-1, 1)
X_plot_poly = poly_features.transform(X_plot)
y_plot = model.predict(X_plot_poly)

# Plot the data points and the regression curve
plt.scatter(X, y, color='blue', label='Data Points')
plt.plot(X_plot, y_plot, color='red', linewidth=2, label='Regression Curve (Degree 2)')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Polynomial Regression (Degree 2)')
plt.legend()
plt.show()

#### 10. Generate synthetic data for simple linear regression (use random values for X and y) and fit a linear regression model to the data. Print the model's coefficient and intercept.

In [None]:
# Generate synthetic dataset
np.random.seed()
X = 2 * np.random.rand(1000, 1)
y = 4 - 3 * X + np.random.randn(1000, 1)

In [None]:
plt.scatter(X,y)
plt.title("Synthetic Data")
plt.xlabel("Feature")
plt.ylabel("Target")
plt.show()

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model

In [None]:
model.fit(X,y)

In [None]:
print(f"Coefficient = {model.coef_}")
print(f"Intercept = {model.intercept_}")

#### 11. Write a Python script that fits polynomial regression models of different degrees to a synthetic dataset and compares their performance

#### 12.  Write a Python script that fits a simple linear regression model with two features and prints the model's coefficients, intercept, and R-squared score

In [None]:
# Generate synthetic data with two features
np.random.seed(0)
X = pd.DataFrame(np.random.rand(1000, 2), columns=['feature1', 'feature2'])
y = 2 + 3 * X['feature1'] + 4 * X['feature2'] + np.random.randn(1000)

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=1)

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model

In [None]:
model.fit(X_train,y_train)

In [None]:
from sklearn.metrics import r2_score
y_pred = model.predict(X_test)

print(f"Coefficients {model.coef_}")
print(f"intercept {model.intercept_}")
print(f"R2 Score {r2_score(y_test,y_pred)}")

#### 13. Write a Python script that generates synthetic data, fits a linear regression model, and visualizes the regression line along with the data points

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generate synthetic data
np.random.seed(42)  # Set random seed for reproducibility
X = 2 * np.random.rand(100, 1)  # Generate 100 random values for X between 0 and 2
y = 4 + 3 * X + np.random.randn(100, 1)  # Generate y values with a linear relationship and some noise

# Create and fit the linear regression model
model = LinearRegression()
model.fit(X, y)

# Predict values using the model
y_pred = model.predict(X)

# Plot the data points and the regression line
plt.scatter(X, y, color='blue', label='Data Points')
plt.plot(X, y_pred, color='red', linewidth=2, label='Regression Line')
plt.xlabel('Feature (X)')
plt.ylabel('Target (y)')
plt.title('Linear Regression with Synthetic Data')
plt.legend()
plt.show()

#### 14.  Write a Python script that uses the Variance Inflation Factor (VIF) to check for multicollinearity in a dataset with multiple features.

In [None]:
diamond = sns.load_dataset("diamonds")
# diamond.head(2)

In [None]:
# diamond["cut"].unique()
# diamond["clarity"].unique()
# diamond["color"].unique()
diamond["cut"] = diamond["cut"].map({"Ideal":1,"Premium":2,"Very Good":3,"Good":4,"Fair":5})
diamond["clarity"]= diamond["clarity"].map({"SI1":1,"VS2":2,"SI2":3,"VS1":4,"VVS2":5,"VVS1":6,"IF":7,"I1":8})
diamond["color"] = diamond["color"].map({"G":1,"E":2,"F":3,"H":4,"D":5,"I":6,"J":7})

In [None]:
sns.clustermap(data=diamond.drop("price",axis = 1).corr(),annot=True)
plt.title("Correlation Matrix")
plt.show()

In [None]:
# Import import variance_inflation_factor
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Create Variance Inflaction Database for storing data relatied to multicollinearity
vif_data = pd.DataFrame()
vif_data["features"] = diamond.drop("price",axis=1).columns
vif_data["VIF"] = [variance_inflation_factor(diamond.values,i) for i in range(len(diamond.drop("price",axis=1).columns))]
vif_data

In [None]:
pd.DataFrame(diamond.values)

#### 15. Write a Python script that generates synthetic data for a polynomial relationship (degree 4), fits a polynomial regression model, and plots the regression curve.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Generate synthetic data
np.random.seed(0)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = (X**4 + X**3 + X**2 + X + 1 + np.random.randn(80, 1)*20).ravel() # Introduce some noise

# Create polynomial features (degree 4)
poly = PolynomialFeatures(degree=4)
X_poly = poly.fit_transform(X)

# Fit polynomial regression model
model = LinearRegression()
model.fit(X_poly, y)

# Generate points for plotting the curve
X_plot = np.linspace(0, 5, 100).reshape(-1, 1)
X_plot_poly = poly.transform(X_plot)
y_plot = model.predict(X_plot_poly)

# Plot the data points and the regression curve
plt.scatter(X, y, color='blue', label='Data Points')
plt.plot(X_plot, y_plot, color='red', linewidth=2, label='Regression Curve (Degree 4)')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Polynomial Regression (Degree 4)')
plt.legend()
plt.show()

#### 16. Write a Python script that creates a machine learning pipeline with data standardization and a multiple linear regression model, and prints the R-squared score.

In [None]:
# Import Dataset from Sklearn Libary
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

In [None]:
# Make the Dataset
housing_data = pd.DataFrame(housing.data,columns=housing.feature_names)
housing_data["price"] = housing.target
housing_data.head()

In [None]:
# To check Linarity
sns.pairplot(housing_data)
plt.show()

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
housing_data.drop("price",axis=1,inplace=True)

In [None]:
vif_housing_data = pd.DataFrame()
vif_housing_data["features"] = housing_data.columns
vif_housing_data["vif"] = [variance_inflation_factor(housing_data.values,i) for i in range(len(housing_data.columns))]
vif_housing_data

In [None]:
housing_data.drop("Longitude",axis=1,inplace=True)

In [None]:
vif_housing_data["features"] = housing_data.columns
vif_housing_data["vif"] = [variance_inflation_factor(housing_data.
values,i) for i in range(len(housing_data.columns))]
vif_housing_data

In [None]:
housing_data.drop("AveRooms",axis=1,inplace=True)

In [None]:
vif_housing_data = pd.DataFrame()
vif_housing_data["features"] = housing_data.columns
vif_housing_data["vif"] = [variance_inflation_factor(housing_data.
values,i) for i in range(len(housing_data.columns))]
vif_housing_data

In [None]:
housing_data.drop("Latitude",axis=1,inplace=True)

In [None]:
vif_housing_data = pd.DataFrame()
vif_housing_data["features"] = housing_data.columns
vif_housing_data["vif"] = [variance_inflation_factor(housing_data.
values,i) for i in range(len(housing_data.columns))]
vif_housing_data

In [None]:
X = housing_data
y = pd.DataFrame(housing.target)
print(f"X Shape : {X.shape}")
print(f"y Shape : {y.shape}")

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=1)
print(f"X_train Shape : {X_train.shape}")
print(f"X_test Shape : {X_test.shape}")
print(f"y_train Shape : {y_train.shape}")
print(f"y_test Shape : {y_test.shape}")

In [None]:
# Scale the features
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
print(f"X_train Shape : {X_train.shape}")
print(f"X_test Shape : {X_test.shape}")

In [None]:
# Make the model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model

In [None]:
# Train The Model
model.fit(X_train,y_train)

In [None]:
y_pred = model.predict(X_test)
print(f"Coefficient = {model.coef_}")
print(f"Intercept = {model.intercept_}")

from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
print(f"Mean Absolute Error = {mean_absolute_error(y_test,y_pred)}")
print(f"Mean Squared Error = {mean_squared_error(y_test,y_pred)}")
print(f"Root Mean Squared Error = {np.sqrt(mean_squared_error(y_test,y_pred))}")
print(f"R2 Score = {r2_score(y_test,y_pred)}")

#### 17.  Write a Python script that performs polynomial regression (degree 3) on a synthetic dataset and plots the regression curve.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Generate synthetic data with a non-linear relationship
np.random.seed(0)
X = np.random.rand(100, 1) * 10  # Feature values between 0 and 10
y = 2 * X**3 + 3 * X**2 + 5 + np.random.randn(100, 1) * 10  # Target with cubic relationship and noise
X = X.reshape(-1,1)
y = y.reshape(-1,1)
# Create polynomial features (degree 3)
poly_features = PolynomialFeatures(degree=3)
X_poly = poly_features.fit_transform(X)

# Fit linear regression model to polynomial features
model = LinearRegression()
model.fit(X_poly, y)

# Generate points for plotting the curve
X_curve = np.linspace(0, 10, 100).reshape(-1, 1)  # Evenly spaced points between 0 and 10
X_curve_poly = poly_features.transform(X_curve)  # Transform to polynomial features
y_curve = model.predict(X_curve_poly)  # Predict target values for the curve

# Plot the data points and the regression curve
plt.scatter(X, y, color='blue', label='Data Points')
plt.plot(X_curve, y_curve, color='red', linewidth=2, label='Regression Curve (Degree 3)')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('Polynomial Regression (Degree 3)')
plt.legend()
plt.show()

#### 18. Write a Python script that performs multiple linear regression on a synthetic dataset with 5 features. Print the R-squared score and model coefficients

In [None]:
np.random.seed(42)
num_samples = 1000
X = np.random.rand(num_samples, 5)  # 5 features
coefficients = np.array([2, -3.5, 1, 4.2, -1.5])
y = X @ coefficients + np.random.randn(num_samples) * 0.5  # Add some noise

In [None]:
feature_names = [f'feature_{i}' for i in range(1, 6)]
data = pd.DataFrame(X,columns=feature_names)
data["target"] = y
data.head(2)

In [None]:
from sklearn.model_selection import train_test_split
X = data.drop("target",axis=1)
y = data["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=1)

In [None]:
from sklearn.linear_model import LinearRegression
model_18 = LinearRegression()
model_18

In [None]:
model_18.fit(X_train,y_train)

In [None]:
y_pred = model_18.predict(X_test)
print(f"Coefficient = {model.coef_}")
print(f"Intercept = {model.intercept_}")

from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
print(f"Mean Absolute Error = {mean_absolute_error(y_test,y_pred)}")
print(f"Mean Squared Error = {mean_squared_error(y_test,y_pred)}")
print(f"Root Mean Squared Error = {np.sqrt(mean_squared_error(y_test,y_pred))}")
print(f"R2 Score = {r2_score(y_test,y_pred)}")

#### 19. Write a Python script that generates synthetic data for linear regression, fits a model, and visualizes the data points along with the regression line.

In [None]:
np.random.seed(42)
num_samples = 1000
X = np.random.rand(num_samples)  # 5 features
y = X * 2.1 + np.random.randn(num_samples)

In [None]:
X = pd.DataFrame(X,columns=["feature_1"]).head(2)
X["target"] = y
X.head(2)

In [None]:
plt.scatter(X,y)
plt.title("Synthetic Data")
plt.xlabel("Feature")
plt.ylabel("Target")
plt.show()

In [None]:
from sklearn.model_selection import train_test_split
X = data.drop("target",axis=1)
y = data["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=1)

In [None]:
from sklearn.linear_model import LinearRegression
model_19 = LinearRegression()
model_19.fit(X_train,y_train)
model_19

In [None]:
y_pred = model_19.predict(X_test)
print(f"Coefficient = {model.coef_}")
print(f"Intercept = {model.intercept_}")

from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
print(f"Mean Absolute Error = {mean_absolute_error(y_test,y_pred)}")
print(f"Mean Squared Error = {mean_squared_error(y_test,y_pred)}")
print(f"Root Mean Squared Error = {np.sqrt(mean_squared_error(y_test,y_pred))}")
print(f"R2 Score = {r2_score(y_test,y_pred)}")

In [None]:
plt.scatter(X_test,y_test,color="black",label="Data Points")

#### 20. Create a synthetic dataset with 3 features and perform multiple linear regression. Print the model's Rsquared score and coefficients

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Generate synthetic data with 3 features
np.random.seed(0)
num_samples = 1000
X = pd.DataFrame(np.random.rand(num_samples, 3), columns=['feature1', 'feature2', 'feature3'])
y = 2 + 3 * X['feature1'] + 4 * X['feature2'] + 1.5 * X['feature3'] + np.random.randn(num_samples)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Evaluate the model using R-squared score
r2 = r2_score(y_test, y_pred)

# Print the results
print("R-squared score:", r2)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

#### 21. Write a Python script that demonstrates how to serialize and deserialize machine learning models using joblib instead of pickling

In [None]:
import joblib
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Serialize the model using joblib
joblib.dump(model, 'linear_regression_model.joblib')

# Deserialize the model using joblib
loaded_model = joblib.load('linear_regression_model.joblib')

# Make predictions using the loaded model
predictions = loaded_model.predict(X_test)

# To see the output, run the code.

#### 22. Write a Python script to perform linear regression with categorical features using one-hot encoding. Use **the** Seaborn 'tips' dataset

In [None]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import OneHotEncoder

# Load the 'tips' dataset
tips = sns.load_dataset('tips')

# Create a OneHotEncoder object
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore') # sparse=False for compatibility

# Select categorical features for one-hot encoding
categorical_features = ['sex', 'smoker', 'day', 'time']

# Transform categorical features using one-hot encoding
encoded_features = encoder.fit_transform(tips[categorical_features])

# Create a DataFrame with encoded features
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(categorical_features))

# Concatenate encoded features with numerical features
numerical_features = ['total_bill', 'size']
X = pd.concat([tips[numerical_features], encoded_df], axis=1)
y = tips['tip']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Evaluate the model using R-squared score
r2 = r2_score(y_test, y_pred)
print(f"R-squared score: {r2}")

#### 23. Compare Ridge Regression with Linear Regression on a synthetic dataset and print the coefficients and R squared score

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import r2_score

# Generate synthetic data
np.random.seed(0)
X = pd.DataFrame(np.random.rand(100, 5), columns=['feature1', 'feature2', 'feature3', 'feature4', 'feature5'])
y = 2 + 3 * X['feature1'] + 4 * X['feature2'] + np.random.randn(100)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear Regression
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
linear_pred = linear_model.predict(X_test)
linear_r2 = r2_score(y_test, linear_pred)

# Ridge Regression
ridge_model = Ridge(alpha=1.0)  # Alpha is the regularization parameter
ridge_model.fit(X_train, y_train)
ridge_pred = ridge_model.predict(X_test)
ridge_r2 = r2_score(y_test, ridge_pred)

# Print results
print("Linear Regression:")
print("Coefficients:", linear_model.coef_)
print("R-squared:", linear_r2)
print("\nRidge Regression:")
print("Coefficients:", ridge_model.coef_)
print("R-squared:", ridge_r2)

#### 24. Write a Python script that uses cross-validation to evaluate a Linear Regression model on a synthetic dataset.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import make_scorer, r2_score

# Generate synthetic data
np.random.seed(0)
X = pd.DataFrame(np.random.rand(100, 5), columns=['feature1', 'feature2', 'feature3', 'feature4', 'feature5'])
y = 2 + 3 * X['feature1'] + 4 * X['feature2'] + np.random.randn(100)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Linear Regression model
model = LinearRegression()

# Define the scoring metric (R-squared)
scoring = make_scorer(r2_score)

# Perform cross-validation
scores = cross_val_score(model, X_train, y_train, cv=5, scoring=scoring)  # 5-fold cross-validation

# Print the cross-validation scores
print("Cross-validation scores:", scores)
print("Average R-squared:", scores.mean())

#### 25.  Write a Python script that compares polynomial regression models of different degrees and prints the R squared score for each

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score

# Generate synthetic data with a non-linear relationship
np.random.seed(0)
X = np.random.rand(100, 1) * 10  # Feature values between 0 and 10
y = 2 * X**2 + 3 * X + 5 + np.random.randn(100, 1) * 10  # Target with quadratic relationship and noise

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define degrees of polynomial to test
degrees = [1, 2, 3, 4, 5]

# Iterate over degrees and fit polynomial regression models
for degree in degrees:
    # Create polynomial features
    poly_features = PolynomialFeatures(degree=degree)
    X_train_poly = poly_features.fit_transform(X_train)
    X_test_poly = poly_features.transform(X_test)

    # Fit linear regression model to polynomial features
    model = LinearRegression()
    model.fit(X_train_poly, y_train)

    # Make predictions on the testing set
    y_pred = model.predict(X_test_poly)

    # Calculate and print R-squared score
    r2 = r2_score(y_test, y_pred)
    print(f"Degree {degree}: R-squared = {r2}")