# Example usage

To use `linreg_ally` in a project:

In [None]:
import linreg_ally

print(linreg_ally.__version__)

In [None]:
# Imports

In [None]:
# Paramveer - EDA

In [None]:
# Alex - VIF 

In [None]:
# Cheng - model fitting

## Running Linear Regression Tutorial

In this tutorial, you will learn a streamlined way to preprocess data, run linear regression and output with scoring metrics.

First, ensure you have the `models` package imported.

In [1]:
from linreg_ally.models import run_linear_regression

We will be using the `cars` dataset provided by `vega_datasets`. This dataset contains various features related to cars, including both numerical and categorical variables, making it ideal for demonstrating the full capabilities of our linear regression function.

In [5]:
from vega_datasets import data

df = data.cars()
df.head()

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970-01-01,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970-01-01,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970-01-01,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970-01-01,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970-01-01,USA


As shown above, the dataset includes data about different car models, featuring attributes such as `Miles_per_Gallon`, `Cylinders`, `Displacement` etc. We will utilize these attributes to build a linear regression model, predicting the target variable `Horsepower`.

We will first perform some data cleaning by removing columns that contain `NA` values.

In [6]:
df = df[['Horsepower', 'Displacement']].dropna()

With the dataset loaded, you're all set to move forward to the next step: using our package's `run_linear_regression` function to prepare the data, fit a model, and evaluate its performance.

We will specify the `target_column`, `numeric_feats`, `categorical_feats` and `drop_feats`. In this case, `target_column` will be `Horsepower` since we are trying to predict its value. `numeric_feats` will be all the numeric features that we want to scale using scikit-learn's `StandardScaler`. `categorical_feats` will be the categorical features (in this case only `Origin`) that we want to perform one-hot encoding on using scikit-learn's `OneHotEncoder`. `drop_feats` will be the columns that we do not want to include in the analysis, in which in this case will be `Name` since it does not provide any meaningful information to the analysis.

For the `scoring_metrics`, we will specify `r2` to evaluate the performance of the model on test data.

In [17]:
from vega_datasets import data
from linreg_ally.models import run_linear_regression

df = data.cars()
df = df.dropna()

# Define parameters for run_linear_regression
target_column = "Horsepower"
numeric_feats = ["Miles_per_Gallon", "Cylinders", "Displacement", "Weight_in_lbs", "Acceleration"] 
categorical_feats = ["Origin"]
drop_feats = ["Name"]
random_state = 123
scoring_metrics = ["r2"]

best_model, X_train, X_test, y_train, y_test, scores = run_linear_regression(
    dataframe=df,
    target_column=target_column,
    numeric_feats=numeric_feats,
    categorical_feats=categorical_feats,
    drop_feats=drop_feats,
    random_state=random_state,
    scoring_metrics=scoring_metrics
)

Model Summary
------------------------
Test r2: 0.846


`best_model` provides a visual summary of the steps used in the entire linear regression pipeline.

In [18]:
best_model

`scores` gives the R2 and negative mean squared error scores that we are interested to find out in order to understand how the model performs on test data.

In [19]:
scores

{'r2': 0.8463952369304465}

As shown above, an R² score of 85% indicates that 85% of the variance in the dependent variable can be explained by the independent variables included in the model, showing that the model provides a good fit to the data.

However, R² alone does not tell the whole story, for example if there might be multicollinearity or other issues. You might also want to consider other metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), or visually inspect residual plots to gain a more comprehensive understanding of model performance.

This is the end of this tutorial where you have seen how we use the `run_linear_regression` function in our package to preprocess data, run linear regression and output with scoring metrics.

In [None]:
# Merari - plot