<a href="https://colab.research.google.com/github/mohith789p/Machine-Learning/blob/main/regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Training and Evaluation with Multiple Regression Techniques

The dataset contains information about 50 startups, and various regression models are applied to predict a target variable. The data is processed using libraries like pandas and sklearn, with categorical variables encoded using one-hot encoding. Multiple models, including Linear Regression, Lasso, and Ridge, are trained and evaluated using metrics such as MAE, MSE, and R² to assess their performance.

## Importing the required libraries and Loading the Data

The required libraries are imported to handle data manipulation, preprocessing, and model training. Libraries such as pandas and numpy are used for data handling and numerical operations, while sklearn provides tools for splitting the data, encoding categorical features, and building regression models. The dataset, "50_Startups.csv," is then loaded into a pandas DataFrame, which will be used for training and testing machine learning models.

In [57]:
# importing the necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score

# loading the data
data = pd.read_csv('50_Startups.csv')
data.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


## Exploring the data

Missing values in the dataset are checked using the isnull().sum() function, which calculates the total number of missing values for each column. The data.info() method is used to display information about the dataset, including data types and the number of non-null entries in each column. Summary statistics for the numerical columns are generated using data.describe(), providing insights into key metrics such as the mean, standard deviation, minimum, and maximum values, helping to understand the distribution of the data.

In [58]:
# Check for missing values
missing = data.isnull().sum()
print(missing)
# Check for null values and data types
print(data.info())
# Get summary statistics
print(data.describe())

R&D Spend          0
Administration     0
Marketing Spend    0
State              0
Profit             0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        50 non-null     float64
 1   Administration   50 non-null     float64
 2   Marketing Spend  50 non-null     float64
 3   State            50 non-null     object 
 4   Profit           50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB
None
           R&D Spend  Administration  Marketing Spend         Profit
count      50.000000       50.000000        50.000000      50.000000
mean    73721.615600   121344.639600    211025.097800  112012.639200
std     45902.256482    28017.802755    122290.310726   40306.180338
min         0.000000    51283.140000         0.000000   14681.400000
25%     39936.370000   103730.875000    129300.132500   

## Encoding Categorical Variable **'State'**

The categorical variable "State" is handled using one-hot encoding with the `OneHotEncoder` from sklearn. The `drop='first'` argument ensures that the first category is dropped to avoid multicollinearity. The encoded "State" feature is then added to the dataset, and the original "State" column is removed. The resulting encoded data is stored in a new DataFrame, data_processed, which is displayed using `head()` to show the first few rows of the updated dataset, now containing the numerical representation of the "State" variable.

In [59]:
# Handle categorical variable: "State"
categorical_features = ['State']
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded_state = encoder.fit_transform(data[categorical_features])

# Add encoded state to the dataset and drop original "State" column
encoded_state_df = pd.DataFrame(encoded_state, columns=encoder.get_feature_names_out(categorical_features))
data_processed = pd.concat([data.drop(columns=categorical_features), encoded_state_df], axis=1)
data_processed.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit,State_Florida,State_New York
0,165349.2,136897.8,471784.1,192261.83,0.0,1.0
1,162597.7,151377.59,443898.53,191792.06,0.0,0.0
2,153441.51,101145.55,407934.54,191050.39,1.0,0.0
3,144372.41,118671.85,383199.62,182901.99,0.0,1.0
4,142107.34,91391.77,366168.42,166187.94,1.0,0.0


## Defining the Features (X)

The features (X) are defined by dropping the target variable "Profit" from the `data_processed` dataset. This results in a DataFrame `X` that contains all the independent variables (or predictors) for the model. The `head()` function is used to display the first few rows of `X`, providing a preview of the feature set that will be used for training the regression models.

In [60]:
# Define features (X)
X = data_processed.drop(columns=['Profit'])
X.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_Florida,State_New York
0,165349.2,136897.8,471784.1,0.0,1.0
1,162597.7,151377.59,443898.53,0.0,0.0
2,153441.51,101145.55,407934.54,1.0,0.0
3,144372.41,118671.85,383199.62,0.0,1.0
4,142107.34,91391.77,366168.42,1.0,0.0


## Defining the Target (y)

The target variable (y) is defined by selecting the "Profit" column from the `data_processed` dataset. This results in a pandas Series `y` containing the dependent variable, which the model will aim to predict. The `head()` function is used to display the first few rows of `y`, giving a preview of the target values that correspond to the features in `X`.

In [61]:
# Define target (y)
y = data_processed['Profit']
y.head()

Unnamed: 0,Profit
0,192261.83
1,191792.06
2,191050.39
3,182901.99
4,166187.94


## Splitting the Data

The data is split into training and test sets using the `train_test_split` function from sklearn. The independent variables (`X`) and the target variable (`y`) are divided into training and testing subsets. The `test_size=0.2` argument indicates that 20% of the data will be used for testing, while the remaining 80% will be used for training the model. The `random_state=42` ensures reproducibility of the split. The `head()` function is used to display the first few rows of the training features (`X_train`), providing a preview of the data used to train the model.

In [62]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_Florida,State_New York
12,93863.75,127320.38,249839.44,1.0,0.0
4,142107.34,91391.77,366168.42,1.0,0.0
37,44069.95,51283.14,197029.42,0.0,0.0
8,120542.52,148718.95,311613.29,0.0,1.0
3,144372.41,118671.85,383199.62,0.0,1.0


## Linear Regression

The Linear Regression model is built and trained using the `LinearRegression()` class from sklearn. The model is fitted to the training data (`X_train` and `y_train`) using the `fit()` method. Predictions are then made on the test set (`X_test`) using the `predict()` method, and the model's performance is evaluated using several metrics:

- **Mean Absolute Error (MAE)**: Measures the average magnitude of the errors in predictions.
- **Mean Squared Error (MSE)**: Measures the average of the squared differences between predicted and actual values.
- **Root Mean Squared Error (RMSE)**: The square root of MSE, providing error in the same units as the target variable.
- **R-squared (R²)**: Indicates the proportion of variance in the target variable that is explained by the model.

The results for each of these metrics are printed to assess the performance of the Linear Regression model.

In [63]:
# Build and train the model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Make predictions
y_pred = lr_model.predict(X_test)

# Evaluate the model
lr_mae = mean_absolute_error(y_test, y_pred)
lr_mse = mean_squared_error(y_test, y_pred)
lr_rmse = np.sqrt(lr_mse)
lr_r2 = r2_score(y_test, y_pred)

print("Linear Regression:")
print("Mean Absolute Error (MAE):", lr_mae)
print("Mean Squared Error (MSE):", lr_mse)
print("Root Mean Squared Error (RMSE):", lr_rmse)
print("R-squared Score (R2):", lr_r2)

Linear Regression:
Mean Absolute Error (MAE): 6961.477813252376
Mean Squared Error (MSE): 82010363.04430099
Root Mean Squared Error (RMSE): 9055.957323458464
R-squared Score (R2): 0.8987266414328637


## Lasso Regression Model

The Lasso Regression model is trained using the `Lasso()` class from sklearn with an alpha value of 1.0, which controls the regularization strength. The model is fitted to the training data (`X_train` and `y_train`) using the `fit()` method. Predictions are made on the test set (`X_test`) using the `predict()` method.

The model's performance is then evaluated using the following metrics:
- **Mean Squared Error (MSE)**: Measures the average squared difference between predicted and actual values.
- **Mean Absolute Error (MAE)**: Measures the average absolute differences between predicted and actual values.
- **Root Mean Squared Error (RMSE)**: The square root of MSE, representing error in the same units as the target variable.
- **R-squared (R²)**: Indicates how well the model explains the variance in the target variable.

The results for each of these metrics are printed to assess the performance of the Lasso Regression model.

In [64]:
# Train Lasso Regression model
lasso_model = Lasso(alpha=1.0)
lasso_model.fit(X_train, y_train)

# Make predictions
y_lasso_pred = lasso_model.predict(X_test)

# Evaluate the model
lasso_mse = mean_squared_error(y_test, y_lasso_pred)
lasso_mae = mean_absolute_error(y_test, y_lasso_pred)
lasso_rmse = np.sqrt(lasso_mse)
lasso_r2 = r2_score(y_test, y_lasso_pred)

print("Lasso Regression Results:")
print("Mean Absolute Error (MAE):", lasso_mae)
print("Mean Squared Error (MSE):", lasso_mse)
print("Root Mean Squared Error (RMSE):", lasso_rmse)
print("R-squared Score (R2):", lasso_r2)


Lasso Regression Results:
Mean Absolute Error (MAE): 6961.5746884671735
Mean Squared Error (MSE): 82004202.15414938
Root Mean Squared Error (RMSE): 9055.617160312675
R-squared Score (R2): 0.8987342494230525


## Ridge Regression model

The Ridge Regression model is trained using the `Ridge()` class from sklearn with an alpha value of 1.0, which controls the regularization strength. The model is fitted to the training data (`X_train` and `y_train`) using the `fit()` method. Predictions are made on the test set (`X_test`) using the `predict()` method.

The model's performance is then evaluated using the following metrics:
- **Mean Squared Error (MSE)**: Measures the average squared difference between predicted and actual values.
- **Mean Absolute Error (MAE)**: Measures the average absolute differences between predicted and actual values.
- **Root Mean Squared Error (RMSE)**: The square root of MSE, representing error in the same units as the target variable.
- **R-squared (R²)**: Indicates how well the model explains the variance in the target variable.

The results for these evaluation metrics are printed to assess the performance of the Ridge Regression model.

In [65]:
# Train Ridge Regression model
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)

# Make predictions
y_ridge_pred = ridge_model.predict(X_test)

# Evaluate the model
ridge_mse = mean_squared_error(y_test, y_ridge_pred)
ridge_mae = mean_absolute_error(y_test, y_ridge_pred)
ridge_rmse = np.sqrt(ridge_mse)
ridge_r2 = r2_score(y_test, y_ridge_pred)

print("L2 Regression Results:")
print("Mean Absolute Error (MAE):", ridge_mae)
print("Mean Squared Error (MSE):", ridge_mse)
print("Root Mean Squared Error (RMSE):", ridge_rmse)
print("R-squared Score (R2):", ridge_r2)


L2 Regression Results:
Mean Absolute Error (MAE): 6963.340034795974
Mean Squared Error (MSE): 81887773.66036233
Root Mean Squared Error (RMSE): 9049.186353499541
R-squared Score (R2): 0.8988780252113923


## Comparison of Regression Models: Evaluation Metrics

The evaluation metrics for different regression models (Linear Regression, Lasso Regression, and Ridge Regression) are organized into a DataFrame with the metrics as rows and the models as columns. This table provides a clear comparison of the performance of each model based on metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R²). The table helps to visually assess the strengths and weaknesses of each model in predicting the target variable.

In [68]:
# Create a DataFrame to display the metrics with reversed axes
metrics_data = {
    'Linear Regression': [lr_mae, lr_mse, lr_rmse, lr_r2],
    'Lasso Regression': [lasso_mae, lasso_mse, lasso_rmse, lasso_r2],
    'Ridge Regression': [ridge_mae, ridge_mse, ridge_rmse, ridge_r2]
}

# Create the DataFrame with the metrics as rows
metrics_df = pd.DataFrame(metrics_data, index=['MAE', 'MSE', 'RMSE', 'R²'])

# Display the table
print("Metrics Table:")
metrics_df


Metrics Table:


Unnamed: 0,Linear Regression,Lasso Regression,Ridge Regression
MAE,6961.478,6961.575,6963.34
MSE,82010360.0,82004200.0,81887770.0
RMSE,9055.957,9055.617,9049.186
R²,0.8987266,0.8987342,0.898878


# Identifying the Best Model Based on Evaluation Metrics

This function `find_best_model` evaluates the performance of different regression models based on key metrics (MAE, MSE, RMSE, and R²). It identifies the best model for each metric by finding the minimum value for MAE, MSE, and RMSE, and the maximum value for R². The results are then displayed to show which model performs the best in each category. This approach helps to objectively determine the most suitable model for the given dataset based on the evaluation metrics.

In [67]:
# a function to find the best model for each metric
def find_best_model(metrics_df):
    best_model = {
        'Best MAE': metrics_df.loc['MAE'].idxmin(),
        'Best MSE': metrics_df.loc['MSE'].idxmin(),
        'Best RMSE': metrics_df.loc['RMSE'].idxmin(),
        'Best R²': metrics_df.loc['R²'].idxmax()
    }
    return best_model

# Get the best model based on the comparison
best_model = find_best_model(metrics_df)

# Display the best model for each metric
print("Best Model based on the metrics:")
print("Best MAE: ", best_model['Best MAE'])
print("Best MSE: ", best_model['Best MSE'])
print("Best RMSE:",  best_model['Best RMSE'])
print("Best R²: ", best_model['Best R²'])


Best Model based on the metrics:
Best MAE:  Linear Regression
Best MSE:  Ridge Regression
Best RMSE: Ridge Regression
Best R²:  Ridge Regression
