# REGRESSION ANALYSIS ON CORPORATION FAVORITA PRODUCTS



## DESCRIPTION

Corporation Favorita seeks to be informed  on the stocks of products it should have at a particular point in time by analysing the demand trend of all of its products by consumers by using machine learning model forcast. The regression analysis will use past sales data to identify patterns in demand and develop a model that can accurately forecast future demand. This model will then be used to inform purchasing and stocking decisions, reducing the likelihood of stockouts and overstocking. 


## OBJECTIVE

The goal of this regression analysis is to optimize stock management at Corporation Favorita by accurately predicting demand for products in order to ensure that the right quantity of each product is always in stock. 



## HYPOTHESES

# 1

H0 - The type of day does not play a significant role in determining the demand for oil

H1 - the type of day play a significant roles in determining the demand for oil


# 2
H0 - The location does not have an impact for the for the demand for oil

H1 - The location have an impact for the demand for oil

# 3

H0 - There is no significant correlation between oil price and increase sales

H1 - There is  significant correlation between oil price and increase sales


## QUESTIONS

1. Is the train dataset complete (has all the required dates)?

2. Which dates have the lowest and highest sales for each year?

4. Are certain groups of stores selling more products? (Cluster, city, state, type)

5. Are sales affected by promotions, oil prices and holidays?

7. What analysis can we get from the date and its extractable features?

8. What is the difference between RMSLE, RMSE, MSE (or why is the MAE greater than all of them?)

9.  What is the relationship between oil prices and sales?

10. What is the relationship between product and sales?

11. What is the trend of sales overtime ?

12. What is the relationship between oil prices and promotion ?


## Import Libraries


In [4]:
# Library for EDA
import pandas as pd
import numpy as np 
import seaborn as sns

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings('ignore')

In [5]:
# Import datasets

df_sample_sub = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/sample_submission.csv')
df_stores = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/stores.csv')
df_trans = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/transactions.csv')
df_holi = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/holidays_events.csv')
df_oil = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/oil.csv')

#Loading train & test dataset
df_train = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/train.csv')
df_test = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/test.csv')


FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/sample_submission.csv'

## Exploratory Data Analysis

#### Sample Submission data 

In [None]:
# Sample Submission data top 5
df_sample_sub.head(5)

In [None]:
# Sample Submission data from bottom 5
df_sample_sub.tail(5)

In [None]:
# Checking Null values
df_sample_sub.isnull().sum()

#### Stores data

In [None]:
# The stores data - top 5
df_stores.head(5)

In [None]:
# The stores data from bottom 5
df_stores.tail(5)

In [None]:
# Checking Null values
df_stores.isnull().sum()

#### Transactions data 

In [None]:
# The transaction data - top 5
df_trans.head(5)

In [None]:
# The transactions data - bottom 5
df_trans.tail(5)

In [None]:
# Checking Null values
df_trans.isnull().sum()

#### Holidays_events data frame

In [None]:
# Holidays Event data - top 5
df_holi.head(5)

In [None]:
# Holidays Event data - bottom 5
df_holi.tail(5)

In [None]:
# Checking Null values
df_holi.isnull().sum()

#### Oil data 

In [None]:
# The oil data - top 5
df_oil.head(5)

In [None]:
# The oil data - bottom 5
df_oil.tail(5)

In [None]:
# Checking Null values
df_oil.isnull().sum()

#### Train data

In [None]:
# The Train data - top 5
df_train.head(5)

In [None]:
# The Train data - bottom 5
df_train.tail(5)

In [None]:
# Checking Null values
df_train.isnull().sum()

#### Test Data

In [None]:
# The Test data - top 5
df_test.head(5)

In [None]:
# The Test data - bottom 5
df_test.tail(5)

In [None]:
# Checking Null values
df_test.isnull().sum()

### Checking for all data types all the data frames

In [None]:
# Sample Submission data types
df_sample_sub.info()

In [None]:
# Stores data types
df_stores.info()

In [None]:
# Transaction data types
df_trans.info()

In [None]:
# Holidays event data types
df_holi.info()

In [None]:
# Oil data types
df_oil.info()

In [None]:
# Train data types
df_train.info()

In [None]:
# Test data types
df_test.info()

In [None]:
df_test.nunique()

### Checking all the shape of all the data Frames

In [None]:
# Checking shapes for all datasets
print(df_sample_sub.shape, df_stores.shape, df_trans.shape, df_oil.shape,df_train.shape, df_test.shape)

### Data Issues


1. 43 null values in  oil data before merge
2. More missing values after merging 


### Solution to Issues


1. Use the simple imputer after merge


In [None]:
# Function to convert date column to datetime format
def to_dateTime(df):
    # Convert 'date' column to datetime format
    df['date'] = pd.to_datetime(df['date'])

# List of dataframes to convert
dataframes = [df_trans, df_holi, df_oil, df_train, df_test]

# Loop through dataframes and convert 'date' column to datetime format
for df in dataframes:
    to_dateTime(df)


In [None]:
# Checking the datetime conversion on Train data
df_train.info()

# 

# Merge all datasets for further EDA

In [None]:
# combine the datasets on common columns

merged_data = pd.merge(df_train, df_trans, on=['date', 'store_nbr'])


In [None]:
# Merge Holiday data to previous merged data on date column
merged_data2 = pd.merge(merged_data, df_holi, on='date')


In [None]:
# Merge Oil data to previous merged data on date column
merged_data3 = pd.merge(merged_data2, df_oil, on='date')


In [None]:
# Merge Store data to previous merged data on store_nbr column

merged_data4 = pd.merge(merged_data3, df_stores, on='store_nbr')

In [None]:
# Preview Merged data
merged_data4.head()

In [None]:
# Rename columns using the rename method
new_merged_data = merged_data4.rename(columns={"type_x": "holiday_type", "type_y": "store_type"})

In [None]:
# Preview of new merged data - top 10
new_merged_data.head()

In [None]:
# Preview of new merged data - bottom 10
new_merged_data.tail()

In [None]:
new_merged_data['year'].unique()

In [None]:
# Datatypes of new merged data
new_merged_data.info()

In [None]:
# Inspect data for null values
new_merged_data.isnull().sum()

In [None]:
# Preview of shape of new merged data
new_merged_data.shape                   

In [None]:
#display random sample of 5 rows
new_merged_data.sample(5, random_state = 0)

In [None]:
# New datatypes after changing date datatype as datetime
new_merged_data.info()

In [None]:
# # Generate summary statistics for numerical columns in the DataFrame
new_merged_data.describe()

In [None]:
# Finding duplicated valuew
new_merged_data.duplicated().sum()

In [None]:
# Convert dataset to CSV 
new_merged_data.to_csv('new_merged_data.csv', index=False)

In [None]:
# Create a boxplot of the 'transactions' column grouped by 'locale'
sns.boxplot(x='transactions', y='locale', data=new_merged_data)

# Show the plot
plt.show()


In [None]:
# Create the boxplot using the Seaborn library 
sns.barplot(x='transactions', y='city', data=new_merged_data)


width=0.5,  # Adjust the width of the boxes
fliersize=3, # Adjust the size of the outliers
showmeans=True, # Show the mean value
meanline=True, # Show a line for the mean
notch=True, # Make the boxes "notched"

# Add a title and labels for the x and y axis
plt.title("Transactions by City", fontsize=18)
plt.xlabel("Frequency", fontsize=16)
plt.ylabel("City", fontsize=16)

# Show the plot
plt.show()

In [None]:
# Create a histogram of the 'transactions' column
new_merged_data.transactions.hist()

# Add labels to the x-axis, y-axis, and title
plt.xlabel('Transactions', fontsize=16)
plt.ylabel('Frequency', fontsize=16)
plt.title('Histogram of Transactions', fontsize=20)

# Show the plot
plt.show()


In [None]:
# create a dataframe with numerical columns only
numerical_df = new_merged_data.select_dtypes(include=['float64', 'int64'])

# calculate the correlation matrix
corr_matrix = numerical_df.corr()

# display the correlation matrix
print(corr_matrix)


In [None]:
#change date datatype as datetime to create new features

new_merged_data.date = pd.to_datetime(new_merged_data.date)


new_merged_data['year'] = new_merged_data.date.dt.year

new_merged_data['month'] = new_merged_data.date.dt.month


new_merged_data['dayofmonth'] = new_merged_data.date.dt.day


new_merged_data['dayofweek'] = new_merged_data.date.dt.dayofweek


new_merged_data['dayname'] = new_merged_data.date.dt.strftime('%A')


## Answering  Questions

### 1. Is the train dataset complete (has all the required dates)?

In [None]:
# Check for missing values
if df_train.isnull().values.any():
  print("The dataset is not complete. There are missing values.")

# Check for missing dates in a time-series dataset
if not df_train.index.is_unique:
  print("The dataset is not complete. There are duplicate dates.")
else:
  print("The dataset is complete.")


### 2. Which dates have the lowest and highest sales for each year?

In [None]:
# Group the data by year and get the minimum and maximum sales for each year
grouped_by_year = new_merged_data.groupby("year")["sales"].agg(["min", "max"])

# Get the dates corresponding to the minimum and maximum sales for each year
result = pd.concat([new_merged_data[new_merged_data["sales"] == grouped_by_year.loc[year, "min"]][["year", "date"]].rename(columns={"date": "date_min"}) for year in grouped_by_year.index] +
                  [new_merged_data[new_merged_data["sales"] == grouped_by_year.loc[year, "max"]][["year", "date"]].rename(columns={"date": "date_max"}) for year in grouped_by_year.index])

# Set the index to be the year
result = result.set_index("year")

# Group the data by year to get the minimum and maximum sales on separate rows
result = result.groupby(level=0).agg({"date_min": "first", "date_max": "first"})

# Reset the index to get a regular dataframe
result = result.reset_index()

print(result)


In [None]:
# Group the data by year and get the minimum and maximum sales for each year
grouped_by_year = new_merged_data.groupby("year")["sales"].agg(["min", "max"])

# Get the dates corresponding to the minimum and maximum sales for each year
result = pd.concat([new_merged_data[new_merged_data["sales"] == grouped_by_year.loc[year, "min"]][["year", "date"]].rename(columns={"date": "date_min"}) for year in grouped_by_year.index] +
                  [new_merged_data[new_merged_data["sales"] == grouped_by_year.loc[year, "max"]][["year", "date"]].rename(columns={"date": "date_max"}) for year in grouped_by_year.index])

# Set the index to be the year
result = result.set_index("year")

# Group the data by year to get the minimum and maximum sales on separate rows
result = result.groupby(level=0).agg({"date_min": "first", "date_max": "first"})

# Reset the index to get a regular dataframe
result = result.reset_index()

# Plot the minimum and maximum sales for each year
plt.plot(result["year"], grouped_by_year["min"], label="Minimum Sales")
plt.plot(result["year"], grouped_by_year["max"], label="Maximum Sales")

# Add a legend
plt.legend()

# Add axis labels
plt.xlabel("Year")
plt.ylabel("Sales")

# Show the plot
plt.show()


### 3. Are certain groups of stores selling more products? (Cluster, city, state, type)

In [None]:
#display random sample of 5 rows
df_stores.sample(5, random_state = 0)

In [None]:

# Plot the number of stores by city
plt.figure(figsize=(10, 5))
sns.countplot(x='city', data=df_stores)

# Add title and labels
plt.title("Number of Stores by City")
plt.xlabel("City")
plt.xticks(rotation=45)
plt.ylabel("Number of Stores")

# Show the plot
plt.show()

# Plot the number of stores by state
plt.figure(figsize=(10, 5))
sns.countplot(x='state', data=df_stores)

# Add title and labels
plt.title("Number of Stores by State")
plt.xlabel("State")
plt.xticks(rotation=45)
plt.ylabel("Number of Stores")

# Show the plot
plt.show()

# Plot the number of stores by type
plt.figure(figsize=(10, 5))
sns.countplot(x='type', data=df_stores)

# Add title and labels
plt.title("Number of Stores by Type")
plt.xlabel("Type")
plt.ylabel("Number of Stores")

# Show the plot
plt.show()

# Plot the number of stores by cluster
plt.figure(figsize=(10, 5))
sns.countplot(x='cluster', data=df_stores)

# Add title and labels
plt.title("Number of Stores by Cluster")
plt.xlabel("Cluster")
plt.ylabel("Number of Stores")

# Show the plot
plt.show()


### 5. What analysis can we get from the date and its extractable features?

In [None]:
# create a copy of the dataframe
df_train_copy = df_train.copy()


# extract year, quarter, month, day, and weekday information from the date column
df_train_copy['year'] = df_train_copy['date'].dt.year
df_train_copy['quarter'] = df_train_copy['date'].dt.quarter
df_train_copy['month'] = df_train_copy['date'].dt.month
df_train_copy['day'] = df_train_copy['date'].dt.day
df_train_copy['weekday'] = df_train_copy['date'].dt.weekday

# group sales data by year
grouped_by_year = df_train_copy.groupby('year').sum()

# plot the aggregated sales data by year
plt.plot(grouped_by_year.index, grouped_by_year['sales'])
plt.xlabel("Year")
plt.ylabel("Sales")
plt.title("Sales by Year")
plt.show()

# group sales data by month
grouped_by_month = df_train_copy.groupby('month').sum()

# plot the aggregated sales data by month
plt.bar(grouped_by_month.index, grouped_by_month['sales'])
plt.xlabel("Month")
plt.ylabel("Sales")
plt.title("Sales by Month")
plt.show()


# group sales data by year
grouped_by_quarter = df_train_copy.groupby('quarter').sum()

# plot the aggregated sales data by quarter
plt.plot(grouped_by_quarter.index, grouped_by_quarter['sales'])
plt.xlabel("quarter")
plt.ylabel("Sales")
plt.title("Sales by Quarter")
plt.show()

### 7. What is the relationship between oil prices and sales?

In [None]:
# Plot a scatter plot to visualize the relationship between oil prices and sales
plt.scatter(new_merged_data['dcoilwtico'], new_merged_data['sales'])
plt.xlabel('Oil Price')
plt.ylabel('Sales')
plt.title('Relationship between Oil Prices and Sales')
plt.show()


### 8. What is the relationship between product and sales?

In [None]:

# Group data by product family and sum the sales
grouped_data_1 = new_merged_data.groupby('family').sum()['sales']

# Sort the data by sales
grouped_data_1 = grouped_data_1.sort_values(ascending=False)

# Plot the top 10 product families
sns.barplot(x=grouped_data_1.index[:10], y=grouped_data_1.values[:10])

# Add labels and title
plt.xlabel('Product Family')
plt.ylabel('Sales')
plt.title('Relationship between Product Family and Sales (Top 10)')
# Rotate the x-axis labels for better readability
plt.xticks(rotation=45)

# Show the plot
plt.show()


### 9. What is the trend of sales overtime

In [None]:

# Group data by date and sum the sales
date_group = new_merged_data.groupby("date").sum()

# Plot the sales over time
plt.figure(figsize=(12,5))
plt.plot(date_group.index, date_group["sales"])
plt.title("Sales Over Time")
plt.xlabel("Date")
plt.ylabel("Sales")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


# Feature Processing & Engineering
This section is to **clean**, **process** the dataset and **create new features**.

## Drop Duplicates

In [None]:
#checking duplicates in the train data
new_merged_data.duplicated().sum()

In [None]:
# Drop the specified columns
new_merged_data = new_merged_data.drop(columns=["year", "month", "dayofmonth", "dayofweek", "dayname"])


In [None]:
new_merged_data

## New Features Creation

In [None]:
#change date datatype as datetime to create new features

new_merged_data.date = pd.to_datetime(new_merged_data.date)


new_merged_data['year'] = new_merged_data.date.dt.year

new_merged_data['month'] = new_merged_data.date.dt.month


new_merged_data['dayofmonth'] = new_merged_data.date.dt.day


new_merged_data['dayofweek'] = new_merged_data.date.dt.dayofweek


new_merged_data['dayname'] = new_merged_data.date.dt.strftime('%A')



In [None]:
# Preview data with new features
new_merged_data.head()

## Impute Missing Values

In [None]:
from sklearn.impute import SimpleImputer

# create an instance of the SimpleImputer class with mean strategy
imputer = SimpleImputer(strategy='mean')

# fit the imputer to the dcoilwtico column of new_merged_data
imputer.fit(new_merged_data[['dcoilwtico']])

# use the imputer to transform the dcoilwtico column of new_merged_data, replacing missing values with the mean value
new_merged_data['dcoilwtico'] = imputer.transform(new_merged_data[['dcoilwtico']])


In [None]:
# Preview data columns after imputing
new_merged_data.isnull().sum()

In [None]:
# Write the DataFrame to a CSV file
new_merged_data.to_csv('new_merged_data.csv', index=False)

In [None]:
#drop unnecessary columns
final_data = new_merged_data.drop(columns=['id','locale', 'locale_name', 'description', 'transferred'], inplace=True)


In [None]:
new_merged_data.head()

In [None]:
# set the date column as the index
new_merged_data.set_index('date', inplace=True)

In [None]:
new_merged_data.head()

In [None]:
# drop more columns

final_data = new_merged_data.drop(columns=['state',  'store_type', 'dayname'], inplace=True)

In [None]:
final_data = new_merged_data.copy()

In [None]:
final_data.head()

In [None]:
# categorizing the products
food_families = ['BEVERAGES', 'BREAD/BAKERY', 'FROZEN FOODS', 'MEATS', 'PREPARED FOODS', 'DELI','PRODUCE', 'DAIRY','POULTRY','EGGS','SEAFOOD']
final_data['family'] = np.where(final_data['family'].isin(food_families), 'FOODS', final_data['family'])
home_families = ['HOME AND KITCHEN I', 'HOME AND KITCHEN II', 'HOME APPLIANCES']
final_data['family'] = np.where(final_data['family'].isin(home_families), 'HOME', final_data['family'])
clothing_families = ['LINGERIE', 'LADYSWARE']
final_data['family'] = np.where(final_data['family'].isin(clothing_families), 'CLOTHING', final_data['family'])
grocery_families = ['GROCERY I', 'GROCERY II']
final_data['family'] = np.where(final_data['family'].isin(grocery_families), 'GROCERY', final_data['family'])
stationery_families = ['BOOKS', 'MAGAZINES','SCHOOL AND OFFICE SUPPLIES']
final_data['family'] = np.where(final_data['family'].isin(stationery_families), 'STATIONERY', final_data['family'])
cleaning_families = ['HOME CARE', 'BABY CARE','PERSONAL CARE']
final_data['family'] = np.where(final_data['family'].isin(cleaning_families), 'CLEANING', final_data['family'])
hardware_families = ['PLAYERS AND ELECTRONICS','HARDWARE']
final_data['family'] = np.where(final_data['family'].isin(hardware_families), 'HARDWARE', final_data['family'])

In [None]:
from sklearn.preprocessing import StandardScaler

# create an instance of StandardScaler
scaler = StandardScaler()

# select numerical columns
num_cols = ['sales', 'transactions', 'dcoilwtico', 'year', 'month', 'dayofmonth', 'dayofweek']

# fit and transform the numerical columns
final_data[num_cols] = scaler.fit_transform(final_data[num_cols])


## Features Encoding


In [None]:
from sklearn.preprocessing import OneHotEncoder

# Select the categorical columns
categorical_columns = ["family", "city", "holiday_type"]
categorical_data = final_data[categorical_columns]

# Initialize the OneHotEncoder
encoder = OneHotEncoder()

# Fit and transform the data to one hot encoding
one_hot_encoded_data = encoder.fit_transform(categorical_data)

# Get the categories for each column
categories = [encoder.categories_[i] for i in range(len(encoder.categories_))]

# Create the column names for the one hot encoded data
column_names = []
for i in range(len(categories)):
    for j in range(len(categories[i])):
        column_names.append(f'{categorical_columns[i]}_{categories[i][j]}')

# Convert the one hot encoding data to a DataFrame
one_hot_encoded_data = pd.DataFrame(one_hot_encoded_data.toarray(), columns=column_names)


# Reset the index of both dataframes
final_data = final_data.reset_index(drop=True)
one_hot_encoded_data = one_hot_encoded_data.reset_index(drop=True)

# Concatenate the original dataframe with the one hot encoded data
final_data_encoded = pd.concat([final_data, one_hot_encoded_data], axis=1)

# Drop the original categorical columns
final_data_encoded.drop(categorical_columns, axis=1, inplace=True)



In [None]:
final_data_encoded.head()

In [None]:
#Rename dcoilwtico column to oil price
final_data_encoded.rename(columns={'dcoilwtico':'oil_price'}, inplace=True)


In [None]:
final_data_encoded.head()

In [None]:
# Make a copy of the final_data_encoded as data
data = final_data_encoded.copy()

In [None]:
data.head()

In [None]:
fig, ax = plt.subplots(figsize=(16, 11))
ax.plot(new_merged_data['sales'])
ax.set_xlabel('Time')
ax.set_ylabel('Sales')
fig.autofmt_xdate()
plt.tight_layout()

In [None]:
# Write the DataFrame to a CSV file
data.to_csv('encoded_data.csv', index=False)

# Machine Learning Modeling


Here is the section to build, train, evaluate and compare the models to each others.

## Simple Model #001


Please, keep the following structure to try all the model you want.

### Create and Train the Model

In [None]:
# Import libraries
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_squared_log_error

In [None]:
# Split Data to train and Test
from sklearn.model_selection import train_test_split

# Create the feature dataframe using the selected columns
X = data.drop(["sales"], axis=1)

# Get the target variable
y = data.sales

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Linear Regression Model

In [None]:
# Linear Regression Model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Make prediction on X_test
lr_predictions = lr.predict(X_test)


In [None]:

plt.scatter(y_test, lr_predictions)
plt.xlabel("True Values")
plt.ylabel("Predictions")
plt.title("Linear Regression")
plt.show()


In [None]:
# Evaluation Metrics for Linear Regression
lr_mse = mean_squared_error(y_test, lr_predictions).round(2)
lr_rmse = np.sqrt(lr_mse).round(2)

In [None]:
# apply the absolute value function to y_test to remove negative signs
y_test_abs = abs(y_test)
lr_predictions_abs = abs(lr_predictions)


In [None]:
# calculate the mean squared logarithmic error using the new y_test_abs and lr_predictions_abs array
lr_rmsle = np.sqrt(mean_squared_log_error(y_test_abs, lr_predictions_abs)).round(2)

In [None]:
# Print the evaluation results for Linear Regression model
print("\nEvaluation Results for Linear Regression:")
print("MSE:", lr_mse)
print("RMSE:", lr_rmse)
print("RMSLE:", lr_rmsle)

### Decision Tree Regression Model

In [None]:
# Decision Tree Regression Model
dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)

# Make prediction on X_test
dt_predictions = dt.predict(X_test)

In [None]:
plt.scatter(y_test, dt_predictions)
plt.xlabel("True Values")
plt.ylabel("Predictions")
plt.title("Decision Tree Regression")
plt.show()

In [None]:
# Evaluation Metrics for Decision Tree Regression
dt_mse = mean_squared_error(y_test, dt_predictions).round(2)
dt_rmse = np.sqrt(dt_mse).round(2)


In [None]:
# apply the absolute value function to y_test to remove negative signs
#y_test_abs = abs(y_test)
dt_predictions_abs = abs(dt_predictions)


In [None]:
# calculate the mean squared logarithmic error using the new y_test_abs and dt_predictions_abs array

dt_rmsle = np.sqrt(mean_squared_log_error(y_test_abs, dt_predictions_abs)).round(2)

In [None]:
# Print the evaluation results for Decision Tree Regression model
print("\nEvaluation Results for Decision Tree Regression:")
print("MSE:", dt_mse)
print("RMSE:", dt_rmse)

print("RMLSE:", dt_rmsle)

### XGBoost

In [None]:
# XGBoost Model
xgb = XGBRegressor()
xgb.fit(X_train, y_train)
xgb_predictions = xgb.predict(X_test)


In [None]:
plt.scatter(y_test, xgb_predictions)
plt.xlabel("True Values")
plt.ylabel("Predictions")
plt.title("XGBoost")
plt.show()

In [None]:
# Evaluation Metrics for XGBoost
xgb_mse = mean_squared_error(y_test, xgb_predictions).round(2)
xgb_rmse = np.sqrt(xgb_mse).round(2)

In [None]:
# apply the absolute value function to y_test to remove negative signs
#y_test_abs = abs(y_test)
xgb_predictions_abs = abs(xgb_predictions)

In [None]:
# calculate the mean squared logarithmic error using the new y_test_abs and xgb_predictions_abs array

xgb_rmsle = np.sqrt(mean_squared_log_error(y_test_abs, xgb_predictions_abs)).round(2)

In [None]:
# Print the evaluation results for XGBoost model
print("\nEvaluation Results for XGBoost:")
print("MSE:", xgb_mse)
print("RMSE:", xgb_rmse)
print("RMSLE:", xgb_rmsle)

### Random Forest Regression Model

In [None]:
# Random Forest Regression Model
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Make prediction on X_test
rf_predictions = rf.predict(X_test)

In [None]:
plt.scatter(y_test, rf_predictions)
plt.xlabel("True Values")
plt.ylabel("Predictions")
plt.title("Random Forest Regression")
plt.show()

In [None]:
# Evaluation Metrics for Random Forest Regression
rf_mse = mean_squared_error(y_test, rf_predictions).round(2)
rf_rmse = np.sqrt(rf_mse).round(2)
#rf_rmsle = np.sqrt(mean_squared_error(np.log(y_test), np.log(rf_predictions)))

In [None]:
# apply the absolute value function to y_test to remove negative signs
#y_test_abs = abs(y_test)
rf_predictions_abs = abs(rf_predictions)


In [None]:
# calculate the mean squared logarithmic error using the new y_test_abs and rf_predictions_abs array

rf_rmsle = np.sqrt(mean_squared_log_error(y_test_abs, rf_predictions_abs)).round(2)

In [None]:
# Print the evaluation results for Random Forest Regrression model
print("\nEvaluation Results for Random Forest:")
print("MSE:", rf_mse)
print("RMSE:", rf_rmse)
print("RMSLE:", rf_rmsle)

In [None]:
# Create a table to compare the evaluation results
results_table = pd.DataFrame({'Model': ['Linear Regression', 'Decision Tree', 'XGBoost', 'Random Forest'],
                              'MSE': [lr_mse, dt_mse, xgb_mse, rf_mse],
                              'RMSE': [lr_rmse, dt_rmse, xgb_rmse, rf_rmse],
                              'RMSLE': [lr_rmsle, dt_rmsle, xgb_rmsle, rf_rmsle]})

# Print the comparison table
print("\nComparison Table of Evaluation Results:")
print(results_table)


In [2]:
new_merged_data.to_csv('new_merged_data.csv',index=False)

NameError: name 'new_merged_data' is not defined