# Prediction of sales

### Problem Statement
This dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

We have covered data preparation and feature engineering two weeks ago. Now, it's time to do some predictive models.

## Model Building

## Task
Make a baseline model. Baseline models help us set a benchmark to gauge the performance of our future models. If your new model is below the baseline, something has gone wrong, and you should check your data.

To make a baseline model, run a simple regression model without altering the default parameters in sklearn. 

In [82]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

dataframe = pd.read_csv('df_final.csv', dtype={'Item_Weight':'float64'}, low_memory=False)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
y = y.astype(int)
print(X.shape, y.shape)

(8523, 1607) (8523,)


1.0

In [83]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = LogisticRegression()
result = model.fit(X_scaled,y)
model.score(X_scaled,y)

1.0

## Task
Split your data in 80% train set and 20% test set.

In [84]:
from sklearn.model_selection import train_test_split

In [85]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=.2,random_state=42)

## Task
Use grid_search to find the best value of the parameter `alpha` for Ridge and Lasso regressions from `sklearn`.

In [86]:
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import RepeatedKFold, RandomizedSearchCV
from scipy.stats import loguniform


cv = RepeatedKFold(n_splits=10, n_repeats=2, random_state=1)

space = {
    'alpha': loguniform(1e-3, 100)
}

# Initialize RandomizedSearchCV with reduced n_iter and n_jobs for efficiency
ridge_model = Ridge()
ridge_search = RandomizedSearchCV(ridge_model, space, n_iter=10, scoring='neg_mean_squared_error', n_jobs=-1, cv=cv, random_state=1)
ridge_result = ridge_search.fit(X_train, y_train)
# Fit model
lasso_model = Lasso(max_iter=5000)
lasso_search = RandomizedSearchCV(lasso_model, space, n_iter=10, scoring='neg_mean_squared_error', n_jobs=-1, cv=cv, random_state=1)
lasso_result = lasso_search.fit(X_train, y_train)

In [88]:
from sklearn.model_selection import GridSearchCV

# Define a range of discrete alpha values for grid search
space = {'alpha': [0.01, 0.1, 1, 10, 100]}

ridge_search = GridSearchCV(ridge_model, space, scoring='neg_mean_squared_error', n_jobs=-1, cv=cv)
ridge_result = ridge_search.fit(X_train, y_train)

In [89]:
# Best parameters found
print("Best Parameters for Ridge:", ridge_result.best_params_)
print("Best Score for Ridge:", ridge_result.best_score_)

print("Best Parameters for Lasso:", lasso_result.best_params_)
print("Best Score for Lasso:", lasso_result.best_score_)


Best Parameters for Ridge: {'alpha': 0.01}
Best Score for Ridge: -0.00023726807804257008
Best Parameters for Lasso: {'alpha': 0.0010013176560941261}
Best Score for Lasso: -3.805989732935029e-06


## Task
Using the model from grid_search, predict the values in the test set and compare against your benchmark.

In [87]:
from sklearn.metrics import mean_squared_error

# Assuming 'ridge_result' and 'lasso_result' are your tuned models

# Select the best Ridge model from RandomizedSearchCV
best_ridge = ridge_result.best_estimator_

# Predict on the test set
ridge_predictions = best_ridge.predict(X_test)

# Calculate the error for Ridge regression
ridge_mse = mean_squared_error(y_test, ridge_predictions)
print(f"Ridge Mean Squared Error: {ridge_mse}")

# Repeat for Lasso
best_lasso = lasso_result.best_estimator_
lasso_predictions = best_lasso.predict(X_test)
lasso_mse = mean_squared_error(y_test, lasso_predictions)
print(f"Lasso Mean Squared Error: {lasso_mse}")

# Compare to your benchmark MSE
benchmark_mse = mean_squared_error(y_test, [y_train.mean()] * len(y_test))  # Example: benchmark as mean of y_train
print(f"Benchmark Mean Squared Error: {benchmark_mse}")

print(f"Improvement over benchmark (Ridge): {benchmark_mse - ridge_mse}")
print(f"Improvement over benchmark (Lasso): {benchmark_mse - lasso_mse}")


Ridge Mean Squared Error: 1.915904310693906e-05
Lasso Mean Squared Error: 3.6182718411989197e-06
Benchmark Mean Squared Error: 0.14772664182509457
Improvement over benchmark (Ridge): 0.14770748278198764
Improvement over benchmark (Lasso): 0.14772302355325337
