## LearnX Sales Forecasting

<h3>Your task is to predict the course sales for each course in the test set for the next 60 days </h3>

<h5> We will predict the User_Traffic values from the training data for the Test data. With this new Test data, we will predict the sales value for it </h5>

In [None]:
#Importing all required packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder 
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.metrics import accuracy_score, recall_score, f1_score
from sklearn.metrics import mean_squared_log_error as msle
from sklearn.linear_model import LinearRegression as LR
from imblearn.over_sampling import SMOTE


In [None]:
#Reading data from csv files
train_data = pd.read_csv("train_data.csv")
test_data = pd.read_csv("test_data.csv")

train_data.shape, test_data.shape

In [None]:
train_data.dtypes

In [None]:
train_data['Public_Holiday'].value_counts()

In [None]:
train_data['Competition_Metric'].value_counts()

In [None]:
train_data.isnull().sum()

In [None]:
#Checking for the KDE plot of Competition Metric.
#Since the Spread was Right_skewed, we will calculate the log of the variable scaled to 100
# This gave a normal-like distribution
sns.kdeplot(train_data['Competition_Metric'], shade =True)

In [None]:
train_data['Competition_Metric'].isnull().sum()

Since we have very few values of NA for Competition metric as compared to the dataset size, we will drop those rows.

Then we will now Do One-Hot Encoding for Course_Domain and Course_type Variables
And perform the log for Competition Metric and scale it to 100 first

In [None]:
#Creating a duplicate for Train_data
train_data_transformed = train_data

#Replacing the null values in Competition Metric with the mean value
train_data_transformed['Competition_Metric'] = train_data_transformed['Competition_Metric'].fillna(train_data_transformed['Competition_Metric'].mean())

# typecasting Object variables to category
train_data_transformed['Course_Domain'] = train_data_transformed['Course_Domain'].astype('category')
train_data_transformed['Course_Type'] = train_data_transformed['Course_Type'].astype('category')


#Instead of Hot encoding, trying Label Encoding
le = LabelEncoder()

train_data_transformed['Course_Domain'] = le.fit_transform(train_data_transformed['Course_Domain'])
train_data_transformed['Course_Type'] = le.fit_transform(train_data_transformed['Course_Type'])

#Log for Scaled Competition metric
train_data_transformed['Competition_Metric'] = train_data_transformed['Competition_Metric'] *100
train_data_transformed['Competition_Metric'] = np.log(train_data_transformed['Competition_Metric'] + 0.1)

Will Convert the train data to drop User Traffic and sales columns to feed into the Linear Regression Model

In [None]:
train_data_model_x = train_data_transformed.drop(['User_Traffic','Sales'], axis = 1)
train_data_model_y = train_data_transformed['User_Traffic']

<h5> Will now do the transformations required to do on test Data to get it at par with train data</h5

In [None]:
# Creating a duplicate of the test_data and performing the transformations on the test data model df
test_data_model = test_data

#Replacing the null values in Competition Metric with the mean value
test_data_model['Competition_Metric'] = test_data_model['Competition_Metric'].fillna(test_data_model['Competition_Metric'].mean())

# typecasting Object variables to category
test_data_model['Course_Domain'] = test_data_model['Course_Domain'].astype('category')
test_data_model['Course_Type'] = test_data_model['Course_Type'].astype('category')


#Instead of Hot encoding, trying Label Encoding
le = LabelEncoder()

test_data_model['Course_Domain'] = le.fit_transform(test_data_model['Course_Domain'])
test_data_model['Course_Type'] = le.fit_transform(test_data_model['Course_Type'])

#Log for Scaled Competition metric
test_data_model['Competition_Metric'] = test_data_model['Competition_Metric'] *100
test_data_model['Competition_Metric'] = np.log(test_data_model['Competition_Metric'] + 0.1)

In [None]:
train_data_model_x.shape, test_data_model.shape

<h5>We now have train data and test data in similar formats
Will begin to apply linear regression to predict "User_traffic"</h5>

In [None]:
lr = LR()

#Fitting the Linear Regression Model
lr.fit(train_data_model_x,train_data_model_y)
train_data_model_yhat = lr.predict(train_data_model_x)

#Checking for the Root Mean Square Log error
train_score = msle(train_data_model_y,train_data_model_yhat)

np.sqrt(train_score) * 1000

We will now proceed to predict the "User_traffic" values for the Testing data    

In [None]:
test_uh = lr.predict(test_data_model)
test_data_uh = test_data_model
test_data_uh['User_Traffic'] = test_uh
test_data_uh.shape

<h5> We have successfully predicted the User_Traffic for the Testing data.</h5>
<h6> Now we shall begin to Train and predict for Sales values </h6>

In [None]:
# Re-creating the Training data to accommodate for the User_Traffic values
train_data_model_x = train_data_transformed.drop(['Sales'], axis = 1)
train_data_model_y = train_data_transformed['Sales']

In [None]:
sns.kdeplot(train_data_model_x['User_Traffic'], shade = True)

Since the User_Traffic Metric is Right Skewed, we will perform sqrt function on it to transform the variable.
This will apply to both the train and test data

In [None]:
# Applying Log transformation to User_Traffic metric to account for the Right skewed distribution
train_data_model_x['User_Traffic'] = np.sqrt(train_data_model_x['User_Traffic'])
test_data_uh['User_Traffic'] = np.sqrt(test_data_uh['User_Traffic'])

In [None]:
lr = LR()

#Fitting the Linear Regression Model
lr.fit(train_data_model_x,train_data_model_y)
train_data_model_yhat = lr.predict(train_data_model_x)

#Since few values were being predicted < 0 so replacing them with 1 for the time being
train_data_model_yhat = np.where(train_data_model_yhat < 0, 1, train_data_model_yhat)

##Checking for the Root Mean Square Log error
train_score = msle(train_data_model_y,train_data_model_yhat)

np.sqrt(train_score) * 1000

We will now predic the sales value for Test data 

In [None]:
test_sales = lr.predict(test_data_uh)

test_sales_submit = pd.DataFrame(test_data_uh['ID'])
test_sales_submit['Sales'] = test_sales

test_sales_submit.set_index('ID', inplace = True)

test_sales_submit.to_csv('Lakshay_submit.csv')

<h2> We will try to Re-create the model this time without predicting User_Traffic </h2>

In [None]:
train_data = pd.read_csv("train_data.csv")

#Creating a duplicate for Train_data
train_data_transformed = train_data

#Replacing the null values in Competition Metric with the mean value
train_data_transformed['Competition_Metric'] = train_data_transformed['Competition_Metric'].fillna(train_data_transformed['Competition_Metric'].mean())

# typecasting Object variables to category
train_data_transformed['Course_Domain'] = train_data_transformed['Course_Domain'].astype('category')
train_data_transformed['Course_Type'] = train_data_transformed['Course_Type'].astype('category')


#Instead of Hot encoding, trying Label Encoding
le = LabelEncoder()

train_data_transformed['Course_Domain'] = le.fit_transform(train_data_transformed['Course_Domain'])
train_data_transformed['Course_Type'] = le.fit_transform(train_data_transformed['Course_Type'])

#Log for Scaled Competition metric
train_data_transformed['Competition_Metric'] = np.sqrt(train_data_transformed['Competition_Metric'])

In [None]:
test_data = pd.read_csv("test_data.csv")

# Creating a duplicate of the test_data and performing the transformations on the test data model df
test_data_model = test_data

#Replacing the null values in Competition Metric with the mean value
test_data_model['Competition_Metric'] = test_data_model['Competition_Metric'].fillna(test_data_model['Competition_Metric'].mean())

# typecasting Object variables to category
test_data_model['Course_Domain'] = test_data_model['Course_Domain'].astype('category')
test_data_model['Course_Type'] = test_data_model['Course_Type'].astype('category')


#Instead of Hot encoding, trying Label Encoding
le = LabelEncoder()

test_data_model['Course_Domain'] = le.fit_transform(test_data_model['Course_Domain'])
test_data_model['Course_Type'] = le.fit_transform(test_data_model['Course_Type'])

#Log for Scaled Competition metric

test_data_model['Competition_Metric'] = np.sqrt(test_data_model['Competition_Metric'])

In [None]:
train_data_model_x = train_data_transformed.drop(['User_Traffic','Sales'], axis = 1)
train_data_model_y = train_data_transformed['Sales']

lr = LR()

#Fitting the Linear Regression Model
lr.fit(train_data_model_x,train_data_model_y)
train_data_model_yhat = lr.predict(train_data_model_x)

#Since few values were being predicted < 0 so replacing them with 1 for the time being
train_data_model_yhat = np.where(train_data_model_yhat < 0, 1, train_data_model_yhat)

##Checking for the Root Mean Square Log error
train_score = msle(train_data_model_y,train_data_model_yhat)

np.sqrt(train_score) * 1000

In [None]:
test_sales = lr.predict(test_data_model)

test_sales_submit = pd.DataFrame(test_data_model['ID'])
test_sales_submit['Sales'] = test_sales

test_sales_submit.set_index('ID', inplace = True)

test_sales_submit.to_csv('Lakshay_submit_3.csv')