![INSA](https://gi.insa-lyon.fr/sites/all/themes/insa_satellites/logo.png)

# GI-5-DSC - Data Science: Sales prediction and fraud detection
***


In this tutorial, we will perform a comparison study of forecasting methods. More specifically, we will compare the performances of traditional machine learning techniques with neural networks and we will experiment both classification and regression on the same dataset.

The dataset used in this tutorial is maintained transparently with the Creative Commons 4.0 license by Fabian Constante, Fernando Silva, and António Pereira through the Mendeley data repository. It consists of roughly 180k transactions from supply chains used by the company DataCo Global for 3 years (from 2015 to 2018). The dataset can be downloaded from:

https://data.mendeley.com/datasets/8gx2fvg2k6/5

It contains 3 files:

1. DescriptionDataCoSupplyChain.csv: the description of each of the variables of the DataCoSupplyChainDatasetc.csv.
2. DataCoSupplyChain.csv: structured data
3. tokenized_access_logs.csv: unstructured data


We will train deep neural networks and machine learning models for:
1. Classification
 * detection of fraud transactions, 
 * late delivery of orders, 
2. Regression
 * sales revenue,
 * order quantity.


The machine learning classifiers used in this project for fraud transactions and late delivery are Logistic Regression, Linear Discriminant Analysis, Gaussian Naive Bayes, Support Vector Machines, k - Nearest Neighbors, Decision Tree classification, and Random Forest classification.
These models will be evaluated and compared using accuracy and F1 score. 

The regression models used to predict sales and quantity of the products required are Lasso, Ridge, Light Gradient boosting, Decision Tree Regression, Random Forest regression, and Linear Regression.
These models will be evaluated and compared with mean absolute error (MAE) and root mean square error (RMSE).

***

## 1. Set up the Environment : import libraries & read the data

### 1.1. Importing all required libraries

In [None]:
!pip install statsmodels

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import calendar,warnings,itertools,matplotlib,keras,shutil
import tensorflow as tf
import statsmodels.api as sm
from datetime import datetime

from sklearn.model_selection import train_test_split,cross_val_score, cross_val_predict
from sklearn import svm,metrics,tree,preprocessing,linear_model
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Ridge,LinearRegression,LogisticRegression,ElasticNet, Lasso
from sklearn.ensemble import RandomForestRegressor,RandomForestClassifier, GradientBoostingRegressor,BaggingClassifier,ExtraTreesClassifier
from sklearn.metrics import accuracy_score,mean_squared_error,recall_score,confusion_matrix,f1_score,roc_curve, auc
from sklearn.datasets import load_iris,make_regression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.kernel_ridge import KernelRidge
from keras import Sequential
from keras.layers import Dense
from IPython.core import display as ICD

## hiding the warnings
warnings.filterwarnings('ignore') 

### 1.2. Import data and take a first look at it


Use the [read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) method from the Pandas library to load the `DataCoSupplyChain.csv` file into a dataframe. 

Hint: you need to use the `encoding='unicode_escape'` param for this dataset.


In [None]:
## Importing Dataset using pandas
## Your code here
data = ******

data.head() # Checking 5 rows in dataset

In [None]:
## Print the number of row and column of the dataset
## Your code here




In [None]:
data.info()

Thanks to the 'non-null' column we can notice that some data are missing.

## 2. Exploratory data analysis

### 2.1. Data cleaning


As we saw before, there are some missing values from `Customer Lname`, `Product Description`, `Order Zipcode` and, `Customer Zipcode` which should be removed or replaced before proceeding with the analysis.


Then, since there is a chance that different customers might have the same first name or same last name a new column with `Customer Full Name` is created to avoid any ambiguities.

In [None]:
# Adding first name and last name together to create new column
data['Customer Full Name'] = data['Customer Fname'].astype(str) + data['Customer Lname'].astype(str)

To make it easier for analysis some unimportant columns are dropped: `Customer Fname`,`Customer Lname`,`Product Description`,`Customer Email`,`Product Status`,`Customer Password`,`Customer Street`,
           `Latitude`,`Longitude`,`Product Image`,`Order Zipcode`,`shipping date (DateOrders)`.

Hint: Use the [drop()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) method from the Pandas library.

In [None]:
## Drop the columns
# Your code here
data = *****



In [None]:
## Check that the number of column has decreased
data.shape


There are 3 missing values in `Customer Zipcode` column. 
Since the missing values are just zip codes which are not very important these are replaced with zero before proceeding with data analysis.

Hint: Use the [fillna()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html) method from the Pandas library.

In [None]:
## Filling NaN columns with zero. Tips: use the fillna method from the pandas library
# Your code here
data['Customer Zipcode'] = *****


In [None]:
data.info()

### 3.2. Data Visualisation

To find important parameters, data correlation is performed.

In [None]:
fig, ax = plt.subplots(figsize=(24,12))

## Plot heatmap for correlation matrix
sns.heatmap(data.corr(), annot=True, linewidths=.5, fmt='.1g', cmap= 'coolwarm') 

For instance, we can observe that `Product Price` has high correlation with `Sales per customer`, `Order Item Product Price`, `Sales`, and `Order Item Total`.

As the data which is being used for analysis is related to Supply chain, it makes sense to find which region has most sales? 
It can be found by using groupby method which will segregate similar market regions together and add all sales for that particular region using 'sum' function.

In [None]:
## Grouping by market
# Your code here
market = *****

## Grouping by order region
# Your code here
region = *****

plt.figure(figsize=(20, 6))

plt.subplot(121)
******.sum().sort_values(ascending=False).plot.bar(title="Total sales per customer for all markets")

plt.subplot(122)
******.sum().sort_values(ascending=False).plot.bar(title="Total sales per customer for all regions")

It could be seen from the graph that European market has the most number of sales whereas Africa has the least. In these markets western europe regions and central america recorded highest sales. 

Which catergory of products has highest sales? The same method can be followed here to see the product category with highest sales

In [None]:
## Plot the total sales, average sales and average price per category

# Your code here
******


As we can see from fig 1 that the fishing category had most number of sales followed by the Cleats.
However it is suprising to see that top 7 products with highest price on average are the most sold products on average with computers having almost 1350 sales despite price being 1500$. 

Which month or week day recorded highest sales? It can be found by dividing order time into years, months, week day and hour to better observe the trend.

In [None]:
## Create a new column for each part of the date (year, month, etc.).
## This cell can take a while to run

data['order_year']= pd.DatetimeIndex(data['order date (DateOrders)']).year
data['order_month'] = pd.DatetimeIndex(data['order date (DateOrders)']).month
data['order_week_day'] = pd.DatetimeIndex(data['order date (DateOrders)']).weekday
data['order_hour'] = pd.DatetimeIndex(data['order date (DateOrders)']).hour
data['order_month_year'] = pd.to_datetime(data['order date (DateOrders)']).dt.to_period('M')

So what is the purchase trend in week days, hours and months?

In [None]:
# Your code here

           
## Plot the average sales per year



## Plot the average sales per week in days



## Plot the average sales per day in hours



## Plot the average sales per year in month






How price is impacting sales, when and which products are having more sales are found.
The most number of orders came in October followed by November, and orders for all other months are consistent.
Highest number of orders are placed by customers in 2017. 
Saturday recorded highest number of average sales and wednesday with the least number of sales. The average sales are consistent throughout the day irrespective of time with std of 3.

It is also important to know what type of payment method is being preferred by people to buy all these products in all regions? It can be found using the [.unique()](https://pandas.pydata.org/docs/reference/api/pandas.unique.html) method to see different payment methods.

In [None]:
# Your code here



It is found that four types of payment methods are used.

Which payment method is preferred the most by people in different regions?

In [None]:
count1 = data[(data['Type'] == 'TRANSFER')]['Order Region'].value_counts()
count2 = data[(data['Type'] == 'CASH')]['Order Region'].value_counts()
count3 = data[(data['Type'] == 'PAYMENT')]['Order Region'].value_counts()
count4 = data[(data['Type'] == 'DEBIT')]['Order Region'].value_counts()
names = data['Order Region'].value_counts().keys()

n_groups = 23
fig,ax = plt.subplots(figsize=(20,8))
index = np.arange(n_groups)
bar_width = 0.2
opacity = 0.6

type1 = plt.bar(index,count1, bar_width, alpha=opacity, color='b', label='Transfer')
type2 = plt.bar(index+bar_width, count2, bar_width, alpha=opacity, color='r', label='Cash')
type3 = plt.bar(index+bar_width+bar_width, count3, bar_width, alpha=opacity, color='g', label='Payment')
type4 = plt.bar(index+bar_width+bar_width+bar_width, count4, bar_width, alpha=opacity, color='y', label='Debit')

plt.xlabel('Order Regions')
plt.ylabel('Number of payments')
plt.title('Different Type of payments used in all regions')
plt.legend()
plt.xticks(index+bar_width,names,rotation=90)

plt.tight_layout()
plt.show()


Debit type is most preferred payment method by people in all regions, Cash payment being the least preferred method.

Finding which payment method is used to conduct frauds can be useful to prevent fraud from happening in future

Hint: Use the [value_counts()](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) method.

In [None]:
## Checking type of payment used to conduct fraud (how many fraud per type of payment)
# Your code here



It can be clearly seen that there are no frauds conducted with DEBIT, CASH, or PAYMENT methods so all the suspected fraud orders are made using wire transfer probably from abroad. 

Which region and what product is being suspected to the fraud the most? 

In [None]:
## Get suspected fraud orders made by transfer payment
high_fraud = ***** # Your code here

## Plotting pie chart with respect to order region
fraud = high_fraud['Order Region'].value_counts().plot.pie(figsize=(24,12),
                                                  startangle=180, explode=(0.1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), autopct='%.1f', shadow=True,)

plt.title("Regions with Highest Fraud", size=15, color='y') # Plotting title
plt.ylabel(" ")
fraud.axis('equal') 

plt.show()

It can be observed that highest number of suspected fraud orders are from Western Europe which is approximately 17.4% of total orders followed by Central America with 15.5%. 

Following the same method, which product is being suspected fraud the most (for all regions and for western europe only)?

In [None]:
## Plotting bar chart for top 10 most suspected fraud department in all regions and in Western Europe
# Your code here
fraud_all = *****
fraud_WE = *****

fraud_all.nlargest(10).plot.bar(figsize=(20,8), title="Fraud Category",color='orange')
fraud_WE.nlargest(10).plot.bar(figsize=(20,8), title="Fraud product in Western Europe",color='green')

plt.legend(["All regions", "Western Europe"])
plt.title("Top 10 products with highest fraud detections", size=15)
plt.xlabel("Products", size=13)
plt.ylim(0,600)
plt.show()

It is very suprising to see that Men's footwear department is being suspected to fraud the most followed by cleats in all the regions and also in Western Europe.


Delivering products to customer on time without late delivery is another important aspect for a supply chain company. 

What category of products are being delivered late the most?

In [None]:
# Your code here

## Filtering columns with late delivery status


## Plot bar: Top 10 products with most late deliveries


It can be seen that orders with Cleats department is getting delayed the most followed by Men's Footwear.


## 3. Data Modelling

Now that we know the data better, we can start to train models!

First, we will prepare the data: separate output variables from features and split train and test sets.

Then, classification models will be trained to detect fraud and late delivery and regression models will preict sales and order quantity.



### 3.1. Data preparation

A new dataset is created with the copy of original data for training the data and validation.

In [None]:
train_data = data.copy()

Two new columns are created for orders with suspected fraud and late delivery making them into binary classification (0 or 1), which in turn helps to measure performance of different models better.

In [None]:
# Your code here
train_data['fraud'] = *****
train_data['late_delivery'] = *****


Now to measure machine models accurately all the columns with repeated values are dropped like `late_delivery_risk` column because, it is known all the products with late delivery risk are delivered late. And `Order Status` column because, a new column for fraud detection is created, so there is a chance machine learning model might take values directly from these columns to predict output.

In [None]:
## Dropping columns with repeated values
train_data.drop(['Delivery Status','Late_delivery_risk','Order Status','order_month_year','order date (DateOrders)'], axis=1, inplace=True)

It is important to check the type of variables in the data because machine learning models can only be trained with numerical values.

In [None]:
train_data.dtypes

In [None]:
## Display the 5 first data
train_data.head()

There are some columns with object type data which cannot be trained in machine learning models so all the object type data is converted to int type using preprocessing label encoder library.

Hint: use the [LabelEncoder()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) method from Scikit-Learn.


In [None]:
## Create the Labelencoder object used to transform categorical values into numeric
# Your code here

*****

## Convert the categorical columns into numeric
train_data['Customer Country']  = *****
train_data['Market']            = *****
train_data['Type']              = *****
train_data['Product Name']      = *****
train_data['Customer Segment']  = *****
train_data['Customer State']    = *****
train_data['Order Region']      = *****
train_data['Order City']        = *****
train_data['Category Name']     = *****
train_data['Customer City']     = *****
train_data['Department Name']   = *****
train_data['Order State']       = *****
train_data['Shipping Mode']     = *****
train_data['order_week_day']    = *****
train_data['Order Country']     = *****
train_data['Customer Full Name']= *****

## Display the 5 first data
train_data.head()

Now all the data is transformed into int type. 


### 3.2. Splitting train and test datasets

The dataset is split into train data and test data so models can be trained with train data and the performance of model can be evaluated using test data.

In [None]:
## Your code here

## All columns expect fraud


## Only fraud column


##  Splitting the data into two parts in which 80% data will be used for training the model and 20% for testing


# All columns expect late_delivery


# Only late delivery column


## Splitting the data into two parts in which 80% data will be used for training the model and 20% for testing



Since there are so many different variables with different ranges [standard scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) is used to standardize total the data so it is internally consistent before training the data with machine learning.



In [None]:
sc = StandardScaler()
xf_train = sc.fit_transform(xf_train)
xf_test = sc.transform(xf_test)
xl_train = sc.fit_transform(xl_train)
xl_test = sc.transform(xl_test)


### 3.3. Comparision of Classification Models

The data is now ready to be used in machine learning models, since many different models are compared training every model from begining will be redundant so a function is defined to make the process bit easy. 
The output is in binary classification format so all the models are measured with Accuracy and F1 score metrics. 

To measure the performance of different models F1 score is used as the main metric because it is the harmonic mean of precison score and recall score.
The function will also display the confusion matrix.

The parameters of the function will be the name of the task (example : 'fraud detection'), the classifier, the input data train, the input data test, the labels train and the labels test.


In [None]:
## Your code here
## Define a function which train a model and print the accuracy, f1-score, and confuction matrix for prediction














#### 3.3.1 Logistic classification model

Train and evaluate a logistic classification model for both fraud detection and late delivery prediction.

Hint: Use the [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) algorithm from Scikit-Learn with default parameters.

In [None]:
# Your code here






#### 3.3.2 Gaussian naive bayes model

Train and evaluate a Gaussian naive bayes model for both fraud detection and late delivery prediction.

Hint: Use the [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) algorithm from Scikit-Learn with default parameters.

In [None]:
# Your code here





#### 3.3.3 Support vector machines

Train and evaluate a Support vector machines model for both fraud detection and late delivery prediction.

Hint: Use the [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) algorithm from Scikit-Learn with default parameters.

In [None]:
# Your code here






#### 3.3.4 K Nearest Neighbors Classification

Train and evaluate a K Nearest Neighbors Classification model for both fraud detection and late delivery prediction.

Hint: Use the [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) algorithm from Scikit-Learn with default parameters.

In [None]:
# Your code here
# This cell can take a while to run






#### 3.3.5 Random Forest Classification

Train and evaluate a Random Forest model for both fraud detection and late delivery prediction.

Hint: Use the [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) algorithm from Scikit-Learn with default parameters.


In [None]:
# Your code here






#### 3.3.6 Decision Tree Classification

Train and evaluate a Decision Tree model for both fraud detection and late delivery prediction.

Hint: Use the [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) algorithm from Scikit-Learn with default parameters.

In [None]:
# Your code here






### 3.4. Feature Importance

Which variable was given more importance in the model is found using feature importance method from sklearn.

In [None]:
important_col = model_f.feature_importances_.argsort()
feat_imp = pd.DataFrame({'Variables': xf.columns[important_col], 'importance': model_f.feature_importances_[important_col]})
feat_imp = feat_imp.sort_values(by='importance', ascending=False)
ax = sns.catplot(x='Variables', y = 'importance', data=feat_imp, height=5, aspect=2, kind="bar")
plt.xticks(rotation=90)

Even though fraud detection is not at all related to Days for shipping(real) it is very surprising to see it was given an importance of 0.12. All other important parameters like customer full name, shipping mode, type of payment used are given an importance of 0.7 which helps the company to detect fraud accurately when same customer is conducting fraud.

Same way which variables were given importance for prediction of late delivery is found.

In [None]:
# Your code here








It can be seen that the columns for the days of shipping is given almost 90% importance in decision tree model, it will be interesting to see how well the model can predict when these variables are removed.

So a new model with the copy of train data is created, and we start again the whole process after we drop the two columns `Days for shipping (real)`, `Days for shipment (scheduled)`:
1. Copy the dataset
2. Drop the columns
3. Separate input and labels data
4. Split data train and test
5. Standardize the data
6. Fit the decision tree model

In [None]:
# Your code here


## Dropping columns in new data set


## Seperate input and labels data


## Splitting the data into two parts in which 80% data will be used for training the model and 20% for testing


## Standardize the data


## Train the model



Even when shipping days variables were removed the F1 score and the accuracy of the new model is nearly 84% which is still pretty good. Which variables are given more importance this time?

In [None]:
# Your code here







This time variables like shipping mode, order city,state are given more importance which helps company to use different shipping methods to deliver products faster.

### 3.5. Deep Neural Network Model for Classification

Decision Tree classifier is identified as the best model in all Machine learning models for Classification Type data. 
How well it can perform when compared with Deep Neural Network model?

We will use the `Tensorflow-Keras` library and more specifically the [Sequential](https://www.tensorflow.org/guide/keras/sequential_model) model with 9 hidden layers.

In [None]:
keras.layers.BatchNormalization()
classifier = Sequential()
#First Hidden Layer
classifier.add(Dense(1024, activation='relu',kernel_initializer='random_normal', input_dim=43)) #Since we have 43 columns
#Third Hidden Layer
classifier.add(Dense(512, activation='relu',kernel_initializer='random_normal'))
#Fourth Hidden Layer
classifier.add(Dense(256, activation='relu',kernel_initializer='random_normal'))
#Fifth Hidden Layer
classifier.add(Dense(128, activation='relu',kernel_initializer='random_normal'))
#Sixth Hidden Layer
classifier.add(Dense(64, activation='relu',kernel_initializer='random_normal'))
#Seventh Hidden Layer
classifier.add(Dense(32, activation='relu',kernel_initializer='random_normal'))
#Eight Hidden Layer
classifier.add(Dense(16, activation='relu',kernel_initializer='random_normal'))
#Ninth Hidden Layer
classifier.add(Dense(8, activation='relu',kernel_initializer='random_normal'))
#Tenth Hidden Layer
classifier.add(Dense(4, activation='relu',kernel_initializer='random_normal'))
#Eleventh Hidden Layer
classifier.add(Dense(2, activation='relu',kernel_initializer='random_normal'))
#Output Layer
classifier.add(Dense(1, activation='sigmoid',kernel_initializer='random_normal'))

Since output data is binary classification the `binary_crossentropy` is used to measure loss and `accuracy` is used as metric to train the model because F1 score is not available in Keras.

In [None]:
classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

The model is trained with batch size of 512 and 40 epochs.

In [None]:
## Fitting the data to the training dataset
classifier.fit(xf_train, yf_train, batch_size=512, epochs=40)

It can be seen that the neural network model is performing better at every 20 first epochs even tough accuracy remained same the loss is decreasing. Then, loss is stable which means that we can stop learning.
We could continue the training until the loss value stabilises.


The model is evaluated with test data set. See the [Tensorflow-Keras doc](https://www.tensorflow.org/guide/keras/train_and_evaluate) for more infos.

In [None]:
train_evaluate = classifier.evaluate(xf_train, yf_train)
test_evaluate = classifier.evaluate(xf_test, yf_test)

print('accuracy for Train set is',train_evaluate)
print('accuracy for Test set is',test_evaluate) 

yf_pred1 = classifier.predict(xf_test, batch_size=512, verbose=1)
yf_pred = np.argmax(yf_pred1, axis=1)
print(f1_score(yf_test, yf_pred, average="weighted"))

The f1 score for deep neural network model is 96.48% which is pretty high and better when compared with decision tree f1 score which was 80.64.

But comparing accuracy scores it can concluded that even machine learning models did pretty good for fraud detection and late delivery prediction.

### 3.6. Comparision of Regression Models

For comparison of regression models `Sales` and `Order Item Quantity` are predicted.

In [None]:
# Your code here

## For Sales
## Prepare features and labels data


## Split train/test datasets


## For Order Item Quantity
## Prepare features and labels data


## Split train/test datasets



MinMax scaler is used to standardize data since data type is regression.

Hint: use the [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) method from `Scikit-Learn`.

In [None]:
# Your code here


## Fit and transform input train data (for both sales and quantity sets)



## transform input test data (for both sales and quantity sets)




The data is now ready to be used in machine learning models. Since, different models are compared here like above a function is defined. The output is regression type so accuracy cannot be used as a measure to compare different models like classification models, so all the models are compared using mean absolute error (MAE) and RMSE.

The lower the value of mean absolute error the better the model is performing and lower values of RMSE indicate better fit.

In [None]:
# Your code here










#### 3.6.1 Lasso Regression

Train and evaluate a Lasso Regression model for both sales and quantity prediction.

Hint: Use the [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) algorithm from Scikit-Learn with default parameters.

In [None]:
# Your code here





#### 3.6.2 Ridge Regression

Train and evaluate a Ridge Regression model for both sales and quantity prediction.

Hint: Use the [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) algorithm from Scikit-Learn with default parameters.

In [None]:
# Your code here





#### 3.6.3 Gradient Boosting Regression

Train and evaluate a Gradient Boosting Regression model for both sales and quantity prediction.

Hint: Use the [GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) algorithm from Scikit-Learn with default parameters.

In [None]:
# Your code here





#### 3.6.4 Random Forest Regression

Train and evaluate a Random Forest Regression model for both sales and quantity prediction.

Hint: Use the [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) algorithm from Scikit-Learn with default parameters. Use max_depth=10 to limit the size of the trees.

In [None]:
# Your code here





#### 3.6.5 Decision Tree Regression

Train and evaluate a Decision Tree Regression model for both sales and quantity prediction.

Hint: Use the [DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) algorithm from Scikit-Learn with default parameters.

In [None]:
# Your code here





#### 3.6.6 Linear Regression

Train and evaluate a Decision Tree Regression model for both sales and quantity prediction.

Hint: Use the [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) algorithm from Scikit-Learn with default parameters.

In [None]:
# Your code here





Here suprisingly, Linear regression model performed better in comparision to other models followed by decision tree regression model for predicting sales.
For predicting order quantity Random forest did very good. How well these models perform against neural network model perform to predict order quantity?

The neural network model is trained with 5 hidden layers.

### 3.7. Neural Network Model for Regression


Again, we will use the `Tensorflow-Keras` library and more specifically the [Sequential](https://www.tensorflow.org/guide/keras/sequential_model) model with 4 hidden layers.

In [None]:
regressor = Sequential()

#First Hidden Layer
regressor.add(Dense(512, activation='relu', kernel_initializer='normal', input_dim=43))
#Second  Hidden Layer
regressor.add(Dense(256, activation='relu', kernel_initializer='normal'))
#Third  Hidden Layer
regressor.add(Dense(256, activation='relu', kernel_initializer='normal'))
#Fourth  Hidden Layer
regressor.add(Dense(256, activation='relu', kernel_initializer='normal'))
#Fifth  Hidden Layer
regressor.add(Dense(256, activation='relu', kernel_initializer='normal'))

#Output Layer
regressor.add(Dense(1, activation='linear'))# Linear activation is used.

The mean absolute error is used as loss metric to train the model.

In [None]:
regressor.compile(optimizer='adam', loss='mean_absolute_error', metrics=['mean_absolute_error'])

In [None]:
regressor.fit(xq_train, yq_train, batch_size=256, epochs=40)

The test data is evaluated to find the MAE, RMSE values.

In [None]:
pred_train_q = regressor.predict(xq_train)
pred_q_test = regressor.predict(xq_test)
print('MAE Value train data:',regressor.evaluate(xq_train,yq_train))
print('RMSE of train data:',np.sqrt(mean_squared_error(yq_train,pred_train_q)))
print('MAE Value test data:',regressor.evaluate(xq_test,yq_test))
print('RMSE of test data:',np.sqrt(mean_squared_error(yq_test,pred_q_test)))

The MAE and RMSE scores for neural network models are 0.022 and 0.065 which are pretty good. 