**Mobile Price Prediction using Auto Sklearn**

In [1]:
# !pip install mlbox
# !pip install auto-sklearn

In [2]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math

In [3]:
from mlbox.preprocessing import *
from mlbox.optimisation import *
from mlbox.prediction import *

In [4]:
target_name = "is_happy_customer"  #will change as per user choice
xl=pd.read_excel('dirty_data.xls')
ndf=xl.loc[0:int(xl.shape[0]*0.03)]  #losing 3 percent of the data, for the sake of drifting process, i.e. removing unncessary columns
xl.drop(xl.index[0:math.ceil(xl.shape[0]*0.03)],0,inplace=True)
ndf.drop(target_name,axis='columns', inplace=True)
ndf.to_excel('test.xls')
xl.to_excel('train.xls')
paths=["train.xls","test.xls"]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [5]:
rd = Reader(sep = ',')
df = rd.train_test_split(paths, target_name)


reading xls : train.xls ...
cleaning data ...
CPU time: 5.452819585800171 seconds

reading xls : test.xls ...
cleaning data ...
CPU time: 0.06067371368408203 seconds

> Number of common features : 13

gathering and crunching for train and test datasets ...
reindexing for train and test datasets ...
dropping training duplicates ...
dropping constant variables on training set ...

> Number of categorical features: 5
> Number of numerical features: 8
> Number of training samples : 386
> Number of test samples : 12

> You have no missing values on train set...

> Task : classification
1.0    277
0.0    109
Name: is_happy_customer, dtype: int64

encoding target ...


In [6]:
dft = Drift_thresholder()
df = dft.fit_transform(df) # this drops the unncessary columns automatically, like ids, and anything which isn't useful.


computing drifts ...
CPU time: 0.13759779930114746 seconds

> Top 10 drifts

('customer_id', 1.0)
('order_id', 1.0)
('shopping_cart', 0.9948186528497409)
('coupon_discount', 0.46027633851468064)
('nearest_warehouse', 0.28411053540587217)
('order_price', 0.1567357512953369)
('is_expedited_delivery', 0.14853195164075972)
('distance_to_nearest_warehouse', 0.13428324697754745)
('customer_lat', 0.10535405872193415)
('order_total', 0.07599309153713296)

> Deleted variables : ['customer_id', 'order_id', 'shopping_cart']
> Drift coefficients dumped into directory : save


In [7]:
#this converts any remaining object columns, into categorical, since thats the only remaining type 
for col in df['train'].columns:
  if df['train'].dtypes[col] == np.object:
    df['train'] = df['train'].astype({col:'category'})

In [8]:
#confirming that all columns have the right datatype
df['train'].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 386 entries, 0 to 385
Data columns (total 10 columns):
 #   Column                         Non-Null Count  Dtype   
---  ------                         --------------  -----   
 0   coupon_discount                386 non-null    float64 
 1   customer_lat                   386 non-null    float64 
 2   customer_long                  386 non-null    float64 
 3   delivery_charges               386 non-null    float64 
 4   distance_to_nearest_warehouse  386 non-null    float64 
 5   is_expedited_delivery          386 non-null    float64 
 6   nearest_warehouse              386 non-null    category
 7   order_price                    386 non-null    float64 
 8   order_total                    386 non-null    float64 
 9   season                         386 non-null    category
dtypes: category(2), float64(8)
memory usage: 28.5 KB


In [9]:
# Split train and test set
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(df['train'],df['target'],test_size=0.2,random_state=42)

In [10]:
# Training
#for now, this code will work on classification task. Similarly can be done for regression.
import autosklearn.classification
#time left for task, is in seconds, it decides how long to train the model. Can be a user input, to improve accuracy on very large datasets
automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=240,per_run_time_limit=30)
automl.fit(X_train,Y_train)
print("AutoSkLearn Model Accuracy: {:2f}%".format(automl.score(X_test,Y_test)*100))

AutoSkLearn Model Accuracy: 91.025641%


In [11]:
import pickle
# save the model to disk
filename = 'finalized_model.sav'
pickle.dump(automl, open(filename, 'wb'))

In [12]:
loaded_model = pickle.load(open(filename, 'rb'))

In [13]:
#shows stats of the training process
print(loaded_model.sprint_statistics())

auto-sklearn results:
  Dataset name: 11ba2848-2dda-11ec-a21e-00155df4b369
  Metric: accuracy
  Best validation score: 0.882353
  Number of target algorithm runs: 48
  Number of successful target algorithm runs: 47
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 1
  Number of target algorithms that exceeded the memory limit: 0



In [14]:
#shows the different algorithms tested, and their ranks
print(loaded_model.leaderboard())

          rank  ensemble_weight               type      cost  duration
model_id                                                              
45           1             0.02                sgd  0.117647  1.194382
40           2             0.02                sgd  0.127451  1.058690
17           3             0.02  gradient_boosting  0.176471  1.655259
29           4             0.04           adaboost  0.205882  1.641421
44           5             0.02       bernoulli_nb  0.225490  1.222593
27           6             0.02                lda  0.245098  1.090954
36           7             0.42                sgd  0.294118  1.034279
41           8             0.02        extra_trees  0.333333  2.366200
31           9             0.42                sgd  0.382353  0.999049


In [15]:
X_test.head()

Unnamed: 0,coupon_discount,customer_lat,customer_long,delivery_charges,distance_to_nearest_warehouse,is_expedited_delivery,nearest_warehouse,order_price,order_total,season
336,25.0,-37.818112,144.948641,75.41,0.6211,0.0,Thompson,23900.0,18000.41,Spring
307,5.0,-37.819576,144.963249,48.99,0.5649,0.0,Nickolson,14950.0,14251.49,Autumn
90,10.0,-37.818592,145.005238,79.5,1.3,1.0,Bakers,18295.0,16545.0,Summer
265,5.0,-37.811595,144.945697,76.34,0.1702,1.0,Thompson,7760.0,507327.39,Autumn
150,10.0,-37.814972,144.96024,80.74,0.9127,1.0,Nickolson,8125.0,7393.24,Autumn


In [16]:
#test input
t=X_test.iloc[[3]]

In [17]:
Y_test.iloc[[3]]

265    1
Name: is_happy_customer, dtype: int64

In [18]:
loaded_model.predict(t)[0]

1