# <b> Notebook 1 : Data Import & Processing
This notebooks is used to import the data and process it

In [20]:
%run ../klicp/klicp_00_import_and_functions.ipynb
%run ../klicp/klicp_01_data_import_tools.ipynb

In [21]:
%run ../code_patrimony/00_fonctions_preparation.ipynb

In [22]:
# Importing the dataset (Link - https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset)
data = pd.read_csv("online_shoppers_intention.csv", header = 0)

## <b> STEP 1 : Data Cleaning (if necessary)

#### We can have a quick look at the data type in each column of our dataset to make sure if there are any missing values. 

In [23]:
data.dtypes

Administrative               int64
Administrative_Duration    float64
Informational                int64
Informational_Duration     float64
ProductRelated               int64
ProductRelated_Duration    float64
BounceRates                float64
ExitRates                  float64
PageValues                 float64
SpecialDay                 float64
Month                       object
OperatingSystems             int64
Browser                      int64
Region                       int64
TrafficType                  int64
VisitorType                 object
Weekend                       bool
Revenue                       bool
dtype: object

#### Month and VisitorType are object type, we can quickly check if there is any missing values by counting the values.

In [24]:
data['Month'].value_counts()

May     3364
Nov     2998
Mar     1907
Dec     1727
Oct      549
Sep      448
Aug      433
Jul      432
June     288
Feb      184
Name: Month, dtype: int64

In [25]:
data['VisitorType'].value_counts()

Returning_Visitor    10551
New_Visitor           1694
Other                   85
Name: VisitorType, dtype: int64

## <b> STEP 2 : Data preprocessing

#### As we have categorical features in our dataset, we need to encode them before so we can use them in our analysis later. 

In [26]:
data['Revenue']

0        False
1        False
2        False
3        False
4        False
         ...  
12325    False
12326    False
12327    False
12328    False
12329    False
Name: Revenue, Length: 12330, dtype: bool

In [27]:
# Downsampled Dataset
data = resample(data, 'Revenue', data['Revenue'].unique())

categorical_col = ['Month','OperatingSystems','Browser','VisitorType','Weekend']
encoder, data = custom_label_encoder(data, OneHotEncoder(), categorical_col, 'create')

numerical_col = ['Administrative', 'Administrative_Duration', 'Informational', 'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration', 'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'Region', 'TrafficType']
scaler, data = custom_scaler_encoder(data, PowerTransformer(), numerical_col, 'create')

data, X_train, X_test, y_train, y_test = ml_build(data, 'Revenue', ['Revenue'], test_size=0.25)

In [29]:
with open('artifacts/X_train.pickle', 'wb') as file:
    pickle.dump(X_train, file)
with open('artifacts/X_test.pickle', 'wb') as file:
    pickle.dump(X_test, file)
with open('artifacts/data.pickle', 'wb') as file:
    pickle.dump(data, file)

In [30]:
with open('artifacts/encoder.pickle', 'wb') as file:
    pickle.dump(encoder, file)

In [31]:
with open('artifacts/scaler.pickle', 'wb') as file:
    pickle.dump(scaler, file)