# <b> Notebook 1 : Data Import & Processing
This notebooks is used to import the data and process it

In [18]:
%run ../klicp/klicp_00_import_and_functions.ipynb
%run ../klicp/klicp_01_data_import_tools.ipynb

In [19]:
# Importing the dataset (Link - https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset)
data = pd.read_csv("online_shoppers_intention.csv", header = 0)

## <b> STEP 1 : Data Cleaning (if necessary)

#### We can have a quick look at the data type in each column of our dataset to make sure if there are any missing values. 

In [14]:
data.dtypes

Administrative               int64
Administrative_Duration    float64
Informational                int64
Informational_Duration     float64
ProductRelated               int64
ProductRelated_Duration    float64
BounceRates                float64
ExitRates                  float64
PageValues                 float64
SpecialDay                 float64
Month                       object
OperatingSystems             int64
Browser                      int64
Region                       int64
TrafficType                  int64
VisitorType                 object
Weekend                       bool
Revenue                       bool
dtype: object

#### Month and VisitorType are object type, we can quickly check if there is any missing values by counting the values.

In [15]:
data['Month'].value_counts()

May     3364
Nov     2998
Mar     1907
Dec     1727
Oct      549
Sep      448
Aug      433
Jul      432
June     288
Feb      184
Name: Month, dtype: int64

In [16]:
data['VisitorType'].value_counts()

Returning_Visitor    10551
New_Visitor           1694
Other                   85
Name: VisitorType, dtype: int64

## <b> STEP 2 : Data preprocessing

#### As we have categorical features in our dataset, we need to encode them before so we can use them in our analysis later. 

In [17]:
column_trans = make_column_transformer((OneHotEncoder(),['Month','OperatingSystems','Browser','VisitorType','Weekend']),remainder='passthrough')

# Scaler
scalar = MinMaxScaler()

# Purchased
dataset_p = data[data.Revenue==True]
# Not Purchased
dataset_np = data[data.Revenue==False]

# Downsampled Dataset
dataset_p_down = resample(dataset_p,replace=True,n_samples=5000)
dataset_np_down = resample(dataset_np,replace=False,n_samples=5000)
dataset = pd.concat([dataset_p_down,dataset_np_down])

# Identifying the class label
X = dataset.drop(columns=['Revenue'])
y = dataset['Revenue']

# Encoding categorical features
column_trans.fit(X)
X = column_trans.transform(X)

# Creating training and testing set
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=0)

# Center and normalize the data
scalar.fit(X)
X_train = scalar.transform(X_train)
X_test = scalar.transform(X_test)