Importing pretty standard packages, plus the imbalanced-learn package so that I can over and under sample my data to try and work around the class imbalance.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss
from collections import Counter

In [2]:
df = pd.read_csv('df_new.csv')

In [3]:
X = df.loc[:, df.columns != 'Revenue']
y = df['Revenue']

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=1)

In [5]:
num_cols = X.columns[X.dtypes.apply(lambda c: np.issubdtype(c, np.number))]

In [6]:
scaler = MinMaxScaler().fit(X_train[num_cols])
X_train[num_cols] = scaler.transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

In [7]:
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)

In [8]:
X_train.dtypes

Administrative                   float64
Administrative_Duration          float64
Informational                    float64
Informational_Duration           float64
ProductRelated                   float64
ProductRelated_Duration          float64
BounceRates                      float64
ExitRates                        float64
PageValues                       float64
SpecialDay                       float64
OperatingSystems                 float64
Browser                          float64
Region                           float64
TrafficType                      float64
Weekend                             bool
Month_Aug                          uint8
Month_Dec                          uint8
Month_Feb                          uint8
Month_Jul                          uint8
Month_June                         uint8
Month_Mar                          uint8
Month_May                          uint8
Month_Nov                          uint8
Month_Oct                          uint8
Month_Sep       

Checked to make sure everything converted nicely, just have to manually change the Weekend values now.

In [9]:
replace_dict = {True: 1, False: 0}
X_train = X_train.replace(replace_dict)
X_test = X_test.replace(replace_dict)

In [10]:
X_train.dtypes

Administrative                   float64
Administrative_Duration          float64
Informational                    float64
Informational_Duration           float64
ProductRelated                   float64
ProductRelated_Duration          float64
BounceRates                      float64
ExitRates                        float64
PageValues                       float64
SpecialDay                       float64
OperatingSystems                 float64
Browser                          float64
Region                           float64
TrafficType                      float64
Weekend                            int64
Month_Aug                          uint8
Month_Dec                          uint8
Month_Feb                          uint8
Month_Jul                          uint8
Month_June                         uint8
Month_Mar                          uint8
Month_May                          uint8
Month_Nov                          uint8
Month_Oct                          uint8
Month_Sep       

In [11]:
Counter(y_train)

Counter({False: 8307, True: 1557})

It looks like when undersampling I'll be cutting down the dataset quite a bit, which will be interesting to see how the results are affected.

In [None]:
sm = SMOTE(random_state=42)
X_trainOS, y_trainOS = sm.fit_resample(X_train, y_train)
nm = NearMiss()
X_trainUS, y_trainUS = nm.fit_resample(X_train, y_train)

Right now for the classification I'm thinking I'll look at Random Forest and Gradient Boosting models and for the clustering I'll look at utilizing PCA and KMeans to visualize any groupings in the data.