# STA4724 Final Project
## Team 6: Andres Machado, Jackson Smalls, Sarah Taha, and Thomas Tibbets

### EDA (Exploratory Data Analysis) Notebook 

---

### Importing the libraries, data and initial inspections

In [None]:
import pandas as pd
import plotly.express as px

In [3]:
# Import the dataset
raw_data = pd.read_csv("online_shoppers_intention.csv")

# Preview the dataset
raw_data.info()
raw_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           12330 non-null  int64  
 1   Administrative_Duration  12330 non-null  float64
 2   Informational            12330 non-null  int64  
 3   Informational_Duration   12330 non-null  float64
 4   ProductRelated           12330 non-null  int64  
 5   ProductRelated_Duration  12330 non-null  float64
 6   BounceRates              12330 non-null  float64
 7   ExitRates                12330 non-null  float64
 8   PageValues               12330 non-null  float64
 9   SpecialDay               12330 non-null  float64
 10  Month                    12330 non-null  object 
 11  OperatingSystems         12330 non-null  int64  
 12  Browser                  12330 non-null  int64  
 13  Region                   12330 non-null  int64  
 14  TrafficType           

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


The data imported correctly and additionally we observe no null/missing values as expected. Let us now check for duplicate rows:

In [4]:
# Counting and printing the duplicate rows
duplicated_rows_count = raw_data.duplicated().sum()
print(f"Total number of duplicate rows: {duplicated_rows_count}")
print(f"Percentage of rows that are duplicates: {(duplicated_rows_count / len(raw_data)) * 100:.2f}%")

Total number of duplicate rows: 125
Percentage of rows that are duplicates: 1.01%


1.01% of rows are duplicates which is quite small. For our EDA purposes, we will keep them in, but when we go to fit models we will remove them prior to train and test split.

Before we progress any further, the columns `OperatingSystems`, `Browser`, `Region`, and `TrafficType` are categorical features being recognized as data type `int64`. For EDA purposes, we need to make them `category` type:

In [5]:
# Converting certain numerical columns to categorical 
num_to_cat_cols = ["OperatingSystems", "Browser", "Region", "TrafficType"]
raw_data[num_to_cat_cols] = raw_data[num_to_cat_cols].astype("category")

# Verifying
raw_data["OperatingSystems"].dtype

CategoricalDtype(categories=[1, 2, 3, 4, 5, 6, 7, 8], ordered=False, categories_dtype=int64)

Next, we will generate the summary statistics for numerical features of the dataset:

In [6]:
# Summary stats
raw_data.describe()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay
count,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0
mean,2.315166,80.818611,0.503569,34.472398,31.731468,1194.74622,0.022191,0.043073,5.889258,0.061427
std,3.321784,176.779107,1.270156,140.749294,44.475503,1913.669288,0.048488,0.048597,18.568437,0.198917
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,7.0,184.1375,0.0,0.014286,0.0,0.0
50%,1.0,7.5,0.0,0.0,18.0,598.936905,0.003112,0.025156,0.0,0.0
75%,4.0,93.25625,0.0,0.0,38.0,1464.157214,0.016813,0.05,0.0,0.0
max,27.0,3398.75,24.0,2549.375,705.0,63973.52223,0.2,0.2,361.763742,1.0


Looks pretty standard and matches up exactly what the researchers have from which we pulled this dataset from. There are some extreme values for the max but this we will take a closer look at through visualizations.

---

### 1D Data Visualizations

For this section, we are going to explore the variables a little more closely through visualizations. We split it up by numerical and categorical/bool features.

In [7]:
# Splitting of the raw data frame into cat and num dfs
cat_df = raw_data.select_dtypes(include=["object", "bool", "category"])
num_df = raw_data.select_dtypes(include=["int64", "float64"])

##### Numerical Features

In [None]:
##### YOU ARE HERE lol
# Making an example and will automate for rest of numerical features at a later time
fig = px.histogram(raw_data, x = "PageValues", color = "Revenue", nbins = 50, marginal = "box",
                   color_discrete_map={True: "green", False: "red"})
fig.show()