## Analysis of online shoppers' purchase intention based on their online behavior
Data source: https://www.kaggle.com/datasets/imakash3011/online-shoppers-purchasing-intention-dataset?resource=download 

Phyl Peng

### Goals
The goal of the dataset is to gain insight into what parameters of an online customer’s behavior
lead to revenue. This analysis would inform online retail companies on what aspects of their
online sales experience to focus on and what aspect to not expend extra resource on when they
are advertising.

**The questions I’m hoping to answer are**:
1. what are the factors that correlate with revenue?
2. what aspects of the online sales can companies target to better present and promote their products?
3. what aspect of the company, product or shopping experience best predict revenue?
4. what aspect of the company, product or shopping experience negatively affects product sales?

### Notes from the source:
The dataset consists of feature vectors belonging to **12,330 sessions**.

The **'Revenue'** attribute can be used as the class label.

"Administrative", "Administrative Duration", "Informational", "Informational Duration", "Product Related" and "Product Related Duration" represent the **number of different types of pages visited** by the visitor in that session and **total time spent** in each of these page categories. The values of these features are derived from the URL information of the pages visited by the user and updated in real time when a user takes an action, e.g. moving from one page to another. 

The "Bounce Rate", "Exit Rate" and "Page Value" features represent the metrics measured by "Google Analytics" for each page in the e-commerce site. The value of "Bounce Rate" feature for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session. 

The value of "Exit Rate" feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session. 

The "Page Value" feature represents the average value for a web page that a user visited before completing an e-commerce transaction. 

The "Special Day" feature indicates the **closeness of the site visiting time to a specific special day** (e.g. Mother’s Day, Valentine's Day) in which the sessions are more likely to be finalized with transaction. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date. For example, for Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8. 

The dataset also includes operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.

## Exploratory Data Analysis and Visualization

In [10]:
import pandas as pd
import plotly.express as px
print("setup complete")

setup complete


In [2]:
data = "online_shoppers_intention.csv"
shop = pd.read_csv(data)
print(type(shop))
shop.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


In [3]:
shop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           12330 non-null  int64  
 1   Administrative_Duration  12330 non-null  float64
 2   Informational            12330 non-null  int64  
 3   Informational_Duration   12330 non-null  float64
 4   ProductRelated           12330 non-null  int64  
 5   ProductRelated_Duration  12330 non-null  float64
 6   BounceRates              12330 non-null  float64
 7   ExitRates                12330 non-null  float64
 8   PageValues               12330 non-null  float64
 9   SpecialDay               12330 non-null  float64
 10  Month                    12330 non-null  object 
 11  OperatingSystems         12330 non-null  int64  
 12  Browser                  12330 non-null  int64  
 13  Region                   12330 non-null  int64  
 14  TrafficType           

In [4]:
shop.describe()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,OperatingSystems,Browser,Region,TrafficType
count,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0
mean,2.315166,80.818611,0.503569,34.472398,31.731468,1194.74622,0.022191,0.043073,5.889258,0.061427,2.124006,2.357097,3.147364,4.069586
std,3.321784,176.779107,1.270156,140.749294,44.475503,1913.669288,0.048488,0.048597,18.568437,0.198917,0.911325,1.717277,2.401591,4.025169
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,0.0,7.0,184.1375,0.0,0.014286,0.0,0.0,2.0,2.0,1.0,2.0
50%,1.0,7.5,0.0,0.0,18.0,598.936905,0.003112,0.025156,0.0,0.0,2.0,2.0,3.0,2.0
75%,4.0,93.25625,0.0,0.0,38.0,1464.157214,0.016813,0.05,0.0,0.0,3.0,2.0,4.0,4.0
max,27.0,3398.75,24.0,2549.375,705.0,63973.52223,0.2,0.2,361.763742,1.0,8.0,13.0,9.0,20.0


In [23]:
print(type(shop))

shop.Revenue = shop.Revenue.replace({False: 0, True: 1})
shop.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,0
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,0
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,0
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,0
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,0


In [24]:
revenue_percentage = sum(shop.Revenue)/shop.shape[0]
print("amount of sessions that yielded success: ", revenue_percentage*100, "%")

amount of sessions that yielded success:  15.474452554744525 %


In [25]:
shop.groupby('Revenue').mean()

Unnamed: 0_level_0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,OperatingSystems,Browser,Region,TrafficType,Weekend
Revenue,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,2.117732,73.740111,0.451833,30.236237,28.714642,1069.987809,0.025317,0.047378,1.975998,0.068432,2.129726,2.339474,3.159278,4.078392,0.227308
1,3.393606,119.483244,0.786164,57.611427,48.210168,1876.209615,0.005117,0.019555,27.264518,0.023166,2.092767,2.453354,3.082285,4.021488,0.26153


In [30]:
shop_byMonth = shop.groupby('Month').mean().reset_index()
shop_byMonth

Unnamed: 0,Month,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,OperatingSystems,Browser,Region,TrafficType,Weekend,Revenue
0,Aug,3.136259,106.717288,0.542725,35.514365,38.258661,1272.653654,0.018211,0.037727,5.93807,0.0,2.071594,2.378753,3.249423,3.512702,0.221709,0.17552
1,Dec,2.196294,78.632929,0.512449,38.069358,27.994789,1111.470727,0.020149,0.041303,6.833243,0.0,2.253619,2.613202,3.393746,4.068327,0.211928,0.125072
2,Feb,0.543478,16.872418,0.086957,2.38587,11.184783,471.014647,0.047021,0.074148,0.890363,0.233696,1.918478,2.179348,2.663043,2.771739,0.152174,0.016304
3,Jul,2.423611,78.874358,0.516204,45.520409,36.407407,1217.604028,0.024676,0.04533,4.104414,0.0,2.094907,2.37963,3.414352,3.68287,0.240741,0.152778
4,June,2.274306,59.129946,0.5625,20.450775,36.065972,1213.377604,0.035102,0.058242,3.39144,0.0,2.131944,2.315972,3.190972,4.211806,0.163194,0.100694
5,Mar,1.887782,71.231507,0.420556,30.673764,19.8086,812.282992,0.021728,0.0446,3.959682,0.0,2.078658,2.291557,3.033561,3.178815,0.252229,0.100682
6,May,1.964923,69.47179,0.4239,27.163159,26.487812,981.89306,0.026867,0.04885,5.431574,0.212366,2.120095,2.368906,3.134958,4.476813,0.212545,0.108502
7,Nov,2.617412,90.93331,0.646431,43.634938,46.038692,1758.397922,0.019259,0.038202,7.129379,0.0,2.116077,2.250834,3.033689,4.454303,0.263843,0.253502
8,Oct,3.71949,125.939345,0.48816,38.666926,33.566485,1116.977684,0.011849,0.029011,8.64558,0.0,2.056466,2.227687,3.193078,4.276867,0.262295,0.209472
9,Sep,3.334821,109.325429,0.566964,35.736835,33.104911,1253.38815,0.012183,0.03032,7.556826,0.0,2.140625,2.486607,3.294643,3.332589,0.214286,0.191964


In [40]:
fig = px.bar(shop_byMonth, 
             x='Month', y='Revenue',
             color_discrete_sequence=px.colors.qualitative.Prism,
             color='SpecialDay',
             title='Online Shopper Intention - Closeness to Special Day and Revenue')
fig.show()

It seems like the subscription is not related to closeness to a special day unless it is in Feb or May, which couuld be Valentine's Day, Graduation Days, and Memorial Day. 

In [44]:
fig = px.histogram(shop, 'ProductRelated_Duration', range_x = [0, 5000], 
                   color = 'Revenue', barmode='overlay',  
                   title='Histogram of Product Related Duration, by Revenue',
                  labels={'count':'number of sessions',
                     'ProductRelated_Duration':'Time Spent Viewing Product-related Webpage',
                     'Revenue':'Successful Sale',
                     1:'Success',
                     0:'No success'
                    })
fig.show()

It seems like there is no significant correlation between the duration of time spent viewing producted related webpage and a successful sale. However, if very little time is spent looking at product-related pages, it is likely that the session would not lead to sale. 

In [47]:
fig = px.histogram(shop, 'ProductRelated', range_x = [0, 200], 
                   color = 'Revenue', barmode='overlay',  
                   title='Histogram of Product Related Duration, by Revenue',
                  labels={'count':'number of sessions',
                     'ProductRelated_Duration':'Time Spent Viewing Product-related Webpage',
                     'Revenue':'Successful Sale',
                     1:'Success',
                     0:'No success'
                    })
fig.show()