In [1]:
from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di # Example: di.display_html('<h3>%s:</h3>' % str, raw=True)

# This line will hide code by default when the notebook is exported as HTML
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)

# This line will add a button to toggle visibility of code blocks, for use with the HTML export version
di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">Toggle code</button>''', raw=True)

# **Data Spaces - Analysis of the Online Shoppers Purchasing Intention Dataset**
**Martina Alutto, s265027**



# Introduction

The analysis that will be presented has been carried out on the dataset available online at the following link: https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset.

This dataset consists of feature vectors belonging to 12,330 online sessions.
The dataset was formed so that each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user
profile, or period. Some of these sessions on the site end with a purchase, while others do not.
The dataset contains 18 features, where 10 are numerical and 8 are categorical attributes, there is also the 'Revenue' attribute, that indicates whether the session ends with shopping or not and this could be used as the class label.

The columns containing users’s attributes are described in the following:


*  "Administrative", "Administrative_Duration", "Informational", "Informational _Duration", "ProductRelated" and "ProductRelated_Duration" represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories. The values of these features are derived from the URL information of the pages visited by the user and updated in real time when a user takes an action, e.g. moving from one page to another. 
*  "BounceRates" for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session.
*  "ExitRates" for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session.
*  "PageValues" feature represents the average value for a web page that a user visited before completing an e-commerce transaction. 

The previous three features represent the metrics measured by *Google Analytics* for each page in the e-commerce site. 
*   "SpecialDay" feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine's Day) in which the sessions are more likely to be finalized with transaction. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date. For example, for Valentine’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8.
*   "Month" indicates the month in which the session took place.
*   "OperatingSystems" and "Browser" are features indicating which operating system and browser are used by categorical values.
*   "Region" indicates the geographic region of the user. 
*   "TrafficType" is a particular identifier type for any hierarchy of customer base: *user* or *account* are two of the most commonly defined traffic types. (In the dataset this is a categorical feature).
*   "VisitorType" states the nature of the user as returning or new visitor.
*   "Weekend" features indicates whether the session's date is weekend or not.

As mentioned earlier, the boolean attribute "Revenue" is used to understand if the session in question ends with a purchase (and was therefore successful) or if instead it was limited to a search or a quick look at the products. The analysis of the users' behaviour could be very useful to predict it and to provide a better organisation of the site (e.g. a restyiling with more pop-ups that encourage purchase by users or more offers) in order to avoid leaving the site without shopping.

The analysis carried out consists of a supervised classification problem, using the Python language and Jupyter Notebook. 

Among the main packages imported for the purposes of our analysis are to be reported: 
-  *pandas*: an open-source library prviding high-performance data structures and data analysis tools for manipulating numerical tables and time series.
-  *numpy*: the fundamental package for scientific computing with Python.
NumPy provides among other things: a powerful N-dimensional array object, sophisticated (broadcasting) functions, useful linear algebra and random number capabilities.
-  *sklearn*: a free machine learning library providing tools for data mining.It features various algorithms like support vector machine, random forests, and k-neighbours.
-  *matplotlib*: a comprehensive library for creating static, animated, and interactive visualizations.
-  *seaborn*: a Python data visualization library based on *matplotlib*, providing an interface for statistical graphics.

In [10]:
import plotly.graph_objects as go
import pandas as pd
import sklearn
from sklearn import decomposition, preprocessing
from sklearn import svm
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn import metrics
from sklearn import tree 

In [11]:


# set the seed for the analysis
SEED = 40

# pandas option for the output style 
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', -1)
pd.set_option('expand_frame_repr', True)

**Exploration and Preprocessing**

After importing the dataset, we can move on to exploring it, we can take a look of teh structure of our data, checking whether missing values are present in the dataset or not.

In [12]:
data = pd.read_csv('/Users/Martina/Google Drive/online_shoppers_intention.csv') 

print('\nDataset dimensions : ', data.shape)

# Data analysis
# Preview the first 5 lines of the loaded data 
print('This is a preview of the first 5 lines of the loaded dataset.\n')
print(data.head(5))

print("\nThere are " + ("some" if data.isnull().values.any() else "no")  + " null/missing values in the dataset.")


Dataset dimensions :  (12330, 18)
This is a preview of the first 5 lines of the loaded dataset.

   Administrative  Administrative_Duration  Informational  Informational_Duration  ProductRelated  ProductRelated_Duration  BounceRates  ExitRates  PageValues  SpecialDay Month  OperatingSystems  Browser  Region  TrafficType        VisitorType  Weekend  Revenue
0  0               0.0                      0              0.0                     1               0.000000                 0.20         0.20       0.0         0.0         Feb   1                 1        1       1            Returning_Visitor  False    False  
1  0               0.0                      0              0.0                     2               64.000000                0.00         0.10       0.0         0.0         Feb   2                 2        1       2            Returning_Visitor  False    False  
2  0               0.0                      0              0.0                     1               0.000000         

In order to investigate the pair-wise correlations between two variables X and Y, we use the Pearson correlation. 
Let σ(X), σ(Y) be the standard deviation of X,Y and the covariance cov(X,Y) = E[(X−E[X])(Y−E[Y])]. 
Then we can define the Pearson correlation as ρ_{X,Y}=cov(X,Y)σ(X)σ(Y).

The correlation matrix graph shows a strong positive correlation between the attributes "BounceRates" and "ExitRates". Let us remember that Bounce rate is the percentage of people who landed on a page and immediately left, so they are always one-page sessions. A high Bounce rate on a home page is usually a sign that something is wrong, but it’s really a matter of context. Instead Exit rate is the percentage of people who left your site from that page and exits may have viewed more than one page in a session. That means they may not have landed on that page, but simply found their way to it through site navigation. What is important is that like Bounce rates, high Exit rates can often reveal problem areas on your site and that's why we have a strong correlation between them. 

It's also clear that the attributes "Administrative", "Informational" and "ProductRelated" are quite positively correlated with their duration attributes, because the number of pages of a certain type visited during the session is related to the time spent on that type of pages. You can see a greater correlation in the case of product pages because, with the same number of pages of the 3 types, more time is spent on a product page where all its specifications and characteristics are looked at.

We can also observe a quite high positive correlation between the "PageValues" feature and the label "Revenue", and this is what we expected because the objective of the first value is to give an idea of the page that has contributed most to the site's revenue. If a page has not been involved in any way in the e-commerce transaction, its Page Value is € 0, since the page has never been visited in a session where a transaction was made.
-- **vedere se aggiungere altro** -- 

We can observe how the categorical attributes "Month", "VisitorType" and "Weekend" have string values but they can be easily transformed into numbers.
"Weekend" is transformed by placing *0* if it was *False* and *1* if it was *True*; similarly "VisitorType" is set to *1* if it was a returning visitor and *0* if it is a new one. 
Instead "Month" could be transformed using LabelEncoder which automatically converts each distinct label into an unique integer. However, we must note that there are no user sessions in January and April, so we transform this attribute by hand in order to avoid future problems of decoding and understanding our results. That way we'll have an encoding of months one through 12. 

In [13]:
trace = go.Scatter(x=[1,2,3,4], y=[1,2,3,4],
                    mode = "lines+markers",
                    name = "teaching",
                    marker = dict(color = 'rgba(80, 26, 80, 0.8)'))
layout = dict(title = 'Citation and Teaching vs World Rank of Top 100 Universities',
              xaxis= dict(title= 'World Rank',ticklen= 5,zeroline= False))
fig = go.Figure(data = trace, layout = layout)
fig.show()

In [14]:
revenue_dict = {False: "No Revenue", True: "Revenue"}
y = data["Revenue"].value_counts()

d = [go.Bar(x=[revenue_dict[x] for x in y.index], y=y.values, marker=dict(color=['#FF3333','#3399FF']))]
layout = go.Layout(
    title='Revenue distribution on the Total',
    autosize=False,
    width=400,
    height=400,
    yaxis=dict(
        title='Number of samples',
    ),
)
fig = go.Figure(data=d, layout=layout)
fig.show()