# Linkfire: Website Traffic Analysis

**Leng Yang**

**Last Updated: 1/27/25**

## Assignment

Project Source: https://platform.stratascratch.com/data-projects/website-traffic-analysis

The goal of this project is to understand this traffic better, in particular the volume and distribution of events, and to develop ideas how to increase the links' clickrates. With that in mind, please analyze the data using the Python libraries Pandas and SciPy where indicated, providing answers to the presented questions:

1. [Pandas] How many total pageview events did the links in the provided dataset receive in the full period, how many per day?
2. [Pandas] What about the other recorded events?
3. [Pandas] Which countries did the pageviews come from?
4. [Pandas] What was the overall click rate (clicks/pageviews)?
5. [Pandas] How does the clickrate distribute across different links?
6. [Pandas & SciPy] Is there any correlation between clicks and previews on a link? Is it significant? How large is the effect? Make sure to at least test for potential linear as well as categorical (think binary) relationships between both variables.

## Data Description

The data set provided (`traffic.csv`) contains web traffic data (`"events"`) from a few different pages (`"links"`) over a period of 7 days including various categorical dimensions about the geographic origin of that traffic as well as a page's content: `isrc`.

<BR><BR>

### 1. How many total pageview events did the links in the provided dataset receive in the full period, how many per day?

In [7]:
#Import necessary libraries
import numpy as np
import pandas as pd
from scipy import stats

In [8]:
#Load in and preview data
df = pd.read_csv('./datasets/traffic.csv')
df.head()

Unnamed: 0,event,date,country,city,artist,album,track,isrc,linkid
0,click,2021-08-21,Saudi Arabia,Jeddah,Tesher,Jalebi Baby,Jalebi Baby,QZNWQ2070741,2d896d31-97b6-4869-967b-1c5fb9cd4bb8
1,click,2021-08-21,Saudi Arabia,Jeddah,Tesher,Jalebi Baby,Jalebi Baby,QZNWQ2070741,2d896d31-97b6-4869-967b-1c5fb9cd4bb8
2,click,2021-08-21,India,Ludhiana,Reyanna Maria,So Pretty,So Pretty,USUM72100871,23199824-9cf5-4b98-942a-34965c3b0cc2
3,click,2021-08-21,France,Unknown,"Simone & Simaria, Sebastian Yatra",No Llores Más,No Llores Más,BRUM72003904,35573248-4e49-47c7-af80-08a960fa74cd
4,click,2021-08-21,Maldives,Malé,Tesher,Jalebi Baby,Jalebi Baby,QZNWQ2070741,2d896d31-97b6-4869-967b-1c5fb9cd4bb8


In [9]:
#Total number of events for each event type
df.event.value_counts()

event
pageview    142015
click        55732
preview      28531
Name: count, dtype: int64

In [10]:
#Number of events per day for each event type
df.groupby('date')['event'].value_counts().unstack()

event,click,pageview,preview
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-08-19,9207,22366,3788
2021-08-20,8508,21382,4222
2021-08-21,8071,21349,4663
2021-08-22,7854,20430,4349
2021-08-23,7315,18646,3847
2021-08-24,7301,18693,3840
2021-08-25,7476,19149,3822


<BR>

### 2. What about the other recorded events?

The total and number of events per day can be found in the above section for each event type.

<BR>

### 3. Which countries did the pageviews come from?

In [16]:
#All countries with a 'pageview' event
df.loc[df.event == 'pageview', 'country'].unique()

array(['Saudi Arabia', 'United States', 'Ireland', 'United Kingdom',
       'France', 'Guatemala', 'Jordan', 'Kuwait', 'Pakistan', 'Italy',
       'Germany', 'Iraq', 'Peru', 'India', 'Nicaragua', 'Rwanda',
       'Tanzania', 'United Arab Emirates', 'Norway', 'Oman', 'Bahamas',
       'Algeria', 'Czechia', 'Mexico', 'Jamaica', 'Netherlands',
       'Colombia', 'Morocco', 'Australia', 'Myanmar', 'Uzbekistan',
       'Austria', 'Latvia', 'Turkey', 'Mauritania', 'Sri Lanka',
       'Bosnia and Herzegovina', 'Estonia', 'Nigeria', 'Bulgaria',
       'Greece', 'El Salvador', 'Philippines', 'Denmark', 'Serbia',
       'Canada', 'Spain', 'Libya', 'Palestine', 'Chad', 'Ecuador', 'Mali',
       'Romania', 'Switzerland', 'Portugal', 'Slovenia', 'Iceland',
       'Sweden', 'Bahrain', 'Egypt', 'Lithuania', 'Liberia', 'Israel',
       'Ukraine', 'Puerto Rico', 'South Africa', 'Ghana', 'Kenya',
       'Armenia', 'Nepal', 'Barbados', 'Azerbaijan', 'Qatar', 'Uganda',
       'Poland', 'Brazil', 'Guyana',

<BR>

### 4. What was the overall click rate (clicks/pageviews)?

In [19]:
#Count the number of clicks and pageviews per linkid, dropping NA values where there are no clicks to the pageview
clickrate = df.groupby('linkid')['event'].value_counts().unstack()[['click','pageview']].dropna()

In [20]:
#Calculate clickrate for each linkid
clickrate['clickrate'] = clickrate.click / clickrate.pageview

In [21]:
#View of the clickrates across each linkid
clickrate

event,click,pageview,clickrate
linkid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
00126b32-0c35-507b-981c-02c80d2aa8e7,2.0,2.0,1.000000
004b9724-abca-5481-b6e9-6148a7ca00a5,1.0,1.0,1.000000
0063a982-41cd-5629-96d0-e1c4dd72ea11,2.0,3.0,0.666667
006af6a0-1f0d-4b0c-93bf-756af9071c06,8.0,36.0,0.222222
00759b81-3f04-4a61-b934-f8fb3185f4a0,3.0,4.0,0.750000
...,...,...,...
ffd8d5a7-91bc-48e1-a692-c26fca8a8ead,29.0,84.0,0.345238
fff38ca0-8043-50cd-a5f1-f65ebb7105c5,1.0,1.0,1.000000
fff84c0e-90a1-59d8-9997-adc909d50e16,1.0,1.0,1.000000
fffc17a7-f935-5d3e-bd3e-d761fd80d479,1.0,2.0,0.500000


<BR>

### 5. How does the clickrate distribute across different links?

In [24]:
#Distribution of clickrate
clickrate.clickrate.describe()

count    2253.000000
mean        0.809920
std         1.958030
min         0.090909
25%         0.500000
50%         1.000000
75%         1.000000
max        92.300000
Name: clickrate, dtype: float64

<BR>

### 6. Is there any correlation between clicks and previews on a link? Is it significant? How large is the effect? Make sure to at least test for potential linear as well as categorical (think binary) relationships between both variables.

In [27]:
#Count the number of clicks and previews per linkid
click_preview = df.groupby('linkid')['event'].value_counts().unstack()[['click','preview']]

In [28]:
#There is a strong correlation between clicks and previews as one is dependent on the other (can't have clicks without previews)
click_preview.corr()

event,click,preview
event,Unnamed: 1_level_1,Unnamed: 2_level_1
click,1.0,0.993422
preview,0.993422,1.0


In [46]:
#Define categorical variables
cats = ['country','city','artist','album','track']
#Encode categorical values into numericals in order to determin correlation -- artist, album, and track are highly correlated as they are dependent upon one another
df[cats].apply(lambda x: pd.factorize(x)[0]).corr()

Unnamed: 0,country,city,artist,album,track
country,1.0,0.188101,0.010852,0.006472,0.005394
city,0.188101,1.0,0.088732,0.095431,0.097308
artist,0.010852,0.088732,1.0,0.901881,0.875746
album,0.006472,0.095431,0.901881,1.0,0.963166
track,0.005394,0.097308,0.875746,0.963166,1.0
