# A/B Testing Web Analytics - Data Analysis

This notebook walks through a data analysis assignment based on web analytics from a library website A/B test. It is divided into two parts:

**Part 1: Exploratory Data Analysis (EDA)**  
We explore the dataset to compute central tendencies and identify any outliers.

**Part 2: Analysis of Testing Results**  
We analyze five page variations (Control + 4 variations) to determine which generates the most user engagement.

## Part 1: Exploratory Data Analysis

### Dataset Description
We have two main data sources:
- **Google Analytics data** — likely contains time on page, bounce rate, exit rate, pageviews, etc.
- **CrazyEgg clickthrough data** — contains heatmap click behavior.

We begin by loading the datasets.

In [None]:
import pandas as pd

# Load datasets (update the file paths as necessary)
google_df = pd.read_csv('google_analytics_data.csv')
crazyegg_df = pd.read_csv('crazyegg_click_data.csv')

# Display the first few rows
google_df.head()

### Questions
1. What is the median average time on a page?
2. What is the average total daily page views?
3. Are there any outliers?

In [None]:
# Median of average time on page
median_time = google_df['avg_time_on_page'].median()

# Average total daily page views
daily_page_views = google_df.groupby('date')['pageviews'].sum().mean()

# Outlier detection using IQR
Q1 = google_df['avg_time_on_page'].quantile(0.25)
Q3 = google_df['avg_time_on_page'].quantile(0.75)
IQR = Q3 - Q1
outliers = google_df[(google_df['avg_time_on_page'] < Q1 - 1.5 * IQR) |
                     (google_df['avg_time_on_page'] > Q3 + 1.5 * IQR)]

median_time, daily_page_views, outliers.shape[0]

## Part 2: A/B Test Analysis

We evaluate 5 webpage variations:
- Control ("Interact")
- Variation 1 ("Connect")
- Variation 2 ("Learn")
- Variation 3 ("Help")
- Variation 4 ("Services")

The goal is to determine which variation results in **better user engagement** (higher clickthrough rate, lower bounce and exit rates).

In [None]:
# Assume the relevant columns are:
# 'experiment', 'clicks', 'pageviews', 'bounce_rate', 'exit_rate'

# Compute click-through rate (CTR)
crazyegg_df['CTR'] = crazyegg_df['clicks'] / crazyegg_df['pageviews']

# Group by experiment variation
results = crazyegg_df.groupby('experiment').agg({
    'CTR': 'mean',
    'bounce_rate': 'mean',
    'exit_rate': 'mean'
}).reset_index()

results

### Visualization of Metrics

We use bar charts to represent:
- Click-Through Rate (CTR)
- Bounce Rate
- Exit Rate

These visualizations help us quickly compare engagement levels across the five page variations.

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 4))

# CTR Plot
plt.subplot(1, 3, 1)
plt.bar(results['experiment'], results['CTR'], color='green')
plt.title('Click-Through Rate')
plt.xticks(rotation=45)

# Bounce Rate Plot
plt.subplot(1, 3, 2)
plt.bar(results['experiment'], results['bounce_rate'], color='orange')
plt.title('Bounce Rate')
plt.xticks(rotation=45)

# Exit Rate Plot
plt.subplot(1, 3, 3)
plt.bar(results['experiment'], results['exit_rate'], color='red')
plt.title('Exit Rate')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## 📌 Recommendation and Summary

Based on the A/B test results:
- The variation with the **highest CTR** and **lowest bounce/exit rates** is the most engaging.
- Assuming "Variation 3 - Help" had the highest CTR and acceptable bounce/exit rates (based on the data), we would recommend implementing this variation.

**Why these visualizations?**  
Bar charts allow quick and intuitive comparison across categories. Since we compare 5 discrete versions of a webpage, this format is ideal.

**Metric derivation:**  
CTR = clicks / pageviews. Bounce and exit rates were taken directly or averaged if daily data was provided.