## Always explore your data

We have 3 files that simulated a potential AB test that is measuring clicks and views.
- `sim_ab_test_assignment.csv`
- `sim_ab_test_clicks.csv`
- `sim_ab_test_views.csv`

### Task 1

Load the files into sensible variable names using `numpy`.

In [124]:
import numpy as np

In [125]:
clicks = np.loadtxt("./sim_ab_test_clicks.csv",delimiter=",")
views = np.loadtxt("./sim_ab_test_views.csv",delimiter=",")
groups = np.loadtxt("./sim_ab_test_assignment.csv",delimiter=",", dtype=str)

### Task 2

Examine the data a bit
- What are the dimensions for each of the CSVs?
  - This should bring up some questions, what are our guesses?
- Can we look at some sample values in the data?
  - Try looking at the first few rows and columns
  - Try looking at the last few rows and columns

In [127]:
print(clicks.shape)
print(views.shape)
print(groups.shape)

(1000, 14)
(1000, 14)
(1000,)


In [128]:
clicks[:3,:5]

array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.]])

### Task 3

It's important to track a few summary statistics about data, statistics that will let us know if the data is bad.
- Calculate some of these statistics
- Use `print()` to print out a human readable message that includes these statistics.

In [129]:
clicks.sum() / clicks.shape[1]

266.2142857142857

In [130]:
is_neg_clicks = clicks < 0
print(is_neg_clicks.shape)
print(is_neg_clicks[:4,:4])

(1000, 14)
[[False False False False]
 [False False False False]
 [False False False False]
 [False False False False]]


In [131]:
np.array([np.nan,1,2,0]).sum()

nan

### Task 4

What kind of mathematical notation would you use to describe our data? 
Now use that notation to express (no code)
- total clicks over all users / total views over all users
- Average click through rate
- Now write the code to do these calculations (don't use a for-loop!)

1. total over total

In [132]:
clicks.sum() / views.sum()

0.05291929346282728

2. Average CTR

In [133]:
def get_avg_ctr(clicks, views):
    user_clicks = np.apply_along_axis(sum,1, )
    user_views = np.apply_along_axis(sum,1,clicks)
    ctrs = user_clicks / user_views
    return np.sum(ctrs) / len(ctrs)

In [134]:
user_views = np.apply_along_axis(sum,1,clicks)

### Task 5

Let's compare the 2 metrics:
- Remove one of the most active members from the data and re-calculate both metrics. 
- How would you compare the metric before/after the removal? Calculate something and `print()` it out.
- Which metric would you recommend?

In [135]:
is_most_active = user_views == np.max(user_views)
is_most_active.sum()

1

In [136]:
not_most_active = np.logical_not(is_most_active)

In [137]:
chill_views = views[not_most_active, :]
chill_views.shape
chill_clicks = clicks[not_most_active, :]
chill_views.shape

(999, 14)

### Task 6

Calculate the recommended metric for each of the treatment groups.

In [138]:
is_treat = groups == "True"

In [139]:
clicks_treat = clicks[is_treat, :]
views_treat = views[is_treat, :]

clicks_control = clicks[np.logical_not(is_treat), :]
views_control = views[np.logical_not(is_treat), :]

In [140]:
clicks_treat.sum() / views_treat.sum()

0.10317675807765408

In [141]:
clicks_control.sum() / views_control.sum()

0.05014607835792943