# Clustering and Non-Compliance on Experimental Data.

#### Clustering

In the context of experimental design, clustering refers to a method used to group experimental units or subjects into homogeneous subsets based on certain characteristics or attributes. The goal of clustering is to create groups that are internally similar and externally dissimilar. By clustering similar units together, researchers can reduce variability within each group and increase the efficiency of experiments. This allows for more precise comparisons and analysis of treatment effects. Clustering in experimental design helps to create a balanced and representative sample of the population under study.

#### Data Context and Purpose

This dataset pertains to a hypothetical Randomized Controlled Trial (RCT) for application of a technique called "clustered randomization." In this scenario, there are a total of 200 stores within a company. These stores were randomly assigned into two groups: 100 treatment stores and 100 control stores.

In the treatment group, pictures of individuals using the products sold in the store were displayed as in-store advertisements. The objective of the study is to examine whether these in-store advertisements have an impact on the amount of purchases made by customers.

To gather the necessary information, data was collected from the first 1,000 individuals who entered the stores after the launch of the experiment. For each of these 1,000 people, the amount they purchased at the store was recorded.

In [37]:
# Import The Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [38]:
# Read the data
df = pd.read_csv('cluster.csv')

In [39]:
# Preview the data
df.head()

Unnamed: 0,treatment,store_id,sale,customer_id
0,1,1,95.496437,1
1,1,1,66.05352,2
2,1,1,82.490273,3
3,1,1,81.939018,4
4,1,1,77.757187,5


In [40]:
# Check the data types
df.dtypes

treatment        int64
store_id         int64
sale           float64
customer_id      int64
dtype: object

In [41]:
# Check the shape of the data
df.shape

(200000, 4)

In [42]:
# Check the averages by treatment
df.groupby('treatment').mean()

Unnamed: 0_level_0,store_id,sale,customer_id
treatment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,98.11,75.148295,500.5
1,102.89,78.748684,500.5


In [43]:
# Calculate the amount of records for each treatment
df.groupby('treatment').size()

treatment
0    100000
1    100000
dtype: int64

To calculate the average treatment effect (ATE) on sales and its corresponding 95% confidence interval (CI), we assume that each of the 200,000 observations in the dataset can be treated as independent.

Using the dataset, we analyze the impact of in-store advertisements on sales. The ATE represents the average difference in sales between the treatment group (stores with in-store advertisements) and the control group (stores without in-store advertisements).

To compute the ATE, we take the average sales in the treatment group and subtract the average sales in the control group. The 95% confidence interval provides a range within which we can be 95% confident that the true ATE lies.

In [44]:
# Calculate average treatment effect on sale (ATE) for treatment with a 95% confidence interval
ATE = df[df['treatment'] == 1]['sale'].mean() - df[df['treatment'] == 0]['sale'].mean()
ATE

# Calculate the standard error of the ATE
ATE_SE = np.sqrt((df[df['treatment'] == 1]['sale'].var() / df[df['treatment'] == 1]['sale'].count()) + (df[df['treatment'] == 0]['sale'].var() / df[df['treatment'] == 0]['sale'].count()))

# Calculate the 95% confidence interval
ATE_CI = (ATE - 1.96 * ATE_SE, ATE + 1.96 * ATE_SE)
ATE_CI

# Print the results
print('The average treatment effect (ATE) is', ATE, 'with a 95% confidence interval of', ATE_CI)

The average treatment effect (ATE) is 3.60038876098929 with a 95% confidence interval of (3.41628632508447, 3.78449119689411)


In [45]:
ATE_SE

0.09392981423715299

To calculate the average sales amount at each store and create a new dataset of the 200 observations, we will aggregate the sales data for each store and include the treatment status of the store in the new dataset.

From the original dataset, which contains 200,000 observations, we will calculate the average sales amount for each of the 200 stores. This involves summing up the sales amounts for all customers who visited a particular store and then dividing by the total number of customers at that store.

Once we have the average sales amount for each store, we will create a new dataset with 200 observations. Each observation in the new dataset will contain the average sales amount at a specific store and the treatment status of that store (whether it belongs to the treatment or control group).

By condensing the data to 200 observations, we can focus on the overall average sales amount and the treatment status of each store, which will be useful for further analysis and comparison between the treatment and control groups.

In [46]:
# Calculate the average of sales at each store_id for treatment and control
df_2 = pd.DataFrame(df.groupby(['store_id', 'treatment']).mean())
df_2 = df_2.reset_index()
df_2

Unnamed: 0,store_id,treatment,sale,customer_id
0,1,1,68.686135,500.5
1,2,0,85.262843,500.5
2,3,0,56.063686,500.5
3,4,1,72.525568,500.5
4,5,1,89.928989,500.5
...,...,...,...,...
195,196,0,71.414121,500.5
196,197,1,70.982562,500.5
197,198,1,87.530904,500.5
198,199,1,96.036842,500.5


To calculate the average treatment effect (ATE) on sales using the store-level aggregated data and determine the corresponding 95% confidence interval (CI), we will utilize the dataset containing 200 observations, each representing a specific store.

By analyzing the average sales data at each store, we can compare the treatment group (stores with in-store advertisements) to the control group (stores without in-store advertisements). The ATE represents the average difference in sales between these two groups.

To compute the ATE, we subtract the average sales in the control group from the average sales in the treatment group. The 95% confidence interval provides a range within which we can be 95% confident that the true ATE lies.

Using the store-level aggregated data, we will calculate the ATE and its corresponding CI, which will give us valuable insights into the impact of in-store advertisements on sales.

In [47]:
# Calculate the ATE at df_2 with a 95% confidence interval
ATE = df_2[df_2['treatment'] == 1]['sale'].mean() - df_2[df_2['treatment'] == 0]['sale'].mean()
ATE

# Calculate the standard error of the ATE
ATE_SE = np.sqrt((df_2[df_2['treatment'] == 1]['sale'].var() / df_2[df_2['treatment'] == 1]['sale'].count()) + (df_2[df_2['treatment'] == 0]['sale'].var() / df_2[df_2['treatment'] == 0]['sale'].count()))

# Calculate the 95% confidence interval
ATE_CI = (ATE - 1.96 * ATE_SE, ATE + 1.96 * ATE_SE)
ATE_CI

# Print the results
print('The average treatment effect (ATE) is', ATE, 'with a 95% confidence interval of', ATE_CI)

The average treatment effect (ATE) is 3.6003887609905263 with a 95% confidence interval of (-0.4908495661125043, 7.691627088093557)


In [48]:
ATE_SE

2.0873664934199136

#### Non-Compliance

Non-compliance, in the context of experimental design, refers to situations where participants or subjects do not adhere to the assigned treatment or fail to comply with the experimental protocol. Non-compliance can occur for various reasons, such as unwillingness to follow instructions, technical issues, or external factors influencing participant behavior. Non-compliance can introduce bias and impact the validity of experimental results. Researchers often employ strategies to address non-compliance, such as monitoring and tracking participant adherence, implementing incentives or reminders, or analyzing the data using intention-to-treat or complier average causal effect approaches. These approaches help to account for non-compliance and mitigate its potential effects on the experimental findings.

#### Data Context and Purpose

A company offers free on-site flu vaccine shots for employees.  The standard practice at the company has been to put up flyers about the on-site clinic days and send a mass-email to the company encouraging everyone to get the flu shot.  This year the company ran an Randomized Controlled Trial (RCT) to test a new program to try to increase the fraction of employees getting the shot. They had two treatment groups:

- Treatment 1: Was sent an email encouraging them to take a short online “flu awareness quiz” that asked people to guess statistics about the flu and then showed them the right answers. By taking the survey the employee was entered into a lottery to win a $100 prize.  
- Treatment 2: Was the same as Treatment 1.  However, after 1 week if the employee had not taken the online quiz, they got a short automated call to their work phone number to highlight it to them.   

The company tracked both what fraction of the Treatment groups took the online quiz and whether each employee got the flu shot.  Here were the findings:  

- Control:  64% got the flu shot 
- Treatment 1:  40% took the survey; 66% got the flu shot 
- Treatment 2:  60% took the survey; 68% got the flu shot.

In [49]:
# Define the variables
Control_Effect = 0.64
Treatment_1 = 0.66
Survey_1= 0.40
Treatment_2 = 0.68
Survey_2 = 0.60

What is the Intention to Treat (ITT) estimate of the treatment effect for each treatment relative 
to control? 

In [53]:
# Calculate Intention to Treat
ITT1 = Treatment_1 - Control_Effect
ITT1

ITT2 = Treatment_2 - Control_Effect
ITT2

print('The ITT for Treatment 1 is', round(ITT1, 2), 'and the ITT for Treatment 2 is', round(ITT2, 2))

The ITT for Treatment 1 is 0.02 and the ITT for Treatment 2 is 0.04


What is the Treatment-On-The-Treated (TOT) for effect of survey on flu shot, comparing Treatment 1 to control?   

In [54]:
TOT1 = ITT1 / Survey_1
print('The TOT for Treatment 1 is', round(TOT1, 2))

The TOT for Treatment 1 is 0.05


What is the Treatment-On-The-Treated (TOT) for effect of survey on flu shot, comparing Treatment 2 to control?   

In [55]:
TOT2 = ITT2 / Survey_2
print('The TOT for Treatment 2 is', round(TOT2, 2))

The TOT for Treatment 2 is 0.07


What is the Treatment-On-The-Treated (TOT)  for effect of survey on flu shot, comparing Treatment 2 to Treatment 1?   

In [56]:
ITT_3 = 0.68 - 0.64
print('The ITT for Treatment 3 is', round(ITT_3, 2))

The ITT for Treatment 3 is 0.04


In [57]:
Survey_3 = Survey_2 - Survey_1
print('The Survey for Treatment 3 is', round(Survey_3, 2))

The Survey for Treatment 3 is 0.2


In [58]:
TOT3 = ITT_3 / Survey_3
print('The TOT for Treatment 3 is', round(TOT3, 2))

The TOT for Treatment 3 is 0.2
