You are provided with Sales data from different Stores with 3 promotional strategies. Your goal is to analalyse the data and decide which promotional strategy is better suited for different Stores. Data on sales is provided in marketing_campaign.csv

1. What is the median sales (SalesInThousands)?
2. Compute median sales in markets of different size (MarketSize). Markets of which size have the smallest median sales?
3. To select best promotion strategy marketing team ran three promotion strategies (Promotion) for 4 weeks (week). Is there a more than 5% difference between average sales in Large markets across 4 weeks? (In Orange you could select Rows by condition using Select Rows operator)
4. Which market size are the most important when selecting a promotion strategy? Compute total sales in markets of different size. 
5. Using bootstrap method sales team prepared 1000 samples of average sales in markets of Medium size (bootstrapped_means.csv). Help them estimate 99.5% confidence (two-sided) interval for average sales. (e.g. use `PERCENTILE` in google spreadsheets) 
6. Compare sales in markets where different promotions were ran. Which promotion strategy showed the best results in terms of average sales?
7. Are the differences between strategies 1 and 2 statistically significant? Use two-sided Student T-test with confidence level 5%. (e.g. select appropriate rows in Orange, store a csv and use `TTEST` in google spreadsheet with 2 tails of type 2, see example in https://docs.google.com/spreadsheets/d/132BCqQUFpPbmxPvdZCFbPkkXVmx5iLHQQZas9dWi4cs/edit?usp=sharing)
8. Are the differences between strategies 1 and 3 statistically significant? Use two-sided Student T-test with confidence level 5%. Which strategy should we select?
9. Compare the sales for strategies 1 and 3 but now for markets of different size separatly. Compare Large markets with strategies 1 and 3. Compare Medium markets with strategies 1 and 3. Compare Small markets with strategies 1 and 3. Are the differences between strategies 1 and 3 stratified by markets of different size statistically significant (5%)? Which strategy should we select?
10. If instead of running promotions for 4 weeks we have only ran it for 1 week, does your answer to the previous question change? If we now look for confidence level 10%?


In [1]:
import pandas as pd
from scipy.stats import ttest_ind

In [2]:
df = pd.read_csv('WA_Marketing-Campaign.csv')

idx = df[(df.MarketSize=='Large')&(df.Promotion==3)].index
df.loc[idx, 'SalesInThousands'] = df.loc[idx, 'SalesInThousands'].apply(lambda x:x*1.1)

idx = df[(df.MarketSize=='Medium')&(df.Promotion==3)].index
df.loc[idx, 'SalesInThousands'] = df.loc[idx, 'SalesInThousands'].apply(lambda x:x*1.2)

idx = df[(df.MarketSize=='Small')&(df.Promotion==1)].index
df.loc[idx, 'SalesInThousands'] = df.loc[idx, 'SalesInThousands'].apply(lambda x:x*1.5)

def get_age_group(x):
    if x <=10:
        return '<=10 years'
    else:
        return '>10 years'
    
df['StoreAge'] = df.AgeOfStore.apply(get_age_group)
df = df.drop(['MarketID', 'AgeOfStore'], axis=1)

In [3]:
df.to_csv('marketing_campaign.csv')

# 1. What is the median sales (SalesInThousands)? `53.63`

In [4]:
df.SalesInThousands.median()

53.62599999999999

In [24]:
df.SalesInThousands.mean()

57.16525912408751

In [26]:
df.query('MarketSize=="Medium"').SalesInThousands.median()

47.425

In [27]:
df.query('MarketSize=="Medium"').SalesInThousands.mean()

47.28183750000003

# 2. Compute median sales in markets of different size (MarketSize). Markets of which size have the smallest median sales? `Medium`

In [5]:
df.groupby('MarketSize').SalesInThousands.median()

MarketSize
Large     75.450
Medium    47.425
Small     61.420
Name: SalesInThousands, dtype: float64

# 3. To select best promotion strategy marketing team ran three promotion strategies (Promotion) for 4 weeks (week). Is there a more than 5% difference between average sales in Large markets across 4 weeks? (In Orange you could select Rows by condition using Select Rows operator) `No`

In [6]:
df.query('MarketSize=="Large"').groupby(['week']).SalesInThousands.mean()

week
1    73.070119
2    70.820690
3    72.844262
4    72.555143
Name: SalesInThousands, dtype: float64

# 4. Which market size are the most important when selecting a promotion strategy? Compute total sales in markets of different size. `Large - 12150, Medium - 15130, Small - 4046`

In [7]:
df.groupby('MarketSize').SalesInThousands.sum()

MarketSize
Large     12150.189
Medium    15130.188
Small      4046.185
Name: SalesInThousands, dtype: float64

In [28]:
df.groupby('MarketSize').SalesInThousands.mean()

MarketSize
Large     72.322554
Medium    47.281838
Small     67.436417
Name: SalesInThousands, dtype: float64

# 5. Using bootstrap sales team prepared 1000 samples of average sales in markets of Medium size. Help them estimate 99.5% confidence (two-sided) interval for average sales. (e.g. use `PERCENTILE` in google spreadsheets) `[45.48, 49.02]` 

In [6]:
import numpy as np

In [9]:
def bootstrap(data, random_state=10):
    random = np.random.RandomState(random_state)
    n = data.shape[0]
    idx = random.randint(0, n, n)
    return data[idx]

In [10]:
x = df.query('MarketSize=="Medium"').SalesInThousands.values

In [11]:
means = []

for i in range(1000):
    sample = bootstrap(x, i)
    means.append(np.mean(sample))
means = np.array(means)

In [12]:
df_average_sales = pd.DataFrame()
df_average_sales['AverageSales'] = means
df_average_sales['BootstrapID'] = np.arange(1000)
df_average_sales.to_csv('bootstrapped_means.csv', index=False)

In [2]:
d = pd.read_csv('/home/anvar/projects/ds-for-business-2022/home-assignment-4/bootstrapped_means.csv')

In [7]:
np.quantile(d.AverageSales.values, 0.0025), np.quantile(d.AverageSales.values, 0.9975), 

(45.4754770625, 49.023133390625)

In [8]:
0.9975-0.0025

0.9950000000000001

In [34]:
np.quantile(means, 0.05), np.quantile(means, 0.95), 

(46.2255296875, 48.298515312499994)

# 6. Compare sales in markets where different promotions were ran. Which promotion strategy showed the best results in terms of average sales? `3`

In [15]:
df.groupby(['Promotion']).SalesInThousands.mean()

Promotion
1    61.596831
2    47.329415
3    62.946686
Name: SalesInThousands, dtype: float64

# 7. Are the differences between strategies 1 and 2 statistically significant? Use two-sided Student T-test with confidence level 5%. (e.g. select appropriate rows in Orange, store a csv and use TTEST in google spreadsheet with 2 tails of type 2, see example in https://docs.google.com/spreadsheets/d/132BCqQUFpPbmxPvdZCFbPkkXVmx5iLHQQZas9dWi4cs/edit?usp=sharing) `Yes`

In [16]:
ttest_ind(df.query('Promotion==1').SalesInThousands, 
          df.query('Promotion==2').SalesInThousands)

Ttest_indResult(statistic=7.761907735013977, pvalue=8.85304097461179e-14)

# 8. Are the differences between strategies 1 and 3 statistically significant? Use two-sided Student T-test with confidence level 5%. Which strategy should we select?

In [17]:
ttest_ind(df.query('Promotion==1').SalesInThousands, 
          df.query('Promotion==3').SalesInThousands)

Ttest_indResult(statistic=-0.6963281649703463, pvalue=0.48667517588801124)

# 9. Compare the sales for strategies 1 and 3 but now for markets of different size separatly. Compare Large markets with strategies 1 and 3. Compare Medium markets with strategies 1 and 3. Compare Small markets with strategies 1 and 3. Are the differences between strategies 1 and 3 stratified by markets of different size statistically significant (5%), which strategy should we select?

In [18]:
ttest_ind(df.query('Promotion==1').query('MarketSize=="Small"').SalesInThousands, 
          df.query('Promotion==3').query('MarketSize=="Small"').SalesInThousands)

Ttest_indResult(statistic=15.729103026221376, pvalue=3.28652796016919e-19)

In [19]:
ttest_ind(df.query('Promotion==1').query('MarketSize=="Medium"').SalesInThousands, 
          df.query('Promotion==3').query('MarketSize=="Medium"').SalesInThousands)

Ttest_indResult(statistic=-5.546126944655506, pvalue=8.72691945185124e-08)

In [20]:
ttest_ind(df.query('Promotion==1').query('MarketSize=="Large"').SalesInThousands, 
          df.query('Promotion==3').query('MarketSize=="Large"').SalesInThousands)

Ttest_indResult(statistic=-3.146014761762151, pvalue=0.0021694397502372495)

# 10. If instead of running promotions for 4 weeks we have only ran it for 1 week, does your answer to the previous question change? If we now look for confidence level 0.5%? Select all correct options.

In [21]:
ttest_ind(df.query('Promotion==1').query('week==1').query('MarketSize=="Small"').SalesInThousands, 
          df.query('Promotion==3').query('week==1').query('MarketSize=="Small"').SalesInThousands)

Ttest_indResult(statistic=7.177589277766924, pvalue=5.208042684076091e-05)

In [22]:
ttest_ind(df.query('Promotion==1').query('week==1').query('MarketSize=="Medium"').SalesInThousands, 
          df.query('Promotion==3').query('week==1').query('MarketSize=="Medium"').SalesInThousands)

Ttest_indResult(statistic=-2.434484067162056, pvalue=0.01845222033396582)

In [23]:
ttest_ind(df.query('Promotion==1').query('week==1').query('MarketSize=="Large"').SalesInThousands, 
          df.query('Promotion==3').query('week==1').query('MarketSize=="Large"').SalesInThousands)

Ttest_indResult(statistic=-1.7931023723764705, pvalue=0.08556982681378379)