In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats



### Here's what I did to clean the data:
- deleted any incomplete survey results/previews
- deleted any results with self-reported technical difficulties
- deleted willingness to pay data (Q5.5, Q5.6) that was higher than the mean + 3 standard deviations
- delete those who failed the attention check


**I double checked and I have the same number of records as the original cleaned survey data.**

In [8]:
survey_df = pd.read_csv('one_header_qualtrics_survey.csv')


display(survey_df)

Unnamed: 0,Progress,Duration (in seconds),Finished,Q51,Q2.1,Q2.2,Q2.2_5_TEXT,Q2.3_1,Q4.5,Q4.6,...,Q7.4.1,Q7.3.1,Q53.1,Q8.1.1,Q8.2.1,Q8.4.1,Q8.5.1,Q8.5_3_TEXT.1,Q8.7.1,Q8.8.1
0,100,93,1,1,3.0,13,,7.0,5,4,...,Yes,2 days,I prefer Dwayne,Bachelor's degree in college (4-year),"$70,000 to $79,999",No,Male,,No,
1,100,137,1,1,4.0,1423,,3.0,6,4,...,No,,I prefer Gordon,Master's degree,"$20,000 to $29,999",No,Female,,No,
2,100,106,1,1,4.0,43,,4.0,3,4,...,Yes,3 days,I prefer Gordon,Bachelor's degree in college (4-year),"$40,000 to $49,999",No,Female,,No,
3,100,116,1,1,3.0,123,,5.0,4,6,...,Yes,2 days,I prefer Dwayne,Master's degree,"$60,000 to $69,999",No,Male,,No,no
4,100,142,1,1,5.0,3,,9.0,7,7,...,Yes,2 days,I prefer Gordon,Associate degree in college (2-year),"$80,000 to $89,999",Yes,Female,,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
147,100,415,1,1,3.0,1423,,7.0,7,7,...,Yes,2 days,I prefer Gordon,Bachelor's degree in college (4-year),"$80,000 to $89,999",Yes,Male,,No,
148,100,175,1,1,2.0,1423,,6.0,5,6,...,Yes,2 days,I prefer Gordon,Some college but no degree,"$90,000 to $99,999",Yes,Male,,No,
149,100,122,1,1,4.0,3,,8.0,6,6,...,Yes,3 days,I prefer Dwayne,Bachelor's degree in college (4-year),"$40,000 to $49,999",Yes,Female,,No,
150,100,218,1,1,5.0,143,,8.0,7,7,...,Yes,4 or more days,I prefer Gordon,Bachelor's degree in college (4-year),"$90,000 to $99,999",Yes,Male,,No,Thank you!


I'll redefine all of the column names, since the '.' in each of the question names is causing a syntactical error.

In [17]:
new_columns = [i.replace('.','_') for i in list(survey_df.columns)]

survey_df.columns = new_columns

Here I wanted to compare the means for question 5.6 -- "What is the most that you would be willing to pay for this air fryer? (Enter in X.XX format)". With a T Test we can see if the means are significantly different from each other, assuming a confidence interval of 95%.

In [19]:
logo_philips = survey_df.loc[survey_df['Factor1'] == 'Logo=Philips']
lp_q56 = logo_philips['Q5_6']
logo_none = survey_df.loc[survey_df['Factor1'] == 'Logo=None']
ln_q56 = logo_none['Q5_6']

stats.ttest_ind(lp_q56, ln_q56)


Ttest_indResult(statistic=1.9386763915866523, pvalue=0.0544182178811881)

Here we can see that the pvalue is just slightly higher than our threshold of 0.05. If it was below 0.05, we could call this **statistically significant**; however, since it is very close and also below 0.1, we could call the difference between these means **marginally significant**.

### Below I'll use a linear regression to get a better sense of how Factor1 and Factor2 interact with each other. 

In [23]:
import statsmodels.api as sm

model = sm.OLS.from_formula('Q5_6 ~ Factor1 + Factor2 + Factor1:Factor2', survey_df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                   Q5_6   R-squared:                       0.075
Model:                            OLS   Adj. R-squared:                  0.056
Method:                 Least Squares   F-statistic:                     4.006
Date:                Sun, 21 Aug 2022   Prob (F-statistic):            0.00891
Time:                        16:55:17   Log-Likelihood:                -1069.4
No. Observations:                 152   AIC:                             2147.
Df Residuals:                     148   BIC:                             2159.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                                                        coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------

### Below I'll make a note of a few things shown in the outputted summary
- The R-Squared value is very low -- i.e. the independent variables do no explain the variability we see in the dependent very much. In this case, the model shows that our Factor1 and Factor2 variables only explain about 7.5% of the variability seen in the model.
- Given a confidence interval of 0.95, we have two statistically significant variables: Factor1 (p=0.003) and Factor1 * Factor2 (p=0.025). This shows that these variables do have an effect on the dependent variable; an effect that is almost definitely not due to randomness. 
- Last we can relate our differences in coefficient values to differences in willingness to pay. The Intercept represents the baseline willingness to pay (about $91.35); When Factor1 = Philips, the coefficient is about 2.16 times higher than the baseline ($197.2405). Conversely, when Factor1 = 'Philips' & Factor2 = 'None' (no celebrity), the resulting coefficient is -$204.79. This large negative coefficient value shows a clear inverse relationship between these particular variables and the dependent variable (willingness to pay). The size of this inverse may be partially explained by the very large standard error we see for this variable and all of the variables we see in this model. 