A friend of yours owns a frozen drink shop. On hot days, she seems extra happy. She says that she sees a line extending around the block and knows that means more sales. However, even before your friend started the business, she always seemed to love summer and thrived in the heat, so you think her jubilant attitude might be intrinsic. She makes a bet with you that sales really are higher on hotter days. She gets data on the sales numbers (in dollars) and the daily temperatures (in degrees Fahrenheit).
1.	What is the outcome?
2.	What is the main effect/predictor she wants to understand the impact of?
3.	What is the hypothesis?

Use the data she collected to conduct an analysis, test the hypothesis, and report results. The dataset is drinks.xlsx. Your analysis should have the following elements:

7.	An explanation of why the analysis is being conducted and what the hypothesis is
8.	Descriptive information about the data, including summary statistics (such as number of observations, measures of central tendency, & measures of dispersion) and plots of the data distributions
9.	Descriptive information about the relationships between the two variables, including correlation and scatterplots
10.	A regression analysis to test the hypothesis. If you have trouble getting the regression analysis to work, look closely at the data. Your friend wasn’t always able to get sales data for each day. Choose a method to handle rows with missing data.
11.	A description of the results of the analysis. Included in this description should be an interpretation of the coefficients, description of the goodness of fit, and a discussion of whether the results are statistically significant.

1. The outcome we are trying to measure is the difference in sales between hotter days and cooler days. Specifically, we want to test whether sales are increased on hotter days.
2. The main predictor we are trying to understand the impact of is temperature. She wants to understand if there is a relationship between a change in temperature and a change in sales.
3. The hypothesis is that higher temperatures lead to higher sales.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import numpy as np
import seaborn as sns

drinks_df = pd.read_csv('drinks.csv')
drinks_df.columns = ['temps','sales']

#summary stats
print(drinks_df[['temps','sales']].describe())

#Median
print('Median values:')
print(drinks_df[['temps', 'sales']].median())

# histograms

#temperature histogram
plt.hist(drinks_df['temps'], bins = 20, edgecolor = 'navy')
plt.xlabel('Temperatures')
plt.ylabel('Frequency')
plt.title('Distribution of Temperatures')
plt.show();

# sales histogram
plt.hist(drinks_df['sales'], bins = 20, edgecolor = 'navy')
plt.xlabel('Sales')
plt.ylabel('Frequency')
plt.title('Distribution of Sales')
plt.show();

In [None]:
# Calculate the correlation between the variables
cleaned_df = drinks_df.dropna(subset = ['temps','sales'])

corr = cleaned_df['temps'].corr(cleaned_df['sales'])
print(f"Correlation: {corr:.3f}")

plt.scatter(cleaned_df['temps'],cleaned_df['sales'])
plt.title("Temperatures vs Sales")
plt.xlabel('Temperatures')
plt.ylabel('Sales')
plt.show()

In [None]:
X = cleaned_df['temps']
Y = cleaned_df['sales']

# Add a constant term for the intercept
X = sm.add_constant(X)

# Ordinary Least Squares Regressions
model = sm.OLS(Y,X).fit()

# Print the regression results
print(model.summary())

# Scatter plot with regression line
sns.lmplot(x='temps', y='sales', data=cleaned_df, height=6, aspect=1.5)

plt.xlabel('Temperature')
plt.ylabel('Sales ($)')
plt.title('Sales vs Temperature')
plt.show()

We are conducting an analysis to determine whether or not sales are higher at my friends drink business on hotter days. We have a temperature and sales variable, so we can perform this by running a linear regression and interpreting the results.

First, we want to take a look at the data, how many rows there are are for each (we can see that there are an uneven amount of values - so we will remove null values later on.) Then we want to get a little bit of information about the relationship between the two variables. We can do this by creating a scatter plot to compere the two values and calculate the correlation coefficient.

Already from the scatter plot and the histograms, one could guess that we are likely to see a strong relationship between the temperature and sales values. The scatter plot shows a positive relationship in the data, as temperatures get higher, sales get higher. The histograms have a pretty similar spread which tells us that the distributions of the data are similar.

The correlation coefficient of 0.823 indicates that there is in face a strong, positive relationship between the two variables. After performing the linear regression, we can confirm that sales do in fact increase as temperatures increase. With 95% confidence, we can say that an increase of 1 degree would return an increase in sales between 24.40 and 34.89. The R-Squared statistic is 0.677 - this can be interpreted as saying that 67% of the variance in sales can be explained by the temperature variable, which is good for this problem. The model also produced a low p-value, which tells us that we can conclude that this result is statistically significant. So yes, my friend is correct - hotter days are good for her business.