# Purpose-Driven Data Science

Up until now, we have focused our attention on learning what and how and less so on the when and why. Today we are going to take an outcome-based approach to learning about what types of problems warrant the usage of various techniques.

It is important to understand that these techniques are often nuanced, and applying them blindly without full understanding proper usage can wind up with unintended consequences. We are only going to scratch the surface on these techniques so that you can 1) know of their existence, and 2) know what situations may warrant their usage. It is still incumbent upon you to learn more about these techniques before using them.

Before we get into the details, I want to go over two final concepts that are critical to understanding applied data science:
- Hypothesis testing
- Statistical error types

Hypothesis testing is a foundational tool for statistical inference. Hypothesis testing is set up such that it is the analyst's burden to "prove" that two values are NOT the same or that some outcome is NOT likely.

For example, let's take the following data set predicting mobile phone usage based on various user characteristics.

In [None]:
phone_usage_df = read.csv('device_usage.csv')

In [None]:
head(phone_usage_df, 5)

In [None]:
mean(phone_usage_df$avg_usage_hrs)

### Identifying Appropriate Hypothesis Statements

If we were to perform an analysis on phone usage based on gender, we could ask ourselves two similar, but very different analytitical hypotheses:
1. "Females use their phones more than males", or
2. "The amount of time females use their phones differs from that of males"

The way you frame your question is completely dependent on the study you are trying to perform. Sometimes being different is sufficient while other times you need to see a particular outcome in order for the results to be relevant.

What are possible analyses where we'd be more interested in Hypothesis 1. What about Hypothesis 2?

If we wanted to pursue Hypothesis 1, then the following would be the null and alternate hypotheses:
\begin{equation*}
\mu_0: \mu_{female} = \mu_{male}
\end{equation*}
\begin{equation*}
\mu_a: \mu_{female} \ne \mu_{male}
\end{equation*}

In this case, we would "reject the null hypothesis" if we could prove that females used their phones more OR less than males.

If we wanted to pursue Hypotheis 2 ,then the following would be the null and alternate hypotheses:
\begin{equation*}
\mu_0: \mu_{female} <= \mu_{male}
\end{equation*}
\begin{equation*}
\mu_a: \mu_{female} > \mu_{male}
\end{equation*}

#### Why can't I just use the sample means to make this assessment?

You may be wondering, why can't I just make my determination based on the means of the males and females from our sample? For our data, the following are our means:

In [None]:
means_by_gender = aggregate(phone_usage_df$avg_usage_hrs, by=list(Gender=phone_usage_df$gender), FUN=mean, data=phone_usage_df)
means_by_gender

While it seems that females use their phones more than males based on this data, we don't *know* whether they do because the true mean for females remains unknown. Let's look at the graphic below to understand why we can't just use these sample statistics to make a determination.

<img src="https://i.stack.imgur.com/ZfxV9m.png">

If we say that the confidence interval around the males is the curve on the left while the confidence interval for the females is the curve on the right, we can see that there is overlap in where the true mean for each gender *could* be. Even though it seems that the curve on the right is "further right" than the one on the one on the left, there are regions of the left curve that actually exceed that of the right curve. The amount of uncertainty is what determines whether we can or cannot reject the null hypothesis.

In [None]:
males <- subset(phone_usage_df, gender=='Male', select=avg_usage_hrs)
females <- subset(phone_usage_df, gender=='Female', select=avg_usage_hrs)
t.test(males$avg_usage_hrs, females$avg_usage_hrs)

In [None]:
t.test(females$avg_usage_hrs, males$avg_usage_hrs, alternative='greater')

If we set our cutoff for "statistical significance" at 95%, then alpha would be 5%. Since our p-value is less than our alpha value we can reject the null hypothesis and "accept" the alternate hypothesis.

### Linear Modeling


In [None]:
phone_usage_df$gender <- factor(phone_usage_df$gender)
phone_usage_df$age <- factor(phone_usage_df$age)
phone_usage_df$income <- factor(phone_usage_df$income)
phone_usage_df$phone <- factor(phone_usage_df$phone)
phone_usage_df$has_degree <- factor(phone_usage_df$phone)
model <- lm(avg_usage_hrs ~ gender + phone, data=phone_usage_df)

In [None]:
summary(model)

In [None]:
plot(model)