#Waze app: user churn analysis and prediction model
Part 3: hypothesis testing

Research question:

*“Do drivers who open the application using an iPhone have the same number of drives on average as drivers who use Android devices?”*


In [1]:
# import packages and libraries

import pandas as pd
from scipy import stats

In [2]:
# load dataset into a dataframe
df = pd.read_csv("waze_dataset.csv")

Computing **descriptive statistics** will help to quickly compare the average amount of drives by device type.

In the dataset, device is a categorical variable with the labels iPhone and Android. We need to turn each label into an integer.

The following code assigns a 1 for an iPhone user and a 2 for Android. It assigns this label to a new variable *device_type* so that we don't overwrite the original data.

In [3]:
# create a dictionary that contains the class labels for keys and the values to convert them to as values
map_dictionary = {"Android": 2, "iPhone": 1}

# create a copy of the "device" column
df["device_type"] = df["device"]

# map the new column to the dictionary
# use the map() method on the device_type series and pass map_dictionary as its argument
df["device_type"] = df["device_type"].map(map_dictionary)

''' when you pass a dictionary to the Series.map() method, it will replace
the data in the series where that data matches the dictionary’s keys.
The values that get imputed are the values of the dictionary'''

df["device_type"].head()

0    2
1    1
2    2
3    1
4    2
Name: device_type, dtype: int64

In [4]:
df["device"].head()

0    Android
1     iPhone
2    Android
3     iPhone
4    Android
Name: device, dtype: object

In [6]:
# calculate the average number of drives for each device type
df.groupby("device_type")["drives"].mean()

device_type
1    67.859078
2    66.231838
Name: drives, dtype: float64

The above result shows that the number of drives on average is higher for iPhone users.

To confirm whether this difference is statistically significant, we can conduct a hypothesis test.

### Hypothesis testing

We are going to perform a **two-sample t-test** since we have two independent samples: iPhone users and Android users.

The steps will be:

1.   State the **null hypothesis (H0)** and the **alternative hypothesis (HA)**
2.   Choose a **signficance level** (5%)
3. Find the **p-value**
4. Reject or fail to reject the null hypothesis

**null hypothesis (H0):** there is no difference in the average number of drives between iPhone and Android users.

**alternative hypothesis (HA):** there is a difference in the average number of drives between iPhone and Android users.



In [12]:
# isolate the "drives" column for both devices
iPhone = df[df["device_type"] == 1]["drives"]
Android = df[df["device_type"] == 2]["drives"]

# perform the t-test
stats.ttest_ind(a=iPhone, b=Android, equal_var=False)

TtestResult(statistic=1.463523206885235, pvalue=0.143351972680206, df=11345.066049381952)

The default for the argument equal_var is True, which assumes population
variances are equal. However, there is no strong reason to assume that the
two groups have the same variance. By setting equal_var to False, we are
performing the unequal variances t-test (Welch’s t-test).

From the above result, the p-value (14%) is larger than our significance level (5%). Therefore, we fail to reject the null hypothesis and can conclude that there is not a statistically significant difference in the average number of drives between iPhone and Android drivers.