# **Waze Project**
**Coursera - The Power of Statistics**

* **The purpose** of this project is to demostrate knowledge of how to conduct a two-sample hypothesis test.

* **The goal** is to apply descriptive statistics and hypothesis testing in Python.

## Problem statements
   * Do drivers who open the application using an iPhone have the same number of drives on average as drivers who use Android devices?

### **1. Imports and data loading**
Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

In [1]:
import pandas as pd

from scipy import stats

Import the dataset.

**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [2]:
# Load dataset into dataframe
df = pd.read_csv('waze_dataset.csv')

### **2. Data exploration**

Use descriptive statistics to conduct exploratory data analysis (EDA).

In [3]:
df.head()

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android


In [4]:
df.isna().sum()

ID                           0
label                      700
sessions                     0
drives                       0
total_sessions               0
n_days_after_onboarding      0
total_navigations_fav1       0
total_navigations_fav2       0
driven_km_drives             0
duration_minutes_drives      0
activity_days                0
driving_days                 0
device                       0
dtype: int64

Since `label` does not affect anything, let's continue to the next step

**Note:** In the dataset, `device` is a categorical variable with the labels `iPhone` and `Android`.

In order to perform this analysis, we must turn each label into an integer.  The following code assigns a `1` for an `iPhone` user and a `2` for `Android`.  It assigns this label back to the variable `device_new`.

In [5]:
map_label = {'iPhone':1,
             'Android':2}

df['device_new'] = df['device'].map(map_label)

We are interested in the relationship between device type and the number of drives. One approach is to look at the average number of drives for each device type. Let's calculate these averages.

In [6]:
pd.DataFrame(round(df.groupby('device')['drives'].mean(), 2)).reset_index()

Unnamed: 0,device,drives
0,Android,66.23
1,iPhone,67.86


Based on the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, let's conduct a hypothesis test.


### **3. Hypothesis testing**

The goal is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test:

1.   State the null hypothesis and the alternative hypothesis
        * $H_0$: There is NO difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.
        * $H_A$: There is a difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.
2.   Choose a signficance level
        * We use 5% as the significance level and proceed with a two-sample t-test.
3.   Find the p-value
        * We use t-score to find this
4.   Reject or fail to reject the null hypothesis

**Note:** This is a t-test for two independent samples. This is the appropriate test since the two groups are independent (Android users vs. iPhone users).

We can use the `stats.ttest_ind()` function to perform the test.


**Technical note**: The default for the argument `equal_var` in `stats.ttest_ind()` is `True`, which assumes population variances are equal. This equal variance assumption might not hold in practice (that is, there is no strong reason to assume that the two groups have the same variance); you can relax this assumption by setting `equal_var` to `False`, and `stats.ttest_ind()` will perform the unequal variances $t$-test (known as Welch's `t`-test). Refer to the [scipy t-test documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) for more information.


1. Isolate the `drives` column for iPhone users.
2. Isolate the `drives` column for Android users.
3. Perform the t-test

In [7]:
user_iphone = df[df['device_new'] == 1]['drives']
user_android = df[df['device_new'] == 2]['drives']

t_score, p_val = stats.ttest_ind(a=user_iphone,
                                 b=user_android,
                                 equal_var=False)

print(f'T-test\n• T-score: {t_score}')
print(f'• P-value: {p_val}\n\nTest result:')

if p_val < 0.05:
    print('• REJECT the null hypothesis')
else:
    print('• FAIL TO REJECT the null hypothesis')

T-test
• T-score: 1.4635232068852353
• P-value: 0.1433519726802059

Test result:
• FAIL TO REJECT the null hypothesis


Based on the hypothesis testing, we can infer that p-val `0.14` > significance level `0.05`. So, there is <i>**NO difference in average number**</i> of drives between drivers who use iPhone devices and drivers who use Androids.

### **4. Communicate insights**

Now, we've completed our hypothesis test, the next step is to share our findings with the Waze leadership team. Let's consider the following question as we prepare to write your executive summary:

* What business insight(s) can we draw from the result of our hypothesis test?

> *The drivers who use iPhone devices on average have similar number of drives to those who use Android*

> *For futher analysis, conduct temporary changes in marketing or user interface for the Waze app may provide more data to investigate churn*