![](WAZE.png)
# Waze User Churn Project: Results of Hypothesis Testing   
Waze is a free navigation app that makes it easier for drivers around the world to get to where they want to go. Waze leadership wants to optimize the company’s retention strategy, enhance user experience, and make data-driven decisions about product development. They would like an analysis of WAZE data to understand their users better and the development of a machine learning model that predicts user churn. (Churn is understood to be the number of users who have uninstalled the Waze app or stopped using it.) 

This project is part of a larger effort at Waze to increase growth. It assumes that high retention rates indicate satisfied users who repeatedly employ the Waze app over time. Identifying and predicting which users are likely to churn will allow the WAZE team to target such individuals to induce their retention, thereby allowing Waze to grow its business. 

**Data**    
Waze’s free navigation app makes it easier for drivers around the world to get to where they want to go. Waze’s community of map editors, beta testers, translators, partners, and users help make each drive better and safer. Waze partners with cities, transportation authorities, broadcasters, businesses, and first responders to help as many people as possible travel more efficiently and safely. The data set is in-house from Waze for Cities (https://www.transportation.gov/office-policy/transportation-policy/faq-waze-data).

**Deliverables**   
(Since this is an exercise, all models are predetermined. All Python code can be located at: https://github.com/izsolnay/WAZE_Python.)
* I.	An analysis of WAZE data to understand their users better
* II.	The development of a machine learning model that predicts user churn
    * a.	a binomial logistic regression model
    * b.	a winning tree-based model
* Appendix: A two sample t-test based on a sample of user data determining if there is a statistically significant difference in the mean number of rides between iPhone® users and Android™ users

## Appendix: Hypothesis Testing, a two-sample t-test
Before any further investigation into habits of users, the data team would like  a statistical analysis of ride data based on device type. In particular, leadership wants to know if there is a statistically significant difference in mean amount of rides between iPhone® users and Android™ users. 

*Deliverables:* \
A two-sample hypothesis test (t-test) to analyze the difference in the mean amount of `drives` taken by iPhone and Android users.

In [1]:
# Import standard operational packages
import pandas as pd
import numpy as np

# Import additional statistical package
from scipy import stats

# Set Jupyter to display all of the columns (no redaction)
pd.set_option('display.max_columns', None)

In [2]:
# Import data; create df
df0 = pd.read_csv('waze_dataset.csv', on_bad_lines='skip')
df0.head()

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android


In [3]:
df0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       14999 non-null  int64  
 1   label                    14299 non-null  object 
 2   sessions                 14999 non-null  int64  
 3   drives                   14999 non-null  int64  
 4   total_sessions           14999 non-null  float64
 5   n_days_after_onboarding  14999 non-null  int64  
 6   total_navigations_fav1   14999 non-null  int64  
 7   total_navigations_fav2   14999 non-null  int64  
 8   driven_km_drives         14999 non-null  float64
 9   duration_minutes_drives  14999 non-null  float64
 10  activity_days            14999 non-null  int64  
 11  driving_days             14999 non-null  int64  
 12  device                   14999 non-null  object 
dtypes: float64(3), int64(8), object(2)
memory usage: 1.5+ MB


In [4]:
df0.describe()

Unnamed: 0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
count,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0
mean,7499.0,80.633776,67.281152,189.964447,1749.837789,121.605974,29.672512,4039.340921,1860.976012,15.537102,12.179879
std,4329.982679,80.699065,65.913872,136.405128,1008.513876,148.121544,45.394651,2502.149334,1446.702288,9.004655,7.824036
min,0.0,0.0,0.0,0.220211,4.0,0.0,0.0,60.44125,18.282082,0.0,0.0
25%,3749.5,23.0,20.0,90.661156,878.0,9.0,0.0,2212.600607,835.99626,8.0,5.0
50%,7499.0,56.0,48.0,159.568115,1741.0,71.0,9.0,3493.858085,1478.249859,16.0,12.0
75%,11248.5,112.0,93.0,254.192341,2623.5,178.0,43.0,5289.861262,2464.362632,23.0,19.0
max,14998.0,743.0,596.0,1216.154633,3500.0,1236.0,415.0,21183.40189,15851.72716,31.0,30.0


### Date frame and feature transformation
* drop ID variable
* drop observations(rows) with missing label values
* calculate ratios
* encode categorical variables for hypothesis test: map values
  * create map_dictionary
  * create `device_type` column
  * map `device_type` column to the created dictionary

In [5]:
# Drop ID column
df = df0.drop(['ID'], axis = 1)		# axis = 1 indicates the vertical axis(columns) which meet criteria
df.head(10)

Unnamed: 0,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device
0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android
1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone
2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android
3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone
4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android
5,retained,113,103,279.544437,2637,0,0,901.238699,439.101397,15,11,iPhone
6,retained,3,2,236.725314,360,185,18,5249.172828,726.577205,28,23,iPhone
7,retained,39,35,176.072845,2999,0,0,7892.052468,2466.981741,22,20,iPhone
8,retained,57,46,183.532018,424,0,26,2651.709764,1594.342984,25,20,Android
9,churned,84,68,244.802115,2997,72,0,6043.460295,2341.838528,7,3,iPhone


In [6]:
# Check for missing values
df.isna().sum()

label                      700
sessions                     0
drives                       0
total_sessions               0
n_days_after_onboarding      0
total_navigations_fav1       0
total_navigations_fav2       0
driven_km_drives             0
duration_minutes_drives      0
activity_days                0
driving_days                 0
device                       0
dtype: int64

In [7]:
# Drop rows with missing data in `label` column
df = df.dropna(subset=['label'])
df.isna().sum()

label                      0
sessions                   0
drives                     0
total_sessions             0
n_days_after_onboarding    0
total_navigations_fav1     0
total_navigations_fav2     0
driven_km_drives           0
duration_minutes_drives    0
activity_days              0
driving_days               0
device                     0
dtype: int64

In [8]:
# Calculate count & % of churned vs. retained
print(df['label'].value_counts())
print(df['label'].value_counts(normalize = True))

label
retained    11763
churned      2536
Name: count, dtype: int64
label
retained    0.822645
churned     0.177355
Name: proportion, dtype: float64


In [9]:
# isolate device
df['device'].describe()

count      14299
unique         2
top       iPhone
freq        9225
Name: device, dtype: object

In [10]:
# Calculate count and % of iPhone users and Android users in full dataset
print(df.shape)
print(df['device'].value_counts()) # value_counts counts the # of times device appears in the 700 rows with null values
print(df['device'].value_counts(normalize = True))  # normalize = True displays in percentages

(14299, 12)
device
iPhone     9225
Android    5074
Name: count, dtype: int64
device
iPhone     0.64515
Android    0.35485
Name: proportion, dtype: float64


In [11]:
# Calculate the number of Android users and iPhone users
print(df.groupby(['label', 'device']).size())

label     device 
churned   Android     891
          iPhone     1645
retained  Android    4183
          iPhone     7580
dtype: int64


In [12]:
# Calculate percentages of Android users and iPhone users
print(df.groupby('label')['device'].value_counts(normalize=True))

label     device 
churned   iPhone     0.648659
          Android    0.351341
retained  iPhone     0.644393
          Android    0.355607
Name: proportion, dtype: float64


In [13]:
# Create `map_dictionary` to encode new device type column
map_dictionary = {'Android': 0, 'iPhone': 1}

# Create new `device_type` column
df['device_type'] = df['device']

# Map the `device_type` column to dictionary
df['device_type'] = df['device'].map(map_dictionary)
df.head()

Unnamed: 0,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device,device_type
0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android,0
1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone,1
2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android,0
3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone,1
4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android,0


Results of new feature `device_type`  
* 0 = Android
* 1 = iPhone

In [14]:
# Calculate mean by device type
df.groupby('device_type')['drives'].mean()

device_type
0    66.024241
1    67.933225
Name: drives, dtype: float64

In [15]:
# Calculate median by device type
df.groupby('device_type')['drives'].median()

device_type
0    47.0
1    48.0
Name: drives, dtype: float64

Results\
There would seem to be a negligible difference between the medians and average drives taken by iPhone and Android users during the month. However, there is a slightly higher number taken by iPhone users. A two-sample t-test will assess whether the difference is statistically significant or a random sampling occurence. 

## Hypothesis testing: Welch's t-test
Conduct a two-sample t-test, because these are two independent categories (Android vs. iPhone users)\
Welch's t-test assumes unequal variances in population (no reason no assume same variance here)\
(Variance: the average of the squared difference of each data point from the mean)
1. state the NULL hypothesis ($H_0$) and the alternative hypothesis ($H_a$)
    * $H_0$: there is no statistical difference between the number of rides a user has taken and the device type – any difference is CHANCE
    * $H_a$: there is a statistical difference; REJECT $H_0$, because difference is not due to chance; there is a relationship.
2. choose a signficance level: 5%
3. find the p-value; stats.ttest_ind() function to perform the test
4. reject or fail to reject the NULL hypothesis

In [16]:
# Set significance level
significance_level = 0.05
significance_level

0.05

In [17]:
# Isolate drives columns & specify which drives rows desired
Android = df[df['device_type']== 0]['drives']
iPhone = df[df['device_type']== 1]['drives']

# Perform the t-test
stats.ttest_ind(a=iPhone, b=Android, equal_var=False) # equal_var=False to not assume population variances are =

TtestResult(statistic=1.676594122141587, pvalue=0.09365074661708836, df=10826.925404660755)

## Results
Because the p-value is ~0.09365, it is above 0.05. Fail to reject the null hypothesis, which means it's CHANCE. Drivers who use iPhones have similar drives as those who use Androids.\
Since the hypothesis test demonstrated that there is no statistical correlation between device type and drives taken, other variables should be tested to see if there is a correlation.

### Rerun test on modified data set

In [19]:
# Import data; create dfd
df_mod = pd.read_csv('WAZE_dataset_transformed_2.csv', on_bad_lines='skip')
df_mod.head()

Unnamed: 0,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device,percent_sessions_in_month,total_sessions_per_day,kms_driven_per_day_during_the_month,km_per_drive,km_per_hour,professional_driver,label2,device2
0,retained,243,200,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android,0.936313,0.130381,138.360267,11.632058,79.430298,1,0,0
1,retained,133,107,326.896596,1225,19,64,8898.716275,3160.472914,13,11,iPhone,0.406856,0.266854,1246.901868,128.186173,260.389902,1,0,1
2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android,0.841186,0.051121,382.393602,32.201567,113.95346,0,0,0
3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone,0.724968,1.089743,304.530374,22.839778,93.351141,0,0,1
4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android,0.499266,0.107713,219.455667,58.091206,194.34297,0,0,0


In [21]:
# Isolate drives columns & specify which drives rows desired
Android = df_mod[df_mod['device2']== 0]['drives']
iPhone = df_mod[df_mod['device2']== 1]['drives']

# Perform the t-test
stats.ttest_ind(a=iPhone, b=Android, equal_var=False)

TtestResult(statistic=1.4057121891392934, pvalue=0.1598388152623599, df=10636.379585225877)

## Results
Using the modifed data frame results in an even higher p-value is ~0.16, which is above 0.05. Fail to reject the null hypothesis, which means it's CHANCE. Drivers who use iPhones have similar drives as those who use Androids.