# Waze workplace scenario

Your team is nearing the midpoint of their user churn project. So far, you’ve completed a project proposal, and used Python to explore and analyze Waze’s user data. You’ve also used Python to create data visualizations. The next step is to use statistical methods to analyze and interpret your data.

You receive a new email from Sylvester Esperanza, your project manager. Sylvester tells your team about a new request from leadership: to analyze the relationship between mean amount of rides and device type. You also discover follow-up emails from three other team members: May Santner, Chidi Ga, and Harriet Hadzic. These emails discuss the details of the analysis. They would like a statistical analysis of ride data based on device type. In particular, leadership wants to know if there is a statistically significant difference in mean amount of rides between iPhone® users and Android™ users. A final email from Chidi includes your specific assignment: to conduct a two-sample hypothesis test (t-test) to analyze the difference in the mean amount of rides between iPhone users and Android users.

A notebook was structured and prepared to help you in this project. Please complete the following questions and prepare an executive summary.

### assignment

You will conduct hypothesis testing on the data for the churn data. The data team has asked you to investigate Waze's dataset to determine which hypothesis testing method best serves the data and the churn project.

### import important libraries and data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

In [2]:
df = pd.read_csv('waze_dataset.csv')
df.head()

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android


In [3]:
df = df.set_index('ID')

## Task 2. Data exploration

let's tranform categorical values into numerical ones. Iphone = 1 and android = 2

In [4]:
map_dictionary = {'iPhone': 1, 'Android': 2}
df['device_type'] = df['device'].map(map_dictionary)

In [5]:
df.head()

Unnamed: 0_level_0,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device,device_type
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android,2
1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone,1
2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android,2
3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone,1
4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android,2


You are interested in the relationship between device type and the number of drives. One approach is to look at the average number of drives for each device type. Calculate these averages.

In [6]:
df.groupby(['device'])['drives'].mean().sort_values(ascending = False).to_frame()

Unnamed: 0_level_0,drives
device,Unnamed: 1_level_1
iPhone,67.859078
Android,66.231838


Based on the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, you can conduct a hypothesis test.

## Task 3. Hypothesis testing

Your goal is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test:

State the null hypothesis and the alternative hypothesis

Choose a signficance level

Find the p-value

Reject or fail to reject the null hypothesis


In [10]:
iphone = df[df['device'] == 'iPhone']
android = df[df['device'] == 'Android']

In [11]:
android

Unnamed: 0_level_0,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device,device_type
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android,2
2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android,2
4,retained,84,68,168.247020,1562,166,5,3950.202008,1219.555924,27,18,Android,2
8,retained,57,46,183.532018,424,0,26,2651.709764,1594.342984,25,20,Android,2
13,retained,80,64,132.830506,3154,39,16,8531.248070,6324.273457,1,0,Android,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
14983,retained,48,40,50.823538,2488,504,0,5340.491350,2513.410279,13,8,Android,2
14984,retained,66,53,128.256561,1353,59,0,2556.749325,1982.442725,22,19,Android,2
14988,churned,13,11,41.804981,770,132,87,1533.521450,823.418616,0,0,Android,2
14990,churned,73,61,329.904300,614,60,46,6090.450154,3323.880771,0,0,Android,2


In [14]:
# now let's conduct a A|B test on the data selected above
stats.ttest_ind(a=iphone['drives'], b=android['drives'], equal_var=False)

Ttest_indResult(statistic=1.4635232068852353, pvalue=0.1433519726802059)

Let's choose 5% as the significance level and proceed with a two-sample t-test

Since the p-value is larger than the chosen significance level (5%), you fail to reject the null hypothesis. You conclude that there is not a statistically significant difference in the average number of drives between drivers who use iPhones and drivers who use Androids.

# conclusion

The key business insight is that drivers who use iPhone devices on average have a similar number of drives as those who use Androids.

One potential next step is to explore what other factors influence the variation in the number of drives, and run additonal hypothesis tests to learn more about user behavior. Further, temporary changes in marketing or user interface for the Waze app may provide more data to investigate churn.