# **Waze Project**

# **Data exploration and hypothesis testing**

**The goal** is to apply descriptive statistics and hypothesis testing in Python.


# **PACE stages**


## **PACE: Plan**

In [1]:
# Import any relevant packages or libraries
import numpy as np
import pandas as pd
from scipy import stats

In [2]:
# Load dataset into dataframe
df = pd.read_csv('waze_dataset.csv')

## **PACE: Analyze and Construct**

### **Data exploration**

In [3]:
df.head()

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android


We are interested in the relationship between device type and the number of drives. One approach is to look at the average number of drives for each device type.

In [4]:
df.device.value_counts()
df.groupby('device')['drives'].mean()

device
Android    66.231838
iPhone     67.859078
Name: drives, dtype: float64

Based on the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, let's conduct a hypothesis test.


### **Hypothesis testing**

The goal is to conduct a two-sample t-test. The steps for conducting a hypothesis test:


1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis

**Note:** This is a t-test for two independent samples. This is the appropriate test since the two groups are independent (Android users vs. iPhone users).

**Hypotheses for this data project:**

H0 - There is no statistically significant difference in mean amount of rides between iPhone users and Android users. 

H1 - There is a statistically significant difference in mean amount of rides between iPhone users and Android users. 

Next, we choose 5% as the significance level and proceed with a two-sample t-test.

In [5]:
# 1. Isolate the `drives` column for iPhone users.
drives_with_iphone = df[df['device'] == 'iPhone'].drives

# 2. Isolate the `drives` column for Android users.
drives_with_android = df[df['device'] == 'Android'].drives

# 3. Perform the t-test
statistic, pvalue = stats.ttest_ind(drives_with_iphone, drives_with_android)

print("Statistic:", statistic)
print("P-value:", pvalue)

Statistic: 1.4469680558136442
P-value: 0.14792677140889893


Since the p-value (0.148) is greater than the significance level (0.05), we fail to reject the null hypothesis. This means there is no statistically significant difference in the mean number of rides between iPhone and Android users.

## **PACE: Execute**

### **Communicate insights with stakeholders**

**Business insights:**
The hypothesis test suggests that device type does not significantly impact the number of rides taken by users. This means Waze should focus on other factors (e.g., user behavior, engagement patterns, or app features) rather than device type when analyzing user retention and churn. Marketing and feature optimizations should target all users equally, rather than being device-specific.