# Waze - Preliminary Data Summary

This work prepares a dataset for future exploratory data analysis (EDA). The purpose is to investigate and understand the data provided. The goal is to leverage Python dataframes in order to perform cursory data inspection and inform team members of the findings. At the end of this stage, the data must be ready to answer questions, yield insights, produce visualizations and be tested through future hypothesis testing and statistical methods.

In [1]:
import pandas as pd
import numpy as np

In [26]:
# reading the data
df = pd.read_csv("waze_dataset.csv")
df.head(10)

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android
5,5,retained,113,103,279.544437,2637,0,0,901.238699,439.101397,15,11,iPhone
6,6,retained,3,2,236.725314,360,185,18,5249.172828,726.577205,28,23,iPhone
7,7,retained,39,35,176.072845,2999,0,0,7892.052468,2466.981741,22,20,iPhone
8,8,retained,57,46,183.532018,424,0,26,2651.709764,1594.342984,25,20,Android
9,9,churned,84,68,244.802115,2997,72,0,6043.460295,2341.838528,7,3,iPhone


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       14999 non-null  int64  
 1   label                    14299 non-null  object 
 2   sessions                 14999 non-null  int64  
 3   drives                   14999 non-null  int64  
 4   total_sessions           14999 non-null  float64
 5   n_days_after_onboarding  14999 non-null  int64  
 6   total_navigations_fav1   14999 non-null  int64  
 7   total_navigations_fav2   14999 non-null  int64  
 8   driven_km_drives         14999 non-null  float64
 9   duration_minutes_drives  14999 non-null  float64
 10  activity_days            14999 non-null  int64  
 11  driving_days             14999 non-null  int64  
 12  device                   14999 non-null  object 
dtypes: float64(3), int64(8), object(2)
memory usage: 1.5+ MB


In [4]:
# isolating rows with null values
with_null_df = df[df["label"].isnull()]

# displaying summary statistics of samples with null values
with_null_df.describe()

Unnamed: 0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
count,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0
mean,7405.584286,80.837143,67.798571,198.483348,1709.295714,118.717143,30.371429,3935.967029,1795.123358,15.382857,12.125714
std,4306.900234,79.98744,65.271926,140.561715,1005.306562,156.30814,46.306984,2443.107121,1419.242246,8.772714,7.626373
min,77.0,0.0,0.0,5.582648,16.0,0.0,0.0,290.119811,66.588493,0.0,0.0
25%,3744.5,23.0,20.0,94.05634,869.0,4.0,0.0,2119.344818,779.009271,8.0,6.0
50%,7443.0,56.0,47.5,177.255925,1650.5,62.5,10.0,3421.156721,1414.966279,15.0,12.0
75%,11007.0,112.25,94.0,266.058022,2508.75,169.25,43.0,5166.097373,2443.955404,23.0,18.0
max,14993.0,556.0,445.0,1076.879741,3498.0,1096.0,352.0,15135.39128,9746.253023,31.0,30.0


In [5]:
# isolating rows without null values
without_null_df = df[~df["label"].isnull()]

# displaying summary statistics of samples without null values
without_null_df.describe()

Unnamed: 0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
count,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0
mean,7503.573117,80.62382,67.255822,189.547409,1751.822505,121.747395,29.638296,4044.401535,1864.199794,15.544653,12.18253
std,4331.207621,80.736502,65.947295,136.189764,1008.663834,147.713428,45.35089,2504.97797,1448.005047,9.016088,7.833835
min,0.0,0.0,0.0,0.220211,4.0,0.0,0.0,60.44125,18.282082,0.0,0.0
25%,3749.5,23.0,20.0,90.457733,878.5,10.0,0.0,2217.319909,840.181344,8.0,5.0
50%,7504.0,56.0,48.0,158.718571,1749.0,71.0,9.0,3496.545617,1479.394387,16.0,12.0
75%,11257.5,111.0,93.0,253.54045,2627.5,178.0,43.0,5299.972162,2466.928876,23.0,19.0
max,14998.0,743.0,596.0,1216.154633,3500.0,1236.0,415.0,21183.40189,15851.72716,31.0,30.0


In [6]:
# calculating the count of null values by device
with_null_df["device"].value_counts()

Unnamed: 0_level_0,count
device,Unnamed: 1_level_1
iPhone,447
Android,253


In [7]:
# calculating the percentage of null values for iPhone and Android devices
with_null_df["device"].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
device,Unnamed: 1_level_1
iPhone,0.638571
Android,0.361429


In [8]:
# calculating the percentage of iPhone and Android users
df["device"].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
device,Unnamed: 1_level_1
iPhone,0.644843
Android,0.355157


None of the first 10 observations had missing values, though this does not imply the entire dataset is free of gaps. Variables like `label` and `device` are of type object, while `total_sessions`, `driven_km_drives` and `duration_minutes_drives` are floats. The other variables are integers. There are 14,999 rows and 13 columns in total, with 700 missing values in the `label` column. Comparing rows with missing retention labels to those with complete data reveals no significant differences. The means and standard deviations are consistent across both groups. Of the 700 rows with missing values, 447 correspond to iPhone users and 253 to Android users, matching the overall representation of these devices in the dataset. There is no evidence of a systematic cause for the missing data.

In [9]:
# calculating the counts of churned and retained users
print(df["label"].value_counts())
print()
print(df["label"].value_counts(normalize=True))

label
retained    11763
churned      2536
Name: count, dtype: int64

label
retained    0.822645
churned     0.177355
Name: proportion, dtype: float64


In [10]:
# calculating the median values of all columns for churned and retained users
df.groupby("label").median(numeric_only=True)

Unnamed: 0_level_0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
churned,7477.5,59.0,50.0,164.339042,1321.0,84.5,11.0,3652.655666,1607.183785,8.0,6.0
retained,7509.0,56.0,47.0,157.586756,1843.0,68.0,9.0,3464.684614,1458.046141,17.0,14.0


In [11]:
# calculating the percentage of Android and iPhone users for each label
df.groupby("label")["device"].value_counts(normalize=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,proportion
label,device,Unnamed: 2_level_1
churned,iPhone,0.648659
churned,Android,0.351341
retained,iPhone,0.644393
retained,Android,0.355607


In [12]:
# calculating the number of Android and iPhone users  for each label
lst_columns = ["label", "device"]
df.groupby(lst_columns).size()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
label,device,Unnamed: 2_level_1
churned,Android,891
churned,iPhone,1645
retained,Android,4183
retained,iPhone,7580


In [13]:
# creating a column for the number of drives per driving day
df["drives_per_driving_day"] = df["drives"] / df["driving_days"]

# creating a column for the number of km per drive
df["km_per_drive"] = df["driven_km_drives"] / df["drives"]

# creating a column for the number of km per driving day
df["km_per_driving_day"] = df["driven_km_drives"] / df["driving_days"]

In [21]:
# isolating the drives per driving day after grouping by the label and calculating the median
median_drives_per_driving_day = df.groupby("label").median(numeric_only=True)[["drives_per_driving_day"]]

# isolating the km per drive after grouping by the label and calculating the median
median_km_per_drive = df.groupby("label").median(numeric_only=True)[["km_per_drive"]]

# isolating the km per driving day after grouping by the label and calculating the median
median_km_per_driving_day = df.groupby("label").median(numeric_only=True)[["km_per_driving_day"]]

Churned users completed around three more drives in their final month than retained users, but retained users used the app on more than twice as many days over the same period. The median churned user drove approximately 200 more kilometers and 2.5 more hours than retained users during the last month of activity. Churned users tended to drive more frequently over fewer days, with longer trips in both distance and time, suggesting the potential for a distinct user profile. The median churned user drove 698 kilometers per day in their final month, around 240% more than the median retained user's per-drive-day distance. They also showed a higher frequency of drives per driving day. Based on these figures, it appears that the dataset represents drivers with intense usage patterns, possibly even long-haul truckers. It is likely not representative of average drivers. It would be advisable for Waze to further study these "super-drivers", as their specific needs may explain why they stop using the app and these needs could differ significantly from those of regular users.

In [25]:
median_drives_per_driving_day.head()

Unnamed: 0_level_0,drives_per_driving_day
label,Unnamed: 1_level_1
churned,10.0
retained,4.0625


In [23]:
median_km_per_drive.head()

Unnamed: 0_level_0,km_per_drive
label,Unnamed: 1_level_1
churned,74.109416
retained,75.014702


In [24]:
median_km_per_driving_day.head()

Unnamed: 0_level_0,km_per_driving_day
label,Unnamed: 1_level_1
churned,697.541999
retained,289.549333


To sum up, the dataset has 700 missing values in the `label` column with no apparent pattern in the omissions. The mean is sensitive to outliers, while the median provides a central value unaffected by extreme data points. For instance, churned users drove 698 kilometers per driving day last month (around 240% more than retained users). It is important to determine how the data was collected and whether the sample is representative. Android users made up 36% of the sample, while iPhone users accounted for 64%. In general, churned users drove longer distances over fewer days and used the app about half as often as retained users. The churn rate for iPhone and Android users was almost identical, differing by just one percentage point, suggesting no link between device type and churn behavior.