The Goal of this Statistical Analysis is to Determine weather the device type effect on rides is statistically significant for waze application users.

**Data Description**

<table border="1">
  <tr>
    <th>Column Name</th>
    <th>Type</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>ID</td>
    <td>int</td>
    <td>A sequential numbered index</td>
  </tr>
  <tr>
    <td>label</td>
    <td>obj</td>
    <td>Binary target variable (“retained” vs “churned”) for if a user has churned anytime during the course of the month</td>
  </tr>
  <tr>
    <td>sessions</td>
    <td>int</td>
    <td>The number of occurrences of a user opening the app during the month</td>
  </tr>
  <tr>
    <td>drives</td>
    <td>int</td>
    <td>An occurrence of driving at least 1 km during the month</td>
  </tr>
  <tr>
    <td>device</td>
    <td>obj</td>
    <td>The type of device a user starts a session with</td>
  </tr>
  <tr>
    <td>total_sessions</td>
    <td>float</td>
    <td>A model estimate of the total number of sessions since a user has onboarded</td>
  </tr>
  <tr>
    <td>n_days_after_onboarding</td>
    <td>int</td>
    <td>The number of days since a user signed up for the app</td>
  </tr>
  <tr>
    <td>total_navigations_fav1</td>
    <td>int</td>
    <td>Total navigations since onboarding to the user’s favorite place 1</td>
  </tr>
  <tr>
    <td>total_navigations_fav2</td>
    <td>int</td>
    <td>Total navigations since onboarding to the user’s favorite place 2</td>
  </tr>
  <tr>
    <td>driven_km_drives</td>
    <td>float</td>
    <td>Total kilometers driven during the month</td>
  </tr>
  <tr>
    <td>duration_minutes_drives</td>
    <td>float</td>
    <td>Total duration driven in minutes during the month</td>
  </tr>
  <tr>
    <td>activity_days</td>
    <td>int</td>
    <td>Number of days the user opens the app during the month</td>
  </tr>
  <tr>
    <td>driving_days</td>
    <td>int</td>
    <td>Number of days the user drives (at least 1 km) during the month</td>
  </tr>
</table>


In [2]:
#Import the Required Libraries
import pandas as pd
from scipy import stats 

In [3]:
#import the dataset \
filepath = "waze_dataset.csv"
df = pd.read_csv(filepath)

In [9]:
#explore the dataset
description = df.describe(include='all')
information = df.info()
devices = df["device"].value_counts()
print("The dataset describtion")
print(description)
print("The dataset information")
print(information)
print("The devices types count for this dataset")
print(devices)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       14999 non-null  int64  
 1   label                    14299 non-null  object 
 2   sessions                 14999 non-null  int64  
 3   drives                   14999 non-null  int64  
 4   total_sessions           14999 non-null  float64
 5   n_days_after_onboarding  14999 non-null  int64  
 6   total_navigations_fav1   14999 non-null  int64  
 7   total_navigations_fav2   14999 non-null  int64  
 8   driven_km_drives         14999 non-null  float64
 9   duration_minutes_drives  14999 non-null  float64
 10  activity_days            14999 non-null  int64  
 11  driving_days             14999 non-null  int64  
 12  device                   14999 non-null  object 
dtypes: float64(3), int64(8), object(2)
memory usage: 1.5+ MB
The dataset describ

In [10]:
#map the device type into integer format to use descriptive statistics to conduct exploratory data analysis (EDA)
map_dic = {"iPhone":1, "Android":2}
df["Mobile"] = df['device'].map(map_dic)
df["Mobile"]

0        2
1        1
2        2
3        1
4        2
        ..
14994    1
14995    2
14996    1
14997    1
14998    1
Name: Mobile, Length: 14999, dtype: int64

we are interested in the relationship between device type and the number of drives. One approach is to look at the average number of drives for each device type. Calculate these averages.

In [14]:
#prepare the dataset to calculate the average number of drives per Mobile Type
iphone = df[df["Mobile"] == 1]
android = df[df["Mobile"] == 2]
#Calculate the average(mean)
iphone_average_rides = iphone["drives"].mean()
android_average_rides = android["drives"].mean()
#show the results
print(f"iphone= {iphone_average_rides} android= {android_average_rides}")

iphone= 67.85907775020678 android= 66.23183780739629


Based on the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, you can conduct a hypothesis test.

Now it's time to test if that difference is statistically Significant 

Ho: This Difference in mean value is by chance 
Ha: This Difference in mean value is due to device Type

In [15]:
tetst = stats.ttest_ind(a=iphone["drives"], b=android["drives"], equal_var=False)
tetst

TtestResult(statistic=1.463523206885235, pvalue=0.14335197268020597, df=11345.066049381952)

Since the p-value is larger than the chosen significance level (5%), you fail to reject the null hypothesis. You conclude that there is **not** a statistically significant difference in the average number of drives between drivers who use iPhones and drivers who use Androids.