## Exploratory Data Analysis Cab Companies recommendation and hypothesis results

## Hyphotesis Test

#### Test1 - Is there any difference in the mean profit per km for these companies?

In [165]:
# The mean profit per km for both companies
df.groupby("Company").mean()["Profit_dist"]

Company
Pink Cab      2.769908
Yellow Cab    7.105508
Name: Profit_dist, dtype: float64

H0: There is no difference in the mean profit per km for companies Yellow Cab and Pink Cab.

H1: The mean profit per km for company Yellow Cab is higher than the mean profit per km for Pink Cab.

In [166]:
from scipy import stats
from scipy.stats import ttest_ind

sample1 = df[df["Company"] == "Pink Cab"]["Profit_dist"]
sample2 = df[df["Company"] == "Yellow Cab"]["Profit_dist"]

t_stat, p_value = ttest_ind(sample1, sample2)
print("T-statistic value: ", t_stat)  
print("P-Value: ", p_value)

T-statistic value:  -210.96861574553898
P-Value:  0.0


According to the 2-sample t-test results, considering a 0.05 significance level, we concluded that the mean profit per km for the Yellow Cab company is higher compared to the Pink Cab.

#### Test2 - Is there any difference in the mean profit per customer for these companies?

In [167]:
# The mean profit per km for both companies
df2 = df.groupby(["Company", "Customer ID"]).mean()["Profit"].reset_index()
df2

Unnamed: 0,Company,Customer ID,Profit
0,Pink Cab,1,218.291000
1,Pink Cab,10,66.470000
2,Pink Cab,100,188.273750
3,Pink Cab,1000,154.964667
4,Pink Cab,10001,6.822000
...,...,...,...
72221,Yellow Cab,9994,84.971600
72222,Yellow Cab,9996,63.018700
72223,Yellow Cab,9997,81.553800
72224,Yellow Cab,9998,92.264667


In [168]:
df2.groupby("Company").mean("Profit")

Unnamed: 0_level_0,Profit
Company,Unnamed: 1_level_1
Pink Cab,58.507889
Yellow Cab,129.373032


H0: There is no difference in the mean profit per customer for companies Yellow Cab and Pink Cab.

H1: The mean profit per customer for company Yellow Cab is higher than the mean profit per km for Pink Cab.

In [169]:
from scipy import stats
from scipy.stats import ttest_ind

sample1 = df2[df2["Company"] == "Pink Cab"]["Profit"]
sample2 = df2[df2["Company"] == "Yellow Cab"]["Profit"]

t_stat, p_value = ttest_ind(sample1, sample2)
print("T-statistic value: ", t_stat)  
print("P-Value: ", p_value)

T-statistic value:  -106.54408808351249
P-Value:  0.0


According to the 2-sample t-test results, considering a 0.05 significance level, we concluded that the mean profit per customer for the Yellow Cab company is higher compared to the Pink Cab.

#### Test3 - Is there any difference in the mean charged price per km for these companies?

In [170]:
df["PriceCharged_km"] = df["Price Charged"] / df["KM Travelled"]

In [171]:
# The mean profit per km for both companies
df.groupby("Company").mean()["PriceCharged_km"]

Company
Pink Cab      13.768510
Yellow Cab    20.306073
Name: PriceCharged_km, dtype: float64

H0: There is no difference in the mean charged price per km for companies Yellow Cab and Pink Cab.

H1: The mean charged price per km for company Yellow Cab is higher than the mean profit per km for Pink Cab.

In [172]:
from scipy import stats
from scipy.stats import ttest_ind

sample1 = df[df["Company"] == "Pink Cab"]["PriceCharged_km"]
sample2 = df[df["Company"] == "Yellow Cab"]["PriceCharged_km"]

t_stat, p_value = ttest_ind(sample1, sample2)
print("T-statistic value: ", t_stat)  
print("P-Value: ", p_value)

T-statistic value:  -320.9807762543473
P-Value:  0.0


According to the 2-sample t-test results, considering a 0.05 significance level, we concluded that the mean charged price per km for the Yellow Cab company is higher compared to the Pink Cab.

#### Test4 - Is there any difference in the mean profit of these companies considering the top5 cities with higher profits?

In [173]:
df.groupby("City").sum()["Profit"].sort_values(ascending=False)[:5].index

Index(['NEW YORK NY', 'LOS ANGELES CA', 'WASHINGTON DC', 'CHICAGO IL',
       'BOSTON MA'],
      dtype='object', name='City')

In [174]:
# Creating an index list to select the top 5 cities
index_list = list(df[df["City"] == "NEW YORK NY"].index) + list(df[df["City"] == "LOS ANGELES CA"].index) + \
             list(df[df["City"] == "WASHINGTON DC"].index) + list(df[df["City"] == "CHICAGO IL"].index) + \
             list(df[df["City"] == "BOSTON MA"].index)



In [175]:
# Creating a new dataframe with the top 5 cities
df3 = df.iloc[index_list]

In [176]:
# The mean profit per km for both companies
df3.groupby(["Company", "City"]).mean()["Profit"]

Company     City          
Pink Cab    BOSTON MA          50.520960
            CHICAGO IL         34.047910
            LOS ANGELES CA     56.669120
            NEW YORK NY       108.217540
            WASHINGTON DC      52.482761
Yellow Cab  BOSTON MA          61.483619
            CHICAGO IL         64.924486
            LOS ANGELES CA    116.656368
            NEW YORK NY       307.864252
            WASHINGTON DC      82.384912
Name: Profit, dtype: float64

H0: There is no difference in the mean profit per km for companies Yellow Cab and Pink Cab, considering the top5 profitable cities.

H1: The mean profit per km for company Yellow Cab is higher than the mean profit per km for Pink Cab, considering the top 5 cities.

In [177]:
from scipy import stats
from scipy.stats import ttest_ind

sample1 = df3[df3["Company"] == "Pink Cab"]["Profit"]
sample2 = df3[df3["Company"] == "Yellow Cab"]["Profit"]

t_stat, p_value = ttest_ind(sample1, sample2)
print("T-statistic value: ", t_stat)  
print("P-Value: ", p_value)

T-statistic value:  -125.54270120924046
P-Value:  0.0


According to the 2-sample t-test results, considering a 0.05 significance level, we concluded that the mean profit per km for the Yellow Cab company is higher compared to the Pink Cab, considering the top 5 cities.

#### Test5 - Is there any difference in the mean customer's income that uses the service of one or another company?

In [182]:
df[df["Company"] == "Yellow Cab"].groupby("Customer ID").mean()["Income (USD/Month)"].mean()

14983.896831762582

In [183]:
df[df["Company"] == "Pink Cab"].groupby("Customer ID").mean()["Income (USD/Month)"].mean()

15034.097618311165

H0: There is no difference in the mean customer's income for costumers that uses the Yellow Cab and Pink Cab services.

H1: There is a difference in the mean customer's income for costumers that uses the Yellow Cab and Pink Cab services.

In [184]:
from scipy import stats
from scipy.stats import ttest_ind

sample1 = df[df["Company"] == "Pink Cab"].groupby("Customer ID").mean()["Income (USD/Month)"]
sample2 = df[df["Company"] == "Yellow Cab"].groupby("Customer ID").mean()["Income (USD/Month)"]

t_stat, p_value = ttest_ind(sample1, sample2)
print("T-statistic value: ", t_stat)  
print("P-Value: ", p_value)

T-statistic value:  0.838939895792237
P-Value:  0.40150581507996785


There is not significant statistical difference in the mean customer's income for the users of Yellow Cab and Pink Cab. It is important to notice that all the customers that uses both services, were included in both samples of the test.

#### Test6 - Is there any difference in the mean profit for different payment methods?

In [186]:
# The mean profit per payment method
df.groupby("Payment_Mode").mean()["Profit"]

Payment_Mode
Card    137.086461
Cash    137.502924
Name: Profit, dtype: float64

H0: There is no difference in the mean profit for both payment methods.

H1: There is difference in the mean profit for both payment methods.

In [188]:
from scipy import stats
from scipy.stats import ttest_ind

sample1 = df[df["Payment_Mode"] == "Card"]["Profit"]
sample2 = df[df["Payment_Mode"] == "Cash"]["Profit"]

t_stat, p_value = ttest_ind(sample1, sample2)
print("T-statistic value: ", t_stat)  
print("P-Value: ", p_value)

T-statistic value:  -0.7630743349933286
P-Value:  0.4454195660215009


No signifcant difference in the mean profit for the payment methods by card or cash.

#### Test7 - Is there any difference in the mean profit for holidays and non-holidays?

In [192]:
df[df["Holiday"] == "No"]["Profit"].mean()

137.55501953350472

In [193]:
df[df["Holiday"] != "No"]["Profit"].mean()

122.19940113250196

H0: There is no difference in the mean profit for holidays or non-holidays.

H1: There is a difference in the mean profit for hoildays or non-holidays.

In [194]:
from scipy import stats
from scipy.stats import ttest_ind

sample1 = df[df["Holiday"] == "No"]["Profit"]
sample2 = df[df["Holiday"] != "No"]["Profit"]

t_stat, p_value = ttest_ind(sample1, sample2)
print("T-statistic value: ", t_stat)  
print("P-Value: ", p_value)

T-statistic value:  7.971760168134794
P-Value:  1.5688395014131125e-15


The mean profit for holidays is significantly lower compared to non-holidays with a significance level of 0.05.

#### Test8 - Is there any difference in the mean profit for weekdays and weekends?
* Weekends were considered to be Friday, Saturday and Sunday in this analysis

In [212]:
df4 = pd.concat([df[df["day_of_week"] == "Friday"], df[df["day_of_week"] == "Saturday"], df[df["day_of_week"] == "Sunday"]])
df5 = pd.concat([df[df["day_of_week"] == "Monday"], df[df["day_of_week"] == "Tuesday"], df[df["day_of_week"] == "Wednesday"], df[df["day_of_week"] == "Thursday"]])


In [214]:
df4["Profit"].mean()

148.73054734067915

In [215]:
df5["Profit"].mean()

116.86500203173533

H0: There is no difference in the mean profit for weekends and weekdays.

H1: There is a difference in the mean profit for weekends and weekdays.

In [194]:
from scipy import stats
from scipy.stats import ttest_ind

sample1 = df4["Profit"]
sample2 = df5["Profit"]

t_stat, p_value = ttest_ind(sample1, sample2)
print("T-statistic value: ", t_stat)  
print("P-Value: ", p_value)

T-statistic value:  7.971760168134794
P-Value:  1.5688395014131125e-15


The mean profit for weekends is significantly higher compared to weekdays with a significance level of 0.05.

#### Test9 - Is there any correlation between the distance, cost of trip, price charged and profit.

According the the previous analysis, it is clear that the distance travelled has a very strong positive correlation (0.98, p-value = 0) with the cost of trip, meaning that the longest the travel distance is, the higher the cost of the trip is.
Moreover, there is a strong correlation between the distance travelled (0.84, p-value = 0) and the price charged, although this correlation is slightly weaker compared to the previous one.
Finally, there is a moderate to weak correlation between the distance travelled and the profit (margin) (0.46, p-value = 0). This might have been caused because some travels may not be profitable due to some reason.

#### Test10 - Is there a difference in the profit along the years for Yellow Cab company?

In [216]:
df6 = df[df["Company"] == "Yellow Cab"]
df7 = df[df["Company"] == "Pink Cab"]

In [217]:
df6.groupby("year").mean()["Profit"]

year
2016    169.347821
2017    168.817057
2018    143.416122
Name: Profit, dtype: float64

H0: There is no difference in the mean profit of Yellow Cab company for the years 2016, 2017 and 2018.

H1: At least one of the years presented a mean profit of Yellow Cab company higher than the other years.

In [226]:
from scipy import stats
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

sample1 = df6[df6["year"] == 2016]["Profit"]
sample2 = df6[df6["year"] == 2017]["Profit"]
sample3 = df6[df6["year"] == 2018]["Profit"]

f_stat, p_value = f_oneway(sample1, sample2, sample3)
print("T-statistic value: ", t_stat)  
print("P-Value: ", p_value)

tukey = pairwise_tukeyhsd(endog=df6['Profit'],
                          groups=df6['year'],
                          alpha=0.05)

print("\n")
print(tukey)

T-statistic value:  693.1451171764937
P-Value:  5.3455921384708516e-301


 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj   lower    upper   reject
------------------------------------------------------
  2016   2017  -0.5308 0.7895  -2.4295    1.368  False
  2016   2018 -25.9317   -0.0 -27.8485 -24.0149   True
  2017   2018 -25.4009   -0.0 -27.2327 -23.5692   True
------------------------------------------------------


According to the Anova and Tukey tests, the mean profit for the year 2018 was significantly lower compared to the previous years 2016 and 2017 for the Yellow Cab company.

#### Test11 - Is there a difference in the profit along the years for Pink Cab company?

In [218]:
df7.groupby("year").mean()["Profit"]

year
2016    68.321819
2017    67.070839
2018    53.229689
Name: Profit, dtype: float64

H0: There is no difference in the mean profit of Pink Cab company for the years 2016, 2017 and 2018.

H1: At least one of the years presented a mean profit of Pink Cab company higher than the other years.

In [227]:
from scipy import stats
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

sample1 = df7[df7["year"] == 2016]["Profit"]
sample2 = df7[df7["year"] == 2017]["Profit"]
sample3 = df7[df7["year"] == 2018]["Profit"]

f_stat, p_value = f_oneway(sample1, sample2, sample3)
print("T-statistic value: ", t_stat)  
print("P-Value: ", p_value)

tukey = pairwise_tukeyhsd(endog=df7['Profit'],
                          groups=df7['year'],
                          alpha=0.05)

print("\n")
print(tukey)

T-statistic value:  693.1451171764937
P-Value:  1.3845866439321786e-145


 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj   lower    upper   reject
------------------------------------------------------
  2016   2017   -1.251 0.1397  -2.7971   0.2951  False
  2016   2018 -15.0921    0.0 -16.6502  -13.534   True
  2017   2018 -13.8411    0.0 -15.3249 -12.3574   True
------------------------------------------------------


According to the Anova and Tukey tests, the mean profit for the year 2018 was significantly lower compared to the previous years 2016 and 2017 for the Pink Cab as well.

#### Test12 - Is there a difference in the profit per km along the years for Yellow Cab company?

In [229]:
df6.groupby("year").mean()["Profit_dist"]

year
2016    7.489847
2017    7.494612
2018    6.364805
Name: Profit_dist, dtype: float64

H0: There is no difference in the mean profit per km of Yellow Cab company for the years 2016, 2017 and 2018.

H1: At least one of the years presented a mean profit per km of Yellow Cab company higher than the other years.

In [230]:
from scipy import stats
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

sample1 = df6[df6["year"] == 2016]["Profit_dist"]
sample2 = df6[df6["year"] == 2017]["Profit_dist"]
sample3 = df6[df6["year"] == 2018]["Profit_dist"]

f_stat, p_value = f_oneway(sample1, sample2, sample3)
print("T-statistic value: ", t_stat)  
print("P-Value: ", p_value)

tukey = pairwise_tukeyhsd(endog=df6['Profit_dist'],
                          groups=df6['year'],
                          alpha=0.05)

print("\n")
print(tukey)

T-statistic value:  693.1451171764937
P-Value:  0.0


Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
  2016   2017   0.0048 0.9833 -0.0591  0.0686  False
  2016   2018   -1.125   -0.0 -1.1895 -1.0606   True
  2017   2018  -1.1298   -0.0 -1.1914 -1.0682   True
----------------------------------------------------


A similar pattern was observed for the profit per km for the Yellow Cab company. This means that the profit per km for the year 2018 was lower compared to the previous years assessed in this analysis.

#### Test13 - Is there a difference in the profit per km along the years for Pink Cab company?

In [231]:
df7.groupby("year").mean()["Profit_dist"]

year
2016    3.026813
2017    2.962883
2018    2.350447
Name: Profit_dist, dtype: float64

H0: There is no difference in the mean profit per km of Yellow Cab company for the years 2016, 2017 and 2018.

H1: At least one of the years presented a mean profit per km of Yellow Cab company higher than the other years.

In [232]:
from scipy import stats
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

sample1 = df7[df7["year"] == 2016]["Profit_dist"]
sample2 = df7[df7["year"] == 2017]["Profit_dist"]
sample3 = df7[df7["year"] == 2018]["Profit_dist"]

f_stat, p_value = f_oneway(sample1, sample2, sample3)
print("T-statistic value: ", t_stat)  
print("P-Value: ", p_value)

tukey = pairwise_tukeyhsd(endog=df7['Profit_dist'],
                          groups=df7['year'],
                          alpha=0.05)

print("\n")
print(tukey)

T-statistic value:  693.1451171764937
P-Value:  2.6571847629273957e-239


Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
  2016   2017  -0.0639 0.0144 -0.1175 -0.0103   True
  2016   2018  -0.6764    0.0 -0.7304 -0.6223   True
  2017   2018  -0.6124    0.0 -0.6639  -0.561   True
----------------------------------------------------


In the case of the Pink Cab company, it was observed that the profit per km significantly decreased year by year from 2016 to 2018.

#### Assumptions:

* As there were more transactions in the transaction table compared to the cab-data table, it was susptected that the data in the transaction table could have been from a source considering other cab companies, and therefore, the data that was not related to the Yellow or Pink Cab companies were disregarded.
* The outliers identified in the Price Charged variable were kept in the dataset and were also used during the data analaysis as it was assumed that the Price Charged could have been higher, depending on many circumstances that we may not have the information to further investigate, and therefore, they were not treated for this analysis.
* The profit was calculated based on the difference between the price charged and the cost of the trip.
* Weekends were assumed to be Friday, Saturday and Sunday in any analysis involving the weekdays and weekends.

#### Insights:

* 359392 registers of rides in the cab_data for two companies (Yellow Cab and Pink Cab) related to the years 2016, 2017 and 2018.
* The Yellow Cab (274,681) had about 3 times more rides compared to the Pink Cab (84,711) for the assessed years
* The mean age of customers was 35 years old, with most of customers being between 18 and 40 years old. Moreover, the mean income was about USD 15,000.00 per month. Most of customers presented an income lower than USD 25,000.00 per month.
* Most of customers had only one transaction, meaning only one ride, however there were many customers with higher number of rides. The customer with the highest number of rides in the 3-year period evaluated was 54 rides.
* The cities Washington DC, Boston MA, Los Angeles CA, Chicago IL and San Diego CA were the cities with the highest proportion of cab users considering all 19 cities assessed.
* The top 5 cities with higher number of rides for Yellow Cab: New York NY, Chicago IL, Washington DC, Los Angeles CA, Boston MA. The top 5 cities with higher profits included exactly the same cities that presented higher number of rides. These top 5 cities accounted for about 85% of the total profit of this company, considering the years 2016, 2017 and 2018.
* The top 5 cities with higher number of rides for Pink Cab: Los Angeles CA, New York NY, San Diego CA, Chicago IL, Boston MA. However, the most profitable ones included the same list with exception to Boston MA, which was replaced by the region of Silicon Valley. These top 5 cities accounted for about 74% of the total profit of this company, considering the years 2016, 2017 and 2018.
* When comparing the number of travels per company per city, it was observed that the Yellow Cab company presented a higher number of travels in nearly all cities, with exception to 4 out of 19 cities, which include Nashville TN, Pittsburgh PA, Sacramento CA and San Diego CA.
* When analysing the proportion of profitable rides, the Yellow Cab company presented even a higher number of cities that were more profitable to the Yellow Cab compared to the Pink Cab (18 out 19 cities), with exception to Tuscon AZ, where the Pink Cab was more profitable for the 3-year period.
* A significative strong positive correlation (0.98, p-value < 0.001) was observed between the distance travelled and the cost of the ride, however the correlation between the distance travelled and the profit was not as strong (0.46, p-value < 0.001).
* Some rides were not profitable for both companies during the 3-year period that was evaluated, and they accounted for 13.1% and 5.0% for the Pink Cab and Yellow Cab companies.
* The Yellow Cab company presented a higher profit per distance travelled (USD 7.11/Km) compared to the Pink Cab company (USD 2.77/Km) (p-value < 0.001).
* Most of trips happened during the weekends (considering Friday, Saturday and Sunday) compared to the weekdays. The mean profit per ride for the weekends was USD 148.73 compared to USD 116.87 for the weekdays (p-value < 0.001).
* The last months of the year presented higher number of rides compared to the first months of the year, especially from September to December. The Yellow Cab provided over 25,000 rides in all these months reaching a top of more than 35,000 rides in December. With a similar pattern, the Pink Cab had a lower number of rides, but the top month were also from September to December, reaching a maximum of about 11,500 rides in December too.
* For the higher profit per month, the Pink Cab company presente a similar pattern, showing higher profits from September to December. However, for the Yellow Cab, although the months of May and June presented a much lower number of rides compared to September and October, the profit made in May and June were very close to the profit made in September and October.
* When comparing the yearly profits, both companies presented an increase in the total profit between the years 2016 and 2017, however, there was a significant decrease in the following year (2018) compared to the previous years (p-value < 0.001) for both companies.
* 11,442 (approximately 25%) customers out of 46,148 had only 1 ride, whereas 34,706 customers had more than a ride.
* From 34,706 customers that had at least 2 rides, 26,078 used both services (approximately 75%), whereas the remaining 8,628 customers were loyal to one or another company.
* The mean profit per customer for the Yellow Cab company (USD 129.37) was higher compared to the Pink Cab (USD 58.51) for the years assessed (p-value < 0.001).
* There is no significant difference in the mean income for users of Yellow Cab and Pink Cab companies.
* It is interesting to note that non-holidays generated higher profits compared to national holidays.
* When weekdays were compared to weekends, highest mean profit per ride was observed during weekends (USD 148.73) compared to weekdays (USD 116.87). Be aware that weekends were considered Friday, Saturday and Sunday.
* The mean profit per Km was significantly lower year by year from 2016 to 2018 for the Pink Cab company, whereas the mean profit per Km for the Yellow Cab was significantly lower only in the year 2018 compared to the two previous years.
* There was observed a positive strong correlation (0.94) between the number of customers that uses the service per day and the daily profit (p-value < 0.001).


#### Conclusion:

* Most of rides were provided by the Yellow Cab company, which has a higher number of customers as well.
* The profit by Km and profit per customer were higher for the Yellow Cab company in the past years (2016 2017, 2018).
* Investing in the Yellow Cab company is recommended, although both companies presented a significant decrease in 2018.