# Data Analysis and Reporting with SQL: Project (15th November 2020)

## Overview 


Over the past few years, ride sharing apps have been on the rise across many cities in the world. While this has happened, Uber and Lyft's ride prices are not constant like public transport. They are greatly affected by the demand and supply of rides at a given time. 

As a Data Scientist working to understand this market, you have been tasked to come up with a **descriptive analysis report** to help a Ride Sharing Startup coming into the space, understand the various patterns on how pricing works for the existing ride sharing company. 

Luckily, you were able to access some real-time data from Uber & Lyft's API and weather data from a Weather API conditions.
You build a custom application in Scala to query data at regular intervals and saved it to DynamoDB. The queried cab ride estimates are done after every 5 mins and weather data after every 1 hr. 

The cab ride data covers various types of cabs for Uber & Lyft and their price for the given location. Weather data contains weather attributes like temperature, rain, cloud, etc for all the locations taken into consideration.

Now that you have your data in the given dataset, write SQL queries to perform descriptive analysis highlighting key insights that would be helpful for the startup in developing a new product.



## Useful Information

Specific research question would then be : <em>**How do demand, supply as well as weather patterns affect cab pricing by Uber and Lyft?** </em>

1.   Metric for success : The main goal here is **to deduce factors affecting cab pricing** we will not be making any predictions and therefore we do not have a specific metric upon which to base our success. Success can be determined by whether or not we get to deduce useful relationships in the data that can help us understand the factors that influence cab pricing variations.
2.   Get started by asking, for instance, "When do prices hit a high during 3 periods of the day?" You get a feel of average pricing for a single day. You could select a random day and break it into periods of the day: early morning (5am - 10pm); mid morning (etc.) and afternoon (etc.). 


## Project Deliverable

The expected deliverable for this project will be an SQL notebook with your data analysis in SQL. You will need to create this notebook based on your understanding of the approaches that we've taken in previous workshop projects.


## Dataset Explained


Weather Dataset URL = https://bit.ly/cabsweatherdata <br>
Cabs Dataset URL = https://bit.ly/cabsdataset <br> <br>

Cab Rides Dataset
1. distance: distance between source and destination. 
2. cab_type: Uber or Lyft
3. time_stamp: time when data was queried
4. destination: destination of the ride
5. source: the starting point of the ride
6. price: price estimate for the ride in USD
7. surge_multiplier: the multiplier by which price was increased, default 1
8. unique identifier
9. product_id: uber/lyft identifier for cab-type
10. name: Visible type of the cab eg: Uber Pool, UberXL

Weather Dataset
1. temp: Temperature 
2. location: Location name
3. clouds: Clouds
4. pressure: pressure in mb
5. rain: rain in inches for the last hr
6. time_stamp: time when row data was collected
7. humidity: humidity in %
8. wind: wind speed in mph

Project Source: https://bit.ly/2AKnlBL


## Step 1. Pre-requisites

In [None]:
# Importing pandas library which will help in reading of data from an external 
#source as well and its manipulation.
import pandas as pd

# Loading SQL extension which will allow us to run SQL code in our Notebook.
%load_ext sql

# Connecting to an in-memory SQLite database within colaboratory
%sql sqlite://

'Connected: @None'

## Step 2. Preparing Datasets

### Step 2.1 Importing Datasets

The dataset on cab rides

In [None]:
# Load the cab rides dataset from an external csv file and store it in a dataframe called cabs
cabs = pd.read_csv('https://bit.ly/cabsdataset')

# Store the dataset in our in memory sqlite database. but for controls, check 
# first if the table exists in the database, and if so drop it. 
%sql DROP TABLE if EXISTS cabs

# Finally create an SQL table of our sqlite database and store 
# the recordset in readiness for further analysis with SQL 
%sql PERSIST cabs;

 * sqlite://
Done.
 * sqlite://


'Persisted cabs'

The dataset on weather

In [None]:
# Load the weather dataset from an external csv file and store it in a dataframe "weather"
weather = pd.read_csv('https://bit.ly/cabsweatherdata')

#check for table existence and drop it accordingly.
%sql DROP TABLE if EXISTS weather;

# Finally create an SQL table in the sqlite database running in memory and store 
# the recordset in readiness for further analysis with SQL 
%sql PERSIST weather; 


 * sqlite://
Done.
 * sqlite://


'Persisted weather'

### Step 2.2 Exploratory Data Analysis

In [None]:
# sampling the cab rides dataset

%sql SELECT * FROM cabs LIMIT 10;

 * sqlite://
Done.


index,Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name
0,0,0.44,Lyft,2018-12-16 09:30:07.890,North Station,Haymarket Square,5.0,1.0,424553bb-7174-41ea-aeb4-fe06d4f4b9d7,lyft_line,Shared
1,1,0.44,Lyft,2018-11-27 02:00:23.677,North Station,Haymarket Square,11.0,1.0,4bd23055-6827-41c6-b23b-3c491f24e74d,lyft_premier,Lux
2,2,0.44,Lyft,2018-11-28 01:00:22.198,North Station,Haymarket Square,7.0,1.0,981a3613-77af-4620-a42a-0c0866077d1e,lyft,Lyft
3,3,0.44,Lyft,2018-11-30 04:53:02.749,North Station,Haymarket Square,26.0,1.0,c2d88af2-d278-4bfd-a8d0-29ca77cc5512,lyft_luxsuv,Lux Black XL
4,4,0.44,Lyft,2018-11-29 03:49:20.223,North Station,Haymarket Square,9.0,1.0,e0126e1f-8ca9-4f2e-82b3-50505a09db9a,lyft_plus,Lyft XL
5,5,0.44,Lyft,2018-12-17 18:25:12.138,North Station,Haymarket Square,16.5,1.0,f6f6d7e4-3e18-4922-a5f5-181cdd3fa6f2,lyft_lux,Lux Black
6,6,1.08,Lyft,2018-11-26 05:03:00.200,Northeastern University,Back Bay,10.5,1.0,462816a3-820d-408b-8549-0b39e82f65ac,lyft_plus,Lyft XL
7,7,1.08,Lyft,2018-12-02 19:53:04.677,Northeastern University,Back Bay,16.5,1.0,474d6376-bc59-4ec9-bf57-4e6d6faeb165,lyft_lux,Lux Black
8,8,1.08,Lyft,2018-12-03 06:28:02.645,Northeastern University,Back Bay,3.0,1.0,4f9fee41-fde3-4767-bbf1-a00e108701fb,lyft_line,Shared
9,9,1.08,Lyft,2018-11-27 10:45:22.249,Northeastern University,Back Bay,27.5,1.0,8612d909-98b8-4454-a093-30bd48de0cb3,lyft_luxsuv,Lux Black XL


In [None]:
# sampling the weather dataset

%sql SELECT * FROM weather LIMIT 5;

 * sqlite://
Done.


index,Unnamed: 0,temp,location,clouds,pressure,rain,time_stamp,humidity,wind
0,0,42.42,Back Bay,1.0,1012.14,0.1228,2018-12-16 23:45:01,0.77,11.25
1,1,42.43,Beacon Hill,1.0,1012.15,0.1846,2018-12-16 23:45:01,0.76,11.32
2,2,42.5,Boston University,1.0,1012.15,0.1089,2018-12-16 23:45:01,0.76,11.07
3,3,42.11,Fenway,1.0,1012.13,0.0969,2018-12-16 23:45:01,0.77,11.09
4,4,43.13,Financial District,1.0,1012.14,0.1786,2018-12-16 23:45:01,0.75,11.49


In [None]:
#we seek to know the number of records that exist for each of the datasets.

%sql SELECT COUNT(*) FROM cabS;

 * sqlite://
Done.


COUNT(*)
693071


In [None]:
%sql SELECT COUNT (*) FROM weather;

 * sqlite://
Done.


COUNT (*)
6276


 There are a total of `693,071` records of cab rides in the dataset, while we have `6,276` weather records. To note is that the queried cab ride estimates were done after every 5 mins while those of weather records were done every hour.

In [None]:
# let us determine the time period of the data that we use for analysis.
# pick the cab ride records based on time to determine the start and end of data collection.
# start  time stamp for the collection of cab rides data.

%%sql SELECT * , datetime(time_stamp) as TIME_STAMP1 from cabs
ORDER BY datetime(time_stamp) ASC
LIMIT 4;

 * sqlite://
Done.


index,Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name,TIME_STAMP1
10462,10462,2.4,Lyft,2018-11-26 03:40:46.972,Beacon Hill,Fenway,19.5,1.25,c241e674-57c8-4b00-bcbb-f1f59a98fac5,lyft_plus,Lyft XL,2018-11-26 03:40:46
13895,13895,2.01,Lyft,2018-11-26 03:40:46.527,South Station,North Station,16.5,1.0,65e9134a-dff4-45b9-81e3-90ba4cae702f,lyft_premier,Lux,2018-11-26 03:40:46
23285,23285,4.32,Lyft,2018-11-26 03:40:46.528,Northeastern University,Financial District,22.5,1.0,716f939f-56eb-425a-b125-64246b8e9907,lyft_plus,Lyft XL,2018-11-26 03:40:46
29102,29102,1.16,Uber,2018-11-26 03:40:46.421,Theatre District,Haymarket Square,,1.0,63fbb593-95ba-4b74-88d1-e3eb0d99fe33,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi,2018-11-26 03:40:46


In [None]:
# end time stamp for the collection of cab rides data.

%%sql SELECT * , datetime(time_stamp) as TIME_STAMP1 from cabs
ORDER BY datetime(time_stamp) DESC
LIMIT 4;

 * sqlite://
Done.


index,Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name,TIME_STAMP1
10350,10350,4.55,Uber,2018-12-18 19:15:10.579,Financial District,Northeastern University,30.5,1.0,9d29a29d-4d8f-44c4-b400-621b4f33301b,6c84fd89-3f11-4782-9b50-97c468b19529,Black,2018-12-18 19:15:10
17138,17138,3.91,Uber,2018-12-18 19:15:10.603,Financial District,Boston University,12.0,1.0,7273d46d-3270-43ef-9fbc-c218f45c3496,997acbb5-e102-41e1-b155-9df7de0a73f2,UberPool,2018-12-18 19:15:10
17871,17871,0.46,Lyft,2018-12-18 19:15:10.762,South Station,Financial District,3.0,1.0,e0497e3a-3ddc-4ee9-aeec-0a539315d42b,lyft_line,Shared,2018-12-18 19:15:10
18039,18039,2.48,Uber,2018-12-18 19:15:10.943,South Station,Beacon Hill,14.0,1.0,e7ff1618-2d8b-4d13-a798-593858064a9e,6f72dfc5-27f1-42e8-84db-ccc7a75f6969,UberXL,2018-12-18 19:15:10


In [None]:
# weather records... Earliest data stamp.

%%sql 
SELECT * , strftime('%Y-%m-%d %H:%M:%f', time_stamp) as new_time from weather
ORDER BY datetime(strftime('%Y-%m-%d %H:%M:%f', time_stamp)) ASC
LIMIT 5;

 * sqlite://
Done.


index,Unnamed: 0,temp,location,clouds,pressure,rain,time_stamp,humidity,wind,new_time
2509,2509,40.98,Haymarket Square,0.87,1014.4,,2018-11-26 03:40:44,0.92,1.57,2018-11-26 03:40:44.000
2510,2510,40.81,Northeastern University,0.89,1014.35,,2018-11-26 03:40:44,0.93,1.36,2018-11-26 03:40:44.000
2511,2511,40.86,South Station,0.87,1014.39,,2018-11-26 03:40:44,0.93,1.6,2018-11-26 03:40:44.000
2512,2512,40.84,West End,0.87,1014.4,,2018-11-26 03:40:44,0.93,1.52,2018-11-26 03:40:44.000
3998,3998,41.04,Back Bay,0.87,1014.39,,2018-11-26 03:40:45,0.92,1.46,2018-11-26 03:40:45.000


In [None]:
# weather records... end data stamp for data collection.

%%sql 
SELECT * , strftime('%Y-%m-%d %H:%M:%f', time_stamp) as new_time from weather
ORDER BY datetime(strftime('%Y-%m-%d %H:%M:%f', time_stamp)) DESC
LIMIT 5;

 * sqlite://
Done.


index,Unnamed: 0,temp,location,clouds,pressure,rain,time_stamp,humidity,wind,new_time
483,483,30.8,Beacon Hill,0.0,1012.32,,2018-12-18 18:45:02,0.46,13.08,2018-12-18 18:45:02.000
484,484,30.96,Boston University,0.0,1012.35,,2018-12-18 18:45:02,0.45,12.93,2018-12-18 18:45:02.000
485,485,30.93,Fenway,0.0,1012.35,,2018-12-18 18:45:02,0.45,12.99,2018-12-18 18:45:02.000
486,486,31.19,Financial District,0.0,1012.31,,2018-12-18 18:45:02,0.45,13.18,2018-12-18 18:45:02.000
487,487,30.83,North Station,0.0,1012.32,,2018-12-18 18:45:02,0.46,13.09,2018-12-18 18:45:02.000


Data in these recordsets was collected between 2018-11-26 03:40:46 and 2018-12-18 19:15:10 for the cab rides and between 2018-11-26 03:40:44 and 2018-12-18 18:45:02 for the weather. Any insights obtained therefore are based on this period of business transactions. 

In [None]:
# checking for NULL values. in my preliminary examination of the datasets using MS Excel, I was able to determine
# that cab rides had NULL values in the price column. Weather dataset too has some NULL entries in the  "rain" field.
# I interpret this to mean no rain and so will populate these with zeros.

%%sql 
SELECT *  FROM weather
WHERE rain ISNULL
LIMIT 4;

 * sqlite://
Done.


index,Unnamed: 0,temp,location,clouds,pressure,rain,time_stamp,humidity,wind
11,11,43.28,Back Bay,0.81,990.81,,2018-11-27 19:45:20,0.71,8.3
12,12,43.27,Beacon Hill,0.8,990.8,,2018-11-27 19:45:20,0.71,8.3
13,13,43.35,Boston University,0.82,990.82,,2018-11-27 19:45:20,0.71,8.24
14,14,43.07,Fenway,0.82,990.82,,2018-11-27 19:45:20,0.72,8.28


In [None]:
# count of NULL records.

%%sql
SELECT COUNT (*) as [No. of NULL] FROM weather
WHERE rain ISNULL;

 * sqlite://
Done.


No. of NULL
5382


In [None]:
# Update the NULLS to zeros in the rain column

%%sql
UPDATE weather
SET rain = 0
WHERE rain ISNULL;

 * sqlite://
5382 rows affected.


[]

In [None]:
# checking on the records with NULL in cab rides dataset.

%%sql 
SELECT * FROM cabs
WHERE price ISNULL
LIMIT 5;

 * sqlite://
Done.


index,Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name
18,18,1.11,Uber,2018-12-01 14:13:04.211,West End,North End,,1.0,fa5fb705-03a0-4eb9-82d9-7fe80872f754,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi
31,31,2.48,Uber,2018-12-02 23:52:56.318,South Station,Beacon Hill,,1.0,eee70d94-6706-4b95-a8ce-0e34f0fa8f37,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi
40,40,2.94,Uber,2018-11-29 20:38:05.298,Fenway,North Station,,1.0,7f47ff53-7cf2-4a6a-8049-83c90e042593,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi
60,60,1.16,Uber,2018-12-13 20:10:16.318,West End,North End,,1.0,43abdbe4-ab9e-4f39-afdc-31cfa375dc25,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi
69,69,2.67,Uber,2018-11-30 13:08:03.653,Beacon Hill,North End,,1.0,80db1c49-9d51-4575-a4f4-1ec23b4d3e31,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi


In [None]:
# checking the nature of these records.. from which company they are etc.

%%sql
SELECT DISTINCT cab_type FROM cabs
WHERE price ISNULL;

 * sqlite://
Done.


cab_type
Uber


In [None]:
# we notice that all records where prices have not been updated are from Uber and we need to know the count.

%%sql
SELECT COUNT(cab_type) FROM cabs
WHERE price ISNULL;

 * sqlite://
Done.


COUNT(cab_type)
55095


In [None]:
# determine the total number of Uber records so as to see if the NULLS are significant.

%%sql 
SELECT COUNT(cab_type) FROM cabs
WHERE cab_type = "Uber";

 * sqlite://
Done.


COUNT(cab_type)
385663


In [None]:
# drop all the records where the price is NULL as the price feature is important in our analysis.

%%sql
DELETE FROM cabs
WHERE price ISNULL;

 * sqlite://
55095 rows affected.


[]

 From this check of null values, we notice that: 
1.  Out of 6,273 records in the weather dataset, we have 5,382 records for days / time when there was no rain. we update all the NULL values to zeros in the rain column for these records to mean there was 0 mm of rain (it didn't rain) 
2. The price feature of cab rides registered null values of `55,095` entries. Pricing is one of the most important feature of the dataset in this analysis and therefore these will be dropped as they represent only 14% of all Uber records in the dataset.


**After this exercise, we now have a clean set of datasets that we can use for our analysis.** 

## Step 3. Business Needs
Our main interest is to interogate the available data so as to deduce any useful relationships, if any, between the dataset features that can help us understand the factors that influence cab pricing variations. <br> Rephrasing this into a research question: **"How does uber and lyft vary their cab pricing in relation to demand and supply?"**

**1. How many distinct cab types are reported on and further, among these cab types, which is the mode type?**

In [None]:
#We make use of the property 'cab_type' and use the COUNT()function against this 
#variable for each of its make group. 

%%sql 
SELECT cab_type, count(cab_type) as 'frequency'
FROM cabs
GROUP BY cab_type
ORDER BY count(cab_type) DESC;

 * sqlite://
Done.


cab_type,frequency
Uber,330568
Lyft,307408


`Uber` cars are preferred over `Lyft` ones at a share of `51.81%` to `48.19%` of the sampled rides respectively. 



---



**2. Get the feel of the charges by picking the highest cost of a ride, min cost and the average cab ride and further, select the top 15 ride records in terms of travel cost ordered by the cab type then by cost so that we know the other associated details like the car type, distance etc.**

In [None]:
#highest registered charge for a ride
%%sql
select *, max(price)
from cabs

 * sqlite://
Done.


index,Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name,max(price)
597071,597071,4.43,Lyft,2018-12-02 01:28:02.123,Fenway,Financial District,97.5,2.0,ba1593a1-e4fd-4c7a-a011-e2d4fccbf081,lyft_luxsuv,Lux Black XL,97.5


In [None]:
#lowest registered charge for a ride

%%sql
select *, min(price)
from cabs

 * sqlite://
Done.


index,Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name,min(price)
5901,5901,1.53,Lyft,2018-11-28 23:33:41.778,Back Bay,Boston University,2.5,1.0,f6ed86e6-c3f1-42f5-9ce9-bb4bea19f18e,lyft_line,Shared,2.5


In [None]:
#Average cost for a ride.

%%sql
SELECT avg(price)
FROM cabs

 * sqlite://
Done.


avg(price)
16.545125490614065


 From the above exploration, we notice the following:
1. The highest price for a ride recorded is by Lyft for the journey between Financial District and Fenway at a cost of `$97.5` (with a multiplier of 2.0) on 2nd Dec 2018.
2.  The lowest price for a ride recorded is also by Lyft for the journey between Boston University and Back Bay at a cost of `$2.5` (default price, no multiplier) on 28th Dec 2018.
3. The average cost for the rides is `$16.54`.



In [None]:
# get the average price for rides between distinct source/destination pairs.

%%sql
SELECT "source" ||" - " || "destination" as route, distance, cast(price as float) as price, AVG(price)
FROM cabs
GROUP BY route
ORDER BY AVG(price) DESC;

 * sqlite://
Done.


route,distance,price,AVG(price)
Financial District - Boston University,5.42,47.5,25.498434004474277
Boston University - Financial District,4.72,13.0,24.146085011185683
Fenway - Financial District,4.4,19.5,23.4388178555406
Financial District - Fenway,4.58,19.5,23.404849910394265
Northeastern University - Financial District,3.79,11.5,22.582093757043047
Financial District - Northeastern University,4.13,9.0,21.91858367162833
Theatre District - Boston University,3.07,13.5,20.360661724626624
Boston University - North Station,3.39,26.0,20.185337915234825
Northeastern University - North Station,3.22,11.0,19.910938903863432
Fenway - North Station,3.07,11.0,19.701839464882944


 The above information on average pricing will help the incoming ride company to estimate the pricing between different sources and destinations.

In [None]:
%%sql
SELECT cab_type, name,  "source" ||" - " || "destination" as route, distance, time_stamp, cast(price as float) as price_f, id, product_id
FROM cabs
ORDER BY price_f DESC
LIMIT 10;

 * sqlite://
Done.


cab_type,name,route,distance,time_stamp,price_f,id,product_id
Lyft,Lux Black XL,Financial District - Fenway,4.43,2018-12-02 01:28:02.123,97.5,ba1593a1-e4fd-4c7a-a011-e2d4fccbf081,lyft_luxsuv
Lyft,Lux Black XL,Fenway - Financial District,3.89,2018-11-28 21:41:35.635,92.0,edb9ba13-b129-487f-be93-2e3abdd700a3,lyft_luxsuv
Lyft,Lux Black XL,Boston University - Financial District,3.75,2018-12-13 17:15:14.374,92.0,e0032a2b-c328-457e-adb5-79c3d83064a8,lyft_luxsuv
Lyft,Lux Black XL,Financial District - Boston University,5.39,2018-11-29 01:02:08.714,92.0,6f17623a-f97c-4379-9188-4f3e07d1c48e,lyft_luxsuv
Lyft,Lux Black XL,Financial District - Boston University,5.39,2018-11-27 22:03:21.754,92.0,fe9f56b5-e54a-48b4-b702-4e625adcf0ab,lyft_luxsuv
Lyft,Lux Black XL,Financial District - Boston University,5.36,2018-12-17 23:45:13.982,92.0,16373de7-5ef0-45a7-86d8-842bd12780e9,lyft_luxsuv
Lyft,Lux Black XL,Financial District - Boston University,5.37,2018-12-01 20:07:59.927,92.0,2a3f4de9-7a24-4580-8079-a7953a1431e3,lyft_luxsuv
Lyft,Lux Black XL,Boston University - Financial District,3.75,2018-11-27 17:03:22.477,92.0,75c1c9bf-38a7-4cb6-81c1-e2effaaaedbd,lyft_luxsuv
Lyft,Lux Black XL,Boston University - Financial District,4.39,2018-12-14 03:40:07.486,92.0,fa0c51a4-acc0-4c49-8244-561e9ad945e3,lyft_luxsuv
Lyft,Lux Black XL,Boston University - Financial District,4.37,2018-12-16 06:00:05.181,92.0,71ad1473-ebc6-4b98-80c6-0f56046af3e2,lyft_luxsuv


In [None]:
# We intend to find out the relationship between the cost of the ride and the cab type
# and also the time of the day that these trips happen.

%%sql
SELECT cab_type, name,  "source" ||" - " || "destination" as route, distance, time_stamp, cast(price as float) as price, id, product_id, surge_multiplier
FROM cabs
ORDER BY price DESC
LIMIT 100;

 * sqlite://
Done.


cab_type,name,route,distance,time_stamp,price,id,product_id,surge_multiplier
Lyft,Lux Black XL,Financial District - Fenway,4.43,2018-12-02 01:28:02.123,97.5,ba1593a1-e4fd-4c7a-a011-e2d4fccbf081,lyft_luxsuv,2.0
Lyft,Lux Black XL,Fenway - Financial District,3.89,2018-11-28 21:41:35.635,92.0,edb9ba13-b129-487f-be93-2e3abdd700a3,lyft_luxsuv,2.0
Lyft,Lux Black XL,Boston University - Financial District,3.75,2018-12-13 17:15:14.374,92.0,e0032a2b-c328-457e-adb5-79c3d83064a8,lyft_luxsuv,2.0
Lyft,Lux Black XL,Financial District - Boston University,5.39,2018-11-29 01:02:08.714,92.0,6f17623a-f97c-4379-9188-4f3e07d1c48e,lyft_luxsuv,2.0
Lyft,Lux Black XL,Financial District - Boston University,5.39,2018-11-27 22:03:21.754,92.0,fe9f56b5-e54a-48b4-b702-4e625adcf0ab,lyft_luxsuv,2.0
Lyft,Lux Black XL,Financial District - Boston University,5.36,2018-12-17 23:45:13.982,92.0,16373de7-5ef0-45a7-86d8-842bd12780e9,lyft_luxsuv,2.0
Lyft,Lux Black XL,Financial District - Boston University,5.37,2018-12-01 20:07:59.927,92.0,2a3f4de9-7a24-4580-8079-a7953a1431e3,lyft_luxsuv,2.0
Lyft,Lux Black XL,Boston University - Financial District,3.75,2018-11-27 17:03:22.477,92.0,75c1c9bf-38a7-4cb6-81c1-e2effaaaedbd,lyft_luxsuv,2.0
Lyft,Lux Black XL,Boston University - Financial District,4.39,2018-12-14 03:40:07.486,92.0,fa0c51a4-acc0-4c49-8244-561e9ad945e3,lyft_luxsuv,2.0
Lyft,Lux Black XL,Boston University - Financial District,4.37,2018-12-16 06:00:05.181,92.0,71ad1473-ebc6-4b98-80c6-0f56046af3e2,lyft_luxsuv,2.0


Regardless of the distance, `94%` of the highest rates (Top 100) for the rides were recorded by lyft. Also notice that among the top 100 rides in terms of cost, all the Lyft rides' prices were hiked some to  double double the cost. This send another probing question... does Uber ever increase its prices?... lets find out next.  </font></em>



---



**3. Does Uber ever increase its prices? How about Lyft?**

In [None]:
%%sql
SELECT cab_type, name, product_id, "source" ||" - " || "destination" as route, distance, time_stamp, cast(price as float), surge_multiplier
FROM cabs
WHERE price <> 'None' AND cab_type = "Uber" AND surge_multiplier > 1
ORDER BY price DESC
LIMIT 5

 * sqlite://
Done.


cab_type,name,product_id,route,distance,time_stamp,cast(price as float),surge_multiplier




---



In [None]:
%%sql
SELECT cab_type, name, product_id, "source" ||" - " || "destination" as route, distance, time_stamp, cast(price as float), surge_multiplier
FROM cabs
WHERE price <> 'None' AND cab_type = "Lyft" AND surge_multiplier > 1
ORDER BY price DESC
Limit 20

 * sqlite://
Done.


cab_type,name,product_id,route,distance,time_stamp,cast(price as float),surge_multiplier
Lyft,Lux Black XL,lyft_luxsuv,Financial District - Fenway,4.43,2018-12-02 01:28:02.123,97.5,2.0
Lyft,Lux Black XL,lyft_luxsuv,Fenway - Financial District,3.89,2018-11-28 21:41:35.635,92.0,2.0
Lyft,Lux Black XL,lyft_luxsuv,Boston University - Financial District,3.75,2018-12-13 17:15:14.374,92.0,2.0
Lyft,Lux Black XL,lyft_luxsuv,Financial District - Boston University,5.39,2018-11-29 01:02:08.714,92.0,2.0
Lyft,Lux Black XL,lyft_luxsuv,Financial District - Boston University,5.39,2018-11-27 22:03:21.754,92.0,2.0
Lyft,Lux Black XL,lyft_luxsuv,Financial District - Boston University,5.36,2018-12-17 23:45:13.982,92.0,2.0
Lyft,Lux Black XL,lyft_luxsuv,Financial District - Boston University,5.37,2018-12-01 20:07:59.927,92.0,2.0
Lyft,Lux Black XL,lyft_luxsuv,Boston University - Financial District,3.75,2018-11-27 17:03:22.477,92.0,2.0
Lyft,Lux Black XL,lyft_luxsuv,Boston University - Financial District,4.39,2018-12-14 03:40:07.486,92.0,2.0
Lyft,Lux Black XL,lyft_luxsuv,Boston University - Financial District,4.37,2018-12-16 06:00:05.181,92.0,2.0


In [None]:
#How many categories of price hikes are there?

%%sql
SELECT DISTINCT surge_multiplier
FROM cabs;

 * sqlite://
Done.


surge_multiplier
1.0
1.25
2.5
2.0
1.75
1.5
3.0


Interestingly, we notice that `Uber` does not increase its prices for the rides at all. This is brought out by the fact that the surge_multiplier is always 1. However, `Lyft` on the other hand has some rides with normal fare and others with hiked charges with between 1.25 to 3 times normal fare. </font></em>

In [None]:
# Do we have any special / common property about the rides whose fare is hiked by three times?

%%sql
SELECT cab_type, name, product_id, "source" ||" - " || "destination" as route, distance, time_stamp, cast(price as float), surge_multiplier
FROM cabs
WHERE surge_multiplier = 3
ORDER BY route DESC;

 * sqlite://
Done.


cab_type,name,product_id,route,distance,time_stamp,cast(price as float),surge_multiplier
Lyft,Lyft XL,lyft_plus,Theatre District - Boston University,4.64,2018-11-27 22:12:23.100,62.5,3.0
Lyft,Lyft,lyft,Theatre District - Boston University,4.64,2018-12-16 23:40:18.739,38.0,3.0
Lyft,Lyft XL,lyft_plus,Boston University - Financial District,4.39,2018-11-27 04:42:21.898,65.0,3.0
Lyft,Lyft,lyft,Boston University - Financial District,4.39,2018-12-14 03:40:07.486,38.5,3.0
Lyft,Lyft XL,lyft_plus,Beacon Hill - Boston University,2.33,2018-12-01 13:57:58.833,42.5,3.0
Lyft,Lyft,lyft,Beacon Hill - Boston University,2.33,2018-12-01 13:57:58.833,26.0,3.0
Lyft,Lyft,lyft,Back Bay - South Station,1.84,2018-11-27 09:03:21.922,22.5,3.0
Lyft,Lyft XL,lyft_plus,Back Bay - South Station,1.84,2018-12-17 03:05:03.672,38.0,3.0
Lyft,Lyft XL,lyft_plus,Back Bay - North End,3.16,2018-11-28 12:32:08.522,55.0,3.0
Lyft,Lyft,lyft,Back Bay - North End,3.16,2018-11-28 12:32:08.522,27.5,3.0


 We notice that the only routes where the prices are hiked by up to 3 time are only 6. and this happens only with the Lyft and Lyft Plus products of the company. The incoming company needs to know the special feature that is along these routes for proper placement.

**4. How many car types / categories do the two companies have, and does this have any relationship with the cost of the ride?**

In [None]:
# Use the COUNT() function to get the total number of cab subcategories for all the 
# rides registered.
%%sql
SELECT cab_type, name, COUNT(name) as 'Frequency'
FROM cabs
GROUP BY name


 * sqlite://
Done.


cab_type,name,Frequency
Uber,Black,55095
Uber,Black SUV,55096
Lyft,Lux,51235
Lyft,Lux Black,51235
Lyft,Lux Black XL,51235
Lyft,Lyft,51235
Lyft,Lyft XL,51235
Lyft,Shared,51233
Uber,UberPool,55091
Uber,UberX,55094


In [None]:
# Get the favourite car make by route

%%sql
SELECT "source" ||" - " || "destination" as route, name, COUNT(name) as [car name popularity]
FROM cabs
GROUP BY route
ORDER BY route DESC
Limit 100

 * sqlite://
Done.


route,name,car name popularity
West End - South Station,UberXL,8784
West End - Northeastern University,Lux Black XL,8778
West End - North End,UberPool,8478
West End - Haymarket Square,Lux,8424
West End - Fenway,UberX,9360
West End - Boston University,UberXL,9156
Theatre District - South Station,UberX,8994
Theatre District - Northeastern University,Lyft XL,8874
Theatre District - North End,Lux Black XL,8760
Theatre District - Haymarket Square,Black SUV,8898


 The schedule shows which cars are popular in which routes. This will help the incoming company judge the cards to employ on which routes.

**5. How does Uber's rate compare to that of Lyft for the same distance and also at the same time?**

In [None]:
%%sql
SELECT cab_type, name, "source" ||" - " || "destination" as route, distance, time_stamp, cast(price as float) as price, surge_multiplier, product_id
FROM cabs
WHERE price <> 'None'
ORDER BY route, distance DESC
Limit 100

 * sqlite://
Done.


cab_type,name,route,distance,time_stamp,price,surge_multiplier,product_id
Lyft,Shared,Back Bay - Boston University,1.78,2018-11-27 19:18:21.869,5.0,1.0,lyft_line
Lyft,Lyft XL,Back Bay - Boston University,1.78,2018-11-27 19:18:21.869,13.5,1.0,lyft_plus
Lyft,Lux Black,Back Bay - Boston University,1.78,2018-11-30 12:22:55.867,19.5,1.0,lyft_lux
Lyft,Lux Black XL,Back Bay - Boston University,1.78,2018-11-30 12:22:55.867,26.0,1.0,lyft_luxsuv
Lyft,Lyft,Back Bay - Boston University,1.78,2018-11-30 12:22:55.867,9.0,1.0,lyft
Lyft,Lux,Back Bay - Boston University,1.78,2018-11-30 12:22:55.867,16.5,1.0,lyft_premier
Lyft,Lyft,Back Bay - Boston University,1.77,2018-11-30 03:03:00.720,9.0,1.0,lyft
Lyft,Lux Black XL,Back Bay - Boston University,1.77,2018-11-30 13:43:01.082,27.5,1.0,lyft_luxsuv
Lyft,Lux,Back Bay - Boston University,1.77,2018-11-30 13:43:01.082,16.5,1.0,lyft_premier
Lyft,Lux Black,Back Bay - Boston University,1.77,2018-11-30 13:43:01.082,19.5,1.0,lyft_lux


In [None]:
%%sql
SELECT cab_type, name, "source" ||" - " || "destination" as route, distance, time_stamp, cast(price as float) as price, surge_multiplier, product_id
FROM cabs
WHERE price <> 'None' AND surge_multiplier <> 1
ORDER BY route, cab_type, name, distance DESC
Limit 100

 * sqlite://
Done.


cab_type,name,route,distance,time_stamp,price,surge_multiplier,product_id
Lyft,Lux,Back Bay - Boston University,1.72,2018-12-14 10:10:08.997,16.5,1.25,lyft_premier
Lyft,Lux,Back Bay - Boston University,1.71,2018-11-29 01:18:40.764,16.5,1.25,lyft_premier
Lyft,Lux,Back Bay - Boston University,1.7,2018-12-18 00:25:03.783,19.5,1.25,lyft_premier
Lyft,Lux,Back Bay - Boston University,1.69,2018-11-28 16:26:07.939,16.5,1.25,lyft_premier
Lyft,Lux,Back Bay - Boston University,1.59,2018-12-01 17:07:56.318,16.5,1.25,lyft_premier
Lyft,Lux,Back Bay - Boston University,1.58,2018-11-27 11:30:22.385,27.5,2.0,lyft_premier
Lyft,Lux,Back Bay - Boston University,1.57,2018-11-28 17:59:08.408,26.0,2.0,lyft_premier
Lyft,Lux,Back Bay - Boston University,1.57,2018-12-17 01:35:10.629,16.5,1.25,lyft_premier
Lyft,Lux,Back Bay - Boston University,1.57,2018-11-28 16:31:26.473,26.0,2.0,lyft_premier
Lyft,Lux,Back Bay - Boston University,1.56,2018-12-14 21:45:04.669,22.5,1.75,lyft_premier


**6. Does the weather have any effect on the pricing**

In [None]:
%%sql
SELECT cabs.time_stamp, cabs.time_stamp, weather.time_stamp
FROM cabs, weather
WHERE cabs.strftime('%Y-%m-%d %H:%M:%f', time_stamp) = weather.strftime('%Y-%m-%d %H:%M:%f', time_stamp)
LIMIT 10

 * sqlite://
(sqlite3.OperationalError) near "(": syntax error
[SQL: SELECT cabs.time_stamp, cabs.time_stamp, weather.time_stamp
FROM cabs, weather
WHERE cabs.strftime('%Y-%m-%d %H:%M:%f', time_stamp) = weather.strftime('%Y-%m-%d %H:%M:%f', time_stamp)
LIMIT 10]
(Background on this error at: http://sqlalche.me/e/13/e3q8)


A </font></em>



---



In [None]:
# 

%%sql
SELECT cab_type, name, "source" ||" - " || "destination" as route, distance, date(cabs.time_stamp) as mydate, time(cabs.time_stamp) as mytime, cast(price as float) as price, surge_multiplier, product_id, weather.location
FROM cabs, weather
left join weather on  weather.location = cabs.source
ORDER BY route, cab_type, name, distance, mydate DESC
Limit 100

 * sqlite://
(sqlite3.OperationalError) ambiguous column name: weather.location
[SQL: SELECT cab_type, name, "source" ||" - " || "destination" as route, distance, date(cabs.time_stamp) as mydate, time(cabs.time_stamp) as mytime, cast(price as float) as price, surge_multiplier, product_id, weather.location
FROM cabs, weather
left join weather on  weather.location = cabs.source
ORDER BY route, cab_type, name, distance, mydate DESC
Limit 100]
(Background on this error at: http://sqlalche.me/e/13/e3q8)


## Step 4. Report

1. The weather dataset has 6,276 records and 8 fields (features) while the cabsrides recordset has 693,071 records and 10 fields.

2. Data in these recordsets was collected between 2018-11-26 03:40:46 and 2018-12-18 19:15:10 for the cab rides and between 2018-11-26 03:40:44 and 2018-12-18 18:45:02 for the weather. Any insights communicated therefore are based on this period of business transactions.

3. From this check of null values, we notice that:

- Out of 6,273 records in the weather dataset, we have 5,382 records for days / time when there was no rain. we update all the NULL values to zeros in the rain column for these records to mean there was 0 mm of rain (it didn't rain)
- The price feature of cab rides registered null values of 55,095 entries. Pricing is one of the most important feature of the dataset in this analysis and therefore these will be dropped as they represent only 14% of all Uber records in the dataset.

4. Uber cars are preferred over Lyft ones at a share of 51.81% to 48.19% of the sampled rides respectively.

5. From the above exploration, we notice the following:

- The highest price for a ride recorded is by Lyft for the journey between Financial District and Fenway at a cost of `$97.5` (with a multiplier of 2.0) on 2nd Dec 2018.
- The lowest price for a ride recorded is also by Lyft for the journey between Boston University and Back Bay at a cost of `$2.5` (default price, no multiplier) on 28th Dec 2018.
- The average cost for the rides is `$16.54`.

6. The average pricing for each route represented in the sample dataset is given in the analysis and will help the incoming ride company to estimate the pricing between different sources and destinations.

7. Regardless of the distance, 94% of the highest rates (Top 100) for the rides were recorded by lyft. Also notice that among the top 100 rides in terms of cost, all the Lyft rides' prices were hiked some to double double the cost. This send another probing question... does Uber ever increase its prices?... lets find out next.

- Interestingly, we notice that Uber does not increase its prices for the rides at all. This is brought out by the fact that the surge_multiplier is always 1. However, Lyft on the other hand has some rides with normal fare and others with hiked charges with between 1.25 to 3 times normal fare. The incoming ride company will then need to see which model works best for them.

8. We notice that the only routes where the prices are hiked by up to 3 times are only 6. and this happens only with the Lyft and Lyft Plus products of the company. The incoming company needs to know the special feature that is along these routes for proper placement.

