# **Processing & Analyzing the Data**

First we take the dataset and make some calculations in order to get some inights about differences found between types of users (**member\_casual**). I am going to include some aggregate functions like **COUNT, SUM** and **AVG:**

In [7]:
--Numero de viajes, Promedio de viajes y tiempo total de viajes, por tipo de usuario (member or casual)

SELECT  
	member_casual,
	COUNT(ride_length) AS number_of_trips, 
	AVG(ride_length) AS average_ride, 
	SUM(ride_length) AS total_time_trips

FROM (
	SELECT	
		member_casual, 
		DATEDIFF(MINUTE,started_at, ended_at) AS ride_length 
	FROM [202103-divvy-tripdata]
) AS temp_table
GROUP BY member_casual

member_casual,number_of_trips,average_ride,total_time_trips
casual,84033,38,3206751
member,144463,13,2018031


The _casual riders_ have less **number\_of\_trips** but a longer **average\_ride** time (in minutes) than  _member riders_. Because of that, the sum of **total\_time\_trips** is greater in _casual riders_. We have to explore why or how this happen.

Then we search a relationship between the **day\_of\_week** the user started the trip, by type of user (**member\_casual**) in order to indentify some differences. But first, I look up for the mode of **date\_of\_week** in general terms:

In [9]:
SELECT  
	day_of_week,
	COUNT(day_of_week) AS number_trips

FROM (
	SELECT	
		member_casual, 
		rideable_type,
		DATEPART(DW, started_at) AS day_of_week  --CREAR UNA TABLA TABLA CON TODOS LOS CALCULOS 
	FROM [202103-divvy-tripdata]
) AS temp_table1
--WHERE NOT rideable_type='docked_bike'
GROUP BY day_of_week
ORDER BY number_trips DESC

day_of_week,number_trips
7,45252
1,35627
2,34825
3,34000
4,31669
6,25657
5,21466


As you can see, there is an important preference for 7 (Sunday), following by 1 (Monday) in which **day\_of\_week** the user starts its ride (_Remember this: Monday=1 and Sunday=7)._
Now we try to analyze the preference of **day\_of\_week** separated by type of user (**member\_casual**) in order to keep trying find differences. Also we aggregate functions like **COUNT, AVG** and **SUM,** and group them by type of user(**member\_casual**) and **day\_of\_week** in order to get further detail about our analysis.

In [11]:

SELECT  
	member_casual,
	day_of_week,
	COUNT(ride_length) AS number_of_trips, 
	AVG(ride_length) AS average_ride, 
	SUM(ride_length) AS total_time_trips

FROM (
	SELECT	
		member_casual, 
		rideable_type,
		DATEDIFF(MINUTE,started_at, ended_at) AS ride_length,  --CREAR UNA TABLA TABLA CON TODOS LOS CALCULOS 
		DATEPART(DW, started_at) AS day_of_week	
	FROM [202103-divvy-tripdata]
) AS temp_table
WHERE member_casual='casual'
GROUP BY member_casual, day_of_week
ORDER BY member_casual, number_of_trips DESC


SELECT  
	member_casual,
	day_of_week,
	COUNT(ride_length) AS number_of_trips, 
	AVG(ride_length) AS average_ride, 
	SUM(ride_length) AS total_time_trips

FROM (
	SELECT	
		member_casual, 
		rideable_type,
		DATEDIFF(MINUTE,started_at, ended_at) AS ride_length,  --CREAR UNA TABLA TABLA CON TODOS LOS CALCULOS 
		DATEPART(DW, started_at) AS day_of_week	
	FROM [202103-divvy-tripdata]
) AS temp_table
WHERE member_casual='member'
GROUP BY member_casual, day_of_week
ORDER BY member_casual, number_of_trips DESC

member_casual,day_of_week,number_of_trips,average_ride,total_time_trips
casual,7,22090,42,934447
casual,1,17342,41,717331
casual,2,11990,44,529223
casual,3,10463,36,376865
casual,4,8883,28,255304
casual,6,7777,29,227023
casual,5,5488,30,166558


member_casual,day_of_week,number_of_trips,average_ride,total_time_trips
member,3,23537,13,317206
member,7,23162,15,367678
member,2,22835,14,322576
member,4,22786,12,294539
member,1,18285,16,295307
member,6,17880,12,227987
member,5,15978,12,192738


From the tables shown above, we can conclude the following:

- There is differences of which **day\_of\_week** each type of user prefer take the ride. The **number\_of\_trips** column shows us that _casual riders_ prefer Sunday followed by Monday, and _member riders_ prefer Wednesday followed by Sunday.
- Also, wee can see which **day\_of\_week** the **average\_ride** time (in minutes) is longer. In the case of _casual riders_, the **average\_ride** time is longer in Tuesday (44 minutes), Sunday (42) and Monday (41).
- For _member riders_, the longer **average\_ride** time are shorter than _casual riders:_ Monday (16 minutes), Sunday (15 ) and Tuesday (14). The average times of _member riders_ are more atomized than _casual riders._
- Notice that there is not much correlation between **average\_ride** time of each **day\_of\_week** and the **number\_of\_trips** grouped by **day\_of\_week**.

 So, we explore a little more into data, looking  which the longest trips are (in _Hours_ because of the magnitude of longest ride\_length):

In [12]:
SELECT TOP 100
	member_casual,
	started_at,
	ended_at,
	DATEDIFF(HOUR,started_at,ended_at) AS ride_length,
	DATEPART(DW, started_at) AS day_of_week,    -- DAY OF THE WEEK (1 TO 7)
	rideable_type

FROM [202103-divvy-tripdata]
ORDER BY ride_length desc


member_casual,started_at,ended_at,ride_length,day_of_week,rideable_type
casual,2021-03-08 16:48:24.0000000,2021-03-30 16:50:03.0000000,528,2,docked_bike
casual,2021-03-19 14:11:08.0000000,2021-04-02 12:45:14.0000000,334,6,docked_bike
casual,2021-03-02 17:48:41.0000000,2021-03-15 09:39:25.0000000,304,3,docked_bike
casual,2021-03-06 22:37:58.0000000,2021-03-18 22:08:43.0000000,288,7,docked_bike
casual,2021-03-06 22:58:09.0000000,2021-03-18 22:09:49.0000000,288,7,docked_bike
casual,2021-03-22 13:22:48.0000000,2021-04-02 14:27:23.0000000,265,2,docked_bike
casual,2021-03-20 17:35:12.0000000,2021-03-30 13:56:23.0000000,236,7,docked_bike
casual,2021-03-18 21:23:58.0000000,2021-03-27 22:05:30.0000000,217,5,docked_bike
casual,2021-03-20 23:58:11.0000000,2021-03-29 17:01:59.0000000,210,7,docked_bike
casual,2021-03-21 18:20:50.0000000,2021-03-29 21:30:13.0000000,195,1,docked_bike


The results show an anomaly in the r**ide\_length** of the longest rides. Some insights we capture:

- The duration of the longest ride is 528 hours, which is too much in my opinion. 
- Then the second one is 334 hours, and so on until completing the top 100 with 25 hours. 
- All the three-digits duration rides are which **rideable\_type** are _'docked\_bikes'_ and the type of rider (**member\_casual**) is _casual ._
- We also found that these very long rides are mainly from _casual riders_. 
- The longest ride from _member riders_ is 26 hours (row n°80).

Now we take a deeper look at  **rideable\_types** separated by types of users (**member\_casual**):

In [14]:
SELECT  
	member_casual,
	rideable_type,
	COUNT(ride_length) AS number_of_trips, 
	AVG(ride_length) AS average_ride, 
	SUM(ride_length) AS total_time_trips

FROM (
	SELECT	
		member_casual, 
		rideable_type,
		DATEDIFF(MINUTE,started_at, ended_at) AS ride_length,  --CREAR UNA TABLA TABLA CON TODOS LOS CALCULOS 
		DATEPART(DW, started_at) AS day_of_week	
	FROM [202103-divvy-tripdata]
) AS temp_table
WHERE member_casual='casual'
GROUP BY member_casual, rideable_type
ORDER BY member_casual, number_of_trips DESC


SELECT  
	member_casual,
	rideable_type,
	COUNT(ride_length) AS number_of_trips, 
	AVG(ride_length) AS average_ride, 
	SUM(ride_length) AS total_time_trips

FROM (
	SELECT	
		member_casual, 
		rideable_type,
		DATEDIFF(MINUTE,started_at, ended_at) AS ride_length,  --CREAR UNA TABLA TABLA CON TODOS LOS CALCULOS 
		DATEPART(DW, started_at) AS day_of_week	
	FROM [202103-divvy-tripdata]
) AS temp_table
WHERE member_casual='member'
GROUP BY member_casual, rideable_type
ORDER BY member_casual, number_of_trips DESCw


member_casual,rideable_type,number_of_trips,average_ride,total_time_trips
casual,classic_bike,45528,31,1435951
casual,electric_bike,22848,21,492410
casual,docked_bike,15657,81,1278390


member_casual,rideable_type,number_of_trips,average_ride,total_time_trips
member,classic_bike,107017,14,1519896
member,electric_bike,37446,13,498135


We can see that docked\_bike rides  have an **average\_ride** time of 81 minutes wich explain what we mentioned above. It would be useful know which are the trips that is most repeated (or Mode), grouped by **ride\_legth** time. For that we have to round the ride\_length variable removing decimals to simplify the results table:

In [8]:
SELECT  TOP 50
	ride_length,
	COUNT(ride_length) AS number_of_trips, 
	AVG(ride_length) AS average_ride, 
	SUM(ride_length) AS total_time_trips

FROM (
	SELECT	
		ROUND(DATEDIFF(MINUTE,started_at, ended_at),0) AS ride_length,  -- USE ROUND FUNCTIONS TO REMOVE DECIMALS
		DATEPART(DW, started_at) AS day_of_week	
	FROM [202103-divvy-tripdata]
) AS temp_table
GROUP BY ride_length
ORDER BY number_of_trips DESC

ride_length,number_of_trips,average_ride,total_time_trips
6,12578,6,75468
7,12442,7,87094
5,12337,5,61685
8,11821,8,94568
4,11168,4,44672
9,10857,9,97713
10,10043,10,100430
11,9116,11,100276
12,8531,12,102372
3,8199,3,24597


The mode of the variable **ride\_length** is 6 minutes followed by 7, 5, 8, 4, 9 and so on until (rounding the numbers to whole number closer, removing decimals). As you noticed, the **ride\_length** and **average\_ride** are equals because what we explained. Fo r example: the **average\_ride** time grouped by **ride\_length** of 6 minutes, is calculated by numbers between 5.5 and 6.4 (5.5,6.5\], so it is logical that the result tends towards the nearest whole number, in this case 6 minutes.