# Bike Share Market Strategy Case Study for Google Data Analytics Certification 
## (Details in the Readme file) <br> We will follow A-P-P-A-S-A Framework.

# 1. *__Ask__*
### ● What is the problem you are trying to solve?<br> ● How can your insights drive business decisions?<br><br> Key tasks
#### 1. Identify the business task<br>2. Consider key stakeholders
### Deliverable: <br> A clear statement of the business task

## Response To *__Ask__* 
### Identify trends using historical bike trip data to understand the difference between Casual vs. Annual members and the reason for annual subscriptions, to drive Marketing campaigns targeted to convert Casual members to Annual members.

# 2. *__Prepare__*
### ● Data preparation and elementary review on the dataset<br><br> Key tasks
#### 1. Download data and store it appropriately.<br>2. Identify how it’s organized.<br>3. Sort and filter the data.<br>4. Determine the credibility of the data.
### Deliverable: <br> A description of all data sources used.

In [5]:
import pandas as pd
import numpy as np

In [6]:
df_apr = pd.read_csv("202304-divvy-tripdata.csv")
df_apr.info()
df_apr.describe()
df_apr.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426590 entries, 0 to 426589
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   ride_id             426590 non-null  object 
 1   rideable_type       426590 non-null  object 
 2   started_at          426590 non-null  object 
 3   ended_at            426590 non-null  object 
 4   start_station_name  362776 non-null  object 
 5   start_station_id    362776 non-null  object 
 6   end_station_name    357960 non-null  object 
 7   end_station_id      357960 non-null  object 
 8   start_lat           426590 non-null  float64
 9   start_lng           426590 non-null  float64
 10  end_lat             426155 non-null  float64
 11  end_lng             426155 non-null  float64
 12  member_casual       426590 non-null  object 
dtypes: float64(4), object(9)
memory usage: 42.3+ MB


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,8FE8F7D9C10E88C7,electric_bike,2023-04-02 08:37:28,2023-04-02 08:41:37,,,,,41.8,-87.6,41.79,-87.6,member
1,34E4ED3ADF1D821B,electric_bike,2023-04-19 11:29:02,2023-04-19 11:52:12,,,,,41.87,-87.65,41.93,-87.68,member
2,5296BF07A2F77CB5,electric_bike,2023-04-19 08:41:22,2023-04-19 08:43:22,,,,,41.93,-87.66,41.93,-87.66,member
3,40759916B76D5D52,electric_bike,2023-04-19 13:31:30,2023-04-19 13:35:09,,,,,41.92,-87.65,41.91,-87.65,member
4,77A96F460101AC63,electric_bike,2023-04-19 12:05:36,2023-04-19 12:10:26,,,,,41.91,-87.65,41.91,-87.63,member
5,8D6A2328E19DC168,electric_bike,2023-04-19 12:17:34,2023-04-19 12:21:38,,,,,41.91,-87.63,41.92,-87.65,member
6,C97BBA66E07889F9,electric_bike,2023-04-19 09:35:48,2023-04-19 09:45:00,,,,,41.93,-87.66,41.91,-87.65,member
7,6687AD4C575FF734,electric_bike,2023-04-11 16:13:43,2023-04-11 16:18:41,,,,,42.0,-87.66,41.99,-87.66,member
8,A8FA4F73B22BC11F,electric_bike,2023-04-11 16:29:24,2023-04-11 16:40:23,,,,,41.99,-87.66,42.0,-87.66,member
9,81E158FE63D99994,electric_bike,2023-04-19 17:35:40,2023-04-19 17:36:11,,,,,41.88,-87.65,41.88,-87.65,member


#### Comment: <br>Dataset for Apr 2023 seems to follow the Reliable, Original, Comprehensive, Current and Cited (ROCCC) requirement.
#### Will continue with a preliminary check on May and June dataset.

In [7]:
df_may = pd.read_csv("202305-divvy-tripdata.csv")
df_jun = pd.read_csv("202306-divvy-tripdata.csv")
df_may.info()
df_jun.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 604827 entries, 0 to 604826
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   ride_id             604827 non-null  object 
 1   rideable_type       604827 non-null  object 
 2   started_at          604827 non-null  object 
 3   ended_at            604827 non-null  object 
 4   start_station_name  515587 non-null  object 
 5   start_station_id    515587 non-null  object 
 6   end_station_name    509560 non-null  object 
 7   end_station_id      509560 non-null  object 
 8   start_lat           604827 non-null  float64
 9   start_lng           604827 non-null  float64
 10  end_lat             604117 non-null  float64
 11  end_lng             604117 non-null  float64
 12  member_casual       604827 non-null  object 
dtypes: float64(4), object(9)
memory usage: 60.0+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 719618 entries, 0 to 719617
Data col

#### Comment: <br> May and June dataset seem to be ROCCC as well.

## Response To *__Prepare__* 
### We will be analyzing the first 3 months of FY 2023-24 (Apr-Jun) for this study. The data used originates from the trip data available at (https://divvy-tripdata.s3.amazonaws.com/index.html) which is provided as a part of the case study.<br><br> The dataset contains details of bike-trips for the users of Cyclistic Bike Share service.<br><br> A preliminary study on the dataset reveals that it is indeed ROCCC and can be used for the analysis.

# 3. *__Process__*
### Key tasks
#### 1. Check the data for errors.<br>2. Choose your tools.<br>3. Transform the data so you can work with it effectively.<br>4. Document the cleaning process.
### Deliverable: <br> Documentation of any cleaning or manipulation of data.

#### We start by analyzing null values in the datasets.

In [8]:
print("Null count in Apr:",df_apr.isnull().sum())

Null count in Apr: ride_id                   0
rideable_type             0
started_at                0
ended_at                  0
start_station_name    63814
start_station_id      63814
end_station_name      68630
end_station_id        68630
start_lat                 0
start_lng                 0
end_lat                 435
end_lng                 435
member_casual             0
dtype: int64


#### Comment:<br>Apart from the primary key - ride_id, bike_type, start_time, end_time, member_type, and start location coordinates seem to be quite reliable, having 0 null values each.
#### Station names and id for both start along with ending coordinates seem to be the issue. Let's further investigate few of these cases.
### 3.1 Investigate missing and uneven start and end station id/names

In [10]:
df_subset = df_apr[df_apr['start_station_name'].notnull() & df_apr['end_station_name'].isnull()]
df_subset.head(10)

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
28,8936AA6E8D57572E,electric_bike,2023-04-21 20:40:44,2023-04-21 20:48:44,Kenosha & Wellington,361,,,41.93,-87.73,41.93,-87.71,member
123,C14FC8B566D06DA8,electric_bike,2023-04-27 18:54:33,2023-04-27 19:05:50,California Ave & Division St,13256,,,41.90301,-87.697637,41.9,-87.67,member
249,62D4BD561EF92234,electric_bike,2023-04-21 18:01:04,2023-04-21 18:11:05,California Ave & Milwaukee Ave,13084,,,41.92266,-87.697057,41.92,-87.67,member
288,59D9C085B07268AC,electric_bike,2023-04-15 13:34:10,2023-04-15 13:57:35,Campbell Ave & Montrose Ave,15623,,,41.961605,-87.691177,41.97,-87.66,member
297,126A53B30E6DC376,electric_bike,2023-04-03 07:51:36,2023-04-03 08:00:45,Wood St & Chicago Ave,637,,,41.89567,-87.67232,41.87,-87.68,member
332,2333641422D7B6DE,electric_bike,2023-04-17 11:48:19,2023-04-17 11:54:33,California Ave & Milwaukee Ave,13084,,,41.922672,-87.697164,41.91,-87.7,member
378,CE757C127CC38E7C,electric_bike,2023-04-24 23:02:30,2023-04-24 23:05:49,California Ave & Milwaukee Ave,13084,,,41.922692,-87.697145,41.91,-87.7,member
389,E826DA0CCA666902,electric_bike,2023-04-26 17:06:35,2023-04-26 17:12:34,California Ave & Milwaukee Ave,13084,,,41.922745,-87.697295,41.94,-87.69,member
467,21EB373FBF2C6C26,electric_bike,2023-04-12 19:50:27,2023-04-12 20:09:01,California Ave & Milwaukee Ave,13084,,,41.922663,-87.697167,41.89,-87.67,member
470,EA5CFDA9CF9826A2,electric_bike,2023-04-02 23:09:01,2023-04-02 23:12:38,California Ave & Milwaukee Ave,13084,,,41.922688,-87.697211,41.91,-87.7,member
