# STAGE 4: ANALYZE

- How should you organize your data to perform analysis on it?
- Has your data been properly formatted?
- What surprises did you discover in the data?
- What trends or relationships did you find in the data?
- How will these insights help answer your business questions?

## Key tasks
- Aggregate your data so it’s useful and accessible. 
- Organize and format your data.
- Perform calculations.
- Identify trends and relationships.

# STAGE 5: SHARE

- Were you able to answer the question of how annual members and casual riders use Cyclistic bikes differently? - What story does your data tell?
- How do your findings relate to your original question?
- Who is your audience? What is the best way to communicate with them?
- Can data visualization help you share your findings?
- Is your presentation accessible to your audience?

## Key tasks
- Determine the best way to share your findings.
- Create effective data visualizations.
- Present your findings. 
- Ensure your work is accessible.

## Data Aggregation : Get ride counts by month with station lat lng attributions 
- I will collect station names and member/casual ride counts for each month and save into a csv file:
<font color = 'red'>dfstations_ride_count_master.csv</font>
- Since there are start and end stations for each ride, I will iterate them separately and combined them into one master file.

In [1]:
import pandas as pd

file_list_df = pd.read_csv('file_list_2020.csv', header=None, names= ['filename'])
file_list = file_list_df['filename'].values

dtypes = {'ride_id': 'str', 'rideable_type': 'category', 'start_station_name': 'category', 'start_station_id': 'category', 'end_station_name':'category',
           'end_station_id': 'category', 'member_casual':'category'}

def read_csv_to_df(filename, dtype):
    df = pd.read_csv('./Data/cleaned_csv/'+filename, parse_dates=['started_at','ended_at'],dtype = dtype)
    return df

In [2]:
from datetime import datetime

In [3]:
dfstations_start_ride_count_master = pd.DataFrame()
for filename in file_list:
    df_filename = read_csv_to_df(filename, dtypes)
    # calculate lat lng for each station for later map visualization
    # name the column names without "start" or "end"
    dfstation_lat = df_filename.groupby(['start_station_name','member_casual'])['start_lat'].mean()
    dfstation_lat.name = 'station_lat'
    dfstation_lng = df_filename.groupby(['start_station_name','member_casual'])['start_lng'].mean()
    dfstation_lng.name = 'station_lng'
    
    dfstation_latlng = pd.concat([dfstation_lat, dfstation_lng], axis=1)
    dfstation_latlng.index.set_names(['station_name','member'], inplace=True)
    
    dfstation_latlng['Ym']= datetime.strptime(filename[0:6],'%Y%m')
    dfstation_latlng['count']=df_filename.groupby(['start_station_name','member_casual'])['ride_id'].count()
    dfstations_start_ride_count_master = pd.concat([dfstations_start_ride_count_master, dfstation_latlng], axis=0)

# filter out 0 count rows    
dfstations_start_ride_count_master=dfstations_start_ride_count_master[dfstations_start_ride_count_master['count']>0]
    

In [4]:
print (dfstations_start_ride_count_master.shape)
dfstations_start_ride_count_master.head()

(40453, 4)


Unnamed: 0_level_0,Unnamed: 1_level_0,station_lat,station_lng,Ym,count
station_name,member,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2112 W Peterson Ave,casual,41.9912,-87.6836,2020-04-01,13
2112 W Peterson Ave,member,41.9912,-87.6836,2020-04-01,29
63rd St Beach,casual,41.781,-87.5761,2020-04-01,5
63rd St Beach,member,41.781,-87.5761,2020-04-01,33
900 W Harrison St,casual,41.8748,-87.6498,2020-04-01,31


In [5]:
dfstations_end_ride_count_master = pd.DataFrame()
for filename in file_list:
    df_filename = read_csv_to_df(filename, dtypes)
    # calculate lat lng for each station for later map visualization
    # name the column names without "end" or "end"
    dfstation_lat = df_filename.groupby(['end_station_name','member_casual'])['end_lat'].mean()
    dfstation_lat.name = 'station_lat'
    dfstation_lng = df_filename.groupby(['end_station_name','member_casual'])['end_lng'].mean()
    dfstation_lng.name = 'station_lng'
    
    dfstation_latlng = pd.concat([dfstation_lat, dfstation_lng], axis=1)
    dfstation_latlng.index.set_names(['station_name','member'], inplace=True)
    
    dfstation_latlng['Ym']= datetime.strptime(filename[0:6],'%Y%m')
    dfstation_latlng['count']=df_filename.groupby(['end_station_name','member_casual'])['ride_id'].count()
    dfstations_end_ride_count_master = pd.concat([dfstations_end_ride_count_master, dfstation_latlng], axis=0)

# filter out 0 count rows    
dfstations_end_ride_count_master=dfstations_end_ride_count_master[dfstations_end_ride_count_master['count']>0]
    #print (filename, Ym)

In [6]:
print (dfstations_end_ride_count_master.shape)
dfstations_end_ride_count_master.head()

(40746, 4)


Unnamed: 0_level_0,Unnamed: 1_level_0,station_lat,station_lng,Ym,count
station_name,member,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2112 W Peterson Ave,casual,41.9912,-87.6836,2020-04-01,13
2112 W Peterson Ave,member,41.9912,-87.6836,2020-04-01,37
63rd St Beach,casual,41.781,-87.5761,2020-04-01,4
63rd St Beach,member,41.781,-87.5761,2020-04-01,30
900 W Harrison St,casual,41.8748,-87.6498,2020-04-01,26


In [7]:
dfstations_start_ride_count_master['start_end']='start'
dfstations_end_ride_count_master['start_end']='end'

In [8]:
# Concat the start and end stations
dfstations_ride_count_master = pd.concat([dfstations_start_ride_count_master, dfstations_end_ride_count_master],axis=0)

In [9]:
print (dfstations_ride_count_master.shape)

(81199, 5)


In [10]:
dfstations_ride_count_master.to_csv('dfstations_ride_count_master.csv')

The data is uploaded to Tableau. This is the link to the dashboard.
https://public.tableau.com/views/CyclisticTripCountsInterativeMap/Dashboard1?:language=en-US&:display_count=n&:origin=viz_share_link

![Tableau_map.png](Figures/Tableau_map.png)