# 2.6 Creating Dashboards with Python

## This script contains the following:
#### [1. Import Libraries](#import-libraries)
#### [2. Import Data](#import-data)
#### [3. Data Wrangling](#wrangling)
#### [4. Bar Chart](#barchart)
#### [5. Dual Axis Line Chart](#linechart)
#### [6. Export Dashboard Datasets](#export-data)

### 1. Import Libraries<a id='import-libraries'></a>

In [None]:
import streamlit as st
import pandas as pd
import numpy as np
import os
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from datetime import datetime as dt
from streamlit_keplergl import keplergl_static

### 2. Import Data<a id='import-data'></a>

In [None]:
folderpath = r'/Users/matthewjones/Documents/CareerFoundry/Data Visualization with Python/Achievement 2/NY-CitiBike/2. Data/Processed Data'

df = pd.read_pickle(os.path.join(folderpath, 'cleaned_nyc_bike_weather_data.pkl'))

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.shape

### 3. Data Wrangling<a id='wrangling'></a>

In [None]:
# Drop unnecessary columns for dashboard
df_1 = df.drop(columns = {'ride_id', 'started_at', 'ended_at', 'start_station_id',
                          'end_station_id', 'start_lat', 'start_lng', 'end_lat',
                          'end_lng', 'value'})

In [None]:
df_1.head()

In [None]:
# Select a subset of the data through random sampling
np.random.seed(32)
red = np.random.rand(len(df_1)) <= 0.92

small = df_1[~red]

In [None]:
small.shape

#### GROUP BY START STATION NAME

In [None]:
# Groupby the start station name, and then choose the most popular 20 by total trips
df_groupby_bar = df.groupby('start_station_name', as_index=False).agg({'value': 'sum'})
top20 = df_groupby_bar.nlargest(20, 'value')

#### REDUCE NUMBER OF COLUMNS NEEDED TO ANALYZE

In [None]:
# Select only the date, bike_rides_daily, and avgTemp columns to use for the dual axis line chart
df_temp = small[['date', 'bike_rides_daily', 'avgTemp']].set_index('date')

### 4. Bar Chart<a id='barchart'></a>
    Finding the top 20 stations by total number of trips started at each station

In [None]:
# Initialize the plotly graph object with a bar chart
bar_fig = go.Figure(go.Bar(x = top20['start_station_name'], 
                           y = top20['value'], 
                           marker={'color': top20['value'], 'colorscale' : 'blues'})) ### Use a blue color palette based on the 'value' column

# Add titles to the chart
bar_fig.update_layout(
    title = '<b>Top 20 Most Popular Citi Bike Stations in New York 2022</b>', ### <b></b> makes the titles bold
    xaxis_title = '<b>Start Stations</b>',
    yaxis_title ='<b>Total Trips</b>',
    width = 900, height = 600 ### Set the height and width of the chart area
)

bar_fig.update_xaxes(
    automargin = True ### Prevent the x-axis labels from overlapping with the x-axis title
)

bar_fig.show()

Recreated the same bar chart once made in matplotlib and seaborn, now using plotly. The plotly bar chart required less code to accomplish a similar output, and this is primarily because of plotly's inate interactivity. We did not have to code for the bar labels in this chart, plotly already displays those numbers on hover. Due to the dashboard being wider than tall, the bar chart was kept vertical, instead of the horizontal charts created before.

As with all the charts, we can see that W 21 St & 6 Ave is the most popular station. Of the top 20 start stations, most are in the most expensive neighborhoods of NYC (e.g. Chelsea, Upper East Side, Midtown, SoHo, Hell's Kitchen, etc.) And within these neighborhoods, most stations are near a popular tourist destination or park area (e.g. Central Park, Union Square, Madison Square Gardens/Penn Station, 9/11 Memorial, etc.) This would suggest the most common users of the bikes are commuters of a high socioeconomic status or tourists.

### 5. Dual Axis Line Chart<a id='linechart'></a>
    Overlaying the total number of trips taken each day in 2022, with the average temperature of that day in NYC

In [None]:
# Initialize the plotly graph object with a bar chart
line_fig = make_subplots(specs = [[{"secondary_y": True}]]) ### Set up the dual axis

# PRIMARY AXIS - Total bike rides
line_fig.add_trace(
go.Scatter(x = df_temp.index, 
           y = df_temp['bike_rides_daily'], 
           marker = {'color': df_temp['bike_rides_daily'],'color': '#2B4B8D'}, ### Set the color of the line chart
           fill = 'tozeroy'), ### Fill the area underneath the line chart
secondary_y = False
)

# SECONDARY AXIS - Average temperature
line_fig.add_trace(
go.Scatter(x = df_temp.index, 
           y = df_temp['avgTemp'], 
           marker={'color': df_temp['avgTemp'],'color': '#EB392A'}), ### Set the color of the line chart
secondary_y = True
)

# Add titles and plot formatting
line_fig.update_layout(
    title = dict(text = '<b>Daily Bike Trips and Avergage NYC Temperature in 2022</b>',
                 font = dict(size = 18)),
    xaxis_title = '',
    yaxis1_title = dict(text = '<b>Bike Rides Daily</b>', 
                        font = dict(size = 14, color = '#2B4B8D')),
    yaxis2_title = dict(text = '<b>Average Temperature (in C)</b>', 
                        font = dict(size = 14, color = '#EB392A')),
    xaxis = dict(showgrid = False, ### Hide x-axis gridlines
                 range = [dt(2022, 1, 1), dt(2023, 1, 1)]), ### Manually set the range for the x-axis
    yaxis1 = dict(showgrid = False), ### Hide y-axis gridlines
    yaxis2 = dict(showgrid = False,
                  zeroline = False), ### Hide zero line for the y-axis (temperature)
    showlegend = False, ### Hide legend
    margin = dict(pad = 10), ### Add spacing between the plot and the axis labels
    width = 900, height = 500 ### Set the height and width of the chart area
)

line_fig.update_yaxes(
    automargin = True ### Prevent the y-axis labels from overlapping with the y-axis titles
)

Recreated the dual axis chart that was originally made using matplotlib and seaborn, now using plotly. This chart took about the same amount of code to accomplish, and there was no loss in desired customizability. As noted before, average temperature and total daily bike rides are very strongly correlated. Customers are more likely to use Citi Bikes in warmer weather, and less likely to use them in colder weather. If our primary customer base is tourists, this corresponds with summer being peak tourist season for international travelers.

This finding is actually inversely correlated to subway usage in privileged areas. In warmer weather, subway usage goes down. So in summer months, more people turn to bicycles for their daily transportation. The warm weather encourages exercise. Perhaps, those commuters are working from home more often during the summer and don't need to travel as far with the subway. The increased usage in the summer could also be due to children who are out of school during those months (and who wouldn't need to take a subway train to commute).

### 6. Export Dashboard Datasets<a id='export-data'></a>

In [None]:
top20.to_csv(os.path.join(folderpath, 'DB_top20_stations.csv'), index=True)

In [None]:
df_temp.to_csv(os.path.join(folderpath, 'DB_dualaxis_rides_temp.csv'), index=True)

In [None]:
small.to_csv(os.path.join(folderpath, 'DB_reduced_bike_weather_data.csv'), index=True)