# Citibike ML
In this example we use the [Citibike dataset](https://ride.citibikenyc.com/system-data). Citibike is a bicycle sharing system in New York City. Everyday users choose from 20,000 bicycles at 1300 stations around New York City.

To ensure customer satisfaction Citibike needs to predict how many bicycles will be needed at each station. Maintenance teams from Citibike will check each station and repair or replace bicycles. Additionally, the team will relocate bicycles between stations based on predicted demand. The business needs to be able to run reports of how many bicycles will be needed at a given station on a given day.

## Streamlit Application
In this section of the demo, we will utilize Streamlit with Snowpark's Python client-side Dataframe API to create a visual front-end application for the Citibike operations team to consume the insights from the ML forecast.

For this demo flow we will assume that the organization has the following **policies and processes** :   
-**Dev Tools**: The ML engineer can develop in their tool of choice (ie. VS Code, IntelliJ, Pycharm, Eclipse, etc.).  Snowpark Python makes it possible to use any environment where they have a python kernel.  For the sake of a demo we will use Jupyter.  
-**Data Governance**: To preserve customer privacy no data can be stored locally.  The ingest system may store data temporarily but it must be assumed that, in production, the ingest system will not preserve intermediate data products between runs. Snowpark Python allows the user to push-down all operations to Snowflake and bring the code to the data.   
-**Automation**: Although the ML engineer can use any IDE or notebooks for development purposes the final product must be python code at the end of the work stream.  Well-documented, modularized code is necessary for good ML operations and to interface with the company's CI/CD and orchestration tools.  
-**Compliance**: Any ML models must be traceable back to the original data set used for training.  The business needs to be able to easily remove specific user data from training datasets and retrain models. 

In [10]:
!pip -q install streamlit

In [2]:
from dags.snowpark_connection import snowpark_connect
session, state_dict = snowpark_connect('./include/state.json')

In [11]:
import pandas as pd
import streamlit as st

In [345]:
# import numpy as np
# eval_df=session.table('FLAT_EVAL_D4D511E2_A933_11EC_9280_0242AC180004')
# pd.DataFrame(eval_df.to_pandas()).to_csv('./eval_test.csv')
eval_df = pd.read_csv('./eval_test.csv').drop('Unnamed: 0', axis=1)
eval_df['DATE']='2020-02-29'
eval_df2 = pd.read_csv('./eval_test.csv').drop('Unnamed: 0', axis=1)
eval_df2['DATE']='2020-01-31'
eval_df2['RMSE']=eval_df2['RMSE']+np.random.normal(size=len(eval_df2)) #*.1
eval_df3 = pd.read_csv('./eval_test.csv').drop('Unnamed: 0', axis=1)
eval_df3['DATE']='2019-12-31'
eval_df3['RMSE']=eval_df3['RMSE']+np.random.normal(size=len(eval_df3)) #*.1
eval_df=pd.concat([eval_df, eval_df2])
eval_df=pd.concat([eval_df, eval_df3])
eval_df.to_csv('./eval_test1.csv')
# eval_df = pd.read_csv('./eval_test1.csv').drop('Unnamed: 0', axis=1)
# forecast_df=session.table('FLAT_FORECAST_D4D511E2_A933_11EC_9280_0242AC180004')
# pd.DataFrame(forecast_df.to_pandas()).to_csv('./forecast_test.csv')
# forecast_df = pd.read_csv('./forecast_test.csv').drop('Unnamed: 0', axis=1)
# forecast_df['DATE'] = pd.to_datetime(forecast_df['DATE'])
# forecast_df['STATION_ID']=forecast_df['STATION_ID'].astype(str)
# pred_df=session.table('FLAT_PRED_D4D511E2_A933_11EC_9280_0242AC180004')
# pd.DataFrame(pred_df.to_pandas()).to_csv('./pred_test.csv')
#pred_df = pd.read_csv('./pred_test.csv').drop('Unnamed: 0', axis=1)

In [402]:
#%%writefile streamlit/app.py
import streamlit as st
import pandas as pd
from datetime import timedelta
import altair as alt

#@st.cache
def load_forecast_data(forecast_filename:str):
    df = pd.read_csv(forecast_filename).drop('Unnamed: 0', axis=1)
    df['DATE'] = pd.to_datetime(df['DATE'])
    df['STATION_ID']=df['STATION_ID'].astype(str)    
    return df

#@st.cache
def load_eval_data(eval_filename:str):
    df = pd.read_csv(eval_filename).drop('Unnamed: 0', axis=1)
    df['DATE'] = pd.to_datetime(df['DATE'])
    df['STATION_ID']=df['STATION_ID'].astype(str)    
    return df


def update_forecast_table(forecast_df, stations:list, start_date, end_date):
    forecast_df = forecast_df.loc[(forecast_df['DATE']>=pd.Timestamp(start_date)) & 
                                  (forecast_df['DATE']<pd.Timestamp(end_date))]
    forecast_df['DATE'] = forecast_df['DATE'].dt.strftime('%Y-%m-%d')
    
    data = forecast_df.pivot(index="STATION_ID", columns="DATE", values="PRED").loc[stations]
    st.write("### Weekly Forecast", data)
    
    return None

def update_eval_table(eval_df, stations:list):
    eval_df['DATE'] = eval_df['DATE'].dt.strftime('%Y-%m-%d')
    
    data = eval_df.pivot(index="STATION_ID", columns="DATE", values="RMSE").loc[stations]
    st.write("### Model Monitor (RMSE)", data)    
    return None


forecast_df = load_forecast_data('./forecast_test.csv')
eval_df = load_eval_data('./eval_test1.csv')

min_date=forecast_df['DATE'].min()
max_date=forecast_df['DATE'].max()
max_days=len(forecast_df[forecast_df['STATION_ID'] == forecast_df['STATION_ID'][0]])

start_date = st.date_input('Start Date', value=min_date, min_value=min_date, max_value=max_date)
show_days = st.number_input('Number of days to show', value=7, min_value=1, max_value=max_days)
end_date = start_date+timedelta(days=show_days)

stations = st.multiselect('Choose stations', forecast_df['STATION_ID'].unique(), ["519", "545"])
if not stations:
    stations = forecast_df['STATION_ID'].unique()

update_forecast_table(forecast_df, stations, start_date, end_date)

update_eval_table(eval_df, stations)

download_file_names = st.multiselect(label='Monthly ingest file(s):', 
                                     options=['202003-citibike-tripdata.csv.zip'], 
                                     default=['202003-citibike-tripdata.csv.zip'])

st.button('Run Ingest Taskflow', args=(download_file_names))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  forecast_df['DATE'] = forecast_df['DATE'].dt.strftime('%Y-%m-%d')


False

In [405]:
forecast_df = forecast_df.loc[(forecast_df['DATE']>=pd.Timestamp(start_date)) & 
                              (forecast_df['DATE']<pd.Timestamp(end_date))]
forecast_df['DATE'] = forecast_df['DATE'].dt.strftime('%Y-%m-%d')

data = forecast_df.pivot(index="STATION_ID", columns="DATE", values="PRED").loc[stations]

rect = alt.Chart(data).mark_rect().encode(
    alt.X('DATE:T'),
    alt.Y('STATION_ID:Q'),
    alt.Color('count()',
        scale=alt.Scale(scheme='greenblue'),
        legend=alt.Legend(title='Total Records')
    )
)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  forecast_df['DATE'] = forecast_df['DATE'].dt.strftime('%Y-%m-%d')


DATE,2020-03-01,2020-03-02,2020-03-03,2020-03-04,2020-03-05,2020-03-06,2020-03-07
STATION_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
519,12,442,448,448,433,419,11
545,105,135,146,151,141,139,140


In [396]:



# alt.Chart(data).mark_line().encode(
#             x="DATE:T",
#             y=alt.Y("RMSE:N", stack=None),
#             color="STATION_ID:N")


In [387]:
help(alt.Chart().mark_line())

Help on Chart in module altair.vegalite.v4.api object:

class Chart(TopLevelMixin, _EncodingMixin, altair.vegalite.v4.schema.mixins.MarkMethodMixin, altair.vegalite.v4.schema.core.TopLevelUnitSpec)
 |  Chart(data=Undefined, encoding=Undefined, mark=Undefined, width=Undefined, height=Undefined, **kwargs)
 |  
 |  Create a basic Altair/Vega-Lite chart.
 |  
 |  Although it is possible to set all Chart properties as constructor attributes,
 |  it is more idiomatic to use methods such as ``mark_point()``, ``encode()``,
 |  ``transform_filter()``, ``properties()``, etc. See Altair's documentation
 |  for details and examples: http://altair-viz.github.io/.
 |  
 |  Attributes
 |  ----------
 |  data : Data
 |      An object describing the data source
 |  mark : AnyMark
 |      A string describing the mark type (one of `"bar"`, `"circle"`, `"square"`, `"tick"`,
 |       `"line"`, * `"area"`, `"point"`, `"rule"`, `"geoshape"`, and `"text"`) or a
 |       MarkDef object.
 |  encoding : FacetedE

'https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/movies.json'

In [None]:
import streamlit as st
import pandas as pd
import altair as alt

from urllib.error import URLError

@st.cache
def get_UN_data():
    AWS_BUCKET_URL = "http://streamlit-demo-data.s3-us-west-2.amazonaws.com"
    df = pd.read_csv(AWS_BUCKET_URL + "/agri.csv.gz")
    return df.set_index("Region")

try:
    df = get_UN_data()
    countries = st.multiselect(
        "Choose countries", list(df.index), ["China", "United States of America"]
    )
    if not countries:
        st.error("Please select at least one country.")
    else:
        data = df.loc[countries]
        data /= 1000000.0
        st.write("### Gross Agricultural Production ($B)", data.sort_index())

        data = data.T.reset_index()
        data = pd.melt(data, id_vars=["index"]).rename(
            columns={"index": "year", "value": "Gross Agricultural Product ($B)"}
        )
        chart = (
            alt.Chart(data)
            .mark_area(opacity=0.3)
            .encode(
                x="year:T",
                y=alt.Y("Gross Agricultural Product ($B):Q", stack=None),
                color="Region:N",
            )
        )
        st.altair_chart(chart, use_container_width=True)
except URLError as e:
    st.error(
        """
        **This demo requires internet access.**

        Connection error: %s
    """
        % e.reason
    )

In [382]:
AWS_BUCKET_URL = "http://streamlit-demo-data.s3-us-west-2.amazonaws.com"
df1 = pd.read_csv(AWS_BUCKET_URL + "/agri.csv.gz")
df1 = df1.set_index("Region")
countries = ["China", "United States of America"]
data = df1.loc[countries]
data /= 1000000.0
data = data.T.reset_index()
data = pd.melt(data, id_vars=["index"]).rename(
            columns={"index": "year", "value": "Gross Agricultural Product ($B)"})

In [381]:
data

Unnamed: 0,year,Region,Gross Agricultural Product ($B)
0,1961,China,58.34074
1,1962,China,60.69086
2,1963,China,63.94270
3,1964,China,68.46261
4,1965,China,74.74790
...,...,...,...
89,2003,United States of America,172.45820
90,2004,United States of America,183.51910
91,2005,United States of America,181.43290
92,2006,United States of America,176.80300
