<h1>Charts with bokeh assignment</h1>
Download the nyc taxi data for 2016 January (see below) and prepare the following charts:

<ol>
    <li>A bokeh bar chart with day of the week (Monday, Tuesday, ...) on the x-axis and the average duration of rides on the y-axis. Make sure that the hover tool is activated and that it shows the average duration when the cursor hovers over it</li>
    <li>A bokeh interactive chart with a slider containing the hour of the day (0,1,...23) and the average number of rides for each hour for each day of the week. I.e., the chart should contain days of the week on the x-axis and the mean number of rides on the y-axis for a particular hour of the day. Moving the slider (e.g., from 10 to 11) should replace the chart for 1000 hrs by the chart for 1100 hrs). Don't forget the tooltip</li>
    <ul><li><a href="https://docs.bokeh.org/en/latest/docs/gallery/slider.html">sliders</a></li>
        <li><a href="https://docs.bokeh.org/en/latest/docs/reference/models/glyphs/vbar.html">vbar</a></li>
        <li>note that column names must be strings for converting a data frame into a column data source</li>
    </ul>
    <li>A piechart that shows how much of the total payment comes from each day of the week. The pie should have seven slices, one for each day, and the size of each slice depends on the fraction it contributes to the total. Again, don't forget the tooltip</li>
    
</ol>
<li>For the purposes of this exercise, remove any taxi rides that are less than 5 minute in duration</li>

<h2>NYC taxi data</h2>
<li>NYC taxi trip data is collected and made available (yellow, green, and black cabs)</li>
<li>We'll use data from January 2022</li>
<li><a href="https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet"</a></li>
<li>Use pandas <span style="color:blue">read_parquet</span> function to import the data</li>
<li>Apache Parquet is a column data source format for data storage. It's main advantage over csv files is that each column retains its data type (csv converts everything to strings)</li>
<li>After running pd.read_parquet, try df.info() to see the data type of each column</li>


In [313]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure

output_notebook()

In [314]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
%matplotlib inline

datasource = "../class-datasets/yellow_tripdata_2022-01.parquet"
df = pd.read_parquet(datasource)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2463931 entries, 0 to 2463930
Data columns (total 19 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   passenger_count        float64       
 4   trip_distance          float64       
 5   RatecodeID             float64       
 6   store_and_fwd_flag     object        
 7   PULocationID           int64         
 8   DOLocationID           int64         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       
 18  airport_fee           

<span style="color:blue">Start with a small subset of the data</span>
<br>
<li>After you've completed the assignment with the subset, you can try using all the data</li>

In [315]:
# df = df.sample(frac=0.2)
# df.info()

<h3>Get the pickup hour (e.g., 11:20 corresponds to 11, 15:30pm corresponds to 15, etc.)</h3>

In [316]:
df['pickup_hour'] = df.tpep_pickup_datetime.dt.hour

<h3>Get the day of week (0-Monday, 1-Tuesday, ...)</h3>

In [317]:
df['day_of_week'] = df.tpep_pickup_datetime.dt.weekday

<h3>Get the taxi ride duration in minutes</h3>
<li>I've done this for you</li>

In [318]:
df['duration'] = (df.tpep_dropoff_datetime - df.tpep_pickup_datetime)/np.timedelta64(1, 's')/60.0

<h3>Remove rides of 5 minutes or less and save in df</h3>

In [319]:
df = df[df.duration > 5 ]

<h1>PROBLEM 1: Average duration by day of week bar chart</h1>

<h3>group the data by day of week</h3>

In [320]:
day_of_week_group = df.groupby('day_of_week')

<h3>Get the mean ride duration for each group</h3>
<li>And make a df out of it</li>
<li>day_of_week_mean has the day of week as the index</li>
<li>the dataframe will have seven rows with indexes 0,1,2,..7</li>
<li>add a new column with values Monday, Tuesday, Wedensday,...,Sunday</li>

In [321]:
day_of_week_mean = day_of_week_group.duration.mean()
day_of_week_mean_df = pd.DataFrame(day_of_week_mean)

In [322]:
day_of_week_mean_df['weekday'] = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

<h3>Make a column data source object from this dataframe</h3>

In [323]:
from bokeh.models import ColumnDataSource
cdata = ColumnDataSource(data=day_of_week_mean_df)

<h3>Draw the vertical bar chart</h3>
<li>You must include tooltips that show the duration when hovering over a bar</li>


In [324]:
text_labels = day_of_week_mean_df.weekday
values = day_of_week_mean_df.duration

tooltips = [
    ("avg_duration", "@duration")
]


p = figure(plot_height = 600, plot_width = 600, x_range=text_labels, y_range=(0, 18),
           title = 'Average Trip Duration by Day',
          x_axis_label = 'Day of Week', 
           y_axis_label = 'Duration (minutes)',tooltips=tooltips)



p.vbar(x='weekday', top='duration', source=cdata, width=0.6, color = "red", hover_fill_color = 'blue', line_color='black')

p.xgrid.grid_line_color = None
p.y_range.start = 0    
show(p)

<h1>PROBLEM 2: Interactive chart with slider</h1>
<li>In this second problem, construct an interactive chart that shows the taxi duration by day of week while varying the pickup_hour</li>
<li>Each chart will have day of the week on the x-axis and the number of trips as the height of the bars for a single pickup_hour</li>
<li>Construct a slider that slides from 0 to 23 with the graph for all 24 pickup_hours</li>
<li>

<h3>Group the data by day of week and, within day of week by pickup_hour</h3>

In [325]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,pickup_hour,day_of_week,duration
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,1.0,N,142,236,1,...,0.5,3.65,0.0,0.3,21.95,2.5,0.0,0,5,17.816667
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,1.0,N,236,42,1,...,0.5,4.0,0.0,0.3,13.3,0.0,0.0,0,5,8.4
2,2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,1.0,N,166,166,1,...,0.5,1.76,0.0,0.3,10.56,0.0,0.0,0,5,8.966667
3,2,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,1.0,N,114,68,2,...,0.5,0.0,0.0,0.3,11.8,2.5,0.0,0,5,10.033333
4,2,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.3,1.0,N,68,163,1,...,0.5,3.0,0.0,0.3,30.3,2.5,0.0,0,5,37.533333


In [326]:
df.groupby(['day_of_week', 'pickup_hour'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fdd5afb0b50>

In [327]:
hour_group = df.groupby(['day_of_week', 'pickup_hour'])

<h3>Get the size of each group and unstack so that rows are weekdays (0, 1,...,7) and cols are hours (0,1,...23)</h3>
<li>Then add an additional column (24) as a copy of column 0. Col 24 will be the display column</li>
<li>Finally, convert all column names into str (since pickup_hour is an int and column data source objects need str column names)</li>
<li>size_df should like like (col names should be strings):</li>
<li>Note that your numbers may be different if you're using a random subset of the data</li>

<pre>
	0	1	2	3	4	5	6	7	8	9	...	16	17	18	19	20	21	22	23	24	dayname
day_of_week																					
0	22368	14049	9421	6574	7225	10072	24590	44526	56618	55145	...	55280	65609	75620	69939	62390	57315	50615	34107	22368	Monday
1	24205	13229	7721	5290	5876	10479	30027	61182	75061	68852	...	54956	66460	81502	77315	74094	73728	64593	43512	24205	Tuesday
2	26920	14924	9213	6438	6578	10100	29918	61453	75718	70062	...	54682	68361	84792	84012	80136	79127	72063	51758	26920	Wednesday
3	30990	17882	11170	7297	7084	10549	30728	62179	75545	70178	...	55160	68506	86230	87825	84255	84989	79568	63830	30990	Thursday
4	64946	51398	39164	29309	22635	17846	33878	64460	78808	74686	...	69599	85201	103036	104992	92350	89919	94841	91173	64946	Friday
5	81272	67218	52933	38654	23787	11349	13148	18395	27400	40364	...	66030	73453	81384	83086	70752	69776	78014	80825	81272	Saturday
6	77206	66788	56481	42242	25799	11053	10810	14220	22050	33439	...	73734	75534	78610	67409	58393	55413	50860	39138	77206	Sunday
</pre>

In [328]:
size_df = hour_group.size().unstack()
size_df[24] = size_df[0]

In [329]:
size_df.columns = size_df.columns.astype(str)

In [330]:
size_df['weekday'] = ['Monday', 'Tuesday', 'Wednesday','Thursday','Friday','Saturday','Sunday']

<h3>Draw the interactive chart by filling in the code below</h3>
<li>Mostly done :)</li>

In [331]:
from bokeh.layouts import column, row
from bokeh.models import CustomJS, Slider
from bokeh.plotting import ColumnDataSource, figure, show, output_notebook

source = ColumnDataSource(size_df)

tooltips = [
    ("number of cases", "@24")
]
p = figure(x_range=size_df.weekday, plot_height=400, plot_width = 600, 
           x_axis_label = "day of week",
           y_axis_label = "size",
           title="Chart",tooltips=tooltips)

p.vbar(x='weekday', top='24', source=source, width=0.6, color = "red", hover_fill_color = 'blue', line_color='black',
hover_fill_alpha = 1.0)

p.xgrid.grid_line_color = None
p.y_range.start = 0    

slider = Slider(start=0, end=23, value=0, step=1, title="Hour of Day")

jscallback = CustomJS(args={'source':source,'slider':slider},code="""
        // widget. this.item is the selected value
        // We can print it on the JavaScript console
        // Chrome windows/linux: Ctrl - Shift - J
        // Chrom mac: Cmd - Option - J
        // Safari: Option - Cmd - C
        
        console.log(' changed selected option', slider.value);

        //data is the variable containing all the data
        var data = source.data;
        var col = this.value
        console.log(' changed selected option', data[col]);
        // allocate the selected column to the field for the y values
        data['24'] = data[col];

        // register the change 
        source.change.emit();
""")

slider.js_on_change("value", jscallback)

layout = row(p,slider)
show(layout)

<h1>PROBLEM 3: Piechart</h1>
<li>Use the total_amount column</li>
<li>Use the grouped by day of week data</li>
<li>Sum the total amount for each group and then compute the fractional amount for each day</li>
<li>Using the class notebook piechart as a guide, construct the piechart for distribution of total amount collected by day of week</li>

In [332]:
from math import pi 
import pandas as pd
import numpy as np
from bokeh.palettes import Category20c #The color palette
from bokeh.models import LabelSet, ColumnDataSource
from bokeh.transform import cumsum #a bokeh function that cumulatively sums the angle column in a column amount_by_day source

#Make the df and create pct column
amount_by_day = pd.DataFrame(df.groupby('day_of_week')['total_amount'].sum())
amount_by_day['frac_amount'] = amount_by_day.total_amount / amount_by_day.total_amount.sum()
amount_by_day['angle'] = amount_by_day['frac_amount']/(amount_by_day['frac_amount'].sum()) * 2*pi

#Specify the color palette
amount_by_day['color'] = Category20c[len(amount_by_day)]

#Define a figure object
p = figure(plot_height=500, title="Payment Distribution by Day of Week", 
        tools="hover", tooltips="@day_of_week: @frac_amount",x_range=(-0.5, 1.0))

#Add the definition of each wedge
#Each wedge, starting with 0 degrees, will be at the sum of the angles upto that wedge
p.wedge(x=0, y=1, radius=0.4, #x,y specifies the location of the center of the pie
        start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'), #the start and end angle of each slice
        line_color="white", fill_color='color', legend_field='day_of_week', source=amount_by_day) #source is our amount_by_day (can be a df or a CDS)

#Make the label values column                             
amount_by_day["label_vals"]=(amount_by_day["frac_amount"] * 100).round(2).astype(str).str.pad(29, side = "left") #pad will move the labels out toward the circumference
amount_by_day["label_vals"]=amount_by_day["label_vals"].apply(lambda x: "" if x.strip()=="0.0" else x)

#LabelSet is the set of labels. Either a value or nothing
labels = LabelSet(x=0, y=1, text='label_vals',
        angle=cumsum('angle', include_zero=True), source=ColumnDataSource(amount_by_day), render_mode='canvas')

#Add the labels
p.add_layout(labels)

p.axis.axis_label=None #Piechart. No axes
p.axis.visible=False #We don't want to see the axes
p.grid.grid_line_color = None #If you want grid lines, specify a color

#Show the chart
show(p)