<br><br><br><br>
<h2 style="color:red;font-size:40px">Charts and graphs</h2>
<br><br>
<li><span style="color:green">matplotlib </span>Python's integrated plotting library</li>
<li><span style="color:green">bokeh </span>Interactive plotting library</li>


<br><br><br><br>
<span style="color:blue;font-size:large">matplotlib</span>
<li><a href="https://matplotlib.org">https://matplotlib.org</a></li>
<li>easy to use and tight coupling with pandas</li>


<br><br><br><br>
<span style="color:blue;font-size:large">bokeh</span>
<li><a href="https://docs.bokeh.org/en/latest/">https://docs.bokeh.org/en/latest/</a></li>
<li>interactive plotting, growing mapping functionality, presentation ready graphs</li>
<li>requires a bit more work</li>


<br><br><br><br>
<span style="color:blue;font-size:large">What kinds of visuals are possible</span>
<li>Tons and tons</li>
<li>Best approach: think of how you want to show your data or results and then google till you find the right way to do it</li>
<li><b>Almost anything is possible!</b></li>

<br><br><br><br>
<span style="color:green;font-size:xx-large">setup</span>
<br><br>

<li>!pip install bokeh</li>
<li>OR</li>
<li>!conda install --yes bokeh</li>
<li>matplotlib comes pre-installed with anaconda</li>
<li>In the notebook, select "File --> Trust notebook" from the menubar. Your graphs will be available when you reopen the notebook

In [1]:
!pip install bokeh --upgrade



In [5]:
import bokeh
bokeh.__version__

'3.2.2'

<br><br><br><br>
<span style="color:green;font-size:50px">Interactive charts with bokeh</span>
<br><br>

<br><br><br><br>
<span style="color:blue;font-size:large">bokeh</span>
<li>import libraries</li>
<li>run output_notebook() to display charts inline</li>
<li>output_notebook() should say "BokehJS 3.2.2 successfully loaded." (or something went wrong!)</li>

In [1]:
#output_notebook for inline charts in a notebook
#show to show a chart
from bokeh.io import output_notebook, show # these allo0w us to have plots in cells and not in the outputs(?)
from bokeh.plotting import figure # we add things to the base figure

In [2]:
output_notebook()

<br><br><br><br>
<span style="color:blue;font-size:large">Construct a figure</span>
<li>specify labels</li>
<li>specify legend</li>
<li>give a title</li>

<br><br><br><br>
<span style="color:blue;font-size:large">A simple single line plot</span>
<li>first create a figure object</li>
<li>create a line object in the figure object</li>
<li>the line object must contain both axes! Bokeh does not make good use of pandas series</li>
<li>then show the figure


<h3 style="color:blue">The data</h3>
<li>The NYC 311 data subset that we used for maps</li>

In [3]:
import pandas as pd
df = pd.read_csv("nyc_311_2023_data.csv")
df['Created Date'] = pd.to_datetime(df['Created Date'],format="%Y-%m-%d %H:%M:%S")
df['Closed Date'] = pd.to_datetime(df['Closed Date'],format="%Y-%m-%d %H:%M:%S")
df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18338326 entries, 0 to 18338325
Data columns (total 8 columns):
 #   Column           Non-Null Count     Dtype         
---  ------           --------------     -----         
 0   Created Date     18338326 non-null  datetime64[ns]
 1   Closed Date      18338326 non-null  datetime64[ns]
 2   Agency           18338326 non-null  object        
 3   Incident Zip     18338326 non-null  int64         
 4   Borough          18338326 non-null  object        
 5   Latitude         18338326 non-null  float64       
 6   Longitude        18338326 non-null  float64       
 7   processing_days  18338326 non-null  float64       
dtypes: datetime64[ns](2), float64(3), int64(1), object(2)
memory usage: 1.1+ GB


<h3>Calculate mean processing time by month and day of week</h3>
<li>Pandas <b>Grouper</b> constructs a groupby instruction for groupby</li>
<li>Useful when dealing with timeseries data</li>
<li>https://pandas.pydata.org/docs/reference/api/pandas.Grouper.html</li>

In [9]:
import pandas as pd
# grouper allows to handle timeseries data --> its the pandas equivalent of resample in yfinance
# below we are resampling by month 
month_groups = df.groupby(pd.Grouper(freq='M',key="Created Date"))
months_mean = month_groups.mean(numeric_only=True)['processing_days']


In [10]:
months_mean

Created Date
2015-01-31    16.153866
2015-02-28    14.000722
2015-03-31    17.050071
2015-04-30    19.559423
2015-05-31    21.806223
                ...    
2023-04-30     4.099683
2023-05-31     3.690137
2023-06-30     3.074180
2023-07-31     2.675771
2023-08-31     1.318601
Freq: M, Name: processing_days, Length: 104, dtype: float64

In [11]:
# access the datetime library and apply the day of the week function on the created date column 
day_of_week_groups = df.groupby(df["Created Date"].dt.dayofweek)
day_of_week_mean = day_of_week_groups.mean(numeric_only=True)['processing_days']
day_of_week_mean

Created Date
0    13.580642
1    14.364547
2    14.680310
3    14.157120
4    13.133037
5     7.884347
6     6.833155
Name: processing_days, dtype: float64

<h3>Graph month means and day of week means</h3>

In [12]:
type(months_mean)

pandas.core.series.Series

In [13]:
# create the plot here
# create the figure object and assign the labels 
p = figure(title="Monthly Processing Time Averages", x_axis_label='Month', y_axis_label='Processing Time')
# pass the data, make the x data the date and the y data the months mean 
p.line(months_mean.index,months_mean,line_width=8,line_color="red")
# display the result 
show(p)

<li>bokeh treats ticks as discrete points and doesn't automatically draw a continuous line</li>
<li>we need three steps for a basic graph rather than one (more complicated than matplotlib)</li>
<li>but we get some nice interactive features along with the graph</li>
<li>we'll need to do some more customization</li>

<li>bokeh can handle time nicely. Convert x-axis into time</li>
<li>tell bokeh that the x-axis is of datetime type. bokeh will assume it is continuous</li>
<li>add gizmos to make the graph look nice</li>

In [14]:
months_mean.index

DatetimeIndex(['2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30',
               '2015-05-31', '2015-06-30', '2015-07-31', '2015-08-31',
               '2015-09-30', '2015-10-31',
               ...
               '2022-11-30', '2022-12-31', '2023-01-31', '2023-02-28',
               '2023-03-31', '2023-04-30', '2023-05-31', '2023-06-30',
               '2023-07-31', '2023-08-31'],
              dtype='datetime64[ns]', name='Created Date', length=104, freq='M')

In [15]:
# we want the x-axis to represent years over time but we need to specify that when creating the figure 
import pandas as pd
p = figure(title="Mean Processing Time (monthly)",width=600, height=400,
          x_axis_label = "months",y_axis_label = "time",x_axis_type="datetime")

p.line(months_mean.index,#pd.to_datetime(months_mean.index,format="%Y%m"), 
       months_mean,
       legend_label="monthly mean",
      line_color = "red",
      line_width = 8)

show(p)

<br><br><br><br>
<span style="color:blue;font-size:large">tools</span>
<li>bokeh provides a number of interactive tools</li>
<li>in the above two graphs, we let it pick the tools (defaults), but we can control the tools list</li>
<li><a href="https://docs.bokeh.org/en/latest/docs/reference/models/tools.html">tool configuration</a>

In [16]:
import pandas as pd
# we can also use the crosshair tool, specify the tools argument in the figure object
p = figure(title="Mean Processing Time (monthly)",width=600, height=400,
          x_axis_label = "months",y_axis_label = "time",x_axis_type="datetime",
          tools="pan,box_zoom, crosshair,reset, save")

p.line(pd.to_datetime(months_mean.index,format="%Y%m"), 
       months_mean,
       legend_label="monthly mean",
      line_color = "red",
      line_width = 2)

show(p)



<br><br><br><br>
<span style="color:blue;font-size:large">The bar chart</span>
<li><a href="https://docs.bokeh.org/en/latest/docs/reference/models/glyphs/vbar.html">vbar</a> for vertical bars (hbar for horizontal bars)</li>
<li>x axis labels need to be of type string</li>
<li>We can convert the day numbers to str or just create the label list</li>


In [17]:
day_of_week_mean

Created Date
0    13.580642
1    14.364547
2    14.680310
3    14.157120
4    13.133037
5     7.884347
6     6.833155
Name: processing_days, dtype: float64

In [18]:
number_labels = [str(day) for day in day_of_week_mean.index] #But we'll use text_labels
text_labels = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
value = day_of_week_mean

p = figure(x_range=text_labels, y_range=(0, 20),width=600, height=300)

p.vbar(x=text_labels, top=value , width=0.5, color = "green",legend_label="means")

show(p)


<br><br><br><br>
<span style="color:blue;font-size:large">Grid plots in bokeh</span>
<li>define each figure separately</li>
<li>use gridplot to create a grid</li>
<li>gridplot takes a matrix of plots as its input</li>

In [19]:
from bokeh.io import output_notebook, show
from bokeh.layouts import gridplot
from bokeh.plotting import figure
import pandas as pd


p1 = figure(title="Mean Processing Time (monthly)",width=600, height=400,
          x_axis_label = "months",y_axis_label = "time",x_axis_type="datetime",
          tools="pan,box_zoom, crosshair,reset, save")

p1.line(pd.to_datetime(months_mean.index,format="%Y%m"), 
       months_mean,
       legend_label="monthly mean",
      line_color = "red",
      line_width = 8)



text_labels = ["Mon","Tues","Wed","Thurs","Fri","Sat","Sun"]
values = day_of_week_mean

p2 = figure(x_range=text_labels, y_range=(0, 25),width=600, height=400)

p2.vbar(x=text_labels, top=values , width=0.5, color = "cyan",legend_label="means")



# if we want to show multiple plots together then we need to make a gridplot object 
grid = gridplot([[p1,p2]],sizing_mode="scale_both",merge_tools=True)


# display the grid (the bokeh equivalent of subplot sin matplotlib)
show(grid)





<h2>Pie charts</h2>
<span style="color:blue;font-size:large">Breakdown of complaints by Borough</span>
<li>We might be interested in seeing if there are differences in complaint percent across boroughs</li>
<li>Let's make a set of 5 piecharts, one for each borough, that shows the relative number of complaints by agency</li>

<li>Group the data by borough and then, within each borough, by agency, and get the size of each group</li>


In [20]:
df.groupby(['Borough','Agency']).size()

Borough        Agency                                      
BRONX          3-1-1                                              133
               DCA                                               4364
               DCWP                                               108
               DEP                                              90332
               DEPARTMENT OF CONSUMER AND WORKER PROTECTION        14
                                                                ...  
STATEN ISLAND  NYCEM                                                1
               NYPD                                            265545
               OSE                                               2843
               OTI                                                  3
               TLC                                                433
Length: 116, dtype: int64

<li>this gives us a series with a two-layered index</li>
<li>the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html">unstack</a> function takes a two layered index and converts it into a dataframe</li>
<li>the outer index is the index of the new dataframe</li>
<li>the inner index becomes the columns of the new dataframe</li>
<li>We need to remove the agencies with a NaN in the data (FDNY)</li>

In [21]:
df.groupby(['Borough','Agency']).size().unstack().columns

Index(['3-1-1', 'DCA', 'DCWP', 'DEP',
       'DEPARTMENT OF CONSUMER AND WORKER PROTECTION', 'DFTA', 'DHS', 'DOB',
       'DOE', 'DOF', 'DOHMH', 'DOITT', 'DOT', 'DPR', 'DSNY', 'EDC', 'FDNY',
       'HPD', 'NYC311-PRD', 'NYCEM', 'NYPD', 'OSE', 'OTI', 'TLC'],
      dtype='object', name='Agency')

In [4]:
df

Unnamed: 0,Created Date,Closed Date,Agency,Incident Zip,Borough,Latitude,Longitude,processing_days
0,2022-06-01 15:40:25,2022-06-02 19:15:43,HPD,10034,MANHATTAN,40.864261,-73.920824,1.149514
1,2022-06-01 07:14:24,2022-06-02 08:59:05,HPD,10453,BRONX,40.857910,-73.907076,1.072697
2,2022-06-01 16:55:37,2022-06-02 15:38:38,HPD,11226,BROOKLYN,40.637296,-73.959624,0.946539
3,2022-06-01 15:02:59,2022-06-09 12:01:55,HPD,10032,MANHATTAN,40.829582,-73.940710,7.874259
4,2022-06-01 17:16:56,2022-06-15 18:24:13,HPD,10031,MANHATTAN,40.826519,-73.952445,14.046725
...,...,...,...,...,...,...,...,...
18338321,2022-06-01 16:08:15,2022-06-16 06:01:24,HPD,11223,BROOKLYN,40.585740,-73.967828,14.578576
18338322,2022-06-01 14:22:07,2022-06-12 02:04:14,HPD,11212,BROOKLYN,40.657274,-73.921019,10.487581
18338323,2022-06-01 14:31:06,2022-06-19 17:31:54,HPD,11105,QUEENS,40.777195,-73.909416,18.125556
18338324,2022-06-01 18:31:01,2022-06-17 20:45:45,HPD,10456,BRONX,40.833941,-73.910644,16.093565


In [22]:
#dropna(axis=1) drops any column that has a nan in it
size_df = df.groupby(['Borough','Agency']).size().unstack().dropna(axis=1)
size_df

Agency,3-1-1,DCA,DCWP,DEP,DEPARTMENT OF CONSUMER AND WORKER PROTECTION,DFTA,DHS,DOB,DOE,DOF,...,DPR,DSNY,EDC,HPD,NYC311-PRD,NYCEM,NYPD,OSE,OTI,TLC
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
BRONX,133.0,4364.0,108.0,90332.0,14.0,25.0,5493.0,91477.0,2710.0,1886.0,...,55496.0,262624.0,114.0,1090789.0,70.0,4.0,1805040.0,5895.0,10.0,3302.0
BROOKLYN,218.0,7136.0,189.0,162373.0,23.0,41.0,21018.0,224412.0,3195.0,3709.0,...,176396.0,991926.0,1361.0,1019280.0,228.0,10.0,2511017.0,17608.0,19.0,11203.0
MANHATTAN,205.0,6221.0,197.0,143625.0,23.0,39.0,91913.0,149428.0,3064.0,2647.0,...,62919.0,301838.0,9793.0,721769.0,64.0,3.0,1636566.0,17169.0,39.0,40310.0
QUEENS,165.0,6715.0,185.0,146276.0,32.0,18.0,11379.0,167851.0,2645.0,3850.0,...,186730.0,831807.0,2254.0,387853.0,290.0,25.0,2157606.0,13999.0,24.0,10934.0
STATEN ISLAND,29.0,963.0,24.0,46545.0,2.0,6.0,1070.0,30438.0,620.0,1375.0,...,56593.0,291424.0,86.0,56085.0,66.0,1.0,265545.0,2843.0,3.0,433.0


<li>each row of the df will correspond to one piechart and we will have 5 piecharts in all</li>

<br><br><br><br>
<span style="color:blue;font-size:large">The pie chart</span>
<li>Drawing a piechart involves specifying each slice of the pie</li>
<li>The <span style="color:blue">color</span> of the slices is defined by importing a <a href="https://docs.bokeh.org/en/latest/docs/reference/palettes.html">color palette</a></li>
<li>Since there are 23 agencies, we need a color palette with at least 23 colors. Bokeh's large palettes are probably the best. We'll pick random colors from Turbo256's 0..255 range of colors</li>
<li>We need to use a <a href="https://docs.bokeh.org/en/latest/docs/user_guide/basic/data.html">ColumnDataSource</a> object to store the data:</li>
<ul>
    <li>The data repository for a bokeh chart</li>
    <li>Bokeh can directly reference named columns in this source</li>
    <li>Think of it as a dedicated table that bokeh understands</li>
</ul>
<li><a href="https://docs.bokeh.org/en/latest/docs/user_guide/basic/annotations.html">LabelSet</a>: the set of labels that we want to display. For the piechart, these will be the fraction of cases that are associated with an agency</li>



In [23]:
size_df

Agency,3-1-1,DCA,DCWP,DEP,DEPARTMENT OF CONSUMER AND WORKER PROTECTION,DFTA,DHS,DOB,DOE,DOF,...,DPR,DSNY,EDC,HPD,NYC311-PRD,NYCEM,NYPD,OSE,OTI,TLC
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
BRONX,133.0,4364.0,108.0,90332.0,14.0,25.0,5493.0,91477.0,2710.0,1886.0,...,55496.0,262624.0,114.0,1090789.0,70.0,4.0,1805040.0,5895.0,10.0,3302.0
BROOKLYN,218.0,7136.0,189.0,162373.0,23.0,41.0,21018.0,224412.0,3195.0,3709.0,...,176396.0,991926.0,1361.0,1019280.0,228.0,10.0,2511017.0,17608.0,19.0,11203.0
MANHATTAN,205.0,6221.0,197.0,143625.0,23.0,39.0,91913.0,149428.0,3064.0,2647.0,...,62919.0,301838.0,9793.0,721769.0,64.0,3.0,1636566.0,17169.0,39.0,40310.0
QUEENS,165.0,6715.0,185.0,146276.0,32.0,18.0,11379.0,167851.0,2645.0,3850.0,...,186730.0,831807.0,2254.0,387853.0,290.0,25.0,2157606.0,13999.0,24.0,10934.0
STATEN ISLAND,29.0,963.0,24.0,46545.0,2.0,6.0,1070.0,30438.0,620.0,1375.0,...,56593.0,291424.0,86.0,56085.0,66.0,1.0,265545.0,2843.0,3.0,433.0


In [24]:
borough = "BRONX"
print(size_df.loc[borough])
data = size_df.loc[borough].reset_index(name='value') # we move the index into the columns 
data 

Agency
3-1-1                                               133.0
DCA                                                4364.0
DCWP                                                108.0
DEP                                               90332.0
DEPARTMENT OF CONSUMER AND WORKER PROTECTION         14.0
DFTA                                                 25.0
DHS                                                5493.0
DOB                                               91477.0
DOE                                                2710.0
DOF                                                1886.0
DOHMH                                             62894.0
DOITT                                               135.0
DOT                                              197831.0
DPR                                               55496.0
DSNY                                             262624.0
EDC                                                 114.0
HPD                                             1090789.0
NYC311-

Unnamed: 0,Agency,value
0,3-1-1,133.0
1,DCA,4364.0
2,DCWP,108.0
3,DEP,90332.0
4,DEPARTMENT OF CONSUMER AND WORKER PROTECTION,14.0
5,DFTA,25.0
6,DHS,5493.0
7,DOB,91477.0
8,DOE,2710.0
9,DOF,1886.0


In [6]:
from math import pi #Need this - a pie chart is a circle
import pandas as pd
import numpy as np
from bokeh.palettes import Turbo256 #The color palette
from bokeh.models import LabelSet, ColumnDataSource
from bokeh.transform import cumsum #a bokeh function that cumulatively sums the angle column in a column data source

borough = "BRONX"

#Create a dataframe with an Agency column and a value column for the borough
data = size_df.loc[borough].reset_index(name='value')

#We'll restrict the size of the string for Agencies. One Agency is way too long!
data['Agency'] = data['Agency'].str.slice(0,8)

#Calculate a percent column (the fraction of cases for each agency)
data['pct'] = (data['value']/sum(data['value'])*100).round(2)

#Calculate an angle column (the fraction of 360 degrees for each slice)
#Note that data['pct'].sum() should be approx 100.0
#The angle is calculated in radians. 360degrees = 2*pi (approx 6.28)
data['angle'] = data['pct']/(data['pct'].sum()) * 2*pi

#Specify the color palette
#Use Category20c if you have fewer than 20 wedges
#Or Category10c if you have fewer than 10 wedges

import random
colors = [Turbo256[random.randint(0,255)] for i in range(len(data))]
data['colors'] = colors

cdsdata = ColumnDataSource(data)
#Define a figure object
#tooltips maps to the ColumnDataSource object
#We haven't created it yet but Agency will map to the Agency column and pct to the pct
#And we'll see NYPD; 49.04 when we hover over the NYPD slice in the piechart
p = figure(height=500, title="Cases by Agency: " + borough, 
        tools="hover", tooltips="@Agency: @pct",x_range=(-0.5, 1.0))

#Add the definition of each wedge
#Each wedge, starting with 0 degrees, will be at the sum of the angles upto that wedge

p.wedge(x=0, y=1, radius=0.4, #x,y specifies the location of the center of the pie
        start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'), #the start and end angle of each slice
        line_color="white", fill_color="colors", legend_field='Agency', source=cdsdata) #source is our data (can be a df or a CDS)

#Since there are way too many small slices, we'll only label slices with pct > 5
data["label_vals"] = np.where(data["pct"]>=5,data["pct"],0)                              
data["label_vals"]=data["label_vals"].astype(str).str.pad(35, side = "left") #pad will move the labels out toward the circumference
data["label_vals"]=data["label_vals"].apply(lambda x: "" if x.strip()=="0.0" else x)

#LabelSet is the set of labels. Either a value or nothing
labels = LabelSet(x=0, y=1, text='label_vals',
        angle=cumsum('angle', include_zero=True), source=cdsdata)

#Add the labels
p.add_layout(labels)


p.axis.axis_label=None #Piechart. No axes
p.axis.visible=False #We don't want to see the axes
p.grid.grid_line_color = None #If you want grid lines, specify a color

#Show the chart
show(p)



NameError: name 'size_df' is not defined

In [28]:
data

Unnamed: 0,Agency,value,pct,angle,colors,label_vals
0,3-1-1,133.0,0.0,0.0,#18d5cc,
1,DCA,4364.0,0.12,0.007541,#3d3791,
2,DCWP,108.0,0.0,0.0,#4678f0,
3,DEP,90332.0,2.45,0.153969,#9d1001,
4,DEPARTME,14.0,0.0,0.0,#caed33,
5,DFTA,25.0,0.0,0.0,#3c9dfd,
6,DHS,5493.0,0.15,0.009427,#25c0e6,
7,DOB,91477.0,2.49,0.156483,#4040a1,
8,DOE,2710.0,0.07,0.004399,#bb1f01,
9,DOF,1886.0,0.05,0.003142,#b6f735,


In [26]:
cdsdata

In [27]:
cdsdata.data

{'index': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22]),
 'Agency': array(['3-1-1', 'DCA', 'DCWP', 'DEP', 'DEPARTME', 'DFTA', 'DHS', 'DOB',
        'DOE', 'DOF', 'DOHMH', 'DOITT', 'DOT', 'DPR', 'DSNY', 'EDC', 'HPD',
        'NYC311-P', 'NYCEM', 'NYPD', 'OSE', 'OTI', 'TLC'], dtype=object),
 'value': array([1.330000e+02, 4.364000e+03, 1.080000e+02, 9.033200e+04,
        1.400000e+01, 2.500000e+01, 5.493000e+03, 9.147700e+04,
        2.710000e+03, 1.886000e+03, 6.289400e+04, 1.350000e+02,
        1.978310e+05, 5.549600e+04, 2.626240e+05, 1.140000e+02,
        1.090789e+06, 7.000000e+01, 4.000000e+00, 1.805040e+06,
        5.895000e+03, 1.000000e+01, 3.302000e+03]),
 'pct': array([ 0.  ,  0.12,  0.  ,  2.45,  0.  ,  0.  ,  0.15,  2.49,  0.07,
         0.05,  1.71,  0.  ,  5.37,  1.51,  7.14,  0.  , 29.63,  0.  ,
         0.  , 49.04,  0.16,  0.  ,  0.09]),
 'angle': array([0.        , 0.00754133, 0.        , 0.15396883, 0.      

<br><br><br><br>
<span style="color:blue;font-size:large">Grid plots in bokeh</span>
<li>define each figure separately</li>
<li>use gridplot to create a grid</li>
<li>gridplot takes a matrix of plots as its input</li>

<br><br><br><br>
<span style="color:blue;font-size:large">Piecharts for all boroughs in a grid</span>
<li>pyplot has a "grid" feature</li>
<li>create grid specifications</li>
<li>add charts to each grid cell (they could be of different types)</li>
<li>We'll write a function that creates a piechart for each borough</li>

In [1]:
def create_bokeh_pi(df,row):
    from math import pi #Need this - a pie chart is a circle
    import pandas as pd

    from bokeh.palettes import Turbo256 #The color palette
    from bokeh.models import LabelSet, ColumnDataSource
    from bokeh.transform import cumsum #a bokeh function that cumulatively sums the angle column in a column data source
    borough = row

    #Create a dataframe with an Agency column and a value column for the borough
    data = size_df.loc[borough].reset_index(name='value')

    #We'll restrict the size of the string for Agencies. One Agency is way too long!
    data['Agency'] = data['Agency'].str.slice(0,8)

    #Calculate a percent column (the fraction of cases for each agency)
    data['pct'] = (data['value']/sum(data['value'])*100).round(2)

    #Calculate an angel column (the fraction of 360 degrees for each slice)
    data['angle'] = data['pct']/(data['pct'].sum()) * 2*pi

    #Specify the color palette
    import random
    colors = [Turbo256[random.randint(0,255)] for i in range(len(data))]
    data['colors'] = colors

    cdsdata = ColumnDataSource(data)

    #Define a figure object
    p = figure(height=350,
                title="Cases by Agency: " + borough, 
                tools="hover", 
                tooltips="@Agency: @pct", # get data from these columns from the source argument in p.wedge
                x_range=(-0.5, 1.0)
                )

    #Add the definition of each wedge
    #Each wedge, starting with 0 degrees, will be at the sum of the angles upto that wedge

    p.wedge(x=0, y=1, radius=0.4, #x,y specifies the location of the center of the pie
            start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'), #the start and end angle of each slice
            line_color="white", fill_color='colors', legend_field='Agency', source=cdsdata,) #source is our data (can be a df or a CDS)
            
    #Since there are way too many small slices, we'll only label slices with pct > 5
    data["label_vals"] = np.where(data["pct"]>5,data["pct"],0)                              
    data["label_vals"]=data["label_vals"].astype(str).str.pad(20, side = "left") #pad will move the labels out toward the circumference
    data["label_vals"]=data["label_vals"].apply(lambda x: "" if x.strip()=="0.0" else x)

    #LabelSet is the set of labels. Either a value or nothing
    labels = LabelSet(x=0, y=1, text='label_vals',
            angle=cumsum('angle', include_zero=True), source=cdsdata)

    #Add the labels
    p.add_layout(labels)


    p.axis.axis_label=None #Piechart. No axes
    p.axis.visible=False #We don't want to see the axes
    p.grid.grid_line_color = None #If you want grid lines, specify a color

    #Show the chart
    return p



In [35]:
show(create_bokeh_pi(size_df,"BRONX"))

In [36]:
plots=list()
for borough in size_df.index:
    plots.append(create_bokeh_pi(size_df,borough))
grid = gridplot(plots,sizing_mode="scale_both",merge_tools=True,ncols=2)
show(grid)

<br><br><br><br><br><br>
<span style="color:green;font-size:40px">Interactive charting</span>
<br><br>

<li>bokeh graphs are interactive</li>
<li>passive interaction (the graph does not change with user input) (e.g., the piechart above)</li>
<li>active interaction (the graph changes with user input)</li>

<br><br><br><br>
<span style="color:blue;font-size:x-large">distribution of processing time</span>
<li>We've looked at mean processing times over months and days</li>
<li>Let's get a sense of the distribution of processing times</li>

<br><br><br><br>
<span style="color:blue;font-size:large">Histogram of processing times</h2>
<li>We'll only keep data for which the processing duration is greater than .5 and less than 1 day </li>
<li>Why? No particular reason other than graph aesthetics!</li>
<li>And divide the data into ten bins</li>
<li>and use <a href="https://numpy.org/doc/stable/reference/generated/numpy.histogram.html">np.histogram</a> to create the histogram</li>

In [7]:
#Load the data
import pandas as pd
import numpy as np
df = pd.read_csv("nyc_311_2023_data.csv",header=0)
df['Created Date'] = pd.to_datetime(df["Created Date"],format="%Y-%m-%d %H:%M:%S")
df['Closed Date'] = pd.to_datetime(df["Closed Date"],format="%Y-%m-%d %H:%M:%S")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18338326 entries, 0 to 18338325
Data columns (total 8 columns):
 #   Column           Dtype         
---  ------           -----         
 0   Created Date     datetime64[ns]
 1   Closed Date      datetime64[ns]
 2   Agency           object        
 3   Incident Zip     int64         
 4   Borough          object        
 5   Latitude         float64       
 6   Longitude        float64       
 7   processing_days  float64       
dtypes: datetime64[ns](2), float64(3), int64(1), object(2)
memory usage: 1.1+ GB


In [8]:
import numpy as np
import pandas as pd
df2 = df[(df['processing_days']>0.5) & (df['processing_days']<1)]
nbins=10
histogram,boundaries = np.histogram(df2['processing_days'],bins=nbins)
hist_df = pd.DataFrame({'cases': histogram, 
                       'left': boundaries[:-1]*24*60, 
                       'right': boundaries[1:]*24*60})

In [33]:
boundaries

array([0.50001157, 0.55000926, 0.60000694, 0.65000463, 0.70000231,
       0.75      , 0.79999769, 0.84999537, 0.89999306, 0.94999074,
       0.99998843])

In [34]:
histogram

array([ 97626,  90289,  85730,  82235,  86773,  92848,  96615,  98118,
       101889,  96576])

In [35]:
hist_df

Unnamed: 0,cases,left,right
0,97626,720.016667,792.013333
1,90289,792.013333,864.01
2,85730,864.01,936.006667
3,82235,936.006667,1008.003333
4,86773,1008.003333,1080.0
5,92848,1080.0,1151.996667
6,96615,1151.996667,1223.993333
7,98118,1223.993333,1295.99
8,101889,1295.99,1367.986667
9,96576,1367.986667,1439.983333


<br><br><br>
<span style="color:blue;font-size:x-large">Creating an interactive bokeh bar chart</span>
<li>We'll convert the dataframe into a ColumnDataSource object</li>
<li>Create a <span style="color:blue">tooltips</span> dictionary (for the data that will be shown when hovering)</li>
<li>Center the bar chart to the median bar size</li>
<li>Use a <a href="https://docs.bokeh.org/en/latest/docs/reference/models/glyphs/quad.html">quad glyph</a> to construct each bar. Quad = Quadrilateral</li>

In [36]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource

output_notebook()

In [37]:
src = ColumnDataSource(hist_df)
src.data

{'index': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 'cases': array([ 97626,  90289,  85730,  82235,  86773,  92848,  96615,  98118,
        101889,  96576]),
 'left': array([ 720.01666667,  792.01333333,  864.01      ,  936.00666667,
        1008.00333333, 1080.        , 1151.99666667, 1223.99333333,
        1295.99      , 1367.98666667]),
 'right': array([ 792.01333333,  864.01      ,  936.00666667, 1008.00333333,
        1080.        , 1151.99666667, 1223.99333333, 1295.99      ,
        1367.98666667, 1439.98333333])}

In [38]:
#Place the data into a ColumnDataSource object
src = ColumnDataSource(hist_df)

#create the tooltips 
#whatever you put in here will be shown when hovering
#the @ tells bokeh to use data from the columns of the columndatasource
#formatting goes in curly braces
tooltips = [
    ("number of cases", "@cases"),
    ("left (minutes)", "@left{s}"),
    ("right (minutes)", "@right{s}"),
]

"""
tooltips = {
    "number of cases": "@cases",
    "left (minutes)": "@left{s}",
    "right (minutes)": "@right{s}",
}
"""
#get the median for centering the x-axis
import statistics
median = statistics.median(src.data['cases'])

#Create the figure
p = figure(height = 600, width = 600, 
           title = 'Histogram of cases by processing times',
          x_axis_label = 'time buckets', 
           y_axis_label = 'number of cases',tooltips=tooltips)

# Add a quad glyph (https://docs.bokeh.org/en/latest/docs/reference/models/glyphs/quad.html)
#we'll use the source argument to attach the columndatasource to the chart
r = p.quad(bottom=median, #location of the x-axis
           top='cases',  #the data that will determine the height of the bar
           left='left', #CDS column for the left value of the bar
           right='right', #CDS column for the right value of the bar
           source=src,  #The ColumnDataSource (CDS)
           fill_color='red', 
           line_color='black',
           fill_alpha = 0.95,
           hover_fill_alpha = 1.0, 
           hover_fill_color = 'orange')



# Show the plot
show(p)

In [39]:
p,r

(figure(id='p1349', ...), GlyphRenderer(id='p1386', ...))

<br><br><br>
<span style="color:green;font-size:xx-large">Active interactions</span>
<li>active interactions allow the user to delve into the data by selecting parameters</li>
<li>example, the user may want to bin by Agency (or Borough)</li>
<li>or select bin sizes rather than rely on our 500 minute window</li>
<li>or select periods of time and see how the histogram changes</li>

<li>In this example, we'll let the user:</li>
<ol>
    <li>select an agency from a list of agencies</li>
    <li>show binned processing time only for that agency</li>
</ol>
<li>the graph will change with each change in agency selection</li>

<br><br><br>
<span style="color:blue;font-size:x-large">JavaScript callbacks</span>
<li>changes to a page in a browser are made using JavaScript</li>
<li>we can include javascript widgets directly in the chart using bokeh's callback feature</li>
<li>the complication is that we can't call python functions from javascript</li>
<li>nor can we refer to a dataframe from JavaScript because there is no json equivalent</li>
<li>ColumnDataSource objects are fine though because they can be converted into JSON objects and arrays</li>
<li>And these json objects/arrays can be passed to a javascript function</li>

<b>Note:</b> While JavaScript is a full fledged language in its own right, we'll just use enough to make our graph interactive while keeping the process simple!

<br><br><br>
<span style="color:blue;font-size:x-large">columndatasource</span>
<li>each user selection (agency/bin size) requires data in a columndatasource object</li>
<li>so we need a function that creates this columndatasource given an agency and a bin size</li>


<br><br><br>
<span style="color:blue;font-size:x-large">dataframe for columndatasource</span>
<li>Create a dataframe that will be converted into a bokeh ColumnDataSource object</li>
<li>Each row corresponds to a bucket of the histogram</li>
<li>Columns correspond to agencies</li>
<li>Plus two additional columns for the left and right bin values for each bin</li>

In [None]:
#Create a dataframe for conversion to ColumnDataSource
#Upper and lower define the processing time bounds (aesthetic reasons!)

def prepareDataForCDS(df,upper=1,lower=0.5,nbins=10):
    df = df[(df['processing_days']<upper) & (df['processing_days']>lower)]
    agencies = df['Agency'].unique()
    
    #get a common set of bins for all agencies

    hist,bounds = np.histogram(df['processing_days'],bins=nbins)
    left = bounds[:-1]
    right = bounds[1:]
    
    #construct the initial dataframe with bin boundaries
    #the index for this df is 0,1,..,nbins-1
    #Initially, all it contains is columns for the left and right values
    #We'll add each agency column to this data frame
    final_df = pd.DataFrame({'left':left*24*60,'right':right*24*60})


    #for each agency, compute the number of cases in each bin
    #and concat this into final_df with the agency name as the column name
    #note that the index will be 0,1,...,nbins-1 therefore concat will use indices
    for agency in agencies:
        #print(final_df.info())
        a_df = df[df['Agency']==agency]
        hist,bounds = np.histogram(a_df['processing_days'],bins=nbins)
        hist_df = pd.DataFrame({agency: hist})
        final_df = pd.concat([final_df,hist_df],axis=1)
        
    return final_df



In [None]:
prepareDataForCDS(df,upper=1,lower=0.5,nbins=10)

<br><br><br>
<span style="color:blue;font-size:x-large">Interactivity</span>
<li>Requires a dropdown box to get the user selection</li>
<li>Requires JavaScript code to change the underlying data when the user selects a new agency</li>
<li>The way this works is as follows:</li>
<ul>
    <li>Specify a column that contains the data to be displayed</li>
    <li>display this column</li>
    <li>when the user selects a new column, copy the contents of the new column into the display column</li>
    <li>and tell the graph to reflect the changed values</li>
</ul>
<li>We'll add a new column <span style="color:blue">display</span> for storing data to be displayed</li>
<li>We'll use the <span style="color:green">dropdown</span> widget to display the selectable list of agencies. Many more <a href="https://docs.bokeh.org/en/latest/docs/user_guide/interaction/widgets.html">widgets are available</a></li>

In [None]:
#row will allow placing the graph and the widget side by side
from bokeh.layouts import row

#import the dropdown widget along with CustomJS, bokeh's JavaScript library
from bokeh.models import CustomJS, ColumnDataSource, Dropdown

list_of_agencies=df['Agency'].unique()

#Get the dataframe
interactive_df = prepareDataForCDS(df)

#add a new column to the object called 'display'
#when the user selects a new agency, we'll copy that agency's data into this column
#the histogram will display whatever is in this column

#Arbitrarly pick an agency and copy it into this column

interactive_df['display'] = interactive_df['DOT']

#create a ColumnDataSource object.
#This is the source for the graph
source = ColumnDataSource(interactive_df)

#construct the graph (but don't "show" it)
#note that the display column is "display" not "cases" as before

tooltips = [
    ("number of cases", "@display"),
    ("left (minutes)", "@left{s}"),
    ("right (minutes)", "@right{s}"),
]
    
p = figure(height = 300, width = 500, 
           title = 'Histogram of cases by processing times',
          x_axis_label = 'time buckets', 
           y_axis_label = 'number of cases',tooltips=tooltips)

p.quad(bottom=0,
       top='display', left='left', right='right',source=source,
       fill_color='red', line_color='black',fill_alpha = 0.75,
       hover_fill_alpha = 1.0, hover_fill_color = 'navy')

#Create a javascript callback as a call to CustomJS
#arguments are the data (the source)
#and a string containing the JavaScript code

#this.item is the selected value (returned by the dropdown widget)
#"this" is a javascript identifier that self identifies an object (the dropdown widget)
#'item' is the value selected in the dropdown
#Note that different widgets may use this.value or this.some_other_name for a widget value
#create a variable that holds the data in source
#replace the display column in the data by the column for the selected agency
#register the change (emit it!)
jscallback = CustomJS(args={'srce':source},code="""
        // widget. this.item is the selected value
        // We can print it on the JavaScript console
        // Chrome windows/linux: Ctrl - Shift - J
        // Chrome mac: Cmd - Option - J
        // Safari: Option - Cmd - C
        
        //console.log logs (prints) in the JavaScript console
        console.log(' changed selected option', this.item);

        //data is the variable containing all the data
        //source was passed in as an argument
        var data = srce.data;

        // allocate the selected column to the field for the y values
        data['display'] = data[this.item];

        // register the change 
        srce.change.emit();
""")

#Create the dropdown widget
#the widget contains a menu in the form of a list of tuples
#(long name, value to be returned)
#the widget has a callback argument. Use the jscallback as the value of this argument

menu = [("Agency " + x,x) for x in list_of_agencies]

dropdown = Dropdown(label="select agency", button_type="primary", menu=menu)

#Note that jscallback must be defined before the button!
dropdown.js_on_event("menu_item_click", jscallback)

#construct a row object containing the plot and the widget
layout = row(p,dropdown)

#show it!
show(layout)





In [5]:
print(1)

1
