# Hands-on with Python + matplotlib

# 1. Improving Pie Charts

*What is wrong with this figure?* 

![](https://drive.google.com/uc?id=1K6hCHovjZV5Icbn3zd-gW86RSRjiH0i-)

## Let's agree that this is a monstrosity.  Now, how do we improve it?

In [None]:
# import necessary libraries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

%matplotlib inline

## 1.1. Read in the data

*This is a made up data set from a colleague of mine. We have 10 items, each with a text label and a numeric value.*

*I'm using the Python library ```pandas``` to read in the data.*

In [None]:
url = 'https://drive.google.com/file/d/1iWAtKk7aOinwb-pJ-Cy-hiB5xBv3Z-b5/view?usp=sharing'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]

data = pd.read_csv(url)
data

## 1.2. For many uses cases (including this) a bar chart is a better option than a pie chart.

*Humans can more easily interpret differences in bar charts. Pie charts require us to interpret areas = slow, while bar charts use position = fast. Generally, you should choose a bar chart over a pie chart when:*
- *There are too many categories to easily distinguish between pie chart areas (as we have here).*
- *Slice sizes in the pie chart are too similar (as we have here).*
- *You have multiple data sets (which we do not have here).*
- *When the raw percentages can provide as much (or more) meaning than fraction of a whole (as we have here).*

*Pie charts are only useful when there are few categories, each category has a very different percentage, AND the purpose of your visualization is to show fractions of a whole.*

*Here is the default bar chart from python.  Leaves lots to be desired...*

In [None]:
f,ax = plt.subplots()

ind = np.arange(len(data))  # the x locations for the bars
width = 0.5 # the width of the bars
rects = ax.bar(ind, data['Value'], width)

## 1.3. Add some labels

*The text for the bars are unreadable.  How should we fix that?*

In [None]:
f,ax = plt.subplots()

ind = np.arange(len(data))  # the x locations for the bars
width = 0.5 # the width of the bars
rects = ax.bar(ind, data['Value'], width)

# add some text for labels, title and axes ticks
ax.set_ylabel('Percent')
ax.set_title('Percentage of Poor Usage')
ax.set_xticks(ind)
_ = ax.set_xticklabels(data['Label'])

## 1.4. Fix the bar text, sort the data, add the percentage values to each bar

In [None]:
f,ax = plt.subplots()

# sort the data (nice aspect of pandas dataFrames)
data.sort_values('Value', inplace=True)

ind = np.arange(len(data))  # the x locations for the bars
width = 0.5 # the width of the bars
rects = ax.barh(ind, data['Value'], width, zorder=2)

# add some text for labels, title and axes ticks
ax.set_xlabel('Percent')
ax.set_title('Percentage of Poor Usage')
ax.set_yticks(ind)
ax.set_yticklabels(data['Label'])

# add a grid behind the plot
ax.grid(color='gray', linestyle='-', linewidth=1, zorder = 1)

# I grabbed this from here : https://matplotlib.org/stable/gallery/lines_bars_and_markers/bar_label_demo.html
# Label with specially formatted floats
ax.bar_label(rects, fmt='%.1f')
ax.set_xlim(right=12)  # adjust xlim to fit labels
        

## 1.5. Clean this up a bit
- *I don't want the grid lines anymore*
- *We can remove the axes entirely*
- *Make the font larger*
- *Let's change the colors, and highlight one of them*
- *Save the plot*

In [None]:
f,ax = plt.subplots(figsize=(10,8))

# sort the data (nice aspect of pandas dataFrames)
data.sort_values('Value', inplace=True)

ind = np.arange(len(data))  # the x locations for the bars
width = 0.7 # the width of the bars
rects = ax.barh(ind, data['Value'], width, zorder=2)

# add some text for labels, title and axes ticks
ax.set_title('Percentage of Poor Usage in Data Visualization', fontsize = 30)
ax.set_yticks(ind)
ax.set_yticklabels(data['Label'], fontsize=20)

# remove all the axes, ticks and lower x label
aoff = ['right', 'left', 'top', 'bottom']
for x in aoff:
    ax.spines[x].set_visible(False)
# remove the ticks and labels on the x axis
ax.set_xticks([])

# Label with specially formatted floats
ax.bar_label(rects, fmt='%.1f%%', fontsize=20)
ax.set_xlim(right=10)  # adjust xlim to fit labels

# change the colors
highlight = [4]
for i, r in enumerate(rects):
    r.set_color('gray')
    if (i in highlight):
        r.set_color('orange')


    
#f.savefig('bar.pdf',format='pdf', bbox_inches = 'tight') 

# 2. Scatter Plots

In [None]:
# import necessary libraries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib as mpl

%matplotlib inline

## 2.1. Read in the data

*I downloaded [2024 Chicago taxi data](https://data.cityofchicago.org/Transportation/Taxi-Trips-2024-/ajtu-isnz/about_data) from the [Chicago data portal](https://data.cityofchicago.org/).  This dataset has millions rows and many columns (and is about 1.3G), and therefore may take some time to load and visualize.  If you want to run this code locally, please either download the data from teh Chicago Data Portal linked above, or the version that I have on Google Drive [here](https://drive.google.com/file/d/1QPS8DY2bDCbttMf4dEIIC3LOdYlph7sJ/view?usp=sharing).  (The dataset is too large to host on GitHub.)*  

*Here, we will look at columns for `Fare` and `Tips`.*

In [None]:
# this assumes that you have downloaded the data (as above), and placed it in a data directory with the file name 'Taxi_Trips__2024-__20240731.csv'
df = pd.read_csv('data/Taxi_Trips__2024-__20240731.csv')
df.head()

## 2.2 Let's plot the `Fare` vs. `Tips` data as a scatter plot.

*Is there anything that we should improve upon here?*

In [None]:
# define the figure and axis objects
f, ax = plt.subplots()

# plot the data as a scatterplot
ax.scatter(df['Fare'], df['Tips'])

## 2.3 Let's improve this
- *Change the axis range.*
- *Try open circles as symbols.*
- *Add a title and some descriptive labels to the axes.*
- *Increase the font sizes.*

In [None]:
# define the figure and axis objects
f, ax = plt.subplots()

# plot the data as a scatterplot
ax.scatter(df['Fare'], df['Tips'], s=40, facecolors='none', edgecolors='black')

# set the labels with correct font sizes
ax.set_title("How Chicagoans Tipped their Cab Drivers in 2024", fontsize=18, pad=20)
ax.set_xlabel("Fare ($)", fontsize=14)
ax.set_ylabel("Tip ($)", fontsize=14)

# change the axis range
ax.set_xlim(0, 150)
ax.set_ylim(0, 150)

## 2.4 Can we improve this more?
- *Use a 2d histogram instead.  (Often when you have so much overlapping data, it is easier for the view to switch to a 2d histogram or contour plot, or similar).*
- *Include a colorbar.*
- *Add a line at 20% Fare, and label it.*

In [None]:
# define the figure and axis objects
f, ax = plt.subplots()

# plot the data as a scatterplot
h, xedges, yedges, image = ax.hist2d(df['Fare'], df['Tips'], bins=60, cmap='Blues', range = [[0,150], [0,60]], norm = mpl.colors.LogNorm(vmax = 1e4))

# Add a colorbar
cbar = plt.colorbar(image, ax=ax)
cbar.set_label('Number of Rides', fontsize=14)  # Set the label for the colorbar

# add lines at standard tip rates (uncomment below to include the lines in the plot)
# xline = np.linspace(0,150,100) # define an x variable that spans our plot range
# tip_pcts = [0.2, 0.25, 0.30, 0.4] # choose a list of tip percentages to plot, then iterate through to plot the lines and labels
# for p in tip_pcts:
#     ax.plot(xline, p*xline, color = 'gray', linestyle = 'dashed', alpha = 0.7) 
#     ax.text(xline[-20], p*xline[-20] + 5*p, f'{p*100:.0f}%', rotation = 100*p, color = 'gray')


# set the labels with correct font sizes
ax.set_title("How Chicagoans Tipped their Cab Drivers in 2024", fontsize=18, pad=20)
ax.set_xlabel("Fare ($)", fontsize=14)
ax.set_ylabel("Tip ($)", fontsize=14)

# Note that we don't need to explicitly set the axis range because it will be limited by the range we supplied to hist2d above

