# Visualization

Visualizations, or visual represenations of data, are a powerful way to represent relationships of data in a way that humans can more effectively understand compared to textual data representations.  In this lecture, we will give a brief introduction to some of the basic visualization capabilities that are available in Python (with and without Pandas).  If you are interested more broadly in data visualization, you should consider LIS 2690 or CMPINF 2130 (offered this summer!)

## Bar Charts

Mathplotlib provides A LOT of functionality similar tothe charing features in Excel, but with a lot more control.  It also integrates will with Pandas.

Examples of all the visualizations can be found [here](https://matplotlib.org/3.1.1/gallery/index.html).

In [None]:
import pandas as pd

import scipy.stats as stats # <-- Use for some simple stats

import numpy as np  # <-- this is NumPy, used for numerical computing, we're not going to get in to it, but many of the examples use this package as helpers for the plotting code

import matplotlib.pyplot as plt # <-- MatPlotLib is one of the most used graphing packages used in Python

Let's jump in and load up a simple dataframe.  This is dataframe has information about a a drug trial and a measure of the result.

In [None]:
df = pd.read_csv("files/drug.csv")
df

We can use the set_index method to set the index of the dataframe:

In [None]:
df = df.set_index('person')
df

We can also use the replace function replace the number values with the meaning of the dosing categories:

In [None]:
df['dose'].replace({1: 'placebo', 2: 'low', 3: 'high'}, inplace = True)
df

We can use some of the simple statistics capabilities to help us get simple stats... we'll see later how this is useful

In [None]:
mask = df['dose'] == 'placebo'

df[mask].mean()['result']

In [None]:
df[mask].std()['result']

Now, this is what I was struggling to show last week in the Activities... you can put the mask right in to the code:

In [None]:
df[df['dose'] == 'placebo'].mean()['result']

Alright, now let's get in to some plotting --- let's create a simple bar chart for the placebo

In [None]:
fig, ax = plt.subplots()

N = 1
ind = np.arange(N)    # the x locations for the groups
width = 0.35         # the width of the bars

means = df[df['dose'] == 'placebo'].mean()['result']
std = df[df['dose'] == 'placebo'].std()['result']

ax.bar(ind, means , width, bottom=0, yerr=std)
plt.show()

We can extend this approach to also include additional bars, with additional data.

Pay careful attention to how we define the X and Y variables.  We've seen this way of arrangin data before!

In [None]:
fig, ax = plt.subplots()

N = 3
ind = np.arange(N)    # the x locations for the groups
width = 0.35         # the width of the bars

means = [df[df['dose'] == 'placebo'].mean()['result'], df[df['dose'] == 'low'].mean()['result'], df[df['dose'] == 'high'].mean()['result']]
std = [df[df['dose'] == 'placebo'].std()['result'], df[df['dose'] == 'low'].std()['result'], df[df['dose'] == 'high'].std()['result']]

ax.bar(ind, means , width, bottom=0, yerr=std)
plt.title('Result by group')
plt.xticks(ind, ('Placebo', "Low Dose", "High Dose"))

plt.show()

Because MatPlotLib takes simple lists, we can send data NOT from Pandas just as easily!

In [None]:
fig, ax = plt.subplots()

N = 3
ind = np.arange(N)    # the x locations for the groups
width = 0.35         # the width of the bars

means = [2,3,2]
std = [1,1,1]

ax.bar(ind, means , width, bottom=0, yerr=std)
plt.title('Result by group')
plt.xticks(ind, ('Placebo', "Low Dose", "High Dose"))

plt.show()

## It's not just bar charts!

We can use the same data, but with a different type of plot:

In [None]:
x = np.linspace(1, 5, 5)
y = df[df['dose'] == 'placebo']['result'].sort_values()

fig, ax = plt.subplots()

# Using set_dashes() to modify dashing of an existing line
line1 = ax.plot(x, y, dashes=[2, 2, 10, 2], label='Placebo')
                      # 2pt line, 2pt break, 10pt line, 2pt break


ax.legend()
plt.show()



In [None]:
x

In [None]:
y

As with the bar charts, we can add more lines.  In this case we add more line descriptions:

In [None]:
fig, ax = plt.subplots()

# Using set_dashes() to modify dashing of an existing line
line1 = ax.plot(x, y, dashes=[2, 2, 10, 2], label='Placebo')
                      # 2pt line, 2pt break, 10pt line, 2pt break

x2 = np.linspace(1, 5, 5)
y2 = df[df['dose'] == 'low']['result'].sort_values()

# Using plot(..., dashes=...) to set the dashing when creating a line
line2 = ax.plot(x2, y2, dashes=[6, 2], label='Low')


x3 = np.linspace(1, 5, 5)
y3 = df[df['dose'] == 'high']['result'].sort_values()

line3 = ax.plot(x3, y3, dashes=[20, 2], label='High')

ax.legend()
plt.show()

# Plotting with multiple data sets

Let's start by creating a merged data set.  Lots of steps here, but let's walk thorugh it:

In [None]:
df_obesity = pd.read_csv("files/obesity-ac-2006-2010censustracts.csv")
df_obesity

In [None]:
df_fast_food_tract = pd.read_csv("files/fastfoodalleghenycountyupdatexy2plustract.csv")
df_fast_food_tract

In [None]:
df_fast_food_tract = df_fast_food_tract.dropna(subset=['tract'])
df_fast_food_tract

In [None]:
df_fast_food_tract['tract'] = df_fast_food_tract['tract'].astype('int32')
df_fast_food_tract

In [None]:
df_fast_food_tract_count = df_fast_food_tract.groupby('tract').count()
df_fast_food_tract_count = df_fast_food_tract_count.drop(['Name', 'Street Name', 'Legal Name', 'Start Date', 'Street Number', 'ZIP Code', 'Lat', 'Lon', 'Category'], axis=1).rename(columns={'Unnamed: 0' : 'count'})
df_fast_food_tract_count

In [None]:
df_merged = pd.merge(df_obesity, df_fast_food_tract_count, left_on='2000 Tract', right_on = 'tract', how='inner')
df_merged

In [None]:
ax2 = df_merged.plot.scatter(x='count', y='2006-2010 estimate of obesity',c='DarkBlue')

Yikes, there is a clear outlier (Dahntahn!)

In [None]:
df_merged_no_outlier = df_merged.drop(1, axis=0)
ax2 = df_merged_no_outlier.plot.scatter(x='count', y='2006-2010 estimate of obesity',c='DarkBlue')

Ok, let's get fancy and add some histograms for each of the axis.  Recall, a historgram is a way of visualing one-dimensional data.

In [None]:
x = df_merged_no_outlier['count']
y = df_merged_no_outlier['2006-2010 estimate of obesity']

# definitions for the axes
left, width = 0.2, 0.65
bottom, height = 0.1, 0.65
spacing = 0.005


rect_scatter = [left, bottom, width, height]
rect_histx = [left, bottom + height + spacing, width, 0.2]
rect_histy = [left + width + spacing, bottom, 0.2, height]

# start with a rectangular Figure
plt.figure(figsize=(8, 8))

ax_scatter = plt.axes(rect_scatter)
ax_scatter.tick_params(direction='in', top=True, right=True)
ax_histx = plt.axes(rect_histx)
ax_histx.tick_params(direction='in', labelbottom=False)
ax_histy = plt.axes(rect_histy)
ax_histy.tick_params(direction='in', labelleft=False)

# the scatter plot:
ax_scatter.scatter(x, y)

# now determine nice limits by hand:
binwidth = 1
lim = np.ceil(np.abs(x).max() / binwidth) * binwidth
ax_scatter.set_xlim((0, lim))
ax_scatter.set_ylim((0, 0.6))

bins = np.arange(0, lim + binwidth, binwidth)
ax_histx.hist(x, bins=bins)
ax_histy.hist(y, bins=20, orientation='horizontal')

ax_histx.set_xlim(ax_scatter.get_xlim())
ax_histy.set_ylim(ax_scatter.get_ylim())

ax_scatter.set_ylabel('obesity')
ax_scatter.set_xlabel('fast food in census tract')


plt.show()

# Word Cloud
 
Not all data that we want to visualize is best done in a numerical or relationship based visual representation.  An example might be word counts, or word popularity.  Plotting works on a histogram isn't compelling, but a word cloud (maybe) is! 

In [None]:
from wordcloud import WordCloud, STOPWORDS 

One thing about word clouds is that we know we don't want to include common words that are used as part of the language structure.  The library we are using calls these stop words:

In [None]:
STOPWORDS

Let's build a word cloud using the lyrics from Imagine.  First we need to load the lyrics into a string:

In [None]:
imagine_lyrics = """
Imagine there's no countries
It isn't hard to do
Nothing to kill or die for
And no religion, too
Imagine all the people
Living life in peace
You, you may say I'm a dreamer
But I'm not the only one
I hope someday you will join us
And the world will be as one
Imagine no possessions
I wonder if you can
No need for greed or hunger
A brotherhood of man
Imagine all the people
Sharing all the world
You, you may say I'm a dreamer
But I'm not the only one
I hope someday you will join us
And the world will live as one
"""

Next, we'll use the string methods to clean up the text and put the words in a list:

In [None]:
import string 

imagine_lyrics = imagine_lyrics.replace(',', '')
imagine_tokens = imagine_lyrics.split() 
for x in imagine_tokens :
    print(x)

Let's make all the words lower case so the word cloud doesn't have a mix of upper and lower case letters:

In [None]:
for i in range(len(imagine_tokens)): 
        imagine_tokens[i] = imagine_tokens[i].lower() 

Now let's put it all back together in a simple single string:

In [None]:
imagine_words = ' '
for word in imagine_tokens: 
    imagine_words = imagine_words + word + ' '
  
imagine_words

The final step is creating the actual word cloud...  let's look at the code:

In [None]:
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', 
                stopwords = set(STOPWORDS), 
                min_font_size = 10)

wordcloud.generate(imagine_words) 
  
# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show()

## Geospatial visualization

Exercise modeled after the example on Geopandas documentation and [GitHub example](https://github.com/bendoesdata/make-a-map-geopandas/blob/master/Let's%20make%20a%20map!%20Geopandas%20and%20Matplotlib.ipynb).

GeoPandas is a way to work with geogrpahically encoded data.  It's very powerful, both from a compuation stand point (e.g. compute base on distance or geographic boundry) but also from a visualization stand point (what we will explore in this lecture).

In [None]:
import geopandas as gpd

In [None]:
map_df = gpd.read_file('files/pitt_neighborhoods.shp')  # you load the shape file, but the other files are REQUIRED to be in the directory too
map_df.head()

Once loaded, you can plot the map_df to see what's in the file:

In [None]:
map_df.plot()

Let's load another dataframe that we will merge with the geo-dataframe to produce some geographic data:

In [None]:
df_crime = pd.read_csv("files/burgh_crime.csv")
df_crime.head()

In [None]:
df_crime_count = df_crime.groupby('INCIDENTNEIGHBORHOOD').count()
df_crime_count

In [None]:
map_merged4 = map_df.merge(df_crime_count, left_on="hood", right_on="INCIDENTNEIGHBORHOOD")
map_merged4

Creating a geographic plot is not very complex, just plotting using plot on the geo-dataframe (but it has to be the geo-dataframe):

In [None]:
# create figure and axes for Matplotlib
fig, ax = plt.subplots(1, figsize=(10, 6))

# create map
map_merged4.plot(column='OFFENSES', cmap='Blues', linewidth=0.8, ax=ax, edgecolor='0.8')
plt.show()

We can use addition methods and settings to make the visualization look nicer:

In [None]:
# create figure and axes for Matplotlib
fig, ax = plt.subplots(1, figsize=(10, 6))

# create map
map_merged4.plot(column=variable, cmap='Blues', linewidth=0.8, ax=ax, edgecolor='0.8')

ax.axis('off')

# add a title
ax.set_title('Crime incidents by neighborhood (2016-present)', fontdict={'fontsize': '25', 'fontweight' : '3'})
# create an annotation for the data source
ax.annotate('Source: Western Pennsylvania Open Data Center, 2020',xy=(0.1, .08),  xycoords='figure fraction', horizontalalignment='left', verticalalignment='top', fontsize=12, color='#555555')
plt.show()

Makeing the legend takes a little more effort to know the range of the values being represented/encoded on to the map:

In [None]:
map_merged4['OFFENSES'].max()

In [None]:
# set the range for the choropleth
vmin, vmax = 0, map_merged4['OFFENSES'].max()

In [None]:
# create figure and axes for Matplotlib
fig, ax = plt.subplots(1, figsize=(13, 6))

# create map
map_merged4.plot(column=variable, cmap='Blues', linewidth=0.8, ax=ax, edgecolor='0.8')

ax.axis('off')

# add a title
ax.set_title('Crime incidents by neighborhood (2016-present)', fontdict={'fontsize': '25', 'fontweight' : '3'})
# create an annotation for the data source
ax.annotate('Source: Western Pennsylvania Open Data Center, 2020',xy=(0.1, .08),  xycoords='figure fraction', horizontalalignment='left', verticalalignment='top', fontsize=12, color='#555555')

# Create colorbar as a legend
sm = plt.cm.ScalarMappable(cmap='Blues', norm=plt.Normalize(vmin=vmin, vmax=vmax))
# empty array for the data range
sm._A = []
# add the colorbar to the figure
cbar = fig.colorbar(sm)

plt.show()