# More Visualization Examples and Practice Activities

Visualizations, or visual represenations of data, are a powerful way to represent relationships of data in a way that humans can more effectively understand compared to textual data representations.  In this lecture, we will give a brief introduction to some of the basic visualization capabilities that are available in Python (with and without Pandas).  If you are interested more broadly in data visualization, you should consider LIS 2690 or CMPINF 2130 (offered this summer!)

## Bar Charts

As shown in the lecture materials Mathplotlib provides A LOT of functionality similar to the charting features in Excel, but with a lot more control.  It also integrates will with Pandas.

Examples of all the visualizations can be found [here](https://matplotlib.org/3.1.1/gallery/index.html).

In [None]:
import pandas as pd

import scipy.stats as stats # <-- Use for some simple stats

import numpy as np  # <-- this is NumPy, used for numerical computing, we're not going to get in to it, but many of the examples use this package as helpers for the plotting code

import matplotlib.pyplot as plt # <-- MatPlotLib is one of the most used graphing packages used in Python

Let's jump in and load up a simple dataframe.  This is dataframe has information about a a drug trial and a measure of the result.

In [None]:
df = pd.read_csv("files/drug.csv")
df

### Fitness time!  Let's get the data ready for plotting

We can use the set_index method to set the index of the dataframe:

In [None]:
df = df.set_index('person')
df

We can also use the replace function replace the number values with the meaning of the dosing categories:

In [None]:
df['dose'].replace({1: 'placebo', 2: 'low', 3: 'high'}, inplace = True)
df

We can use some of the simple statistics capabilities to help us get simple stats... we'll see later how this is useful

In [None]:
mask = df['dose'] == 'placebo'

df[mask].mean()['result']

In [None]:
df[mask].std()['result']

Alright, now let's get in to some plotting --- let's create a simple bar chart for the placebo

In [None]:
fig, ax = plt.subplots()

N = 1
ind = np.arange(N)    # the x locations for the groups
width = 0.35         # the width of the bars

means = df[df['dose'] == 'placebo'].mean()['result']
std = df[df['dose'] == 'placebo'].std()['result']

ax.bar(ind, means , width, bottom=0, yerr=std)
plt.show()

We can extend this approach to also include additional bars, with additional data.

Let's try... can we create a list of means and standard deviations of all three treatments?

In [None]:
fig, ax = plt.subplots()

N = 3
ind = np.arange(N)    # the x locations for the groups
width = 0.35         # the width of the bars

means = ## FILL IN CODE HERE
std = ## FILL IN CODE HERE

ax.bar(ind, means , width, bottom=0, yerr=std)
plt.title('Result by group')
plt.xticks(ind, ('Placebo', "Low Dose", "High Dose"))

plt.show()

In [None]:
## SOLUTION
fig, ax = plt.subplots()

N = 3
ind = np.arange(N)    # the x locations for the groups
width = 0.35         # the width of the bars

means = [df[df['dose'] == 'placebo'].mean()['result'], df[df['dose'] == 'low'].mean()['result'], df[df['dose'] == 'high'].mean()['result']]
std = [df[df['dose'] == 'placebo'].std()['result'], df[df['dose'] == 'low'].std()['result'], df[df['dose'] == 'high'].std()['result']]

ax.bar(ind, means , width, bottom=0, yerr=std)
plt.title('Result by group')
plt.xticks(ind, ('Placebo', "Low Dose", "High Dose"))

plt.show()

## It's not just bar charts!

We can use the same data, but with a different type of plot:

In [None]:
x = np.linspace(1, 5, 5)
y = df[df['dose'] == 'placebo']['result'].sort_values()

fig, ax = plt.subplots()

# Using set_dashes() to modify dashing of an existing line
line1 = ax.plot(x, y, dashes=[2, 2, 10, 2], label='Placebo')
                      # 2pt line, 2pt break, 10pt line, 2pt break


ax.legend()
plt.show()



In [None]:
x

In [None]:
y

As with the bar charts, we can add more lines.  In this case we add more line descriptions.  Let's try:

In [None]:
fig, ax = plt.subplots()

# Using set_dashes() to modify dashing of an existing line
line1 = ax.plot(x, y, dashes=[2, 2, 10, 2], label='Placebo')
                      # 2pt line, 2pt break, 10pt line, 2pt break

x2 = np.linspace(1, 5, 5)
y2 = ## LOW VALUES

line2 = ## LOW VALUES


x3 = np.linspace(1, 5, 5)
y3 = ## HIGH VALUES

line3 = ## HIGH VALUES

ax.legend()
plt.show()

In [None]:
fig, ax = plt.subplots()

# Using set_dashes() to modify dashing of an existing line
line1 = ax.plot(x, y, dashes=[2, 2, 10, 2], label='Placebo')
                      # 2pt line, 2pt break, 10pt line, 2pt break

x2 = np.linspace(1, 5, 5)
y2 = df[df['dose'] == 'low']['result'].sort_values()

# Using plot(..., dashes=...) to set the dashing when creating a line
line2 = ax.plot(x2, y2, dashes=[6, 2], label='Low')


x3 = np.linspace(1, 5, 5)
y3 = df[df['dose'] == 'high']['result'].sort_values()

line3 = ax.plot(x3, y3, dashes=[20, 2], label='High')

ax.legend()
plt.show()

# Plotting with multiple data sets

Let's start by creating a merged data set.  Lots of steps here, but let's walk thorugh it:

In [None]:
df_obesity = pd.read_csv("files/obesity-ac-2006-2010censustracts.csv")
df_obesity

In [None]:
df_fast_food_tract = pd.read_csv("files/fastfoodalleghenycountyupdatexy2plustract.csv")
df_fast_food_tract

Fitness time!!  Let's clear out values where we don't have a tract

In [None]:
df_fast_food_tract = df_fast_food_tract.dropna(subset=['tract'])
df_fast_food_tract

Now change the value type.

In [None]:
df_fast_food_tract['tract'] = df_fast_food_tract['tract'].astype('int32')
df_fast_food_tract

We'll create a new DataFrame that is a grouping by tract.  We'll clean out unneeded columns too!

In [None]:
df_fast_food_tract_count = df_fast_food_tract.groupby('tract').count()
df_fast_food_tract_count = df_fast_food_tract_count.drop(['Name', 'Street Name', 'Legal Name', 'Start Date', 'Street Number', 'ZIP Code', 'Lat', 'Lon', 'Category'], axis=1).rename(columns={'Unnamed: 0' : 'count'})
df_fast_food_tract_count

Now let's merge the obesity and fast food counts:

In [None]:
df_merged = pd.merge(df_obesity, df_fast_food_tract_count, left_on='2000 Tract', right_on = 'tract', how='inner')
df_merged

In [None]:
ax2 = df_merged.plot.scatter(x='count', y='2006-2010 estimate of obesity',c='DarkBlue')

Yikes, there is a clear outlier (Dahntahn!)

In [None]:
df_merged_no_outlier = df_merged.drop(1, axis=0)
ax2 = df_merged_no_outlier.plot.scatter(x='count', y='2006-2010 estimate of obesity',c='DarkBlue')

Ok, let's get fancy and add some histograms for each of the axis.  Recall, a historgram is a way of visualing one-dimensional data.

In [None]:
x = df_merged_no_outlier['count']
y = df_merged_no_outlier['2006-2010 estimate of obesity']

# definitions for the axes
left, width = 0.2, 0.65
bottom, height = 0.1, 0.65
spacing = 0.005


rect_scatter = [left, bottom, width, height]
rect_histx = [left, bottom + height + spacing, width, 0.2]
rect_histy = [left + width + spacing, bottom, 0.2, height]

# start with a rectangular Figure
plt.figure(figsize=(8, 8))

ax_scatter = plt.axes(rect_scatter)
ax_scatter.tick_params(direction='in', top=True, right=True)
ax_histx = plt.axes(rect_histx)
ax_histx.tick_params(direction='in', labelbottom=False)
ax_histy = plt.axes(rect_histy)
ax_histy.tick_params(direction='in', labelleft=False)

# the scatter plot:
ax_scatter.scatter(x, y)

# now determine nice limits by hand:
binwidth = 1
lim = np.ceil(np.abs(x).max() / binwidth) * binwidth
ax_scatter.set_xlim((0, lim))
ax_scatter.set_ylim((0, 0.6))

bins = np.arange(0, lim + binwidth, binwidth)
ax_histx.hist(x, bins=bins)
ax_histy.hist(y, bins=20, orientation='horizontal')

ax_histx.set_xlim(ax_scatter.get_xlim())
ax_histy.set_ylim(ax_scatter.get_ylim())

ax_scatter.set_ylabel('obesity')
ax_scatter.set_xlabel('fast food in census tract')


plt.show()

What kind of data storied can we tell with the above plot?

# Word Cloud
 
Not all data that we want to visualize is best done in a numerical or relationship based visual representation.  An example might be word counts, or word popularity.  Plotting works on a histogram isn't compelling, but a word cloud (maybe) is! 

In [None]:
conda install wordcloud --yes

In [None]:
from wordcloud import WordCloud, STOPWORDS 

One thing about word clouds is that we know we don't want to include common words that are used as part of the language structure.  The library we are using calls these stop words:

In [None]:
STOPWORDS

Let's build a word cloud using the lyrics from Imagine.  First we need to load the lyrics into a string:

In [None]:
imagine_lyrics = """
Imagine there's no countries
It isn't hard to do
Nothing to kill or die for
And no religion, too
Imagine all the people
Living life in peace
You, you may say I'm a dreamer
But I'm not the only one
I hope someday you will join us
And the world will be as one
Imagine no possessions
I wonder if you can
No need for greed or hunger
A brotherhood of man
Imagine all the people
Sharing all the world
You, you may say I'm a dreamer
But I'm not the only one
I hope someday you will join us
And the world will live as one
"""

Next, we'll use the string methods to clean up the text and put the words in a list:

In [None]:
import string 

imagine_lyrics = imagine_lyrics.replace(',', '')
imagine_tokens = imagine_lyrics.split() 
for x in imagine_tokens :
    print(x)

Let's make all the words lower case so the word cloud doesn't have a mix of upper and lower case letters:

In [None]:
for i in range(len(imagine_tokens)): 
        imagine_tokens[i] = imagine_tokens[i].lower() 

Now let's put it all back together in a simple single string:

In [None]:
imagine_words = ' '
for word in imagine_tokens: 
    imagine_words = imagine_words + word + ' '

imagine_words

The final step is creating the actual word cloud...  let's look at the code:

In [None]:
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', 
                stopwords = set(STOPWORDS), 
                min_font_size = 10)

wordcloud.generate(imagine_words) 
  
# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show()

### Your turn!

Now use the code above to create a word cloud for your favorite song, poem, or exert!