# Introduction to Python

## Part III: Visualization with open data


This document is heavily based on the Python-Novice-Gapminder lesson developed by Software Carpentry, and the original lesson can be found online at http://swcarpentry.github.io/python-novice-gapminder/



In [None]:
import pandas
oceania_data = pandas.read_csv('data/gapminder_gdp_oceania.csv')

We're calling the `read_csv` function in the `pandas` library. Right now we're giving it one text argument which is the file path to the file we're reading in. If you go back to the file manager view of Jupyter you'll see a folder called `data`, which is containing several files, one of which is `gapminder_gdp_oceania.csv`.

In [None]:
oceania_data

Jupyter is clever enough to recognize this data is a table and display it to us in a table format. We see here that we first start with a column called `country`, and then proceeed to have GDP measurements over many years. Each row of this dataset currently has a name, which is just '0' or '1'. In many cases we'd like to specify that the names of the rows are denoted by the `country` column. We can do that in the `read_csv` function by specifying the `index_col` parameter. 

Note here that this is the first time we've called a function or method with a *named* parameter; this is extremely useful when a function or method might have dozens of default parameters and we want to change just one, in which case we don't have to list through all the other parameters in the exact right order.

In [None]:
oceania_data = pandas.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
oceania_data

Now we see that the country column is used to set the row names. 

There are a lot of things available on pandas dataframes to use.

## Selecting Values in Pandas DataFrames

Remember with lists how we could use square brackets to select specific values? We can (mostly) do the same thing with DataFrames, but there are two ways to do it.

In [None]:
# Just loading a larger dataset first
europe_data = pandas.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

## Plotting

For plotting we use the `matplotlib` library.

In [None]:
import matplotlib.pyplot as plt
# see our use of aliasing so that we won't have to type matplotlib.pyplot over and over again

Here's a simple example with two sets of points

In [None]:
time = [0, 1, 2, 3]
position = [0, 50, 150, 300]

plt.plot(time, position)
plt.xlabel('Time (hr)')
plt.ylabel('Position (km)')
plt.title('Distance Plot')

You can imagine we might want to plot data from our dataframe. Let's extract the years from the column names first. Just a reminder that we can access the column names using `oceania_data.columns`.

In [None]:
print(type(oceania_data.columns))
print(oceania_data.columns)

In [None]:
# Taking the last 4 characters from the string
# Also note that we're using the shortcut way of working with each value in a list.
years_str = [x[(len(x)-4):] for x in oceania_data.columns]
years_str # close; now we want it to be numeric

In [None]:
years_int = [int(x) for x in years_str]
years_int

In [None]:
# Plot Australia's GDP
plt.plot(years_int, oceania_data.loc['Australia', :])
plt.xlabel('Year')
plt.ylabel('GDP Per Capita')
plt.title('GDP Per Capita of Australia')

In [None]:
# We can plot both Australia and New Zealand, but we have to transform the data so that the columns are countries
lines = plt.plot(years_int, oceania_data.T)
plt.xlabel('Year')
plt.ylabel('GDP Per Capita')
plt.title('GDP Per Capita of Oceania')

plt.legend(lines, ['Australia', 'New Zealand'])

It turns out that `pandas` dataframes have plot functionality built into them directly that will handle labels for you. However we'll want to first replace the column names with the actual year values.

In [None]:
# First what happens if we don't replace the columns
oceania_data.T.plot()

It tries to plot the column labels, equally spaced, which is seldom what we want.

In [None]:
oceania_data.columns = years_int # we change the actual column names
oceania_data.T.plot()

# we can still set axis values if we want
plt.ylabel('GDP Per Capita')

We can also still plot individual countries as well like before.

In [None]:
oceania_data.loc['Australia', :].plot()

You can change the types of plots by specifying `kind`.

In [None]:
oceania_data.T.plot(kind='bar')

We can also make scatter plots easily. Suppose we want to plot Australia's GDP versus New Zealand's GDP.

In [None]:
# Using matplotlib directly
australia_gdp = oceania_data.loc['Australia', :]
new_zealand_gdp = oceania_data.loc['New Zealand', :]

plt.scatter(australia_gdp, new_zealand_gdp)

In [None]:
# Here we can call scatter from the dataframe directly, specifying the two columns (remember we took the transpose)
oceania_data.T.plot.scatter(x='Australia', y='New Zealand')

### Practice

* Using the dataframe you created last practice of the percent gdp change, plot the change in GDP for at least 3 countries of your choice. Remember to convert the columns to numeric years that can be plotted. If you weren't able to create that dataframe last practice, then use `europe_relative_data` instead.
* Make sure your plot has an x-label, a y-label, a title, and a legend.

## Interactive Plots with Cufflinks

One drawback of using matplotlib is that the plots are static. We can use Plotly express to create interactive, easy-to-create plots. 

Let's explore one application of Plotly called Cufflinks. 

Cufflinks as as a pandas method - adding one more tool to your tool kit. 

To create a plot, call the dataframe then use dot notation '.' followed by the function iplot(), specify the kind, and pass other parameters. 

Let's try a few commands.


In [None]:
#!conda install -y -c conda-forge cufflinks-py

In [None]:
import cufflinks as cf
cf.go_offline()

In [None]:
import pandas as pd
all_of_gp = pd.read_csv("data/gapminder_all.csv")
all_of_gp.head()

In [None]:
all_of_gp.columnsumns

In [None]:
gdp_2002_2007 = all_of_gp.groupby("continent")[["gdpPercap_2002","gdpPercap_2007"]].mean()

gdp_2002_2007.iplot(kind='bar',title="Average GDP by continent, 2002, 2007")

In [None]:
life_2002_2007 = all_of_gp.groupby("continent")[["lifeExp_2002", "lifeExp_2007"]].mean()


life_2002_2007.iplot(kind='line',title="Average Life Expectancy by continent, 2002, 2007")

In [None]:
pop_2002_2007 = all_of_gp.groupby("continent")[["pop_2002","pop_2007"]].mean()

pop_2002_2007.iplot(kind='area',title="Average Population by continent, 2002, 2007",
                   values="pop_2002",labels="pop_2002",fill=True)

In [None]:
countries_per_cont = all_of_gp.groupby("continent").size().reset_index(name="Count")

countries_per_cont.iplot(kind='pie',values="Count",labels='continent',title="How many countries per continent?")

## More Tools Using Plotly Express

Let's take a look at the relationship between GDP Per Capita and Life Expectancy in 2007 for "Americas".

In [None]:
import plotly.express as px

In [None]:
all_of_gp.head()

In [None]:
all_of_gp.columns

In [None]:
america_data = all_of_gp[all_of_gp['continent']=="Americas"]

fig = px.scatter(america_data,x="lifeExp_2007",
           y="gdpPercap_2007")

fig.show()

### Customizing plot

In [None]:
fig = px.scatter(america_data,x="lifeExp_2007",
           y="gdpPercap_2007",
           color='country',
           title='Life expectancy vs GDP Per Capita (2007, America)')

fig.show()

In [None]:
fig = px.scatter(america_data,x="lifeExp_2007",
           y="gdpPercap_2007",
           color='country',
           title='Life expectancy vs GDP Per Capita (2007, America)',
                labels={
                    "lifeExp_2007": "Life Expectancy 2007",
                    "gdpPercap_2007":"GDP Per Capita 2007"
                })

fig.show()

In [None]:
fig.write_html("./scatter_plot_gdp_2009.html")

### Other kinds of plots with Plotly express

In [None]:
fig = px.violin(all_of_gp,x="continent",y="pop_2007",
               title="Population for 2007 (violin plot, all continents)")

fig.show()

In [None]:
fig = px.box(all_of_gp,x="continent",y="pop_2007",
               title="Population for 2007 (box plot, all continents)")

fig.show()

In [None]:
fig = px.density_contour(all_of_gp,y="continent",x="gdpPercap_2007",
               title="Density contour map")

fig.show()

In [None]:
fig = px.density_heatmap(all_of_gp,
                         x='continent',
                         y='gdpPercap_2007',
                        title= "GDP Per Capita 2007, all continents")

fig.show()

## Exercise

Open up discussion - when do we want to use either of the methods above? 
Customize any of the plots we generated, and save to html
