ESC +shift +l turns on/off numberlines

# Visualization with plotnine

In [None]:
import pandas as pd
import numpy as np
import plotnine
from plotnine import *
%matplotlib inline

In [None]:
url = 'http://dx.doi.org/10.1787/888932937035'
pisa = pd.read_excel(url,
                     skiprows=18,      # skip the first 18 rows
                     skipfooter=7,     # skip the last 7
                     usecols=[0,1,9,13], # select columns of interest
                     index_col=0,      # set the index as the first column
                     header=0      # set the variable names
                     )

In [None]:
pisa

In [None]:
pisa = pisa.dropna().iloc[1:,:]

In [None]:
pisa

In [None]:
pisa.columns = ['math', 'reading', 'science'] # simplify variable names

In [None]:
#Let's just make sure things are okay
pisa = pisa.astype(float)

In [None]:
pisa

In [None]:
pisa=pisa.rename_axis('Region').reset_index()


In [None]:
pisa

## Getting started! Layers!

In [None]:
## plotnine needs a dataframe to start

(ggplot(data=pisa))


Well, nothing showed up! What is going on?! We need to add aesthetics!

In [None]:
(ggplot(data=pisa, mapping=aes(x='Region', y='math')))


We are getting somewhere!, for now we have defined a space where the graph is going to live. We use layers called geoms to actually feed a plot, we can add new layers by adding a plus to the latest part of the command.

In [None]:
(ggplot(data=pisa, mapping=aes(x='Region', y='math'))+
geom_bar(stat='identity'))


Great! I can see a bar plot now, but this doesn't look like anything on the x axis, maybe I would like a horizontal bar plot

In [None]:
(ggplot(data=pisa, mapping=aes(x='Region', y='math'))+
geom_bar(stat='identity')+
coord_flip())


Okay, I hate how this looks, let me make these plots larger, we are going to set the following option (in inches) from now on:

In [None]:
plotnine.options.figure_size = (16, 8)

In [None]:
(ggplot(data=pisa, mapping=aes(x='Region', y='math'))+
geom_bar(stat='identity')+
coord_flip())

OK, this is better, but maybe we could just look at only the first 5 countries like this and make the bars dark orange

In [None]:
(ggplot(data=pisa.iloc[0:5,], mapping=aes(x='Region', y='math'))+
geom_bar(stat='identity', fill='darkorange')+
coord_flip())

I'd like to do more work with this, and change the axis labels and give the plot a nice title

In [None]:
(ggplot(data=pisa.iloc[0:5,], mapping=aes(x='Region', y='math'))+
geom_bar(stat='identity', fill='darkorange')+
xlab("Mathematics Score")+
ylab("Region")+
ggtitle("Math reading scores for 5 regions")+ 
coord_flip())


Gray rectangles are pretty ugly. Let's get rid of them:

In [None]:
(ggplot(data=pisa.iloc[0:5,], mapping=aes(x='Region', y='math'))+
geom_bar(stat='identity', fill='darkorange')+
xlab("Mathematics Score")+
ylab("Region")+
ggtitle("Math reading scores for 5 regions")+ 
coord_flip()+
theme_bw())

Maybe I would like to know if the scores of math and reading follow each other

In [None]:
(ggplot(data=pisa, mapping=aes(x='math', y='reading'))+
 geom_point())

Let's get a trend line as close as possible to the dots and make the background not gray squares

In [None]:
(ggplot(data=pisa, mapping=aes(x='math', y='reading'))+
 geom_point()+
geom_smooth(method='lm')+
theme_bw())

## Let's play with a new dataset (gapminder)

You might have seen this:

https://www.youtube.com/watch?v=Z8t4k0Q8e8Y

we are gonna try to get just one of those years plotted!

In [None]:
g_url='https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder-FiveYearData.csv'

gapminder=pd.read_csv(g_url)

In [None]:
gapminder.head()

In [None]:
#let's do a scatter plot with where the horizontal axis is gdp per capita and y is life expectancy! 
#Extra, try the color of the dots to be of color 'coornflowerblue'




















In [None]:
#let's check out what are the continents in the datadframe


gapminder['continent'].unique()

In [None]:
#To get one of those nice plots, I'd like to get each continent under a new color

(ggplot(data=gapminder, mapping=aes(x='gdpPercap', y='lifeExp', color='continent')) +
        geom_point(alpha = 0.5)
)

In [None]:
#Nice!!!!, but I want to get a single year, let's see what years we have available

gapminder['year'].unique()

In [None]:
#Let's pick2002 and do the same plot as above!

(ggplot(data=gapminder.loc[gapminder['year']==2002], mapping=aes(x='gdpPercap', y='lifeExp', color='continent')) +
        geom_point(alpha = 0.5)
)


NIIIICE!, Now let's get the size of the countries 

In [None]:

(ggplot(data=gapminder.loc[gapminder['year']==2002], mapping=aes(x='gdpPercap', y='lifeExp',
                                                                 color='continent',size='pop')) +
        geom_point(alpha = 0.5)+
 theme_bw()
)

It's a bit hard to see anything while in the youtube clip we saw everything really well, so what can we do? In the video we have that gdp is in a log10 scale

In [None]:
(ggplot(data=gapminder.loc[gapminder['year']==2002], mapping=aes(x='gdpPercap', y='lifeExp',
                                                                 color='continent',size='pop')) +
        geom_point(alpha = 0.5)+
 scale_x_log10()+
 theme_bw()
)

Let's change the plot size a bit to see things better

In [None]:
plotnine.options.figure_size = (9, 4.5)

In [None]:
(ggplot(data=gapminder.loc[gapminder['year']==2002], mapping=aes(x='gdpPercap', y='lifeExp',
                                                                 color='continent',size='pop')) +
        geom_point(alpha = 0.5)+
 scale_x_log10()+
 theme_bw()
)

In [None]:
#We can add a layer that will allow us to see this better by making the scale size a bit larger

(ggplot(data=gapminder.loc[gapminder['year']==2002], mapping=aes(x='gdpPercap', y='lifeExp',
                                                                 color='continent',size='pop')) +
        geom_point(alpha = 0.5)+
 scale_x_log10()+
 scale_size(range = [0.1, 10])+
 theme_bw()
)

Nice!, we are almost there... but before we get there let's do some exploration.

Maybe we can just see howlife expectancy in each country evolved over time 

In [None]:
(ggplot(gapminder, aes(x='year', y='lifeExp', group='country', color='continent')) +
        geom_line(alpha = 0.5)+
 theme_bw()
)

This shows me some trends. Then let's go back to our year 2002 and see how the life expectancy looks per continent

In [None]:
(ggplot(gapminder.loc[gapminder['year']==2002], aes(x='continent', y='lifeExp', fill='continent')) +
        geom_boxplot()+
 theme_bw()
)

In [None]:
(ggplot(gapminder.loc[gapminder['year']==2002], aes(x='lifeExp')) +
        geom_histogram(binwidth = 3, fill='darkorange')+
 theme_bw()
)

Let's add a density plot on top!

In [None]:
(ggplot(gapminder.loc[gapminder['year']==2002], aes(x='lifeExp')) +
        geom_histogram(aes(y='stat(density)'),binwidth = 3, fill='darkorange')+
     geom_density(color='steelblue')+

 theme_bw()
)

But I want to split this by continent!!! (without density)

In [None]:
(ggplot(gapminder.loc[gapminder['year']==2002], aes(x='lifeExp', fill='continent')) +
        geom_histogram(binwidth = 3)+
    
 facet_wrap('continent')+

 theme_bw()
)

Do you remember our prettier plot so far?

In [None]:
(ggplot(data=gapminder.loc[gapminder['year']==2002], mapping=aes(x='gdpPercap', y='lifeExp',
                                                                 color='continent',size='pop')) +
        geom_point(alpha = 0.5)+
 scale_x_log10()+
 scale_size(range = [0.1, 10])+
 theme_bw()
)

How do I save it?

In [None]:
p=(ggplot(data=gapminder.loc[gapminder['year']==2002], mapping=aes(x='gdpPercap', y='lifeExp',
                                                                 color='continent',size='pop')) +
        geom_point(alpha = 0.5)+
 scale_x_log10()+
 scale_size(range = [0.1, 10])+
 theme_bw()
)

In [None]:
p

Great let's save this

In [None]:
p.save("gapminder2002.png")


Can we make a function that takes the year and gives us the plot?

Let's test our function!

Next time let's add labels and then make an animation!