# Section 3. Data Visualization

#### Instructor: Pierre Biscaye

The content of this notebook draws on material from UC Berkeley's D-Lab Python Fundamentals [course](https://github.com/dlab-berkeley/Python-Fundamentals).
    
### Learning Objectives 
    
* Review of loops for repeated computations.
* Understand how to implement loops in Pandas with a technique called "vectorization".
* Apply several Pandas methods to summarize and manipulate data.
* Distinguish Pandas methods for `DataFrame` and `Series` objects.
* Create simple visualizations using Pandas.
* Visualize data using the Matplotlib package.

### Sections
1. Iteration and Vectorization with data frames
2. Methods to visualize `DataFrame` and `Series` objects
3. Graphics using Matplotlib

# 1. Iteration and vectorization with data frames

The strength of using computers is their speed. We can leverage this through repeated computation, also called iteration. In Python, we can do this using **loops**. 

Reminder: A **[for loop](https://www.w3schools.com/python/python_for_loops.asp)** executes some statements once *for* each value in an iterable (like a list or a string). It says: "*for* each thing in this group, *do* these operations".

Let's take a look at the syntax of a for loop using the below example:

In [None]:
# We use a variable containing a list with the values to be iterated through
lifeExp_list = [28.801, 30.332, 31.997]

# Initialize the loop
for lifeExp in lifeExp_list:
    rounded = round(lifeExp)
    print(rounded)

# This will only be printed when the loop has ended!
print('The loop has ended.')

## Conditionals and Loops

Recall that we can use `if`-statements to check if a condition is `True` or `False`. Also recall that `True` and `False` are called **Boolean values**.

Conditionals are particularly useful when we're iterating through a list, and want to perform some operation only on specific components of that list that satisfy a certain condition.

In [None]:
numbers = [12, 20.2, 43, 88.88, 97, 100, 105, 110.9167]

for number in numbers:
    if number > 100:
        print(number, 'is greater than 100.')
    else:
        print(number, 'is less than 100.')

## Aggregating Values With Loops

In the above example, we are operating on each value in `numbers`. However, instead of simply printing the results, we often will want to save them somehow. We can do this with an **accumulator variable**.

A common strategy in programs is to:
1.  Initialize an accumulator variable appropriate to the datatype of the output:
    * `int` : `0`
    * `str` : `''`
    * `list` : `[]`
2.  Update the variable with values from a collection through a `for` loop. Typical update operations are:
    * `int` : `+`
    * `str` : `+`
    * `list` : `.append()`
    
The result of this is a single list, number, or string with a summary value for the entire collection being looped over.

We can make a new list with all of the rounded numbers:

In [None]:
rounded_numbers = []

for number in numbers: 
    rounded = round(number)
    rounded_numbers.append(rounded)

print('Rounded numbers:', rounded_numbers)

## Iteration: Vectorization

Let's have a look at our Gapminder dataset.

In [None]:
import pandas as pd

df = pd.read_csv('Data/gapminder.csv')
df.head()

Let's say that we want to multiply GDP per capita (`gdpPercap`) by population (`pop`) in order to get the total GDP of a country. We could do so using a `for` loop:

In [None]:
gdpTotal = []
df_length = len(df) # so we can loop over all rows

for each in range(df_length): # going through each row one at a time
    gdp = df['gdpPercap'][each]
    pop = df['pop'][each]
    gdpTotal.append(gdp * pop)
    
gdpTotal[:5]

But this operation is convoluted, slow, and not preferred. In Pandas, we will want to use [**vectorized**](https://www.geeksforgeeks.org/vectorized-operations-in-numpy) operations. 

We can just multiply two columns, and Pandas will know we want to multiply each row of both columns!

In [None]:
gdpTotal = df['gdpPercap'] * df['pop']
gdpTotal[:5]

Note that the output to this operation is not a list, but a `Series` – a data type specific to Pandas. It is like a list, but it is **labeled**, following the row labels from the original data. 

We can add this as a new column to df!

In [None]:
df['gdpTotal'] = df['gdpPercap'] * df['pop']
df.head()

Vectorized operations like these are really handy, and they replace much of the use of `for`-loops in a context of Pandas and data analysis.

# 2. Visualizing and exploring `DataFrame` objects

Pandas has many methods: some allow you to work with entire DataFrames, while others operate on individual columns. This section focuses on learning to distinguish between these methods.

Some methods work on entire DataFrames. We can look at the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) to see all the methods and attributes that are available for `DataFrame` objects. Learning how to read documentation is an important skill! 

## Summary Statistics
The [`describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) method will give some summary statistics for a `DataFrame`. Run the cell below to see how it works.

In [None]:
df.describe()

In [None]:
df.columns

*Question*: Why are only some of the columns in the `DataFrame` visible in the output?

The result of `describe()` is itself a data frame. You can therefore index and subset it.

In [None]:
df.describe().loc['mean','lifeExp']

## Scatter Plots

Pandas has a convenient [`plot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) method that allows you to create different kinds of visualizations. Some of these visualizations can be called on a DataFrame object.

For instance, **scatter plots** visualize the relationship between different variables (columns) in a DataFrame. This is why we run the method on an entire DataFrame.

We can create a scatter `plot()` by specifying the columns to use for the `x` and `y` axes. 

In [None]:
df.plot(kind='scatter', x='lifeExp', y='gdpPercap')

## Sorting Values

Let's say we want to find the countries with the highest `gdpPercap`.

If we want to sort the values in a DataFrame we can use the [`sort_values()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) method on a DataFrame. It takes as an argument the column we want to sort the DataFrame on. 

By default, `sort_values()` sorts in **ascending order**. You can add the argument `ascending=False` when running `sort_values()` to show results in descending order.

In [None]:
df.sort_values('gdpPercap')

## Bar Plots

Bar plots show the relationship between a numeric and a categoric variable. Here, we use the `country` (categorical) and `lifeExp` (numeric) columns. Use a bar plot when you want to illustrate differences in frequencies of some category.

In the below cell, we retrieve the 10 data points with the **lowest life expectancy** in our data using the `sort_values()` method, and then plot those data points in a bar plot.

Note that `plot.bar()` is a method of its own, and is an alternative to using `plot()` with the `type=bar` argument.

In [None]:
# Sort values based on low life expectancy, get top 10
low_lifeExp = df.sort_values('lifeExp', ascending=True)[:10]

# Visualize with bar plot 
low_lifeExp.plot.bar(x='country', y='lifeExp', figsize=(6,4))

## Grouping by variable

Many times we will want to do operations by group. The `groupby()` method is useful for this.

Suppose we want to calculate mean GDP per capita by continent. In the below code, we group the data by continent and then take the mean, and save this as a new dataframe.

In [None]:
# Group by 'continent' and calculate the mean of 'gdpPercap'
gdp_mean_df = df.groupby('continent', as_index=False)['gdpPercap'].mean()

# Display the resulting dataframe
gdp_mean_df

**Practice**: Now visualize these values in a bar plot, after sorting.

In [None]:
# Code here

## Visualizing `Series` objects

Some Pandas methods work on `Series` objects – single columns – instead of entire DataFrames.

For instance, what if we wanted to calculate the median of life expectancy? We'd need to select just one column to operate on. 

Recall that we can select an individual column with bracket notation. This is analogous to indexing a list.

In [None]:
df['lifeExp']

A single column of pandas is a `Series` object. This can be treated as a list or other iterable, and allows for you to do calculations over it. 

We can look at the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) to see the methods and attributes that are available for `Series` objects. If we want the median, we can use the `median()` function.

In [None]:
df['lifeExp'].median()

## Histograms

A histogram shows the distribution of a variable using binned values. We can call this using the syntax: `df[column].plot(kind='hist')`. Use a histogram if you want to show distributions of continuous variables.

*Note* You can try changing the value for the `bins` parameter. What does the `bins` parameter seem to be determining?

In [None]:
df['lifeExp'].plot(kind='hist', title='Histogram of life expectancy', bins=10)

## Key Points

* `for` loops work on lists and other list-like structures, but also on other iterables such as strings.
* We typically use an accumulator variable to store some information we retrieve using a `for` loop.    
* We typically do not use for-loops in Pandas - instead, we use "vectorized" operations.
* Pandas methods work on either `DataFrame` or `Series` objects--make sure you know which!
* Pandas methods yield as output either `DataFrame` or `Series` objects--make sure you know which!


# 3. Graphics using Matplotlib

It's common to use `matplotlib` for graphics, similar to `ggplot` in R. In `matplotlib`, a plot consists of a figure and one or more axes. The axes contain important information about each plot, such as its axis labels, or title.

It is a standard to import matplotlib.pyplot as plt. 

In [None]:
import matplotlib.pyplot as plt

There is a 'magic function' in IPython that allows you to make sure your plots appear inside your notebook instead of appearing as pop-up windows.

Magic functions are prefixed with the % character, and apply to the rest of the document.

Running `%matplotlib inline` sets tthe output of plotting commands to be displayed inline like the Jupyter notebook, directly below the code cell that produced it. The resulting plots will then also be stored in the notebook document. This may already be a default for you but it is useful to run it anyway just in case.

In [None]:
%matplotlib inline

Let's explore the `plot()` method. It takes several arguments (see `plot.plot?` for more information), and plots a line between points.

The most basic structure is to call a single x coordinate and y coordinate. This plots a **point**.

In [None]:
plt.plot(3, 4,marker='o')

In [None]:
# Adding a colon removes some plot backend information
plt.plot(3, 4,marker='o');

We can also specify several **formatting** arguments.

In [None]:
plt.plot?

In [None]:
plt.plot(3, 4, marker='^',color='green',markersize=12)

We can also plot a **line** by specifying lists of coordinates.

In [None]:
plt.plot((3, 6), (4, 9))

In [None]:
# Longer line
plt.plot((3, 6, 4), (4, 9, 7))

In [None]:
# With formatting
plt.plot((3, 6), (4, 9), color='green', marker='o', linestyle='dashed',
         linewidth=2, markersize=12)

We can also **plot multiple things** together.

In [None]:
plt.plot(3, 4, '.')
plt.plot(6, 9, '^')
plt.plot(4, 7, 'X', color='red', markersize=12)
plt.plot((3, 6), (4, 9))
plt.show()

We can add **labels and titles** to the plot using  `plt.xlabel()`, `plt.ylabel()`, and `plt.title()`. See [this resource](https://www.w3schools.com/python/matplotlib_labels.asp) for more information!

In [None]:
plt.plot(3, 4, '.')
plt.plot(6, 9, '^')
plt.plot(4, 7, 'X', color='red', markersize=12)
plt.plot((3, 6), (4, 9))
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Path from A to B, showing point C')
plt.show()

We can also **save and output** the results of our figures using `plt.savefig()`. 

By default it goes to your working folder unless you specify something else.

In [None]:
plt.plot(3, 4, '.')
plt.plot(6, 9, '^')
plt.plot(4, 7, 'X', color='red', markersize=12)
plt.plot((3, 6), (4, 9))
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Path from A to B, showing point C')
plt.savefig('practice_fig.jpg')

## Plotting existing data

Much of the time we will have existing x and y variables we want to plot.

In [None]:
x=[2,8,33,7,-1]
y=[-3,8,3,4,3]
plt.plot(x, y, '-o') # the dash says to link the points
plt.show() # hides some of the descriptives about the plot

In [None]:
#Create two random variables as example
import numpy as np

x = np.random.normal(size=100)
y = np.random.normal(size=100)
plt.plot(x,y,'o')

## Using subplots

Plots can also be organized by axes and subplots. Passing these arguments allows you to specify things like the figure size and to graph things side by side.

In [None]:
fig, ax = plt.subplots(ncols=1, figsize=(8,8))
ax.plot(x,y,'o')
plt.show()

## Plotting data from data frames. 

There are two ways to do this.
1. You can call the columns directly, e.g., `df['column']`.
2. You can specify column names if they are indexed, and then add `data=df`.

Let's practice with the gapminder dataset and the `plt.scatter` method, which plots a scatterplot rather than a connected line.

In [None]:
df.columns

In [None]:
plt.plot(df['year'], df['lifeExp'],'o')

In [None]:
# identical method
plt.scatter('year', 'lifeExp', data=df)

Let's plot life expectancy and GDP per capita by year from the gapminder dataset on side by side graphs.

In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(12,6))
ax[0].scatter('year', 'lifeExp', data=df)
ax[1].scatter('year', 'gdpPercap', data=df)
ax[0].set_title('Life expectancy')
ax[1].set_title('GDP per capita')
for x in range(2):
    ax[x].set_xlabel('Year')
plt.show()

**Practice:** Create a figure with two subplots on separate rows. In the first, plot lifeExp against gdpPercap before 1980. In the second, plot lifeExp against gdpPercap after 1980. Add titles to each subfigure and to the axes. Save the plot to your working directory.

In [None]:
# Code here