# Section 3. Data Wrangling and Visualization Practice

#### Instructor: Pierre Biscaye

The content of this notebook draws on material from UC Berkeley's D-Lab Python Fundamentals [course](https://github.com/dlab-berkeley/Python-Fundamentals).

## Challenge 1: Putting Methods in Order

In the following code we want to to find the top-3 most frequently occurring continents in our data. First, load the gapminder data. Don't forget to import pandas first!

In [None]:
# Code here

 Then, put the following code fragments in the right order in a single line of code to get this information!

In [None]:
.head(3)
.value_counts()
df['continent']

## Challenge 2: Subsetting Data Frames

Besides `==` we can use [other operators](https://www.w3schools.com/python/gloss_python_comparison_operators.asp) to compare values. For instance:
- `<` less than
- `>` greater than

Fill in the code below to subset the gapminder data frame to include only people with a life expectancy (`lifeExp`) less than 50. Then, use the `.shape` attribute to determine how many observations meet this criterion.

In [None]:
# Code here
df[df[...] < ...]

## Challenge 3: Subsetting and Calculating the Mean

Let's make use of subsetting to do some calculation! Calculate the **mean life expectancy** for a continent of your choice in the gapminder dataset. 

This means you will have to:
1. Subset the `continent` column using a Boolean mask.
2. Take the `lifeExp` column from that subset.
3. Apply a Pandas method to get the mean from that column.

You might not know how to get the mean of a column – yet! If that's the case, **use your search engine**.

1. Enter the name of the computer language or package, and your question (for instance: "python Pandas calculate mean").
2. Read and compare the results you find.
3. Try 'em out!

In [None]:
# Code here

## Challenge 4: Get Vectorized

Say the `year` column contains wrong information and we need to add one year to each value. Use a vectorized operation to get this done.

In [None]:
# code here

## Challenge 5: Bar plots of mean life expectancy

Suppose we want to plot the maximums over time for each country's life expectancy, from lowest to highest across countries.

You should approach this in 3 steps.
1. Calculate max life expectancy by country using `groupby()`, and save this as a new data frame.
2. Sort the values from lowest to highest.
3. Plot this in a bar plot.

In [None]:
# code here

## Challenge 6: Loops and Plots

Let's say you have a list of countries you want to compare life expectancy for, using a single lineplot. We will create a function for this.

We have set up the list and function for you. The function uses the `matplotlib` package for plotting.

Your goal is to:
1. Add three country names in the DataFrame to `country_list`.
2. Add two parameters to the function; one for a DataFrame, and one for the list of countries.
3. Within the function block, loop over the list of countries. 
4. Within the for-loop, add the loop variable you named in step 3.
5. In the `label=` parameter of `plt.plot()`, fill in the loop variable name as well.

Run the cell when you're done: if you've succeeded, you should see a single line plot with life expectancy for all of the countries in `country_list`.

In [None]:
import matplotlib.pyplot as plt

# YOUR CODE HERE

country_list = [..., ..., ...]

# Edit this code.
# Note: If you did not load gapminder as 'df', you will have to load it again with that name or change the code below

def plot_life_expectancy(..., ...):
    for ... in ...:
        country_data = df[df['country'] == ...]
        plt.plot(country_data['year'], country_data['lifeExp'], label=...)
    plt.legend()
    plt.show()

plot_life_expectancy(df, country_list)

## Challenge 7: Subplots

Let's create a grid of 3 subplots arranged vertically, where we plot the mean of each variable in the gapminder dataset (pop, lifeExp, and gdpPercap) by year as a line plot. 

Here are the steps you will take
1. Create a data frame with the mean values for each variable across countries by year.
2. Create a plot visualizing these variables vertically using matplotlib. Use `fig, ax = plt.subplots(nrows=3, figsize=(#,#))
` to achieve this, specifying your own plot size.
3. Plot each subplot using `ax[#].plot()` and making it a line plot by using `-o` as the marker type. This tells it to plot connected dots.
4. Add x-axis and y-axis labels for each subplot using `ax[number].set_xlabel` and equivalent for ylabel.

Then, save the resulting figure to your working directory.

In [None]:
# code here

## Challenge 8: Summary stats

Produce a table of summary statistics for numerical variables in the gapminder dataset. Include the 5th and 95 percentile. Pivot the table so each row is the statistics for a given variable. Save this table as a CSV in your working folder.

In [None]:
# code here

## Challenge 9: Relationships between variables

Using the gapminder dataset, plot the relationship between gdpPercap and lifeExp. Add a line showing the estimated linear relationship between the two variables.

Make sure you label your axes, add a legend for the regression line, and title the figure. Save it as a jpg in your working folder.

In [None]:
# code here