# Python data manipulation and plotting workshop

Welcome to our introductory python for bioinformatics workshop. Today our goal is to help everybody get familiar with how python handles data and how to make plots. We will first have a brief introduction to the basic data types in python, the basic data structures of lists, dictionaries, and arrays. Then we'll spend the rest of the workshop using pandas dataframes to transform data and make plots using matplotlib and seaborn.

If you want to learn more about these topics, I recommend the following resources:

- [The Python Tutorial](https://docs.python.org/3/tutorial/index.html)
- [Numpy beginner's guide](https://numpy.org/doc/stable/user/absolute_beginners.html)
- [Pandas User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html)
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)


## Python terminology

Python is an object oriented language, which means everything in python is an **object**, even numbers and strings. Objects have **attributes** and **methods** (functions) and are created from **classes**. When we use the equals sign `=`, we assign objects to **variables**. The variable name then becomes a way for us to refer to that object and we can access that object's attributes and methods. 

Object attributes and methods are accessed by using `.` after the variable name followed by the method or attribute name. **Classes** are like blueprints for an object, so when we say something is from X class, you can automatically know it should have certain attributes and methods. For example, any object of the `str` class should have the `upper()` method that converts every character to uppercase. 

In this workshop, we may use both "object" and "variable" in similar ways, but they mean different things. If there is any confusion, please don't hesitate to ask for clarification. Here is a list of terminology for you to refer to. 

| Term | Definition |
| --- | --- |
| Object | The thing itself (an instance of a class) |
| Variable | The name we give the object (a pointer to the object) |
| Class | The blueprint for the object (defines the attributes and methods of object) |
| Method | A function that belongs to an object |
| Attribute | A property of an object |
| Function | A piece of code that takes an input and gives an output/does something |

Here's a real world analogy.

My water bottle is an object. I named my waterbottle Sally. Sally is a variable because it is a name that points to the my waterbottle, the object. Sally's class is WaterBottle, which means it has attributes like "volume", "color", etc. This particular WaterBottle object has methods like "Open" and "Pour" which I have to call every time I want to drink out of it. 

In computer science, unlike in real life, you can't have the same variable refer to two different objects, so I can't also name my computer Sally. But, I can create another name and also have it point to the water bottle. 


## Data types

When we collect data, we tend to think of them as categorical, numerical, ordinal, etc, and the specific type of data influences how we deal with them in our downstream analysis. Similarly, in computer science and in python, we need to distinguish between different data types because the functions we use to manipulate them will depend on the data we have. The basic data types are numerical, string, and boolean. In python, you can use the `type()` function to check the data type of a variable. 

Numerical data are data represented by a number. Numerical data can further be divided into integers `int` and floats `float`. There's also support for complex numbers, but we won't get into that for this workshop. Integers are whole numbers, while floats are numbers with decimal points. When you perform arithmetic operations on integers, they might return floats, but the reverse is not true. 

Below is a table of basic arithmetic operations you can perform on numerical data. You can combine an operation with an equals sign `=` to perform the operation and set the variable equal to the result in one line (eg `x += 1` adds 1 to whatever number `x` was).

**Operators for numerical data**

| Operator | Description |
| --- | --- |
| `+` | Addition |
| `-` | Subtraction |
| `*` | Multiplication |
| `/` | Division |
| `%` | Modulo (returns remainder) |
| `**` | Exponentiation |
| `//` | Floor division |
|`abs()` | Absolute value |

Boolean (`bool`) is a data type that can either be `True` or `False`. Booleans will come up when we start using conditional statements to compare or filter data. It's technically a subclass of `int`, where 0 is False and 1 is True. Below are some logical operators you can use to create boolean expressions.

**Operators for booleans**

| Operator | Description |
| --- | --- |
| `==` | Equal to |
| `!=` | Not equal to |
| `>` | Greater than |
| `<` | Less than |
| `>=` | Greater than or equal to |
| `<=` | Less than or equal to |
| `and` | Logical and |
| `or` | Logical or |
| `not` | Logical not |

Strings are sequences of characters, and are represented by `str`. Strings are enclosed in `''` or `""`. If there is a string that contains a quote, you can use the escape character `\` to escape the quote. You can use the `+` operator to concatenate strings or the `*` operator to repeat a string. But you can't combine string with numerical data using these operators. You can use the `str()`, `int()`, and `float()` functions to convert between data types. You can do some special operations on strings, such as indexing, slicing and formatting. Indexing and slicing extract substrings from a string, while formatting allows you to insert variables into a string. String are **immutable**, meaning you can't edit a string, but you can create a new string from an existing one.

**Operators for strings**

| Operator | Description |
| --- | --- |
| `+` | Concatenation |
| `*` | Repetition |
| `[]` | Indexing |
| `[:]` | Slicing |
| `.format()` or `f""` | Formatting |

Try to predict the output of the following lines of code to check your understanding of the data types.

In [None]:
a = "5"
b = "10"
c = 15
d = "0123456789"
e = True
print(type(a))
print(type(c))
print(type(e))
print(type(int(d)))


In [None]:
a = "5"
b = "10"
c = 15
print(a + b)
print(5 + 10)
print(5 + 10 != c)
print(a + b == c or 15 == c)


In [None]:
a = "5"
b = "10"
c = 15
print("{} plus {} equals {}".format(a, b, c))
print(f"{a} plus {b} does not equal {a + b}")


In [None]:
d = "0123456789"
print(d[9])
print(d[:5])
print(d[::2])


# Review of basic data structures (lists, dictionaries, and numpy arrays)

In this section, we'll do a quick review of the basic data structures in python. Data structures are a way to store/organize multiple pieces of data or objects. Each data structure has different use cases and ways they represent data. 


## Lists

Lists are the most basic of data structures. They are created with `[]` and can contain any type of data. Each entry is separated by a comma. Lists are ordered and can be indexed, sliced, and concatenated just like strings. When lists are all numerical, they can also support mathematical operations like `max()` and `min()`. Lists can also be nested using another `[]` within the list. 

Lists are our first introduction to a **mutable** data structure, meaning you can change a list without having to create a new one. Indeed, list methods may modify your data **in place** and/or **return** a new object. If the method modifies the object in place, its return value will be `None`. Modifying in place means you don't have to assign the result of the method to a new variable, while returning a new object means you do have to assign it. For example, `list.append(x)` updates the list in place, while `list.pop()` both returns the last element and removes it from the list in place. 

Below are some useful operations and methods for lists. For a full list of methods, you can use `dir()` on the list or consult the [docs](https://docs.python.org/3/library/stdtypes.html#list) page. 

**Operations and methods for lists**

| Operation/Method | Description |
| --- | --- |
| `+` | Concatenation |
| `*` | Repetition |
| `[]`, `[:]` | Indexing, slicing |
| `.append(x)` | Add `x` to the end of the list |
| `.extend([x, y, z])` | Add `[x, y, z]` to the end of the list |
| `.insert(i, x)` | Add `x` at index `i` of the list |
| `.pop(i)` | Remove and return the element at index `i`, defaults to last element if none given |

## Dictionaries

Dictionaries store key:value pairs. Keys are typically strings or numerical identifiers, while the values can be just about anything, including other dictionaries, lists, or individual values. You can create a dictionary with `{}` or with the `dict()` function. The two ways to create a dictionary are shown below:

```python
my_dict = {'a': 1, 'b': 2, 'c': 3}
my_dict = dict(("a", 1), ("b", 2), ("c", 3))
```

Dictionaries are unordered, so you can't index/slice them. But you can retrieve items by their key, e.g. `my_dict["a"]`. Like lists, dictionaries are mutable, so you can add, remove, or update the key:value pairs in place. Other methods return "View objects" that allow you to see the items in the dictionary, but won't allow you to modify the dictionary. Here are some useful methods for dictionaries:

**Operations and methods for dictionaries**

| Operation/Method | Description |
| --- | --- |
| `[]` | Retrieve value by key |
| `.keys()` | Returns a view object of the keys |
| `.values()` | Returns a view object of the values |
| `.items()` | Returns a view object of the key:value pairs |
| `.update(dict)` | Updates the dictionary with the key:value pairs from another dictionary |

## Viewing vs updating data

Before we move on to the next data structure, let's go over a behavior of python that can be confusing. When we create a new object like a list and assign a variable to that object, the variable is a pointer to the object and not the object itself. You can assign multiple variables to the same object, like giving multiple nicknames to the same person. This is important because when you then use that variable to modify the object, you will change the object itself and therefore all the other nicknames you gave it will refer to the updated object. Run the code block below to see a demonstration. 

In [None]:
a = ["my", "list", "of", "words"]
b = a
b[0] = "your"
print(a)


If you want to create a COPY of the original object when you assign a new variable to it, you can use the `copy()` method. This creates a whole new object with the same values that is independent of the original. 

In [None]:
a = ["my", "list", "of", "words"]
b = a.copy()
b[0] = "your"
print(a)
print(b)


When you index a list, you create a what is known as a "shallow copy" of the list. This means that the new list is a new object, but the elements within it can still reference the old objects. In the code block below, we have a list that contains another list `x`. When we index the outer list, we create a new list that references the list `x`. When we change that element in the new list, we also change the object that `x` references.

In [None]:
x = ["my", "nested"]
a = [x, "list", "of", "words"]
b = a[0]
print(b)
b[0] = "your"
print(a)
print(b)
print(x)


It's important to understand the concept of viewing vs copying data in python and to be aware of when you are modifying the data when you may not intend to. In general, it is good practice to be explicit about whether you want to make a copy of the data when you assign a new variable. This also applies to dictionaries and array objects as we'll learn in a moment. The default behavior of these data structures when you index them is to create a view or shallow copy to save on memory. 

>**Question:** Suppose I imported a large dataset into a dictionary but then only wanted to work with one of the entries. So I assign the entry I want to the variable `entry`. I then delete the original variable `data` to free up memory. Does this work? Why or why not?

```python
data = dict(a = [1, 2, 3], b = [2, 3, 4], c = [3, 4, 5])
entry = data["a"]
del data
```

## An aside on libraries

While python can do a lot of things on its own, many of the more specialized uses for the programming language are supported by additional code libraries. Libraries are a collection of classes, objects, and functions that extend the functionality of python. In this workshop, we'll be using the `numpy`, `pandas`, `matplotlib`, and `seaborn`. Those form the basis of data analysis and plotting in python. The "Getting started" instructions will have already walked you through installing the libraries, but to actually use them in your code or jupyter notebook, they need to be imported. You can import an installed library with the `import` keyword, typically at the top of your script/notebook. You can also import specific functions or classes from a library using the `from` keyword. Run the code block below and notice that the libraries are imported with a shorthand name. Whenever you want to use a function or class from the library, you need to first append the shorthand to tell python you're looking for something from that library. 

In [1]:
## Importing libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


# Numpy arrays

Numpy arrays are a data structure that only contain one type of data, typically numerical, and are N-dimensional (any number of dimensions). You can create numpy arrays using the `np.array()` function or by converting other data structures to an array using `np.asarray()`. There are also many other functions that can create an array with pre-filled numbers, such as `np.zeros()` and `np.arange()`. An array is defined by its `shape`, which describes the number of elements in each dimension, also known as axes. The first axis is the number of rows, the second is the number of columns, and so on. 

Numpy arrays can be navigated using indexing and slicing as well as boolean masks. Mathematical operations on numpy arrays occur element-wise, meaning the operation is applied to each element in the array. There are many useful functions and methods for manipulating arrays.

>**Exercise:** In the code block below, we create a numpy array using the function `np.arange()`, which creates evenly spaced values with a given interval (default 1). This creates a 1D array. We then `reshape()` that 1D array to a 10 by 10 2-D array. For your exercise, use slicing to extract the numbers between 30 and 80 (not inclusive of 80)

In [None]:
my_array = np.arange(0,100).reshape(10,10)
print(my_array)
# Your code here


>**Exercise:** Reshape `my_array` to any other shape and print the shape of the new array. Also print the new array itself. What do you notice about how the array is filled in when reshaped?

In [None]:
# Your code here



In the code block below, we have used a conditional statement to filter the array for values between 30 and 80. However, what problem do you notice?

In [None]:
filt = (my_array >= 30) & (my_array < 80)
print(filt)
print(my_array[(my_array >= 30) & (my_array < 80)])


The filter did not preserve the shape of the array and returned just the values divorced from their positions. If you instead want a view of the array that shows where the condition was true, you can use numpy's `np.where()` function to show the value if the condition is true and a different value if the condition is false. In the code below, we use `np.nan` as the placeholder value for where the condition is false and ask it to show the original value from `my_array` where the value is true. 

In [None]:
filt = (my_array >= 30) & (my_array < 80)
np.where(filt, my_array, np.nan)


You can perform mathematical operations on arrays and they'll propagate to each element. In order for the element-wise operation to work, the two objects you're operating with either have to have the same shape or one of them has to be a scalar. Numpy also has functions that allow you to operate on the entire array, such as `np.sum()`, `np.mean()`, etc. In the code block below, we calculate the mean squared error of two dummy arrays using the formula $\frac{1}{n}\sum_{i=1}^{n}(predicted_i - expected_i)^2$.

In [None]:
predicted = np.array([1, 2, 3, 4, 5])
expected = np.array([2, 4, 1, 5, 5])
mse = (1/len(predicted)) * np.sum(np.square(predicted - expected))
print(mse)


Breaking down the code, you can see what is produced at each step

In [None]:
print(predicted - expected)
print(np.square(predicted - expected))
print(np.sum(np.square(predicted - expected)))
print(1/len(predicted))


>**Exercise:** In the code block below, I have an array of assignment grades in the first array and their weights in the second array. Calculate the weighted average of the grades.

In [None]:
gradeA = np.array([90, 88, 93, 85, 79, 100, 85, 92, 88, 95])
gradeW = np.array([0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.2, 0.4])

# Your code here


>**Exercise:** Now I've combined the grades and weights into a single 2D array. Do the same calculation using numpy array indexing. 

In [None]:
grades = np.array([
    [90, 0.05],
    [88, 0.05],
    [93, 0.05],
    [85, 0.05],
    [79, 0.05],
    [100, 0.05],
    [85, 0.05],
    [92, 0.05],
    [88, 0.2],
    [95, 0.4]
])

# Your code here


Numpy arrays are a powerful data structure that is oprimized for rapid mathematical operations. They form the basis of other data structures in python like the tensor in tensorflow and our next topic, the pandas dataframe. If you are working with purely numerical data, numpy arrays are the best.

# Pandas dataframes



Often when we work with data, we have multiple variables with different data types. While numpy arrays are very flexible for numerical data, it does not work as well with a combination of numerical and categorical data or when you want to have labels for your rows and columns. Pandas is another python library that builds upon numpy and adds the ability to create DataFrames, which are 2D tables with labeled rows and columns. You can think of python DataFrames as spreadsheets from excel or dataframes from R. 

Let's start by learning about the pandas `Series` object. Series are one-dimensional arrays of indexed data. It can be created from a list or a numpy array with the `pd.Series()` function. Run the code block below and observe the differences between a `Series` and a numpy array. 

In [None]:
my_series = pd.Series([1, 2, 3, 4, 5])
my_array = np.array([1, 2, 3, 4, 5])
print(my_series)
print(my_array)


Notice that a series object has an index, which is a label for each element in the series. Numpy has an implicit index for its arrays, but Series objects have an explicit index. It's kind of like a dictionary where the keys are the indices and the values are the data. Similar to a dictionary, you can use `.values` to just get the data back. And you can also use `.index` to get the index labels. 

In [None]:
my_series[2]


In [None]:
my_series.values


In [None]:
my_series.index


Indices don't have to be just numbers. They can be whatever values we want. Although duplicate index values are allowed, it is best to avoid. 

In [None]:
grades_english = pd.Series([90, 88, 91], index = ["Yann", "Xavier", "Zach"])
grades_english


In [None]:
grades_english['Yann']


We can construct dataframes from multiple Series objects. 

In [None]:
grades_math = pd.Series([60, 55, 84], index = ["Yann", "Xavier", "Zach"])
grades = pd.DataFrame({'English': grades_english, 'Math': grades_math})
grades


To add a column to an existing dataframe, you can use the `[]` operator and assign a new series, or use the `pd.concat()` function. 

In [None]:
grades["History"] = [88, 100, 93]
grades


Observe how when I index into the "history" column, an index was added to it!

In [None]:
grades["History"]


## Indexing and slicing a dataframe

Pandas DataFrames are not just indexed by row, but also by column. You can access columns by name using the `[]` operator. If you want to access a row by the index, you can use the `.loc[]` method. Run the code blocks below and note how the indexing returns Series objects in both cases. 

In [None]:
grades.columns


In [None]:
grades["English"]


In [None]:
grades.loc["Yann"]


There are many ways to select subsets of a dataframe. The rows and columns of a dataframe can be referred to either by their integer position or by their indexed name. Typically, for columns, you'll use the indexed name and can just do `[]` with the name of the column. For rows, if you want to use the integer position, you will use `.iloc[]`. If you want to use the index name, you will use `.loc[]`. 

For reference, here's a handy table on the best ways to index into a dataframe:

|Action|Named index|Integer Position|
|---|---|---|
|Select single column|`df['column_name]`|`df.iloc[:, column_position]`|
|Select multiple columns|`df[['column_name1', 'column_name2']]`|`df.iloc[:, [column_position1, column_position2]]`|
|Select single row|`df.loc['row_name']`|`df.iloc[row_position]`|
|Select multiple rows|`df.loc[['row_name1', 'row_name2']]`|`df.iloc[[row_position1, row_position2]]`|

Just like with numpy arrays, you can use boolean expressions to filter dataframes. In the code block below, we demonstrate how filtering works with dataframes. You can use the `[]` operator or the `.query()` method to filter on boolean expressions. 

In [None]:
grades[grades["English"] > 90] # extract rows that meet the condition where English grade > 90


In [None]:
grades.query("English > 90") # Another way to extract rows based on boolean expressions
# Note that the column name can be used directly inside the quotes. 


>**Exercise:** Using a boolean expression, find out which student(s) had a higher English score than their History score.

In [None]:
# Your code here


## Wide vs long data

The way our dataframe is organized can affect how we work with it. Currently, our `grades` dataframe is in what is considered a "Wide" format. Wide format is where levels of the same variable are spread out across the columns. In this case, the variable (implicit) of subject is split across the columns into English, Math, and History. In a "Long" format, subject would be one column and "grade" would be another column. We can convert Wide to Long format by **melting** the dataframe. 

In [None]:
# First we reset the index so the index gets its own column
grades.reset_index(names="Name")


Then we pick which column(s) will uniquely identify each row/observation. In our case, it's the Name of the student. We select which columns we want to melt together. They should all be levels of the same variable. Then, we select the name of the new column that will collect all the "value_vars". Finally, we give a name to the values associated with those variables, aka. "Grade". 

In [None]:
grades_long = grades.reset_index(names = "Name").melt(id_vars = 'Name', value_vars = ["English", "Math", "History"], var_name = 'Subject', value_name = 'Grade')
grades_long


Melting a dataset from Wide to Long will facilitate plotting with seaborn in the future. It also makes it easier to filter and manipulate dataframes.

## Plotting directly from dataframe

One neat feature of pandas is that you can directly plot dataframes using the `.plot()` method. This method is a wrapper around matplotlib. In the code block below, we create a bar plot of the student's grades. Just as we will see with matplotlib, the plot function return an object we'll call ax, that we can use to further customize the plot. 

Ironically, the `plot()` method works better in our case with the wide data format. 

In [None]:
ax = grades.plot(kind = 'bar')
ax.set_ylabel('Grade')


Using `plot()` on a pandas DataFrame is a quick and easy way to visualize data on a high level. For more information on the this method, you can consult the visualization section of the pandas documentation [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).

## Loading data into a dataframe

One of the most useful features of pandas DataFrames is its ability to easily perform complex data transformations. This makes it a powerful tool for cleaning, filtering, and summarizing tabular data. Let's read some data into a DataFrame to demonstrate. This data comes from the Tidy Tuesday project, which posts a neat dataset every week. This data is from January 30, 2024 and is about all the recorded instances of groundhog day predictions. Groundhog Day is a North American tradition in which a groundhog emerges from its shadow on Feb 2 and if it sees its shadow, will go back inside and there will be 6 more weeks of winter. If it does not see its shadow, spring will come early. You can find out more about this dataset at the official TidyTuesday [github repo](https://github.com/rfordatascience/tidytuesday/tree/master/data/2024/2024-01-30)

Below you can see an example of how to read csvs into pandas using the `pd.read_csv()` function. This function automatically detected a delimiter of comma and I didn't have to pass any additional arguments. Unlike numpy's `loadtxt` function, it can handle missing data and data of multiple types easily. For trickier imports, you can look at all the options on the pandas [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). 

In [None]:
groundhogs = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-01-30/groundhogs.csv')
predictions = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-01-30/predictions.csv')


The data is split into 2 pieces. groundhogs has information on each groundhog, such as its id, name, and location. predictions has the record for historical predictions of each grounhog, by id. 

In [None]:
groundhogs.head(5)


In [None]:
# It's always a good idea to check your dtypes!
groundhogs.info()


In [None]:
predictions.head(10)


In [None]:
# It's always a good idea to check your dtypes!
predictions.info()


## Groupby and summarize

One of the simplest things we can do is summarize how many predictions occured in each year. We can use the `groupby()` function to combine all the data for each unique value of "year" and then the `size()` function on that grouped dataframe to count the size of each group. This returns a `pd.Series` object with the year as the index. 

In [None]:
predictions.groupby("year").size()


>**Exercise:** Using the groundhogs dataframe, display how many groundhogs vs non-groundhogs have participated. *hint* The column of interest is called "is_groundhog".

In [None]:
# your code here


Instead of yearly predictions, I want to instead bin these years into decades. Let's create a new column in the predictions dataframe that is the decade of the year. 

In the first line below, I'm creating a new column in the predictions dataframe called "decade" that is the year divided by 10 and floored `//` times 10 again. So the year 1999 becomes 199 becomes 1990. Notice that creating a new column in a dataframe is as simple as assigning a value to a new key.

In [None]:
predictions['decade'] = predictions['year'] // 10 * 10
predictions.head(10)


Now, to summarize the number of predictions per decade, I use the `groupby()` method again. In this case, I grouped by "decade" and used the method size() to see how many observations/rows exist per decade. 

In [None]:
predictions.groupby("decade").size()


You can "groupby" multiple columns by passing a list of column names. In the code block below, I group by both "decade" and "shadow" to see how many times a shadow was seee each decade. This returns a multi-indexed pandas Series, where each combination of "decade" and "shadow" (for which there is a value) is a unique index.

In [None]:
p = predictions.groupby(["decade", "shadow"]).size()
print(p)


In [None]:
p.index


If we want to turn this Series into a DataFrame where the index values are columns, we can use the `reset_index()` method. This will turn the multi-index into columns and create a new integer index. Why would we want to do this? DataFrames are a bit easier to work with than Series objects.

In [None]:
p.reset_index(name = "count")


Alternatively, we can use the `unstack()` method to turn the multi-index into columns. This will automatically fill in missing values with `NaN`. 

In [None]:
p.unstack()


## Merging and joining

Let's find out if there are differences between the predictions of groundhogs vs non-groundhogs. We will need to combine the groundhogs and the predictions dataframes. Merging is a way to combine two dataframes based on a common column or index. In the code block below, we've created two dataframes that share the column "A". The DataFrames do not share columns "B" and "C". Additionally, `df2` has one more row than `df1`. 

In [None]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3', 'A4'],'C': ['C0','C1', 'C2', 'C3', 'C4']})
print(df1)
print(df2)


When we merge the two dataframes using the `pd.merge()` function, we can specify that the column to merge on is "A" and whether to keep only rows in common (default) or to keep all rows (`how='outer'`). Note that when we do the outer merge, the missing values are filled in with `NaN`. 

In [None]:
print(pd.merge(df1, df2, on = 'A'))
print(pd.merge(df1, df2, on = 'A', how = 'outer'))


>**Exercise:** Merge the groundhogs and predictions dataframes on the 'id' column and save it to a dataframe called "combined". Then, use groupby to summarize the number of shadow predictions made by groundhogs vs non-groundhogs. (You'll need to groupby two columns)

In [None]:
# Your code here



Tip for the next exercise: When you are chaining multiple methods together, you can split the command across multiple lines for readability. Just enclose the entire command in parentheses.

In [None]:
# This plots a comparison of the predictions of groundhogs vs non-groundhogs
(combined
 .groupby(["is_groundhog", "shadow"])
 .size()
 .unstack()
 .plot(kind = "bar", xlabel = "Is it a groundhog?", ylabel = "Number of Predictions")
)


>**Exercise:** Using the combined dataframe, plot the number of predictions made each year by both real and fake groundhogs as a line plot. Color the line by whether the groundhog is a groundhog or not.

In [None]:
# Your code here


We've done some high level data manipulation and plotting with pandas. Now it's time to go back to the basics and gain a foundational understanding of how to make plots in python.

# Plotting basics with matplotlib

Matplotlib is probably the most popular library for plotting figures in python. It is the basis of other plotting libraries, such as seaborn and the pandas dataframe plots. In this section we will demonstrate how to plot with matplotlib to illustrate the underlying principles of plotting. However, because seaborn is essentially a user-friendly wrapper around matplotlib, we'll save the exercises for the seaborn section.

There are two ways to plot using matplotlib: the object-oriented interface and the pyplot interface. The pyplot interface was designed to help folks transitioning from MATLAB to python. The object-oriented interface is the more reccommended way to plot figures if you're completely new. To begin plotting, we first have to import the library.

Simple plots such as scatter and line plots with one or two variables are easy to create in matplotlib. The steps to plotting are as follows:

1. Create a figure and Axes object using `plt.subplots()`
2. Use the Axes object to plot the data using one of the plot methods such as `.scatter()`, `.bar()`, etc. 
3. Customize the plot using the Axes object's `.set_...` methods
4. Add a legend using `ax.legend()`

In [None]:
import matplotlib.pyplot as plt

# Generate some data
expected = np.linspace(0, 10, 5)
predicted1 = expected + np.random.normal(0, 1, 5)
predicted2 = expected + np.random.normal(0, 1, 5)

# Create a subplot
fig, ax = plt.subplots()

# Plot each group
ax.scatter(x = expected, y = predicted1)
ax.scatter(x = expected, y = predicted2)

# Customize the plot
ax.set_xlabel("Expected")
ax.set_ylabel("Predicted")

# Add legend
ax.legend(["Group 1", "Group 2"])


If I wanted to plot these points in two different graphs, I can create two Axes objects and plot the data on each.

In [None]:
fig, ax = plt.subplots(1,2)

# plot each subplot

ax[0].scatter(x = expected, y = predicted1)
ax[1].scatter(x = expected, y = predicted2)


## Breaking down the grammar of matplotlib

There are many ways to customize a matplotlib plot, but to understand how, we first need to understand the objects returned by the `plt.subplots()` function and how the plot is constructed. The first object that `plt.subplots()` returns is a `matplotlib.figure.Figure` instance. It is a top-level container for all plot elements. This object is like the **canvas** of a painting, and holds parameters such as the size of the entire plot, the background color, how subplots are layed out, etc. The `Figure` also holds all the `Axes` which are the actual plots that are going to be drawn. The second object is a `matplotlib.axes.Axes` object. Think of this like an **artist** that *draws* on the canvas. When customizing plots, such as choosing which data to plot, how to plot it, and in what colors or shapes, you will be interacting with the `Axes` object. 

In the below code blocks, we break down the generation of the Figure object and the Axes object into multiple lines to show how they are related. (note that in the jupyter notebook environment, we had to repeat the previous lines per code block)

In [None]:
# Making the data
N = 50

df = pd.DataFrame({"x_values": np.random.rand(N), \
    "y_values": np.random.rand(N), \
    "class": np.random.randint(0,4, size = N), \
    "size": np.random.randint(20,300, size = N)})

df.head()


Just generating a figure creates a canvas, but no axes, so nothing is drawn. 

In [None]:
fig = plt.figure()


The Figure object can add a subplot and returns an Axes object, also known as an "Artist". Now we have a blank canvas with a blank plot.

In [None]:
fig = plt.figure()
ax = fig.add_subplot()


We pass the data to the Axes object, which has methods for different types of plots, including `scatter()`. The Axes object draws the points, and then returns another object representing those points. This is an object of the class `PathCollection`. The pts (PathCollection) object has its own methods for customizing just those dots.

In [None]:
fig = plt.figure()
ax = fig.add_subplot()
pts = ax.scatter(x = df["x_values"], y = df["y_values"])


Most of the time, we'll be creating the fig and ax objects in the same line using the `plt.subplots()` function, as demonstrated below. We can use additional parameters during the `scatter()` method to customize the plot, such as adding color or size parameters with additional data. 

The PathCollection object can be used to create a legend for the plot. We use `ax.legend()` and pass it the object's `legend_elements()` method to create a legend. 

Finally. we can then use `ax.set()` to change the labels of the plot. 

In [None]:
# create the figure and axis
fig, ax = plt.subplots()

# plot the data
pts = ax.scatter(x = df["x_values"], y = df["y_values"], c = df["class"], s = df["size"])

# create the legend
legend1 = ax.legend(*pts.legend_elements(), loc = "lower left", title = "Classes")

# add x, y, and title labels
ax.set(xlabel = "X", ylabel = "Y", title = "Scatterplot")


In our case we actually have two legends to add, one for the color and one for the sizes. Unfortunately, `ax.legend()` replaces existing legends so here we are creating the first legend, using the `ax.add_artist()` method to add it to the plot, and then creating another legend to be added to the plot.

In [None]:
# create the figure and axis
fig, ax = plt.subplots()

# plot the data
pts = ax.scatter(x = df["x_values"], y = df["y_values"], c = df["class"], s = df["size"])

# create the legend
legend1 = ax.legend(*pts.legend_elements(), loc = "lower left", title = "Classes")

### new
# manually add the first legend (for the classes) to the plot
ax.add_artist(legend1) 
### new

### new
# create a second legend for the sizes
legend2 = ax.legend(*pts.legend_elements(prop = "sizes", alpha = 0.6), loc = "upper right", title = "Sizes") 
### new

# add x, y, and title labels
ax.set(xlabel = "X", ylabel = "Y", title = "Scatterplot")


In general, for every element of a plot, there is a corresponding object or parameter of an object than you use to customize it. Matplotlib can get pretty granular. The matplotlib "Anatomy of a figure" image and its corresponding [docs page](https://matplotlib.org/stable/gallery/showcase/anatomy.html) will be helpful to understand how to customize different parts of a plot. These labels are a mix of objects and their functions, but they should give you a good idea of what you can customize.

<img src="https://matplotlib.org/stable/_images/sphx_glr_anatomy_001_2_00x.png" width="700">

## Using loops to plot

It is relatively simple to plot two axes in one figure. Give `plt.subplots` two numbers representing the number of rows and columns you want and the Axes object it returns will beceome an array. You can then index into it to plot on each object.

In [None]:
fig, ax = plt.subplots(2,1)
ax[0].text(0.5, 0.5, "Top plot", ha = "center")
ax[1].text(0.5, 0.5, "Bottom plot", ha = "center")


If you want to plot multiple subplots, pass the number of rows and columns of subplots you want to the `plt.subplots()` function. This will return an array of axes objects which you can index into to plot on each subplot. For larger plots, it's helpful to increase the size of the canvas using the `figsize` parameter in `plt.subplots()`.

In [None]:
# demonstration of looping through axes to plot multiple subplots
fig, ax = plt.subplots(3, 3, figsize = (10, 10))
# This flattens the 2D array of axes into a 1D array
# Alternatively, you can use a nested loop to plot by row and column
ax = ax.flatten()
for i in range(9):
    # draw number in center of plot
    ax[i].text(0.5, 0.5, str(i), fontsize = 18, ha = 'center')


Another way to plot several sets of data on the same plot is to use a loop over the discrete values. In the code block below, we demonstrate plotting the separate classes of data from the `df` object in a loop. Notice that we can use the `label` parameter to assign each class its own label and that we do not have to save the PathCollection object to create the legend.

In [None]:
# demo code block of looping through columns to plot multiple data sets on one axes

fig, ax = plt.subplots()

# make a dictionary of values you want associated with each class
classes = {
    0: {
        "color": "red",
        "label": "Class 0"
    },
    1: {
        "color": "blue",
        "label": "Class 1"
    },
    2: {
        "color": "green",
        "label": "Class 2"
    },
    3: {
        "color": "orange",
        "label": "Class 3"
    }
}

for key in classes:
    # filter the data to only include the current class you're plotting
    data = df[df["class"] == key]
    # plot just that class and assign the color and label
    ax.scatter(x = data["x_values"], y = data["y_values"], c = classes[key]["color"], label = classes[key]["label"])

# the bbox_to_anchor argument moves the legend outside of the plot
ax.legend(bbox_to_anchor=(1.25, 0.5))

# You can continue modifying the ax outside of the loop
ax.set(xlabel = "X", ylabel = "Y", title = "Scatterplot")


# Better plotting with seaborn



While matplotlib can be very powerful, it's not very user friendly and takes a lot of in depth knowledge to get a plot that is even slightly complicated. The purpose of the previous section was to introduce the conceptual underpinnings of the object oriented plotting interface. Now, we will learn how to plot with the library seaborn, which is built on top of this interface and takes care of a lot of the details for you. One of the reasons seaborn is so easy to work with is that it is built to work with pandas DataFrames whereas matplotlib existed before pandas and was built to work with numpy arrays. For the most part, we recommend using seaborn for your regular plotting needs while keeping in mind the matplotlib background if you want to tweak something super specific that may not be available in seaborn. 

In [None]:
import seaborn as sns


Below is the one-line code to plot the same scatter plot we did with matplotlib. As you can see, seaborn does a lot of the drawing of the plot for us, leaving us to just specify the names of the variables and where to plot them. Like matplotlib, seaborn's plotting functions return an object that can be used to further customize the plot.

The `relplot()` function is a high level function that is designed to plot both scatter and line plots. You can specify the type using the "kind" parameter. 

In [None]:
g = sns.relplot(data = df, x = "x_values", y = "y_values", hue = "class", size = "size", kind = "scatter")


`relplot()`, along with `distplot()` and `catplot()` are "Figure-level" functions which provides a unified interface to plot multiple types of "axes-level" plots. Figure-level functions return a `FacetGrid` object which can be used to customize the plot. You can also directly use the axes-level functions, such as `sns.scatterplot()` and they will return the familiar `matplotlib.pyplot.Axes` object. In the image below, you can see the how the axes-level functions are encompassed by each figure-level function.

![](https://seaborn.pydata.org/_images/function_overview_8_0.png)

## Figure level plotting

Let's load a real dataset that contains a mixture of categorial and continuous variables to illustrate seaborn's plotting capabilities. 

In [None]:
penguins = sns.load_dataset("penguins")
penguins.info()


The code block below demonstrates how to use a figure-level function to plot a scatter plot of the penguins dataset. Notice that we now use `hue` instead of `c` to color the points by sex. 

In [None]:
sns.relplot(data = penguins, x = "bill_length_mm", y = "bill_depth_mm", hue = "sex", kind = "scatter")


The code block below demonstrates how to use the figure-level function `relplot()` to break up a scatterplot into three facets based on the species of penguin using the "col" parameter. We can then use the FacetGrid object to customize the axis labels and also to acess the Axes level objects it contains.

In [None]:
g=sns.relplot(data = penguins, x = "bill_length_mm", y = "bill_depth_mm", hue = "sex", kind = "scatter", col = "species")

# Using the FacetGrid methods to customize the figure
g.set_axis_labels("Bill Length (mm)", "Bill Depth (mm)")
# A tricky/annoying way to edit the legend. 
# You have to access a private attribute rather than use a method
g._legend.set_title("Sex")

# Dropping down to the matplotlib axes objects that are contained in the FacetGrid object
g.axes[0,0].set_title("Adelie")
g.axes[0,1].set_title("Chinstrap")
g.axes[0,2].set_title("Gentoo")


## Axes level plotting

If you want to compose multiple plots of different kinds, it's best to use the axes-level functions, as they are regular matplotlib objects and can be composed just like we learned previously. Just like before, we create the fig and ax objects using `plt.subplots()`, but this time we pass the axes objects as a parameter in the seaborn axes-level functions. Then, just as before with matplotlib, we use the returned ax objects to set the labels and other customizations for each subplot. 

This syntax is very similar to base matplotlib, but because of its native support for pandas dataframes, the code is much easier to read and write.

In [None]:
# standard matplotlib opening
fig, ax = plt.subplots(1, 2, figsize = (10, 8))
plt.tight_layout(pad = 3) # adjust the padding of the subplots and the main title

# now we switch to using seaborn plotting functions
p1 = sns.boxplot(data = penguins, x = "species", y = "bill_depth_mm", hue = "sex", ax = ax[0])
p2 = sns.scatterplot(data = penguins, x = "bill_length_mm", y = "bill_depth_mm", hue = "species", ax = ax[1])

# the below is all matplotlib code
p1.set(xlabel = "", ylabel = "Bill Depth (mm)")
p1.legend(title = "Sex")
p2.set(xlabel = "Bill Length (mm)", ylabel = "Bill Depth (mm)")
p2.legend(title = "Species")

fig.suptitle("Penguin Measurements")


In general, if you're doing one type of plot, it's easiest to use the Figure level function. This is because these functions take care of placing the legend sensibly outside the plot, resizing the canvas to fit the data, and can easily facet subplots by category automatically. 

If you want to compose a plot with multiple different types of subplots or integrate a plot with base matplotlib objects, use the axes-level functions. 

>**Exercise:** Create a single violin plot comparing the body mass of the penguins by both sex and species. Customize the axis labels and legend title. You may use either `sns.violinplot()` or `sns.catplot()` to do this. 

In [None]:
# Your code here
# Using catplot




In [None]:
# Your code here
# Using violinplot




>**Exercise:** The penguins dataset has 7 variables. Utilize a Figure-level function to plot as many of them as you can at once in one line.

In [None]:
# Your code here



## Jointplots and pairplots

Another group of figure-level functions are `jointplot()` and `pairplot()`. `jointplot()` is used to plot the relationship between two continuous variables and the distribution of each variable. `pairplot()` is used to plot the relationship between all pairs of continuous variables in a dataframe. 

In the code block below, you can see that a joint plot is a scatter plot in the middle and then a kernel density estimate plot of each variable on the sides.

In [None]:
sns.jointplot(data = penguins, x = "bill_depth_mm", y = "bill_length_mm", hue = "species")


Just like other figure level functions, you can use "kind" to swap between different representations, in this case "kind" can be one of “scatter”, “kde”, “hist”, “hex”, “reg”, “resid”.

In [None]:
sns.jointplot(data = penguins, x = "bill_depth_mm", y = "bill_length_mm", hue = "species", kind = "kde")


A pairplot plots a grid of scatter plots of all pairs of continuous variables. On the diagonal, it plots the distributions of each variable. 

In [None]:
sns.pairplot(data = penguins, hue = "species")


# Comparing plotting with matplotlib, pandas, and seaborn

In [None]:
## Seaborn

long_grades = grades.reset_index().melt(id_vars = 'index', value_vars = ["English", "Math", "History"], var_name = 'Subject', value_name = 'Grade')
fig, ax = plt.subplots()
long_bar = sns.barplot(data = long_grades, x = 'index', y = 'Grade', hue = 'Subject')


In [None]:
## Pandas

ax = grades.plot(kind = 'bar')
ax.set_ylabel('Grade')


In [None]:
## Matplotlib

width = 0.3  # the width of the bars
x = np.arange(len(grades.index))  # the label locations

fig, ax = plt.subplots()
ax.bar(x - width, grades["Math"], width, label='Math')
ax.bar(x, grades["History"], width, label='History')
ax.bar(x + width, grades["English"], width, label='English')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_xlabel('Students')
ax.set_ylabel('Grades')
ax.set_title('Grades by subject')
ax.set_xticks(x)
ax.set_xticklabels(grades.index)
ax.legend()
