---

# Data visualisation with `pandas`

The `pandas` library doesn't just manipulate data, it can also visualise it.

In [None]:
import pandas as pd

In [None]:
loans = pd.read_csv("./data/loans.csv")

### Distributions

`pandas` supports different visualisations for looking at the distribution of our data.

In [None]:
loans.head()

The `.describe` method gives us some indication of the values in a column:

In [None]:
loans["loan_amnt"].describe()

But really we want to see the full distribution visually.

In most cases there is a plotting method we can add on to a `DataFrame` to visualise it.

For example, to see a histogram of a column, we can use `.hist()`:

In [None]:
loans["loan_amnt"].hist()

The default options are usually pretty good, but we can change anything about the plot

In [None]:
loans["loan_amnt"].hist(bins=20, color="green", figsize=(10, 4))

The `DataFrame` also has a `.hist()` method if you want a histogram *per group*.

We can also suppress the output (`.hist()` returns an array of axis objects) by either:

- adding a semicolon `;` to the end of the statement
- saving the output of the plotting function (in this case `.hist()`) into a variable we don't then use. The Python convention for that is to use `_`

In [None]:
_ = loans.hist(column="loan_amnt", by="purpose", figsize=(10, 10))

#### Note

As a rule of thumb, if a plot requires a *single column* (e.g. a histogram of a single column), the plotting function will be called on the `Series`.

If a plot requires multiple columns (e.g. a scatterplot) the plotting function will be called on the `DataFrame`.

---

`pandas` also support box and whisker plots

In [None]:
_ = loans.boxplot("loan_amnt")

Different plots have different options:

In [None]:
_ = loans.boxplot("loan_amnt", vert=False)

Box plots are also available at the dataset level, to visualise them per category

In [None]:
_ = loans.boxplot(column="int_rate", by="home_ownership", figsize=(12, 5))

### Bar charts

The trick with `pandas` plots is to make sure the data is the right format first. Then, it's usually a matter of calling the correct plotting function (sometimes it's just `.plot()`!).

For bar charts, if you have aggregated data, you can plot it as a bar chart with a simple command.

In [None]:
loans["grade"].value_counts().sort_index()

In [None]:
_ = loans["grade"].value_counts().sort_index().plot(kind="bar")

# can also do this:
# loans["grade"].value_counts().sort_index().plot.bar()

The same can be done for any aggregated data

In [None]:
avg_loan_by_grade = loans.groupby("grade")["loan_amnt"].median()
avg_loan_by_grade

To create a horizontal bar chart, use `"barh"`

In [None]:
_ = avg_loan_by_grade.plot(kind="barh")

Ah, not quite! The trick here is to have the data in **reverse order** (because horizontal bar charts start from the x-axis and go *up*)

In [None]:
_ = avg_loan_by_grade.sort_index(ascending=False).plot(kind="barh")

<h1 style="color: #fcd805">Exercise: distributions and bar charts</h1>

Back to the Kickstarter dataset.

1. Read the Kickstarter data into a `DataFrame` (reminder: it's the `kickstarter.csv.gz` in the `data` folder)

2. Visualise the distribution of the goal amount across the entire dataset using a histogram.

What conclusions do you draw?

3. Compare the distribution of the goal amount across different categories using boxplots.

What do you conclude?

4. Create a column to calculate the pledged amount as a percentage of the goal amount.

5. Visualise the *average* of this percentage for each "state" of projects, as a bar chart.

Each bar will represent the average "completion rate" of a project across successful and failed projects (and any other category that appears in the `state` column).

## Scatter plots and correlation

To calculate correlation in our data, `pandas` has a built-in `.corr()` method.

In [None]:
loans.corr(numeric_only=True)

Some high correlation values, such as the size of the installment is obviously correlated with the size of the loan.

Looks like a person's annual income is also positively correlated with the size of the loan (and therefore the installment). At around 0.22, this isn't a strong relationship though.

Another way to assess the relationship between variables is to visualise it with a scatter plot:

In [None]:
_ = loans.plot(kind="scatter", x="annual_inc", y="loan_amnt")

There are some outliers, so let's zoom in by removing them.

In `pandas`, don't try to do this in the plot itself, do it in the data before you plot (although, you could set the limits of the plot's axes after the fact as well).

In [None]:
_ = loans[loans["annual_inc"] < 500_000].plot(kind="scatter", x="annual_inc", y="loan_amnt")

That's better, but messy. We can change the transparency of the points to prevent "overplotting" and see denser areas.

In [None]:
_ = loans[loans["annual_inc"] < 500_000].plot(kind="scatter", x="annual_inc", y="loan_amnt", alpha=0.2)

<h1 style="color: #fcd805">Exercise: scatter plots and correlation</h1>

Looking at the Kickstarter data, answer the following questions.

1. Is there a relationship between the goal amount and the amount that was pledged for a project?

Answer this question both numerically and visually.

2. Is there a relationship between the number of backers and the *percentage* of the goal that was reached?

Answer this question both numerically and visually.

_Note: you will need your created column for this from a previous exercise!_

## Line charts & dates in `pandas`

Line charts are generally used for time series data.

For time series data, we need dates. Specifically, columns that are a date type.

In [None]:
loans.head()

We actually have a date disguised as a text column!

`pandas` can convert text to dates as long as we specify the format the dates are in.

How do you know what to put in the `format` section?

Here is the reference page: https://strftime.org/

In [None]:
# %b is "abbreviated month name", e.g. "Jan"
# %Y is year, e.g. 2014
loans["date"] = pd.to_datetime(loans["issue_d"], format="%b-%Y")

loans.head()

In [None]:
loans.dtypes

Now that we have a date type, we can access date functionality of that column:

In [None]:
loans["date"].dt.year

Let's look at monthly total loan amounts:

In [None]:
monthly_loans = loans.groupby(loans["date"])["loan_amnt"].sum()
monthly_loans

Again, to visualise this, we can call the correct plot function. The default of `.plot()` is actually a line chart, so we don't need to specify anything else:

In [None]:
_ = monthly_loans.plot()

Looks like the plot defaulted to "scientific notation" on the y-axis.

To change this, we need to dive into the plotting library that `pandas` uses, to gain full control of our plots. We cannot do this if we only use the `.plot()` function in `pandas`.

The plotting library `pandas` uses under the hood is called `matplotlib` and the way to use it is to import its `pyplot` submodule.

For more information about `matplotlib` and the different ways to use this, you can refer to this excellent article: https://pbpython.com/effective-matplotlib.html

In [None]:
import matplotlib.pyplot as plt

You can set global options for all plots, such as the default theme.

Themes are called "styles" in `matplotlib` and you can use the many different built-in ones or even create your own.

Here is the reference page with all available styles: https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html

For example, we could set all our plots to mimic the style of the political blog FiveThirtyEight:

In [None]:
plt.style.use('fivethirtyeight')

_ = loans["loan_amnt"].hist()

And we can change it back

In [None]:
plt.style.use('default')

_ = loans["loan_amnt"].hist()

The real power of using `matplotlib` directly is that we can fully control all aspects of a chart.

In `matplotlib`, a **figure** is the entire chart area, and each set of axes inside it are an **axis**.

In this example, we have **one** figure and **seven** axes.

In [None]:
_ = loans.hist(column="loan_amnt", by="grade", figsize=(8, 8))

To gain full control, we need to *first* create a blank figure and axis (or multiple axes), and then tell the `pandas` plot function to use our figure and axis/axes rather than create its own.

In [None]:
fig, axis = plt.subplots(figsize=(6, 6))

# tell the plot function what Axis object to use
monthly_loans.plot(ax=axis)

# one way to remove scientific notation
axis.ticklabel_format(axis="y", style="plain")

plt.show()

These are the Figure and Axis objects we can control:

In [None]:
print(type(fig), type(axis))

Let's see what else we can do now we have our figure and axis:

In [None]:
fig, axis = plt.subplots(figsize=(6, 6))

loans["loan_amnt"].hist(ax=axis)

# now we have full control of the figure and axis and can set all the options!

# you can set options individually
axis.set_facecolor("pink")

# or all at once!
axis.set(
    title="Distribution of loan amount",
    xlabel="Loan amount",
    ylabel="Frequency"
)

# .show() is not necessary in Jupyter, but it is outside of it
# and it suppresses the text output
plt.show()

Another trick is to create multiple axes (which gives you either a list, or a 2D array of axes)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# axes is now a list!
loans["loan_amnt"].hist(ax=axes[0])

# or all at once!
axes[0].set(
    title="Distribution of loan amount",
    xlabel="Loan amount",
    ylabel="Frequency"
)

loans.boxplot(column="int_rate", vert=False, ax=axes[1])

axes[1].set(
    title="Distribution of interest rate",
    xlabel="Interest rate (%)",
    ylabel=None
)

# remove the "tick" label from the box plot
axes[1].tick_params(axis="y", labelleft=False)

# .show() is not necessary in Jupyter, but it is outside of it
# and it suppresses the text output
plt.show()

<h1 style="color: #fcd805">Exercise: line charts</h1>

Back to our Kickstarter data.

1. Convert the `launched` column to be a date type.

2. Calculate the *number of projects* per day using the `launched` column.

_Tip: if a datetime column has a time component, you can isolate just the date using `.dt.date`_

3. Visualise the number of projects per day as a line chart.

Use `matplotlib` to create a figure and axis object and try lots of options to make your chart look unique!

### Other visualisation options

There are many other visualisation libraries in Python!

One popular complement to `matplotlib` is `seaborn` (https://seaborn.pydata.org). It has lots of interesting plot types and is fully compatible with `matplotlib`.

One thing that's easier in `seaborn` is to colour objects in the visualisation based on a column, such as colouring the points in a scatter plot based on a category:

In [None]:
import seaborn as sns

fig, axis = plt.subplots(figsize=(10, 5))

sns.scatterplot(data=loans[loans["annual_inc"] < 500_000],
                x="annual_inc",
                y="loan_amnt",
                hue="term",
                alpha=0.3,
                ax=axis)

axis.set(
    title="Does the relationship between income and loan amount vary across the length of the loan term?",
    xlabel="Annual income ($)",
    ylabel="Loan amount ($)"
)

plt.show()

Another useful `seaborn` visualisation is the heatmap. This colours any data table based on its values.

If you want to pick the right colour range (called a colormap) this is the documentation: https://matplotlib.org/stable/users/explain/colors/colormaps.html

In [None]:
fig, axis = plt.subplots(figsize=(5, 5))

sns.heatmap(
    data=loans.corr(numeric_only=True),
    vmin=-1,
    vmax=1,
    square=True,
    cmap="Blues",
    ax=axis
)

axis.set(
    title="Correlation matrix for the loans data"
)

plt.show()

Other options for visualisation include:

- Plotnine (if you are an R user familiar with ggplot): https://plotnine.readthedocs.io
- Plotly (for interactive visualisations): https://plotly.com/python/
- Altair (https://altair-viz.github.io/)