                                                           Notebook created by Dragos Gruia and Valentina Giunchiglia

# Introduction to plotting

The most commonly used `module` to create plots and figures in Python is `matplotlib`. Recently, the module `seaborn` has become more famous, since it allows to create better looking plots. But for now, we will focus on `matplotlib` due to the higher complexity of `seaborn`. If in the future you would like to learn more about `seaborn`, you can check out the online [tutorials](https://seaborn.pydata.org/tutorial.html). However, this is not needed for this module.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [None]:
songs = pd.read_csv("Data/Day9_data.csv", low_memory = False)
songs.head()

Visualizing data through dataframes is useful, but not as easy to interpret as by looking at plots.

## Scatterplots

In Python, when you want to create a figure, you always have to create a figure using `plt.figure` and specify the size of the figure in the parenthesis (width, height). Then, you can call the command for the specific plot you are interested in - in this case a scatterplot. `Scatterplots` are useful plots to show the relationship between two variables in a dataframe. Let's try to see if there is a relationship between loudness and energy.

In [None]:
plt.figure(figsize = (6, 4))
plt.scatter(songs["loudness"], songs["energy"])

Of course there are ways to make the plot look better, by for example adding a title, or axis labels, by removing the ticks around the figure or by changing the colours. 

In [None]:
plt.figure(figsize = (6, 4))
plt.scatter(songs["loudness"], songs["energy"], color = "grey")
plt.title("Relationship: loudness and energy")
plt.xlabel("Loudness")
plt.ylabel("Energy")
plt.xticks([])
plt.yticks([])

if you want to save the figure that you just created, you can just add the command `plt.savefig` at the end (as shown in the next plot)

## Histograms

Histograms can be useful to look at the distribution of different values. The function to create an histogram in matplot is `plt.hist` and takes as input arguments the column of the dataframe to plot and the number of bins. The higher is the number of bins, the more bars the histogram has.

In [None]:
plt.figure(figsize = (6, 4))
_ = plt.hist(songs["danceability"], bins = 30)
plt.savefig("figure1.png")

An alternative to just defining the number of bins is to provide a list of bins as input

In [None]:
plt.figure(figsize = (6, 4))
_ = plt.hist(songs["danceability"], bins = [0, 0.4, 0.6, 1], color = "green")

**IMPORTANT**: Be careful on how you define the bins, because they could really change the way the plot looks.

## Boxplots

Boxplots are useful to visualise the distribution of the data in terms of the first quartile [Q1], median, and third quartile [Q3]. It can also give you information about potential outliers in your dataset.

In [None]:
plt.figure(figsize = (6, 4))
plt.boxplot(songs["energy"])
plt.xticks([1],["Energy"])
plt.title("Distribution of energy")
plt.yticks([])

It is also possible to create multiple boxplots one next to the other by providing as input a list of columns of the original dataframe

In [None]:
data = [songs["energy"], songs["danceability"], songs["liveness"]]

In [None]:
plt.figure(figsize = (6, 4))
_ = plt.boxplot(data)
_ = plt.xticks([1, 2, 3],["Energy", "Dance", "Live"])

## Pie charts

Pie charts are useful to represent different proportion of data, where the full pie chart corresponds to 100% and each section to a portion of the full percentage. In order to be able to draw it is, therefore, necessary to calculate these percentages. To undestand how pie charts are created, let's start with a simple example.

In [None]:
perc = [40, 20, 10, 10, 20]
labels = ["English", "Spanish", "Italian", "German", "Chinese"]
plt.figure(figsize = (4, 4))
_ = plt.pie(perc, labels = labels)
plt.legend() # remove this command if you don't want to plot the legend

What if you want to show the percentages of each slice? Then you can add the command `autopct`

In [None]:
perc = [40, 20, 10, 10, 20]
labels = ["English", "Spanish", "Italian", "German", "Chinese"]
plt.figure(figsize = (4, 4))
_ = plt.pie(perc, labels = labels, autopct='%1.0f%%')

-----------

### Code here
Now try to create a pie charts with the songs dataset. You want to show how many songs there are of each genre. Remember, you first need to be able to calculate that, then you can create the plot.

In [None]:
# Code here



----------

## Barplots

Barplots are useful to represent categorical data, and in general can be used to show the relationship between numeric and categorical data.
Let's try to use a barplot to show how many songs there are for each genre. To be able to do this, we need to get this information using the function *value_counts*.

The input of a barplot is the list of categorical variables and then the numerical data (in this case the number of songs for each genre).

In [None]:
count = pd.DataFrame(songs["genre"].value_counts())
plt.figure(figsize = (10, 5))
plt.bar(count.index, count["genre"])

As you can see the labels are overlapping. A solution is to rotate them. 

In [None]:
plt.figure(figsize = (10, 5))
plt.bar(count.index, count["genre"], color = "orange")
plt.title("Songs per genre")
_ = plt.xticks(rotation=90)

## Subplots

Until now, we have always plotted unique plots. However in Python it is possible to plot a figure with multiple subplots by using the command `plt.subplot`. The `plt.subplot` command takes as input the (number or rows, number of columns, plot number).

In [None]:
plt.figure(figsize = (15, 5))
plt.subplot(1,2,1)
_ = plt.boxplot(data)
_ = plt.xticks([1, 2, 3],["Energy", "Dance", "Live"])
plt.subplot(1,2,2)
_ = plt.hist(songs["danceability"], bins = 30)