# Module 7 Data Visualization

**_Author: Favio Vázquez_**

**Expected time = 2 hours**

**Total points = 130 points**

## Assignment Overview

In this assignment you will use matplotlib and pandas to create and customize a number of data visualizations. You will begin by creating a simple line plot and gradually build your skills to create more complex and customized visualizations. You will use pandas to create a pairs plot so you can display multiple scatter plots in one figure. Finally, you will wrap up the assignment by creating a time series visualization in matplotlib. **Please review the Important Instructions section below. You must adhere to these instructions to ensure the grading for this assignment works properly in Vocareum.** 

This assignment is designed to build your familiarity and comfort coding in Python while also helping you review key topics from each module. As you progress through the assignment, answers will get increasingly complex. It is important that you adopt a data scientist's mindset when completing this assignment. **Remember to run your code from each cell before submitting your assignment.** Running your code beforehand will notify you of errors and give you a chance to fix your errors before submitting. You should view your Vocareum submission as if you are delivering a final project to your manager or client. 

***Vocareum Tips***

- Do not add arguments or options to functions unless you are specifically asked to. This will cause an error in Vocareum.
- Do not use a library unless you are expicitly asked to in the question.


### Learning Objectives

- Use matplotlib to visualize data including basic plots and time series. 
- Explain the benefits and limitations of various data visualizations. 
- Examine various data visualization tools. 
- Select the appropriate data visualization to effectively communicate the dataset and analysis. 

**IMPORTANT INSTRUCTIONS:** 

- To be able to test for this module, you will be asked to save your figures as PNG into a folder called "results". Please don't change the name we ask you to give to the plots so you are able to get all the points in every question. The code you will use to save the PNG files is: plt.savefig("results/plot.png")
- Don't add any customization you're not asked to in the plots.

## Index:

#### Module 7: Data Visualization.

- [Question 1](#Question-1)
- [Question 2](#Question-2)
- [Question 3](#Question-3)
- [Question 4](#Question-4)
- [Question 5](#Question-5)
- [Question 6](#Question-6)
- [Question 7](#Question-7)
- [Question 8](#Question-8)
- [Question 9](#Question-9)
- [Question 10](#Question-10)
- [Question 11](#Question-11)
- [Question 12](#Question-12)

In [None]:
# Let's start by importing the libraries we will be using
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from dateutil.parser import parse
from sklearn import datasets
import scipy.stats as sp
import math

# Avoid warnings
import warnings
warnings.filterwarnings("ignore")

## Getting started with Matplotlib

You will start by creating some basic plots with matplotlib. You will also label the `Xaxis` and `Yaxis` of your figures. 

[Back to top](#Index:) 

### Question 1
*5 points*

Create a basic lineplot with the data x and y from below. Add a `y-label` with the words "numbers y" and add a `x-label` with the words "numbers x".

Save your plot as a png with the name "plot1.png" in the folder "results".

In [None]:
### GRADED 

# Data
x = [1,2,3,4]
y = [1,2,3,4]

plt.plot(x,y)
plt.ylabel("numbers y")
plt.xlabel("numbers x")
plt.savefig("results/plot1.png")


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 2
*10 points*
    
Reproduce the plot from the image below. You will find the data in the grading cell. Make sure to limit the x axis to 0-6, and the y axis to 0-150.

![](images/1.png)

Save your plot as a png with the name "plot2.png" in the folder "results".

**Hint: The first line should be red with circles 'ro', the second line shold be blue with squares 'bs', and the third line should be green with trianges 'g^'. You can find this plot in the first video of the module.**

In [None]:
### GRADED 

# Data
t = np.arange(0., 5., 0.2)

plt.plot(t, t, 'ro', t, t**2, 'bs', t, t**3, 'g^')
plt.axis([0, 6, 0, 150])
plt.savefig("results/plot2.png")


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## Simple line plots

Next, you will create a simple line plot. As you may remember: 
- X-axis requires quantitative variable
- Y-axis values must be numbers and can be set to be interpreted (formatted) as decimal, percent, and currency.
- Variables have contiguous values
- Familiar/conventional ordering among ordinals

[Back to top](#Index:) 

### Question 3
*10 points*
    
Plot the functions $f(x) = sin(x)$ and $f(x) = 3 cos(x)$ as a line plot. Make these customizations to the plot:

- Title: "Function plots". **Hint: Notice the capitalization**.
- y-label = "values"
- x-label = "angles"

The data is given in the grading cell below. 

Save your plot as a png with the name "plot3.png" in the folder "results".

In [None]:
### GRADED 

# the x axis: ndarray object of angles between 0 and 2π
x = np.arange(0, math.pi*2, 0.05)

y_sin = np.sin(x)
y_cos = 3 * np.cos(x)
plt.plot(x,y_sin,x,y_cos)
plt.xlabel("angles")
plt.ylabel("values")
plt.title('Function plots')
plt.savefig("results/plot3.png")

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## Bar plots

Next, you will create standard and multi-series bar plots. Bar graphs are used to compare data between different groups or to track changes over time (similar to a line chart). Multi-series Bar graphs can display multiple groups in the same bar in a stacked visual showing the composition of the bar chart along the Y-axis.

Matplotlib provides the **bar()** function for creating bar plots. 

You can use the **bar()** function parameters below with the axes object:

``bar(x, height, width, bottom, align)``

[Back to top](#Index:) 

### Question 4
*5 points*

Given the data below, create a bar plot. Make these customizations to the plot:

- Title: "My bar plot"
- X-ticks: Use (y_pos, bars) in `plt.xticks`.
- The color of the bars should be changed with the argument `color=(0.2, 0.4, 0.6, 0.6)`

Save your plot as a png with the name "plot4.png" in the folder "results".

In [None]:
### GRADED 

# Data
height = [3, 12, 5, 18, 45]
bars = ('A', 'B', 'C', 'D', 'E')
y_pos = np.arange(len(bars))

plt.bar(y_pos, height, color=(0.2, 0.4, 0.6, 0.6))
plt.title("My bar plot")
plt.xticks(y_pos, bars)
plt.savefig("results/plot4.png")


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 5
*20 points*

Given the data below, create a bar plot. Make these customizations to the plot:

- The first bar should be created with the arguments `width=bar_width`, `color='blue'`, `edgecolor='black'`, `yerr=yer1`, `capsize=7`, `label='group1'`.
- The second bar should be created with the arguments `width=bar_width`, `color='cyan'`, `edgecolor='black'`, `yerr=yer2`, `capsize=7`,`label='group12'`.
- Title: "Advanced bar plot".
- Y-label = "height".
- Legend = Show a standard label with the name of the bars.

Save your plot as a png with the name "plot5.png" in the folder "results".

**Hint: Make sure to add the `plt.legend()` to show the legend.**

The final plot should look like this:

![](images/2.png)

In [None]:
### GRADED 

# Data

# width of the bars
bar_width = 0.3
 
# Choose the height of the blue bars
bars1 = [10, 9, 2]
 
# Choose the height of the cyan bars
bars2 = [10.8, 9.5, 4.5]
 
# Choose the height of the error bars (bars1)
yer1 = [0.5, 0.4, 0.5]
 
# Choose the height of the error bars (bars2)
yer2 = [1, 0.7, 1]

# The x position of bars
r1 = np.arange(len(bars1))
r2 = [x + bar_width for x in r1]

# general layout
plt.xticks([r + bar_width for r in range(len(bars1))], ['cond_A', 'cond_B', 'cond_C'])

### YOUR SOLUTION HERE
plt.bar(r1, bars1, width = bar_width, color = 'blue', edgecolor = 'black', yerr=yer1, capsize=7, label='group1')

# Create cyan bars
plt.bar(r2, bars2, width = bar_width, color = 'cyan', edgecolor = 'black', yerr=yer2, capsize=7, label='group2')

plt.ylabel('height')
plt.legend()
plt.savefig("results/plot5.png")

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## Scatter plots
Next, you will build a scatter plot using the iris dataset.  A Scatter Plot is used to determine relationships between two different data dimensions. The x-axis is used to measure one dimension (or variable) and the y-axis is used to measure the other. If both variables increase at the same time, they have a positive relationship. If one variable decreases while the other increases, they have a negative relationship. Sometimes the variables don't follow any pattern and have no relationship.

Matplotlib provides the **scatter()** function for creating scatter plots. 

The **scatter()** function parameters below can be used with axes objects:

``scatter(x_coords, y_coords, shape, colors, alpha)``

[Back to top](#Index:) 

### Question 6
*15 points*
    
Given the iris dataset create a scatter plot of the petal length (cm) vs the sepal length (cm) with these customizations:

- The colors of the species should follow the `colours` and `species` lists from below. 
- X-label: 'sepal length (cm)'
- Y-label: 'petal length (cm)'
- Title: 'Iris dataset: petal length vs sepal length'
- Legend = Should be displayed with the argument `loc="lower right"`.

Save your plot as a png with the name "plot6.png" in the folder "results".

The final plot should look like this:

![](images/3.png)

In [None]:
### GRADED 

# Data
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris['data'], columns=iris['feature_names'])
iris_df['species'] = iris['target']

# Set color and names for legend
colours = ['red', 'orange', 'blue']
species = ['I. setosa', 'I. versicolor', 'I. virginica']

### YOUR SOLUTION HERE
for i in range(0, 3):
    species_df = iris_df[iris_df['species'] == i]
    plt.scatter(
        species_df['sepal length (cm)'],
        species_df['petal length (cm)'],
        color=colours[i],
        alpha=0.5,
        label=species[i]
    )
    
plt.xlabel('sepal length (cm)')
plt.ylabel('petal length (cm)')
plt.title('Iris dataset: petal length vs sepal length')
plt.legend(loc='lower right')
plt.savefig("results/plot6.png")


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## Histograms
A histogram is a display of statistical information that uses rectangles to show the frequency of data items in successive numerical intervals of equal size. In the most common form of histogram, the **independent** variable is plotted along the horizontal axis and the **dependent** variable is plotted along the vertical axis.

Matplotlib provides the **pyplot.hist()** function for creating histograms plots. 

You can use the **pyplot.hist()** function parameters below with the axes object:

``pyplot.hist(x_values, bins, color, transparency)``

[Back to top](#Index:) 

### Question 7
*10 points*

Generate data (using numpy) for a normal distribution centered a x=0 for 100,000 points. Then plot a histogram for the data. You'll find more instructions in the grading cell below.

Save your plot as a png with the name "plot7.png" in the folder "results".

In [None]:
### GRADED 

np.random.seed(123)
N_points = 100000
n_bins = 20

### YOUR SOLUTION HERE
x = np.random.randn(N_points)
plt.hist(x, bins=n_bins)
plt.savefig("results/plot7.png")


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 8
*15 points*

Build a histogram with a fit line for the data given below and these customizations:

- Title = "Histogram of IQ: $\mu=120$, $\sigma=12$"
- X-label = 'x-axis'
- Y-label = 'y-axis'
- The fit line should be a blue dotted line. 
- Use the argument `density=1` when creating the histogram.
- Use the argument `facecolor=red` when creating the histogram.
- Use the argument `alpha=0.6` when creating the histogram.

Save your plot as a png with the name "plot8.png" in the folder "results".

The final plot should look like this:

![](images/4.png)

**Hint: Use scipy to create the fitted line.**

In [None]:
### GRADED

# Data
np.random.seed(123)
mu = 120 # mean of distribution
sigma = 12 # standard deviation of distribution
x = mu + sigma * np.random.randn(10000)
num_bins = 20

### YOUR SOLUTION HERE
n, bins, patches = plt.hist(x, num_bins, density=1, facecolor='red', alpha=0.6)
y = sp.norm.pdf(bins, mu, sigma)
plt.plot(bins, y, 'b--')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.title(r'Histogram of IQ: $\mu=120$, $\sigma=12$')
plt.savefig("results/plot8.png")


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## Box plots

A box plot which displays a summary of a set of data containing the minimum, first quartile, median, third quartile, and maximum. In a box plot, we draw a box from the first quartile to the third quartile. A vertical line goes through the box at the median. The whiskers go from each quartile to the minimum or maximum.

We will use the `boxplot()` function to create a boxplot with Matplotlib. Parameters in the `boxplot()` function determine orientation, colors, labels for the figure. 

[Back to top](#Index:) 

### Question 9
*10 points*

Using the data below, create a box plot with the these customizations:

- Title: "My box plot"
- X-label: "data"
- Plot the outliers with the green_diamond dictionary.

Save your plot as a png with the name "plot9.png" in the folder "results".

In [None]:
### GRADED

# Data
np.random.seed(123)
spread = np.random.rand(50) * 100
center = np.ones(25) * 50
flier_high = np.random.rand(10) * 100 + 100
flier_low = np.random.rand(10) * -100
data = np.concatenate((spread, center, flier_high, flier_low))

# For outliers
green_diamond = dict(markerfacecolor='g', marker='D')

### YOUR SOLUTION HERE
plt.boxplot(data, flierprops=green_diamond)
plt.title("My box plot")
plt.xlabel("data")
plt.savefig("results/plot9.png")

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## Pair plots

A “pairs plot” is also known as a scatter matrix. A scatter matrix is a collection of scatterplots organized into a grid (or matrix). Each scatterplot shows the relationship between a pair of variables. Plots are just elaborations on this, showing all variables paired with all the other variables. 

In this assignment, you will create a pairs plot using pandas. 

[Back to top](#Index:) 

### Question 10
*10 points*

Create a pair plot for the iris dataset using pandas. Use these customizations inside the `scatter_matrix` function:

```
c = Y
figsize = (15,15)
marker = 'o'
hist_kwds={'bins':20}
s = 60
alpha = 0.8
```

Save your plot as a png with the name "plot10.png" in the folder "results".

In [None]:
### GRADED

# Data
iris_dataset = datasets.load_iris()
X = iris_dataset.data
Y = iris_dataset.target

iris_dataframe = pd.DataFrame(X, columns=iris_dataset.feature_names)

### YOUR SOLUTION HERE
pd.plotting.scatter_matrix(iris_dataframe, c=Y, figsize=(15, 15), marker='o',hist_kwds={'bins': 20}, s=60, alpha=.8)
plt.savefig("results/plot10.png")


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## Time series
A time series plot is a graph where some measure of time is the unit on the x-axis. In fact, we label the x-axis the time-axis. The y-axis is for the variable that is being measured. Data points are plotted and generally connected with straight lines, which allows for the analysis of the graph generated. 


A Time Series plot in matplotlib is, in essence, a special case of a line plot. That is, we are plotting x and y points connected by a line, however, the x coordinates represent ``time``.

[Back to top](#Index:) 

### Question 11
*5 points*
    
For the time series data given below, create a lineplot. Change the title of the plot to 'My time series'. 

Save your plot as a png with the name "plot11.png" in the folder "results".

In [None]:
### GRADED

# Data
df = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')
dates = pd.DatetimeIndex([parse(d).strftime('%Y-%m-01') for d in df['date']])
df.set_index(dates, inplace=True)

# convert the dataframe into a Series, there is only one column
ts = df['value']

ts.plot()
plt.title("My time series")
plt.savefig("results/plot11.png")


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 12
*15 points*

Prepare the data for the dataset given below. Then plot all of the data with lineplots. Set the title of the plot as "Seasonal Plot". Set the size of the plot to (16,12) and use a dpi of 80. When creating the plot add a text using the code:

```python
plt.text(df.loc[df.year==y, :].shape[0]-.9, df.loc[df.year==y, 'value'][-1:].values[0], y, fontsize=12)
```


Save your plot as a png with the name "plot12.png" in the folder "results".

The final plot should look like this:

![](images/5.png)

**Hint: Extract the year and month from the data and only use unique years.**

In [None]:
### GRADED

# YOUR SOLUTION HERE
# Data
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'], index_col='date')
df.reset_index(inplace=True)

df['year'] = [d.year for d in df.date]
df['month'] = [d.strftime('%b') for d in df.date]
years = df['year'].unique()

plt.figure(figsize=(16,12), dpi=80)

for i, y in enumerate(years):
    if i > 0:
        plt.plot('month', 'value', data=df.loc[df.year==y, :], label=y)
        plt.text(df.loc[df.year==y, :].shape[0]-.9, df.loc[df.year==y, 'value'][-1:].values[0], y, fontsize=12)
        
plt.title("Season plot")
plt.savefig("results/plot12.png")

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###
