# 1 Plotting and Data Visualisation

In this notebook we want to get familiar with plotting and the Python library ```matplotlib```. Plotting is an essential tool for data exploration that can help you to get an intuition about certain characteristics and features of data.


[Matplotlib](https://matplotlib.org/) is probably the most widely used Python library and will be the one we are using in this course. However, there are also other alternatives that might be interesting for you, for instance [Seaborn](https://seaborn.pydata.org/) or [Plotly](https://plotly.com/) (for this course, we expect you to stick to matplotlib).

Let's install and import matplotlib:

In [None]:
!pip install matplotlib

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np #numpy will always have our back

## 1.1 Obtaining a dataset

For illustration purposes, we will use datasets that are canon in machine learning and data science. Those datasets are already preprocessed and easily obtainable through the [scikit-learn](https://scikit-learn.org/stable/) library.

Now, let's install (mind the version number 0.24.1!) and import scikit-learn and load the dataset.

In [None]:
!pip install scikit-learn==0.24.1

In [None]:
from sklearn import datasets
boston = datasets.load_boston()

Now we have the Boston House Prices dataset. Let's try to explore it with ```matplotlib```.

In [None]:
print(boston["data"].shape)

As you can see, we have 506 samples with 13 features. (Actually, there is a 14th feature, the target, which denotes the price of a house.) Now, have a look at what features we are dealing with:

In [None]:
print(boston["feature_names"])

The given abbreviations may seem rightfully cryptic to you, more information can be found in the dataset description:

In [None]:
print(boston["DESCR"])

## 1.2 Visualise features

Now, we want to get a feeling for a feature, how it is distributed and what we can learn from it. Hence, we will pick out one of the features and plot them in difference ways. 

As an example feature, we will take "RM", the average number of rooms.
For the all subtasks of 1.2, the term "data" refers to the feature "RM" of the Boston Housing dataset.

### Task 1.2.1: Get the feature
Isolate the feature "RM" from the data and save the vector into the provided variable _rm_.

In [None]:
rm = # Your Code here

### Task 1.2.2: Describe the data
Use the skills you learnt in the lecture and the last assignment (and numpy), to extract meaningful properties from the data:
- attribute type (scale) of the data
- mean
- median
- maximum value
- minimum value
- variance

In [None]:
# attribute type: your answer here
rm_mean =  # Your Code here
rm_median =  # Your Code here
rm_max =  # Your Code here
rm_min =  # Your Code here
rm_var =  # Your Code here

### Task 1.2.3: Show the distribution
Now, we are interested in the distribution of the number of rooms over the data, i.e. we want to visualise how many houses have how many rooms. 

Have a look at the sample plots in on the [matplotlib website](https://matplotlib.org/stable/gallery/statistics/histogram_features.html) and choose an appropriate type of plot to display that information.

IMPORTANT: Do not forget to label your axes correctly!

In [None]:
# Your Code here
plt.title("Your Title here")
plt.xlabel("Your X Label here")
plt.ylabel("Your Y Label here")
plt.show()

### Bonus Task 1.2.4: What type of function could describe the data approximately?

In [None]:
# Your Answer here

### Task 1.2.5: Show the boxplot and describe it
Now that you know how to plot with matplotlib, you are tasked to create a box and whiskers plot of the data. Have a look at the [official matplotlib documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html) if you need some guidance. Additionally, give a quick description of the plot and what you learn from it about the data.

In [None]:
# Your Code here
plt.title("Your Title here")
plt.xlabel("Your X Label here")
plt.ylabel("Your Y Label here")
plt.show()

# Your description here

## 1.3 Visualise 2D data
After we have looked at the characteristics of a single feature, let's see how matplotlib can help to visualise two-dimensional data and single samples. Prominent 2D data are greyscale photos or images, where the two dimensions are the x and y positions of the pixel values.

### 1.3.1 Obtain an image dataset
Fortunately, Scikit-learn also provides a dataset with pixel images as samples.

In [None]:
digits = datasets.load_digits()
print(digits.keys())

Let's go the usual route and have a look at the shape of the data.

In [None]:
print(digits["data"].shape)

So, we have 1797 samples with 64 features each. That means, if we isolate a feature vector for a single sample, it has 64 features. But aren't we dealing with images that are usually 2-dimensional?

Perhaps the feature names can give more insight:

In [None]:
print(digits["feature_names"])

From the feature names, we can conclude that the pixel values are represented in a vector, row for row. In order to plot it as a picture, we need a 2D representation, though.

Therefore our tasks are now: 
- 1.) isolate the feature vector of a single sample
- 2.) reshape the vector into a 2D matrix
- 3.) plot the image using matplotlib

### Task 1.3.1 Isolate a feature vector
This task can be seen as the "inverse" of task 1.2.1. But now, instead of a single feature over all samples, we want all the features for a single sample!

Isolate a sample of your choice and save it in the variable _im_vec_.

In [None]:
# Your Code here
print(im_vec.shape)

### Task 1.3.2 Reshape the vector into a 2D matrix
Now, you should have a vector of length 64. The image samples of the dataset are square. So now you need to reshape the vector into the appropriate shape using Numpy. Save the resulting matrix into the variable _im_.

Hint: a helpful function is [numpy.reshape](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html).

In [None]:
im = # your code here
print(im.shape)

### Task 1.3.3 Plot the image
Use the skills obtained above to plot the sample using the ```imshow``` [function from matplotlib](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.imshow.html).

In [None]:
# your code here

## 1.4 Visual recognition of correlations

In this part you shall plot different attributes against each other in a scatterplot to find out if the selected attributes are correlated.

The [scatter()](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) function from matplotlib might be worthwhile to look into.

Don't forget the title and axes labeling for your plots!


### Task 1.4.1 Nitric oxid concentration and industrial density

We assume a positive correlation between the proportion of non-retail business acres per town and the amount of nitric oxid in the air as a result of higher emissions in industrial areas.
Plot the asscociated attributes (INDUS and NOX) against each other and analyse the plot to find out if this is true.

In [None]:
# Your code here

# is the assumption true?: your answer here.

### Task 1.4.2 House age and number of rooms

We assume a negative correlation between the number of rooms and the age of the house as a result of the trend towards more open interior design choices and larger rooms in modern housing.
Plot the asscociated attributes (RM and AGE) against each other and analyse the plot to find out if this assumption is true.

In [None]:
# Your code here

# is the assumption true?: your answer here.

### Task 1.4.3 Social status and price
We assume a negative correlation between the amount of people with lower social status in a neighbourhood and the house prices.
Plot the associated attributes (LSTAT and MDEV) against each other and analyse the plot to find out if this assumption is true.

In [None]:
# Your code here

# is the assumption true?: your answer here.