## ========================================================================
## Matplotlib and scikit-learn  
### Mahdi Shafiee Kamalabad
## ========================================================================

## 3. Matplotlib 
<center>
<img src = "attachment:4744b3e7-7011-483d-840c-4e38afcce8ce.jpeg" width = 400>
</center>


![matplotlib.png](attachment:matplotlib.png)

## Matplotlib

* *matplotlib* is a Python package that allows for data plotting and visualization.
* It is highly recommended to explore its **documentation** to discover the many potentials that matplotlib can offer. 
* Besides importing the package, for plots visualization in Jupyter Notebooks we need to specify the **magic keyword** `%matplotlib inline` (or `%matplotlib notebook` for interactive plotting). 
* `%matplotlib notebook` performs the necessary behind-the-scenes setup for IPython to work correctly hand in hand with matplotlib; it does not, however, actually execute any Python import command.
* We will make particular use of the `matplotlib.pyplot` module, which is the main tool used for data visualization.  

In [None]:
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

### 3.1 Overview 
We are now going to create a Numpy array, and visualize it with pyplot. As we can see, plotting a set of data points is as easy as calling `plot`, and give the desired array as argument. 

In [None]:
data = np.arange(10) # + np.array([1,-2,3,2,6,-6,-4,3,3,7])
data
# np.arange(10)+[1,2,3,4,5,0,6,7,8,8]

In [None]:
# Plot 'data'
plt.plot(data)
plt.show()

### 3.2 Figures and Subplots
"Behind the scene", `plot` creates a **figure** object. It is possible to **initialize** it with the `figure` function as follows: 

```
fig = plt.figure()
```

Among the option of `figure`, there is `figsize`, which allows setting (width, height) of a plot.

To see several plots, you can use the `add_subplot` function. It takes three arguments: **number of plots in a column, number of plots in a row, and plot indicator**. For example, 

```
ax1 = fig.add_subplot(2, 2, 1)
``` 

means that the figure should be 2x2 (thus, four plots in total) and we are going to operate on the first subplot. 

Let's plot 3 blank figures: 

In [None]:
fig = plt.figure()  # if you want to have subplots fig is necessary.
ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 2)
ax3 = fig.add_subplot(2, 2, 3)
ax4 = fig.add_subplot(2, 2, 4)

plt.show()

Here we will plot the cumulative sum of a random Numpy array, and add it to the last plot: 

In [None]:
ForFig= np.random.randn(200)
#ForFig

In [None]:
ForFig.cumsum() # Cumulative sums are used to display the total sum of data 
                # as it grows with index (e.g. time)

In [None]:
# example
# print(np.arange(10))
# (np.arange(10)).cumsum()

In [None]:
fig = plt.figure()
ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 2)

ax3 = fig.add_subplot(2, 2, 3)
plt.plot(ForFig.cumsum(), "k--")  #---> plt.plot(data,....)

ax4 = fig.add_subplot(2, 2, 4)

plt.show()

`k--` is a style option; it says that the type of line we are drawing must be black (abbreviated with `k`) and dashed (the `--` symbol). We can add arbitrary plots by calling the `ax1` and `ax2` objects created above, and giving them some plotting method (in what follows, an histogram and a scatterplot). 

In [None]:
fig = plt.figure()
ax1 = fig.add_subplot(2, 2, 1)
plt.hist(np.random.randn(100), bins=20, color='k', alpha=0.3) # increase 100

ax2 = fig.add_subplot(2, 2, 2)
plt.scatter(np.arange(30), np.arange(30) + 3 * np.random.randn(30))

ax3 = fig.add_subplot(2, 2, 3)
plt.plot(np.random.randn(500).cumsum(), "k--")


# alpha defines the figure transparency
# ax2.scatter(np.arange(30), np.arange(30) + 3 * np.random.randn(30))
plt.show()

### 3.3 Colors, Markers, Line Styles
* The `plot` function accepts an array of coordinates for the x-axis, an array of coordinates for the y-axis (they need to have the same size), and (optionally) a `linestyle` and a `color` arguments that define the lines. 
* The latter two arguments can be specified in abbreviated form. For instance, a green dashed line can be drawn with: 
```
ax.plot(x, y, linestyle='--', color='g')
```
which, abbreviated, can be specified as 
```
ax.plot(x, y, 'g--')
```
The `o` marker, instead, can be used to draw points. Points and lines can also be combined together, as in the next example. 
##### colors: 
b : blue,
g : green,
r : red,
c : cyan,
m : magenta,
y : yellow,
k : black,
w : white.
#### Line style 
'-' or 'solid'	solid line,
'--' or 'dashed'	dashed line,
'-.' or 'dashdot'	dash-dotted line,
':' or 'dotted'	dotted line.

In [None]:
# The following command is equivalent to: 
# plt.plot(np.random.randn(50).cumsum(). linestyle='dashed', marker='o')
plt.plot(np.random.randn(50).cumsum(), 'ko--') # try marker such as x,v etc
plt.show()

In the next example, we are going to see other three functionalities: 
* `drawstyle` can be used to modify the default interpolation behaviour
* each plot can be given a label 
* labels given to each plot can then be visualized in a legend

In [None]:
data = np.random.randn(10).cumsum()
#plt.plot(data, 'go--', label="Default")
plt.plot(data, 'b-', drawstyle="steps-mid", label="steps-mid")   # 'b': blue color   # drawstyle='steps-mid'
plt.legend()
plt.show()


In [None]:
plt.plot(data, 'b-', drawstyle="steps-post", label="steps-post")   # 'b': blue color   # drawstyle='steps-mid'
plt.legend()
plt.show()


Another useful option for lines is `linewidth`, which specifies its width. The corresponding argument for points is `markersize`. 

### 3.4 Ticks and Axis Labels
`xticks`, `xlim`, `xlabels` (and equivalently `yticks`, `ylim` and `ylabels`) can be used to customize, respectively, values in the (x/y)-axis ticks, their boundaries (useful for 'zooming in' and 'zooming out'), and the axes names. Let's see them 'in action': 

In [None]:
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.plot(np.random.randn(1000).cumsum())

# Set ticks for the x-axis
ticks = ax.set_xticks([0, 250, 500, 750, 1000])
abels = ax.set_xticklabels(['one', 'two', 'three', 'four', 'five'], rotation=30, fontsize=15) # comment this and see how it changes


# Set plot title
ax.set_title('Plotting with matplotlib is fun!')

# Set x-axis label
ax.set_xlabel('Stages')
# ax.set_ylabel('ffff')

plt.show()

### 3.5 Legends
We have seen above how to add legends. Let's see another example here, this time by plotting three lines. The option `loc='best'` finds automatically the best position to place the legend. 

In [None]:
from numpy.random import randn
fig = plt.figure(); ax = fig.add_subplot(1, 1, 1)
ax.plot(randn(1000).cumsum(), 'r', label='one')
ax.plot(randn(1000).cumsum(), 'g--', label='two')
ax.plot(randn(1000).cumsum(), 'b.', label='three')
plt.legend(loc='best')
plt.show()

# For interested Students


### 3.6 Other objects
Matplotlib allows adding [text and annotations](https://matplotlib.org/3.1.1/tutorials/text/text_intro.html) to a plot, as well as [drawing shapes](https://matplotlib.org/api/patches_api.html) (called *patches* in matplotlib) such as arrows, rectangles, circles, etc. 

### 3.7 Barplots
`plot.bar` and `plot.barh` produce vertical and horizontal bar plots; they use the index (or label) as first argument, and the 'heights' as second argument. 

In [None]:
import pandas as pd 
data = pd.Series(np.random.rand(16), index=list('abcdefghijklmnop'))
data

In [None]:
fig, axes = plt.subplots(2,1, figsize=(5,7))
data.plot.bar(ax=axes[0], color='b', alpha=0.7)
data.plot.barh(ax=axes[1], color='y', alpha=0.7)
plt.show()

### 3.8 Contours and heatmaps

We can draw **contours** and **heatmaps**, which allow us to represent a third dimension in a 2D-plot. We saw an example of a contour above. We replicate it here: 

In [None]:
points = np.arange(-5, 5, 0.01)    # 1000 equally spaced points
# Create a grid of coordinates for x-axis and y-axis 
x, y = np.meshgrid(points, points)
# Compute third dimensions across previous coordinates
z = np.sqrt(x ** 2 + y ** 2)
# Plot a contour plot
plt.figure(figsize=(5,5))
# contourf plots a `filled contour` (the function contour draws empty contours)
contour = plt.contourf(x, y, z, cmap="plasma")
plt.title(r"$\sqrt{x^2 + y^2}$", fontsize=15)
plt.xlabel('x')
plt.ylabel('y')
# colorbar adds a legend to the contour
plt.colorbar(contour)
plt.show()

A heatmap is a literal way of visualizing a table of numbers, where you substitute the numbers with colored cells. Colors correspond to the level of the measurement. That is, a heatmap (aka heat map) depicts values for a main variable of interest across two axis variables as a grid of colored squares. The axis variables are divided into ranges like a bar chart or histogram, and each cell’s color indicates the value of the main variable in the corresponding cell range.

We can also visualize heatmaps of a matrix, by means of the `imshow` function: 

In [None]:
mat = np.random.randn(10,5)
mat

In [None]:

plt.figure(figsize=(6,7))

plt.imshow(mat, cmap="winter")
plt.ylabel("Matrix rows")
plt.xlabel("Matrix columns")
plt.title("Matrix values")
plt.colorbar()
plt.show()

### 3.9 ***seaborn***
Another nice tool for data visualization is the `seaborn` package. It is particularly useful to represent data directly from objects such as Pandas dataframes. A particularly useful feature is that it allows plotting all columns of a DataFrame with a single command, as well as aggregated data. We won't discuss `seaborn` here, but you should definitely give a look to its [website](https://seaborn.pydata.org/index.html) for some examples.  

<br>

# 4. scikit-learn
=============================================================================================
<center>
<img src = "attachment:32427e58-dafb-4670-b9cb-ea58b929f32a.png" width = 250>
</center>
=================================================================================================================

First of all, a very simple definition of a *model*:

A **model** is discription of the data.

`scikit-learn` is probably the most famous library for Machine Learning with Python. It contains a wide range of models, evaluation measures, model selection methods, and many other functionalities. Furthermore, it has a comprhenesive documentation (linked in the end of the document), which includes accurate theoretical explanations, along with detailed functions description. We are going to see some features of `scikit-learn` here; in the rest of the course, we will use the library more in detail.    

### 4.1 Estimators 
`scikit-learn` provides modules for a large number of algorithms, including *supervised* and *unsupervised* learning. These algorithms allow estimating (also called *training* or *fitting*) Machine Learning models on arrays in Numpy format (and therefore also Pandas). 

*Supervised learning* models (both classifiers and regressors) share a very similar interface with each other; they have an *initializer*, a *fit* method for model training, a *predict* method to perform predictions, and a *score* method for model evaluation (*note*: this method returns a default score, but other metrics are possible, as we will see). Such methods take as argument either a data matrix of inputs X alone, or along with an array of outputs y (plus other hyperparameters). Such objects are called **estimators** in scikit-learn, and a blueprint of their interface can be represented as follows: 

```
class SupervisedEstimator(...):
    # Estimator initializer
    def __init__(self, X, y):
    ...
    return self
    
    # Fit to the training data given input X and output y
    def fit(self, X, y):
    ...
    return self
    
    # Perform prediction on input X 
    def predict(self, X): 
    ...
    return y_pred
    
    # Measure performance against true labels y  
    def score(self, X, y): 
    ...
    return score
```



Therefore, when training a model and, for instance, performing predictions with it, we should: 
1. initialize the estimator
1. training the model with `fit` 
1. performing the predictions with `predict` 

***

Besides estimators, scikit-learn also provides **transformers**, that perform data transformations. For example, the `SimpleImputer` algorithm is used to impute missing values. They share an interface similar to the one of the estimators, except that they replace the `predict` method with a `transform` method. They also allow performing fit and transformation in one go with the `fit_transform` method. 

***

Both estimators and transformers, once fitted to the data, contain algorithm-specific attributes. Attributes are useful, for instance, to retrieve the estimated model coefficients, or estimated probabilities (in case of classifiers). Attribute names end with a underscore (`_`) in scikit learn. 

Check, for example, the [linear regression algorithm documentation page](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html): can you recognize which ones are the function arguments, the function methods, and the function attributes?  

***

### 4.2 Other Features
**Unsupervised Learning**. Besides supervised learning estimators, scikit-learn also contains a huge number of unsupervised learning ones. For example, the [`sklearn.cluster`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster) module offers algorithms for clustering, while [`sklearn.decomposition`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition) offers algorithms for matrix decomposition (such as PCA, for instance). 

**Pipelines**. As we will see, scikit-learn allows estimating several transformers (plus one optional estimator) into one object, called `Pipeline`. This can be found in the [`sklearn.pipeline` module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline). 

**Model and feature selection**. sklearn also includes several functions for model and feature selection, such as cross-validation, train-test split methods, and grid/random parameter search, as well as model-based feature selection algorithms. Model selection functions are in the [`sklearn.model_selection`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) module, while feature selection functions are in the [`sklearn.feature_selection`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection) module. 

**Metrics**. Besides the default metrics contained in the estimators' `score` method, we can also compute other non-default metrics (such as root mean square error, $f_1$ score, etc.). Metrics are in the [`sklearn.metrics`](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) module.

**Datasets and Data Generation routines** `scikit-learn` also comes with a [`sklearn.datasets`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets) module, which allows to [load datasets](https://scikit-learn.org/stable/datasets/index.html) into your Python session, and use them to compare model performance. Furthermore, scikit-learn also performs [toy data generation](https://scikit-learn.org/stable/datasets/index.html#generated-datasets), which can be used to generate data for various tasks and sample sizes. The [`make_classification`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn.datasets.make_classification) function, for example, generates data with a desired sample size, a desired number of classes, and a desired number of features, for classification tasks.   

## 5. Final example

In this final example, we are going to use all the libraries we have explored in this tutorial to perform a small prediction task on a simulated dataset. In particular: 

* we will generate two continuous variables, an input x and an output y, with Numpy.
* we will include them in a Pandas DataFrame, and explore its descriptive statistics. 
* we will transform the data using a scikit-learn transformer (StandardScaler, that standardizes the data).
* we will train a linear regression model on such data with scikit-learn.
* we will compare model's predictions against the true values using matplotlib.

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression 
%matplotlib inline

**1. Data Generation**

In [None]:
np.random.seed(1)
# Generate continuous x and y, with sample size 100 
x = np.random.randn(100)
y = 2 + 3.5 * x + np.random.randn(100)

**2. Summary Statistics**

In [None]:
# Create a Pandas DataFrame and compute summary statistics
data = pd.DataFrame({"x":x, "y":y})
data.head()

In [None]:
data.describe()

In [None]:
scatter_matrix(data)
plt.show()

**3. Standardize the input x with a scikit-learn transformer**.
Remember: scikit-learn transformers need to be initialized, before being used for estimation and transformation/prediction. 

In [None]:
# 1. Initialization
scl = StandardScaler()
# 2. Fit with the `fit` method
scl.fit(x.reshape(-1,1))
# 3. Transform the data 
x_standardized = scl.transform(x.reshape(-1,1))
# 4. new mean and std. deviation of x (compare with the results of Pandas describe() method above)
print("New mean: {0:.2f}; new standard deviation: {1:.2f}".format(x_standardized.mean(), x_standardized.std()))

**4. Train a Linear Regression model and perform predictions**.

In [None]:
# 1. Initialization
lin_reg = LinearRegression()
# 2. Train the model
lin_reg.fit(x_standardized,y)
# 3. Perform predictions
y_pred = lin_reg.predict(x_standardized)
y_pred

We can also evaluate the predictions with `score`, which by default returns the *coefficient of determination* $R^2$ for regression estimators:

In [None]:
print("Model R^2: {0:.2f}".format(lin_reg.score(x_standardized,y)))

We can also observe the estimated intercept and slope with the corresponding attributes (compare with the true values, 2 and 3.5, used in the data generating step: they are not exactly equal; can you detect the reason?): 

In [None]:
print("Model intercept: {0:.2f}; slope for x_standardized: {1:.2f}".format(lin_reg.intercept_, 
                                                                           lin_reg.coef_[0]))

**5. Plot true values vs. model predictions**

In [None]:
plt.figure(figsize=(5,5))
plt.plot(x_standardized, y, "bo", markersize=8, label="True (generated) values")
plt.plot(x_standardized, y_pred, "r-", linewidth=1.5, label="Model predictions")
plt.title("Linear Regression: Observations vs. Model fit")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.show()

## 6. Other Useful Resources 

* [Numpy official tutorial](https://numpy.org/devdocs/user/quickstart.html) 
* [Numpy documentation](https://docs.scipy.org/doc/numpy/user/index.html)
* [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html) 
* [Matplotlib documentation](https://matplotlib.org/contents.html)
* [scikit-learn documentation](https://scikit-learn.org/stable/user_guide.html)

Other useful libraries for Data Science are: 
* [Scipy](https://docs.scipy.org/doc/scipy/reference/) for scientic computing with Python 
* [statsmodels](https://www.statsmodels.org/stable/index.html) for useful statistical routines 

Packages cheat-sheets: 
* [Numpy](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)
* [Pandas](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)
* [Matplotlib](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf)
* [scikit-learn](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf)