Copyright © 2020 IUBH Internationale Hochschule

**explorative data analysis**

"Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations." [1]

This notebook should give some introductiory samples to the explorative data analysis topic:

- introduction of the scikit-learn iris-dataset [->](#iris)
- scatter plot of 'sepal length' vs 'sepal width' in relation to their classes [->](#scat1)
- correlation matrix of the features [->](#corrMat)
- box-plot of features [->](#boxplot)
- andrew's curves plot [->](#ac)
- parallel coordinates plot [->](#parallelCoordinates)
- 'RadViz' plot [->](#radviz)

[1], [What is Exploratory Data Analysis?](https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15), visited on 24.08.2020

<a id="iris">**Iris Dataset**</a><br>
In this notebook the iris-dataset is used for demonstrating different kinds of data visualizations.

"The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.[1] It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species.[2] Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".[3] Fisher's paper was published in the journal, the Annals of Eugenics, creating controversy about the continued use of the Iris dataset for teaching statistical techniques today.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other."[1]

In scikit-learn the dataset is an object of the [Bunch](https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html#sklearn.utils.Bunch) class which is a container exposing keys of attributes.<br>
Detailled description of the iris data set on scikit-learn:  [scikit.datasets](https://scikit-learn.org/stable/datasets/index.html#iris-dataset)

[1]  Iris_flower_data_set, wikipedia [link](https://en.wikipedia.org/wiki/Iris_flower_data_set)

In [None]:
# load iris dataset and load it into a pands dataframe
#
from sklearn.datasets import load_iris
import pandas as pd
#
# - load the dataframe manually:
# load numeric data into frame:
skBunch_iris = load_iris()
df_man = pd.DataFrame(skBunch_iris['data'], columns=skBunch_iris['feature_names'])
#
# add the iris-class names
targets = skBunch_iris['target']
target_names = skBunch_iris['target_names']
#
df_man['class'] = target_names[targets]

In [None]:
# Return a tuple representing the dimensionality of the DataFrame.
df_man.shape
# 

In [None]:
#Return the first n rows. default: n=5
df_man.head()

In [None]:
# Generate descriptive statistics
df_man.describe()

<a id="scat1">**Plot 2 features in relation to their class**</a>

Plot a simple scatter plot of 2 features of the iris dataset. [1]

[1], [scikit-learn example](http://scipy-lectures.org/packages/scikit-learn/auto_examples/plot_iris_scatter.html)

In [None]:
#
# plot data in a scatter-plot
#
import matplotlib.pyplot as plt
#
#
# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: skBunch_iris.target_names[int(i)])

plt.figure(figsize=(8, 6))
plt.scatter(df_man['sepal length (cm)'], df_man['sepal width (cm)'], c=targets)
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.xlabel('sepal length (cm)')
plt.ylabel('sepal width (cm)')

plt.tight_layout()
plt.show()

<a id="corrMat">**correlation matrix**</a><br>
Compute pairwise correlation of columns, excluding NA/null values.

description of the function:

**DataFrame.corr(method='pearson', min_periods=1)**
<br>Parameters:
- method{‘pearson’, ‘kendall’, ‘spearman’} or callable Method of correlation:<br>

    pearson : standard correlation coefficient

    kendall : Kendall Tau correlation coefficient

    spearman : Spearman rank correlation

    callable: callable with input two 1d ndarrays and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.

    New in version 0.24.0.<br>

- min_periods (int), optional:<br>Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.

<br>Returns: 
- DataFrame Correlation matrix. [1]

[1],[pandas api](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)

In [None]:
# correlation matrix of feature_names
# 
# create dataframe without class-column
dfFeature = pd.DataFrame(skBunch_iris['data'], columns=skBunch_iris['feature_names'])
#
f = plt.figure(figsize=(8, 6))
corrM = dfFeature.corr()
#
plt.matshow(corrM, fignum=f.number, cmap=plt.get_cmap('plasma'))
plt.xticks(range(dfFeature.shape[1]), dfFeature.columns, fontsize=11, rotation=45)
plt.yticks(range(dfFeature.shape[1]), dfFeature.columns, fontsize=11)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=11)
# plt.title('Correlation Matrix', fontsize=14);

plt.show()

<a id="boxplot">**Boxplot of each of the features**</a>

"Make a box-and-whisker plot from DataFrame columns, optionally grouped by some other columns.<br> A box plot is a method for graphically depicting groups of numerical data through their quartiles. The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2).<br> The whiskers extend from the edges of box to show the range of the data. By default, they extend no more than 1.5 * IQR (IQR = Q3 - Q1) from the edges of the box, ending at the farthest data point within that interval. Outliers are plotted as separate dots."[1]

[1], [pandas api](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.plotting.boxplot.html)

In [None]:
# box plot of each of the features
#
df_man.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False, figsize=(8, 6))
plt.show()

<a id="ac">**Andrews curves**</a>

"Andrews curves allow one to plot multivariate data as a large number of curves that are created using the attributes of samples as coefficients for Fourier series, see the [Wikipedia entry](https://en.wikipedia.org/wiki/Andrews_plot) for more information. By coloring these curves differently for each class it is possible to visualize data clustering. Curves belonging to samples of the same class will usually be closer together and form larger structures." [1]

[1],  [pandas plotting-tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#plotting-tools)

In [None]:

from pandas.plotting import andrews_curves
#
andrews_curves(df_man, 'class', colormap='jet')
#

<a id="parallelCoordinates">**Parallel coordinates**</a>

"Parallel coordinates is a plotting technique for plotting multivariate data, see the [Wikipedia entry](https://en.wikipedia.org/wiki/Parallel_coordinates) for an introduction. Parallel coordinates allows one to see clusters in data and to estimate other statistics visually. Using parallel coordinates points are represented as connected line segments. Each vertical line represents one attribute. One set of connected line segments represents one data point. Points that tend to cluster will appear closer together." [1]

[1], [pandas plotting-tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#plotting-tools)

In [None]:
#
from pandas.plotting import parallel_coordinates
#
parallel_coordinates(df_man, 'class', colormap='jet')
#

<a id="radviz">**RadViz-plot**</a>

"RadViz is a way of visualizing multi-variate data. It is based on a simple spring tension minimization algorithm. Basically you set up a bunch of points in a plane. In our case they are equally spaced on a unit circle. Each point represents a single attribute. You then pretend that each sample in the data set is attached to each of these points by a spring, the stiffness of which is proportional to the numerical value of that attribute (they are normalized to unit interval). The point in the plane, where our sample settles to (where the forces acting on our sample are at an equilibrium) is where a dot representing our sample will be drawn. Depending on which class that sample belongs it will be colored differently. See the R package [Radviz](https://cran.r-project.org/package=Radviz/) for more information." [1]

[1], [pandas plotting-tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#plotting-tools)

In [None]:
#
from pandas.plotting import radviz
#
radviz(df_man, 'class', colormap='Spectral')

Copyright © 2020 IUBH Internationale Hochschule