# Exploring the iris dataset with Pandas

## First thing: import the pandas module.

In [None]:
import pandas as pd

You can find more information and tutorials on pandas here:

https://pandas.pydata.org/pandas-docs/stable/10min.html

https://pandas.pydata.org/pandas-docs/stable/tutorials.html

## Second: import the data.
We will play with famous Iris dataset. This dataset can be found in many places on the net and was first released at https://archive.ics.uci.edu/ml/index.php. For example it is stored on Kaggle https://www.kaggle.com/uciml/iris/ , with many demos and Jupyter notebooks you can test (have a look at the "kernels" tab).
![Iris Par Za — Travail personnel, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=144395](figures/iris_germanica.jpg)

In [None]:
irisData = pd.read_csv('Iris.csv')
irisData.head()

The description of the entries is given here:
https://www.kaggle.com/uciml/iris/home

In [None]:
irisData['Species'].unique()

In [None]:
irisData.describe()

## Let make a graph from the features

We are going to build a graph from these data. The idea is to represent iris samples (rows of the table) as nodes, with connections depending on their physical similarity. 

The main question is to define the notion of similarity between the flowers. For that, we need to introduce a measure of similarity. It should use the properties of the flowers and provide a positive real value for each pair of samples.

Let us separate the data into two parts, physical properties and labels.

In [None]:
irisfeatures = irisData.loc[:,['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']]
irisSpecies = irisData.loc[:,'Species']

In [None]:
irisfeatures.head()

In [None]:
irisSpecies.head()

Let us import the necessary tools for computing efficiently the similarity.

In [None]:
import numpy as np

In [None]:
from scipy.spatial.distance import pdist,squareform

The function `pdist` compute the pairwise distance. By default it is the Euclidian distance. `irisfeatures.values` is a numpy array extracted from the Pandas dataframe. Very handy.

In [None]:
weights = pdist(irisfeatures.values)

In [None]:
pdist?

In [None]:
# Turn the list of weights into a matrix
W = squareform(weights)

Sometimes, you may need to compute additional features before processing them with some machine learning or some other data processing step. With Pandas, it is as simple as that:

In [None]:
# Compute a new column using the existing ones
irisfeatures['SepalLSquared'] = irisfeatures['SepalLengthCm']**2
irisfeatures.head()

Coming back to the weight matrix, we have obtained a full matrix but we may not need all the connections (reducing the number of connections saves some space and computations!). We can sparsify the graph by removing the values (edges) below some fixed threshold. Let us see what kind of threshold we could use:

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.hist(weights)
plt.title('Distribution of weights')
plt.show()

In [None]:
# Let us choose a threshold of 3
W[W<3] = 0

To conclude the construction of the graph, let us visualize it. We will use the python module named `networkx`. Don't forget to run `pip install networkx` in the command line beforehand.

In [None]:
import networkx as nx

In [None]:
# A simple command to create the graph from the weight matrix
G = nx.from_numpy_array(W)

In [None]:
# Save the graph to a gexf file, readable by Gephi
nx.write_gexf(G,'irisGraph.gexf')

Let us try some direct visualizations using networkx

In [None]:
nx.draw_spectral(G)

Oh! It seems to be separated in 3 parts! Are they related to the 3 different species of iris?

Let us try another one, where the edges are modeled as springs:

In [None]:
nx.draw_spring(G)

You may now explore the graph using Gephi and compare the visualizations.