# Meet the data

The data we will use for this example is the Iris dataset, a classical dataset in machine learning and statistics. It is included in `scikit-learn` in the `datasets` module. We can load it by calling the `load_iris` function

In [1]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()


The `iris` object that is returned by `load_iris` is a `Bunch` object, which is very similar to a dicitonary. It contains keys and values.

In [3]:
print("Keys of iris_dataset: \n{}".format(iris_dataset.keys()))

Keys of iris_dataset: 
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])


The value of the key `DESCR` is a short description of the dataset. We show the beginning of the description here (feel free to look up the rest yourself):

In [4]:
print(iris_dataset['DESCR'][:193] + "\n...")

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive 
...


In [5]:
print("Target names: {}".format(iris_dataset['target_names']))

Target names: ['setosa' 'versicolor' 'virginica']


In [6]:
print("Feature names? \n{}".format(iris_dataset['feature_names']))

Feature names? 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [7]:
print("Type of data: {}".format(type(iris_dataset['data'])))

Type of data: <class 'numpy.ndarray'>


In [8]:
print("Shape of data: {}".format(iris_dataset['data'].shape))

Shape of data: (150, 4)


For `scikit-learn`, it will always be assumed that you follow the specific pattern with your data:
- properties (features) in the columns
- items (samples) in the rows
First five samples below:

In [9]:
print("First five columns of data: \n{}".format(iris_dataset['data'][:5]))

First five columns of data: 
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


Target names are basically:
- 0 means _setosa_
- 1 means _versicolor_
- 2 means _virginica_
- 
## Measuring Success: Training and Testing Data

Let's call train_test_split on our data and assign the outputs using this nomenclature:

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'], random_state =0)


The output of the `train_test_split` function is a `X_train`, `X_test`, `y_train` and `y_test`, which are all `Numpy` arrays, `X_train` contains 75% of the rows of the dataset, and `X_test` contains the remaining 25%

In [11]:
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))

X_train shape: (112, 4)
y_train shape: (112,)


In [12]:
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

X_test shape: (38, 4)
y_test shape: (38,)


## Working with Scatterplots

One of the best ways to inspect data is to visualize it. One way to do this is by using a _scatter_ plot. A scatter plot of the data puts one feature along the x-axis and anoter along the y-axis, and draws a dot for each data point. Unfortunately, computer screens have only two dimensions, which allows us to plot only (or maybe three) features at a time. It is difficult to plot datasets with more than three features this way. One way around this problem is to do a _pair plot_, which looks at all possible pairs of features. If you have a small number of features, such as the four we have here, this is quite reasonable. You should keep in mind, however, that a pair plot does not show the interaction of all of the features at onece, so some interesting aspects of the data may not be revealed when visualizing it this way


## Converting the numpy dataset array into a dataframe with Pandas

To create the plot, we first convert the NumPy array into a `pandas DataFrame`. `pandas` has a function to create pair plots called `scatter_matrix`. The diagonal of this matrix is filled with histograms of each feature.

In [18]:
import pandas as pd

#create a simple dataset of people
data = {'Name': ["John", "Anna", "Peter", "Linda"],
        'Location': ["New York", "Paris", "Berlin", "London"],
        'Age':  [24,13,53,33]
       }

data_pandas = pd.DataFrame(data)
#IPython.display allows "pretty printing" of dataframes
#in the Jupyter notebook
display(data_pandas)



Unnamed: 0,Name,Location,Age
0,John,New York,24
1,Anna,Paris,13
2,Peter,Berlin,53
3,Linda,London,33


In [21]:
#Creating a datafram from data in X_train
# label the columns using the strings in iris_dataset.feature_names
iris_dataframe = pd.DataFrame (X_train, columns=iris_dataset.feature_names)

#create a scatter matrix from the dataframe, color by y_train
grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15,15), marker='o',
                        hist_kwds={'bins':20}, s=60, alpha=.8,cmap=mglearn.cm3)

NameError: name 'mglearn' is not defined