# Data Format


### Supervised Learning

Predict continuous data or Classify categorical data.

1. **Feature** Matrix ( 2D Grid of Data ) 

- Rows represents **Samples** and Columns represents **Features**

2. **Target** Vector ( 1D Vector ) 

- **Numeric** Continuous Values to be **Predicted**. 

- **Categorical** Discrete Labels to be **Classified**.


### Unsupervised Learning

Discover structure and patterns in data.

1. **Clustering** 
- Discover natural grouping based on similarity in a dataset.

2. **Dimensionality Reduction**
- Data compression or removing irrelevant data.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

In [2]:
help(load_iris)

Help on function load_iris in module sklearn.datasets._base:

load_iris(*, return_X_y=False, as_frame=False)
    Load and return the iris dataset (classification).
    
    The iris dataset is a classic and very easy multi-class classification
    dataset.
    
    Classes                          3
    Samples per class               50
    Samples total                  150
    Dimensionality                   4
    Features            real, positive
    
    Read more in the :ref:`User Guide <iris_dataset>`.
    
    Parameters
    ----------
    return_X_y : bool, default=False.
        If True, returns ``(data, target)`` instead of a Bunch object. See
        below for more information about the `data` and `target` object.
    
        .. versionadded:: 0.18
    
    as_frame : bool, default=False
        If True, the data is a pandas DataFrame including columns with
        appropriate dtypes (numeric). The target is
        a pandas DataFrame or Series depending on the number of

**Import** Dataset 

In [3]:
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['species'] = data.target
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


**Feature Matrix**

In [4]:
X = df.iloc[:, :-1] 
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


**Target Vector**

In [5]:
y = df.iloc[:,-1] 
y.head()

0    0
1    0
2    0
3    0
4    0
Name: species, dtype: int32

Scikit learn works better with arrays therefore, converting feature matrix and target vector into **NumPy Arrays**.

In [6]:
X_array = X.values
X_array[:10]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

In [7]:
y_array = y.values
y_array[:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])