# Introduction to Features and Labels

In machine learning, the concepts of **Features** and **Labels** are fundamental:  
- **Features**: These are the input data that the model learns from. They consist of various attributes (characteristics) that describe the data. For example, petal length and width can be Features.  
- **Labels**: These are the target values that the model aims to predict. For example, the species of a flower serves as the Label.

We'll use the Iris dataset provided by sklearn to explore how to view Features and Labels.


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# 1. Loading the Iris Dataset in sklearn

The sklearn library includes many built-in datasets for easy access.  We can use the `datasets` module to load the Iris dataset.


In [2]:
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
# View the dataset keys
print(iris.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])


This code shows the keys available in the Iris dataset. Key components include:

- data: Contains the Feature data.
- target: Contains the Label data.
- feature_names: Names of the Features.
- target_names: Names of the Labels.

# 2. Viewing Features and Labels

Let’s extract and inspect the Features and Labels.

In [5]:
# Features (input data)
features = iris.data
feature_names = iris.feature_names
print("Feature names:", feature_names)
print("Feature data:\n", features[:5])  # View the first 5 samples

Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature data:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


In [6]:
# Labels (target data)
labels = iris.target
label_names = iris.target_names
print("Label names:", label_names)
print("Label data:\n", labels[:5])  # View the first 5 samples

Label names: ['setosa' 'versicolor' 'virginica']
Label data:
 [0 0 0 0 0]


- `features` contains numerical data like petal length and width.
- `labels` includes the target values (e.g., 0, 1, 2) representing the flower species.
- `feature_names` provides a description of the Features (e.g., sepal length (cm)).
- `label_names` maps the numeric Labels to species names (e.g., setosa, versicolor, virginica).

# 3. Converting to a Pandas DataFrame

If you are familiar with Pandas, we can convert the dataset into a DataFrame for better visualization.

In [8]:
import pandas as pd

# Convert Features and Labels into a DataFrame
df = pd.DataFrame(features, columns=feature_names)
df["target"] = labels

df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


# 4 Understanding `iris.DESCR` and the Difference from `describe()`

When working with datasets, it's often helpful to understand their structure, attributes, and metadata. In sklearn, datasets like `iris` include a `DESCR` attribute, which provides a textual description of the dataset.

---

## 1. Using `iris.DESCR`

The `DESCR` attribute in sklearn datasets gives a detailed description of the dataset, including:
- The purpose of the dataset
- The number of samples and features
- The target variable (labels)
- The feature names
- Any additional information about the dataset

Here’s how to use it:


In [9]:
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()

# Print the dataset description
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

## 2. Is describe() the same?
The describe() function is specific to Pandas DataFrames and provides statistical summaries of numerical data. For example:

In [10]:
import pandas as pd

# Convert the Iris dataset to a DataFrame
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# Get statistical summary
df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### Key Differences:
- `DESCR`: A text-based description, primarily available for sklearn datasets. Useful for metadata and context.
- `describe()`: A Pandas DataFrame method for numerical summary statistics like mean, min, max, and standard deviation.

### When to Use Each:
- Use `iris.DESCR` to understand the dataset's context, background, and structure when loading data from sklearn.
- Use `df.describe()` to analyze the statistical distribution of your numerical data after converting it to a Pandas DataFrame.