# Loading example datasets

**Agenda**

- How do we load datasets?

- Characteristics of standard datasets that come with scikit-learn sklearn.datasets.load_*

- Data Requirements in scikit-learn


---

## How to load internal datasets 
### General dataset API
- Small standard datasets - sklearn.datasets.load_*
- Larger real world datasets - sklearn.datasets.fetch_*
- Generate data - sklearn.datasets.make_*	


### Introducing the iris dataset

![Iris](images/03_iris.png)

- 50 samples of 3 different species of iris (150 samples total)
- Features - 4 numeric measurements: sepal length, sepal width, petal length, petal width

### Machine learning on the iris dataset

- Framed as a **supervised learning** problem: Predict the species of an iris using the measurements
- This is perhaps the best known database to be found in the pattern recognition and machine learning literature.

### Loading the iris dataset into scikit-learn

In [2]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris

In [3]:
# save "bunch" object containing iris dataset and its attributes
iris = load_iris()
type(iris)

sklearn.utils.Bunch

In [4]:
# print the iris data
print(iris.data)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

### Machine learning terminology revised

- Each row is a **sample** 
- Each column is a **feature** 

In [9]:
# print the names of the four features
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [6]:
# print integers representing the species of each observation
print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [7]:
# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


- Each value we are predicting is the **response** (also known as: target, outcome, label, dependent variable)
- **Classification** is supervised learning in which the response is categorical
- **Regression** is supervised learning in which the response is ordered and continuous

## Requirements for working with data in scikit-learn

1. Features and response are **separate objects**
2. Features and response should be **numeric**
3. Features and response should be **NumPy arrays**
4. Features and response should have **specific shapes**

In [10]:
# check the types of the features and response
print(type(iris.data))
print(type(iris.target))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [None]:
# check the shape of the features (first dimension = number of observations, second dimensions = number of features)
print(iris.data.shape)

In [None]:
# check the shape of the response (single dimension matching the number of observations)
print(iris.target.shape)

## Exercise

Load Wine dataset (load_wine) and print 
- data
- names of the features
- types of the features
- shape of the features
- names of the response
- types of the response
- shape of the response

In [13]:
# Load wine dataset
# import load_iris function from datasets module
from sklearn.datasets import load_wine
wines = load_wine()
type(wines)

sklearn.utils.Bunch

In [14]:
# Print the wine data 
print(wines.data)


[[1.423e+01 1.710e+00 2.430e+00 ... 1.040e+00 3.920e+00 1.065e+03]
 [1.320e+01 1.780e+00 2.140e+00 ... 1.050e+00 3.400e+00 1.050e+03]
 [1.316e+01 2.360e+00 2.670e+00 ... 1.030e+00 3.170e+00 1.185e+03]
 ...
 [1.327e+01 4.280e+00 2.260e+00 ... 5.900e-01 1.560e+00 8.350e+02]
 [1.317e+01 2.590e+00 2.370e+00 ... 6.000e-01 1.620e+00 8.400e+02]
 [1.413e+01 4.100e+00 2.740e+00 ... 6.100e-01 1.600e+00 5.600e+02]]


In [15]:
# Print the names of the features
print(wines.feature_names)

['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']


In [16]:
# Print the types of the features

print(type(wines.data))
print(type(wines.target))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [17]:
# Print the shape of the features
print(wines.data.shape)

(178, 13)


In [None]:
# Rrint the names of the response


In [None]:
# Print the types of the response


In [None]:
# Print the shape of the response


## How to load other datasets (example importing pandas dataframe)

- scikit-learn expects X (feature matrix) and y (response vector) to be NumPy arrays.
- However, pandas is built on top of NumPy.
- Thus, X can be a pandas DataFrame and y can be a pandas Series!

In [7]:
import pandas as pd

# read CSV file directly from a URL and save the results
Advertising_data = pd.read_csv('Advertising.csv', index_col=0)

print('Advertising data columns: ',Advertising_data.columns.values)

Advertising data columns:  ['TV' 'radio' 'newspaper' 'sales']


In [None]:
# display the first 5 rows
# create a Python list of feature names
feature_cols = ['TV', 'radio', 'newspaper']

# use the list to select a subset of the original DataFrame
Advertising_data_features = Advertising_data[feature_cols]

# equivalent command to do this in one line
#X = data[['TV', 'radio', 'newspaper']]

# print the first 5 rows of features
Advertising_data_features.head()
#Advertising_data.head()

In [None]:
# display the last 5 rows of features
Advertising_data_features.tail()

In [None]:
# check the shape of the DataFrame (rows, columns) of features
Advertising_data_features.shape

In [None]:
# Select a subset of the original DataFrame to make target dataset
Advertising_data_target = Advertising_data['sales']
# display first 5 rows of target
Advertising_data_target.head()

In [None]:
# display last 5 rows of target
Advertising_data_target.tail()

In [None]:
# check the shape of target
Advertising_data_target.shape

What are the features?
- **TV:** advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
- **Radio:** advertising dollars spent on Radio
- **Newspaper:** advertising dollars spent on Newspaper

What is the response?
- **Sales:** sales of a single product in a given market (in thousands of items)

What else do we know?
- Because the response variable is continuous, this is a **regression** problem
- There are 200 **observations** (represented by the rows), and each observation is a single market

## Exercise
Load Avocado dataset (avocado.csv) that has the following columns
- Date - The date of the observation
- AveragePrice - the average price of a single avocado
- type - conventional or organic
- year - the year
- Region - the city or region of the observation
- Total Volume - Total number of avocados sold
- 4046 - Total number of avocados with PLU 4046 sold
- 4225 - Total number of avocados with PLU 4225 sold
- 4770 - Total number of avocados with PLU 4770 sold

as Avocado_data, divide it into features (all but AveragePrice) and target (AveragePrice) and print 
- column names
- first 5 rows
- last 5 rows
- data shape




In [8]:
# Load avocado dataset as Avocado_data
Avocado_data = pd.read_csv('avocado.csv', index_col=0)

In [9]:
#split into features and targets
feature_cols = ['Date', 'type','year','Region','Total Volume','4046','4225','4770']
target_col = ['AveragePrice']
print(feature_cols)


['Date', 'type', 'year', 'Region', 'Total Volume', '4046', '4225', '4770']


In [3]:
# targets
print(target_col)

['AveragePrice']


In [11]:
# Print last 5 rows for features
Avocado_data_features = Avocado_data[feature_cols]

KeyError: "['Region'] not in index"

In [None]:
# Print first 5 rows for features


In [None]:
# Print first 5 rows for target


In [None]:
# Print last 5 rows for target


In [None]:
# Print data shape for features

In [None]:
# Print data shape for target