# Loading example datasets

**Agenda**

- How do we load datasets?

- Characteristics of standard datasets that come with scikit-learn sklearn.datasets.load_*

- Data Requirements in scikit-learn


---

## How to load internal datasets 
### General dataset API
- Small standard datasets - sklearn.datasets.load_*
- Larger real world datasets - sklearn.datasets.fetch_*
- Generate data - sklearn.datasets.make_*	


### Introducing the iris dataset

![Iris](images/03_iris.png)

- 50 samples of 3 different species of iris (150 samples total)
- Features - 4 numeric measurements: sepal length, sepal width, petal length, petal width

### Machine learning on the iris dataset

- Framed as a **supervised learning** problem: Predict the species of an iris using the measurements
- This is perhaps the best known database to be found in the pattern recognition and machine learning literature.

### Loading the iris dataset into scikit-learn

In [None]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris

In [None]:
# save "bunch" object containing iris dataset and its attributes
iris = load_iris()
type(iris)

In [None]:
# print the iris data
print(iris.data)

### Machine learning terminology revised

- Each row is a **sample** 
- Each column is a **feature** 

In [None]:
# print the names of the four features
print(iris.feature_names)

In [None]:
# print integers representing the species of each observation
print(iris.target)

In [None]:
# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica
print(iris.target_names)

- Each value we are predicting is the **response** (also known as: target, outcome, label, dependent variable)
- **Classification** is supervised learning in which the response is categorical
- **Regression** is supervised learning in which the response is ordered and continuous

## Requirements for working with data in scikit-learn

1. Features and response are **separate objects**
2. Features and response should be **numeric**
3. Features and response should be **NumPy arrays**
4. Features and response should have **specific shapes**

In [None]:
# check the types of the features and response
print(type(iris.data))
print(type(iris.target))

In [None]:
# check the shape of the features (first dimension = number of observations, second dimensions = number of features)
print(iris.data.shape)

In [None]:
# check the shape of the response (single dimension matching the number of observations)
print(iris.target.shape)

## Exercise

Load Wine dataset (load_wine) and print 
- data
- names of the features
- types of the features
- shape of the features
- names of the response
- types of the response
- shape of the response

In [None]:
# Load wine dataset
from sklearn.datasets import load_wine
wine = load_wine()

In [None]:
# Print the wine data 
print(wine.data)

In [None]:
# Print the names of the features
print(wine.feature_names)

In [None]:
# Print the types of the features
print(type(wine.data))


In [None]:
# Print the shape of the features
print(wine.data.shape)

In [None]:
# Rrint the names of the response
print(wine.target_names)

In [None]:
# Print the types of the response
print(type(wine.target))

In [None]:
# Print the shape of the response
print(wine.target.shape)

## How to load other datasets (example importing pandas dataframe)

- scikit-learn expects X (feature matrix) and y (response vector) to be NumPy arrays.
- However, pandas is built on top of NumPy.
- Thus, X can be a pandas DataFrame and y can be a pandas Series!

In [2]:
import pandas as pd

# read CSV file directly from a URL and save the results
Advertising_data = pd.read_csv('Advertising.csv', index_col=0)

print('Advertising data columns: ',Advertising_data.columns.values)

Advertising data columns:  ['TV' 'radio' 'newspaper' 'sales']


In [None]:
# display the first 5 rows
# create a Python list of feature names
feature_cols = ['TV', 'radio', 'newspaper']

# use the list to select a subset of the original DataFrame
Advertising_data_features = Advertising_data[feature_cols]

# equivalent command to do this in one line
#X = data[['TV', 'radio', 'newspaper']]

# print the first 5 rows of features
Advertising_data_features.head()
#Advertising_data.head()

In [None]:
# display the last 5 rows of features
Advertising_data_features.tail()

In [None]:
# check the shape of the DataFrame (rows, columns) of features
Advertising_data_features.shape

In [None]:
# Select a subset of the original DataFrame to make target dataset
Advertising_data_target = Advertising_data['sales']
# display first 5 rows of target
Advertising_data_target.head()

In [None]:
# display last 5 rows of target
Advertising_data_target.tail()

In [None]:
# check the shape of target
Advertising_data_target.shape

What are the features?
- **TV:** advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
- **Radio:** advertising dollars spent on Radio
- **Newspaper:** advertising dollars spent on Newspaper

What is the response?
- **Sales:** sales of a single product in a given market (in thousands of items)

What else do we know?
- Because the response variable is continuous, this is a **regression** problem
- There are 200 **observations** (represented by the rows), and each observation is a single market

## Exercise
Load Avocado dataset (avocado.csv) that has the following columns
- Date - The date of the observation
- AveragePrice - the average price of a single avocado
- type - conventional or organic
- year - the year
- Region - the city or region of the observation
- Total Volume - Total number of avocados sold
- 4046 - Total number of avocados with PLU 4046 sold
- 4225 - Total number of avocados with PLU 4225 sold
- 4770 - Total number of avocados with PLU 4770 sold

as Avocado_data, divide it into features (all but AveragePrice) and target (AveragePrice) and print 
- column names
- first 5 rows
- last 5 rows
- data shape

In [3]:
# Load avocado dataset as Avocado_data
Avocado_data = pd.read_csv('avocado.csv', index_col=0)
print('Аvocado data columns: ',Avocado_data.columns.values)

Аvocado data columns:  ['Date' 'AveragePrice' 'Total Volume' '4046' '4225' '4770' 'Total Bags'
 'Small Bags' 'Large Bags' 'XLarge Bags' 'type' 'year' 'region']


In [None]:
#split into features and targets
# first features
Avocado_data_features = Avocado_data.drop(['AveragePrice'], axis=1)
# Print column names
print('Аvocado data columns: ',Avocado_data_features.columns.values)

In [None]:
# then target
Avocado_data_target = Avocado_data['AveragePrice']


In [None]:
# Print last 5 rows for features
Avocado_data_features.tail()

In [None]:
# Print first 5 rows for features
Avocado_data_features.head()

In [None]:
# Print first 5 rows for target
Avocado_data_target.head()

In [None]:
# Print last 5 rows for target
Avocado_data_target.tail()

In [None]:
# Print data shape for features
Avocado_data_features.shape

In [None]:
# Print data shape for target
Avocado_data_target.shape