# Loading Datasets into Scikit-learn


**Lesson Goals**

In this lesson you will learn how to:

    Load Scikit-learn's bundled datasets.
    Load other external datasets of the most relevant formats.
    Visualize your dataset.

**Introduction**

In the Machine Learning workflow presented in previous lessons, extracting data, transforming it, and loading your dataset are your first stages. When you read the dataset from your Python application, you will load the dataset into a data object: typically a dataframe, an ndarray, a dictionary, or a list. The process of loading a dataset with scikit-learn depends on the type of dataset: whether it is a dataset bundled with scikit-learn or not, and if not, depending on the format of the dataset. We will cover the cases separately, providing you with code snippets for you to reuse in the implementations of your Machine Learning workflow.
Load Bundled Dataset

As mentioned in the lesson introducing Scikit-learn, it comes with several datasets bundled that you can load quickly from your Python application. There are three datasets representing regression problems:

    Boston house prices.
    Diabetes.
    Linnerud.

In these datasets, the domain of the target attribute is numeric.

There are also four datasets representing classification problems:

    Iris.
    Digits.
    Wine.
    Breast cancer.

In these datasets, the target attribute might be categorical or an integer with a limited number of values (for example, 0 or 1).

All of them are public open datasets that you can use to test your Machine Learning workflows.

You can load one of these datasets and have a look at the structure using the following Python code:

In [1]:
from sklearn import datasets

# dictionary-like object
diabetesDataset = datasets.load_diabetes()

# Print all attributes
diabetesDataset.keys()

dict_keys(['data', 'target', 'DESCR', 'feature_names', 'data_filename', 'target_filename'])

By printing the DESCR attribute of the dataset you get its documentation informing you about what the dataset describes, number of instances, number of attributes, target attribute, any preprocessing already performed on the dataset, and a citation of its original source. But note that this only works for the bundled datasets.


**Load External Dataset**

The datasets that come bundled with Scikit-learn are very convenient to get you started quickly with building Machine Learning workflows and testing your code. But most of the time you will be working with external datasets that you will download from the web and load from your computer storage device (e.g. your hard disk) into your Python program.


# CSV Format

CSV stands for "Comma Separated Values." In a csv file, data is saved in a table format where in each row, columns (or features) are separated by a comma.

Typically, we read the csv dataset into a pandas DataFrame, which is often a good idea due to its flexibility and versatility. In this lesson, we will use the census data to demonstrate loading.

In [2]:
import pandas as pd

census = pd.read_csv('../census.csv')
census.head()

Unnamed: 0,CensusId,State,County,TotalPop,Men,Women,Hispanic,White,Black,Native,...,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
0,1001,Alabama,Autauga,55221,26745,28476,2.6,75.8,18.5,0.4,...,0.5,1.3,1.8,26.5,23986,73.6,20.9,5.5,0.0,7.6
1,1003,Alabama,Baldwin,195121,95314,99807,4.5,83.1,9.5,0.6,...,1.0,1.4,3.9,26.4,85953,81.5,12.3,5.8,0.4,7.5
2,1005,Alabama,Barbour,26932,14497,12435,4.6,46.2,46.7,0.2,...,1.8,1.5,1.6,24.1,8597,71.8,20.8,7.3,0.1,17.6
3,1007,Alabama,Bibb,22604,12073,10531,2.2,74.5,21.4,0.4,...,0.6,1.5,0.7,28.8,8294,76.8,16.1,6.7,0.4,8.3
4,1009,Alabama,Blount,57710,28512,29198,8.6,87.9,1.5,0.3,...,0.9,0.4,2.3,34.9,22189,82.0,13.5,4.2,0.4,7.7


It is often a good idea to check the shape of the resulting DataFrame:

In [3]:
census.shape

(3220, 37)

We can also look at the columns and their types using the dtypes function: 

In [4]:
census.dtypes

CensusId             int64
State               object
County              object
TotalPop             int64
Men                  int64
Women                int64
Hispanic           float64
White              float64
Black              float64
Native             float64
Asian              float64
Pacific            float64
Citizen              int64
Income             float64
IncomeErr          float64
IncomePerCap         int64
IncomePerCapErr      int64
Poverty            float64
ChildPoverty       float64
Professional       float64
Service            float64
Office             float64
Construction       float64
Production         float64
Drive              float64
Carpool            float64
Transit            float64
Walk               float64
OtherTransp        float64
WorkAtHome         float64
MeanCommute        float64
Employed             int64
PrivateWork        float64
PublicWork         float64
SelfEmployed       float64
FamilyWork         float64
Unemployment       float64
d

Note that not all csv files will contain the column names in the first row. If this is the case, it is best to read the csv file with the following argument in the read_csv function: 

In [5]:
#we name the columns ourselves by creating a list of column names. 
#For example, we have here a dataset with student name and age
#column_list = ['name', 'age']
#df = pd.read_csv(path, header=None, names=column_list)

# JSON Format

JSON is a popular data format. JSON stands for (JavaScript Object Notation) and it is documented at www.json.org. A JSON object is a sequence of label:value pairs, conceptually similar to a Python dictionary or database record. To read a JSON dataset into a pandas DataFrame you can use this code: 

In [6]:
#df = pd.read_json(path)

Sometimes our data might be nested. To flatten the data, we can use the json_normalize function provided in Pandas. Here is an example of flattening an online json file containing information about different Pokemon.

Since we are using a file that is stored online rather than saved locally, we will need to load the data from the web using the urllib library and then extract the json object using the json library



In [7]:
import json
from urllib.request import urlopen
import pandas as pd
from pandas.io.json import json_normalize

url = 'https://raw.githubusercontent.com/Biuni/PokemonGO-Pokedex/master/pokedex.json'
#In Python3 we will need to decode the data as well, hence the use of the decode function
url_read = urlopen(url).read().decode()
url_json = json.loads(url_read)

In order to confirm that we are properly flattening, let's look at the data

In [8]:
url_json['pokemon'][0]

{'id': 1,
 'num': '001',
 'name': 'Bulbasaur',
 'img': 'http://www.serebii.net/pokemongo/pokemon/001.png',
 'type': ['Grass', 'Poison'],
 'height': '0.71 m',
 'weight': '6.9 kg',
 'candy': 'Bulbasaur Candy',
 'candy_count': 25,
 'egg': '2 km',
 'spawn_chance': 0.69,
 'avg_spawns': 69,
 'spawn_time': '20:00',
 'multipliers': [1.58],
 'weaknesses': ['Fire', 'Ice', 'Flying', 'Psychic'],
 'next_evolution': [{'num': '002', 'name': 'Ivysaur'},
  {'num': '003', 'name': 'Venusaur'}]}

It looks like there is one key in this dictionary with many nested values; therefore, flattening at this level will not produce great results. We can see this by looking at the keys in this dictionary:

In [9]:
url_json.keys()

dict_keys(['pokemon'])

Therefore, we will flatten one layer into the nesting in this json object.

In [10]:
pokemon = json_normalize(url_json['pokemon'])
pokemon.head()

Unnamed: 0,avg_spawns,candy,candy_count,egg,height,id,img,multipliers,name,next_evolution,num,prev_evolution,spawn_chance,spawn_time,type,weaknesses,weight
0,69.0,Bulbasaur Candy,25.0,2 km,0.71 m,1,http://www.serebii.net/pokemongo/pokemon/001.png,[1.58],Bulbasaur,"[{'num': '002', 'name': 'Ivysaur'}, {'num': '0...",1,,0.69,20:00,"[Grass, Poison]","[Fire, Ice, Flying, Psychic]",6.9 kg
1,4.2,Bulbasaur Candy,100.0,Not in Eggs,0.99 m,2,http://www.serebii.net/pokemongo/pokemon/002.png,"[1.2, 1.6]",Ivysaur,"[{'num': '003', 'name': 'Venusaur'}]",2,"[{'num': '001', 'name': 'Bulbasaur'}]",0.042,07:00,"[Grass, Poison]","[Fire, Ice, Flying, Psychic]",13.0 kg
2,1.7,Bulbasaur Candy,,Not in Eggs,2.01 m,3,http://www.serebii.net/pokemongo/pokemon/003.png,,Venusaur,,3,"[{'num': '001', 'name': 'Bulbasaur'}, {'num': ...",0.017,11:30,"[Grass, Poison]","[Fire, Ice, Flying, Psychic]",100.0 kg
3,25.3,Charmander Candy,25.0,2 km,0.61 m,4,http://www.serebii.net/pokemongo/pokemon/004.png,[1.65],Charmander,"[{'num': '005', 'name': 'Charmeleon'}, {'num':...",4,,0.253,08:45,[Fire],"[Water, Ground, Rock]",8.5 kg
4,1.2,Charmander Candy,100.0,Not in Eggs,1.09 m,5,http://www.serebii.net/pokemongo/pokemon/005.png,[1.79],Charmeleon,"[{'num': '006', 'name': 'Charizard'}]",5,"[{'num': '004', 'name': 'Charmander'}]",0.012,19:00,[Fire],"[Water, Ground, Rock]",19.0 kg


# Dataset Generator

Instead of looking for the appropriate dataset, downloading it, and reading it, to test your workflow, you can also create a synthetic dataset according to your needs, using Scikit-learn. For instance, you can use sklearn.datasets.make_classification() to generate a synthetic dataset to test your classification workflow: 

In [11]:
from sklearn import datasets

#featureVectors, targets = datasets.make_classification(amt_instances, amt_features, amt_classes)