# Available datasets

As mentionned in the [Reading data](reading-data) section, `creme` makes available some datasets to play around with:

In [1]:
from creme import stream

stream.available_datasets()

['AirlinePassengers',
 'Bananas',
 'Bikes',
 'ChickWeights',
 'CreditCard',
 'Elec2',
 'Friedman',
 'HTTP',
 'Higgs',
 'ImageSegments',
 'Insects',
 'MaliciousURL',
 'MovieLens100K',
 'Music',
 'Phishing',
 'Restaurants',
 'SEA',
 'SMSSpam',
 'SMTP',
 'SolarFlare',
 'TREC07',
 'Taxis',
 'TrumpApproval']

## Regression

In [3]:
def print_datasets(task):
    datasets = filter(lambda dataset: dataset.task == task, (
        stream.iter_dataset(name)
        for name in stream.available_datasets()
    ))
    print(f"\n\n{'-' * 20}\n\n".join(map(str, datasets)))
            
print_datasets('Regression')

Monthly number of international airline passengers.

The stream contains 144 items and only one single feature, which is the month. The goal is to
predict the number of passengers each month by capturing the trend and the seasonality of the
data.

    Name  AirlinePassengers                                                                   
    Task  Regression                                                                          
 Samples  144                                                                                 
Features  1                                                                                   
  Sparse  False                                                                               
    Path  /Users/mhalford/projects/creme-ml/creme/creme/stream/datasets/airline-passengers.csv

--------------------

Bike sharing station information from the city of Toulouse.

The goal is to predict the number of bikes in 5 different bike stations from the city of
Toulouse.

## Binary classification

In [4]:
print_datasets('Binary classification')

Bananas dataset.

An artificial dataset where instances belongs to several clusters with a banana shape.
There are two attributes that correspond to the x and y axis, respectively.

    Name  Bananas                                                                 
    Task  Binary classification                                                   
 Samples  5,300                                                                   
Features  2                                                                       
  Sparse  False                                                                   
    Path  /Users/mhalford/projects/creme-ml/creme/creme/stream/datasets/banana.zip

--------------------

Credit card frauds.

The datasets contains transactions made by credit cards in September 2013 by european
cardholders. This dataset presents transactions that occurred in two days, where we have 492
frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class
(frauds)

## Multi-class classification

In [5]:
print_datasets('Multi-class classification')

Image segments classification.

This dataset contains features that describe image segments into 7 classes: brickface, sky,
foliage, cement, window, path, and grass.

    Name  ImageSegments                                                                
    Task  Multi-class classification                                                   
 Samples  2,310                                                                        
Features  18                                                                           
  Sparse  False                                                                        
    Path  /Users/mhalford/projects/creme-ml/creme/creme/stream/datasets/segment.csv.zip

--------------------

Insects dataset.

This dataset has different variants, which are:

- abrupt_balanced
- abrupt_imbalanced
- gradual_balanced
- gradual_imbalanced
- incremental-abrupt_balanced
- incremental-abrupt_imbalanced
- incremental-reoccurring_balanced
- incremental-reoccurring_imbalanced
- i

Note that the `'Insects'` dataset has multiple variants:

In [6]:
insects = stream.iter_dataset('Insects')
insects.variants

['abrupt_balanced',
 'abrupt_imbalanced',
 'gradual_balanced',
 'gradual_imbalanced',
 'incremental-abrupt_balanced',
 'incremental-abrupt_imbalanced',
 'incremental-reoccurring_balanced',
 'incremental-reoccurring_imbalanced',
 'incremental_balanced',
 'incremental_imbalanced',
 'out-of-control']

You can load a particular variant by passing a keyword argument to `iter_dataset`: 

In [7]:
dataset = stream.iter_dataset('Insects', variant='abrupt_imbalanced')
dataset

Insects dataset.

This dataset has different variants, which are:

- abrupt_balanced
- abrupt_imbalanced
- gradual_balanced
- gradual_imbalanced
- incremental-abrupt_balanced
- incremental-abrupt_imbalanced
- incremental-reoccurring_balanced
- incremental-reoccurring_imbalanced
- incremental_balanced
- incremental_imbalanced
- out-of-control

The number of samples and the difficulty change from one variant to another. The number of
classes is always the same (6), except for the last variant (24).

      Name  Insects                                                                                   
      Task  Multi-class classification                                                                
   Samples  355,275                                                                                   
  Features  33                                                                                        
   Classes  6                                                                        

## Multi-output binary classification

In [8]:
print_datasets('Multi-output binary classification')

Bike sharing station information from the city of Toulouse.

The goal is to predict to which kinds of moods a song pertains to.

      Name  Music                                                                                 
      Task  Multi-output binary classification                                                    
   Samples  593                                                                                   
  Features  72                                                                                    
   Outputs  6                                                                                     
    Sparse  False                                                                                 
      Path  /Users/mhalford/creme_data/Music/music.csv                                            
       URL  https://raw.githubusercontent.com/scikit-multiflow/streaming-datasets/master/music.csv
      Size  370.1 KB                                                           

## Multi-output regression

In [9]:
print_datasets('Multi-output regression')

Solar flare multi-output regression.

    

    Name  SolarFlare                                                                       
    Task  Multi-output regression                                                          
 Samples  1,066                                                                            
Features  10                                                                               
 Outputs  3                                                                                
  Sparse  False                                                                            
    Path  /Users/mhalford/projects/creme-ml/creme/creme/stream/datasets/solar-flare.csv.zip
