# Load sklearn datasets into OpenTURNS

In [1]:
import openturns as ot
import sklearn.datasets as skd

In [2]:
def datasetToSample(dataset):
    p = dataset.data.shape[1]
    n = dataset.data.shape[0]
    sample = ot.Sample(n,p+1)
    sample[:,0:p] = dataset.data
    sample[:,p] = ot.Sample(dataset.target,1)
    descr = [dataset.feature_names[i] for i in range(p)]
    descr.append("Target")
    sample.setDescription(descr)
    return sample

Shows the function in action.

In [3]:
dataset = skd.fetch_california_housing()

In [4]:
sample = datasetToSample(dataset)

In [5]:
print(sample)

        [ MedInc     HouseAge   AveRooms   ... Latitude   Longitude  Target     ]
    0 : [    8.3252    41          6.98413 ...   37.88    -122.23       4.526   ]
    1 : [    8.3014    21          6.23814 ...   37.86    -122.22       3.585   ]
    2 : [    7.2574    52          8.28814 ...   37.85    -122.24       3.521   ]
...
20637 : [    1.7       17          5.20554 ...   39.43    -121.22       0.923   ]
20638 : [    1.8672    18          5.32951 ...   39.43    -121.32       0.847   ]
20639 : [    2.3886    16          5.25472 ...   39.37    -121.24       0.894   ]


Use it *en masse*.

In [6]:
dataset = skd.load_iris()
print(dataset.DESCR)
sample = datasetToSample(dataset)
sample.exportToCSVFile("Iris_dataset.csv")

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

In [7]:
dataset = skd.load_diabetes()
print(dataset.DESCR)
sample = datasetToSample(dataset)
sample.exportToCSVFile("Diabetes_dataset.csv")

Diabetes dataset

Notes
-----

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

Data Set Characteristics:

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attributes:
    :Age:
    :Sex:
    :Body mass index:
    :Average blood pressure:
    :S1:
    :S2:
    :S3:
    :S4:
    :S5:
    :S6:

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani

In [8]:
dataset = skd.load_wine()
print(dataset.DESCR)
sample = datasetToSample(dataset)
sample.exportToCSVFile("Wine_dataset.csv")

Wine Data Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- 1) Alcohol
 		- 2) Malic acid
 		- 3) Ash
		- 4) Alcalinity of ash  
 		- 5) Magnesium
		- 6) Total phenols
 		- 7) Flavanoids
 		- 8) Nonflavanoid phenols
 		- 9) Proanthocyanins
		- 10)Color intensity
 		- 11)Hue
 		- 12)OD280/OD315 of diluted wines
 		- 13)Proline
        	- class:
                - class_0
                - class_1
                - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:     