# Datenerzeugung - Daten aus Internet-Repositories

## OpenML

**OpenML.org** ist eine Plattform, die Datensätze, Aufgaben und Experimente für maschinelles Lernen bereitstellt. Sie ermöglicht den Austausch und die Wiederverwendung von Daten und Ergebnissen zwischen Forschenden und Entwicklern. Die Website bietet eine große Sammlung von Datensätzen, die direkt für maschinelles Lernen genutzt werden können.

Das Laden von Daten von OpenML wird von scikit-learn mit der Funktion fetch_openml unterstützt. Damit können Datensätze direkt aus dem Internet geladen und als Pandas DataFrame oder NumPy Arrays verwendet werden.

In [1]:
from sklearn.datasets import fetch_openml

### Iris-DB

In [10]:
iris = fetch_openml(name='iris', version=1, as_frame=True)

print(iris.DESCR)
print(iris.data.head())
print(iris.target.head())

**Author**: R.A. Fisher  
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Iris) - 1936 - Donated by Michael Marshall  
**Please cite**:   

**Iris Plants Database**  
This is perhaps the best known database to be found in the pattern recognition literature.  Fisher's paper is a classic in the field and is referenced frequently to this day.  (See Duda & Hart, for example.)  The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.  One class is     linearly separable from the other 2; the latter are NOT linearly separable from each other.

Predicted attribute: class of iris plant.  
This is an exceedingly simple domain.  
 
### Attribute Information:
    1. sepal length in cm
    2. sepal width in cm
    3. petal length in cm
    4. petal width in cm
    5. class: 
       -- Iris Setosa
       -- Iris Versicolour
       -- Iris Virginica

Downloaded from openml.org.
   sepallength  sepalwidth  petallength  petalwidth
0          5.1 

### MNIST

In [11]:
mnist = fetch_openml(name='mnist_784', version=1)

print(mnist.DESCR)
print(mnist.data.shape)
print(mnist.target[:5])

**Author**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges  
**Source**: [MNIST Website](http://yann.lecun.com/exdb/mnist/) - Date unknown  
**Please cite**:  

The MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples  

It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image b

### credit-g (Kreditwürdigkeit)

In [12]:
from sklearn.datasets import fetch_openml

credit = fetch_openml(name='credit-g', version=1, as_frame=True)

print(credit.DESCR)
print(credit.data.head())
print(credit.target.value_counts())


**Author**: Dr. Hans Hofmann  
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)) - 1994    
**Please cite**: [UCI](https://archive.ics.uci.edu/ml/citation_policy.html)

**German Credit dataset**  
This dataset classifies people described by a set of attributes as good or bad credit risks.

This dataset comes with a cost matrix: 
``` 
Good  Bad (predicted)  
Good   0    1   (actual)  
Bad    5    0  
```

It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).  

### Attribute description  

1. Status of existing checking account, in Deutsche Mark.  
2. Duration in months  
3. Credit history (credits taken, paid back duly, delays, critical accounts)  
4. Purpose of the credit (car, television,...)  
5. Credit amount  
6. Status of savings account/bonds, in Deutsche Mark.  
7. Present employment, in number of years.  
8. Installment rate in percentage of disposable income  
9. Perso

## Links

https://openml.org/