# Datasets and Benchmark

Since we are going to play a bit with data and datasets - ultimately to show some examples of data analysis and machine learning - 
**in this notebook** we will present a quite new package in the Python ecosystem, i.e. `pmlb` that automatically provides datasets for data analysis and Machine Learning.


### Don't worry

If you are an absolute beginner, or you simply don't know what Machine Learning is, **nothing to worry about**!
We will talk about it in the next section! :)

**In this notebook** we just focus on the `pmlb` package from a technical perspective. 

If you are completely new to concepts like **Supervised Learning**, **Classification**, or **Regression** come back to this notebook later after the "Introduction to Machine Learning" section for a quick recap and cross-check of these concepts !-)


---

## Introducting the Penn Machine Learning Benchmarks

**Penn Machine Learning Benchmark** (`PMLB`) is a collection of code and data for a large, curated set of benchmark datasets for evaluating and comparing supervised machine learning algorithms. 

These data sets cover a broad range of applications, and include _binary/multi-class classification_ problems and _regression problems_, as well as combinations of categorical, ordinal, and continuous features. 

**There are no missing values in these data sets**.

PMLB was developed in the [Computational Genetics Lab](http://epistasis.org/) at the [University of Pennsylvania](https://www.upenn.edu/) with funding from the [NIH](http://www.nih.gov/) under grant R01 AI117694. 

More information here: [github.com/EpistasisLab/penn-ml-benchmarks](https://github.com/EpistasisLab/penn-ml-benchmarks)

## Data set format

All data sets are stored in a common format:

* First row is the column names
* Each following row corresponds to one row of the data
* The target column is named `target`
* All columns are tab (`\t`) separated
* All files are compressed with `gzip` to conserve space

## Python wrapper

For easy access to the benchmark data sets, we have provided a Python wrapper named `pmlb`. The wrapper can be installed on Python via `pip`:

```
pip install pmlb
```

and used in Python scripts as follows:

In [None]:
try: 
    import plmb
except ImportError:
    print('IMPORT ERROR: Please install the pmlb package: pip install pmlb')

In [None]:
# if you got an ImportError in the previous cell:

!pip install pmlb

#### Dataset List

You can list **all** the available data sets in `pmlb` as it follows:


In [None]:
from pmlb import dataset_names

print('All Available Datasets: ')
print('========================')
for i, name in enumerate(dataset_names):
    print('{}) {}'.format(i+1, name))

Moreover, if you are particularly interested in getting the list of available datasets for **Classification** or **Regression**:

In [None]:
from pmlb import classification_dataset_names, regression_dataset_names

print('Classification Datasets: ')
print('=========================')
for i, name in enumerate(classification_dataset_names):
    print('{}) {}'.format(i+1, name))
    

In [None]:
print('Regression Datasets: ')
print('=========================')
for i, name in enumerate(regression_dataset_names):
    print('{}) {}'.format(i+1, name))

## Fetching a Dataset

To fetch data from a dataset in `pmlb` it is just necessary the **name** of the dataset. 

By default, a `pandas.DataFrame` will be returned as output.

In [None]:
# Returns a pandas DataFrame
from pmlb import fetch_data

adult_data = fetch_data('adult')
print(adult_data.describe())

In the next notebooks, we will be using this **package** to easily download and use datasets for further analysis and manipulations

---

### Feel free to SKIP 

Please feel free to **skip** the rest of this notebook and come back to it later once we will introduce **NumPy**, and 
**Scikit-learn** for Machine Learning.

The `fetch_data` function has two additional parameters:

* `return_X_y` (True/False): Whether to return the data in scikit-learn format, with the features and labels stored in separate NumPy arrays.

* `local_cache_dir` (string): The directory on your local machine to store the data files so you don't have to fetch them over the web again. By default, the wrapper does not use a local cache directory.

For example:


In [None]:
from pmlb import fetch_data

# Returns NumPy arrays
adult_X, adult_y = fetch_data('adult', return_X_y=True, local_cache_dir='./data')
print(adult_X)
print(adult_y)



## Example usage: Compare two classification algorithms with PMLB

PMLB is designed to make it easy to benchmark machine learning algorithms against each other. 

Below is a Python code snippet showing the most basic way to use PMLB to compare two algorithms.


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sb

from pmlb import fetch_data, classification_dataset_names

In [None]:
logit_test_scores = []
gnb_test_scores = []

for i, classification_dataset in enumerate(classification_dataset_names):
    if i > 20:
        break
        
    X, y = fetch_data(classification_dataset, return_X_y=True)
    train_X, test_X, train_y, test_y = train_test_split(X, y)

    logit = LogisticRegression(solver='liblinear', multi_class='auto', )
    gnb = GaussianNB()

    logit.fit(train_X, train_y)
    gnb.fit(train_X, train_y)

    logit_test_scores.append(logit.score(test_X, test_y))
    gnb_test_scores.append(gnb.score(test_X, test_y))
    print('{} {} DONE'.format(i+1, classification_dataset))

sb.boxplot(data=[logit_test_scores, gnb_test_scores], notch=True)
plt.xticks([0, 1], ['LogisticRegression', 'GaussianNB'])
plt.ylabel('Test Accuracy')