# Datasets Preparation

In this notebook we prepare a set of machine learning data sets downloaded from the UCI Machine Learning repository. We have downloaded the datasets into the `data/raw` directory and the only transformation made was to change the columns separator to use a simple colon when required. Any other preprocessing is made in this notebook to ensure reproductibility. Here we basically perform the following transforms:

* Remove categorical columns.
* Remove nan entries.
* Normalize each column using min-max feature scaling.
* Change the column order so the target column is the last one.

**Disclaimer**: The use of this collection of datasets is intended to study the performance of neural networks and its variants, therefore, we avoid datasets where the features are mostly non-numerical and that is why we remove categorical columns in the few cases where they are present instead of using any specific transformation.

In [29]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import json

## Regression dataset list

In [30]:
meta = []

### 1. Facebook Metrics

In [31]:
meta.append({
    'id': 'facebook_metrics.csv',
    'name': 'Facebook metrics',
    'type': 'Regression',
    'ref': 'https://archive.ics.uci.edu/ml/datasets/Facebook+metrics',
    'columns': ['page_likes','type','category','post' 'month','post' 'weekday','post_hour','paid',
                'life_time', 'life_impression','life_users','life_consumers','life_consumptions',
                'impressions_like','life_like','life_page_post','comment','like','share','interactions'],
    'target': 'interactions',
    'transforms': [],
    'drop': ['type'],
    'comment': ''})

### 2. Forest Fires

In [32]:
def log_transform(x):
    return np.log(x + 1)

meta.append({
    'id': 'forest_fires.csv',
    'name': 'Forest fires',
    'type': 'Regression',
    'ref': 'https://archive.ics.uci.edu/ml/datasets/Forest+Fires',
    'columns': ['x','y','month','day','FFMC','DMC','DC','ISI','temp','RH','wind','rain','area'],
    'target': 'area',
    'transforms': [{
        'attr': 'area',
        'op': np.vectorize(log_transform)
    }],
    'drop': ['month','day'],
    'comment': 'Target attr. is log-transformed according to original paper.'})

### 3. Aquatic Toxicity

In [33]:
meta.append({
    'id': 'aquatic_toxicity.csv',
    'name': 'Aquatic toxicity',
    'type': 'Regression',
    'ref': 'https://archive.ics.uci.edu/ml/datasets/QSAR+aquatic+toxicity',
    'columns': ['TPSA','SAacc','H-050','MLOGP','RDCHI','GATS1p','nN','C-040', 'response'],
    'target': 'response',
    'transforms': [],
    'drop': [],
    'comment': ''})

### 4. Fish Toxicity

In [34]:
meta.append({
    'id': 'fish_toxicity.csv',
    'name': 'Fish toxicity',
    'type': 'Regression',
    'ref': 'https://archive.ics.uci.edu/ml/datasets/QSAR+fish+toxicity',
    'columns': ['CIC0','SM1_Dz','GATS1i','NdsCH','NdssC','MLOGP', 'response'],
    'target': 'response',
    'transforms': [],
    'drop': [],
    'comment': 'Target attr. is in the last column.'})

### 5. Airfoil noise

In [35]:
meta.append({
    'id': 'airfoil_noise.csv',
    'name': 'Airfoil noise',
    'type': 'Regression',
    'ref': 'https://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise',
    'columns':['frequency', 'angle','chord', 'velocity', 'suction', 'decibels'],
    'target': 'decibels',
    'transforms': [],
    'drop': [],
    'comment': 'Target attr. is in the last column.'})

### 6. Concrete Compressive Strength

In [36]:
meta.append({
    'id': 'concrete.csv',
    'name': 'Concrete strength',
    'type': 'Regression',
    'ref': 'https://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength',
    'columns': ['cement', 'slag', 'ash', 'water', 'superplasticizer', 'coarse' , 'fine', 'age', 'strength'],
    'target': 'strength',
    'transforms': [],
    'drop': [],
    'comment': 'Target attr. is in the last column.'})

## Preprocessing

In [37]:
df_summary = pd.DataFrame()

for ds in meta:
    df = pd.read_csv('data/raw/' + ds['id'], header=0, na_values=["?"], names=ds['columns']).dropna()
    columns = [column for column in ds['columns'] if column not in ds['drop']]
    
    for transform in ds['transforms']:
        attr = transform['attr']
        op = transform['op']
        
        df[attr] = op(df[attr])
        transform['op'] = []
    
    for column in columns:
        df[column] = (df[column] - df[column].min()) / (df[column].max() - df[column].min())
    
    columns = [column for column in columns if column != ds['target']] + [ds['target']]
    df = df[columns]
    df.to_csv('data/prep/' + ds['id'], header=True, index=False)
    
    n, d = df.shape
    df_summary.loc[(ds['name'], 'Type')] = ds['type']
    df_summary.loc[(ds['name'], 'No. Examples')] = n
    df_summary.loc[(ds['name'], 'No. Features')] = d - 1

In [38]:
json.dump(meta, open( "data/prep/list.json", 'w'))

## Summary

In [39]:
df_summary.to_latex('results/tables/datasets.tex', label='tab:datasets', 
                    caption='List of benchmark data sets.')