# Datasets

The quality of the data and the volume of useful information contained within a dataset are key factors that determine how well a machine learning algorithm can learn. Therefore, it is absolutely crucial to study and appropriately process a dataset before feeding it into a machine learning algorithm.

In notebooks MLLAB-11, MLLAB-12, and MLLAB-13 we will investigate the basic data preprocessing techniques that help us create good machine learning models. The topics that will be covered include:

* dataset loading and study of its properties,
* dealing with missing values,
* management of categorical features,
* selection of informative features for constructing effective models,
* creation of training and test sets.

## Datasets from plain text files

The predominant way of creating, downloading and sharing a dataset is through simple files. These files can be of multiple diverse types, e.g. CSV, XML, JSON, and so on. Here we will focus on CSV files, as they constitute the most popular format for creating and sharing datasets.

### CSV files

In CSV (Comma Separated Values) files each record (or sample, or example) occupies one row, while the various columns are separated by commas. Each column represents a feature of the corresponding record. One of the columns (usually the first one, or the last one) stores the target variable of this record. In some cases, the features are not delimited by commas, but by other symbols (such as semicolons, tabs, or spaces). In these cases we are not talking strictly about CSV files, but due to their similarity it has become common for these alternatives to be called CSV.

The [UCI machine learning repository](https://archive.ics.uci.edu/ml/) contains numerous such datasets, in the form of redistributable CSV files. One of them is [CNNpred](https://archive.ics.uci.edu/ml/datasets/CNNpred%3A+CNN-based+stock+market+prediction+using+a+diverse+set+of+variables), a collection of daily features of the US S&P 500, NASDAQ, Dow Jones, RUSSELL 2000 and NYSE stock indices from 2010 to 2017. It covers features from various categories of technical indicators, futures contracts, commodity prices, major market indices around the world, the price of large companies in the US market, etc. A detailed description of the features is described in the paper entitled ["CNNpred: CNN-based stock market prediction using a diverse set of variables"](https://www.sciencedirect.com/science/article/abs/pii/S0957417419301915). 

### Loading CSV files into pandas dataframes

[Pandas](https://pandas.pydata.org/) is an open-source Python library that supports high-performing data structures, while it also implements several data analysis tools. Among the supported data structures, the one that is of special interest is the dataframe.

The pandas dataframes provide an efficient way of representing two-dimensional, potentially heterogeneous, tabular data. Since the vast majority of the real-world datasets are defective and extremely heterogeneous, pandas dataframes constitute a valuable tool for their processing. With them, complex operations are performed easily and quickly in a way that in some cases resembles a relational database.

[Pandas](https://pandas.pydata.org/) is a standard library that is included in Anaconda, and it is automatically installed during Anaconda installation.

In the following example we load the `Processed_NASDAQ.csv` file (containing the NASDAQ index stock prices) into a [pandas](https://pandas.pydata.org/) dataframe. 


In [1]:
import numpy as np
import pandas as pd

DATASET_LOCATION = "datasets/Processed_NASDAQ.csv"

# Load the input CSV file into a Pandas dataframe
data = pd.read_csv(DATASET_LOCATION, sep=',')

# The head(n) command prints the n first rows of the dataframe
data.head(10)


Unnamed: 0,Date,Close,Volume,mom,mom1,mom2,mom3,ROC_5,ROC_10,ROC_15,...,NZD,silver-F,RUSSELL-F,S&P-F,CHF,Dollar index-F,Dollar index,wheat-F,XAG,XAU
0,2009-12-31,2269.149902,,,,,,,,,...,0.03,0.26,-1.08,-1.0,-0.11,-0.08,-0.06,-0.48,0.3,0.39
1,2010-01-04,2308.419922,0.560308,0.017306,,,,,,,...,1.52,3.26,1.61,1.62,-0.57,-0.59,-0.42,3.12,3.91,2.1
2,2010-01-05,2308.709961,0.225994,0.000126,0.017306,,,,,,...,-0.07,1.96,-0.2,0.31,0.43,0.03,0.12,-0.9,1.42,-0.12
3,2010-01-06,2301.090088,-0.048364,-0.0033,0.000126,0.017306,,,,,...,0.56,2.15,-0.02,0.07,-0.56,-0.24,-0.17,2.62,2.25,1.77
4,2010-01-07,2300.050049,0.007416,-0.000452,-0.0033,0.000126,0.017306,,,,...,-0.72,0.94,0.5,0.4,0.58,0.58,0.54,-1.85,0.22,-0.58
5,2010-01-08,2317.169922,-0.054915,0.007443,-0.000452,-0.0033,0.000126,2.116212,,,...,0.61,0.68,0.64,0.35,-0.98,-0.58,-0.56,2.07,1.26,0.38
6,2010-01-11,2312.409912,-0.031463,-0.002054,0.007443,-0.000452,-0.0033,0.172845,,,...,0.64,-0.13,-1.01,0.09,-0.66,-0.64,-0.61,1.08,0.65,1.44
7,2010-01-12,2282.310059,0.139772,-0.013017,-0.002054,0.007443,-0.000452,-1.143491,,,...,-0.47,-2.36,-0.67,-0.74,0.22,-0.05,-0.06,-6.33,-1.78,-2.19
8,2010-01-13,2307.899902,-0.021099,0.011212,-0.013017,-0.002054,0.007443,0.295939,,,...,0.26,1.62,0.82,0.66,-0.15,-0.17,-0.13,-0.51,1.97,0.98
9,2010-01-14,2316.73999,-0.027683,0.00383,0.011212,-0.013017,-0.002054,0.725634,,,...,0.27,0.57,0.76,0.33,0.12,-0.13,-0.16,-1.49,0.32,0.39


### Loading CSV files into NumPy arrays

Loading this dataset into a NumPy array is considerably more complex. The primary reason is that NumPy arrays can only store numerical values (integers and float). Another reason is that they cannot accommodate blank values. This means that they can be used to handle **numerical** and **non-blank** datasets only. However, [CNNpred](https://archive.ics.uci.edu/ml/datasets/CNNpred%3A+CNN-based+stock+market+prediction+using+a+diverse+set+of+variables) has the following prohibitive attributes:

* it contains date data in the first column, and
* it has missing values

So here is what happens if we try to load it directly into a NumPy array:


In [2]:
from numpy import genfromtxt

data = genfromtxt(DATASET_LOCATION, delimiter=',', skip_header=1)
print(data[0:1, :])


[[           nan  2.2691499e+03            nan            nan
             nan            nan            nan            nan
             nan            nan            nan            nan
             nan            nan            nan  4.0000000e-02
   6.0000000e-02  2.0000000e-01  2.6900000e+00  3.8500000e+00
             nan            nan  5.3300000e+00  6.3900000e+00
             nan            nan            nan            nan
             nan            nan            nan            nan
             nan            nan            nan            nan
             nan            nan            nan            nan
             nan            nan            nan            nan
             nan  3.8100000e+00  3.7900000e+00  3.6500000e+00
   2.0000000e-02  1.6000000e-01  1.0600000e+00  2.5400000e+00
   6.1900000e+00  6.3300000e+00  6.3500000e+00            nan
             nan            nan            nan  3.5000000e-01
  -1.3000000e-01  1.5000000e-01  9.0000000e-02  1.0000000e-01
   4.800

To manage the dates we will write a special function (`datestr2num`) with the aim of converting the dates to numbers. Initially, the strings of the first column are converted to dates with `strptotime`. In the sequel, `date2num` converts each date to a decimal number. In essence, this number represents the number of days that have elapsed since 0001-01-01 00:00:00 UTC, plus 1. The decimal part represents the time (hours, minutes, and seconds).

Regarding the missing values, the statement `filling_values=0` converts them all to zero. The next notebook (MLLAB-12), presents more sophisticated methods for handling missing values in a given dataset.


In [3]:
from datetime import datetime
from matplotlib.dates import date2num

# Convert the input string s to a number
def datestr2num(s):
    # At first, convert s to a date
    d = datetime.strptime(s.decode('ascii'), '%Y-%m-%d')

    # Then, convert d to a number: the number of days that have elapsed since 0001-01-01 00:00:00 UTC, plus 1
    num_days = date2num(d)

    return num_days

# Notice the converters={0 : datestr2num} statement. This will convert the
# data of column 0 (the first column) by passing it through datestr2num.
data = genfromtxt(DATASET_LOCATION, delimiter=',', skip_header=1, converters={0 : datestr2num}, filling_values=0.0)

print(data[0:10, 0:1])


[[14609.]
 [14613.]
 [14614.]
 [14615.]
 [14616.]
 [14617.]
 [14620.]
 [14621.]
 [14622.]
 [14623.]]


## scikit-learn toy datasets

The scikit-learn library includes several small-sized, standard [toy datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html). Therefore, it is not required to download the dataset files from an external site. The previous notebooks of this course demonstrated the properties of several machine learning algorithms by employing IRIS, the Breast cancer dataset, the Boston dataset, etc.

Usually, a dataset should include some training examples. Each training example consists of a number of attributes (features, attributes, input variables), plus a target variable (class label). The target variable can be a class label (classification problems) or an arbitrary continuous value (regression problems).

Below we load the [IRIS dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html), a collection of 150 flowers of 3 types (classes): Iris-Setosa, Iris-Versicolor, and Iris-Virginica. The dataset contains 50 instances of each type.

### Loading a scikit-learn toy dataset

If we use the `load_*()` functions of scikit-learn, then the dataset that is represented by `*` is loaded into an object with the following members:

* `data`: a list (of lists) that contains of the values of the input variables of the dataset.
* `target`: a list that contains of the values of the target variables of the dataset.
* `feature_names`: The feature names.
* `DESCR`: A description of the dataset.
* `filename`: The location and the name of the CSV file that contains the dataset.


In [4]:
# Load the IRIS toy dataset into an object called dataset.
from sklearn.datasets import load_iris

dataset = load_iris()
# print(dataset)


In [5]:
# A list (of lists) that contains of the values of the input variables of the dataset.
# Here we print all the columns of the first 10 rowss of dataset.target
print(dataset.data [:10, :])


[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]


In [6]:
# A list that contains of the values of the target variables of the dataset.
print(dataset.target)


[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [7]:
# Print the feature names
print(dataset.feature_names)


['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [8]:
# Dataset description
print(dataset.DESCR)


.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [9]:
# The CSV file that contains the dataset
print(dataset.filename)


C:\Users\Leo\anaconda3\lib\site-packages\sklearn\datasets\data\iris.csv


### Copying a scikit-learn toy dataset into a pandas dataframe

The code below copies the values of the input variables (`dataset.data`) and their names (`dataset.feature_names`) into a pandas dataframe.


In [10]:
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)

# Display the first 10 rows of the dataframe
df.head(10)


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


We may append an additional column to the dataframe with the aim of copying the target variables there. We assign this column the name `FlowerType`.



In [11]:
# Create a new column named 'FlowerType'. Its contents derive from the list of the target variables dataset.target
df['FlowerType'] = dataset.target

# Display the first 10 rows of the dataframe
df.head(10)


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),FlowerType
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
5,5.4,3.9,1.7,0.4,0
6,4.6,3.4,1.4,0.3,0
7,5.0,3.4,1.5,0.2,0
8,4.4,2.9,1.4,0.2,0
9,4.9,3.1,1.5,0.1,0


### Obtaining useful information

The dataframes support a wide range of functions that are particularly useful for processing datasets on machine learning problems. More specifically, the `info()` method returns some useful information about the columns of the dataframe. Equivalently, it presents information on the features of the dataset.

For examples, the third column below indicates that there are no missing values in any feature. 


In [12]:
# Statistics for the toy dataframe
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   FlowerType         150 non-null    int32  
dtypes: float64(4), int32(1)
memory usage: 5.4 KB


The same is achieved with the following code:


In [13]:
print(df.isnull().sum())


sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
FlowerType           0
dtype: int64
