# 02.A: Working with Datasets

It's important to have a clear and sensible way of representing the datasets that learning algorithms train on. A dataset consists of $n$ examples. Each example consists of $m$ features. This makes $m$ the number dimensions the dataset has. In supervised learning, the dataset is a matrix like this:

$\boldsymbol{D} =\left[\begin{array}{cccccc} 
  x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & \cdots & x_m^{(1)} & y^{(1)}\\ 
  x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & \cdots & x_m^{(2)} & y^{(2)}\\
  x_1^{(3)} & x_2^{(3)} & x_3^{(3)} & \cdots & x_m^{(3)} & y^{(3)}\\
  \vdots    & \vdots    & \vdots    & \cdots & \vdots & \vdots \\
  x_1^{(n)} & x_2^{(n)} & x_3^{(n)} & \cdots & x_m^{(n)} & y^{(n)}
\end{array}\right]$

Each row of this matrix is an example consisting of the $m$ features plus the target label as the last element in the row. In other words, $\boldsymbol{D}$ consists of both the input matrix $\boldsymbol{X}$ and target vector $y$, where: 

$\boldsymbol{X} =\left[\begin{array}{ccccc} 
  x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & \cdots & x_m^{(1)}\\ 
  x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & \cdots & x_m^{(2)}\\
  x_1^{(3)} & x_2^{(3)} & x_3^{(3)} & \cdots & x_m^{(3)}\\
  \vdots    & \vdots    & \vdots    & \cdots & \vdots \\
  x_1^{(n)} & x_2^{(n)} & x_3^{(n)} & \cdots & x_m^{(n)}
\end{array}\right]$

and

$\boldsymbol{y} =\left[\begin{array}{c} 
  y^{(1)}\\ 
  y^{(2)}\\
  y^{(3)}\\
  \vdots \\
  y^{(n)}
\end{array}\right]$

For unsupervised learning, $\boldsymbol{D}$ is the same as $\boldsymbol{X}$. Here is a class named `DataSet` to represent datasets. It uses pandas' DataFrame.

Another name for $X$ is `inputs`, and another name for $y$ is `target`. In addition, features have names. Let's put all of this together in a class that we will be using in subsequent weeks.

In [1]:
import numpy as np
import pandas as pd

class DataSet:
    """
    A dataset for a machine learning problem. A dataset d has the following properties:
    d.examples   A list of examples. Each one contains both the features and the target.
    d.features   An array of the of feature names.
    d.target     An m by 1 array containing the values of y
    d.y          Same as d.target
    d.inputs     An n by m array containing the values of X
    d.X          Same as d.inputs
    d.N          Number of examples
    d.M          Number of dimensions
    d.name       The name of the data set (for output display only)
    
    """
    def __init__(self, data, features=None, y=None, name=None):
        """
        If y is True, the data contains the target as the last column
        If y is None or False, No target is available
        Else y is an array to be added as the last column of the examples  dataframe
        """
        self.__name = name
        if isinstance(data, pd.DataFrame):
            self.__examples = data
        else:
            self.__examples = pd.DataFrame(data, columns=features)
            
        if y is True:
            self.__examples.columns = [*self.__examples.columns[:-1], 'y']
        elif y is not False and y is not None:
            self.__examples['y'] = y
    
    @property
    def examples(self):
        return self.__examples
    
    @property
    def features(self):
        return self.__examples.columns[:-1].values
    
    @property
    def target(self):
        if 'y' in self.__examples.columns:
            return self.__examples['y'].values.reshape(self.N, 1)
        return None
    
    @property
    def y(self):
        return self.target
    
    @property
    def inputs(self):
        return self.__examples.iloc[:, :-1].values
    
    @property
    def X(self):
        return self.inputs
    
    @property
    def name(self):
        return self.__name
    
    @property
    def N(self):
        return self.__examples.shape[0]
    
    @property
    def M(self):
        return self.inputs.shape[1]
    
    def shuffled(self, random_state=None):
        rgen = np.random.RandomState(random_state)
        indexes = np.arange(self.N)
        rgen.shuffle(indexes)
        return DataSet(self.__examples.iloc[indexes])
    
    def train_test_split(self, start=0, end=None, test_portion=None, shuffle=False, random_state=None):
        """
        Splits the dataset into a training set and a test set. 
        If test_portion is specified, return that portion of the dataset as test 
        and the rest as training. 
        Otherwise, return the examples between start and end as test and the 
        rest as training.
        """
        indexes = np.arange(self.N)
        if shuffle is True:
            rgen = np.random.RandomState(random_state)
            rgen.shuffle(indexes)

        if test_portion is None:
            end = end or self.N
        else:
            if not isinstance(test_portion, float) or test_portion < 0 or test_portion > 1:
                raise TypeError("Only fractions between ]0,1[ are allowed")

            start = self.N - int(self.N * test_portion)
            end = self.N

        test = DataSet(self.examples.iloc[indexes[range(start, end)]])
        train = DataSet(pd.concat([self.examples.iloc[indexes[range(start)]], 
                                      self.examples.iloc[indexes[range(end, self.N)]]], axis=0))    
        return train, test
    
    def __repr__(self):
        return repr(self.examples)

    # Create the additional method
    def train_validation_test_split(self, portions, shuffle=False):

        # create variables for the amount of each set
        NumTrain = portions["training"]
        NumValidate = portions["validation"]
        NumTest = portions["test"]

        # check that the percentages add up
        if (NumTrain + NumValidate + NumTest) != 1.00:
            raise TypeError("Sorry the portion sizes do not add up to 100 percent, Please try again")
            
        if shuffle is True:
            self.shuffle()

        NumTrain = self.N - int(self.N * NumTrain)
        NumValidate = self.N - int(self.N * NumValidate)
        NumTest = self.N - int(self.N * NumTest)

        # see the amounts the examples should be split into
        print("Amount of examples: ", self.N)
        print("Train: ", NumTrain, "\nValidate: ",NumValidate, "\nTest: ",NumTest)

        # create datasets of the assigned amounts
        train = DataSet(self.examples.iloc[range(0, NumTrain)])
        validate = DataSet(self.examples.iloc[range(NumTrain, (NumTrain + NumValidate))])
        test = DataSet(self.examples.iloc[range(NumTrain + NumValidate, self.N)])


        # return the datasets
        return train, validate, test

This class has a couple of properties including `name` (informational), `features` (the names of the features), `inputs`, `target`, `X`, `y`, `N` (number of examples), `M` (number of dimensions).

A DataSet object is created using a NumPy array or a Pandas dataframe. If it is a NumPy array, the class uses it to create a Pandas dataframe. The dataframe storing the data can be retrieved back using the `examples` property.


Let's test this class by creating a $27 \times 3$ input data and a separate $y$ column.

In [2]:
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")

ds

     x1   x2         x1  y
0   5.0  7.0  12.732921  0
1   5.0  8.0   7.705131  1
2   4.0  6.0   8.894093  0
3   2.0  1.0  10.685696  0
4   2.0  6.0  10.057909  1
5   4.0  7.0   8.320882  0
6   7.0  3.0  12.333498  1
7   5.0  2.0  12.882851  0
8   6.0  2.0  11.616481  0
9   8.0  4.0   9.824204  1
10  8.0  7.0  11.393600  0
11  2.0  5.0   8.239395  1
12  3.0  8.0   9.239006  0
13  3.0  8.0   9.840350  1
14  2.0  8.0  11.649701  0
15  4.0  2.0   9.728107  0
16  2.0  7.0  10.966576  1
17  3.0  5.0   7.111513  1
18  8.0  4.0   9.237051  1
19  4.0  3.0   7.566279  0
20  3.0  1.0   9.813140  1
21  2.0  2.0   8.913988  1
22  7.0  3.0   5.124349  0
23  5.0  7.0  12.731991  1
24  6.0  2.0   7.970008  1
25  2.0  8.0   8.939165  0
26  3.0  4.0   7.783040  0

In [3]:
ds.examples

Unnamed: 0,x1,x2,x1.1,y
0,5.0,7.0,12.732921,0
1,5.0,8.0,7.705131,1
2,4.0,6.0,8.894093,0
3,2.0,1.0,10.685696,0
4,2.0,6.0,10.057909,1
5,4.0,7.0,8.320882,0
6,7.0,3.0,12.333498,1
7,5.0,2.0,12.882851,0
8,6.0,2.0,11.616481,0
9,8.0,4.0,9.824204,1


In [4]:
ds.features

array(['x1', 'x2', 'x1'], dtype=object)

In [5]:
ds.target 

array([[0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [0],
       [1],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0]])

In [6]:
ds.y 

array([[0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [0],
       [1],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0]])

In [7]:
ds.inputs

array([[ 5.        ,  7.        , 12.73292085],
       [ 5.        ,  8.        ,  7.70513096],
       [ 4.        ,  6.        ,  8.89409313],
       [ 2.        ,  1.        , 10.68569592],
       [ 2.        ,  6.        , 10.05790881],
       [ 4.        ,  7.        ,  8.32088218],
       [ 7.        ,  3.        , 12.33349796],
       [ 5.        ,  2.        , 12.88285083],
       [ 6.        ,  2.        , 11.61648061],
       [ 8.        ,  4.        ,  9.82420369],
       [ 8.        ,  7.        , 11.39359982],
       [ 2.        ,  5.        ,  8.23939549],
       [ 3.        ,  8.        ,  9.23900621],
       [ 3.        ,  8.        ,  9.84035013],
       [ 2.        ,  8.        , 11.64970142],
       [ 4.        ,  2.        ,  9.72810707],
       [ 2.        ,  7.        , 10.96657646],
       [ 3.        ,  5.        ,  7.11151292],
       [ 8.        ,  4.        ,  9.23705097],
       [ 4.        ,  3.        ,  7.56627896],
       [ 3.        ,  1.        ,  9.813

In [8]:
ds.X

array([[ 5.        ,  7.        , 12.73292085],
       [ 5.        ,  8.        ,  7.70513096],
       [ 4.        ,  6.        ,  8.89409313],
       [ 2.        ,  1.        , 10.68569592],
       [ 2.        ,  6.        , 10.05790881],
       [ 4.        ,  7.        ,  8.32088218],
       [ 7.        ,  3.        , 12.33349796],
       [ 5.        ,  2.        , 12.88285083],
       [ 6.        ,  2.        , 11.61648061],
       [ 8.        ,  4.        ,  9.82420369],
       [ 8.        ,  7.        , 11.39359982],
       [ 2.        ,  5.        ,  8.23939549],
       [ 3.        ,  8.        ,  9.23900621],
       [ 3.        ,  8.        ,  9.84035013],
       [ 2.        ,  8.        , 11.64970142],
       [ 4.        ,  2.        ,  9.72810707],
       [ 2.        ,  7.        , 10.96657646],
       [ 3.        ,  5.        ,  7.11151292],
       [ 8.        ,  4.        ,  9.23705097],
       [ 4.        ,  3.        ,  7.56627896],
       [ 3.        ,  1.        ,  9.813

In [9]:
ds.name

'Sample Data'

In [10]:
ds.N

27

In [11]:
ds.M

3

## Shuffling

The above class also supports a few useful methods. One such method is for shuffling the data, which we do often before training. This method returns a new DataSet instance with the shuffled data. Here is how this method is implemented:

```python
    ...
    def shuffled(self, random_state=None):
        rgen = np.random.RandomState(random_state)
        indexes = np.arange(self.N)
        rgen.shuffle(indexes)
        return DataSet(self.__examples.iloc[indexes])
   ...
```

Here is an example using this function.

In [12]:
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")


ds.shuffled()

     x1   x2         x1  y
16  4.0  3.0   9.881227  1
12  8.0  8.0  13.254460  1
9   8.0  7.0  13.143351  0
20  6.0  8.0  10.742553  1
6   2.0  8.0  11.013104  1
14  6.0  5.0  10.114977  1
13  8.0  6.0   7.964049  0
21  7.0  4.0  14.189807  1
0   7.0  1.0   9.867608  1
2   5.0  2.0  14.575606  1
26  5.0  8.0  10.248868  0
17  5.0  3.0   9.222830  0
8   7.0  6.0   8.856367  0
18  3.0  6.0   9.812236  0
1   8.0  5.0  12.722751  0
23  4.0  3.0   7.835632  0
7   3.0  8.0   9.773252  0
3   8.0  4.0   9.382219  1
22  8.0  8.0  12.882059  1
10  8.0  2.0   9.333797  1
25  8.0  3.0   8.430229  0
5   8.0  2.0  10.934542  1
24  3.0  7.0   5.140607  1
11  5.0  3.0  12.005591  0
4   5.0  4.0   9.441198  1
19  6.0  6.0   9.548453  0
15  8.0  5.0   9.359273  0

## Splitting a dataset into training and test datasets

Another useful method provided by the above dataset class is the `train_test_split` method. This method splits the dataset into a training and test sets. Here is how this method is implemented:

```python
    ...
    def train_test_split(self,start=0, end=None, test_portion=None, shuffle=False, random_state=None):
        """
        Splits the dataset into a training set and atest set. 
        If test_portion is specified, return that portion of the dataset as test 
        and the rest as training. 
        Otherwise, return the examples between start and end as test and the 
        rest as training.
        """
        indexes = np.arange(self.N)
        if shuffle is True:
            rgen = np.random.RandomState(random_state)
            rgen.shuffle(indexes)

        if test_portion is None:
            end = end or self.N
        else:
            if not isinstance(test_portion, float) or test_portion < 0 or test_portion > 1:
                raise TypeError("Only fractions between ]0,1[ are allowed")

            start = self.N - int(self.N * test_portion)
            end = self.N

        test = DataSet(self.examples.iloc[indexes[range(start, end)]])
        train = DataSet(pd.concat([self.examples.iloc[indexes[range(start)]], 
                                      self.examples.iloc[indexes[range(end, self.N)]]], axis=0))    
        return train, test
   ...
```

If the `start` and end `end` parameters exist, the method returns the examples before them as test and the rest of the data as training. If `test_portion` is provided, then that portion of the data is returned as test and the rest as training. The `shuffle` parameter can be used to instruct the method to shuffle the data before splitting it. The method finally returns two dataset instances: training and test sets.

Here is an example using this method.

In [13]:
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")


ta, te = ds.train_test_split(test_portion=.25, shuffle=False, random_state=17)
print('Training set = \n', ta)
print('Test set = \n', te)

Training set = 
      x1   x2         x1  y
0   3.0  8.0  10.948005  0
1   4.0  6.0   8.187487  1
2   5.0  8.0  11.209373  0
3   5.0  8.0  12.305663  0
4   6.0  7.0  10.706698  1
5   3.0  8.0  13.053736  0
6   6.0  8.0   9.801108  0
7   3.0  1.0   7.404255  1
8   6.0  7.0   8.881717  1
9   5.0  8.0  11.241495  1
10  7.0  2.0   9.315980  0
11  2.0  5.0   8.692839  0
12  4.0  6.0  10.729373  0
13  4.0  1.0   9.435281  1
14  4.0  2.0  11.594021  0
15  5.0  5.0   8.088723  0
16  8.0  4.0   9.183652  0
17  4.0  2.0   8.759675  1
18  2.0  4.0   9.832103  0
19  4.0  6.0  10.253611  0
20  5.0  2.0   8.873369  1
Test set = 
      x1   x2         x1  y
21  5.0  2.0  12.318211  0
22  2.0  7.0  11.695600  0
23  2.0  6.0  10.392485  0
24  4.0  5.0   8.871367  1
25  2.0  3.0   7.382228  1
26  7.0  2.0  11.414077  0


## Using this dataset class inside other notebooks

This class is part of the `mylib` library of this class with is provided to you. Here is how to import this library:

In [14]:
import mylib as my

Once imported, one can use it like this:

In [15]:
ds = my.DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")


ta, te = ds.train_test_split(test_portion=.25, shuffle=False, random_state=17)
print('Training set = \n', ta)
print('Test set = \n', te)

Training set = 
      x1   x2         x1  y
0   8.0  6.0  11.128912  1
1   6.0  6.0   7.939912  1
2   3.0  1.0  10.102054  1
3   4.0  6.0  10.819360  1
4   4.0  1.0   8.568149  0
5   4.0  2.0  11.658371  0
6   5.0  7.0  10.119335  0
7   8.0  2.0  10.630417  0
8   3.0  7.0  12.053922  1
9   5.0  6.0  10.911310  0
10  7.0  5.0  10.253731  0
11  2.0  8.0  11.429588  0
12  2.0  3.0   7.382770  0
13  7.0  3.0  10.379701  1
14  3.0  7.0   7.552759  1
15  3.0  5.0   7.586252  1
16  2.0  6.0   6.594795  1
17  2.0  2.0   9.836914  1
18  2.0  7.0   9.797370  0
19  5.0  8.0  10.294501  0
20  3.0  4.0   8.542844  1
Test set = 
      x1   x2         x1  y
21  3.0  3.0  11.046754  0
22  8.0  1.0  11.871901  0
23  4.0  8.0  10.352095  1
24  8.0  8.0   7.997280  0
25  6.0  6.0   9.567684  0
26  8.0  1.0   6.864585  0


## EXERCISE

Refactor the above DataSet class by adding a method named `train_validation_test_split` to it. This method should split the data into three sets: training, validation, and test. This method should receive a dictionary parameter named `portions` specifying how much of the data is in each set. For a 75%/15%/10% split, one can use the following portions parameter:

```python
portions={"training": .75, 'validation': .15, 'test': .10 }
```

The method should support the `shuffle` parameter as well. You may call the `train_test_split` method internally. Make sure to include a comment describing how your implementation of the method works. Test your method on the `ds` dataset above and show that it works.

In [16]:
########################## implementation code ###########################

# Create the additional method
from random import shuffle

def train_validation_test_split(self, portions, shuffle=False, random_state=None):

    # create variables for the amount of each set
    NumTrain = portions["training"]
    NumValidate = portions["validation"]
    NumTest = portions["test"]

    # check that the percentages add up
    if (NumTrain + NumValidate + NumTest) != 1.00:
        raise TypeError("Sorry the portion sizes do not add up to 100 percent, Please try again")
    
    # use the incredibly convenient shuffled method if we want to shuffle the dataset
    if shuffle is True:
        self = self.shuffled()
        
    # determine the number of values for each set
    NumTrain = int(self.N * NumTrain)
    NumValidate = int(self.N * NumValidate)
    NumTest = int(self.N * NumTest)

    # see the amounts the examples should be split into
    print("Amount of examples:", self.N)

    # create datasets of the assigned amounts
    train = DataSet(self.examples.iloc[range(0, NumTrain)])
    validate = DataSet(self.examples.iloc[range(NumTrain, (NumTrain + NumValidate))])
    test = DataSet(self.examples.iloc[range(NumTrain + NumValidate, self.N)])

    # return the datasets
    return train, validate, test

############################ exhibition code #############################

# add the new method to the DataSet class
my.DataSet.train_validation_test_split = train_validation_test_split

# create testing portions
portions={"training": .75, 'validation': .15, 'test': .10}

# splits without shuffle
print("Without Shuffling\n")
train, validate, test = ds.train_validation_test_split(portions)

print('Training Set:', train.N, '\n',train)
print('Validating Set:', validate.N,'\n',validate)
print('Testing set:',test.N,'\n',test)

# splits with shuffle
print("\n\nWith Shuffle")
train1, validate1, test1 = ds.train_validation_test_split(portions, shuffle=True, random_state = 17)

print('Training Set:', train1.N, '\n',train1)
print('Validating Set:', validate1.N,'\n',validate1)
print('Testing set:',test1.N,'\n',test1)



Without Shuffling

Amount of examples: 27
Training Set: 20 
      x1   x2         x1  y
0   8.0  6.0  11.128912  1
1   6.0  6.0   7.939912  1
2   3.0  1.0  10.102054  1
3   4.0  6.0  10.819360  1
4   4.0  1.0   8.568149  0
5   4.0  2.0  11.658371  0
6   5.0  7.0  10.119335  0
7   8.0  2.0  10.630417  0
8   3.0  7.0  12.053922  1
9   5.0  6.0  10.911310  0
10  7.0  5.0  10.253731  0
11  2.0  8.0  11.429588  0
12  2.0  3.0   7.382770  0
13  7.0  3.0  10.379701  1
14  3.0  7.0   7.552759  1
15  3.0  5.0   7.586252  1
16  2.0  6.0   6.594795  1
17  2.0  2.0   9.836914  1
18  2.0  7.0   9.797370  0
19  5.0  8.0  10.294501  0
Validating Set: 4 
      x1   x2         x1  y
20  3.0  4.0   8.542844  1
21  3.0  3.0  11.046754  0
22  8.0  1.0  11.871901  0
23  4.0  8.0  10.352095  1
Testing set: 3 
      x1   x2        x1  y
24  8.0  8.0  7.997280  0
25  6.0  6.0  9.567684  0
26  8.0  1.0  6.864585  0


With Shuffle
Amount of examples: 27
Training Set: 20 
      x1   x2         x1  y
24  8.0  8.0