## Importance of Data

In [1]:
import numpy as np
import pandas as pd

### Small Dataset

In [2]:
f = 10
n = 250

In [3]:
np.random.seed(100)

In [8]:
x = np.random.randint(0, 2, (n, f))
x[:4]

array([[1, 0, 1, 1, 0, 1, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 0, 1, 0, 0, 1]])

In [10]:
y = np.random.randint(0, 2, n)
y[:4]

array([0, 1, 0, 1])

In [11]:
2 ** f

1024

In order to proceed, the raw data is put into a pandas DataFrame object, which simpli‐
fies certain operations and analyses:

In [12]:
fcols = [f'f{_}' for _ in range(f)]
fcols

['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9']

In [13]:
data = pd.DataFrame(x, columns=fcols)
data['l'] = y

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   f0      250 non-null    int64
 1   f1      250 non-null    int64
 2   f2      250 non-null    int64
 3   f3      250 non-null    int64
 4   f4      250 non-null    int64
 5   f5      250 non-null    int64
 6   f6      250 non-null    int64
 7   f7      250 non-null    int64
 8   f8      250 non-null    int64
 9   f9      250 non-null    int64
 10  l       250 non-null    int64
dtypes: int64(11)
memory usage: 21.6 KB


In [15]:
grouped = data.groupby(list(data.columns))

In [16]:
freq = grouped['l'].size().unstack(fill_value=0)

In [17]:
freq['sum'] = freq[0] + freq[1]

In [18]:
freq.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,l,0,1,sum
f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,0,0,0,0,0,0,1,0,0,0,2,2
0,0,0,0,0,0,0,1,0,1,1,0,1
0,0,0,0,0,0,1,1,1,0,0,1,1
0,0,0,0,0,1,0,0,0,0,1,0,1
0,0,0,0,0,1,0,1,0,1,0,1,1
0,0,0,0,0,1,1,1,1,0,0,1,1
0,0,0,0,1,0,0,0,1,1,0,1,1
0,0,0,0,1,0,0,1,0,1,2,0,2
0,0,0,0,1,0,1,0,0,0,0,1,1
0,0,0,0,1,0,1,1,0,1,0,1,1


In [19]:
freq['sum'].describe().astype(int)

count    225
mean       1
std        0
min        1
25%        1
50%        1
75%        1
max        3
Name: sum, dtype: int64

In [20]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

In [22]:
model = MLPClassifier(hidden_layer_sizes=[128, 128, 128], max_iter=1000, random_state=100)

In [23]:
model.fit(data[fcols], data['l'])

MLPClassifier(hidden_layer_sizes=[128, 128, 128], max_iter=1000,
              random_state=100)

In [24]:
accuracy_score(data['l'], model.predict(data[fcols]))

0.932

Now test the predictive power

In [25]:
split = int(len(data) * 0.7)

In [26]:
train = data[:split]
test = data[split:]

In [27]:
model.fit(train[fcols], train['l'])

MLPClassifier(hidden_layer_sizes=[128, 128, 128], max_iter=1000,
              random_state=100)

In [28]:
accuracy_score(train['l'], model.predict(train[fcols]))

0.9657142857142857

In [29]:
accuracy_score(test['l'], model.predict(test[fcols]))

0.52

### Conclusion on Small data set:
Roughly speaking, the neural network, trained on a small data set only, learns wrong
relationships due to the identified two major problem areas. The problems are not
really relevant in the context of learning relationships in-sample. To the contrary, the
smaller a data set is, the more easily in-sample relationships can be learned in general.
However, the problem areas are highly relevant when using the trained neural net‐
work to generate predictions out-of-sample.

### Larger Dataset

In [30]:
factor = 50

In [32]:
big = pd.DataFrame(np.random.randint(0, 2, (factor * n, f)), columns=fcols)

In [33]:
big['l'] = np.random.randint(0, 2, (factor * n))

In [34]:
train = big[:split]
test = big[split:]

In [35]:
model.fit(train[fcols], train['l'])

MLPClassifier(hidden_layer_sizes=[128, 128, 128], max_iter=1000,
              random_state=100)

In [36]:
accuracy_score(train['l'], model.predict(train[fcols]))

0.96

In [37]:
accuracy_score(test['l'], model.predict(test[fcols]))

0.5013387423935092

A quick analysis of the available data, as shown next, explains the increase in the pre‐
diction accuracy. First, all possible patterns are now represented in the data set. Sec‐
ond, all patterns have an average frequency of above 10 in the data set. In other
words, the neural network sees basically all the patterns multiple times. This allows the
neural network to “learn” that both labels 0 and 1 are equally likely for all possible
patterns. Of course, it is a rather involved way of learning this, but it is a good illus‐
tration of the fact that a relatively small data set might often be too small in the context
of neural networks:

In [38]:
grouped = big.groupby(list(data.columns))

In [39]:
freq = grouped['l'].size().unstack(fill_value=0)

In [40]:
freq['sum'] = freq[0] + freq[1]

In [41]:
freq.head(6)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,l,0,1,sum
f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,0,0,0,0,0,0,0,0,0,3,3,6
0,0,0,0,0,0,0,0,0,1,5,5,10
0,0,0,0,0,0,0,0,1,0,6,4,10
0,0,0,0,0,0,0,0,1,1,3,4,7
0,0,0,0,0,0,0,1,0,0,8,10,18
0,0,0,0,0,0,0,1,0,1,10,7,17


In [42]:
freq['sum'].describe().astype(int)

count    1024
mean       12
std         3
min         2
25%        10
50%        12
75%        14
max        25
Name: sum, dtype: int64

### Big Dataset
Large enough—in terms of vol‐ume, variety, and also maybe velocity—for an AI algorithm to be trained properly
such that the algorithm performs better at a prediction task as compared to a baseline algorithm.