# Dataset overview

It this notebook we make a manual review of the dataset to understand its structure and the information it contains. While we won't use findings here directly in the feature selection, as we want to rely on dedicated methods, it will be an important knowledge for later decision making.


### Dataset structure

Let's check the structure of the data, first.

In [1]:
import numpy as np

x_path = "data/x_train.txt"
y_path = "data/y_train.txt"

X = np.loadtxt(x_path)
y = np.loadtxt(y_path)

print(f"X samples: {X.shape[0]}, features: {X.shape[1]}")
print(f"y samples: {y.shape[0]}, true samples: {np.sum(y):.0f} ({np.mean(y)*100:.2f}%)")

X samples: 5000, features: 500
y samples: 5000, true samples: 2496 (49.92%)


It seems that data structure is compliant with the description of the task. Classes are very well balanced.

### Features statistics

Now we will check the statistics of the features to understand their distribution. We will also check for the number of (extreme) outliers in the dataset.

In [7]:
for i in range(X.shape[1]):
    feat = X[:, i]
    print(f"Feature {i}: {np.mean(feat):.2f}, (+/- {np.std(feat):.2f})")

Feature 0: -0.02, (+/- 1.94)
Feature 1: -0.03, (+/- 2.17)
Feature 2: -0.01, (+/- 1.99)
Feature 3: -0.02, (+/- 1.59)
Feature 4: -0.01, (+/- 1.68)
Feature 5: -0.04, (+/- 1.69)
Feature 6: -0.02, (+/- 1.79)
Feature 7: -0.02, (+/- 1.93)
Feature 8: -0.02, (+/- 1.67)
Feature 9: -0.01, (+/- 1.60)
Feature 10: 0.02, (+/- 1.00)
Feature 11: -0.00, (+/- 1.01)
Feature 12: 0.01, (+/- 0.99)
Feature 13: -0.00, (+/- 1.01)
Feature 14: 0.01, (+/- 0.99)
Feature 15: -0.01, (+/- 0.97)
Feature 16: 0.00, (+/- 0.99)
Feature 17: 0.01, (+/- 0.99)
Feature 18: 0.00, (+/- 1.00)
Feature 19: -0.02, (+/- 1.00)
Feature 20: 0.04, (+/- 1.00)
Feature 21: -0.01, (+/- 1.01)
Feature 22: 0.02, (+/- 1.00)
Feature 23: 0.01, (+/- 1.01)
Feature 24: -0.01, (+/- 1.00)
Feature 25: -0.01, (+/- 1.00)
Feature 26: -0.01, (+/- 0.99)
Feature 27: 0.01, (+/- 0.99)
Feature 28: -0.01, (+/- 1.00)
Feature 29: -0.01, (+/- 1.00)
Feature 30: -0.02, (+/- 1.00)
Feature 31: 0.02, (+/- 1.02)
Feature 32: -0.02, (+/- 1.00)
Feature 33: 0.01, (+/- 1.00)
Fe

Clear pattern appears here. It is very probable that the data was generated artificially. We can see few repeated distributions of the features. Those are:
- features 0-9: mean ≈ 0, std ≈ 1.6-2.17
- features 10-199: mean ≈ 0, std ≈ 1
- features 200-399: mean ≈ 0.5, std ≈ 0.29
- features 400-499: mean ≈ 10, std ≈ 4.5

The most varying group is the first one, where either the features are not following the same distribution or the high level of noise was added to the data.

### Outliers detection

We will also check for outliers in the dataset.

In [18]:
outliers = {}
for i in range(X.shape[1]):
    feat = X[:, i]
    lower_bound = np.percentile(feat, 25) - 1.5 * (
        np.percentile(feat, 75) - np.percentile(feat, 25)
    )
    upper_bound = np.percentile(feat, 75) + 1.5 * (
        np.percentile(feat, 75) - np.percentile(feat, 25)
    )
    outliers[i] = np.sum((feat < lower_bound) | (feat > upper_bound))

for k, v in sorted(outliers.items(), key=lambda x: x[1], reverse=True):
    if v > 0:
        print(f"Feature {k}: {v} outliers ({v/X.shape[0]*100:.2f}%)")

Feature 486: 125 outliers (2.50%)
Feature 464: 119 outliers (2.38%)
Feature 448: 118 outliers (2.36%)
Feature 439: 117 outliers (2.34%)
Feature 434: 116 outliers (2.32%)
Feature 466: 115 outliers (2.30%)
Feature 436: 114 outliers (2.28%)
Feature 480: 114 outliers (2.28%)
Feature 425: 113 outliers (2.26%)
Feature 443: 111 outliers (2.22%)
Feature 452: 111 outliers (2.22%)
Feature 415: 110 outliers (2.20%)
Feature 474: 110 outliers (2.20%)
Feature 477: 110 outliers (2.20%)
Feature 403: 107 outliers (2.14%)
Feature 409: 107 outliers (2.14%)
Feature 465: 107 outliers (2.14%)
Feature 408: 106 outliers (2.12%)
Feature 431: 106 outliers (2.12%)
Feature 481: 106 outliers (2.12%)
Feature 494: 106 outliers (2.12%)
Feature 414: 105 outliers (2.10%)
Feature 424: 105 outliers (2.10%)
Feature 440: 105 outliers (2.10%)
Feature 485: 105 outliers (2.10%)
Feature 491: 105 outliers (2.10%)
Feature 416: 104 outliers (2.08%)
Feature 469: 104 outliers (2.08%)
Feature 411: 103 outliers (2.06%)
Feature 412: 1

Considering the size of the dataset, the number of outliers is not very high - 2.5 % at most. Hence, models not very sensitive to outliers should be fine without any additional preprocessing.

### Correlation analysis

Now, we will check the correlation between the features to understand the relationships between them (we will use the Pearson correlation coefficient for this).
We will only display those pairs which are correlated above a certain threshold: 0.7.

In [34]:
corr = np.corrcoef(X, rowvar=False)
threshold = 0.7

corr_pairs = np.argwhere(np.abs(corr) > threshold)
corr_pairs = corr_pairs[corr_pairs[:, 0] != corr_pairs[:, 1]]
corr_pairs = np.unique([tuple(sorted(pair)) for pair in corr_pairs], axis=0)

for i, j in corr_pairs:
    print(f"Features {i} and {j}: {corr[i, j]:.2f}")

Features 0 and 1: 0.86
Features 0 and 2: 0.76
Features 0 and 3: 0.74
Features 0 and 4: 0.82
Features 0 and 5: 0.76
Features 0 and 6: 0.96
Features 0 and 7: 0.81
Features 0 and 8: 0.78
Features 0 and 9: 0.73
Features 1 and 2: 0.77
Features 1 and 3: 0.81
Features 1 and 4: 0.76
Features 1 and 5: 0.84
Features 1 and 6: 0.89
Features 1 and 7: 0.83
Features 1 and 8: 0.87
Features 1 and 9: 0.93
Features 2 and 6: 0.81
Features 2 and 7: 0.71
Features 2 and 8: 0.80
Features 2 and 9: 0.70
Features 3 and 5: 0.77
Features 3 and 6: 0.72
Features 3 and 7: 0.74
Features 3 and 8: 0.90
Features 3 and 9: 0.71
Features 4 and 5: 0.80
Features 4 and 6: 0.74
Features 4 and 7: 0.94
Features 4 and 8: 0.80
Features 4 and 9: 0.70
Features 5 and 6: 0.78
Features 5 and 7: 0.86
Features 5 and 8: 0.90
Features 5 and 9: 0.75
Features 6 and 7: 0.78
Features 6 and 8: 0.80
Features 6 and 9: 0.80
Features 7 and 8: 0.85
Features 7 and 9: 0.73
Features 8 and 9: 0.81


High correlation between variables is observed only for the features 0-9. This is another indication that this group of features is different from the rest of the dataset. There is a high change their are linear combinations of each other.

### Multicollearity analysis

Next, we will check for multicollinearity in the dataset.

In [25]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = np.array([variance_inflation_factor(X, i) for i in range(X.shape[1])])

for i in np.argsort(vif)[::-1]:
    print(f"Feature {i}: {vif[i]:.2f}")


  vif = 1. / (1. - r_squared_i)


Feature 0: inf
Feature 9: inf
Feature 101: inf
Feature 8: inf
Feature 7: inf
Feature 6: inf
Feature 5: inf
Feature 102: inf
Feature 4: inf
Feature 103: inf
Feature 104: inf
Feature 100: inf
Feature 105: inf
Feature 3: inf
Feature 106: inf
Feature 2: inf
Feature 107: inf
Feature 108: inf
Feature 109: inf
Feature 1: inf
Feature 496: 6.93
Feature 427: 6.88
Feature 483: 6.88
Feature 412: 6.87
Feature 413: 6.86
Feature 408: 6.84
Feature 434: 6.80
Feature 415: 6.79
Feature 485: 6.79
Feature 459: 6.78
Feature 441: 6.77
Feature 426: 6.77
Feature 456: 6.77
Feature 493: 6.76
Feature 469: 6.75
Feature 498: 6.75
Feature 455: 6.74
Feature 402: 6.74
Feature 477: 6.73
Feature 458: 6.73
Feature 461: 6.72
Feature 406: 6.71
Feature 451: 6.71
Feature 446: 6.70
Feature 449: 6.70
Feature 421: 6.69
Feature 443: 6.69
Feature 488: 6.69
Feature 497: 6.69
Feature 476: 6.68
Feature 471: 6.68
Feature 436: 6.68
Feature 409: 6.67
Feature 450: 6.67
Feature 401: 6.67
Feature 475: 6.66
Feature 433: 6.66
Feature 422: 6

There are several variables which have infinite VIF. This means that they can be expressed as a linear combination of other variables. While usually one want to get rid of such variables, in this case they will be our main focus in the feature selection process, since it means that such variable holds information of several variables at once (since no ideal correlation was found between them, it must be more than one variable).

### Correlation with the target

Finally, we will check the correlation of the features with the target variable.

In [24]:
corr_target = np.corrcoef(X, y, rowvar=False)[:-1, -1]

for i, corr in enumerate(corr_target[np.argsort(np.abs(corr_target))[::-1]]):
    print(f"Feature {i}: {corr:.2f}")

0.045463713735111974 -0.040857772027013066
Feature 0: 0.05
Feature 1: -0.04
Feature 2: -0.04
Feature 3: -0.04
Feature 4: 0.04
Feature 5: -0.04
Feature 6: 0.04
Feature 7: 0.04
Feature 8: 0.04
Feature 9: 0.04
Feature 10: 0.04
Feature 11: 0.04
Feature 12: -0.04
Feature 13: -0.03
Feature 14: -0.03
Feature 15: 0.03
Feature 16: 0.03
Feature 17: -0.03
Feature 18: 0.03
Feature 19: 0.03
Feature 20: -0.03
Feature 21: 0.03
Feature 22: -0.03
Feature 23: 0.03
Feature 24: -0.03
Feature 25: 0.03
Feature 26: -0.03
Feature 27: 0.03
Feature 28: 0.03
Feature 29: 0.03
Feature 30: 0.03
Feature 31: 0.03
Feature 32: 0.03
Feature 33: -0.03
Feature 34: 0.03
Feature 35: 0.03
Feature 36: 0.03
Feature 37: 0.03
Feature 38: -0.03
Feature 39: -0.03
Feature 40: -0.03
Feature 41: -0.03
Feature 42: -0.03
Feature 43: 0.03
Feature 44: 0.02
Feature 45: 0.02
Feature 46: -0.02
Feature 47: -0.02
Feature 48: -0.02
Feature 49: -0.02
Feature 50: -0.02
Feature 51: 0.02
Feature 52: -0.02
Feature 53: -0.02
Feature 54: 0.02
Feature

There is no feature which individually has a high correlation with the target. This means that the data is not linearly separable and we will need to use more advanced models to classify it.

### Summary

Based on this analysis, we can conclude that the dataset is probably artificial and was generated with a specific pattern. Since maximal reduction of the used feature is goal, for this moment, the best candidates for the feature selection are the features 0-9 and 100-109. It must be noted that there are high correlations between the features in the first group, which means that they can be expressed as a linear combination of each other.

Considering the nature of the dataset, we will use the following models: Random Forest and Gradient Boosting. Neural Networks are also a good candidate, but given the size of the dataset, there is a high risk of overfitting. SVM seems as a good candidate, but it does not provide probability of the prediction, which is a necessity in this task (where we want to return the most probable samples).