Definitions
* X - Input
* Y - Output
* Column - Single input feature (ex: x1, x2, ...) across all rows
* Row - Single collection of columns
* Test data
* Training data

Computation
1. Scale the x_train columns (i.e. derive a function that adjusts the columns such that each column has range from -1 to 1)
2. Compute the variance of each column
3. Compute the correlation of each column with all other columns (ex: x1 | x2, x1 | x3, x2 | x3, ...)
4. Include all terms with the following above a threshold
  a. 1-Dimension: variance
  b. 2-Dimension: variance_1 * variance_2 * correlation_1_2
  c. 3-Dimension: variance_1 * variance_2 * variance_3 * correlation_1_2 * correlation_2_3 * correlation_1_3
5. Iteratively expand thresholds to see if they improve r-squared for test/train data

Open Questions
* Q1: Is high variance actually a good feature selector? 
  What if one feature perfectly predicts the output?
  What if one feature is randomly min/max?
  A1: The point of variance feature selection is to remove super low variace features, so maybe the threshold should be low
  Q2: What if the output variance is also low?
  A2: ????????
* Q: Is high correlation actually a good feature combiner?
  A: I think yes. No reason to combine features that cancel each other out.
* Q: What to do about variance_1 * variance_2 not being the same as variance_1_2?
  Example: x1=[1,0] and x2=[0,1]
  A: By combining with correlation, columns whose variance cancels should not be included.

Big Idea
* Look at similarities across datasets for insights into how to process new datasets

In [1]:
import mnist
import numpy
from sklearn.preprocessing import MinMaxScaler

x_train, y_train, x_test, y_test = mnist.load()
num_columns = len(x_train[0])
num_rows = len(x_train)
print(num_columns)
print(num_rows)

784
60000


In [2]:
scaler = MinMaxScaler()
x_train_scaled = scaler.fit_transform(x_train)

In [3]:
variances = [numpy.var(x_train_scaled[:,i]) for i in range(0, num_columns)]

In [4]:
covariances = numpy.cov(x_train_scaled, rowvar=False)

In [5]:
print(min(variances))
print(max(variances))
print(numpy.min(covariances))
print(numpy.max(covariances))

0.0
0.19920665087744033
-0.08161850172665423
0.1992099710436244


In [24]:
degree_1 = set()
for i, var in enumerate(variances):
    if var > 0.12:
        degree_1.add(i)
print(len(degree_1))

241


In [40]:
degree_2 = set()
for i, var1 in enumerate(variances):
    for j, var2 in enumerate(variances):
        if var1 * var2 > 0.036 and covariances[i][j] > 0.055:
            degree_2.add((i,j))
print(len(degree_2))
print(degree_2)

630
{(626, 184), (628, 602), (489, 462), (628, 629), (461, 461), (630, 657), (184, 598), (601, 630), (631, 631), (491, 489), (462, 462), (572, 572), (181, 183), (183, 629), (463, 436), (656, 184), (572, 599), (381, 464), (436, 409), (409, 410), (436, 436), (546, 546), (382, 465), (629, 184), (210, 182), (518, 573), (437, 410), (270, 242), (409, 437), (546, 573), (379, 405), (211, 183), (350, 378), (519, 546), (184, 184), (490, 519), (574, 491), (380, 406), (519, 573), (270, 354), (574, 518), (432, 434), (353, 379), (627, 628), (434, 462), (354, 353), (434, 489), (489, 434), (628, 601), (461, 433), (628, 183), (406, 408), (461, 460), (630, 656), (462, 434), (631, 630), (381, 436), (181, 182), (183, 628), (436, 381), (575, 576), (382, 410), (183, 210), (604, 630), (409, 382), (627, 214), (382, 437), (437, 382), (409, 409), (464, 354), (518, 572), (270, 241), (490, 491), (410, 382), (211, 182), (380, 378), (598, 600), (182, 631), (184, 183), (600, 628), (432, 406), (437, 521), (627, 600),

In [43]:
degree_3 = set()
for i, var1 in enumerate(variances):
    for j, var2 in enumerate(variances):
        for k, var3 in enumerate(variances):
            if var1 * var2 * var3 > 0.0025 and covariances[i][j] * covariances[i][k] * covariances[j][k] > 0.002:
                degree_3.add((i,j,k))
print(len(degree_3))
print(degree_3)

6455
{(519, 520, 520), (383, 410, 410), (326, 353, 298), (631, 632, 659), (598, 627, 599), (516, 543, 516), (653, 625, 625), (378, 379, 405), (373, 372, 400), (603, 575, 603), (571, 599, 570), (627, 655, 655), (237, 236, 237), (515, 515, 514), (374, 318, 346), (292, 292, 291), (463, 462, 462), (629, 629, 630), (577, 577, 605), (327, 327, 326), (180, 152, 180), (381, 353, 353), (460, 459, 432), (297, 270, 298), (541, 569, 541), (400, 372, 373), (493, 465, 492), (547, 491, 519), (576, 603, 576), (318, 346, 318), (348, 320, 348), (298, 270, 326), (524, 551, 551), (626, 598, 598), (317, 372, 344), (236, 263, 235), (206, 207, 234), (353, 381, 353), (377, 377, 405), (407, 407, 380), (571, 572, 571), (181, 153, 153), (271, 243, 242), (270, 298, 326), (573, 546, 546), (570, 569, 569), (243, 215, 244), (520, 519, 547), (262, 317, 262), (597, 626, 598), (207, 207, 180), (344, 289, 289), (466, 467, 467), (403, 402, 374), (373, 346, 373), (297, 297, 269), (321, 293, 321), (627, 629, 628), (237, 21

In [44]:
input_data = numpy.zeros((num_rows, len(degree_1)+len(degree_2)+len(degree_3)))

col = 0

for j in degree_1:
    for i in range(num_rows):
        input_data[i][col] = x_train_scaled[i][j]
    col += 1

for j,k in degree_2:
    for i in range(num_rows):
        input_data[i][col] = x_train_scaled[i][j] * x_train_scaled[i][k]
    col += 1

for j,k,l in degree_3:
    for i in range(num_rows):
        input_data[i][col] = x_train_scaled[i][j] * x_train_scaled[i][k] * x_train_scaled[i][l]
    col += 1



In [46]:
x_test_scaled = scaler.fit_transform(x_test)
print(len(x_test_scaled))
print(len(x_test_scaled[0]))
test_input_data = numpy.zeros((len(x_test_scaled[0]), len(degree_1)+len(degree_2)+len(degree_3)))

col = 0

for j in degree_1:
    for i in range(len(x_test_scaled[0])):
        test_input_data[i][col] = x_test_scaled[i][j]
    col += 1

for j,k in degree_2:
    for i in range(len(x_test_scaled[0])):
        test_input_data[i][col] = x_test_scaled[i][j] * x_test_scaled[i][k]
    col += 1

for j,k,l in degree_3:
    for i in range(len(x_test_scaled[0])):
        test_input_data[i][col] = x_test_scaled[i][j] * x_test_scaled[i][k] * x_test_scaled[i][l]
    col += 1

10000
784


In [47]:
from sklearn.linear_model import LinearRegression

test_predictions = [(-1, -1)] * len(y_test)
for i in range(10):
    print(f"Compute linear (or Lasso) regression for {i}")
    y_train_0 = list(map(lambda y: 1 if y == i else 0, y_train))
    linreg = LinearRegression().fit(input_data, y_train_0)
    
    predictions = linreg.predict(test_input_data)
    for j in range(len(predictions)):
        if predictions[j] > test_predictions[j][0]:
            test_predictions[j] = (predictions[j], i)

Compute linear (or Lasso) regression for 0


In [15]:
total_correct = 0
total_wrong = 0
for i in range(len(test_predictions)):
    if y_test[i] == test_predictions[i][1]:
        total_correct += 1
    else:
        total_wrong += 1

print(f"Total Correct: {total_correct}")
print(f"Total Wrong: {total_wrong}")

Actual: 2, Prediction: 6, Prob: 0.47120968882108216
Actual: 5, Prediction: 2, Prob: 0.4131203464257995
Actual: 6, Prediction: 4, Prob: 0.3119131635291668
Actual: 5, Prediction: 0, Prob: 0.42198862287493205
Actual: 3, Prediction: 2, Prob: 0.5453331713292986
Actual: 2, Prediction: 0, Prob: 0.5806487215968459
Actual: 2, Prediction: 7, Prob: 0.6244404205887437
Actual: 9, Prediction: 8, Prob: 0.3756825801986542
Actual: 7, Prediction: 1, Prob: 0.40836631117174216
Actual: 4, Prediction: 9, Prob: 0.4388262155489076
Actual: 7, Prediction: 4, Prob: 0.5161367963416668
Actual: 2, Prediction: 6, Prob: 0.2617611513702218
Actual: 5, Prediction: 3, Prob: 0.4992491015120295
Actual: 5, Prediction: 9, Prob: 0.45899290050389946
Actual: 6, Prediction: 5, Prob: 0.46663100417068604
Actual: 5, Prediction: 0, Prob: 0.27254308912249064
Actual: 5, Prediction: 3, Prob: 0.43133432696020707
Actual: 9, Prediction: 8, Prob: 0.34710892954287276
Actual: 3, Prediction: 6, Prob: 0.6513180361358468
Actual: 4, Prediction: 