## Sparse
The purpose of this notebook is to simulate what would happen given a sparse but highly discriminative feature. An example of which is as follows:

| x_1 | ... | x_n | y   |
| --- | --- | --- | --- |
| 1   | ... | 1   | 1   |
| 2   | ... | 1   | 1   |
| 3   | ... | 0   | 0   |
| 4   | ... | 0   | 0   |
| ... | ... | ... | ... |
| n-2 | ... | 0   | 1   |
| n-1 | ... | 0   | 1   |
| n   | ... | 0   | 1   |

where given feature ``x_1`` the decision boundary is somewhere around ``n/2``. However, if ``x_n`` is equal to 1, although sparse, y is also in the positive class. All features are linear in these cases.

In [1]:
import numpy as np
np.random.seed(42)

## Logistic Regression

To keep our tests as vanilla as possible, our algos will be written from scratch.

In [150]:
class LogisticRegression:
    def __init__(self, learning_step, learning_rate):
        self.learning_step = learning_step
        self.learning_rate = learning_rate
        self.weights = None
        self.bias = None
        
    def linear_reg(self, w, x):
        return np.dot(x, w)
    
    def sigmoid(self, z):
        return 1.0 / ( 1.0 + np.exp(-1.0 * z))
    
    def fit(self, x, y):
        self.intercepts = np.zeros((x.shape[0], 1))
        x = np.concatenate((self.intercepts, x), axis=1)
        self.weights = np.zeros(x.shape[1])
        for step in range(self.learning_step):
            z = self.linear_reg(self.weights, x)
            sig = self.sigmoid(z)
            gradient = np.dot(x.transpose(), (sig - y)) / y.size
            self.weights -= self.learning_rate * gradient 
        print(self.weights)
        
    def predict(self, x):
        self.intercepts = np.zeros((x.shape[0], 1))
        x = np.concatenate((self.intercepts, x), axis=1)
        return self.sigmoid(np.dot(x, self.weights))

## Null
First let's test the impact of completely non-pertinant features

In [88]:
num_zeros  = 10
num_ones = 20
x = np.concatenate(
    (
        np.zeros((num_zeros, 1)), 
        np.ones((num_ones, 1))
    )
)
y = np.concatenate(
    (
        np.zeros((num_zeros)), 
        np.ones((num_ones))
    )
)

In [151]:
lr = LogisticRegression(100000, 0.02)
lr.fit(x, y)
lr.predict(x)

[-5.38570938 11.87369213]


array([0.00456069, 0.00456069, 0.00456069, 0.00456069, 0.00456069,
       0.00456069, 0.00456069, 0.00456069, 0.00456069, 0.00456069,
       0.9984807 , 0.9984807 , 0.9984807 , 0.9984807 , 0.9984807 ,
       0.9984807 , 0.9984807 , 0.9984807 , 0.9984807 , 0.9984807 ,
       0.9984807 , 0.9984807 , 0.9984807 , 0.9984807 , 0.9984807 ,
       0.9984807 , 0.9984807 , 0.9984807 , 0.9984807 , 0.9984807 ])

In [47]:
x.shape

(60, 1)

In [254]:
z = lr.linear_reg(lr.weights, x)

array([[0.        ],
       [0.        ],
       [1.25547868]])

In [260]:
lr.sigmoid(0)

0.5

In [86]:
z = linear_reg(np.array([3,2,1]), np.array([1,2,3]), 5)
sigmoid(z)

0.999999694097773

In [50]:
def generate_features(feature_ranges, samples=10000):
    feature_vec = []
    for feature in feature_ranges:
        increm = (feature[2] - feature[1])/(samples-1)
        feature_vec.append(
            (
                feature[0],
                [(increm * x) + feature[1] for x in range(samples)]
            )
        )
    return feature_vec

In [61]:
def generate_y(pos_ratio, sample=10000):
    
    return [0] * int(sample * (1 - pos_ratio)) + [1] * int(sample * pos_ratio)

In [62]:
def generate

In [62]:
features = [
    ('x_1', 1, 100000)
]

In [63]:
x = generate_features(features)

[0, 0, 0, 0, 0, 0, 0, 0, 0]