# Single Neuron (Logistic Regression) learns the AND gate — **SGD from scratch (NumPy only)**

This notebook teaches how a **neural network classifier** (a single neuron) works using:
- **logits** (pre-activation), **activation**, **probabilities**
- **binary cross-entropy** loss
- **backprop gradients** and **SGD updates**
- success + failure cases, and small tasks to explore hyperparameters

---


## 1) Theory & terminology (with formulas)

### Model (1 neuron)
**Inputs / features**:  \(\mathbf{x}=[x_1,x_2]^T\)  
**Parameters**: weights \(\mathbf{w}=[w_1,w_2]^T\), bias \(b\)

### Logit (a.k.a. score, pre-activation)
\[
z = \mathbf{w}^T\mathbf{x} + b = w_1x_1+w_2x_2+b
\]
- \(z\) is called the **logit** (raw score before activation).

### Activation (sigmoid → probability)
\[
\hat{y}=\sigma(z)=\frac{1}{1+e^{-z}}
\]
- \(\hat y \in (0,1)\) can be interpreted as \(P(y=1 \mid \mathbf{x})\).

### Decision rule (class prediction)
\[
\hat{c}=\mathbb{1}[\hat{y}\ge 0.5] \quad\Longleftrightarrow\quad \mathbb{1}[z\ge 0]
\]

### Loss (Binary Cross-Entropy, BCE)
\[
L(\hat y,y)= -\left(y\ln(\hat y) + (1-y)\ln(1-\hat y)\right)
\]

### Backprop (sigmoid + BCE gives a simple gradient)
\[
\delta=\frac{\partial L}{\partial z} = \hat y - y
\]
\[
\frac{\partial L}{\partial w_i}=\delta x_i,\quad \frac{\partial L}{\partial b}=\delta
\]

### SGD update (learning rate \(\eta\))
\[
w_i \leftarrow w_i-\eta(\hat y-y)x_i,\quad b\leftarrow b-\eta(\hat y-y)
\]

---


## 2) Student tasks (do these by editing a few numbers)

1. **Learning rate sweep**: change `lr` to `0.001`, `0.01`, `0.1`, `1.0`, `5.0` and re-run training.  
   - What happens to the loss curve? Does it converge or blow up?
2. **Epochs**: increase/decrease `epochs` (e.g., 20 vs 500).  
   - How many epochs are needed for perfect AND accuracy?
3. **Initialization**: try starting with `w=[0,0]`, `w=[-1,-1]`, or random.  
   - Does learning still work?
4. **Threshold**: change the decision threshold from `0.5` to `0.7` or `0.3`.  
   - How does accuracy change?
5. **Failure case**: run the **no-bias** experiment and explain why AND cannot be learned without bias.
6. **Shuffle vs fixed order**: change the RNG seed.  
   - Does the loss curve look different?


In [2]:
import numpy as np
np.set_printoptions(precision=6, suppress=True)
print('NumPy version:', np.__version__)

NumPy version: 1.26.4


## 3) AND dataset (features + labels)

AND truth table:
- (0,0) → 0
- (0,1) → 0
- (1,0) → 0
- (1,1) → 1


In [3]:
X=np.array([[0.,0.],[0.,1.],[1.,0.],[1.,1.]]) # Input 
y=np.array([0.,0.,0.,1.]) # Target Classes - class 0 - 0, class 1 - 1
print('X=\n',X,'\n y=',y)

X=
 [[0. 0.]
 [0. 1.]
 [1. 0.]
 [1. 1.]] 
 y= [0. 0. 0. 1.]


## 4) Train/test split (tiny but valid)

We’ll train on 3 samples (including the positive one) and test on 1 sample.


In [4]:
# Standard train - 70% data and test - 30%
train_idx=np.array([0,1,3])
test_idx=np.array([2])
Xtr,ytr=X[train_idx],y[train_idx] # Train data [00,01,11]-[0,0,1]
Xte,yte=X[test_idx],y[test_idx] # Test data [10] - [0]
print('Train X:\n',Xtr,'\nTrain y:',ytr,'\nTest X:\n',Xte,'\nTest y:',yte)

Train X:
 [[0. 0.]
 [0. 1.]
 [1. 1.]] 
Train y: [0. 0. 1.] 
Test X:
 [[1. 0.]] 
Test y: [0.]


## 5) Activation and loss (NumPy only)


In [5]:
# Y_pred = W*x+b -Linear - Logits
# Converts raw neuron outputs to probability scores
def sigmoid(z): return 1/(1+np.exp(-z)) # Probability score sig(Y_pred)
print('sigmoid([-1,0,1]) =', sigmoid(np.array([-10.,0.,10.])))
print('sigmoid(0) =', sigmoid(0.0))
# Loss  = y - y_pred**2

sigmoid([-1,0,1]) = [0.000045 0.5      0.999955]
sigmoid(0) = 0.5


In [6]:
# Binary Cross entropy loss Y_hat - Y_pred
def bce(yhat,y,eps=1e-12):
    yhat=np.clip(yhat,eps,1-eps)
    return -(y*np.log(yhat)+(1-y)*np.log(1-yhat))
print('bce(yhat=0.5,y=1) =', float(bce(0.5,1.0))) # replace 0.5 by 0.75
print('bce(yhat=0.5,y=0) =', float(bce(0.5,0.0))) # 0.5 by 0.25
# bec_loss = -(y_target*log(y_pred)+(1-y_target)*log(1-y_pred))
# Categorical cross entropy loss

bce(yhat=0.5,y=1) = 0.6931471805599453
bce(yhat=0.5,y=0) = 0.6931471805599453


## 6) Initialize parameters (given)

We start from the user-provided values:
\[
w_1=5,\; w_2=5,\; b=-15
\]


In [7]:
# No. of inputs - 2 - 2 weights
w=np.array([5.,5.]); b=-15.0
lr=0.1; epochs=50
print('init w=',w,' b=',b,' lr=',lr,' epochs=',epochs)

init w= [5. 5.]  b= -15.0  lr= 0.1  epochs= 50


## 7) Forward pass (initial): logits and activations on full dataset


In [8]:
z=X@w+b # Learning model z - logits
yhat=sigmoid(z) # Prob scores
print('logits z =', z,'\nprobs yhat =', yhat)
y

logits z = [-15. -10. -10.  -5.] 
probs yhat = [0.       0.000045 0.000045 0.006693]


array([0., 0., 0., 1.])

In [9]:
L=bce(yhat,y)
print('loss per sample =', L)
print('mean loss =', float(L.mean()))

loss per sample = [0.       0.000045 0.000045 5.006715]
mean loss = 1.2517016130474563


## 8) Train with SGD (one update per sampled training point)

We record:
- `loss_steps`: loss after **each SGD step**
- `epoch_loss`: mean loss per epoch (each epoch has `len(ytr)` steps)


In [10]:
loss_steps=[] # Loss pre step
rng=np.random.default_rng(0) # Random number gen
steps=epochs*len(ytr)
for t in range(steps):
    i=rng.integers(len(ytr))
    z=Xtr[i]@w+b
    yh=sigmoid(z)
    d=yh-ytr[i]
    w-=lr*d*Xtr[i] # w_new = w_old - lr*dl/dw
    b-=lr*d # b_new = b_old - lr*dl/db
    loss_steps.append(float(bce(yh,ytr[i])))
epoch_loss=np.mean(np.array(loss_steps).reshape(epochs,len(ytr)),axis=1)
print('trained w=',w,' b=',b,' final epoch loss=',float(epoch_loss[-1]))

trained w= [7.423593 7.41127 ]  b= -12.58873755162908  final epoch loss= 0.0363193467700361


## 9) Evaluate: train set predictions


In [11]:
ztr=Xtr@w+b; ytr_hat=sigmoid(ztr); ptr=(ytr_hat>=0.5).astype(int)
print('train probs =', ytr_hat,'\ntrain pred =', ptr)
print('train true =', ytr.astype(int))

train probs = [0.000003 0.005611 0.904316] 
train pred = [0 0 1]
train true = [0 0 1]


## 10) Evaluate: test set predictions


In [12]:
zte=Xte@w+b; yte_hat=sigmoid(zte); pte=(yte_hat>=0.5).astype(int)
print('test probs =', yte_hat,'\ntest pred =', pte)
print('test true =', yte.astype(int))

test probs = [0.00568] 
test pred = [0]
test true = [0]


In [13]:
acc_tr=float((ptr==ytr).mean()); acc_te=float((pte==yte).mean())
print('accuracy train =', acc_tr,' accuracy test =', acc_te)
print('final parameters: w=', w,' b=', b)

accuracy train = 1.0  accuracy test = 1.0
final parameters: w= [7.423593 7.41127 ]  b= -12.58873755162908


## 11) “Visualization” of loss (ASCII sparkline)

Because we restrict ourselves to **NumPy only**, we draw the loss curve as a text sparkline.
- Lower is better.


In [14]:
a=np.array(loss_steps); chars=np.array(list('▁▂▃▄▅▆▇█'))
s=''.join(chars[((a-a.min())/(a.ptp()+1e-12)*(len(chars)-1)).astype(int)][np.linspace(0,len(a)-1,80).astype(int)])
print('loss sparkline (down is better):\n', s)

loss sparkline (down is better):
 ▇▁▁▁▁▇▇▁▆▁▁▁▅▁▁▄▃▁▃▁▁▁▁▁▃▁▁▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁


In [15]:
print('epoch_loss first 10 =', epoch_loss[:10])
print('epoch_loss last  10 =', epoch_loss[-10:])
print('min epoch loss =', float(epoch_loss.min()))

epoch_loss first 10 = [1.668942 0.       0.       3.042674 1.374954 1.278442 1.183033 2.086221
 0.907795 1.559814]
epoch_loss last  10 = [0.050207 0.045534 0.047116 0.083591 0.078286 0.039128 0.003558 0.07349
 0.005536 0.036319]
min epoch loss = 3.3784431164144464e-07


## 12) Full truth table after training (logits, probabilities, predicted class)
Columns: \(x_1, x_2, y, z, \hat y, \hat c\)


In [16]:
z_all=X@w+b; y_all=sigmoid(z_all); p_all=(y_all>=0.5).astype(int)
table=np.c_[X,y,z_all,y_all,p_all]
print('cols: x1 x2 y z yhat pred\n', table)

cols: x1 x2 y z yhat pred
 [[  0.         0.         0.       -12.588738   0.000003   0.      ]
 [  0.         1.         0.        -5.177467   0.005611   0.      ]
 [  1.         0.         0.        -5.165145   0.00568    0.      ]
 [  1.         1.         1.         2.246125   0.904316   1.      ]]


# 13) Failure case: **remove the bias** (b fixed to 0)

A key lesson: **AND cannot be learned without a bias term**.  
Reason: without \(b\), the decision boundary must pass through the origin, and you cannot separate only \((1,1)\) as positive while keeping \((1,0)\) and \((0,1)\) negative.

We will train **only weights** and keep \(b=0\).


In [17]:
w_nb=np.array([0.,0.]); b_nb=0.0
lr_nb=0.1; epochs_nb=200
print('no-bias init w=',w_nb,' b=',b_nb,' lr=',lr_nb,' epochs=',epochs_nb)

no-bias init w= [0. 0.]  b= 0.0  lr= 0.1  epochs= 200


In [18]:
loss_nb=[]; rng=np.random.default_rng(0); steps=epochs_nb*len(ytr)
for t in range(steps): i=rng.integers(len(ytr)); z=Xtr[i]@w_nb+b_nb; yh=sigmoid(z); d=yh-ytr[i]; w_nb-=lr_nb*d*Xtr[i]; loss_nb.append(float(bce(yh,ytr[i])))
epoch_nb=np.mean(np.array(loss_nb).reshape(epochs_nb,len(ytr)),axis=1); print('trained (no bias) w=',w_nb,' final epoch loss=',float(epoch_nb[-1]))

trained (no bias) w= [ 4.005458 -1.697201]  final epoch loss= 0.12080642879382007


In [19]:
z_fail=X@w_nb+b_nb; y_fail=sigmoid(z_fail); p_fail=(y_fail>=0.5).astype(int)
print('no-bias probs =', y_fail,'\nno-bias pred =', p_fail)
print('true y =', y.astype(int))

no-bias probs = [0.5      0.154831 0.98211  0.909559] 
no-bias pred = [1 0 1 1]
true y = [0 0 0 1]


In [20]:
acc_fail=float((p_fail==y).mean())
print('no-bias accuracy on all 4 =', acc_fail)
print('why it fails: without bias, (1,0) and (0,1) cannot both be <0 while (1,1) >0')

no-bias accuracy on all 4 = 0.5
why it fails: without bias, (1,0) and (0,1) cannot both be <0 while (1,1) >0


In [21]:
a=np.array(loss_nb); chars=np.array(list('▁▂▃▄▅▆▇█'))
s=''.join(chars[((a-a.min())/(a.ptp()+1e-12)*(len(chars)-1)).astype(int)][np.linspace(0,len(a)-1,80).astype(int)])
print('no-bias loss sparkline:\n', s)

no-bias loss sparkline:
 ▇▇▆▆▅▇▅▅▄▃▃▇▇▅▇▅▅▂▇▅▄▄▇▂▂▁▁▁▄▁▇▇▇▇▇▁▃▇▁▁▃▁▃▃▇▂▇▂▇▇▂▇▂▇▂▂▁▁▁▇▇▇▇▂▁▁▇▇▇▇▁▇▇▁▂▇▂▇▁▁


# 14) Failure case: learning rate too large (unstable / oscillating)

If `lr` is too big, SGD updates can overshoot and make learning unstable.


In [22]:
w_bad=np.array([5.,5.]); b_bad=-15.0
lr_bad=20.0; steps_bad=60; rng=np.random.default_rng(1)
print('bad-lr init w=',w_bad,' b=',b_bad,' lr=',lr_bad,' steps=',steps_bad)

bad-lr init w= [5. 5.]  b= -15.0  lr= 20.0  steps= 60


In [23]:
loss_bad=[]; np.seterr(over='ignore', under='ignore')
for t in range(steps_bad): i=rng.integers(len(ytr)); z=Xtr[i]@w_bad+b_bad; yh=sigmoid(z); d=yh-ytr[i]; w_bad-=lr_bad*d*Xtr[i]; b_bad-=lr_bad*d; loss_bad.append(float(bce(yh,ytr[i])))
print('after bad-lr steps: w=',w_bad,' b=',b_bad,' finite?', bool(np.isfinite(w_bad).all() and np.isfinite(b_bad)))

after bad-lr steps: w= [44.762242  4.762705]  b= -35.08423476404552  finite? True


In [24]:
a=np.array(loss_bad); chars=np.array(list('▁▂▃▄▅▆▇█'))
s=''.join(chars[((a-a.min())/(a.ptp()+1e-12)*(len(chars)-1)).astype(int)][np.linspace(0,len(a)-1,60).astype(int)])
print('bad-lr loss sparkline:\n', s,'\nlast loss =', loss_bad[-1])

bad-lr loss sparkline:
 ▁▁▄▁▄▁▁▁▁▁▁▇▁▄▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ 
last loss = 5.351708045459909e-07


In [25]:
z_bad=X@w_bad+b_bad; y_bad=sigmoid(z_bad); p_bad=(y_bad>=0.5).astype(int)
print('bad-lr probs =', y_bad,'\nbad-lr pred =', p_bad)
print('true y =', y.astype(int))

bad-lr probs = [0.       0.       0.999937 0.999999] 
bad-lr pred = [0 0 1 1]
true y = [0 0 0 1]


## What you should take away

- **Logit** \(z=w^Tx+b\) is the raw score; **sigmoid activation** turns it into a probability.
- With BCE + sigmoid, the backprop signal is simply \(\delta=\hat y-y\).
- **Bias is essential** for AND (without it, you cannot isolate only \((1,1)\) as positive).
- Hyperparameters like learning rate and epochs strongly affect convergence.

---
