<a href="https://colab.research.google.com/github/Uzmamushtaque/CSCI_4170_6170_Spring2026/blob/main/Lecture_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture 2

## Today's Lecture

1. Data manipulation and Pre-processing (Tensorflow,PyTorch, Numpy)
2. Broadcasting
3. Python numpy and pandas
4. Logistic Regression using vectorization
5. Datasets
6. Projects in ML/AI
7. Gradient Descent (Optimization Algorithms)
8. About Homework 1

## Learning objectives (by the end of Lecture 2)

You should be able to:

- Explain **tensor rank, shape, dtype, and device**, and translate between *scalars / vectors / matrices / higher-order tensors*.
- Use **broadcasting** intentionally (and recognize when it will fail).
- Load tabular data with **pandas**, perform a first-pass **missing-value audit**, and avoid common data-leakage mistakes.
- Describe why **vectorization** is essential for performance (especially on GPUs/TPUs) and refactor a loop into array/tensor ops.
- Derive and implement the **vectorized forward pass + gradients** for logistic regression (high level), and connect this to gradient descent.

### In-lecture checkpoints
1. Shape reasoning (tensors)
2. Broadcasting “mental math” problems
3. Vectorization speed test
4. Logistic regression: forward pass + gradients


In [4]:
# Environment check (run once)
import sys, platform
import numpy as np
import pandas as pd

print("Python:", sys.version.split()[0], "|", platform.platform())
print("NumPy:", np.__version__)
print("pandas:", pd.__version__)

# Optional: deep learning frameworks
try:
    import torch
    print("PyTorch:", torch.__version__, "| CUDA available:", torch.cuda.is_available())
except Exception as e:
    print("PyTorch not available:", repr(e))

try:
    import tensorflow as tf
    print("TensorFlow:", tf.__version__)
except Exception as e:
    print("TensorFlow not available:", repr(e))


Python: 3.12.12 | Linux-6.6.105+-x86_64-with-glibc2.35
NumPy: 2.0.2
pandas: 2.2.2
PyTorch: 2.9.0+cpu | CUDA available: False
TensorFlow: 2.19.0


# Data manipulation

Generally, there are two important things we need to do with data:
(i) acquire them; and (ii) process them once they are inside the computer.

Once you acquire data, there are several data pre-processing, data visualization (also called EDA or exploratory data analysis) and feature engineering steps that need to be performed in order to get the data in the right format. Some of the most popular libraries for data manipulation are pandas and numpy.

Here is an example notebook with basic pre-processing and manipulation steps: [Link](https://github.com/Uzmamushtaque/Projects-in-Machine-Learning-and-AI/blob/main/TitanicExample.ipynb)

[TensorFlow](https://www.tensorflow.org/) is an open-source end-to-end machine learning library for preprocessing data, modelling data and serving models (getting them into the hands of others).

[PyTorch](https://pytorch.org/)

## Introduction to Tensors

If you've ever used [NumPy](https://numpy.org/), tensors are kind of like NumPy arrays [Link](https://www.w3schools.com/python/numpy/numpy_intro.asp).

You can consider of a tensor as a multi-dimensional numerical representation (also referred to as n-dimensional, where n can be any number) of something. Where something can be almost anything you can imagine:

1. It could be numbers themselves (using tensors to represent the price of houses).
2. It could be an image (using tensors to represent the pixels of an image).
3. It could be text (using tensors to represent words).

Or it could be some other form of information (or data) you want to represent with numbers.

The main difference between tensors and NumPy arrays (also an n-dimensional array of numbers) is that tensors can be used on GPUs (graphical processing units) and TPUs (tensor processing units).

The benefit of being able to run on GPUs and TPUs is faster computation, this means, if we wanted to find patterns in the numerical representations of our data, we can generally find them faster using GPUs and TPUs.

Core attributes

**Rank (ndim):** number of axes (dimensions).

Rank 0: scalar, Rank 1: vector, Rank 2: matrix, Rank ≥3: higher-order tensor.

**Shape:** sizes along each axis; length of shape = rank.

Scalar (), vector (n,), matrix (m, n), rank-3 (a, b, c).

**dtype:** element type / precision (impacts memory, speed, numeric behavior).

Examples: int32/int64, float16/bfloat16/float32/float64, bool, complex64.

**Device:** where the tensor is stored/computed.

Examples: cpu, gpu/cuda, mps, tpu.

Generally requires operands on the same device (and compatible dtype).

Let us get started with Tensors.
The first thing we'll do is import TensorFlow under the common alias tf.

In [6]:
# Import TensorFlow
import tensorflow as tf

print(tf.__version__) # find the version number (should be 2.x+)

2.19.0


In [7]:
# Create a scalar (rank 0 tensor)
scalar = tf.constant(7)
scalar

<tf.Tensor: shape=(), dtype=int32, numpy=7>

In [9]:
# Check the number of dimensions of a tensor (ndim stands for number of dimensions)
scalar.ndim


0

In [10]:
# Create a vector (more than 0 dimensions)
vector = tf.constant([10, 10])
vector

<tf.Tensor: shape=(2,), dtype=int32, numpy=array([10, 10], dtype=int32)>

In [11]:
# Check the number of dimensions of our vector tensor
vector.ndim

1

In [None]:
# Create a matrix (more than 1 dimension)
matrix = tf.constant([[10, 7],
                      [7, 10]])
matrix

<tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[10,  7],
       [ 7, 10]], dtype=int32)>

In [None]:
matrix.shape

TensorShape([2, 2])

| Object      | Rank | Typical shape | Indexing       |
| ----------- | ---: | ------------- | -------------- |
| Scalar      |    0 | `()`          | `x`            |
| Vector      |    1 | `(n,)`        | `v[i]`         |
| Matrix      |    2 | `(m, n)`      | `M[i, j]`      |
| Tensor (3D) |    3 | `(a, b, c)`   | `T[i, j, k]`   |
| Tensor (kD) |    k | `(d1, …, dk)` | `A[i1, …, ik]` |


Common translations (axis manipulations)

Scalar → vector → matrix: () → (1,) → (1,1) (add axes).

Vector ↔ row/column matrix:

    Vector (n,)

    Column vector as matrix (n, 1)

    Row vector as matrix (1, n)

Matrix ↔ batch of matrices: (m, n) ↔ (b, m, n) (add leading batch axis).

### Checkpoint 1: tensor shapes

Without running the next cell, predict the **shape** of each tensor.

- `x = tf.range(12)`  
- `X = tf.reshape(x, (3, 4))`  
- `a = tf.reshape(tf.range(3), (3, 1))`  
- `b = tf.reshape(tf.range(2), (1, 2))`  
- `a + b`  

Then run the cell and confirm your answers.


In [12]:
# Check your shape predictions
import tensorflow as tf

x = tf.range(12)
X = tf.reshape(x, (3, 4))

a = tf.reshape(tf.range(3), (3, 1))
b = tf.reshape(tf.range(2), (1, 2))

print("x shape:", x.shape)
print("X shape:", X.shape)
print("a shape:", a.shape)
print("b shape:", b.shape)
print("a+b shape:", (a+b).shape)


x shape: (12,)
X shape: (3, 4)
a shape: (3, 1)
b shape: (1, 2)
a+b shape: (3, 2)


The difference between tf.Variable() and tf.constant() is tensors created with tf.constant() are immutable (can't be changed, can only be used to create a new tensor), where as, tensors created with tf.Variable() are mutable (can be changed).

In [13]:
# simple pytorch tensor
import torch
x = torch.tensor(3.5)
print(x)

tensor(3.5000)


In [14]:
# simple pytorch tensor
x = torch.tensor(3.5)
print("x:", x)

# simple arithmetic with tensors
y = x + 3
print("y = x+3:", y)

x: tensor(3.5000)
y = x+3: tensor(6.5000)


A tensor represents a (possibly multi-dimensional) array of numerical values. With one axis, a tensor corresponds (in math) to a vector. With two axes, a tensor corresponds to a matrix. Just as vectors generalize scalars, and matrices generalize vectors, we can build data structures with even more axes. Tensors give us a generic way of describing  n -dimensional arrays with an arbitrary number of axes. Vectors, for example, are first-order tensors, and matrices are second-order tensors. Let us create one tensor and then update its shape:

In [None]:
x = tf.range(12)
x

<tf.Tensor: shape=(12,), dtype=int32, numpy=array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11], dtype=int32)>

We can access a tensor’s shape (the length along each axis) by inspecting its shape property.

In [None]:
x.shape

TensorShape([12])

If we just want to know the total number of elements in a tensor, i.e., the product of all of the shape elements, we can inspect its size.

In [None]:
tf.size(x)

<tf.Tensor: shape=(), dtype=int32, numpy=12>

To change the shape of a tensor without altering either the number of elements or their values, we can invoke the reshape function.

In [None]:
X = tf.reshape(x, (3, 4))
X

<tf.Tensor: shape=(3, 4), dtype=int32, numpy=
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]], dtype=int32)>

Reshaping by manually specifying every dimension is unnecessary. If our target shape is a matrix with shape (height, width), then after we know the width, the height is given implicitly. Try calling x.reshape(-1, 4) or x.reshape(3, -1) for x above. Why do you think you get the result you are getting?

In [None]:
#tf.reshape(x,(3, -1))

Typically, we will want our matrices initialized either with zeros, ones, some other constants, or numbers randomly sampled from a specific distribution. We can create a tensor representing a tensor with all elements set to 0 and a shape of (2, 3, 4) as follows:

In [None]:
tf.zeros((2, 3, 4))
tf.ones((3,3,4))

<tf.Tensor: shape=(3, 3, 4), dtype=float32, numpy=
array([[[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]], dtype=float32)>

The following snippet creates a tensor with shape (3, 4). Each of its elements is randomly sampled from a standard Gaussian (normal) distribution with a mean of 0 and a standard deviation of 1.

In [None]:
tf.random.normal(shape=[3, 4])

<tf.Tensor: shape=(3, 4), dtype=float32, numpy=
array([[ 5.9187937e-01, -8.8718188e-01, -3.9479160e-04,  1.2093159e+00],
       [ 5.9993702e-01,  2.8964812e-01,  8.4894878e-01,  1.8067051e+00],
       [-7.7732205e-01,  2.9494369e-01, -7.7629071e-01, -7.0611900e-01]],
      dtype=float32)>

In [None]:
# An exact input for a tensor- Python List
tf.constant([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])

<tf.Tensor: shape=(3, 4), dtype=int32, numpy=
array([[2, 1, 4, 3],
       [1, 2, 3, 4],
       [4, 3, 2, 1]], dtype=int32)>

# Operations

All binary scalar operators perform operations elementwise between arrays/matrices.

In [None]:
x = tf.constant([1.0, 2, 4, 8])
y = tf.constant([2.0, 2, 2,2])
x + y, x - y, x * y, x / y, x**y  # The ** operator is exponentiation


(<tf.Tensor: shape=(4,), dtype=float32, numpy=array([ 3.,  4.,  6., 10.], dtype=float32)>,
 <tf.Tensor: shape=(4,), dtype=float32, numpy=array([-1.,  0.,  2.,  6.], dtype=float32)>,
 <tf.Tensor: shape=(4,), dtype=float32, numpy=array([ 2.,  4.,  8., 16.], dtype=float32)>,
 <tf.Tensor: shape=(4,), dtype=float32, numpy=array([0.5, 1. , 2. , 4. ], dtype=float32)>,
 <tf.Tensor: shape=(4,), dtype=float32, numpy=array([ 1.,  4., 16., 64.], dtype=float32)>)

In [None]:
tf.exp(x)

<tf.Tensor: shape=(4,), dtype=float32, numpy=
array([2.7182817e+00, 7.3890562e+00, 5.4598148e+01, 2.9809580e+03],
      dtype=float32)>

We can also concatenate multiple tensors together, stacking them end-to-end to form a larger tensor.

In [None]:
X = tf.reshape(tf.range(12, dtype=tf.float32), (3, 4))
Y = tf.constant([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
tf.concat([X, Y], axis=0), tf.concat([X, Y], axis=1)

(<tf.Tensor: shape=(6, 4), dtype=float32, numpy=
 array([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.],
        [ 2.,  1.,  4.,  3.],
        [ 1.,  2.,  3.,  4.],
        [ 4.,  3.,  2.,  1.]], dtype=float32)>,
 <tf.Tensor: shape=(3, 8), dtype=float32, numpy=
 array([[ 0.,  1.,  2.,  3.,  2.,  1.,  4.,  3.],
        [ 4.,  5.,  6.,  7.,  1.,  2.,  3.,  4.],
        [ 8.,  9., 10., 11.,  4.,  3.,  2.,  1.]], dtype=float32)>)

Sometimes, we want to construct a binary tensor via logical statements. Take X == Y as an example. For each position, if X and Y are equal at that position, the corresponding entry in the new tensor takes a value of 1, meaning that the logical statement X == Y is true at that position; otherwise that position takes 0.

In [None]:
X == Y

<tf.Tensor: shape=(3, 4), dtype=bool, numpy=
array([[False,  True, False,  True],
       [False, False, False, False],
       [False, False, False, False]])>

# Broadcasting Mechanism

Under certain conditions, when shapes differ, we can still perform elementwise operations by invoking the broadcasting mechanism. This mechanism works in the following way: First, expand one or both arrays by copying elements appropriately so that after this transformation, the two tensors have the same shape. Second, carry out the elementwise operations on the resulting arrays.

### Broadcasting: practical rules + pitfalls

**Rule of thumb:** compare shapes from the **rightmost** dimension moving left.

Two dimensions are compatible when they are equal, or one of them is 1. If one tensor has fewer dimensions, treat missing leading dims as 1s.

Common pitfalls:
- Forgetting that broadcasting happens **per-dimension**, not per-tensor.
- Accidentally broadcasting when you intended a dot product (e.g., `A * B` vs `A @ B`).
- Silent shape expansion leading to unexpectedly large intermediate tensors (memory blow-ups).

We will test three cases below:
1. Works: `(3, 1) + (1, 2) -> (3, 2)`
2. Works: `(4,) + (1,) -> (4,)`
3. Fails: `(3, 2) + (2, 2)` (mismatch on the second-to-last dim)


In [15]:
import tensorflow as tf

# Case 1: works
a = tf.reshape(tf.range(3), (3, 1))
b = tf.reshape(tf.range(2), (1, 2))
print("Case 1:", (a + b).shape)

# Case 2: works
u = tf.range(4)
v = tf.constant([10])
print("Case 2:", (u + v).shape)

# Case 3: fails (uncomment to see the error)
# p = tf.zeros((3, 2))
# q = tf.zeros((2, 2))
# p + q


Case 1: (3, 2)
Case 2: (4,)


In [16]:
a = tf.reshape(tf.range(3), (3, 1))
b = tf.reshape(tf.range(2), (1, 2))
a, b

(<tf.Tensor: shape=(3, 1), dtype=int32, numpy=
 array([[0],
        [1],
        [2]], dtype=int32)>,
 <tf.Tensor: shape=(1, 2), dtype=int32, numpy=array([[0, 1]], dtype=int32)>)

Since a and b are  3×1  and  1×2  matrices respectively, their shapes do not match up if we want to add them. We broadcast the entries of both matrices into a larger  3×2  matrix as follows:

In [17]:
a + b

<tf.Tensor: shape=(3, 2), dtype=int32, numpy=
array([[0, 1],
       [1, 2],
       [2, 3]], dtype=int32)>

[Source for this excerpt](https://numpy.org/doc/stable/user/basics.broadcasting.html)


When operating on two arrays, NumPy compares their shapes element-wise. Two dimensions are compatible when

they are equal, or

one of them is 1

If these conditions are not met, a ValueError: operands could not be broadcast together exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the size that is not 1 along each axis of the inputs.

# Data reading and Data Pre-processing in Colab

To apply deep learning to solving real-world problems, we often begin with preprocessing raw data, rather than those nicely prepared data in the tensor format. Among popular data analytic tools in Python, the pandas package is commonly used.

[Pandas documentation](https://pandas.pydata.org/)

## Data loading and first-pass preprocessing (Colab + local-friendly)

In practice, **most model failures start in the data pipeline**. Before modeling, you should always do a quick audit:

- Shape: number of rows/columns
- Types: numeric vs categorical vs datetime
- Missingness: which columns are incomplete (and how incomplete)

Below are two loading patterns:
1. **Colab + Google Drive** (useful for Kaggle downloads, class datasets)
2. **Local / URL** (works in VSCode/Jupyter if the file is in your repo)


In [None]:
import pandas as pd

# Option A (Colab): Google Drive
# from google.colab import drive
# drive.mount('/content/drive')
# path = '/content/drive/MyDrive/train.csv'

# Option B (local): put the CSV inside your repo (recommended)
# path = 'data/train.csv'

# Option C (URL): if you have a public raw CSV URL
# path = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

# Set ONE of the above:
path = 'data/train.csv'  # change as needed

data = pd.read_csv(path)
data.head()


In [None]:
# First-pass data audit
print("Shape:", data.shape)
print("\nDtypes:")
print(data.dtypes)

print("\nMissing values (count):")
na = data.isna().sum().sort_values(ascending=False)
print(na[na>0])

print("\nMissing values (%):")
na_pct = (data.isna().mean()*100).sort_values(ascending=False)
print(na_pct[na_pct>0].round(2))

# Quick peek at categorical columns
cat_cols = [c for c in data.columns if data[c].dtype == 'object']
print("\nCategorical columns:", cat_cols)


### Minimal preprocessing example (tabular)

This is not a “final pipeline,” but it demonstrates the canonical steps:
- Separate features/label
- Impute missing numeric values (median)
- One-hot encode categoricals
- Train/test split


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# If you are using Titanic, the label is often 'Survived'. Change as needed.
target_col = 'Survived'
assert target_col in data.columns, f"Expected target column '{target_col}' in data."

X = data.drop(columns=[target_col])
y = data[target_col]

num_cols = [c for c in X.columns if X[c].dtype != 'object']
cat_cols = [c for c in X.columns if X[c].dtype == 'object']

numeric_tf = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
])

categorical_tf = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_tf, num_cols),
        ("cat", categorical_tf, cat_cols),
    ]
)

model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("clf", LogisticRegression(max_iter=500))
])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model.fit(X_train, y_train)
pred = model.predict(X_test)
print("Test accuracy:", accuracy_score(y_test, pred))


In [None]:
## Reading the dataset

# If pandas is not installed, just uncomment the following line:
# !pip install pandas
import pandas as pd

data = pd.read_csv('/content/drive/MyDrive/train.csv')
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Handling Missing Values: NaN values are unknown or missing values.To handle missing data, typical methods include imputation and deletion, where imputation replaces missing values with substituted ones, while deletion ignores missing values.

In [None]:
data.isna()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,False,False,False,False,False,False,True,False,False,False,...,False,True,True,True,False,False,False,False,False,False
1,False,False,False,False,False,False,True,False,False,False,...,False,True,True,True,False,False,False,False,False,False
2,False,False,False,False,False,False,True,False,False,False,...,False,True,True,True,False,False,False,False,False,False
3,False,False,False,False,False,False,True,False,False,False,...,False,True,True,True,False,False,False,False,False,False
4,False,False,False,False,False,False,True,False,False,False,...,False,True,True,True,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,False,False,False,False,False,False,True,False,False,False,...,False,True,True,True,False,False,False,False,False,False
1456,False,False,False,False,False,False,True,False,False,False,...,False,True,False,True,False,False,False,False,False,False
1457,False,False,False,False,False,False,True,False,False,False,...,False,True,False,False,False,False,False,False,False,False
1458,False,False,False,False,False,False,True,False,False,False,...,False,True,True,True,False,False,False,False,False,False


Different datatypes require different ways of dealing with missing values.

[Handling Missing Values in pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)

# Vectorization

A scalar is represented by a tensor with just one element. Vectors in ML problems represent examples from the dataset. In math notation, we will usually denote vectors as bold-faced, lower-cased letters (e.g., $\textbf{x}$ ,  $\textbf{y}$ , and  $\textbf{z}$) .

We can refer to any element of a vector by using a subscript. For example, we can refer to the  $i$th  element of  $x$  by  $x_i$ . Note that the element  $x_i$  is a scalar, so we do not bold-face the font when referring to it.

In [None]:
x = tf.range(4)
x
print(x[3])

tf.Tensor(3, shape=(), dtype=int32)


With the advent of deep learning, we usually work with extremely large datasets. Therefore, its important we write efficient code. In-efficient code can become a bottleneck in your algorithm. To make sure that the code is computationally efficient, we will use vectorization. Vectorization is used to speed up the Python code without using loop. This is especially useful when operations like dot product, cross-product etc. need to be performed on vectors or scalars.

In [18]:
import numpy as np
a=np.array([1,2,3,4,5])
a

array([1, 2, 3, 4, 5])

In [19]:
import time
import numpy as np
a=np.random.rand(1000000)
b=np.random.rand(1000000)

#Vectorized version
start=time.time()
c=np.dot(a,b)
end=time.time()
print('Vectorized version '+str(end-start)+' ms')
print(c)

Vectorized version 0.0016067028045654297 ms
249998.96907004397


In [20]:
print(a)

[0.98238711 0.5781509  0.42927071 ... 0.20686529 0.42851847 0.18310746]


In [21]:
#Non vectorized version
c=0
start=time.time()
for i in range(1000000):
  c+=a[i] * b[i]
end=time.time()
print('Non vectorized version '+str(end-start)+' ms')
print(c)

Non vectorized version 0.47296929359436035 ms
249998.96907003334


How many times longer does the non-vectorized version takes? Try checking the time for a nested loop.

You must have heard about GPUs(Graphics Processing Unit) and CPUs(Central Processing Unit (CPU)).Both GPU and CPU have parallelization instructions. They're sometimes called SIMD instructions. This stands for a single instruction multiple data. But what this basically means is that, if you use built-in functions such as this np.function or other tf.functions that don't require you explicitly implementing a for loop. It enables Python to take much better advantage of parallelism to do your computations much faster. And this is true both computations on CPUs and computations on GPUs.

More information on this:
[Difference between GPU and CPU](https://blogs.nvidia.com/blog/2009/12/16/whats-the-difference-between-a-cpu-and-a-gpu/)

# Vectorizing Logistic Regression

We know that in logistic regression we are claculating the predicted value for each example using the following function:

$\hat{y}$= $\sigma(\textbf{w}^Tx + b)$

where $\sigma(a) = \frac{1}{(1+e^{-a})}$

For a given example $i$, the loss function for a single instance is given by:

$l^{i}(y^{(i)},\hat{y}^{(i)})$ = $-(y^{(i)}\space log\hat{y}^{(i)} + (1-y^{i}) log(1-\hat{y}^{(i)}))$

Cost function for the entire data:

$L(y,\hat{y}) = \frac{1}{n} \sum_{i=1}^{n} l^{i}(y^{(i)},\hat{y}^{(i)})$

The computation here requires the calculation of the $\hat{y}$. Lets consider $a=(\textbf{w}^Tx + b)$. This $a$ needs to be computed for every instance. Instead of using an explicit for loop, we can find the dot product of the feature vector and the transpose of the weight vector. The bias term (if it exists) can be added to each individual calculation via broadcasting. Resulting $A$ vector will be:

$A=[a^{(1)},a^{(2)}...a^{(n)}]$

This step can be completed in 1 line of code.

### Logistic regression (from scratch) with vectorization

Below is a minimal **NumPy-only** implementation on a standard binary dataset.  
Purpose: connect *vectorization* → *fast forward pass* → *fast gradients* → *gradient descent*.

You should be able to map each line of code to the math:

- Forward: `z = X @ w + b`, `a = sigmoid(z)`
- Loss: binary cross-entropy
- Gradients: `dw = (X.T @ (a - y)) / n`, `db = mean(a - y)`


In [None]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features for faster optimization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Helper
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Initialize
n, d = X_train.shape
w = np.zeros(d)
b = 0.0
lr = 0.1
steps = 300

for t in range(steps):
    z = X_train @ w + b            # (n,)
    a = sigmoid(z)                  # (n,)
    # binary cross-entropy
    eps = 1e-12
    loss = -(y_train*np.log(a+eps) + (1-y_train)*np.log(1-a+eps)).mean()

    # gradients
    dz = (a - y_train)              # (n,)
    dw = (X_train.T @ dz) / n       # (d,)
    db = dz.mean()

    # update
    w -= lr * dw
    b -= lr * db

    if t % 50 == 0:
        print(f"step {t:3d} | loss {loss:.4f}")

# Evaluate
test_probs = sigmoid(X_test @ w + b)
test_pred = (test_probs >= 0.5).astype(int)
acc = (test_pred == y_test).mean()
print("Test accuracy (from scratch):", round(float(acc), 4))


**Steps in implementing gradient descent**

You get input X
- You compute $A = \sigma(w^T X + b) = (a^{(1)}, a^{(2)}, ..., a^{(n-1)}, a^{(n)})$
- You calculate the cost function: $L = -\frac{1}{n}\sum_{i=1}^{n}(y^{(i)}\log(a^{(i)})+(1-y^{(i)})\log(1-a^{(i)}))$


Here are the two formulas you will be using (Try finding the derivative of the cost function with respect to the parameters):


$$ \frac{\partial L}{\partial w} = \frac{1}{n}X(A-Y)^T$$
$$ \frac{\partial L}{\partial b} = \frac{1}{n} \sum_{i=1}^n (a^{(i)}-y^{(i)})$$




In [None]:
#Assuming you have a custom sigmoid function
#A = sigmoid(np.dot(w.T,X) + b)
#Loss=-1/n * np.sum(Y * np.log(A) + (1-Y) * (np.log(1-A)))
#dw = np.dot(X, (A-Y).T)/n
#db= np.sum(A-Y)/n

**The Update Step**

Once you have initialized your parameters and you have computed a cost function and its gradient.Next, you want to update the parameters using gradient descent.

Write down the optimization function. The goal is to learn  w  and  b  by minimizing the cost function  L . For a parameter  w , the update rule is  w=w−$\eta$ dw , where  $\eta$  is the learning rate.

In [None]:
#You basically need to write down two steps and iterate through them for the entire dataset:
# 1) Calculate the cost and the gradient for the current parameters.
# 2) Update the parameters using gradient descent rule for w and b.
#w = w - learning_rate*dw
#b= b - learning_rate*db

**Predict**
The previous function/code will output the learned w and b. We are able to use w and b to predict the labels for a dataset X.Next step is prediction. There are two steps to computing predictions:

1. Calculate $\hat{Y} = A = \sigma(w^T X + b)$

2. Convert the entries of a into 0 (if activation <= 0.5) or 1 (if activation > 0.5) and store the predictions in a vector.

# Activation Functions

The choice of activation functions is critical in a Neural Network(NN) Design. The logistic regression model we saw above is very similar to the perceptron which is a basic building block of any NN model.

In problems where a binary class label needs to be predicted, usually sign function is can be a choice. For problems where the target variable to be predicted is real, it makes sense to use the identity activation function. When predicting probabilities of a binary class it makes sense to use the sigmoid function as it restricts the outcome between a 0-1 value.

The importance of non-linear activation functions will become clear when we move to multi-layered architecture.

[More about activation functions](https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6)

### Activation functions: visualize what “non-linearity” does

Key properties to watch:
- Range (bounded vs unbounded)
- Saturation regions (where gradients get tiny)
- Smooth vs piecewise-linear

We will plot sigmoid, tanh, and ReLU on the same input range.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-6, 6, 400)

sig = 1/(1+np.exp(-x))
tanh = np.tanh(x)
relu = np.maximum(0, x)

plt.figure(figsize=(7,4))
plt.plot(x, sig, label="sigmoid")
plt.plot(x, tanh, label="tanh")
plt.plot(x, relu, label="ReLU")
plt.axhline(0, linewidth=0.5)
plt.axvline(0, linewidth=0.5)
plt.legend()
plt.title("Common activation functions")
plt.show()


# Readings for Today

[Paper 1](https://arxiv.org/pdf/1609.04747.pdf)

### Summary
The paper covers traditional gradient descent and explores more advanced algorithms designed to improve convergence speed and performance. Some key points include:

Gradient Descent Basics: The paper starts by explaining the fundamental concept of gradient descent, where the goal is to minimize a cost function by iteratively adjusting model parameters in the direction of steepest descent.

Variants of Gradient Descent:

Batch Gradient Descent: Computes the gradient of the entire dataset.
Stochastic Gradient Descent (SGD): Updates parameters for each training example, introducing randomness.
Mini-batch Gradient Descent: A compromise between batch and stochastic, where updates are made using a small random subset of the data.
Challenges with Vanilla Gradient Descent:

Learning Rate: The choice of the learning rate can significantly impact the convergence of the algorithm.
Saddle Points: Vanilla gradient descent may struggle with convergence in the presence of saddle points.
Advanced Optimization Algorithms:

Momentum: Introduces a moving average of past gradients to accelerate convergence.
Adagrad, Adadelta, RMSprop: Adaptive learning rate methods that adjust the learning rates for each parameter individually.
Adam: A popular optimization algorithm that combines ideas from momentum and adaptive learning rates.
Issues and Considerations:

The paper discusses potential issues with optimization algorithms, such as choosing hyperparameters and dealing with non-convex optimization challenges.

Practical Recommendations: Provides practical advice on selecting optimization algorithms based on the characteristics of the optimization problem.

The paper serves as a valuable resource for understanding the landscape of optimization algorithms, their strengths, and considerations for practical implementation in training machine learning models.

[Paper 2](https://proceedings.neurips.cc/paper/2020/file/d3f5d4de09ea19461dab00590df91e4f-Paper.pdf)



## Building Projects

The life cycle of a Data Science/Machine Learning project encompasses a series of sequential stages:

**Problem Comprehension:** The journey begins with a deep understanding of the problem at hand, elucidating the questions that need answers and defining the insights sought from the available dataset.

**Data Retrieval:** In Data Acquisition, we retrieve the necessary data, often collaborating with domain experts. This step consolidates all essential data for addressing the problem or answering the questions at the core.

**Data Wrangling:** Data wrangling involves leveraging knowledge to preprocess data. It includes identifying and addressing missing values, probing why they exist, and shaping the dataset appropriately for visualizations, aligning with the subsequent stages of the life cycle.

**Data Investigation:** Data Investigation entails employing visualization and statistical measures to gauge the extent to which the initial questions are being addressed. The role of the data analyst typically concludes at this point.

**Feature Engineering and Selection:** A precursor to modeling in both Machine Learning and Deep Learning, Feature Engineering and Selection shares similarities with Data Wrangling but incorporates specific algorithms for feature transformation and selection.

**Model Construction:** Modeling is the pivotal stage that unravels the meaning within the data. It involves capturing underlying trends and understanding the data's behavior to create a model capable of predictive analytics, aligning with the earlier sections.

**Implementation:** Once the model is constructed, the next step is to deploy it efficiently and optimally. Deployment extends to diverse platforms such as mobile applications and web applications, making the model accessible to real-world users.

**Supervision:** Following deployment, the model undergoes continuous monitoring. This phase involves acquainting the model with new datasets and tracking the volume of requests it receives. Adjustments to the analysis are made as needed, and the process restarts if necessary.

## Additional Resourses: Gradient Calculations using tensors





In [None]:
import torch
# pytorch tensor
x = torch.tensor(3.5, requires_grad=True)
print("x:", x)

# y is defined as a function of x
y = (x-1) * (x-2) * (x-3)
print("y = (x-1) * (x-2) * (x-3):", y)


print("Working out gradient dy/dx")
# work out gradients
y.backward()

# what is gradient at x = 3.5
print("Numerical value of gradient:", x.grad)

x: tensor(3.5000, requires_grad=True)
y = (x-1) * (x-2) * (x-3): tensor(1.8750, grad_fn=<MulBackward0>)
Working out gradient dy/dx
Numerical value of gradient: tensor(5.7500)


The requires_grad=True option we set for the x tells PyTorch that we will be interested in working out a gradient with respect to x.

PyTorch tensors offer more functionality than normal Python variables and NumPy arrays. A PyTorch tensor can contain additional information beyond the primary numerical value, such as a gradient value.

These tensors can also provide information about which other tensor they depends on, and the mathematical form of that dependency.

## Wrap-up: what you should be able to do after Lecture 2

1. Read and write tensor shapes confidently; debug shape errors quickly.
2. Use broadcasting deliberately; avoid silent shape bugs.
3. Perform a quick data audit (types + missingness) before modeling.
4. Explain why vectorization matters, and convert loops to tensor operations.
5. Connect logistic regression math to a vectorized implementation and gradient descent.

### Suggested practice (15–25 minutes)
- Modify the “from scratch” logistic regression to:
  - change learning rate and number of steps
  - report train vs test accuracy (over/under-fitting intuition)
  - add L2 regularization (optional)

### Reading focus
In Ruder (2016), focus on: batch vs SGD vs mini-batch, momentum, RMSProp, Adam, and the practical guidance on learning-rate choice.


# Extra Reading

[Article](https://iamtrask.github.io/2015/07/27/python-network-part2/)