# Logistic Regression

---

## Brief:
In this notebook I will be implementing a logistic regression model using just numpy, that classifies a cancer patient's tumor as malignant or benign using multiple variable (Multivariate Logistic Regression).  
#### Goals:
- [Extracting Data](#data-extraction)
- [Defining Logistic Model](#logistic-model)
- [Normalizing Data](#data-normalization)
- [Cost Function](#cost-function)
- [Gradient Descent](#gradient-descent)

In [5]:
import pandas as pd
import matplotlib.pyplot
import numpy as np

## Data Extraction
The dataset i'll be using is a very interesting one called **"Diagnostic Wisconsin Breast Cancer Database."**  
Which contains: <center>*"Features that are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass."*</center><center>*"They describe characteristics of the cell nuclei present in the image."*</center>

In breif, it's a datset that contains 33 features with the target value classifying a patient's tumor as "Malignent" or "Benign".  
Since the target array ${y}$ currently contains "M" for malignent and "B" for benign, we'll turn the array into boolean form with M = 1 and B = 0.  

Citation:
> ðŸ“Š **Dataset Reference**  
> Wolberg, W., Mangasarian, O., Street, N., & Street, W. (1993). *Breast Cancer Wisconsin (Diagnostic)* [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5DW2B

In [6]:
# Extract data using pd.read_csv

x_raw, y_raw = pd.read_csv("wdbc.data", sep=',').drop(['ID', 'Diagnosis'], axis=1), pd.read_csv("wdbc.data", usecols=[1])
x_raw

Unnamed: 0,radius1,texture1,perimeter1,area1,smoothness1,compactness1,concavity1,concave_points1,symmetry1,fractal_dimension1,...,radius3,texture3,perimeter3,area3,smoothness3,compactness3,concavity3,concave_points3,symmetry3,fractal_dimension3
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


As you can see, we successfully extracted the data into a ``x_raw`` variable. (Which contains elements of varying sizes, so normalization is needed. More on that in [Normalization of Data](#normalization))  
However, the ``y_raw`` variable contains letters as seen in this sample:  
| Index | Diagnosis |
|-------|-----------|
| 0     | M         |
| 1     | M         |
| 2     | M         |
| 3     | M         |
| 4     | M         |
| ...   | ...       |
| 564   | M         |
| 565   | M         |
| 566   | M         |
| 567   | M         |
| 568   | B         |

Therefore, we will convert the elements in the Diagnosis column into boolean values using ``numpy.where(condition, true then x, else y)``  Where <center>"M" = 1 || "B" = 0</center>

In [7]:
y = np.where(y_raw.values == "M", 1, 0).reshape(-1)
pd.DataFrame(y, columns=['Diagnosis'])

Unnamed: 0,Diagnosis
0,1
1,1
2,1
3,1
4,1
...,...
564,1
565,1
566,1
567,1


## Logistic Regression Model

The logistic model itself is a quite simple one, it takes *input* the linear model we're very used to, and plugs it's negative as the exponential's power.  
The logistic regression model is defined as:

$$
g(z) = \frac{1}{1 + e^{-(w \cdot x + b)}}
$$

Where:
- ${z}$ is the linear model ${w \cdot x + b}$

And this results in a graph like such:  

<center><img src="1.webp" height="300" width="300"></center>

This is wonderful for **Binary Clasification**, as by simply setting a threshold (say, 0.5) the output of this model could is limited to 0 or 1. (more on this later)

In [8]:
def g(z):
    return 1 / 1 + np.exp(-z)

## Normalization of Data

In this dataset, we have **30 features!!** Because of this, and especially since this is a logistic regression model, normalizing the data is a must.  
For this case, we will use *Z-Score Normalization*  
As before normalizing our features, their variation in size was large. But after normalization, they all vary from -1 to 1.  
- Before normalization: ``x[1] = [20.57   17.77   132.9   1326   0.08]`` (Some are small ``0.08`` and some are large ``1326``)
- After normalization: ``x[1] = [-0.181 -0.193  0.311  5.537 -0.271]`` (Same data, ranges from -1 to 1)

Z-score is defined as:

$$
z = \frac{x - \mu}{\sigma}
$$

Where:
- $x$ is the original data point
- $\mu$ is the mean
- $\sigma$ is the deviation

In [21]:
def z_score(x):
    return (x - np.mean(x)) / np.std(x)