In [103]:
import pandas as pd
import numpy as np

In [104]:
df = pd.read_csv("/Users/shamvi/stuff/DeepLearning/Dataset/Churn_Modelling.csv")


In [105]:
df.sample(5)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
680,681,15780804,Nucci,482,France,Male,55,5,97318.25,1,0,1,78416.14,0
3226,3227,15796351,Yao,603,Germany,Male,35,1,105346.03,2,1,1,130379.5,0
4452,4453,15704788,Krawczyk,812,Spain,Female,49,8,66079.45,2,0,0,91556.57,1
1622,1623,15783955,Miah,697,France,Female,25,4,165686.11,2,1,0,15467.98,0
1226,1227,15775572,Bergamaschi,531,Germany,Female,42,6,88324.31,2,1,0,75248.75,0


In [106]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


In [107]:
df.duplicated().sum()

np.int64(0)

In [108]:
df['Exited'].value_counts()

Exited
0    7963
1    2037
Name: count, dtype: int64

In [109]:
df['Geography'].value_counts()

Geography
France     5014
Germany    2509
Spain      2477
Name: count, dtype: int64

In [110]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [111]:
df.drop(columns=['RowNumber','CustomerId','Surname'],inplace=True)
# removing useless cols

In [112]:
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [113]:
df = pd.get_dummies(df,columns=['Geography','Gender'],drop_first=True)

## Why One-Hot Encoding is Used in MLP (Multi-Layer Perceptron)

A Multi-Layer Perceptron (MLP) works only with **numerical data** and performs mathematical operations such as weighted sums and gradient-based optimization.  
Categorical features like **Geography** or **Gender** are labels, not numbers, and cannot be directly processed by a neural network.

### Problem with Categorical Data
If categories are encoded as integers (e.g., France = 0, Germany = 1, Spain = 2), the model may incorrectly assume:
- An **order** between categories
- A **distance or magnitude** between categories

This leads to incorrect learning.

### What One-Hot Encoding Does
One-hot encoding converts each category into a **binary feature (0 or 1)**:
- No ordering between categories
- No false numerical meaning
- Safe and meaningful input for neural networks

Example:
- France → [1, 0, 0]
- Germany → [0, 1, 0]
- Spain → [0, 0, 1]

### Code Used
```python
pd.get_dummies(df, columns=['Geography','Gender'], drop_first=True)


In [114]:
x = df.drop(columns=['Exited'],axis=1)
y = df['Exited']

In [115]:
x

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2,0.00,1,1,1,101348.88,False,False,False
1,608,41,1,83807.86,1,0,1,112542.58,False,True,False
2,502,42,8,159660.80,3,1,0,113931.57,False,False,False
3,699,39,1,0.00,2,0,0,93826.63,False,False,False
4,850,43,2,125510.82,1,1,1,79084.10,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...
9995,771,39,5,0.00,2,1,0,96270.64,False,False,True
9996,516,35,10,57369.61,1,1,1,101699.77,False,False,True
9997,709,36,7,0.00,1,0,1,42085.58,False,False,False
9998,772,42,3,75075.31,2,1,0,92888.52,True,False,True


In [116]:
y

0       1
1       0
2       1
3       0
4       0
       ..
9995    0
9996    0
9997    1
9998    1
9999    0
Name: Exited, Length: 10000, dtype: int64

In [117]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=1)

````markdown

````
### Purpose

`train_test_split` is used to **split the dataset into training and testing sets** so that we can:

* Train the model on one part of the data
* Evaluate its performance on unseen data

---

### Explanation of Each Component

#### `x`

* Input features (independent variables)
* Example: age, salary, geography, gender, etc.

#### `y`

* Target variable (dependent variable)
* Example: churn = 0 or 1

---

### What the Split Means

* `test_size=0.2`

  * 20% of the data → **Test set**
  * 80% of the data → **Training set**

So:

* `x_train`, `y_train` → used to **train** the neural network
* `x_test`, `y_test` → used to **test** the trained model

---

### Why We Need a Test Set

* To check how well the model **generalizes**
* To detect **overfitting**
* To simulate performance on **new, unseen data**

---

### Role of `random_state=1`

* Ensures **reproducibility**
* The same rows will go into train and test every time you run the code
* Without it, the split would be different on every run

---

### What This Means for an MLP

* Training set is used for:

  * Forward propagation
  * Backpropagation
  * Weight updates
* Test set is used only for:

  * Final evaluation
  * Checking accuracy, loss, etc.

---

### One-Line Exam / Notebook Answer

> `train_test_split` divides the dataset into training and testing sets. The model learns from the training data, while the test data is used to evaluate its performance on unseen examples. `random_state` ensures reproducibility of the split.

```
```


In [118]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()



### Purpose

`StandardScaler` is used to **scale numerical features** so that they have:

* **Mean = 0**
* **Standard Deviation = 1**

This process is called **standardization**.

---

### Why Feature Scaling Is Needed

In real datasets, features can have very different ranges:

* Age → 18 to 60
* Salary → 10,000 to 1,00,000
* Balance → 0 to 2,00,000

If we don’t scale:

* Large-value features dominate the loss function
* Gradient descent converges slowly
* Neural network training becomes unstable

---

### What StandardScaler Does Mathematically

For each feature:

[
x_{scaled} = \frac{x - \mu}{\sigma}
]

Where:

* ( \mu ) = mean of the feature
* ( \sigma ) = standard deviation of the feature

---

### Why This Is Important for MLP / Deep Learning

* MLP uses **gradient descent**
* Equal feature scale → **balanced weight updates**
* Faster convergence
* Prevents exploding or vanishing gradients in early layers

---

### Typical Usage in Practice

```python
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test  = scaler.transform(x_test)
```

* `fit` → learns mean and std **only from training data**
* `transform` → applies the same scaling to train and test data
* Prevents **data leakage**

---

### Important Note

* Apply `StandardScaler` **only to numerical features**
* Do **not** scale the target variable (`y`)
* One-hot encoded columns (0/1) usually do **not** need scaling

---

### One-Line Notebook / Exam Answer

> `StandardScaler` standardizes numerical features to zero mean and unit variance, ensuring all features are on a similar scale, which improves gradient descent convergence and neural network training stability.

```
```


In [119]:
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [120]:
x_train_scaled

array([[-0.23082038, -0.94449979, -0.70174202, ...,  1.71490137,
        -0.57273139,  0.91509065],
       [-0.25150912, -0.94449979, -0.35520275, ..., -0.58312392,
        -0.57273139, -1.09278791],
       [-0.3963303 ,  0.77498705,  0.33787579, ...,  1.71490137,
        -0.57273139, -1.09278791],
       ...,
       [ 0.22433188,  0.58393295,  1.3774936 , ..., -0.58312392,
        -0.57273139, -1.09278791],
       [ 0.13123255,  0.01077067,  1.03095433, ..., -0.58312392,
        -0.57273139, -1.09278791],
       [ 1.1656695 ,  0.29735181,  0.33787579, ...,  1.71490137,
        -0.57273139,  0.91509065]], shape=(8000, 11))

In [121]:
y_train_scaled 

array([[-1.03768121,  0.77498705, -1.0482813 , ..., -0.58312392,
        -0.57273139,  0.91509065],
       [ 0.30708683, -0.46686456, -0.70174202, ..., -0.58312392,
        -0.57273139,  0.91509065],
       [-1.23422423,  0.29735181, -1.0482813 , ..., -0.58312392,
        -0.57273139, -1.09278791],
       ...,
       [-0.86182692, -0.46686456,  1.72403288, ..., -0.58312392,
         1.74601919,  0.91509065],
       [-0.30323097, -0.84897275, -1.0482813 , ...,  1.71490137,
        -0.57273139, -1.09278791],
       [ 0.04847759,  1.25262228,  1.3774936 , ...,  1.71490137,
        -0.57273139,  0.91509065]], shape=(2000, 11))

In [122]:
import tensorflow
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense


`import tensorflow`
This imports **TensorFlow**, which is a deep learning framework used to build, train, and deploy neural networks. It provides low-level operations (tensors, gradients) and high-level APIs.

---

`from tensorflow import keras`
Keras is a **high-level neural network API** inside TensorFlow.
It makes building neural networks easier by providing ready-made components like layers, optimizers, and loss functions.

Think of it as:

* TensorFlow → engine
* Keras → user-friendly controls

---

`from tensorflow.keras import Sequential`
`Sequential` is a **model type** in Keras.

It is used when:

* Layers are stacked **one after another**
* Data flows in **one direction** (input → hidden layers → output)

Most basic MLPs are built using `Sequential`.

Example idea:
Input → Dense → Dense → Output

---

`from tensorflow.keras.layers import Dense`
`Dense` represents a **fully connected layer**.

In a Dense layer:

* Every neuron is connected to **all neurons in the previous layer**
* It performs:
  weighted sum + bias + activation function

This is the **core building block of an MLP**.

---

### In short

* TensorFlow → deep learning framework
* Keras → easy-to-use neural network API
* Sequential → simple stack-based model
* Dense → fully connected neural network layer



In [123]:
model = Sequential()
model.add(Dense(3,activation='sigmoid',input_dim=11))
model.add(Dense(1,activation='sigmoid'))

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)




`model = Sequential()`
This creates an **empty neural network model**.
At this point, the model has **no layers**.
You are basically saying: *“I’m going to build a neural network step by step, layer by layer.”*

---

`model.add(Dense(3, activation='sigmoid', input_dim=11))`

This adds the **first layer** of your neural network.

What each part means:

* `Dense`
  This is a **fully connected layer**.
  Every neuron in this layer is connected to **all 11 input features**.

* `input_dim=11`
  Your input data has **11 features** (after one-hot encoding and scaling).
  So each data point entering the network looks like:

  ```
  [x1, x2, x3, ..., x11]
  ```

* `3`
  This layer has **3 neurons**.
  Each neuron learns a different combination of the 11 input features.

* `activation='sigmoid'`
  After computing the weighted sum, the output of each neuron is passed through the **sigmoid function**, which squashes values into the range **0 to 1**.

So this layer:

* Takes 11 inputs
* Applies weights + bias
* Produces **3 outputs**
* Each output is between 0 and 1

This is your **hidden layer**.

---

`model.add(Dense(1, activation='sigmoid'))`

This adds the **output layer**.

* `1` neuron → because this is a **binary classification** problem (Exited: 0 or 1)
* The neuron takes input from the **3 hidden neurons**
* `sigmoid` converts the final value into a **probability**

So the output will be:

```
A number between 0 and 1
```

Example:

* 0.87 → 87% chance the customer exits
* 0.12 → 12% chance the customer exits

---

### Big Picture (what you actually built)

Input layer (11 features)
→ Hidden layer (3 neurons, sigmoid)
→ Output layer (1 neuron, sigmoid)

This is a **simple Multi-Layer Perceptron**.




In [124]:
model.summary()

In [125]:
model.compile(loss='binary_crossentropy',optimizer='Adam')


`model.compile(...)`
This step **configures the model for training**.
Until this point, you have only *designed the network structure*.
Now you are telling the model **how to learn**.

---

`loss='binary_crossentropy'`

This defines **how wrong the model’s prediction is**.

Your problem is:

* Binary classification (`Exited` = 0 or 1)
* Output layer uses `sigmoid` (gives probability between 0 and 1)

`binary_crossentropy` is specifically designed for this situation.

What it does conceptually:

* Compares the **true label** (0 or 1)
* With the **predicted probability**
* Penalizes confident wrong predictions **very strongly**

Example:

* True = 1, Predicted = 0.95 → very small loss
* True = 1, Predicted = 0.05 → very large loss

This pushes the model to become **confident and correct**.

---

`optimizer='Adam'`

The optimizer decides **how the weights are updated**.

Adam is an advanced version of gradient descent that:

* Automatically adjusts learning rates
* Uses momentum (remembers past gradients)
* Converges faster and more reliably

In practice:

* Adam is the **default choice** for most deep learning tasks
* Works very well without much tuning

---

### What `compile` really means in one sentence

You are telling the model:

* “This is how to measure mistakes” → loss function
* “This is how to fix those mistakes” → optimizer

---

### What happens after this step

After `compile`, the model is ready for:

```python
model.fit(...)
```

That’s when:

* Forward propagation happens
* Loss is calculated
* Backpropagation updates weights

---

### One practical note

Often you’ll also see:

```python
model.compile(loss='binary_crossentropy',
              optimizer='Adam',
              metrics=['accuracy'])
```

This just tells Keras to **report accuracy during training**.



In [126]:
model.fit(x_train_scaled,y_train,epochs=10)

Epoch 1/10
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 248us/step - loss: 0.6009
Epoch 2/10
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 220us/step - loss: 0.5058
Epoch 3/10
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 220us/step - loss: 0.4695
Epoch 4/10
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 218us/step - loss: 0.4545
Epoch 5/10
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 220us/step - loss: 0.4463
Epoch 6/10
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 221us/step - loss: 0.4405
Epoch 7/10
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 218us/step - loss: 0.4360
Epoch 8/10
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 218us/step - loss: 0.4327
Epoch 9/10
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 217us/step - loss: 0.4304
Epoch 10/10
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s

<keras.src.callbacks.history.History at 0x122c00fd0>



`model.fit(x_train_scaled, y_train, epochs=10)`

This line **starts training your neural network**.

---

### What data you are giving

* `x_train_scaled`
  These are your **input features**, already:

  * one-hot encoded
  * scaled
  * shape ≈ (8000, 11)

* `y_train`
  These are the **true labels** (0 or 1: customer exited or not)

So the model now has:

* Inputs (X)
* Correct answers (y)

---

### What `epochs=10` means

An **epoch** =
One complete pass of the entire training dataset through the network.

So:

* Epoch 1 → model sees all 8000 samples once
* Epoch 10 → model has seen the entire data **10 times**

Each epoch helps the model:

* Make predictions
* Measure error (loss)
* Adjust weights to reduce that error

---

### What you see in the output

Example line:

```
Epoch 1/10
250/250 - loss: 0.5042
```

This means:

* Training dataset was divided into **250 mini-batches** (default batch size = 32)
* Model updated weights **250 times** in that epoch
* Average loss after epoch 1 = 0.5042

You can see:

```
loss: 0.5042 → 0.4770 → 0.4602 → ... → 0.4277
```

Loss going **down** means:
✅ Model is learning
✅ Weights are improving

---

### What is happening internally (important intuition)

For **each batch**:

1. Forward pass → model predicts output
2. Loss is calculated using binary crossentropy
3. Backpropagation computes gradients
4. Adam optimizer updates weights

This repeats:

* 250 times per epoch
* 10 epochs total

---

### Why the final line appears

```
<keras.src.callbacks.history.History at 0x...>
```

This means:

* `model.fit()` returns a **History object**
* It contains loss (and metrics) for each epoch
* You can store it like this:

```python
history = model.fit(...)
```

And later access:

```python
history.history['loss']
```



In [127]:
model.layers[1].get_weights()

[array([[-1.7306589],
        [ 0.586282 ],
        [-0.7327666]], dtype=float32),
 array([-0.39685404], dtype=float32)]

In [128]:
y_log = model.predict(x_test_scaled)

[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 331us/step


In [129]:
y_pred = np.where(y_log>0.5,1,0)

In [130]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.8035