# Logistic Regression — Predicting `sex_female`

This notebook trains a Logistic Regression model to predict the binary target `sex_female` from an insurance dataset.
Workflow: load data → preprocess (encode + scale) → split → train → evaluate.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

# Suppress non-critical warnings to keep notebook output clean
warnings.filterwarnings('ignore')

In [19]:
df = pd.read_csv(
  '/home/harsh/Desktop/ML/ML/Data/Data.csv'
)
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


## Quick Data Preview

Inspect the first few rows, basic statistics, and missing values to understand the dataset structure before preprocessing.

In [20]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [22]:
df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [23]:
df.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

## Preprocessing

Encode categorical features (`sex`, `region`, `smoker`) with `OneHotEncoder` and scale numeric features (`age`, `bmi`, `charges`) with `StandardScaler`.
These transformed features are concatenated into `final_df` and the original raw columns are dropped.

In [24]:
from sklearn.preprocessing import OneHotEncoder

# One-hot encode categorical features: 'sex', 'region', 'smoker'
# handle_unknown='ignore' prevents errors if new categories appear
ohe = OneHotEncoder(
  handle_unknown='ignore',
  # drop= 'first',
  sparse_output=False
)
# Transform and create a DataFrame with proper column names
ohe_df = pd.DataFrame(
  ohe.fit_transform(df[
    ['sex', 'region', 'smoker']
  ]),
  columns=ohe.get_feature_names_out(['sex', 'region', 'smoker'])
)

# quick peek at the generated feature names
ohe.get_feature_names_out(['sex', 'region', 'smoker'])

array(['sex_female', 'sex_male', 'region_northeast', 'region_northwest',
       'region_southeast', 'region_southwest', 'smoker_no', 'smoker_yes'],
      dtype=object)

In [25]:
final_df = pd.concat([df, ohe_df], axis=1)
final_df.drop(
  ['sex', 'region', 'smoker'], 
  axis=1,
  inplace = True
)

final_df

Unnamed: 0,age,bmi,children,charges,sex_female,sex_male,region_northeast,region_northwest,region_southeast,region_southwest,smoker_no,smoker_yes
0,19,27.900,0,16884.92400,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,18,33.770,1,1725.55230,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
2,28,33.000,3,4449.46200,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
3,33,22.705,0,21984.47061,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
4,32,28.880,0,3866.85520,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
1333,50,30.970,3,10600.54830,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
1334,18,31.920,0,2205.98080,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1335,18,36.850,0,1629.83350,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1336,21,25.800,0,2007.94500,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0


In [26]:
from sklearn.preprocessing import StandardScaler

# Scale numeric features so they have mean=0 and std=1
scaler = StandardScaler()

scaled = scaler.fit_transform(
  df[
    ['age', 'bmi', 'charges']
  ]
)

scaled_df = pd.DataFrame(
  scaled,
  columns = ['age_scaled', 'bmi_scaled', 'charges_scaled'],
  index= df.index
)

# Append scaled columns to final_df and remove original raw columns
final_df = pd.concat(
  [final_df, scaled_df],
  axis=1
)

final_df.drop(
  ['age', 'bmi', 'charges'],
  axis = 1,
  inplace = True
)

# show a few samples to confirm transformations
final_df.sample(10)

Unnamed: 0,children,sex_female,sex_male,region_northeast,region_northwest,region_southeast,region_southwest,smoker_no,smoker_yes,age_scaled,bmi_scaled,charges_scaled
609,2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,-0.655551,1.17072,2.145393
388,0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,-0.940356,-1.321115,-0.833804
326,1,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,-0.869155,-1.222689,-0.801995
783,1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.768473,-0.502533,0.929318
231,3,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.409283,-0.464803,0.060375
99,0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,-0.085942,-1.864103,0.210671
1108,1,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,-0.940356,-0.108827,-0.856334
1066,2,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.62607,1.087058,-0.35457
761,1,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,-1.153959,0.744205,-0.896574
805,0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.412467,0.845092,-0.457525


In [27]:
final_df.isnull().sum()

children            0
sex_female          0
sex_male            0
region_northeast    0
region_northwest    0
region_southeast    0
region_southwest    0
smoker_no           0
smoker_yes          0
age_scaled          0
bmi_scaled          0
charges_scaled      0
dtype: int64

## Train / Test Split

Split the prepared features and target (`sex_female`) into training and test sets. We use `random_state=42` for reproducibility and an 80/20 split.

In [28]:
from sklearn.model_selection import train_test_split

# Split features and target into train/test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(
  final_df.drop('sex_female', axis = 1),
  final_df['sex_female'],
  random_state=42,
  test_size=0.2
)
# Print shapes to confirm split
X_train.shape, X_test.shape

((1070, 11), (268, 11))

## Model Training

Train a `LogisticRegression` model. `max_iter` increased to ensure convergence if needed.

In [29]:
from sklearn.linear_model import LogisticRegression

# Initialize Logistic Regression and fit to training data
lr_reg = LogisticRegression(
  max_iter=500
)

lr_reg.fit(X_train, y_train)
# Model is now trained and can be used for predictions

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,500


In [None]:
# Generate class predictions and class log-probabilities for the test set
y_pred = lr_reg.predict(X_test)
print(y_pred)

# `predict_log_proba` returns log-probabilities; index [:,1] is the positive class
Y_prob = lr_reg.predict_log_proba(X_test)[:, 1]
print(Y_prob)

[1. 1. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 1. 0. 1. 0. 0. 0. 1. 1.
 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 1. 1. 1. 0. 1. 1. 1. 0. 0. 1. 0. 0. 1.
 1. 0. 0. 1. 0. 1. 1. 1. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0.
 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 0. 0. 1. 1. 0.
 1. 0. 1. 0. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 0. 0. 0. 1. 1. 1. 1. 0. 1.
 0. 0. 0. 1. 1. 0. 1. 1. 0. 0. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1.
 0. 1. 1. 1. 0. 0. 1. 0. 0. 0. 1. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 1.
 1. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 1.
 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0.
 1. 0. 1. 1. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 0. 0. 0. 1. 1. 1. 0. 0.
 1. 0. 1. 1.]
[-0.01563502 -0.01595736 -0.01781292 -4.1942229  -4.36413657 -4.17324517
 -0.0158189  -4.18561677 -0.01506752 -4.2014191  -4.27591401 -4.19923871
 -0.01514673 -4.35489474 -4.30549939 

In [31]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

1.0

## Evaluation

We compute accuracy, a classification report (precision, recall, f1-score), and the confusion matrix to understand prediction performance and common error modes.

In [None]:
from sklearn.metrics import classification_report

# Show precision, recall and f1-score for each class
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       128
         1.0       1.00      1.00      1.00       140

    accuracy                           1.00       268
   macro avg       1.00      1.00      1.00       268
weighted avg       1.00      1.00      1.00       268



In [None]:
from sklearn.metrics import confusion_matrix

# Compute and display confusion matrix: rows=true labels, cols=predicted labels
confusion_matrix(y_test, y_pred)

array([[128,   0],
       [  0, 140]])

In [None]:
# Quick checks: types, shapes, and first few values to debug mismatches
print(type(y_test), type(y_pred))
print(y_test.shape, y_pred.shape)
print(y_test[:10])
print(y_pred[:10])

<class 'pandas.core.series.Series'> <class 'numpy.ndarray'>
(268,) (268,)
764     1.0
887     1.0
890     1.0
1293    0.0
259     0.0
1312    0.0
899     1.0
752     0.0
1286    1.0
707     0.0
Name: sex_female, dtype: float64
[1. 1. 1. 0. 0. 0. 1. 0. 1. 0.]


In [None]:
# Ensure both arrays are integer type before computing some metrics
y_test = y_test.astype(int)
y_pred = y_pred.astype(int)

In [39]:
from sklearn.metrics import f1_score, precision_score, recall_score

# Problem: `precision_score` was called with `X_test` (features) instead of
# the true labels `y_test`. Also the original prints showed the function
# object rather than the computed metric values. This cell fixes both.

precision = precision_score(y_test, y_pred)
f1_scores = f1_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print('Precision Score: ', precision, '\n')
print('F1-Score: ', f1_scores, '\n')
print('Recall Score: ', recall)

Precision Score:  1.0 

F1-Score:  1.0 

Recall Score:  1.0


## Summary & Next Steps

- **Summary:** Review the evaluation metrics above to judge model performance.
- **Next steps:** Try feature selection, regularization tuning (`C`), or cross-validation to improve performance.