# Part 1: Understanding Logistic Regression

The goal of this part is to understand how logistic regression handle binary classification problems.

We will be using Python libraries such as numpy, matplotlib, scipy, and sklearn. Make sure all these are imported to run the experiment.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from scipy.special import expit  # Sigmoid function
from sklearn.linear_model import LinearRegression, LogisticRegression

We will create a simple toy dataset where the X values are sampled from a Gaussian distribution (normal distribution) with some added noise. The target y will be a binary value (0 or 1), based on whether X is greater than zero.

In [None]:
xmin, xmax = -5, 5
n_samples = 1000  # Number of samples
np.random.seed(1)
X = np.random.normal(size=n_samples)
y = (X > 0).astype(float)  # Binary classification target

X[X > 0] *= 4  # Scale positive values
X += 0.3 * np.random.normal(size=n_samples)  # Add noise

X = X[:, np.newaxis]  # Reshape X for sklearn compatibility


In [None]:
print("X shape:", X.shape)
print("y shape:", y.shape)

In [None]:
# Visualize the dataset
plt.scatter(X, y)
plt.xlabel("X")
plt.ylabel("y")
plt.show()

Next, we fit a logistic regression model to the data. Logistic regression models the probability that `y=1` given `x`.

In [None]:
logistic_regr = LogisticRegression(C=1e5)  # C=1e5 minimizes regularization to fit more closely
logistic_regr.fit(X, y)

The logistic function is of the form:
$$p = \frac{1}{1+e^{-(ax+b)}}$$,
where $a$ is the coefficient and $b$ is the intercept.  
$p$ gives the probability that $y=1$ given $x$.

Print the coefficient and the intercept of the trained model:

In [None]:
print("Coefficient (a):", logistic_regr.coef_[0][0])
print("Intercept (b):", logistic_regr.intercept_[0])

**Open a code cell below, calculate the value of $x$ that gives $p=0.5$.  
Assign this value to the variable `x_threshold`.**

Now let's plot the logistic regression model, along with its prediction.

In [None]:
y_pred = logistic_regr.predict(X)

# Create a range of x values for plotting
x_plot = np.linspace(xmin, xmax, 100)

# Calculate the predicted probabilities using the logistic regression model
p_plot = 1 / (1 + np.exp(-(logistic_regr.coef_[0][0] * x_plot + logistic_regr.intercept_[0])))

# Plot the logistic function
plt.plot(x_plot, p_plot, label="Logistic Regression", c='orange')
plt.scatter(X, y_pred, label="Logistic Regression Predictions", c='blue')
plt.xlabel("X")
plt.ylabel("y")
plt.legend()

# Plot dashed lines where p = 0.5, x = x_threshold
plt.axhline(0.5, linestyle='--')
plt.axvline(x_threshold, linestyle='--')
plt.show()

**Open a text cell below, and answer the question:  
How does logistic regression determine the decision boundary between class 0 and class 1?**

Now let's compare the prediction with the original dataset (ground truth).

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12,4))

# Create a range of x values for plotting
x_plot = np.linspace(xmin, xmax, 100)

# Calculate the predicted probabilities using the logistic regression model
p_plot = 1 / (1 + np.exp(-(logistic_regr.coef_[0][0] * x_plot + logistic_regr.intercept_[0])))

# Plot the logistic function
ax[0].plot(x_plot, p_plot, c='orange')
ax[0].scatter(X, y, label="Ground truth")
ax[0].set_xlabel("X")
ax[0].set_ylabel("y")
ax[0].axhline(0.5, linestyle='--')
ax[0].axvline(x_threshold, linestyle='--')
ax[0].legend(loc='lower right')

ax[1].plot(x_plot, p_plot, c='orange')
ax[1].scatter(X, y_pred, label="Predictions", c='blue')
ax[1].set_xlabel("X")
ax[1].set_ylabel("y")
ax[1].axhline(0.5, linestyle='--')
ax[1].axvline(x_threshold, linestyle='--')
ax[1].legend(loc='lower right')


**Open a text cell below, and answer the question:  
Does the logistic regression model give 100% accuracy for this dataset? Justify your answer.**

# Part 2: Comparing Logistic Regression with Linear Regression

Using the same datset, let's create a linear regression model.  
**Insert a code cell below, add code to create a linear regression model `linear_regr`, and fit the model with the dataset.**

**Open a text cell below, and answer the question:  
What assumptions does linear regression make about the relationship between X and y?**

**Open a code cell below, print the coefficient and intercept of the linear regression model**

We now plot both the logistic regression model and the linear regression model on the same graph to compare them.

In [None]:
plt.figure(1, figsize=(8, 6))  # Set up figure
plt.scatter(X, y, label="Example data", color="blue", s=20, marker = 'o')  # Scatter plot of the data

X_test = np.linspace(-5, 10, 300)  # Test range for X-axis

# Logistic regression prediction (sigmoid curve)
loss = expit(X_test * logistic_regr.coef_ + logistic_regr.intercept_).ravel()
plt.plot(X_test, loss, label="Logistic Regression Model", color="orange", linewidth=2)

# Linear regression prediction (straight line)
plt.plot(
    X_test,
    linear_regr.coef_ * X_test + linear_regr.intercept_,
    label="Linear Regression Model",
    linewidth=2,
)

plt.axhline(0.5, color=".5")  # Horizontal line at y=0.5
plt.ylabel("y")
plt.xlabel("X")
plt.ylim(-0.5, 1.5)  # Set y-limits
plt.xlim(-4, 10)  # Set x-limits

plt.legend(loc="lower right", fontsize="small")
plt.tight_layout()
plt.show()


**Open a text cell below, and answer the questions:**
1. What do you observe about the shape of the logistic regression curve compared to the linear regression line?
2. Why does logistic regression's output stay between 0 and 1, whereas linear regression does not?
3. If you were to classify the data into two groups based on the output of the linear regression model, what threshold would you use? How would this threshold compare to the 0.5 threshold in logistic regression?

# Part 3:Customer Churn Prediction (Binary Classification)

In this part of the lab, you will build a logistic regression model to predict customer churn (whether a customer will leave a service). This is a typical binary classification problem. The task will use a dataset with various customer features, and the goal is to predict whether a customer will churn or not (0 = no churn, 1 = churn).



In [None]:
import numpy as np
import pandas as pd
# Sklearn imports
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


In [None]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

In [None]:
!ls drive/MyDrive/ECEN250/lab5_logistic_regression/Telco_Customer_Churn.csv ## please change this to the directory of your own csv file.

Load the dataset and perform some basic exploratory data analysis to understand its structure and key characteristics.

In [None]:
# importing dataset
df = pd.read_csv('drive/MyDrive/ECEN250/lab5_logistic_regression/Telco_Customer_Churn.csv')

In [None]:
df.head()

In [None]:
df.info()

Column `TotalCharges` is of type `object`, there might be some non-numeric values.  
Let's try to convert column `TotalCharges` to numeric using `pd.to_numeric()`, and set `errors='coerce'` to turn non-numeric values into NaN.

In [None]:
# Convert the TotalCharges column to numeric, forcing errors to NaN
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

Check the datatype of this column again:

In [None]:
print(df['TotalCharges'].dtype)

**Insert a code block below to drop the NaNs in the dataframe**

In [None]:
# Reset row index after drop some rows
df.reset_index(drop=True, inplace=True)

In [None]:
# Check for missing values
print(df.isnull().sum())

**Open a code cell below to drop the column 'customerID', since it's not relevant for predicting customer churn.**

In [None]:
df.head()

Check the values in column `Churn`:

In [None]:
df['Churn'].unique()

Column `Churn` contains values of `No` or `Yes`. Let's convert them to numerical values `0` or `1`.

In [None]:
# Convert 'Churn' column to numerical values: No -> 0, Yes -> 1
df['Churn'] = df['Churn'].replace({'No': 0, 'Yes': 1})

# Verify the datatype of 'Churn' column
df['Churn'].dtype

In [None]:
df.info()

Let's start with a logistic regresion model with only one feature.  
**Use the TotalCharges feature to predict customer churn. Insert code cells below, create a dataset (X, y) with this feature, and Churn as label. Split the dataset into 70% training and 30% testing.**

**Insert a code cell below. Create a logistic regression model, train the model with the training set, and predict on the testing set.**

Let's look at the accuracy:

In [None]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))


**Question: Compared with the accuracy score, how does the confusion matrix help you understand the model's performance?**

Now we use all numerical columns in the original dataframe.

In [None]:
df_numerical = df.select_dtypes(include=['int64', 'float64'])
df_numerical.head()

**Insert code cells below and do the following:   
Create a dataset with the above numerical feature. Split the dataset into 70% training and 30% testing.  
Create a logistic regression model, train the model with the training set, and predict on the testing set.  
Calculate the prediction accuracy.**

**Question: Is the performance improved compared with the previous model with only one feature? Justify your answer**

Now let's use all the features in the dataframe.

In [None]:
# Loop through all columns with 'object' dtype
for column in df.select_dtypes(include='object').columns:
    unique_values = df[column].unique()
    print(f"Unique values in '{column}' column: {unique_values}")


The following code converts all categorical columns into numerical.

In [None]:
categorical_cols = [col for col in df.columns if df[col].dtype == 'object']
df_categorical = df[categorical_cols].copy()
for col in categorical_cols:
    if df_categorical[col].nunique() == 2:
        df_categorical[col], _ = pd.factorize(df_categorical[col])
    else:
        df_categorical = pd.get_dummies(df_categorical, columns=[col])

df_categorical = df_categorical.astype('int')


df_categorical.head()

In [None]:
df_categorical.info()

In [None]:
df_numerical.info()

Apply a standard scaler to the features.

In [None]:
numerical_cols = [col for col in df.columns if df[col].dtype != 'object' and col!='Churn']
df_std = pd.DataFrame(StandardScaler().fit_transform(df_numerical[numerical_cols].astype('float64')), columns=numerical_cols)
df_std.head()

In [None]:
df_std.info()

Combine the numerical and categorical columns together.

In [None]:
df_processed = pd.concat([df_std, df_categorical], axis=1)
df_processed['Churn'] = df_numerical['Churn'].astype(int)
df_processed.head()

In [None]:
df_processed.info()

**Insert code cells below and do the following:   
Create a dataset using a above dataframe, with `Churn` as label, and all other columns as feature.  
Split the dataset into 70% training and 30% testing.  
Create a logistic regression model, train the model with the training set, and predict on the testing set.  
Calculate the prediction accuracy.**

**Question: Open a text cell below, summarize and compare the performance of 1. model with only one feature; 2. model with four numerical features; 3. model with all features. Please share your observation and insights.**

Lab 5 is now complete.  Make sure all cells are visible and have been run (rerun if necessary).

The code below converts the ipynb file to PDF, and saves it to where this .ipynb file is. 

In [None]:
NOTEBOOK_PATH = # Enter here, the path to your notebook file, e.g. "/content/drive/MyDrive/ECEN250/ECEN250_Lab5.ipynb". Do not change the lines below, and make sure you do not have multiple notebooks with the same path.
! pip install playwright
! jupyter nbconvert --to webpdf --allow-chromium-download "$NOTEBOOK_PATH"

Download your notebook as an .ipynb file, then upload it along with the PDF file (saved in the same Google Drive folder as this notebook) to Canvas for Lab 5. Make sure that the PDF file matches your .ipynb file.