# Logistic Regression: Age vs. Smoking Quantity

In this notebook, we specifically analyze the relationship between **Age**, the **Number of Cigarettes Smoked Per Day**, and the risk of developing **Coronary Heart Disease (CHD)**.

We use the **Framingham Heart Study** dataset.

In [None]:
import numpy as np
import pandas as pd
import sklearn.linear_model as lm

### 1. Load and specificially select Age & Smoking Data
We load the dataset and strictly isolate `age` and `cigsPerDay` as our features.

First, we load the data from `github`.

In [None]:
url = "https://raw.githubusercontent.com/GauravPadawe/Framingham-Heart-Study/master/framingham.csv"
df = pd.read_csv(url)

Next, we select the columns we want to analyze.

In [None]:
df = df[['age', 'cigsPerDay', 'TenYearCHD']]

We drop the rows with missing values e.g., unknown smoking history.

In [None]:
df = df.dropna()

Let us take a look.

In [None]:
print(f"Dataset Shape: {df.shape}")
df.head()

### 2. Train Logistic Regression
We train the model to separate the classes based *only* on these two features.

We define the feature matrix `X` and the target variable `Y`.

In [None]:
X = df[['age', 'cigsPerDay']].values
Y = df['TenYearCHD'].values

In [None]:
X

We train the model. Setting `C=10_000` effectively turns of *regularization*.

In [None]:
M = lm.LogisticRegression(C=10_000, solver='lbfgs')
M.fit(X, Y)

We extract the coefficients.

In [None]:
ϑ0 = M.intercept_[0]
ϑ1, ϑ2 = M.coef_[0]

print(f"Intercept: {ϑ0:.4f}")
print(f"Coefficient for Age: {ϑ1:.4f}")
print(f"Coefficient for CigsPerDay: {ϑ2:.4f}")

### 3. Visualizing the Risk Boundary
The plot below shows how Age and Smoking Quantity interact.

* **X-Axis:** Age
* **Y-Axis:** Cigarettes Per Day
* **Green Line:** The "Risk Threshold" (50% probability). If you are **above** or **to the right** of this line, the model predicts you are at higher risk for heart disease.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Set plot style
sns.set(style='darkgrid')

plt.figure(figsize=(14, 8))

# Plot Labels
plt.title('Heart Disease Risk: Age vs. Smoking', fontsize=16)
plt.xlabel('Age (Years)', fontsize=14)
plt.ylabel('Cigarettes Per Day', fontsize=14)

# Scatter Plot of Actual Data
# We add "jitter" (random noise) to the points because many people have the exact same age/smoking count.
# Without jitter, the dots would stack on top of each other and hide the density.
jitter_age = np.random.normal(0, 0.3, len(X))
jitter_cigs = np.random.normal(0, 0.3, len(X))

plt.scatter(X[Y==0, 0] + jitter_age[Y==0], X[Y==0, 1] + jitter_cigs[Y==0], 
            color='blue', label='No Disease', alpha=0.3, s=20)
plt.scatter(X[Y==1, 0] + jitter_age[Y==1], X[Y==1, 1] + jitter_cigs[Y==1], 
            color='red', label='Developed CHD', alpha=0.6, s=20)

# Calculate Decision Boundary Line
# Formula: ϑ0 + ϑ1*Age + ϑ2*Cigs = 0
# Solve for Cigs (y): Cigs = -(ϑ0 + ϑ1*Age) / ϑ2
x_vals = np.linspace(X[:, 0].min(), X[:, 0].max() + 10, 100)
y_vals = -(ϑ0 + ϑ1 * x_vals) / ϑ2

# Plot the Line
plt.plot(x_vals, y_vals, color='green', linewidth=3, label='50% Probability Threshold')

# Set Limits
plt.ylim(-2, 70) # Cigarettes per day range
plt.xlim(30, 80) # Age range

plt.legend(fontsize=12, loc='upper left')
plt.show()

### Interpretation
You will notice the green line slopes **downwards**. It is obvious that heavy smokers can expect to aquire CHD at an earlier age than non-smokers. 