# Data Challenge 13 ‚Äî Interpreting Logistic Regression 

**Purpose**  
Apply what you learned about logistic regression interpretation by analyzing NYC Restaurant Inspection data. 
 
You‚Äôll practice interpreting **continuous**, **binary**, and **categorical** predictors, compute **odds ratios**, and assess model accuracy. 

**Learning Goals**
- Convert coefficients to odds ratios using `np.exp()`.  
- Interpret ORs for continuous, binary, and categorical predictors.  
- Use accuracy to assess logistic regression performance.  
- Communicate results clearly and responsibly.  

**Data:** June 1, 2025 - Nov 4, 2025 Restaurant Health Inspection

[Restaurant Health Inspection](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j/about_data)


## Instructor Guidance

**Hint: Use the Lecture Deck, Canvas Reading, and Docs to help you with the code**

Use this guide live; students implement below.

**Docs (Quick Links)**
- LogisticRegression ‚Äî https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html  
- accuracy_score ‚Äî https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html  
- OneHotEncoder ‚Äî https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html  
- StandardScaler ‚Äî https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html  
- np.exp ‚Äî https://numpy.org/doc/stable/reference/generated/numpy.exp.html  

**Pseudocode Plan**

1Ô∏è‚É£ Load cleaned restaurant inspection data from the previous challenge.  
2Ô∏è‚É£ Define target = `IS_A` (1 = Grade A, 0 = otherwise).  
3Ô∏è‚É£ Predictors ‚Üí  
    ‚Ä¢ Continuous = `SCORE`  
    ‚Ä¢ Binary = `CRITICAL_NUM`  
    ‚Ä¢ Categorical = `BORO`  
4Ô∏è‚É£ Scale continuous variables; encode categorical ones.  
5Ô∏è‚É£ Fit `LogisticRegression`.  
6Ô∏è‚É£ Exponentiate coefficients (np.exp()) ‚Üí odds ratios.  
7Ô∏è‚É£ Interpret one continuous, one binary, and one categorical coefficient.  
8Ô∏è‚É£ Evaluate accuracy.  
9Ô∏è‚É£ Reflect on scaling choices and communication of odds.  


## You Do ‚Äî Student Section
Work in pairs. Comment your choices briefly. Keep code simple‚Äîonly coerce the columns you use.

## Step 1 ‚Äî Imports and Plot Defaults

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


### Step 2 ‚Äî Load CSV, Create Columns, Preview

- Point to your New York City Restaurant Inspection Data 
- Create the `is_A` and `critical_num` columns like you did in L11 notebook

In [2]:
df = pd.read_csv('/Users/Marcy_Student/Desktop/Marcy_Lab/DA2025_Lectures/Mod6/data/restaurant_inspection_cleaned.csv')
display(df.head())
display(df.info())
display(df.describe())

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE_DESCRIPTION,INSPECTION_DATE,ACTION,...,INSPECTION_TYPE,Latitude,Longitude,Community_Board,Council_District,Census_Tract,BIN,BBL,NTA,Location
0,50141498,DMM BAKERY,Brooklyn,6802,BAY PARKWAY,11204.0,7183314372,Chinese,2025-06-16,No violations were recorded at the time of thi...,...,Cycle Inspection / Initial Inspection,40.6121,-73.983252,311.0,47.0,25800.0,3135132.0,3055800000.0,BK28,POINT (-73.983252132334 40.612100428335)
1,50115119,JANNAT ADEN RESTAURANT,Bronx,2620,AVENUE Z,,7185004894,Middle Eastern,2025-07-09,Violations were cited in the following area(s).,...,Cycle Inspection / Initial Inspection,0.0,0.0,,,,,2.0,,
2,50121689,BELLA ITALY PIZZA,Bronx,1941,SOUTHERN BOULEVARD,10460.0,7183789577,Pizza,2025-06-09,No violations were recorded at the time of thi...,...,Administrative Miscellaneous / Initial Inspection,40.842262,-73.885759,206.0,15.0,36502.0,2010160.0,2029600000.0,BX17,POINT (-73.885758684936 40.842261957703)
3,50142981,PIZZA PLUS,Manhattan,2253,3 AVENUE,10035.0,2122892400,Pizza,2025-06-09,Violations were cited in the following area(s).,...,Cycle Inspection / Initial Inspection,40.802072,-73.936988,111.0,8.0,19400.0,1054626.0,1017870000.0,MN34,POINT (-73.936988471232 40.802071782442)
4,50139126,WONDER,Brooklyn,310,SCHERMERHORN STREET,11217.0,9142614549,Fusion,2025-06-07,No violations were recorded at the time of thi...,...,Inter-Agency Task Force / Initial Inspection,40.68748,-73.982245,302.0,33.0,4100.0,3000556.0,3001728000.0,BK38,POINT (-73.982245142975 40.687480172953)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41100 entries, 0 to 41099
Data columns (total 27 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   CAMIS                  41100 non-null  int64  
 1   DBA                    41100 non-null  object 
 2   BORO                   41100 non-null  object 
 3   BUILDING               40940 non-null  object 
 4   STREET                 41100 non-null  object 
 5   ZIPCODE                40597 non-null  float64
 6   PHONE                  41100 non-null  object 
 7   CUISINE_DESCRIPTION    41100 non-null  object 
 8   INSPECTION_DATE        41100 non-null  object 
 9   ACTION                 41100 non-null  object 
 10  VIOLATION_CODE         40591 non-null  object 
 11  VIOLATION_DESCRIPTION  40591 non-null  object 
 12  CRITICAL_FLAG          41100 non-null  object 
 13  SCORE                  39329 non-null  float64
 14  GRADE                  22556 non-null  object 
 15  GR

None

Unnamed: 0,CAMIS,ZIPCODE,SCORE,Latitude,Longitude,Community_Board,Council_District,Census_Tract,BIN,BBL
count,41100.0,40597.0,39329.0,40993.0,40993.0,40423.0,40423.0,40423.0,40213.0,40881.0
mean,48454200.0,10714.489741,29.159196,40.274106,-73.112144,256.961631,20.35653,30462.868194,2606248.0,2494582000.0
std,3473063.0,595.241082,21.543035,4.28155,7.772006,130.928956,15.484138,31846.096493,1362256.0,1346751000.0
min,40356020.0,10001.0,0.0,0.0,-74.248708,101.0,1.0,100.0,1000000.0,1.0
25%,50035920.0,10023.0,13.0,40.688945,-73.988314,106.0,4.0,8100.0,1049916.0,1010710000.0
50%,50106740.0,11101.0,25.0,40.737404,-73.954362,302.0,20.0,17200.0,3028676.0,3008560000.0
75%,50141620.0,11233.0,39.0,40.760766,-73.889191,402.0,34.0,44100.0,4017574.0,4006920000.0
max,50178190.0,11694.0,203.0,40.912822,0.0,503.0,51.0,161700.0,5799501.0,5270001000.0


In [4]:
df.columns = df.columns.str.strip().str.replace(' ', '_').str.replace('-', '_')
df['INSPECTION_DATE'] = pd.to_datetime(df['INSPECTION_DATE'])
df = df[(df['INSPECTION_DATE'] >= '2025-06-01') & (df['INSPECTION_DATE'] <= '2025-11-04')]

In [5]:
# Coerce fare, tip, distance to numeric safely
num_cols = ['GRADE', 'SCORE']
df = df.replace([np.inf, -np.inf], np.nan).dropna(subset=num_cols)

In [6]:
df['SCORE'].info()

<class 'pandas.core.series.Series'>
Index: 22554 entries, 0 to 41099
Series name: SCORE
Non-Null Count  Dtype  
--------------  -----  
22554 non-null  float64
dtypes: float64(1)
memory usage: 352.4 KB


In [7]:
df['is_A'] = (df['GRADE'] == 'A').astype(int)

In [8]:
df['is_A'].value_counts()

is_A
1    11905
0    10649
Name: count, dtype: int64

In [9]:
df['CritiCal_num'] = (df['CRITICAL_FLAG'] == 'Critical').astype(int)

In [12]:
df['CritiCal_num'].value_counts()

CritiCal_num
1    11679
0    10875
Name: count, dtype: int64

## Step 3 ‚Äî Define Predictors & Target

- Target is `is_A` 
- X predictors are: SCORE, CRITICAL_NUM (created in Step 2), BORO


In [None]:
df['BORO'] 

0         Brooklyn
5         Brooklyn
8         Brooklyn
11       Manhattan
12          Queens
           ...    
41094    Manhattan
41096       Queens
41097    Manhattan
41098       Queens
41099     Brooklyn
Name: BORO, Length: 22554, dtype: object

In [14]:
from sklearn.preprocessing import OneHotEncoder
# Dummy Variables (Method: sklearn.OneHotEncoder)
# We fit the encoder on the 'BORO' column
ohe = OneHotEncoder(drop='first', sparse_output=False)
feature_array = ohe.fit_transform(df[['BORO']])
feature_labels = list(ohe.get_feature_names_out())
dummies_skl = pd.DataFrame(feature_array, columns=feature_labels)

In [15]:
dummies_skl.head()

Unnamed: 0,BORO_Brooklyn,BORO_Manhattan,BORO_Queens,BORO_Staten Island
0,1.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0
4,0.0,0.0,1.0,0.0


In [16]:
# Concatenate back to the original dataframe
df = pd.concat([df.reset_index(drop=True), dummies_skl.reset_index(drop=True)], axis=1)
print("\n--- Data with sklearn.OneHotEncoder ---")
df.head()


--- Data with sklearn.OneHotEncoder ---


Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE_DESCRIPTION,INSPECTION_DATE,ACTION,...,BIN,BBL,NTA,Location,is_A,CritiCal_num,BORO_Brooklyn,BORO_Manhattan,BORO_Queens,BORO_Staten Island
0,50141498,DMM BAKERY,Brooklyn,6802,BAY PARKWAY,11204.0,7183314372,Chinese,2025-06-16,No violations were recorded at the time of thi...,...,3135132.0,3055800000.0,BK28,POINT (-73.983252132334 40.612100428335),1,0,1.0,0.0,0.0,0.0
1,50098804,GYRO EXPRESS,Brooklyn,3160,CONEY ISLAND AVENUE,11235.0,7187699228,Middle Eastern,2025-10-14,Violations were cited in the following area(s).,...,3245015.0,3086780000.0,BK19,POINT (-73.959713615942 40.578835975614),0,0,1.0,0.0,0.0,0.0
2,50154039,AMAR BARI,Brooklyn,1075,LIBERTY AVENUE,11208.0,9172507577,Bangladeshi,2025-08-27,Violations were cited in the following area(s).,...,3093593.0,3041710000.0,BK83,POINT (-73.869546373377 40.678346807646),0,0,1.0,0.0,0.0,0.0
3,50124438,TIPSY SHANGHAI,Manhattan,594,3 AVENUE,10016.0,2124666488,Chinese,2025-06-27,Establishment re-opened by DOHMH.,...,1019147.0,1008940000.0,MN20,POINT (-73.975981402783 40.748698335692),0,0,0.0,1.0,0.0,0.0
4,41395494,CITI FIELD SUITE KITCHEN,Queens,0,126TH ST & ROOSEVELT AVENUE,,7185958100,American,2025-09-13,Violations were cited in the following area(s).,...,,4.0,,,1,0,0.0,0.0,1.0,0.0


## Step 4 ‚Äî Split Data (70/30 Stratify by Target)

In [26]:
from sklearn.preprocessing import StandardScaler
X = df[['SCORE', 'CritiCal_num', 'BORO_Manhattan', 'BORO_Queens', 'BORO_Staten Island', 'BORO_Brooklyn']]
y = df['is_A']
X = sm.add_constant(X)
scaler = StandardScaler()
scaler.fit(X)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

## Step 5 ‚Äì Preprocessing (You can chose to do this in a Pipeline)  

- Scale continuous features  
- Pass binary as is  
- One-hot encode categorical feature (`BORO`)  

### All this steps were done above in step 4 ‚òùÔ∏é

## Step 6 ‚Äì Fit Model & Evaluate Accuracy

- Fit `is_A ~ score` using **LogisticRegression**  
- Compute predictions with `.predict()`  
- Evaluate accuracy with `accuracy_score()`

In [27]:
# Initialize and fit the model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Make predictions on the test set
y_pred = log_reg.predict(X_test)
# Print a simple accuracy score (we will learn better metrics later)
print(f"Model Accuracy: {accuracy_score(y_test, (y_pred >= 0.5).astype(int)):.2f}")

Model Accuracy: 0.97


In [39]:
log_reg.coef_[0][2]

np.float64(0.4309017704947399)

## Step 7 ‚Äì Extract Coefficients and Convert to Odds Ratios


In [None]:
coef_Score = log_reg.coef_[0][0]
intercept = log_reg.intercept_[0]

print(f"Intercept (Log-Odds): {intercept:.4f}")
print(f"Coefficient (Log-Odds): {coef_Score:.4f}")
# --- Convert to Odds Ratios ---
baseline_odds = np.exp(intercept)

odds_ratio_score = np.exp(coef_Score)
odds_ratio_CritiCal = np.exp(log_reg.coef_[0][1])
odds_ratio_Boro_Manhattan = np.exp(log_reg.coef_[0][2])
odds_ratio_Boro_Queens = np.exp(log_reg.coef_[0][3])
odds_ratio_Boro_Staten_Island = np.exp(log_reg.coef_[0][4])
odds_ratio_Boro_Brooklyn = np.exp(log_reg.coef_[0][5])

print(f"\nBaseline Odds: {baseline_odds:.4f}")

print(f"Odds Ratio for 'SCORE': {odds_ratio_score:.4f}")
print(f"Odds Ratio for 'CritiCal_num': {odds_ratio_CritiCal:.4f}")
print(f"Odds Ratio for 'BORO_Manhattan': {odds_ratio_Boro_Manhattan:.4f}")
print(f"Odds Ratio for 'BORO_Queens': {odds_ratio_Boro_Queens:.4f}")
print(f"Odds Ratio for 'BORO_Staten Island': {odds_ratio_Boro_Staten_Island:.4f}")
print(f"Odds Ratio for 'BORO_Brooklyn': {odds_ratio_Boro_Brooklyn:.4f}")   

Intercept (Log-Odds): 8.0214
Coefficient (Log-Odds): -0.0102

Baseline Odds: 3045.3660
Odds Ratio for 'SCORE': 0.9899
Odds Ratio for 'CritiCal_num': 0.6042
Odds Ratio for 'BORO_Manhattan': 1.5386
Odds Ratio for 'BORO_Queens': 1.1991
Odds Ratio for 'BORO_Staten Island': 1.0772
Odds Ratio for 'BORO_Brooklyn': 1.0267


In [38]:
df['CRITICAL_FLAG'].value_counts()


CRITICAL_FLAG
Critical          11679
Not Critical      10787
Not Applicable       88
Name: count, dtype: int64

## Step 8 ‚Äì Interpret Each Predictor 

**Remember**
üí° OR > 1 ‚Üí increases odds of Grade A  
üí° OR < 1 ‚Üí decreases odds of Grade A

**Type markdown interpreting all 3 predictors in plain english**

- For a restaurants who has no score, the odds of them having a grade A are 3045.3660. this intercept isn't very meaningful. Make no scence!
- For Each decrease in `score`, increases the restaurant's odds of having a `grade A` increase by 1.01%.
- The odds of having grade A for restaurants on critical are 0.6042 times lower than the odds for restaurants who are not on a critical flag.
- The odds of having grade A for restaurants on in Manhattan are 53.86% higher than the restaurants in other boroughs.
- The odds of having grade A for restaurants on in Queens are 19.91% higher than the restaurants in other boroughs.
- The odds of having grade A for restaurants on in Staten Island are 7.72% higher than the restaurants in other boroughs.
- The odds of having grade A for restaurants on in Brooklyn are 2.67% higher than the restaurants in other boroughs.


# We Share ‚Äî Reflection & Wrap-Up

Write **one short paragraphs** (4‚Äì6 sentences). Be specific and use evidence from your notebook.

**Which predictor had the strongest relationship with getting an A grade?**  
Use the odds ratios and accuracy to support your answer.  

My logistic regression model shows that restaurants with lower inspection scores are more likely to receive an A grade, meaning that cleaner restaurants receive better grade. Restaurants with a critical flag reduce the odds of earning an A grade, those restaurants are about 40% less likely to get an A compared to others. Among boroughs, Manhattan restaurants have the highest odds of receiving an A grade, with about 54% higher odds than restaurants in other areas. Queens also shows a strong result (about 20% higher odds), followed by Staten Island (8%) and Brooklyn (3%). Overall, Manhattan location appears to be the strongest predictor of getting an A grade, while having a critical flag is the strongest negative factor.