by Kiara Dorion
ECON611 - Computation for Economics - University of San Francisco - Fall 2024

This analysis examines the factors influencing San Francisco residents’ decisions to own a home. 
Using a binary logit model, we assess the impact of key predictors, estimate parameters with 
Maximum Likelihood Estimation (MLE), and evaluate the goodness-of-fit metrics to provide actionable insights.

Source: San Francisco City Survey Data (https://data.sfgov.org/City-Management-and-Ethics/San-Francisco-City-Survey-Data/nufj-bfbw/about_data)

Description: The dataset contains detailed demographic and perception data for San Francisco residents, collected through city surveys.

In [None]:
1.	dem_homeown: Binary dependent variable indicating homeownership status.
	•	Values: Own (1), Not Own (0) -- post-dummy variable conversion
2.	dem_hhsize: Number of people in the household (numeric).
3.	child_05: Number of children aged 0–5 in the household (numeric).
4.	child_617: Number of children aged 6–17 in the household (numeric).
5.	safe_day: Perceived neighborhood safety during the day (numeric, scale 1–5).
6.	safe_polqual: Perceived quality of neighborhood police services (numeric, scale 1–5).
7.	dem_gender: Gender of respondent.
	•	Values: Male (0), Female (1), Others (3) -- post-conversion from object to numeric

In [None]:
1.	dem_homeown: Binary dependent variable indicating homeownership status.
	•	Values: Own (1), Not Own (0) -- post-dummy variable conversion
2.	dem_hhsize: Number of people in the household (numeric).
3.	child_05: Number of children aged 0–5 in the household (numeric).
4.	child_617: Number of children aged 6–17 in the household (numeric).
5.	safe_day: Perceived neighborhood safety during the day (numeric, scale 1–5).
6.	safe_polqual: Perceived quality of neighborhood police services (numeric, scale 1–5).
7.	dem_gender: Gender of respondent.
	•	Values: Male (0), Female (1), Others (3) -- post-conversion from object to numeric

We begin by selecting relevant columns from the dataset, handling missing values, and creating features for model compatibility.

In [None]:
1.	Filter relevant variables: Include demographic and safety-related predictors that plausibly influence homeownership.

In [None]:
1.	Filter relevant variables: Include demographic and safety-related predictors that plausibly influence homeownership.

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy import optimize

file_path = '/Users/kdorion/Documents/MSAE/2024/Fall_2024/ECON611/FINAL PROJECT/San_Francisco_City_Survey_Data.csv'
columns_to_use = ['dem_homeown', 'dem_hhsize', 'child_05', 'child_617', 
                  'safe_day', 'safe_polqual', 'dem_gender']
df = pd.read_csv(file_path, usecols=columns_to_use)

df = df.dropna()

print(df.info())
print(df.head())

In [None]:
<class 'pandas.core.frame.DataFrame'>
Index: 2297 entries, 0 to 2529
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   safe_day      2297 non-null   float64
 1   safe_polqual  2297 non-null   float64
 2   child_05      2297 non-null   float64
 3   child_617     2297 non-null   float64
 4   dem_homeown   2297 non-null   object 
 5   dem_hhsize    2297 non-null   float64
 6   dem_gender    2297 non-null   object 
dtypes: float64(5), object(2)
memory usage: 143.6+ KB
None
   safe_day  safe_polqual  child_05  child_617 dem_homeown  dem_hhsize  \
0       4.0           3.0       0.0        0.0         Own         1.0   
1       5.0           3.0       1.0        0.0        Rent         3.0   
2       5.0           2.0       0.0        0.0        Rent         4.0   
3       5.0           5.0       0.0        0.0        Rent         1.0   
4       4.0           4.0       0.0        0.0        Rent         3.0   

  dem_gender  
0       Male  
1       Male  
2       Male  
3       Male  
4     Female

In [None]:
/var/folders/6j/pmgy91ns295gxlfrv0dz53bmt6tjzs/T/ipykernel_31823/55862392.py:10: DtypeWarning: Columns (91) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(file_path, usecols=columns_to_use)

In [None]:
1.	Binary Encoding: Convert the dependent variable dem_homeown into binary (1 = Own, 0 = Not Own).
2.	Feature Transformation: Map dem_gender into numeric categories (Male = 0, Female = 1, Others = 3).

In [None]:
1.	Binary Encoding: Convert the dependent variable dem_homeown into binary (1 = Own, 0 = Not Own).
2.	Feature Transformation: Map dem_gender into numeric categories (Male = 0, Female = 1, Others = 3).

In [None]:
#verify unique values in 'dem_homeown' and 'dem_gender' (the two variables with OBJECT datatypes)
print('Unique values in dem_homeown before converting:', df['dem_homeown'].unique())
print('Unique values in dem_gender before converting:', df['dem_gender'].unique())

# Convert 'dem_homeown' to binary and to a float datatype
df['dem_homeown'] = pd.get_dummies(df['dem_homeown'])['Own']

# Function to map 'dem_gender' values
def map_gender(value):
    if value == 'Male':
        return 0.0
    elif value == 'Female':
        return 1.0
    else:
        return 3.0

# Apply the function to transform 'dem_gender'
df['dem_gender'] = df['dem_gender'].apply(map_gender)

# Verify the transformation
print('Unique values in dem_homeown after converting:', df['dem_gender'].unique())
print('Unique values in dem_gender after converting:', df['dem_gender'].unique())

In [None]:
Unique values in dem_homeown before converting: ['Own' 'Rent' 'Other']
Unique values in dem_gender before converting: ['Male' 'Female' 'Genderqueer/Non-Binary' 'Not listed' 'Trans Female'
 'Trans Male']
Unique values in dem_homeown after converting: [0. 1. 3.]
Unique values in dem_gender after converting: [0. 1. 3.]

Standardization scales numeric predictors to ensure comparability of coefficients and facilitates the optimization process during MLE.

In [None]:
# Define a standardization function
def standardize_numeric_features(df, column):
    var_mean = df[column].mean()
    var_std = df[column].std()
    df[column] = round((df[column] - var_mean) / var_std, 6)
    return df[column]

numeric_columns = ['dem_hhsize', 'child_05', 'child_617', 'safe_day', 'safe_polqual', 'dem_gender']
for col in numeric_columns:
    df[col] = standardize_numeric_features(df, col)

df['intercept'] = 1

print(df.head())

In [None]:
safe_day  safe_polqual  child_05  child_617  dem_homeown  dem_hhsize  \
0  0.317226     -0.103398 -0.408056  -0.549566         True   -0.086782   
1  1.209107     -0.103398  2.449579  -0.549566        False   -0.031302   
2  1.209107     -1.034794 -0.408056  -0.549566        False   -0.003563   
3  1.209107      1.759393 -0.408056  -0.549566        False   -0.086782   
4  0.317226      0.827998 -0.408056  -0.549566        False   -0.031302   

   dem_gender  intercept  
0   -0.915557          1  
1   -0.915557          1  
2   -0.915557          1  
3   -0.915557          1  
4    0.811072          1

Using the statsmodels library, we fit a binary logit model to estimate the relationship between predictors and the likelihood of homeownership.

Mathematics:
The logit model expresses the log-odds of the dependent variable  Y  as:

In [None]:
# Fit the binary logistic regression model
logit_model = sm.Logit(df['dem_homeown'], df[['intercept'] + numeric_columns]).fit()

print(logit_model.summary())

In [None]:
Optimization terminated successfully.
         Current function value: 0.681941
         Iterations 4
                           Logit Regression Results                           
==============================================================================
Dep. Variable:            dem_homeown   No. Observations:                 2297
Model:                          Logit   Df Residuals:                     2290
Method:                           MLE   Df Model:                            6
Date:                Fri, 13 Dec 2024   Pseudo R-squ.:                0.005247
Time:                        22:18:43   Log-Likelihood:                -1566.4
converged:                       True   LL-Null:                       -1574.7
Covariance Type:            nonrobust   LLR p-value:                   0.01120
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
intercept       -0.2490      0.042     -5.900      0.000      -0.332      -0.166
dem_hhsize       0.0465      0.046      1.015      0.310      -0.043       0.136
child_05         0.0208      0.043      0.479      0.632      -0.064       0.106
child_617        0.1311      0.043      3.015      0.003       0.046       0.216
safe_day        -0.0666      0.045     -1.495      0.135      -0.154       0.021
safe_polqual     0.0659      0.045      1.479      0.139      -0.021       0.153
dem_gender      -0.0226      0.042     -0.533      0.594      -0.106       0.061
================================================================================

To complement the logit model, we estimate parameters using MLE to find the parameters
() that maximize the likelihood of the observed data.

Sigmoid (the probability of choosing one alternative over another) Function:

“Own” = 1, “Not Own” = 0. The probability of the outcome depends on predictors (dependent variables, i.e. gender ('dem_gender') 
and the coefficients  $\beta$, modeled by the sigmoid function.

Where:
$$X = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_kX_k$$  is the representative utility (weighted sum of the predictors)
$\exp$  is the exponential function (ensures probabilities  P(y=1|X)  are between 0 and 1)

Log-Likelihood Function:

In [None]:
•	If someone owns a home ( y=1 ), the likelihood is just  P(y=1|X) .
•	If someone doesn’t own a home ( y=0 ), the likelihood is  1 - P(y=1|X) .

In [None]:
•	If someone owns a home ( y=1 ), the likelihood is just  P(y=1|X) .
•	If someone doesn’t own a home ( y=0 ), the likelihood is  1 - P(y=1|X) .

For all observations combined, the total likelihood is the product of the individual likelihoods.

Log-Likelihood is maximized by taking the derivative of  $\ln(L)$  with respect to each  $\beta$ .
Finally we solve for FONC, where the derivatives are equal to zero. This finds the values of  $\beta$  that give the maximum log-likelihood.

These coefficients tell us:

In [None]:
def sigmoid(data, beta):
    """
    for converting representative utility into logit probabilities.
    """
    Xb = np.dot(data, beta)
    eXb = np.exp(Xb)
    probs = eXb/ (1 + eXb)
    return probs

In [None]:
def LL_logit(params, *args):
    y, X = args
    beta = np.array(params)

    # Compute probabilities
    probs = sigmoid(X, beta)
    
    # Log-likelihood computation
    ll = y * np.log(probs) + (1 - y) * np.log(1 - probs)
    
    return -np.sum(ll)

In [None]:
y = df['dem_homeown'].to_numpy()  # Dependent variable
X = df[['intercept'] + numeric_columns].to_numpy()  # Independent variables

# Initial random values for parameters
start_values = np.zeros(X.shape[1])  # Start with zeros

In [None]:
# Perform MLE optimization
result = optimize.minimize(
    LL_logit, 
    x0=start_values, 
    args=(y, X), 
    method='BFGS'
)

estimated_params = result['x']
print('Estimated Parameters:\n', estimated_params)

In [None]:
Estimated Parameters:
 [-0.24903026  0.04648287  0.0207879   0.13107236 -0.06661513  0.06588379
 -0.02261164]

In [None]:
# Compute the log-likelihood for the estimated parameters
final_ll = -LL_logit(result['x'], y, X)
print('Final Log-Likelihood:', final_ll)

In [None]:
Final Log-Likelihood: -1566.419519824517

In [None]:
•	Statsmodels Coef: Coefficients estimated using the logit model, representing the log-odds effect of each predictor on homeownership.
•	MLE Coef: Parameters derived from maximizing the likelihood function, validating the statsmodels estimates.

In [None]:
•	Statsmodels Coef: Coefficients estimated using the logit model, representing the log-odds effect of each predictor on homeownership.
•	MLE Coef: Parameters derived from maximizing the likelihood function, validating the statsmodels estimates.

In [None]:
#Analyze coefficients from statsmodels and MLE
MLE_results = pd.DataFrame({
    'Variable': ['intercept'] + numeric_columns,
    'Statsmodels Coef': logit_model.params.values,  # Use params from the fitted statsmodels logit model
    'MLE Coef': estimated_params,  # Use the MLE estimated parameters
})

print(MLE_results)

In [None]:
Variable  Statsmodels Coef  MLE Coef
0     intercept         -0.249030 -0.249030
1    dem_hhsize          0.046483  0.046483
2      child_05          0.020788  0.020788
3     child_617          0.131072  0.131072
4      safe_day         -0.066615 -0.066615
5  safe_polqual          0.065884  0.065884
6    dem_gender         -0.022612 -0.022612

The binary logit model highlights key factors influencing homeownership in San Francisco. Perceptions of quality of police safety (safe_polqual), emerged as the most significant positive predictor of homeownership likelihood. Improved safety during the day (safe_day) also positively correlates with ownership, though to a lesser extent. Household size (dem_hhsize) shows a smaller but positive effect, suggesting that larger households are more likely to own homes. Variables related to the presence of children (child_05 and child_617) had minimal impact, reflecting limited influence on ownership decisions in this urban setting.

In [None]:
import matplotlib.pyplot as plt

coefficients = binary_logit.params.values
variables = ['Intercept'] + numeric_columns

plt.barh(variables, coefficients, color='skyblue', edgecolor='black')
plt.title('Coefficient Effects and Directions')
plt.xlabel('Coefficient Value (Log-Odds)')
plt.ylabel('Predictor')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

In [None]:
•	Redder regions indicate stronger positive log-odds, meaning that as 
    the predictor value increases, the likelihood of owning a home significantly rises.
•	Bluer regions indicate negative log-odds, suggesting that higher predictor values
    decrease the likelihood of homeownership.

In [None]:
•	Redder regions indicate stronger positive log-odds, meaning that as 
    the predictor value increases, the likelihood of owning a home significantly rises.
•	Bluer regions indicate negative log-odds, suggesting that higher predictor values
    decrease the likelihood of homeownership.

The heatmap reveals that perceptions of safety and family dynamics are significant drivers of homeownership in San Francisco.

In [None]:
data = []
x_values = np.linspace(df[numeric_columns[0]].min(), df[numeric_columns[0]].max(), 100)

for column in numeric_columns:
    m = logit_model.params[column]
    b = logit_model.params['intercept']
    y_values = m * x_values + b
    data.append(y_values)

data = np.array(data)

plt.figure(figsize=(10, 6))
plt.imshow(data, aspect='auto', cmap='coolwarm', extent=(x_values.min(), x_values.max(), 0, len(numeric_columns)))
plt.colorbar(label='Log-Odds')
plt.title('Heatmap of y = mx + b for Predictors', fontsize=14)
plt.xlabel('Predictor Values', fontsize=12)
plt.ylabel('Predictors', fontsize=12)
plt.yticks(range(len(numeric_columns)), numeric_columns)
plt.tight_layout()
plt.show()

Household Size (dem_hhsize): Larger households are more likely to own a home.

Children (0–5) (child_05): Presence of young children increases likelihood.

Children (6–17)	(child_617): Households with school-aged children favor ownership.

Safety (Day) (safe_day): Feeling safe during the day increases likelihood.

Police Quality (safe_polqual): Poor perceptions of police reduce likelihood.

Gender (Female)	(dem_gender | Female): 	Women are more likely than men to own a home.

Concluding Remarks

This binary logit analysis highlights the importance of demographic and safety-related factors in determining homeownership in San Francisco. By understanding these influences, stakeholders can design interventions that address barriers to ownership and support equitable housing opportunities. Future studies can expand on these findings by incorporating additional variables, longitudinal data, or exploring multinomial choices (e.g., renting vs. owning different types of housing).