# Fitting a Logistic Regression Model - Lab

## Introduction

In the last lesson you were given a broad overview of logistic regression. This included an introduction to two separate packages for creating logistic regression models. In this lab, you'll be investigating fitting logistic regressions with `statsmodels`. For your first foray into logistic regression, you are going to attempt to build a model that classifies whether an individual survived the [Titanic](https://www.kaggle.com/c/titanic/data) shipwreck or not (yes, it's a bit morbid).


## Objectives

In this lab you will: 

* Implement logistic regression with `statsmodels` 
* Interpret the statistical results associated with model parameters

## Import the data

Import the data stored in the file `'titanic.csv'` and print the first five rows of the DataFrame to check its contents. 

In [1]:
# Import the data
# Import necessary libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Load Titanic dataset

df = pd.read_csv("titanic.csv")
# Display the first five rows
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Define independent and target variables

Your target variable is in the column `'Survived'`. A `0` indicates that the passenger didn't survive the shipwreck. Print the total number of people who didn't survive the shipwreck. How many people survived?

In [2]:
# Total number of people who survived/didn't survive
num_survived = df['Survived'].sum()
num_not_survived = len(df) - num_survived

print(f"Number of people who survived: {num_survived}")
print(f"Number of people who did not survive: {num_not_survived}")


Number of people who survived: 342
Number of people who did not survive: 549


Only consider the columns specified in `relevant_columns` when building your model. The next step is to create dummy variables from categorical variables. Remember to drop the first level for each categorical column and make sure all the values are of type `float`: 

In [3]:
# Create dummy variables
relevant_columns = ['Pclass', 'Age', 'SibSp', 'Fare', 'Sex', 'Embarked', 'Survived']
dummy_dataframe = df[relevant_columns]

dummy_dataframe.shape

(891, 7)

Did you notice above that the DataFrame contains missing values? To keep things simple, simply delete all rows with missing values. 

> NOTE: You can use the [`.dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method to do this. 

In [4]:
# Drop missing rows
dummy_dataframe = dummy_dataframe.dropna()
dummy_dataframe.shape

(712, 7)

Finally, assign the independent variables to `X` and the target variable to `y`: 

In [5]:
# Split the data into X and y
y = dummy_dataframe['Survived']
X = dummy_dataframe.drop(columns=['Survived'])

## Fit the model

Now with everything in place, you can build a logistic regression model using `statsmodels` (make sure you create an intercept term as we showed in the previous lesson).  

> Warning: Did you receive an error of the form "LinAlgError: Singular matrix"? This means that `statsmodels` was unable to fit the model due to certain linear algebra computational problems. Specifically, the matrix was not invertible due to not being full rank. In other words, there was a lot of redundant, superfluous data. Try removing some features from the model and running it again.

In [11]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

# üöÄ **Step 1: Load dataset**
df = pd.read_csv("titanic.csv")

# üöÄ **Step 2: Define relevant columns**
relevant_columns = ['Pclass', 'Age', 'SibSp', 'Fare', 'Sex', 'Embarked']

# üöÄ **Step 3: Select only the relevant columns**
dummy_dataframe = df[relevant_columns]

# üöÄ **Step 4: Convert categorical variables into dummy variables**
dummy_dataframe = pd.get_dummies(dummy_dataframe, columns=['Sex', 'Embarked'], drop_first=True)

# üöÄ **Step 5: Fill missing values (Replace NaNs)**
dummy_dataframe.fillna(dummy_dataframe.median(), inplace=True)

# üöÄ **Step 6: Assign independent (X) and target variable (y)**
y = df['Survived']
X = dummy_dataframe

# üöÄ **Step 7: Add intercept column**
X = sm.add_constant(X)

# üöÄ **Step 8: Convert boolean columns to integers**
X = X.astype({col: int for col in X.select_dtypes(include=['bool']).columns})

# üöÄ **Step 9: Convert everything to numeric (Final check)**
X = X.apply(pd.to_numeric, errors='coerce')
y = pd.to_numeric(y, errors='coerce')

# üöÄ **Step 10: Ensure no NaNs or infinities in X**
X.replace([np.inf, -np.inf], np.nan, inplace=True)
X.dropna(inplace=True)
y = y.loc[X.index]  # Ensure `y` aligns with `X`

# üöÄ **Step 11: Final Check Before Model Fitting**
print("\nFinal data types in X (After Fix):")
print(X.dtypes)

print(f"\nFinal dataset shape - X: {X.shape}, y: {y.shape}")

if X.shape[0] == 0 or y.shape[0] == 0:
    raise ValueError("‚ùå Error: X or y is empty after preprocessing. Check for excessive missing values.")

# üöÄ **Step 12: Fit logistic regression model**
logit_model = sm.Logit(y, X)
result = logit_model.fit()

# üöÄ **Step 13: Display model summary**
print("\n‚úÖ Model Successfully Fitted! Summary Below:")
print(result.summary())




Final data types in X (After Fix):
const         float64
Pclass          int64
Age           float64
SibSp           int64
Fare          float64
Sex_male        int32
Embarked_Q      int32
Embarked_S      int32
dtype: object

Final dataset shape - X: (891, 8), y: (891,)
Optimization terminated successfully.
         Current function value: 0.441032
         Iterations 6

‚úÖ Model Successfully Fitted! Summary Below:
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  891
Model:                          Logit   Df Residuals:                      883
Method:                           MLE   Df Model:                            7
Date:                Sun, 23 Feb 2025   Pseudo R-squ.:                  0.3377
Time:                        12:51:11   Log-Likelihood:                -392.96
converged:                       True   LL-Null:                       -593.33
Covariance Type:          

## Analyze results

Generate the summary table for your model. Then, comment on the p-values associated with the various features you chose.

In [None]:
# Summary table
print("\nModel Summary:")
print(result.summary())



# Your comments here
- The coefficients in the summary table indicate the relationship between each feature and survival.
- A **positive coefficient** means that an increase in that feature increases the likelihood of survival.
- A **negative coefficient** means that an increase in that feature decreases the likelihood of survival.


## Level up (Optional)

Create a new model, this time only using those features you determined were influential based on your analysis of the results above. How does this model perform?

In [None]:
# Your code here


In [None]:
# Your comments here

## Summary 

Well done! In this lab, you practiced using `statsmodels` to build a logistic regression model. You then interpreted the results, building upon your previous stats knowledge, similar to linear regression. Continue on to take a look at building logistic regression models in Scikit-learn!