
This notebook will briefly cover three common data preprocessing methods in Python, which are essential for effective data cleaning:

1. **Selecting Unique Elements**: Using Python's `set()` function to extract distinct values from datasets, helping to eliminate duplicates.

2. **Degrees of Freedom in Logistic Regression**: Understanding how to calculate degrees of freedom, which is important for evaluating model complexity.

3. **Multicollinearity**: Detecting multicollinearity among independent variables using the Variance Inflation Factor (VIF), which helps assess the reliability of regression coefficients.

Additionally, **Dropping Unnecessary Columns**: The `drop()` method can remove irrelevant or redundant features from the dataset, streamlining analysis. Similarly, SQL's `DISTINCT` keyword can filter out duplicate records in a database. SQL is also effective for data preprocessing tasks like filtering, aggregating, and joining datasets, making it a valuable tool alongside Python for data manipulation. Together, these methods enhance the data cleaning and preparation process, leading to more accurate analyses and insights.

### 1. Selecting Unique Elements

In Python, you can use the `set()` function to select unique elements from a list or any iterable. This is particularly useful when processing datasets to ensure that you are working with distinct values.

In [1]:
# Sample data: a list of credit scores
credit_scores = [700, 720, 700, 680, 720, 750, 680]


In [3]:
# Using set() to get unique credit scores
unique_credit_scores = set(credit_scores)
print("Unique Credit Scores:", unique_credit_scores)

Unique Credit Scores: {720, 700, 680, 750}




### 2. Degrees of Freedom in Logistic Regression

Degrees of freedom can be calculated as $ n - p - 1 $, where $n$ \)$ is the number of observations and $p $ is the number of parameters estimated in the model. Degrees of freedom can help evaluating the model's complexity -  a higher number of degrees of freedom generally indicates a more complex model, while a lower number may suggest overfitting or insufficient data to support the model's parameters.


In [8]:
import pandas as pd
import statsmodels.api as sm

# Expanded sample dataset with more observations
data = {
    'credit_score': [
        700, 720, 680, 750, 690, 710, 740, 730, 675, 665,
        780, 800, 690, 720, 710, 680, 740, 760, 690, 700,
        720, 710, 680, 750, 690, 720, 740, 730, 675, 685,
        700, 720, 680, 750, 690, 710, 740, 730, 675, 665,
        780, 800, 690, 720, 710, 680, 740, 760, 690, 700,
        720, 710, 680, 750, 690, 720, 740, 730, 675, 685
    ],
    'default': [
        0, 0, 1, 0, 1, 0, 0, 0, 1, 0,
        0, 1, 0, 0, 1, 0, 0, 0, 1, 0,
        0, 0, 1, 0, 1, 0, 0, 0, 1, 0,
        0, 1, 0, 0, 1, 0, 0, 0, 1, 0,
        0, 0, 1, 0, 1, 0, 0, 0, 1, 0,
        0, 1, 0, 0, 1, 0, 0, 0, 1, 0
    ]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Adding a constant for the intercept
X = sm.add_constant(df['credit_score'])
y = df['default']

# Fitting the logistic regression model
model = sm.Logit(y, X).fit()

# Print the model summary
print("\nModel Summary:")
print(model.summary())

# Degrees of freedom
n_observations = len(df)
n_parameters = model.params.shape[0]  # Includes intercept
degrees_of_freedom = n_observations - n_parameters

print("\nDegrees of Freedom:", degrees_of_freedom)


Optimization terminated successfully.
         Current function value: 0.536857
         Iterations 6

Model Summary:
                           Logit Regression Results                           
Dep. Variable:                default   No. Observations:                   60
Model:                          Logit   Df Residuals:                       58
Method:                           MLE   Df Model:                            1
Date:                Fri, 30 May 2025   Pseudo R-squ.:                  0.1212
Time:                        22:23:01   Log-Likelihood:                -32.211
converged:                       True   LL-Null:                       -36.652
Covariance Type:            nonrobust   LLR p-value:                  0.002882
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
const           21.4682      8.549      2.511      0.012       4.713      38.223
credit_

## 3. Multicollinearity

Multicollinearity occurs when two or more independent variables in a regression model are highly correlated. This can lead to unreliable estimates of coefficients. You can detect multicollinearity using the Variance Inflation Factor (VIF).


In [5]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Sample dataset with multiple features
data_multicollinearity = {
    'credit_score': [700, 720, 680, 750, 690, 710, 740],
    'income': [50000, 60000, 55000, 70000, 52000, 58000, 62000],
    'debt': [20000, 25000, 22000, 30000, 21000, 23000, 24000]
}

df_multicollinearity = pd.DataFrame(data_multicollinearity)

# Calculating VIF for each feature
X_multicollinearity = df_multicollinearity
vif_data = pd.DataFrame()
vif_data["feature"] = X_multicollinearity.columns
vif_data["VIF"] = [variance_inflation_factor(X_multicollinearity.values, i) for i in range(X_multicollinearity.shape[1])]

print("Variance Inflation Factor (VIF):")
print(vif_data)


Variance Inflation Factor (VIF):
        feature          VIF
0  credit_score   238.603998
1        income  2108.113632
2          debt  1242.875791


## 4. Correlation Matrix

By using the `corr()` method, you can generate a **correlation matrix** that displays the correlation coefficients between pairs of variables. A high correlation coefficient (close to +1 or -1) between two independent variables suggests that they may be multicollinear.


In [9]:

# Sample dataset
data = {
    'var1': [1, 2, 3, 4, 5],
    'var2': [2, 4, 6, 8, 10],  # Highly correlated with var1
    'var3': [5, 4, 3, 2, 1],   # Inversely correlated with var1
    'var4': [1, 1, 1, 1, 1]    # Constant variable
}

df = pd.DataFrame(data)

# Calculate the correlation matrix
correlation_matrix = df.corr()
print("Correlation Matrix:")
print(correlation_matrix)


Correlation Matrix:
      var1  var2  var3  var4
var1   1.0   1.0  -1.0   NaN
var2   1.0   1.0  -1.0   NaN
var3  -1.0  -1.0   1.0   NaN
var4   NaN   NaN   NaN   NaN



### Summary

- **Unique Elements**: We utilized the `set()` function to extract unique credit scores from a list, ensuring that we eliminate any duplicates in our dataset.
- **Degrees of Freedom**: We calculated the degrees of freedom in a logistic regression model by subtracting the number of estimated parameters from the total number of observations, which is essential for evaluating model complexity.
- **Multicollinearity**: We assessed multicollinearity among independent variables using the Variance Inflation Factor (VIF), which quantifies how much the variance of a regression coefficient is inflated due to correlations between predictors.

