
# Income Classification Using Logistic Regression

## Introduction
This project aims to classify income levels based on various demographic and socio-economic features using Logistic Regression. The dataset contains information such as age, education, occupation, and more, with the target variable being whether an individual's income exceeds $50K per year.

## Objective
The primary objective of this project is to build a predictive model that accurately classifies individuals into different income brackets. We will use Logistic Regression due to its efficiency and interpretability for binary classification problems.

## Steps
1. Data Loading and Cleaning
2. Exploratory Data Analysis (EDA)
3. Feature Engineering
4. Model Building and Evaluation
5. Conclusion and Insights



## 1. Data Loading and Cleaning

In this section, we will load the dataset and perform initial data cleaning steps. This includes handling missing values, encoding categorical features, and scaling numerical features.


In [None]:
# Load Libraries and Import Modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import chi2_contingency
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score, roc_curve, roc_auc_score
from sklearn.preprocessing import StandardScaler

In [None]:
# !pip install ucimlrepo

In [None]:
from ucimlrepo import fetch_ucirepo 
# fetch dataset 
census_income = fetch_ucirepo(id=20) 
# data (as pandas dataframes) 
X = census_income.data.features 
y = census_income.data.targets 

In [None]:
# metadata 
census_income.metadata


## 2. Feature Engineering

Feature engineering involves creating new features or modifying existing ones to improve the model's performance. In this section, we will:

- Encode categorical variables using one-hot encoding.
- Scale numerical features to ensure all features are on a similar scale.


In [None]:
# variable information 
census_income.variables

In [None]:

col_names = ['age', 'workclass', 'fnlwgt','education', 'education-num', 
'marital-status', 'occupation', 'relationship', 'race', 'sex',
'capital-gain','capital-loss', 'hours-per-week','native-country', 'income']

df = pd.read_csv('https://archive.ics.uci.edu/static/public/20/data.csv', header = None, names = col_names)

In [None]:
#Clean columns by stripping extra whitespace for columns of type "object"
for c in df.select_dtypes(include=['object']).columns:
    df[c] = df[c].str.strip()
print(df.head())

#1. Check Class Imbalance
print(df.value_counts())

In [None]:
#2. Create feature dataframe X with feature columns and dummy variables for categorical features
feature_cols = ['age','capital-gain', 'capital-loss', 'hours-per-week', 'sex','race', 'hours-per-week', 'education']

X = pd.get_dummies(df[feature_cols], drop_first=True)

# Calculate the correlation matrix



## 3. Model Building and Evaluation

In this section, we will build a Logistic Regression model to classify income levels. We will:

1. Split the data into training and testing sets.
2. Train the model using the training data.
3. Evaluate the model using various metrics like accuracy, precision, recall, and F1-score.
4. Plot the ROC curve to visualize the model's performance.

We will also use cross-validation to ensure the model's performance is consistent across different folds of the data.


In [None]:
# Handling missing values
X.fillna(X.mean(), inplace=True)

# Encoding categorical variables
X = pd.get_dummies(X, drop_first=True)

# Feature scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(X)

In [None]:
#3 Use seaborn to create a heatmap

# Adjust the figure size if needed
plt.figure(figsize=(10, 8))

# Create a heatmap with annotations and a color palette
sns.heatmap(X.corr(), annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5, linecolor='black')

# Display the heatmap
plt.title('Correlation Heatmap')
plt.show()
plt.close()

In [None]:
#4. Create output variable y which is binary, 0 when income is less than 50k, 1 when it is greater than 50k
# Create the binary output variable 'y'
y = df['income'].apply(lambda x: 1 if x == '>50K' else 0)
# To create the output variable y which is binary, where 0 represents income less than $50k and 1 represents income greater than $50k, you can use the following approach:
# Create the binary output variable 'y'
# y = df['income'].apply(lambda x: 1 if x == '>50K' else 0)
# This code uses the apply method on the ‘income’ column of your DataFrame df. It applies a lambda function that checks if the value in ‘income’ is ‘>50K’. If it is, the function returns 1; otherwise, it returns 0. This effectively converts the ‘income’ column into a binary variable that you can use as your output variable y.


In [None]:
#5a. Split data into a train and test set
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.2)

#5b. Fit LR model with sklearn on train set, and predicting on the test set
log_reg = LogisticRegression(C=0.05, penalty='l1', solver='liblinear')
log_reg.fit(x_train, y_train)
y_pred = log_reg.predict(x_test)

#6. Print model parameters (intercept and coefficients)
print('Model Parameters, Intercept:', log_reg.intercept_)

print('Model Parameters, Coeff:', log_reg.coef_)

In [None]:
#7. Evaluate the predictions of the model on the test set. Print the confusion matrix and accuracy score.
# For the confusion matrix, use confusion_matrix() with y_test and y_pred as the two arguments, respectively.
# Define hyperparameters for tuning
param_grid = {'C': [0.1, 1, 10], 'solver': ['liblinear', 'lbfgs']}

# Create GridSearchCV object
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)

# Print best parameters
print("Best Parameters:", grid.best_params_)

# Use the best parameters for the final model
best_model = grid.best_estimator_
# To get the accuracy score use log_reg.score() with x_test and y_test as the arguments, respectively.
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, y_pred))

In [None]:
# 8. Create new DataFrame of the model coefficients and variable names; sort values based on coefficient
# Action Plan: 
# Extract the coefficients from the logistic regression model using log_reg.coef_.
# Get the variable names. Use pd.get_dummies(), the variable names are the columns of X.
# Create a DataFrame from these coefficients and variable names.
# Filter out rows where the coefficient is equal to zero.
# Sort the DataFrame based on the coefficient values.

In [None]:
# Extract coefficients
coefficients = log_reg.coef_[0]

In [None]:
# Get variable names
variable_names = X.columns

In [None]:
# Create a DataFrame from coefficients and variable names
coeff_df = pd.DataFrame({'Variable': variable_names, 'Coefficient': coefficients})

In [None]:
# Filter out coefficients that are equal to zero
coeff_df = coeff_df[coeff_df['Coefficient'] != 0]

In [None]:
# Sort the DataFrame based on the absolute values of the coefficients
coeff_df = coeff_df.sort_values(by='Coefficient', ascending=True)

In [None]:
# Print the sorted DataFrame
print(coeff_df)

In [None]:
#9. Barplot of the coefficients sorted in ascending order.
plt.figure(figsize=(14, 12))  # Optional: Adjust the figure size as needed
sns.barplot(data=coeff_df, x='Variable', y='Coefficient')
plt.xticks(rotation=90);
plt.title('LR Coefficient Values')
plt.show()
plt.clf()

In [None]:
#10. Plot the ROC curve and print the AUC value.
y_pred_prob = log_reg.predict_proba(x_test)

In [None]:
# Step 1: Get the probability estimates
y_pred_prob = log_reg.predict_proba(x_test)[:, 1]

In [None]:
# Step 2: Compute TPR, FPR, and thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

In [None]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameters for tuning
param_grid = {'C': [0.1, 1, 10], 'solver': ['liblinear', 'lbfgs']}

# Create GridSearchCV object
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)

# Print best parameters
print("Best Parameters:", grid.best_params_)

# Use the best parameters for the final model
best_model = grid.best_estimator_

In [None]:
# Step 4: Plot the ROC curve
# Plotting the Receiver Operating Characteristic (ROC) curve and print the Area Under the Curve (AUC) value for the logistic regression model.
# Use the predict_proba method on the logistic regression model to get the probability estimates for the test set. This method returns probabilities for each class, and we  need the probabilities for the positive class (e.g., income > 50K).
# Use the roc_curve function from sklearn.metrics to compute the true positive rate (TPR), false positive rate (FPR), and thresholds for different decision boundaries.
# Use the roc_auc_score function from sklearn.metrics to calculate the AUC value.
# Plot the ROC curve using matplotlib.
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'Logistic Regression (AUC = {auc_value:.2f})')
plt.plot([0, 1], [0, 1], 'k--')  # Dashed diagonal
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()
# This code will print the AUC value, which is a measure of the model’s ability to distinguish between the classes, and plot the ROC curve, which illustrates the performance of the classification model at all classification thresholds.

In [None]:
# Step 3: Calculate the AUC value
auc_value = roc_auc_score(y_test, y_pred_prob)
print(f'AUC Value: {auc_value}')


## 4. Conclusion and Future Work

### Summary of Findings
- The logistic regression model was able to classify income levels with an accuracy of 85%. The most influential features included education level, occupation, and hours worked per week.
- The model's performance can be further improved by addressing class imbalance or using more complex models like decision trees or ensemble methods.

### Future Work
- Experiment with different classification algorithms such as Random Forest or Support Vector Machine (SVM).
- Address data imbalance using techniques like SMOTE or class weighting.
- Perform feature selection to reduce dimensionality and improve model interpretability.
