### Importing libraries and Loading the Data

In [56]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

file_path = "./HousingData.csv"
boston_data = pd.read_csv(file_path)



### Defining the Variables and Processing the Data

In [57]:
# Define the target variable
median_crime_rate = boston_data['CRIM'].median()
boston_data['High_CRIM'] = (boston_data['CRIM'] > median_crime_rate).astype(int)

# Drop the original CRIM column
boston_data = boston_data.drop(columns=['CRIM'])

# Handle missing values
imputer = SimpleImputer(strategy="mean")
boston_data_imputed = pd.DataFrame(imputer.fit_transform(boston_data), columns=boston_data.columns)

# Define features and target variable
X = boston_data_imputed.drop(columns=['High_CRIM'])
y = boston_data_imputed['High_CRIM']

# Split into training and test sets and standradize the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


### Training and Evaluating XGBoost

In [58]:
xgb_model_optimized = XGBClassifier(
    eval_metric="logloss",
    random_state=42,
    n_estimators=100,
    max_depth=2,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8
)

xgb_model_optimized.fit(X_train, y_train)
xgb_preds_optimized = xgb_model_optimized.predict(X_test)
xgb_accuracy_optimized = accuracy_score(y_test, xgb_preds_optimized)

print("XGBoost Accuracy:",xgb_accuracy_optimized)

XGBoost Accuracy: 0.9313725490196079


### Training and Evaluating Random Forest

In [59]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_preds = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_preds)

print("Random Forest Accuracy:",rf_accuracy)

Random Forest Accuracy: 0.9411764705882353


## Findings
Random Forest performs best for this dataset, with higher accuracy (94.12%) and strong reliability. It effectively handles complex relationships and is not easily affected by noise. XGBoost, with an accuracy of 93.14%, also remains an effective choice, where its boosting technique can improve predictions over time.

## Conclusion and Comparison with all the models 
The results show that tree-based models achieve much higher accuracy than traditional classification models. Random Forest performs the best at 94.12%, followed by XGBoost at 93.14%. In comparison, Logistic Regression (86.55%), LDA (84.03%), and Na√Øve Bayes (80.67%) have lower accuracy. This is because tree-based models can capture complex feature interactions and non-linear patterns, making them ideal when accuracy is the main priority. However, if a simpler and more interpretable model is needed, Logistic Regression or LDA can be good alternatives, even though they sacrifice some accuracy.
Thus, tree-based models are the best option for high accuracy, while traditional models provide better interpretability with a trade-off in performance.