Problem Definition:
- Clearly define the problem you aim to solve using supervised learning (classification or regression).
- Describe the dataset you will work with, including the features and the target variable.

What is the projected sale price of a 2000 sqft gross home? (regression)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import math
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split


df = pd.read_csv('small.csv')

def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

from sklearn.preprocessing import LabelEncoder
import seaborn as sns

label_encoder = LabelEncoder()

for column in df.columns:
    if df[column].dtype == 'object':
        df[column] = label_encoder.fit_transform(df[column])
    
correlation_matrix = df.corr()

# df.corr()['sale_price'][:].plot(figsize=(4,2), kind='bar')
print(correlation(df, 0.7))
df.drop(columns=correlation(df, 0.7), inplace=True)
cm = df.corr()

df.corr()['sale_price'][:].plot(figsize=(4,2), kind='bar')
plt.figure(figsize = (17,13))
sns.heatmap(cm, annot=True, cmap=plt.cm.CMRmap_r)
plt.show()
df

{'building_class_at_time_of_sale', 'sale_price', 'tax_class_at_present', 'total_units', 'building_class_at_present', 'land_square_feet', 'apartment_number', 'gross_square_feet', 'tax_class_at_time_of_sale'}


KeyError: 'sale_price'

Feature Scaling:
- Perform necessary feature scaling and justify your choice.

In [2]:
X = df.iloc[:, 7:11].values # gross_square_feet, residential_units, commercial_units, land_square_feet
y = df.iloc[:, -2].values # sale_price

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# feature scaling 
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)


Data Modeling:

Model Selection:
- Choose at least three supervised learning models suitable for your problem.
- Provide a brief justification for selecting each model relative to your dataset. You'll receive zero points if you read the generic description from ChatGPT. 

Implementation:
- Implement each chosen model using a programming language of your choice (e.g., Python with libraries like Scikit-learn).
- Train each model on the training set.

In [3]:
# # decision tree
# from sklearn.tree import DecisionTreeClassifier
# classifier_dtc = DecisionTreeClassifier(criterion='entropy', random_state=0) 
# classifier_dtc.fit(X_train, y_train)

# random forest
from sklearn.ensemble import RandomForestClassifier
classifier_rf = RandomForestClassifier(criterion='entropy', random_state=0, verbose=1) 
classifier_rf.fit(X_train, y_train)
 
# #GBR gradient boosting regression
# from sklearn.ensemble import GradientBoostingRegressor
# gb_regressor = GradientBoostingRegressor()
# gb_regressor.fit(X_train, y_train)


[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.3s


Hyperparameter Tuning:
- Perform hyperparameter tuning for each model to optimize its performance.
- Use both grid and randomized search techniques to find the best hyperparameters.
- Show the tradeoff between grid search and randomized search regarding resource consumption (e.g., time, memory, and CPU) and performance gain (e.g., model accuracy). 
- Justify your choice of hyperparameters based on experimentation and reasoning. You'll receive zero points if you read the generic description from ChatGPT. 

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=classifier_rf, param_grid=param_grid, cv=5, scoring='accuracy')

# Perform the grid search
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Get the best estimator
best_classifier = grid_search.best_estimator_

# Evaluate the best estimator
y_pred = best_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.3s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.2s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Do

In [None]:
from sklearn.metrics import recall_score, precision_score, f1_score, r2_score, mean_absolute_error, mean_squared_error

# Evaluate Random Forest Classifier
rf_predictions = classifier_rf.predict(X_test)
# Calculate metrics
rf_accuracy = accuracy_score(y_test, rf_predictions)
rf_precision = precision_score(y_test, rf_predictions, average='weighted')
rf_recall = recall_score(y_test, rf_predictions, average='weighted')
rf_f1 = f1_score(y_test, rf_predictions, average='weighted')
rf_conf_matrix = confusion_matrix(y_test, rf_predictions)

print("\nRandom Forest Classifier Metrics:")
print("Accuracy:", rf_accuracy)
print("Precision:", rf_precision)
print("Recall:", rf_recall)
print("F1 Score:", rf_f1)
print("Confusion Matrix:\n", rf_conf_matrix)

Evaluation:
- Evaluate the performance of each model on the testing set using appropriate metrics (e.g., R-squared, Adjusted R-squared, accuracy, precision, recall, F1 score, confusion matrix, sensitivity, specificity, ROC, AUC, etc.).
- Provide a comparative analysis of the models and discuss their strengths and weaknesses relative to your dataset. You'll receive zero points if you read the generic description from ChatGPT. 

In [None]:
from sklearn.metrics import recall_score, precision_score, f1_score, r2_score, mean_absolute_error, mean_squared_error

# Evaluate Decision Tree Classifier
dtc_predictions = classifier_dtc.predict(X_test)
# Calculate metrics
dtc_accuracy = accuracy_score(y_test, dtc_predictions)
dtc_precision = precision_score(y_test, dtc_predictions, average='weighted')
dtc_recall = recall_score(y_test, dtc_predictions, average='weighted')
dtc_f1 = f1_score(y_test, dtc_predictions, average='weighted')
dtc_conf_matrix = confusion_matrix(y_test, dtc_predictions)

# Evaluate Random Forest Classifier
rf_predictions = classifier_rf.predict(X_test)
# Calculate metrics
rf_accuracy = accuracy_score(y_test, rf_predictions)
rf_precision = precision_score(y_test, rf_predictions, average='weighted')
rf_recall = recall_score(y_test, rf_predictions, average='weighted')
rf_f1 = f1_score(y_test, rf_predictions, average='weighted')
rf_conf_matrix = confusion_matrix(y_test, rf_predictions)

# Evaluate Gradient Boosting Regressor
gb_predictions = gb_regressor.predict(X_test)
# Calculate metrics
gb_r2 = r2_score(y_test, gb_predictions)
gb_mae = mean_absolute_error(y_test, gb_predictions)
gb_mse = mean_squared_error(y_test, gb_predictions)
gb_rmse = np.sqrt(gb_mse)

# Print Metrics
print("Decision Tree Classifier Metrics:")
print("Accuracy:", dtc_accuracy)
print("Precision:", dtc_precision)
print("Recall:", dtc_recall)
print("F1 Score:", dtc_f1)
print("Confusion Matrix:\n", dtc_conf_matrix)

print("\nRandom Forest Classifier Metrics:")
print("Accuracy:", rf_accuracy)
print("Precision:", rf_precision)
print("Recall:", rf_recall)
print("F1 Score:", rf_f1)
print("Confusion Matrix:\n", rf_conf_matrix)

print("\nGradient Boosting Regressor Metrics:")
print("R-squared (R2):", gb_r2)
print("Mean Absolute Error (MAE):", gb_mae)
print("Mean Squared Error (MSE):", gb_mse)
print("Root Mean Squared Error (RMSE):", gb_rmse)


Conclusion:
Summarize what you learned from this project and the course. 