# Run Models (includes RandomForest and XGBoost)

Documentation
https://model.earth/RealityStream  
https://model.earth/RealityStream/input/industries backed-up to Run-Models-bkup.ipynb

Haohao: Loading parameters.yaml file and saving as custom configs on colab user's Google Drive    
https://raw.githubusercontent.com/ModelEarth/RealityStream/main/parameters.yaml

DONE Aashish: Used Pandas for integrated_df (became df) to avoid loading saved .csv files when in Colab at Google.com.  
DONE Loren: Load parameters.yaml and save locally for customization.  
https://chatgpt.com/share/e4a2ee73-ab74-4551-9868-37b9b5b6b359  

TO DO: In the same panel as each accuracy report, call a new function called displayModelHeader to display the model name (as a bold header) and the file paths for features and targets above the report.

TO DO: Add a path parameter that pulls from [all-years](https://colab.research.google.com/drive/1zu0WcCiIJ5X3iN1Hd1KSW4dGn0JuodB8#scrollTo=jxZiI7xcrT4B) generated by our [Industry Features CoLab](https://colab.research.google.com/drive/1HJnuilyEFjBpZLrgxDa4S0diekwMeqnh?usp=sharing)

TO DO: Load 1 of these 4 bee targets using parameters.yaml setting, remove bees hardcoding from colab
https://github.com/ModelEarth/RealityStream/tree/main/input/bees/targets  

TO DO: Load targets from Google Data Commons by calling a separate python file.  
https://reality.streamlit.app/?parameters=https://raw.githubusercontent.com/ModelEarth/RealityStream/main/parameters.yaml

In [1]:
save_training = True  # When False, Pandas is used.

import pandas as pd
import regex as re
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import yaml
import requests

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_curve, roc_auc_score
from imblearn.over_sampling import SMOTE
from sklearn.impute import SimpleImputer

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import xgboost as xgb
from xgboost import plot_importance

In [17]:
# Default parameters file and local path to save at.
# After running you can edit parameters that appear to the right.
# Coming soon:
# You can change the bees year in the targets.path to: 2007, 2012, 1017, 2022
# TO DO: Changing the year in the bees target does not work yet. Make updates to other panels.
parametersSource = "https://raw.githubusercontent.com/ModelEarth/RealityStream/main/parameters.yaml"
importNewParameters = True
overwriteExistingParameter = False
localParametersPath = '/content/parametersLocal.yaml'

# Fetch the parameters from the source URL
response = requests.get(parametersSource)
parametersSourceData = yaml.safe_load(response.content)

# Function to merge dictionaries
def merge_dicts(source, local, import_new, overwrite_existing):
    for key, value in source.items():
        if key in local:
            if isinstance(value, dict) and isinstance(local[key], dict):
                merge_dicts(value, local[key], import_new, overwrite_existing)
            elif overwrite_existing:
                local[key] = value
        else:
            if import_new:
                local[key] = value

class DictToObject:
    def __init__(self, dictionary):
        for key, value in dictionary.items():
            if isinstance(value, dict):
                value = DictToObject(value)
            self.__dict__[key] = value

    def __getitem__(self, key):
        return self.__dict__[key]

    def __setitem__(self, key, value):
        self.__dict__[key] = value

# Load local parameters if they exist
if os.path.exists(localParametersPath):
    with open(localParametersPath, 'r') as file:
        parametersLocalData = yaml.safe_load(file)
else:
    parametersLocalData = {}

# Merge parameters according to specified rules
merge_dicts(parametersSourceData, parametersLocalData, importNewParameters, overwriteExistingParameter)

# Save the merged parameters locally
with open(localParametersPath, 'w') as file:
    yaml.dump(parametersLocalData, file)

# Display local parameters file in the left side of Colab
from google.colab import files
files.view(localParametersPath)

<IPython.core.display.Javascript object>

In [4]:
# Apply Parameters
# Load local parameters and print below.

import yaml

localParametersPath = '/content/parametersLocal.yaml'

# Load parameters from the local file
with open(localParametersPath, 'r') as file:
    param_dict = yaml.safe_load(file)

# Convert dictionary to an object with dot notation access
param = DictToObject(param_dict)

# Print the parameters
def print_param(obj, indent=0):
    for key in obj.__dict__.keys():
        value = getattr(obj, key)
        if isinstance(value, DictToObject):
            print(' ' * indent + f"{key}:")
            print_param(value, indent + 2)
        else:
            print(' ' * indent + f"{key}: {value}")

# Also not in use yet, these will only be used if parametersLocal.yaml omits.
features_data = param['features']['data'] # "industries"
features_path = param['features']['path'] # "https://raw.githubusercontent.com/ModelEarth/community-timelines/main/training/naics{naics}/US/counties/{year}/US-{state}-training-naics{naics}-counties-{year}.csv"
targets_data  = param['targets']['data'] # "bees"
targets_path  = param['targets']['path'] # "https://raw.githubusercontent.com/ModelEarth/RealityStream/main/input/bees/targets/bees-targets-increase2022.csv"

print_param(param)
print("\nparam.targets.data:", param.targets.data)

features:
  data: industries
  endyear: 2021
  naics: [2, 4, 6]
  path: https://raw.githubusercontent.com/ModelEarth/community-timelines/main/training/naics{naics}/US/counties/{year}/US-{state}-training-naics{naics}-counties-{year}.csv
  startyear: 2017
  state: ME
models: rbf
targets:
  data: bees
  path: https://raw.githubusercontent.com/ModelEarth/RealityStream/main/input/bees/targets/bees-targets-increase2017.csv

param.targets.data: bees


In [5]:
# TO DO: Setting model_name = "XGBoost" resulted in the error:
# ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, The experimental DMatrix parameter`enable_categorical` must be set to `True`.  Invalid columns:Population-2018: object, Population-2019: object, Population-2020: object

# TO DO: These are in use, replace with parameters

dataset_name = "bees"  # TO DO: eliminate since features and targets will differ.
model_name = "RandomForest"  # Specify the model to be trained
all_model_list = ["LogisticRegression", "SVM", "MLP", "RandomForest", "XGBoost"]  # All usable models
assert model_name in all_model_list
valid_report_list = ["RandomForest", "XGBoost"]  # All valid models to generate feature-importance report

random_state = 42  # Specify random state

# Feature related information:
country = "US"
years = range(2017, 2022)
naics_level = 2
naics_list = [2, 4, 6]
assert naics_level in naics_list

# Target related information:
target_url = f"https://raw.githubusercontent.com/ModelEarth/RealityStream/main/input/{dataset_name}/targets/{dataset_name}-targets.csv"
target_df = pd.read_csv(target_url)  # Get the target csv

if dataset_name == "bees":  # Eliminate these lines after switching to parameters.yaml settings
    target_column = '2022_increase'  # Specify the target column
    target_list = ['2007_increase', '2012_increase', '2017_increase', '2022_increase']  # Specify all usable target columns
    target_list.remove(target_column)  # Drop the one we are interested in

year_list = ["2002", "2007", "2012", "2017", "2022"]
drop_list = ['Unnamed: 0', 'Name', 'State', 'State ANSI', 'County ANSI', "Ag District", "Ag District Code"]
all_drop_list = drop_list + target_list + year_list  # Drop all columns that can affect the training procedure or are not related

feature_start_idx = 3  # Specify the starting column index in dataset csv for features, where first few columns are for target and id related stuff
target_idx = 0  # Specify the column index for target

# Directory Information:
merged_save_dir = f"../process/{dataset_name}/states-{target_column}-{dataset_name}"  # Specify the saving dir for state-separate dataset
full_save_dir = f"../output/{dataset_name}/training"  # Specify the saving dir for the integrated dataset


In [6]:
# STEP: Get Dictionaries for states and industries
STATE_DICT = {
    "AL": "ALABAMA","AK": "ALASKA","AZ": "ARIZONA","AR": "ARKANSAS","CA": "CALIFORNIA","CO": "COLORADO","CT": "CONNECTICUT","DE": "DELAWARE","FL": "FLORIDA","GA": "GEORGIA","HI": "HAWAII","ID": "IDAHO","IL": "ILLINOIS","IN": "INDIANA","IA": "IOWA","KS": "KANSAS","KY": "KENTUCKY","LA": "LOUISIANA","ME": "MAINE","MD": "MARYLAND","MA": "MASSACHUSETTS","MI": "MICHIGAN","MN": "MINNESOTA","MS": "MISSISSIPPI","MO": "MISSOURI","MT": "MONTANA","NE": "NEBRASKA","NV": "NEVADA","NH": "NEW HAMPSHIRE","NJ": "NEW JERSEY","NM": "NEW MEXICO","NY": "NEW YORK","NC": "NORTH CAROLINA","ND": "NORTH DAKOTA","OH": "OHIO","OK": "OKLAHOMA","OR": "OREGON","PA": "PENNSYLVANIA","RI": "RHODE ISLAND","SC": "SOUTH CAROLINA","SD": "SOUTH DAKOTA","TN": "TENNESSEE","TX": "TEXAS","UT": "UTAH","VT": "VERMONT","VA": "VIRGINIA","WA": "WASHINGTON","WV": "WEST VIRGINIA","WI": "WISCONSIN","WY": "WYOMING"
}
try:
    industries_df = pd.read_csv(f"https://raw.githubusercontent.com/ModelEarth/community-data/master/{country.lower()}/id_lists/naics{naics_level}.csv",header=None)
    INDUSTRIES_DICT = industries_df.set_index(0).to_dict()[1]
except:
    INDUSTRIES_DICT = dict()

In [7]:
# STEP: Create Functions
def rename_columns(df, year):
    rename_mapping = {}
    for column in df.columns:
      if column not in df.columns[:2]:
          new_column_name = column + f'-{year}'
          rename_mapping[column] = new_column_name

    df.rename(columns=rename_mapping, inplace=True)

def check_directory(directory_path): # Check whether the given directory exists, if not, then create it
    if not os.path.exists(directory_path):
        try:
            os.makedirs(directory_path)
            print(f"Directory '{directory_path}' created successfully.")
        except OSError as e:
            print(f"Error creating directory '{directory_path}': {e}")
    else:
        print(f"Directory '{directory_path}' already exists.")
    return directory_path

In [8]:
# STEP: Merge feature and target data
# If save_training=True, your files will reside in the "process" folder to the left.
# Hit the refresh icon above your folder list to the left.
if save_training:
    save_dir = merged_save_dir  # Save in the local directory if save_training is True

check_directory(save_dir)

# State-separately, for each state, merging industry features and target on Fips value and County Name, return the merged csv

for state in STATE_DICT:
    data = {}
    for year in years:
        url = f"https://raw.githubusercontent.com/ModelEarth/community-timelines/main/training/naics{naics_level}/{country}/counties/{year}/{country}-{state}-training-naics{naics_level}-counties-{year}.csv"
        data[year] = pd.read_csv(url)
        rename_columns(data[year], year)

    merged_df_feature = pd.merge(data[2017], data[2018], on=['Fips', 'Name'], how='inner')
    for year in range(2019, 2022):
        merged_df_feature = pd.merge(merged_df_feature, data[year], on=['Fips', 'Name'], how='inner')

    cols = merged_df_feature.columns.tolist()
    cols = cols[:2] + sorted(cols[2:])
    merged_df_feature = merged_df_feature[cols].rename(columns={"Name": "County"})

    merged_df = pd.merge(merged_df_feature, target_df[target_df["State"] == STATE_DICT[state]], on=["Fips", "County"], how="inner")
    merged_df.drop(columns=all_drop_list, axis=1, inplace=True)

    target = merged_df.iloc[:, -1]
    merged_df.drop(columns=[target_column], axis=1, inplace=True)
    merged_df.insert(0, 'target', target)

    merged_df.to_csv(os.path.join(merged_save_dir, f"{state}-{target_column}-{dataset_name}.csv"), index=False)

    if save_training:
      save_dir = merged_save_dir #Use the local directory if not in Google Colab
      file_path = os.path.join(save_dir, f"{state}-{target_column}-{dataset_name}.csv")
      merged_df.to_csv(file_path, index=False)
      print(f"Saved file at: {file_path}")

      # try:
      #   from google.colab import drive
      #   drive.mount('/content/drive')
      #   save_dir = '/content/drive/My Drive/RunModels' #Your Google Drive path
      #   check_directory(save_dir)

      # except ImportError:
      #   save_dir = merged_save_dir #Use the local directory if not in Google Colab

      # file_path = os.path.join(save_dir, f"{state}-{target_column}-{dataset_name}.csv")
      # merged_df.to_csv(file_path, index=False)
      # print(f"Saved file at: {file_path}")

      merged_df.to_csv(os.path.join(merged_save_dir, f"{state}-{target_column}-{dataset_name}.csv"), index=False)

if not save_training:
      print(f"Since save_training is false no files are currently saved.")




Directory '../process/bees/states-2022_increase-bees' created successfully.
Saved file at: ../process/bees/states-2022_increase-bees/AL-2022_increase-bees.csv
Saved file at: ../process/bees/states-2022_increase-bees/AK-2022_increase-bees.csv
Saved file at: ../process/bees/states-2022_increase-bees/AZ-2022_increase-bees.csv
Saved file at: ../process/bees/states-2022_increase-bees/AR-2022_increase-bees.csv
Saved file at: ../process/bees/states-2022_increase-bees/CA-2022_increase-bees.csv
Saved file at: ../process/bees/states-2022_increase-bees/CO-2022_increase-bees.csv
Saved file at: ../process/bees/states-2022_increase-bees/CT-2022_increase-bees.csv
Saved file at: ../process/bees/states-2022_increase-bees/DE-2022_increase-bees.csv
Saved file at: ../process/bees/states-2022_increase-bees/FL-2022_increase-bees.csv
Saved file at: ../process/bees/states-2022_increase-bees/GA-2022_increase-bees.csv
Saved file at: ../process/bees/states-2022_increase-bees/HI-2022_increase-bees.csv
Saved file 

In [10]:
# STEP: Integrate separate state data into one, return the full dataset csv
# If save_training=True, your files will reside in the "output" folder to the left.
# Hit the refresh icon above your folder list to the left.
save_dir = full_save_dir  # Use the local directory if save_training is True

check_directory(save_dir)

dataframes = []
csv_directory = f"../process/{dataset_name}/states-{target_column}-{dataset_name}"
csv_files = os.listdir(csv_directory)
for csv_file in csv_files:
    if csv_file.endswith('.csv'):
        dataframes.append(pd.read_csv(os.path.join(csv_directory, csv_file)))

integrated_df = pd.concat(dataframes, ignore_index=True)
df = integrated_df

if save_training:
  save_dir = full_save_dir #Use the local directory if not in Google Colab
  file_path = os.path.join(save_dir, f"{target_column}-{dataset_name}.csv")
  integrated_df.to_csv(file_path, index=False)
  print(f"Saved file at: {file_path}")
    # try:
    #   from google.colab import drive
    #   drive.mount('/content/drive', force_remount=True)
    #   save_dir = '/content/drive/My Drive/RunModels' #Your Google Drive path
    #   check_directory(save_dir)
    # except ImportError:
    #   save_dir = full_save_dir #Use the local directory if not in Google Colab

  file_path = os.path.join(save_dir, f"{target_column}-{dataset_name}.csv")
  integrated_df.to_csv(file_path, index=False)
  print(f"Saved file at: {file_path}")
  #integrated_df.to_csv(os.path.join(full_save_dir, f"{target_column}-{dataset_name}.csv"), index=False)



Directory '../output/bees/training' already exists.
Saved file at: ../output/bees/training/2022_increase-bees.csv
Saved file at: ../output/bees/training/2022_increase-bees.csv


In [11]:
# Train the model and get the test report
def train_model(model, X_train, y_train, X_test, y_test, over_sample):

    if over_sample:
        sm = SMOTE(random_state = 2)
        X_train, y_train = sm.fit_resample(X_train, y_train.ravel())
        print("Oversampling Done for Training Data.")

    model = model.fit(X_train, y_train)
    print("Model Fitted Successfully.")

    # calculating y_pred
    y_pred = model.predict(X_test)
    y_pred_prob = model.predict_proba(X_test)

    #roc_auc score
    roc_auc = round(roc_auc_score(y_test, y_pred_prob[:, 1]), 2)
    print(f"\033[1mROC-AUC Score\033[0m \t\t: {roc_auc*100} %")

    fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob[:,1], pos_label=1)

    gmeans = np.sqrt(tpr * (1-fpr))

    ix = np.argmax(gmeans)

    print('\033[1mBest Threshold\033[0m \t\t: %.3f \n\033[1mG-Mean\033[0m \t\t\t: %.3f' % (thresholds[ix], gmeans[ix]))
    best_threshold_num = round(thresholds[ix], 3)

    gmeans_num = round(gmeans[ix], 3)

    y_pred = (y_pred > thresholds[ix])

    #ccuracy score
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_num = f"{accuracy * 100:.1f}"

    print("\033[1mModel Accuracy\033[0m \t\t:", round(accuracy,2,)*100, "%")
    print("\033[1m\nClassification Report:\033[0m")

    #Generate classification report for display and in dictionary for furture report generation
    cfc_report = classification_report(y_test, y_pred)
    cfc_report_dict = classification_report(y_test, y_pred, output_dict= True)
    print(cfc_report)

    return model, y_pred, accuracy_num, gmeans_num, accuracy_num, roc_auc, best_threshold_num, cfc_report_dict

# Train the specified model, impute the nan values, and save the trained model as well as the feature-target report
def train(model_name, target_column, dataset_name, X_train, y_train, X_test, y_test, report_gen, all_model_list, valid_report_list, over_sample=False, model_saving=True, random_state=42):
    assert model_name in all_model_list

    imputer = SimpleImputer(strategy='mean')
    X_train_imputed = imputer.fit_transform(X_train)
    X_test_imputed = imputer.transform(X_test)

    if model_name == "LogisticRegression":
        model = LogisticRegression(max_iter=10000, random_state=random_state)
    elif model_name == "SVM":
        model = SVC(random_state=random_state,probability=True)
    elif model_name == "MLP":
        model = MLPClassifier(hidden_layer_sizes=(64, 32), activation='relu', solver='adam', max_iter=1000, random_state=random_state)
    elif model_name == "RandomForest":
        model = RandomForestClassifier(n_jobs=3, n_estimators=1000, criterion="gini", random_state=random_state)
        model_fullname = "Random Forest"
    elif model_name == "XGBoost":
        model = xgb.XGBClassifier(random_state=random_state)
        model_fullname = "XGBoost"
    else:
        raise Exception

    if model_name == "XGBoost":
        model, y_pred, accuracy_num, gmeans_num, accuracy_num, roc_auc, best_threshold_num, cfc_report_dict = train_model(model, X_train, y_train, X_test, y_test, over_sample) # No need to impute nan values for XGBoost

    else:
        model, y_pred, accuracy_num, gmeans_num, accuracy_num, roc_auc, best_threshold_num, cfc_report_dict = train_model(model, X_train_imputed, y_train, X_test_imputed, y_test, over_sample)


    save_dir = f"../output/{dataset_name}/saved"
    check_directory(save_dir)

    if model_saving:
        if model_name == "XGBoost":
            save_model(model, None, target_column, dataset_name, model_name, save_dir) # No need to impute nan values for XGBoost
        else:
            save_model(model, imputer, target_column, dataset_name, model_name, save_dir)

    if report_gen:
        if model_name in valid_report_list:
            if model_name == "RandomForest":
                importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': model.feature_importances_})
                report = importance_df.sort_values(by='Importance', ascending=False)
            elif model_name == "XGBoost":
                importance_df = pd.DataFrame(list(model.get_booster().get_score().items()), columns=["Feature","Importance"])
                report = importance_df.sort_values(by='Importance', ascending=False)
            else:
                raise Exception

            report["Feature_Name"] = report["Feature"].apply(report_modify)
            report = report.reindex(columns=["Feature","Feature_Name","Importance"])
            report.to_csv(os.path.join(save_dir, f"{target_column}-{dataset_name}-report-{model_name}.csv"), index=False)
        else:
            report = None
            print("No Valid Report for Current Model")

    return model, y_pred, report, model_fullname, cfc_report_dict, accuracy_num, gmeans_num, accuracy_num, roc_auc, best_threshold_num



# Save the trained model and nan-value imputer
def save_model(model, imputer, target_column, dataset_name, model_name, save_dir):
    data = {
    "model": model,
    "imputer": imputer
    }
    with open(os.path.join(save_dir, f"{target_column}-{dataset_name}-trained-{model_name}.pkl"), 'wb') as file:
        pickle.dump(data, file)

# Modify the feature-importance report by adding an industry-correspondence introduction column
def report_modify(value):
    splitted = value.split("-")
    if splitted[0] in ["Emp","Est","Pay"]:
        try:
            modified = splitted[0]+"-"+INDUSTRIES_DICT[splitted[1]]+"-"+splitted[2]
        except:
            modified = value
        return modified
    else:
        return value


def report_generator(cfc_report_dict, model_fullname, model_name, gmeans_num, accuracy_num, roc_auc, best_threshold_num):
    #transfrom report from dictionary to df
    df_report = pd.DataFrame.from_dict(cfc_report_dict).transpose()

    #adjust data display format for md and yaml
    df_report['support'] = df_report['support'].astype(int)
    df_report.iloc[:, 0:3] = df_report.iloc[:, 0:3].round(2)
    df_report.iloc[2,0] = " "
    df_report.iloc[2,1] = " "
    df_report.iloc[2,3] = df_report.iloc[3,3]

    #edit roc_auc format
    roc_auc = roc_auc *100

    #covert numpy float to python float for yaml display
    roc_auc = roc_auc.item()
    best_threshold_num = best_threshold_num.item()
    gmeans_num = gmeans_num.item()

    #markdown file content
    markdown_content = f"""
## {model_fullname} Accuracy

**ROC-AUC Score:** {roc_auc}% &nbsp;&nbsp; **Best Threshold:** {best_threshold_num} &nbsp;&nbsp; **G-Mean:** {gmeans_num} &nbsp;&nbsp; **Model Accuracy:** {accuracy_num}%

                    Precision   Recall      F1-Score    Support

    0               {df_report.iloc[0,0]}        {df_report.iloc[0,1]}        {df_report.iloc[0,2]}        {df_report.iloc[0,3]}
    1               {df_report.iloc[1,0]}        {df_report.iloc[1,1]}        {df_report.iloc[1,2]}        {df_report.iloc[1,3]}

    Accuracy                                {df_report.iloc[2,2]}        {df_report.iloc[3,3]}
    Macro Avg       {df_report.iloc[3,0]}        {df_report.iloc[3,1]}        {df_report.iloc[3,2]}        {df_report.iloc[3,3]}
    Weighted Avg    {df_report.iloc[4,0]}        {df_report.iloc[4,1]}        {df_report.iloc[4,2]}        {df_report.iloc[3,3]}
"""

    #yaml output dictionary
    report_dict = {
    "model_fullname": model_fullname,
    "roc_auc": roc_auc,
    "best_threshold_num": best_threshold_num,
    "gmeans_num": gmeans_num,
    "accuracy_num": accuracy_num,
    "classification_report": df_report.to_dict(orient="index")
    }

    with open(f'{model_name}_accuracy.md','w') as markdown_file:
        markdown_file.write(markdown_content)

    with open(f'{model_name}_accuracy.yaml', "w") as f:
        yaml.dump(report_dict, f, default_flow_style=False)

In [12]:
# Read the integrated full dataset and do the train-test splitting and save the splitted files
if save_training:
  integrated_df = pd.read_csv(os.path.join(full_save_dir, f"{target_column}-{dataset_name}.csv"))

X_total, y_total = df.iloc[:, feature_start_idx:], df.iloc[:, target_idx] #X_total, y_total = integrated_df.iloc[:, feature_start_idx:], integrated_df.iloc[:, target_idx]
X_train, X_test, y_train, y_test = train_test_split(X_total, y_total, test_size=0.2, random_state=random_state)
X_train.to_csv(os.path.join(full_save_dir, f"{target_column}-{dataset_name}-X-train.csv"), index=False)
X_test.to_csv(os.path.join(full_save_dir, f"{target_column}-{dataset_name}-X-test.csv"), index=False)
y_train.to_csv(os.path.join(full_save_dir, f"{target_column}-{dataset_name}-y-train.csv"), index=False)
y_test.to_csv(os.path.join(full_save_dir, f"{target_column}-{dataset_name}-y-test.csv"), index=False)

if save_training:
  file_path = os.path.join(full_save_dir, f"X_train.csv")
  X_train.to_csv(file_path, index=False)

  file_path = os.path.join(full_save_dir, f"X_test.csv")
  X_test.to_csv(file_path, index=False)

  file_path = os.path.join(full_save_dir, f"y_train.csv")
  y_train.to_csv(file_path, index=False)

  file_path = os.path.join(full_save_dir, f"y_test.csv")
  y_test.to_csv(file_path, index=False)


Model training, testing and results saving:

In [13]:
# Training Random Forest
model, y_pred, report, model_fullname, cfc_report_dict, accuracy_num, gmeans_num, accuracy_num, roc_auc, best_threshold_num= train("RandomForest", target_column, dataset_name, X_train, y_train, X_test, y_test,
      report_gen=True, all_model_list=all_model_list, valid_report_list=valid_report_list, over_sample=False, model_saving=True, random_state=random_state)

Model Fitted Successfully.
[1mROC-AUC Score[0m 		: 56.00000000000001 %
[1mBest Threshold[0m 		: 0.639 
[1mG-Mean[0m 			: 0.551
[1mModel Accuracy[0m 		: 65.0 %
[1m
Classification Report:[0m
              precision    recall  f1-score   support

         0.0       0.51      0.11      0.18       198
         1.0       0.67      0.94      0.78       372

    accuracy                           0.65       570
   macro avg       0.59      0.53      0.48       570
weighted avg       0.61      0.65      0.57       570

Directory '../output/bees/saved' created successfully.


In [14]:
# Generate markdown and yaml file
# Results will appear in the content folder to the left
report_generator(cfc_report_dict, model_fullname, "RandomForest", gmeans_num, accuracy_num, roc_auc, best_threshold_num)

In [15]:
# Generating dummy values to handle the categorical columns: 'Population-2018', 'Population-2019', 'Population-2020'
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)

#Train XGBoost model
model, y_pred, report, model_fullname, cfc_report_dict, accuracy_num, gmeans_num, accuracy_num, roc_auc, best_threshold_num  = train("XGBoost", target_column, dataset_name, X_train, y_train, X_test, y_test,
      report_gen=True, all_model_list=all_model_list, valid_report_list=valid_report_list, over_sample=False, model_saving=True, random_state=random_state)

Model Fitted Successfully.
[1mROC-AUC Score[0m 		: 51.0 %
[1mBest Threshold[0m 		: 0.798 
[1mG-Mean[0m 			: 0.516
[1mModel Accuracy[0m 		: 56.99999999999999 %
[1m
Classification Report:[0m
              precision    recall  f1-score   support

         0.0       0.33      0.22      0.26       198
         1.0       0.65      0.76      0.70       372

    accuracy                           0.57       570
   macro avg       0.49      0.49      0.48       570
weighted avg       0.53      0.57      0.55       570

Directory '../output/bees/saved' already exists.


In [16]:
# Generate Report for XGBoost
report_generator(cfc_report_dict, "XGBoost", "XGBoost", gmeans_num, accuracy_num, roc_auc, best_threshold_num)