# Comparison of models CropData

In [1]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

## Data transformation and OneHotEncoding

In [2]:
df = pd.read_excel('files/food-twentieth-century-crop-statistics-1900-2017-xlsx.xlsx', sheet_name="CropStats")

df_transformed=df.drop(['Unnamed: 0','admin2','notes'], axis=1)
df_transformed['admin1'].fillna(df['admin0'], inplace=True)

for index, row in df_transformed.iterrows():
    if pd.notna(row['hectares (ha)']) and pd.notna(row['production (tonnes)']) and pd.isna(row['yield(tonnes/ha)']) and row['hectares (ha)'] != 0:
        df_transformed.at[index, 'yield(tonnes/ha)'] = row['production (tonnes)'] / row['hectares (ha)']

df_transformed['yield(tonnes/ha)'].bfill(inplace=True)
df_transformed=df_transformed.drop(['hectares (ha)','production (tonnes)'], axis=1)



########################ONE_HOT_ENCODE#################################



# Create an instance of OneHotEncoder
encoder = OneHotEncoder()

# Fit and transform the data
encoded_data = encoder.fit_transform(df_transformed[['admin0', 'admin1', 'crop']])

# Convert the encoded data to a pandas DataFrame
encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names_out(['admin0', 'admin1', 'crop']))

# Get the numerical columns to concatenate with the one-hot encoded df 
df_numerical = df_transformed.drop(['admin0', 'admin1', 'crop'], axis=1)

# Concatenate the encoded data with the original numerical data
final_df = pd.concat([encoded_df, df_numerical], axis=1)


X = final_df.drop('yield(tonnes/ha)', axis=1)
y = final_df['yield(tonnes/ha)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model Wouter Selis DecisionTreeRegressor

In [38]:
import pickle

# Load the model from disk
filename = 'files/cropdata_chosen_model.sav'
loaded_model_DecisionTreeRegressor = pickle.load(open(filename, 'rb'))
loaded_model_DecisionTreeRegressor

In [39]:
# Maak voorspellingen op de testgegevens
y_pred = loaded_model_DecisionTreeRegressor.predict(X_test)

In [40]:
# Convert to discrete values
y_pred_discrete = [int(round(x)) for x in y_pred]
y_test_discrete = [int(round(x)) for x in y_test]

# Calculate the accuracy of the model
accuracy_DecisionTreeRregressor = accuracy_score(y_test_discrete, y_pred_discrete)
print("Accuracy:", accuracy_DecisionTreeRregressor)

Accuracy: 0.7593298828657041


# Model Laurens Pycaret RandomForestRegressor

In [15]:
import pycaret
from pycaret.regression import *

In [41]:
filename = 'files/cropdata_pycaret_model.sav'
loaded_model_RandomForestRegressor = pickle.load(open(filename, 'rb'))
loaded_model_RandomForestRegressor

In [42]:
s = setup(df_transformed, target='yield(tonnes/ha)', session_id=123, numeric_features=['Harvest_year', 'year'], categorical_features=['admin0', 'admin1', 'crop'])
predict_model(loaded_model_RandomForestRegressor, data=df_transformed)
measures_Laurens_RandomForestRegressor = pull()
measures_Laurens_RandomForestRegressor['MAE']


Unnamed: 0,Description,Value
0,Session id,123
1,Target,yield(tonnes/ha)
2,Target type,Regression
3,Original data shape,"(36707, 6)"
4,Transformed data shape,"(36707, 34)"
5,Transformed train set shape,"(25694, 34)"
6,Transformed test set shape,"(11013, 34)"
7,Numeric features,2
8,Categorical features,3
9,Preprocess,True


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Random Forest Regressor,0.1436,0.074,0.272,0.9849,0.0707,0.0727


0    0.1436
Name: MAE, dtype: float64

## Kieran AWS Model

# Comparison



## Mean Absolute Error (MAE)

Here we see the MAE of our two models. The lower the MAE scores indicates a better performance of the model. So between a Decision Tree Regressor model and a RandomForestRegressor model the second one is the clear winner.

In [43]:
from sklearn.metrics import mean_absolute_error
mae_model_1 = mean_absolute_error(y_test_discrete, y_pred_discrete)
print("Wouter DecisionTreeRregressor model:", mae_model_1)

print("Laurens Pycaret RandomForestRegressor model: ", measures_Laurens_RandomForestRegressor['MAE'])

Wouter DecisionTreeRregressor model: 0.27880686461454646
Laurens Pycaret RandomForestRegressor model:  0    0.1436
Name: MAE, dtype: float64


## Mean Squared Error (MSE)

MSE measures the average squared difference between predicted and actual values. It emphasizes larger errors more than MAE.
So again the RandomForestRegressor wins because it has a lower value.

In [44]:
from sklearn.metrics import mean_squared_error

mse_model_1 = mean_squared_error( y_test_discrete, y_pred_discrete)
print("Wouter DecisionTreeRregressor model:", mse_model_1)
print("Laurens Pycaret RandomForestRegressor model: ", measures_Laurens_RandomForestRegressor['MSE'])

Wouter DecisionTreeRregressor model: 0.4147371288477254
Laurens Pycaret RandomForestRegressor model:  0    0.074
Name: MSE, dtype: float64


## Root Mean Squared Error (RMSE):

So, RMSE is a way to measure how far off, on average, your model's predictions are from the actual values. The smaller the RMSE, the better your model's predictions are in terms of how close they are to the actual values. In simpler terms, RMSE tells you how well your model is doing in making accurate predictions. Also here the lower the RMSE the better your model performs because on average your predictions are less far off from the actual values.

In [45]:
import numpy as np

rmse_model_1 = np.sqrt(mse_model_1)
print("Wouter DecisionTreeRregressor model:", rmse_model_1)
rmse_model_2 = np.sqrt(measures_Laurens_RandomForestRegressor['MSE'])
print("Laurens Pycaret RandomForestRegressor model: ",rmse_model_2)

Wouter DecisionTreeRregressor model: 0.6440008764339731
Laurens Pycaret RandomForestRegressor model:  0    0.272029
Name: MSE, dtype: float64


## R-squared (R2):

In essence, R² is a way to measure how well your model captures and explains the patterns and relationships in your data. It's a useful tool for understanding how effective our regression model is at making predictions based on the available information. Higher values indicate better model fit.


In [37]:
from sklearn.metrics import r2_score

r2_model_1 = r2_score( y_test_discrete, y_pred_discrete)
print("Wouter DecisionTreeRregressor model:", r2_model_1)

print("Laurens Pycaret RandomForestRegressor model: ", measures_Laurens_RandomForestRegressor['R2'])

Wouter DecisionTreeRregressor model: 0.9181124392342246
Laurens Pycaret RandomForestRegressor model:  0    0.9849
Name: R2, dtype: float64


# Conclusion
You will see that the model of Laurens is always superior because he uses Pycaret to get the best model. This is logical because Pycaret scans a vast amount of models to see which one fits the best for the given dataset. Wouters' model is also in the list of models that Pycaret scans for and so it is only logical that if Pycaret does not choose this exact same model it will automatically outperform it. 