## Pycaret Model Cropdata - Laurens Karakolev

We install the needed modules to use pycaret

In [42]:
!pip install pandas numpy
!pip install pycaret




[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: C:\Users\laure\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: C:\Users\laure\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


We import the libraries.

In [53]:
from pycaret.regression import *
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
import pickle

## Dataset Transformation

The explanation of our dataset transformation is described in our EDA file.

In [44]:
df = pd.read_excel('files/food-twentieth-century-crop-statistics-1900-2017-xlsx.xlsx', sheet_name="CropStats")

df_transformed=df.drop(['Unnamed: 0','admin2','notes'], axis=1)
df_transformed['admin1'].fillna(df['admin0'], inplace=True)

for index, row in df_transformed.iterrows():
    if pd.notna(row['hectares (ha)']) and pd.notna(row['production (tonnes)']) and pd.isna(row['yield(tonnes/ha)']) and row['hectares (ha)'] != 0:
        df_transformed.at[index, 'yield(tonnes/ha)'] = row['production (tonnes)'] / row['hectares (ha)']

df_transformed['yield(tonnes/ha)'].bfill(inplace=True)
df_transformed=df_transformed.drop(['hectares (ha)','production (tonnes)'], axis=1)
df_transformed

Unnamed: 0,Harvest_year,admin0,admin1,crop,year,yield(tonnes/ha)
0,1902,Austria,Austria,wheat,1902,1.310000
1,1903,Austria,Austria,wheat,1903,1.470000
2,1904,Austria,Austria,wheat,1904,1.270000
3,1905,Austria,Austria,wheat,1905,1.330000
4,1906,Austria,Austria,wheat,1906,1.280000
...,...,...,...,...,...,...
36702,2013,China,zhejiang,wheat,2013,3.685117
36703,2014,China,zhejiang,wheat,2014,3.768875
36704,2015,China,zhejiang,wheat,2015,3.912027
36705,2016,China,zhejiang,wheat,2016,3.315054


We figure out the datatypes of our different features again just see again what is categorical data and what is numerical. See below.

In [45]:
df_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36707 entries, 0 to 36706
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Harvest_year      36707 non-null  int64  
 1   admin0            36707 non-null  object 
 2   admin1            36707 non-null  object 
 3   crop              36707 non-null  object 
 4   year              36707 non-null  int64  
 5   yield(tonnes/ha)  36707 non-null  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 1.7+ MB


## Setup the Pycaret environment
Initialize the training environment and create the transformation pipeline to prepare the data for modeling and deployment. Target is set on yield(tonnes/ha) because this is what we want to predict.

In [47]:
s = setup(df_transformed, target='yield(tonnes/ha)', session_id=123, numeric_features=['Harvest_year', 'year'], categorical_features=['admin0', 'admin1', 'crop'])


Unnamed: 0,Description,Value
0,Session id,123
1,Target,yield(tonnes/ha)
2,Target type,Regression
3,Original data shape,"(36707, 6)"
4,Transformed data shape,"(36707, 34)"
5,Transformed train set shape,"(25694, 34)"
6,Transformed test set shape,"(11013, 34)"
7,Numeric features,2
8,Categorical features,3
9,Preprocess,True


## Compare the models

Here we compare the outcomes of all the models. Every highest score in marked in yellow and we can see that Random Forest Regressor is the winner.

In [48]:
best = compare_models()
print(best)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
rf,Random Forest Regressor,0.2679,0.2095,0.457,0.9574,0.114,0.1335,1.658
et,Extra Trees Regressor,0.2965,0.2687,0.5176,0.9454,0.1239,0.1461,1.571
lightgbm,Light Gradient Boosting Machine,0.3788,0.3248,0.5694,0.9339,0.153,0.2057,0.197
dt,Decision Tree Regressor,0.3176,0.3505,0.5907,0.9288,0.1428,0.1524,0.083
gbr,Gradient Boosting Regressor,0.533,0.5921,0.769,0.8796,0.2048,0.2913,0.499
knn,K Neighbors Regressor,0.5303,0.6359,0.7968,0.8708,0.2081,0.2784,0.158
ada,AdaBoost Regressor,0.8364,1.1689,1.0809,0.7623,0.3096,0.5721,0.502
ridge,Ridge Regression,0.9884,1.7348,1.317,0.6472,0.3433,0.6153,0.064
br,Bayesian Ridge,0.9883,1.735,1.317,0.6472,0.3432,0.6151,0.089
lr,Linear Regression,0.9884,1.7348,1.317,0.6472,0.3433,0.6153,0.512


RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='squared_error',
                      max_depth=None, max_features=1.0, max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
                      oob_score=False, random_state=123, verbose=0,
                      warm_start=False)


## Test best model

Now we test this model on our dataset again but also want to see what it exactly predicted. This is done in the prediction_label column.
If this was a classification model it would've also given a prediction_score column that depicts how sure it is but Pycaret does not do this for regression models which is logical because in a regression task there is no "score" or probability" like with classification.

In [49]:
predict_model(best, data=df_transformed)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Random Forest Regressor,0.1436,0.074,0.272,0.9849,0.0707,0.0727


Unnamed: 0,Harvest_year,admin0,admin1,crop,year,yield(tonnes/ha),prediction_label
0,1902,Austria,Austria,wheat,1902,1.310000,1.257925
1,1903,Austria,Austria,wheat,1903,1.470000,1.237351
2,1904,Austria,Austria,wheat,1904,1.270000,1.253547
3,1905,Austria,Austria,wheat,1905,1.330000,1.306594
4,1906,Austria,Austria,wheat,1906,1.280000,1.294807
...,...,...,...,...,...,...,...
36702,2013,China,zhejiang,wheat,2013,3.685117,3.720229
36703,2014,China,zhejiang,wheat,2014,3.768875,3.795209
36704,2015,China,zhejiang,wheat,2015,3.912027,3.871542
36705,2016,China,zhejiang,wheat,2016,3.315054,3.461911


Get the accuracy to compare it later on.

In [50]:
measures = pull()

## Save the model

In [51]:
save_model(best, 'files/crop_pycaret_model')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['Harvest_year', 'year'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean',
                                                               verbose='deprecated'))),
                 ('categorical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=...
                 ('trained_model',
                  RandomForestRegressor(boot

Let's also make a picklefile which takes way less storage

In [54]:
filename = 'files/cropdata_pycaret_model.sav'
pickle.dump(best, open(filename, 'wb'))