# **PyCaret Fundamentals**

PyCaret is an open-source, low-code machine learning library in Python that simplifies the process of training and deploying models. In this notebook, we will explore how to use PyCaret for regression tasks.

---

#### **1. Loading a Dataset from PyCaret**
Before training models, we need a dataset. PyCaret provides built-in datasets that we can use for demonstration.


The get_data('traffic') function loads a dataset related to traffic volume prediction.

This dataset contains various features such as temperature, rain, snow, cloud coverage, and whether the observation falls during rush hour.

In [1]:
!pip install --pre pycaret

Collecting pycaret
  Downloading pycaret-3.3.2-py3-none-any.whl.metadata (17 kB)
Collecting pandas<2.2.0 (from pycaret)
  Downloading pandas-2.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting scipy<=1.11.4,>=1.6.1 (from pycaret)
  Downloading scipy-1.11.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting joblib<1.4,>=1.2.0 (from pycaret)
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting pyod>=1.1.3 (from pycaret)
  Downloading pyod-2.0.3.tar.gz (169 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m169.6/169.6 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting category-encoders>=2.4.0 (from pycaret)
  Downloading category_encoders-2.8.0-py3-none-any.whl.metadata (7.9 kB)
Collecti

In [1]:
from pycaret.datasets import get_data
data1 = get_data(dataset = 'traffic')

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,Rush Hour,traffic_volume
0,,288.28,0.0,0.0,40,Clouds,1,5545
1,,289.36,0.0,0.0,75,Clouds,0,4516
2,,289.58,0.0,0.0,90,Clouds,0,4767
3,,290.13,0.0,0.0,90,Clouds,0,5026
4,,291.14,0.0,0.0,75,Clouds,0,4918


#### **2. Importing the PyCaret Regression Module**
To use PyCaret for regression tasks, we need to import the regression module.

In [2]:
from pycaret.regression import *

#### **3. Setting Up the PyCaret Environment**  
The `setup` function initializes PyCaret's machine learning environment by performing **data preprocessing, feature engineering, and model preparation**.

#### **Explanation of Parameters:**
- **`data`**: The dataset that will be used for modeling.
- **`target`**: The column name that represents the target variable (the variable we want to predict).
- **`session_id`**: A fixed number to ensure reproducibility of results.
- **`silent`**: If set to `True`, it automatically accepts the default settings without user input.

#### **What Happens During Setup?**
- It **identifies numerical and categorical features** in the dataset.
- It **handles missing values** (if any) automatically.
- It **encodes categorical variables** for machine learning models.
- It **detects outliers** and applies transformations if needed.
- It **performs feature selection** and scaling when necessary.

After running `setup`, PyCaret provides a **summary of all preprocessing steps** before proceeding to model training.


In [4]:
dataset = setup(data=data1, target='traffic_volume', session_id=438)

Unnamed: 0,Description,Value
0,Session id,438
1,Target,traffic_volume
2,Target type,Regression
3,Original data shape,"(48204, 8)"
4,Transformed data shape,"(48204, 28)"
5,Transformed train set shape,"(33742, 28)"
6,Transformed test set shape,"(14462, 28)"
7,Numeric features,5
8,Categorical features,2
9,Rows with missing values,99.9%


#### **4. Comparing Models**
The `compare_models()` function automatically trains and evaluates multiple regression models to find the best one.

#### **Key Points:**
- Tests various models and ranks them based on performance.
- Uses default metrics like **R², RMSE, and MAE**.
- Returns the best model for further tuning.

This saves time and simplifies model selection.


In [5]:
best = compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
xgboost,Extreme Gradient Boosting,1486.0287,3069799.6509,1751.9876,0.2207,0.9235,2.4047,0.469
lightgbm,Light Gradient Boosting Machine,1502.9547,3079513.6446,1754.7529,0.2183,0.9296,2.4701,1.077
gbr,Gradient Boosting Regressor,1531.0211,3150013.956,1774.7161,0.2004,0.9422,2.5583,2.705
ada,AdaBoost Regressor,1577.2863,3282090.0319,1811.574,0.1669,0.973,2.7074,1.345
knn,K Neighbors Regressor,1573.6762,3662855.0418,1913.7557,0.0702,0.9601,2.5601,1.007
rf,Random Forest Regressor,1547.2748,3741181.5363,1933.9648,0.0502,0.9554,2.4877,10.516
omp,Orthogonal Matching Pursuit,1712.8788,3848922.2942,1961.8288,0.0229,1.0224,2.964,0.184
dummy,Dummy Regressor,1743.7632,3939897.8403,1984.8734,-0.0001,1.0318,2.9022,0.143
et,Extra Trees Regressor,1678.7005,4588616.2268,2141.9299,-0.165,1.048,2.6702,7.161
dt,Decision Tree Regressor,1746.3217,5206293.5785,2281.4939,-0.3218,1.1239,2.5748,0.291


Processing:   0%|          | 0/81 [00:00<?, ?it/s]

#### **5. Tuning the Best Model**
The `tune_model()` function optimizes the best model found using `compare_models()` by adjusting its hyperparameters.

#### **Key Points:**
- **Improves model performance** by searching for the best hyperparameters.
- **Uses automated tuning** to find the optimal settings.
- **Returns a tuned version** of the selected model.

This helps achieve better accuracy and efficiency.


In [6]:
tuned_best = tune_model(estimator = best)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1490.3054,3061569.0259,1749.734,0.2117,0.9331,5.0053
1,1520.1414,3150344.7758,1774.9211,0.203,0.9396,2.7265
2,1499.9276,3085576.1753,1756.5808,0.2226,0.9353,2.2002
3,1493.9599,3084641.7898,1756.3148,0.2128,0.9265,3.1083
4,1509.0597,3112668.3918,1764.2756,0.2302,0.9286,1.7668
5,1490.7023,3057320.6982,1748.5196,0.2139,0.9282,2.2817
6,1461.7845,2943290.1647,1715.602,0.2431,0.9033,1.4892
7,1513.2775,3144289.8509,1773.2146,0.194,0.9365,2.3216
8,1526.8819,3194609.5531,1787.3471,0.1873,0.934,1.5868
9,1531.3525,3180566.9799,1783.4144,0.2082,0.933,2.1803


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 10 folds for each of 10 candidates, totalling 100 fits


Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


#### **6. Finalizing the Model**
The `finalize_model()` function locks the best-tuned model so it can be used for predictions on new data.

#### **Key Points:**
- Trains the model on the **entire dataset** for best performance.
- Ensures the model is **ready for deployment**.
- Prevents further modifications, making it production-ready.

This step is essential before making predictions on unseen data.


In [7]:
final_model = finalize_model(estimator = tuned_best)

#### **7. Saving the Model**
The `save_model()` function saves the finalized model to a file for future use.

#### **Key Points:**
- Saves the trained model as a **binary file**.
- The model can be **reloaded** later without retraining.
- Useful for **deployment and sharing**.

The saved file can be loaded anytime using `load_model()`, making it easy to reuse.


In [9]:
save_model(final_model, 'rf_base_traffic')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['temp', 'rain_1h', 'snow_1h',
                                              'clouds_all', 'Rush Hour'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=['holiday', 'weather_main'],
                                     transformer=SimpleImputer(strategy='most_frequent'))),
                 ('onehot_encoding',
                  TransformerWrapper(inclu...
                               feature_types=None, gamma=None, grow_policy=None,
                               importance_type=None,
                               interaction_constraints=None, learning_rate=None,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=None, max_le

#### Predictions

#### **8. Loading the Saved Model**
The `load_model()` function reloads a previously saved model for making predictions without retraining.

#### **Key Points:**
- Loads the trained model from the specified file.
- Allows **reuse of the model** without repeating the training process.
- Ensures **consistency in predictions** across different sessions.

This is useful for **deploying the model** in real-world applications.


In [13]:
saved_model = load_model('rf_base_traffic')

Transformation Pipeline and Model Successfully Loaded


#### **9. Making Predictions**
The `predict_model()` function uses the trained model to make predictions on new data.

#### **Key Points:**
- Takes a **saved model** and a **new dataset** as input.
- Returns the **predicted values** along with other relevant information.
- Works without needing to retrain the model.

The `head()` function displays the first few rows of the predictions.


# Create the new Dataset

In [15]:
import pandas as pd

new_data = pd.DataFrame({
    'holiday': [None, None, None, None, None],  # If the original dataset had holidays
    'temp': [288.28, 289.36, 289.58, 290.13, 291.14],  # Temperature values
    'rain_1h': [0.0, 0.0, 0.0, 0.0, 0.0],  # Rain in the last hour
    'snow_1h': [0.0, 0.0, 0.0, 0.0, 0.0],  # Snow in the last hour
    'clouds_all': [40, 75, 90, 90, 75],  # Cloud percentage
    'weather_main': ['Clouds', 'Clouds', 'Clouds', 'Clouds', 'Clouds'],  # Weather condition
    'Rush Hour': [1, 0, 0, 0, 0]  # Indicator for rush hour
})

# Display the DataFrame
print(new_data)



  holiday    temp  rain_1h  snow_1h  clouds_all weather_main  Rush Hour
0    None  288.28      0.0      0.0          40       Clouds          1
1    None  289.36      0.0      0.0          75       Clouds          0
2    None  289.58      0.0      0.0          90       Clouds          0
3    None  290.13      0.0      0.0          90       Clouds          0
4    None  291.14      0.0      0.0          75       Clouds          0


In [16]:
predictions = predict_model(saved_model, data=new_data)
predictions.head()

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,Rush Hour,prediction_label
0,,288.279999,0.0,0.0,40,Clouds,1,4633.11377
1,,289.359985,0.0,0.0,75,Clouds,0,3285.329102
2,,289.579987,0.0,0.0,90,Clouds,0,3202.917236
3,,290.130005,0.0,0.0,90,Clouds,0,3072.800049
4,,291.140015,0.0,0.0,75,Clouds,0,3251.032715
