# Testing functions

In [1]:
import os 
import sys
sys.path.append('..')

from IPython.display import display, Markdown

from utils.llm import chat_completion, achat_completion, format_response

from dotenv import load_dotenv
load_dotenv()

from openai import OpenAI, AsyncOpenAI

client = OpenAI()
aclient = AsyncOpenAI()
pplx_client = OpenAI(api_key=os.getenv('PPLX_API_KEY'), base_url="https://api.perplexity.ai")

## Test Synchronous Chat Completion

In [13]:
model_name = "gpt-4o"
o1_model_name = 'o1-preview'
model_name_pplx = "llama-3.1-sonar-huge-128k-online"
temperature = 0.0

system_message = "You are a helpful assistant to an experienced data scientist."
user_message = "Using a table, explain the difference between LightGBM, XGBoost, and CatBoost."

OpenAI

In [3]:
response = chat_completion(
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message}
    ],
    model_name=model_name,
    temperature=temperature,
    client=client
)

In [4]:
response, usage = format_response(response)

display(Markdown(f'Usage: {usage}'))
display(Markdown(f'Response: {response}'))

Usage: {'completion_tokens': 539, 'prompt_tokens': 42, 'total_tokens': 581, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}

Response: Certainly! Below is a table that highlights the key differences between LightGBM, XGBoost, and CatBoost, which are all popular gradient boosting frameworks:

| Feature/Aspect       | LightGBM                                      | XGBoost                                       | CatBoost                                      |
|----------------------|-----------------------------------------------|-----------------------------------------------|-----------------------------------------------|
| **Developer**        | Microsoft                                     | Distributed Machine Learning Community        | Yandex                                        |
| **Language Support** | C++, Python, R, Java, C#, and more            | C++, Python, R, Java, Scala, and more         | Python, R, C++, Java                          |
| **Tree Growth**      | Leaf-wise growth with depth limitation        | Level-wise growth                             | Symmetric tree (balanced) growth              |
| **Handling Categorical Features** | Requires preprocessing (e.g., one-hot encoding) | Requires preprocessing (e.g., one-hot encoding) | Native support for categorical features       |
| **Speed**            | Generally faster due to leaf-wise growth      | Fast, but can be slower than LightGBM         | Competitive speed, especially with categorical data |
| **Memory Usage**     | Efficient memory usage                        | Can be more memory-intensive                  | Efficient, especially with categorical data   |
| **GPU Support**      | Yes                                           | Yes                                           | Yes                                           |
| **Regularization**   | L1 (Lasso) and L2 (Ridge) regularization      | L1 (Lasso) and L2 (Ridge) regularization      | L2 (Ridge) regularization                     |
| **Handling Missing Values** | Automatic handling of missing values       | Automatic handling of missing values          | Automatic handling of missing values          |
| **Parallelization**  | Feature-wise parallelism                      | Tree and feature-wise parallelism             | Supports multi-threading                      |
| **Hyperparameter Tuning** | Many hyperparameters, can be complex to tune | Many hyperparameters, can be complex to tune  | Fewer hyperparameters, easier to tune         |
| **Use Cases**        | Large datasets, high-dimensional data         | General-purpose, wide range of applications   | Datasets with categorical features, text data |
| **Documentation and Community** | Good documentation, active community     | Extensive documentation, large community      | Good documentation, growing community         |

Each of these frameworks has its strengths and is suited to different types of problems. LightGBM is often preferred for its speed and efficiency on large datasets, XGBoost is known for its robustness and versatility, and CatBoost excels in handling categorical features and is user-friendly with fewer hyperparameters to tune.

OpenAI - reasoning models

In [14]:
response_o1 = chat_completion(
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message}
    ],
    model_name=o1_model_name,
    temperature=temperature,
    client=client
)
response_o1, usage_o1 = format_response(response_o1)

display(Markdown(f'Usage: {usage_o1}'))
display(Markdown(f'Response: {response_o1}'))

Usage: {'completion_tokens': 4311, 'prompt_tokens': 49, 'total_tokens': 4360, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 2432, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}

Response: Here is a comprehensive table that explains the differences between LightGBM, XGBoost, and CatBoost across various aspects:

| **Aspect**                     | **LightGBM**                                       | **XGBoost**                                       | **CatBoost**                                      |
|--------------------------------|----------------------------------------------------|---------------------------------------------------|---------------------------------------------------|
| **Developed by**               | Microsoft Research                                 | Tianqi Chen and collaborators (originally at University of Washington) | Yandex                                            |
| **Initial Release**            | 2016                                               | 2014                                              | 2017                                              |
| **Programming Languages**      | C++, with APIs for Python, R, Java, C#, Julia      | C++, with APIs for Python, R, Java, Scala, Julia  | C++, with APIs for Python, R, Java                |
| **License**                    | MIT License                                        | Apache License 2.0                                | Apache License 2.0                                |
| **Main Features**              | - Gradient boosting framework using tree-based learning algorithms<br>- Optimized for speed and efficiency<br>- Uses histogram-based algorithms and leaf-wise tree growth<br>- Efficient memory usage | - Scalable, portable, and distributed gradient boosting library<br>- Highly optimized, supports customized objective functions and evaluation metrics<br>- Handles missing data internally<br>- Uses level-wise (depth-wise) tree growth | - Gradient boosting with explicit support for categorical features<br>- Implements ordered boosting and categorical feature encoding internally<br>- Uses symmetric (oblivious) trees<br>- Reduces overfitting with novel techniques |
| **Tree Growth Strategy**       | Leaf-wise (best-first) growth                      | Level-wise (depth-wise) growth                    | Symmetric (oblivious) trees                       |
| **Handling of Categorical Features** | - Limited built-in support<br>- Requires preprocessing (e.g., label encoding, one-hot encoding)<br>- Can treat categorical variables as ordered integers | - Requires preprocessing to convert categorical variables into numerical form (e.g., one-hot encoding) | - Natively supports categorical features<br>- Automatically handles categorical variables internally<br>- Uses target statistics and ordered boosting to prevent overfitting |
| **Speed and Performance**      | - Generally faster training speed than XGBoost<br>- Efficient with large datasets<br>- Faster thanks to leaf-wise growth and histogram-based algorithms | - Highly optimized and efficient<br>- Supports multi-threading and out-of-core computation<br>- May be slower than LightGBM on large datasets | - Comparable speed to LightGBM<br>- Efficient handling of categorical features saves preprocessing time<br>- Optimized for both CPU and GPU computations |
| **Regularization Methods**     | - L1 and L2 regularization<br>- Depth constraints<br>- Leaf-wise growth can be constrained to prevent overfitting | - L1 and L2 regularization<br>- Tree pruning<br>- Parameters like `gamma` to control model complexity | - L2 regularization (`l2_leaf_reg` parameter)<br>- Ordered boosting to reduce overfitting<br>- Gradient priors and other techniques to improve generalization |
| **Parallelism and GPU Support**| - Supports parallel learning and GPU acceleration<br>- GPU support optimized for NVIDIA GPUs | - Supports parallelization and distributed training<br>- GPU acceleration available but may be less efficient than LightGBM | - Efficient multi-threaded CPU and GPU implementations<br>- GPU support for both NVIDIA and AMD GPUs |
| **Model Complexity Control**   | - Controls complexity through max depth, number of leaves, and regularization parameters<br>- Leaf-wise growth can create deep trees if not constrained | - Controls complexity via max depth, min child weight, `gamma`, subsample ratios, etc.<br>- Level-wise growth tends to produce balanced trees | - Controls complexity with max depth, L2 regularization<br>- Symmetric trees inherently limit complexity<br>- Ordered boosting reduces overfitting risk |
| **Hyperparameters**            | - Number of leaves (`num_leaves`)<br>- Max depth (`max_depth`)<br>- Learning rate (`learning_rate`)<br>- Feature fraction (`feature_fraction`)<br>- Bagging fraction (`bagging_fraction`), etc. | - Learning rate (`eta`)<br>- Max depth (`max_depth`)<br>- Min child weight (`min_child_weight`)<br>- Subsample (`subsample`)<br>- Column subsample (`colsample_bytree`), etc. | - Learning rate (`learning_rate`)<br>- Max depth (`depth`)<br>- L2 regularization (`l2_leaf_reg`)<br>- Subsample (`subsample`)<br>- Parameters for categorical handling (`one_hot_max_size`), etc. |
| **Advantages**                 | - Very fast training speed<br>- Efficient memory usage<br>- Handles large datasets well<br>- Leaf-wise growth can lead to better accuracy | - Mature and widely used library<br>- Robust performance across tasks<br>- Handles sparse data and missing values internally<br>- Supports various objective functions and evaluation metrics | - Excellent handling of categorical features without preprocessing<br>- Reduces need for complex data preprocessing<br>- Strong default parameters often require less tuning<br>- Effective against overfitting with ordered boosting |
| **Disadvantages**              | - Can overfit if not properly tuned due to leaf-wise growth<br>- Limited native support for categorical features compared to CatBoost<br>- May require careful parameter tuning to prevent overfitting | - May be slower than LightGBM on large datasets<br>- Requires preprocessing of categorical variables<br>- Can require extensive hyperparameter tuning for optimal performance | - Model size can be larger due to symmetric trees<br>- Less flexible tree structure compared to LightGBM and XGBoost<br>- Relatively newer library with potentially smaller community support compared to XGBoost |
| **Use Cases**                  | - Applications requiring fast training and prediction times<br>- Large-scale machine learning tasks<br>- When high performance on large datasets is needed | - Wide range of applications across domains<br>- When customized optimization and evaluation are required<br>- Suitable for data with missing values and need for robustness | - Datasets with significant categorical features<br>- Situations where minimal data preprocessing is desired<br>- Applications where overfitting is a concern and strong generalization is needed |
| **Documentation and Community Support** | - Good documentation<br>- Growing community support<br>- Active development | - Extensive documentation<br>- Large and active community<br>- Widely adopted in industry and academia<br>- Numerous resources and tutorials available | - Good documentation<br>- Growing community support<br>- Resources for learning are increasing<br>- May have fewer third-party resources compared to XGBoost |
| **Latest Version (as of 2023-10)** | - Continuously updated<br>- Check the official GitHub repository for the latest release | - Continuously updated<br>- Check the official GitHub repository for the latest release | - Continuously updated<br>- Check the official GitHub repository for the latest release |
| **Additional Features**        | - Supports weighted quantile sketch for efficient histogram building<br>- Supports direct multi-class classification<br>- Can handle lambdarank for ranking tasks | - Supports DART (Dropout meets Multiple Additive Regression Trees) for regularization<br>- Can optimize custom objective functions<br>- Handles ranking, regression, and classification tasks | - Supports text, embedding, and numerical features<br>- Provides tools for model interpretation<br>- Implements monotonic constraints and feature importance calculations |
| **Handling of Missing Values** | - Treats missing values as a separate category<br>- Finds optimal split points considering missing values | - Handles missing values internally<br>- Learns best directions to handle missing values | - Handles missing values internally without preprocessing<br>- Considers missing values during split calculations |

**Summary:**

- **LightGBM** is optimized for speed and performance, especially on large datasets. It uses a leaf-wise tree growth strategy and histogram-based algorithms, leading to faster training times but potentially deeper trees that may overfit if not properly regulated. It has limited support for categorical features compared to CatBoost.

- **XGBoost** is a mature and widely-used library known for its robustness and flexibility. It uses level-wise tree growth and has extensive hyperparameter options to fine-tune performance. While it handles missing values well, it requires preprocessing of categorical features.

- **CatBoost** excels in handling categorical features natively and reduces the need for extensive data preprocessing. It uses symmetric trees and incorporates innovative techniques like ordered boosting to combat overfitting. CatBoost often performs well on datasets with many categorical variables and can require less parameter tuning.

**When to Use Each Algorithm:**

- **LightGBM** is preferred when you need fast training on large datasets and can manage categorical variables through preprocessing or when they are not predominant.

- **XGBoost** is suitable when you need a proven, versatile algorithm with robust performance and have the capability to preprocess categorical variables as needed.

- **CatBoost** is ideal when your dataset contains many categorical features, and you wish to leverage their native handling to reduce preprocessing time and improve model performance.

---

This table should provide a clear comparison between LightGBM, XGBoost, and CatBoost, highlighting their differences in features, capabilities, and use cases.

Perplexity

In [5]:
response_pplx = chat_completion(
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message}
    ],
    model_name=model_name_pplx,
    temperature=temperature,
    client=pplx_client
)

In [6]:
response_content_pplx, usage_pplx = format_response(response_pplx)  

display(Markdown(f'Usage: {usage_pplx}'))
display(Markdown(f'Response: {response_content_pplx}'))
display(Markdown(f'Citations: {response_pplx.citations}'))

Usage: {'completion_tokens': 625, 'prompt_tokens': 31, 'total_tokens': 656, 'completion_tokens_details': None, 'prompt_tokens_details': None}

Response: ### Comparison of LightGBM, XGBoost, and CatBoost

| **Feature** | **LightGBM** | **XGBoost** | **CatBoost** |
|-------------|--------------|-------------|--------------|
| **Tree Growth Strategy** | Leaf-wise growth, which can lead to unbalanced trees but is generally faster[1][5] | Level-wise growth, similar to traditional gradient boosting, with advanced algorithms for optimal splits[4][5] | Symmetrical trees, going wide rather than deep, which can be slower in training but faster in prediction[1] |
| **Handling Categorical Features** | Native support for categorical features, but not as robust as CatBoost[5] | Supports categorical features through one-hot encoding or label encoding, but not as efficient as CatBoost[4][5] | Specifically designed to handle categorical features efficiently, using a unique encoding method[1][5] |
| **Parallelization** | Multi-threaded, parallelizes over trees with OpenMP, which can be beneficial in low concurrent load scenarios but may not perform as well in high concurrent load scenarios[1] | Similar to LightGBM, multi-threaded with OpenMP, but can be less efficient in high concurrent load scenarios[1] | Single-threaded by default, but can be parallelized; performs well even in high concurrent load scenarios due to its tree structure[1] |
| **Prediction Speed** | Generally slower than CatBoost, especially in high concurrent load scenarios[1] | Slower than CatBoost, with a significant difference in latency (up to 12X-15X slower)[1] | Faster prediction speed, especially in high concurrent load scenarios, with a 2X-3X improvement in latency over LightGBM and XGBoost[1] |
| **Training Time** | Can be faster than CatBoost in training, especially when using GOSS and EFB simultaneously[2] | Training time can be comparable to LightGBM but slower than CatBoost in some cases[2] | Can be slower in training due to its symmetrical tree growth strategy, but offers better prediction speed[1][2] |
| **Accuracy** | Similar in prediction quality to XGBoost and CatBoost when optimal parameters are found, but default values can lead to differences[1][2] | Comparable accuracy to LightGBM and CatBoost, but can be worse in some cases due to high dimensionality or specific parameter settings[1][2] | Generally provides the most prominent results across various datasets, but can be slower in training[2] |

### Summary:
- **LightGBM** uses leaf-wise tree growth and is generally faster in training but slower in prediction compared to CatBoost.
- **XGBoost** employs level-wise tree growth with advanced algorithms for optimal splits, offering comparable accuracy but slower prediction speeds than CatBoost.
- **CatBoost** is specifically designed for categorical features, uses symmetrical tree growth, and provides faster prediction speeds, especially in high concurrent load scenarios, but can be slower in training.

Citations: ['https://tech.deliveryhero.com/is-catboost-faster-than-lightgbm-and-xgboost/', 'https://arxiv.org/pdf/2305.17094.pdf', 'https://www.youtube.com/watch?v=R5FB1ZUejXM', 'https://www.geeksforgeeks.org/gradientboosting-vs-adaboost-vs-xgboost-vs-catboost-vs-lightgbm/', 'https://datamapu.com/posts/classical_ml/gradient_boosting_variants/']

## Test Asynchronous Chat Completion

In [11]:
response_async = await achat_completion(
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message}
    ],
    model_name=model_name,
    temperature=temperature,
    client=aclient
)

In [12]:
response_async, usage_async = format_response(response_async)

display(Markdown(f'Usage: {usage_async}'))
display(Markdown(f'Response: {response_async}'))

Usage: {'completion_tokens': 600, 'prompt_tokens': 42, 'total_tokens': 642, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}

Response: Certainly! Below is a table that outlines the key differences between LightGBM, XGBoost, and CatBoost, which are all popular gradient boosting frameworks:

| Feature/Aspect       | LightGBM                                      | XGBoost                                      | CatBoost                                      |
|----------------------|-----------------------------------------------|----------------------------------------------|-----------------------------------------------|
| **Developer**        | Microsoft                                     | Distributed Machine Learning Community       | Yandex                                        |
| **Language**         | C++ (with interfaces for Python, R, etc.)     | C++ (with interfaces for Python, R, etc.)    | C++ (with interfaces for Python, R, etc.)     |
| **Tree Structure**   | Leaf-wise growth with depth limitation        | Level-wise growth                            | Symmetric tree (balanced)                     |
| **Handling Categorical Features** | Requires preprocessing (e.g., one-hot encoding) | Requires preprocessing (e.g., one-hot encoding) | Native support for categorical features       |
| **Speed**            | Generally faster due to leaf-wise growth      | Slower compared to LightGBM and CatBoost     | Fast, especially with categorical features    |
| **Memory Usage**     | Lower memory usage due to efficient algorithms| Higher memory usage                          | Efficient memory usage                        |
| **Parallelization**  | Efficient parallelization                     | Good parallelization                         | Efficient parallelization                     |
| **Overfitting Control** | Can overfit if not properly tuned          | Good control with regularization parameters  | Good control with built-in regularization     |
| **Missing Values**   | Handles missing values natively               | Handles missing values natively              | Handles missing values natively               |
| **GPU Support**      | Yes                                           | Yes                                          | Yes                                           |
| **Ease of Use**      | Requires careful tuning for best performance  | Well-documented and widely used              | User-friendly with automatic handling of categorical features |
| **Use Cases**        | Large datasets, high-dimensional data         | General-purpose, widely applicable           | Datasets with categorical features, ease of use |

### Additional Notes:
- **LightGBM**: Known for its speed and efficiency, especially with large datasets. The leaf-wise growth strategy can lead to faster convergence but may require careful tuning to avoid overfitting.
- **XGBoost**: One of the most popular and widely used gradient boosting libraries. It is known for its robust performance and flexibility, with a wide range of parameters for fine-tuning.
- **CatBoost**: Particularly strong in handling categorical features without the need for extensive preprocessing. It is designed to be user-friendly and often requires less parameter tuning compared to the other two.

Each of these frameworks has its strengths and is suitable for different types of problems and datasets. The choice between them often depends on the specific requirements of the task, such as the size of the dataset, the presence of categorical features, and the need for speed versus interpretability.