# Gradient Boosting algorithms

In [None]:
import pandas as pd
import xgboost as xgb

import matplotlib.pyplot as plt
from likelihood.tools import *

# Dataset Description

**Objective:** The dataset is designed for predicting the `effectiveness` of various sales managers using the XGBoost algorithm.

## Columns

- **date**: Date of the observation, indicating when the data was collected.

- **id_manager**: Unique identifier for each sales manager.

- **id_business_line**: Unique identifier for each business line.

- **target_sales_manager**: A target metric related to the sales manager's goals.

- **departamento**: Department where the sales manager operates.

- **studies**: Educational qualifications of the sales manager, potentially impacting their effectiveness.

- **generation**: The generational cohort the sales manager belongs to (e.g., Millennials, Gen X), which may influence their work style and effectiveness.

- **location**: Geographical location where the sales manager operates, potentially affecting performance due to regional factors.

- **gender**: Gender of the sales manager, which could be a demographic variable for analyzing effectiveness.

- **age**: Age of the sales manager, which might correlate with experience and effectiveness.

- **direct_boss**: The direct superior of the sales manager, which might influence the sales manager's effectiveness.

- **position**: Job position of the sales manager.

- **rh_location**: Location of the human resources department.

- **civil_status**: Civil status (e.g., single, married) of the sales manager, which could impact their personal and professional life balance.

- **cell_phone**: Contact number of the sales manager, which may be used for direct communication but may not be directly relevant for prediction.

- **total_sales**: Total sales achieved by the sales manager.

- **clients**: Number of clients handled by the sales manager, potentially influencing their sales performance.

- **potential_clients**: Number of potential clients identified by the sales manager, indicating future sales opportunities.

- **number_employees**: Number of employees under the sales manager’s supervision, which could affect their workload and performance.

- **program**: Program or initiative the sales manager is involved in, which might be linked to their effectiveness.

- **clients_ratio**: Ratio of clients to some other metric (e.g., total clients vs. potential clients), providing insight into client management efficiency.

- **sales_ratio**: Ratio of sales to another metric (e.g., total sales vs. target sales), which may reflect the sales manager's performance relative to targets.

- **effectiveness**: Target variable representing the effectiveness of the sales manager, which the model aims to predict.

## Summary

The dataset contains a mix of demographic, operational, and performance-related features about sales managers. The goal is to use these attributes to build a predictive model that accurately estimates the effectiveness of sales managers.


In [None]:
df = pd.read_parquet("data.parquet")
df.head()

In [None]:
import plotly.graph_objects as go

palette = [
    "#FC814A",
    "#816581",
    "#96939B",
    "#BFBFBF",
    "#E8E8E8",
    "#5F3B3B",
    "#F0EAD6",
    "#72A7D0",
    "#D3A76C",
    "#BB4430",
    "#4D4D4D",
    "#A6C5E2",
    "#B9C9A9",
    "#E3C8B3",
    "#7851A9",
    "#C8AD7F",
]

drop_columns = ["date", "id_manager", "location", "rh_location", "direct_boss"]
df_copy = df.drop(columns=drop_columns)
num_charts = len([col for col in df_copy.columns if df_copy[col].dtype == "object"])

for col in df_copy.columns:
    if df_copy[col].dtype == "object":
        # Pie chart
        pie_data = df_copy[col].value_counts()
        pie_fig = go.Figure(
            data=[
                go.Pie(
                    labels=pie_data.index,
                    values=pie_data.values,
                    hole=0.3,
                    marker=dict(colors=palette),
                )
            ]
        )
        pie_fig.update_layout(
            title_text=f"Frequency per {col}",
            annotations=[dict(text=f"{col}", x=0.5, y=0.5, font_size=20, showarrow=False)],
        )
        pie_fig.show()

        # Stacked bar chart with percentages
        grouped = df_copy.groupby(["effectiveness", col]).size().unstack()
        grouped_pct = grouped.apply(lambda x: x / x.sum() * 100, axis=1)

        bar_fig = go.Figure()

        colors = palette[: len(grouped.columns)]

        for i, c in enumerate(grouped.columns):
            bar_fig.add_trace(
                go.Bar(
                    x=grouped.index,
                    y=grouped[c],
                    name=c,
                    text=grouped_pct[c].apply(lambda x: f"{x:.1f}%"),
                    textposition="inside",
                    marker=dict(color=colors[i]),
                )
            )

        bar_fig.update_layout(
            title=f"Event Distribution by {col}",
            xaxis_title="Effectiveness",
            yaxis_title="Count",
            barmode="stack",
            xaxis=dict(type="category"),
        )
        bar_fig.show()