---
title: scikit-learn survey
subtitle: Presentation of survey results
author:
  - name: Inessa Pawson
    email: ipawson@openteams.com
date: 2024/12/27
---

# Libraries and Data
```{tip} Show code
:class: dropdown 

```python
import seaborn as sns
import matplotlib.pyplot as plt

import pandas as pd
import matplotlib.colors as mcolors
import numpy as np
import altair as alt

url = 'https://raw.githubusercontent.com/Auslum/scikit_learn_survey/refs/heads/main/scikit-learn-survey-master-dataset.csv'

df = pd.read_csv(url)

#print(df.head())    

# Project Future Direction and Priorities

In the survey, new features, performance, and technical documentation have the highest priority for the users. While website redesign and other have the lowest.

```{tip} Show code
:class: dropdown 

```python
# Define priority levels from 1 to 7
priority_levels = list(range(1, 8))

# Filter columns related to the question
priority_columns = [col for col in df.columns if "PROJECT FUTURE DIRECTION AND PRIORITIES" in col]
priority_data = df[priority_columns].dropna()

# Rename the categories
renamed_columns = [
    "Performance", "Reliability", "Packaging", "New features",
    "Technical documentation", "Educational materials",
    "Website redesign", "Other"
]
priority_data.columns = renamed_columns

# Convert the data to a long format for Altair
priority_data_melted = priority_data.melt(var_name="Category", value_name="Priority")

# Create interpolated colors from blue to orange
scikit_learn_colors = ["#0072B2", "#FF9900"]
priority_colors = [
    mcolors.to_hex(c)
    for c in mcolors.LinearSegmentedColormap.from_list("ScikitLearn", scikit_learn_colors)(
        np.linspace(0, 1, len(priority_levels))
    )
]

# Create custom labels for the legend
priority_labels = {1: "1 (Lowest priority)", 2: "2", 3: "3", 4: "4", 5: "5", 6: "6", 7: "7 (Highest priority)"}
priority_data_melted['Priority Label'] = priority_data_melted['Priority'].map(priority_labels)

# Create the stacked bar chart with Altair
chart = alt.Chart(priority_data_melted).mark_bar().encode(
    x=alt.X('Category:N', title='Categories', sort=renamed_columns),
    y=alt.Y(
        'count()',
        title='Frequency',
        scale=alt.Scale(domain=[0, 350])
    ),
    color=alt.Color(
        'Priority Label:N',
        scale=alt.Scale(
            domain=list(priority_labels.values()),
            range=priority_colors
        ),
        title='Priority Level'
    ),
    order=alt.Order('Priority:Q', sort='ascending')
).properties(
    title="Project Future Direction And Priorities",
    width=600,
    height=400
).configure_axis(
    labelAngle=45
)

# Display the chart
chart.show()


![Project Future Direction and Priorities](images/chart1.png)
***

# ML Tasks: Priority Levels

Talking about Machine Learning tasks, classification and regression have the highest priorities, while other, outlier/anomaly detection, and clustering have the lowest ones.

```{tip} Show code
:class: dropdown 

```python

# Define priority levels from 1 to 7
priority_levels = list(range(1, 8))

# Filter columns related to the question
priority_columns = [col for col in df.columns if "Please order the following ML tasks in order of priority to you" in col]
priority_data = df[priority_columns].dropna()

# Rename the categories
renamed_columns = [
    "Regression", "Classification", "Forecasting",
    "Outlier/anomaly detection", "Dimensionality reduction",
    "Clustering", "Other"
]
priority_data.columns = renamed_columns

# Convert the data to a long format for Altair
priority_data_melted = priority_data.melt(var_name="Category", value_name="Priority")

# Create interpolated colors from blue to orange
scikit_learn_colors = ["#0072B2", "#FF9900"]
priority_colors = [
    mcolors.to_hex(c)
    for c in mcolors.LinearSegmentedColormap.from_list("ScikitLearn", scikit_learn_colors)(
        np.linspace(0, 1, len(priority_levels))
    )
]

# Create custom labels for the legend
priority_labels = {1: "1 (Lowest priority)", 2: "2", 3: "3", 4: "4", 5: "5", 6: "6", 7: "7 (Highest priority)"}
priority_data_melted['Priority Label'] = priority_data_melted['Priority'].map(priority_labels)

# Create the stacked bar chart with Altair
chart = alt.Chart(priority_data_melted).mark_bar().encode(
    x=alt.X('Category:N', title='Categories', sort=renamed_columns),
    y=alt.Y(
        'count()',
        title='Frequency',
        scale=alt.Scale(domain=[0, 350])
    ),
    color=alt.Color(
        'Priority Label:N',
        scale=alt.Scale(
            domain=list(priority_labels.values()),
            range=priority_colors
        ),
        title='Priority Level'
    ),
    order=alt.Order('Priority:Q', sort='ascending')
).properties(
    title="ML Tasks: Priority Levels",
    width=600,
    height=400
).configure_axis(
    labelAngle=45
)

# Display the chart
chart.show()

![ML Tasks: Priority Levels](images/chart2.png)
***

# Visualizations used to evaluate models

The majority of respondents use confusion matrix the most as visualization to evaluate models, feature importance and ROC curve are commonly used as well. Residual plots and reliability diagrams are the least used.

```{tip} Show code
:class: dropdown 

```python

# Mapping dictionary
mapping_dict = {
    # Confusion matrix responses
    "Confusion matrix": "Confusion matrix",
    "Matriz de confusão": "Confusion matrix",
    "Matriz de confusión": "Confusion matrix",
    "混淆矩阵": "Confusion matrix",
    "Matrice de confusion": "Confusion matrix",
    "مصفوفة الدقة": "Confusion matrix",
    # Reliability diagram responses
    "Reliability diagram": "Reliability diagram",
    "Diagrama de confiabilidade": "Reliability diagram",
    "Diagrama de confiabilidad": "Reliability diagram",
    "可靠性图": "Reliability diagram",
    "Diagramme de fiabilité": "Reliability diagram",
    "مخطط الموثوقية": "Reliability diagram",
    # ROC curve responses
    "ROC curve": "ROC curve",
    "Curva ROC": "ROC curve",
    "ROC曲线": "ROC curve",
    "Courbe ROC": "ROC curve",
    "منحنى ROC": "ROC curve",
    # Precision-Recall curve responses
    "Precision-Recall curve": "Precision-Recall curve",
    "Curva de Precisão-Recall": "Precision-Recall curve",
    "Curva de Precisión-Recall": "Precision-Recall curve",
    "PR曲线（精确率-召回率曲线）": "Precision-Recall curve",
    "Courbe Précision-Rappel": "Precision-Recall curve",
    "منحنى الدقة-الاسترجاع": "Precision-Recall curve",
    # Feature importance responses
    "Feature importance": "Feature importance",
    "Importância das características": "Feature importance",
    "Importancia de variables": "Feature importance",
    "特征重要性": "Feature importance",
    "Importance des caractéristiques (features)": "Feature importance",
    "الأهمية النسبية للخواص": "Feature importance",
    # Residual plots responses
    "Residual plots": "Residual plots",
    "Gráficos de resíduos": "Residual plots",
    "Gráficos de residuos": "Residual plots",
    "残差图": "Residual plots",
    "Graphiques des résidus": "Residual plots",
    "مخططات البواقي": "Residual plots",
    # Learning curves responses
    "Learning curves": "Learning curves",
    "Curvas de aprendizagem": "Learning curves",
    "Curvas de aprendizaje": "Learning curves",
    "学习曲线": "Learning curves",
    "Courbes d'apprentissage": "Learning curves",
    "منحنيات التعلم": "Learning curves",
    # Other responses
    "Other": "Other",
    "Outro": "Other",
    "Otro": "Other",
    "其它": "Other",
    "Autre": "Other",
    "أخرى": "Other"
}

# Function to normalize responses
def normalize_responses(response):
    if isinstance(response, str):
        response_split = [r.strip() for r in response.split(',')]
        normalized = [mapping_dict.get(r, None) for r in response_split]
        return [r for r in normalized if r is not None]
    return []

# Apply normalization and count responses
df['Normalized_Responses'] = df['What visualizations do you use to evaluate your models? Select all that apply.'].apply(normalize_responses)
all_responses = [item for sublist in df['Normalized_Responses'].dropna() for item in sublist]
response_counts = pd.Series(all_responses).value_counts().reset_index()
response_counts.columns = ['Visualization', 'Count']

# Chart using Altair with orange color
chart = alt.Chart(response_counts).mark_bar(color='#F7931E').encode(
    x='Count:Q',
    y=alt.Y('Visualization:N', sort='-x'),
    tooltip=['Visualization', 'Count']
).properties(
    title='Visualizations used to evaluate models',
    width=500,
    height=300
)

chart.show()

![Visualizations used to evaluate models](images/chart3.png)
***

# Dataframe libraries used

Pandas is the most used library for the majority of respondents. Other libraries are not so popular compared to pandas, being Modin the least used.

```{tip} Show code
:class: dropdown 

```python

# Mapping dictionary
mapping_dict = {
    # cudf responses
    "cudf": "cudf",
    "cuDF كووديف": "cudf",
    # Dask DataFrame responses
    "Dask DataFrame": "Dask DataFrame",
    "Dask 数据框": "Dask DataFrame",
    "Dask DataFrame  اطر بيانات داسك": "Dask DataFrame",
    # DuckDB responses
    "DuckDB": "DuckDB",
    "DuckDB دك دي بي": "DuckDB",
    # Modin responses
    "Modin": "Modin",
    "Modin مودين": "Modin",
    # pandas responses
    "pandas": "pandas",
    "Pandas": "pandas",
    "pandas بنداز": "pandas",
    # Polars responses
    "Polars": "Polars",
    "Polars بولارز": "Polars",
    # Spark DataFrame responses
    "Spark DataFrame": "Spark DataFrame",
    "Spark 数据框": "Spark DataFrame",
    "Spark DataFrame اطر بيانات سبارك": "Spark DataFrame",
    # Other responses
    "Other": "Other",
    "Outro": "Other",
    "Otro": "Other",
    "其它": "Other",
    "Autre": "Other",
    "أخرى": "Other"
}

# Function to normalize responses
def normalize_responses(response):
    if isinstance(response, str):
        response_split = [r.strip() for r in response.split(',')]
        normalized = [mapping_dict.get(r, None) for r in response_split]
        return [r for r in normalized if r is not None]
    return []

# Apply normalization and count responses
df['Normalized_Responses'] = df['Which DataFrame libraries do you use? Select all that apply.'].apply(normalize_responses)
all_responses = [item for sublist in df['Normalized_Responses'].dropna() for item in sublist]
response_counts = pd.Series(all_responses).value_counts().reset_index()
response_counts.columns = ['Libraries', 'Count']

# Chart using Altair with orange color
chart = alt.Chart(response_counts).mark_bar(color='#F7931E').encode(
    x='Count:Q',
    y=alt.Y('Libraries:N', sort='-x'),
    tooltip=['Libraries', 'Count']
).properties(
    title='DataFrame libraries used',
    width=500,
    height=300
)

chart.show()

![Dataframe libraries used](images/chart4.png)
***

# Machine Learning libraries used

For Machine Learning, XGBoost and PyTorch are the most used libraries, while Jax is the least used.

```{tip} Show code
:class: dropdown 

```python

# Mapping dictionary
mapping_dict = {
    # CatBoost responses
    "CatBoost": "CatBoost",
    "CatBoost كات بوست": "CatBoost",
    # Jax responses
    "Jax": "Jax",
    "JAX چاكس": "Jax",
    # Keras responses
    "Keras": "Keras",
    "Keras كيراس": "Keras",
    # LightGBM responses
    "LightGBM": "LightGBM",
    "LightGBM لايت جي بي ام": "LightGBM",
    # PyTorch responses
    "PyTorch": "PyTorch",
    "PyTorch باي تورش": "PyTorch",
    # Transformers responses
    "Transformers": "Transformers",
    "Transformers المحولات (ترانسفورمرز)": "Transformers",
    # XGBoost responses
    "XGBoost": "XGBoost",
    "XGBoost اكس جي بوست": "XGBoost",
    # Other responses
    "Other": "Other",
    "Outro": "Other",
    "Otro": "Other",
    "其它": "Other",
    "Autre": "Other",
    "أخرى": "Other"
}

# Function to normalize responses
def normalize_responses(response):
    if isinstance(response, str):
        response_split = [r.strip() for r in response.split(',')]
        normalized = [mapping_dict.get(r, None) for r in response_split]
        return [r for r in normalized if r is not None]
    return []

# Apply normalization and count responses
df['Normalized_Responses'] = df['Which other Machine Learning libraries do you use? Select all that apply.'].apply(normalize_responses)
all_responses = [item for sublist in df['Normalized_Responses'].dropna() for item in sublist]
response_counts = pd.Series(all_responses).value_counts().reset_index()
response_counts.columns = ['Libraries', 'Count']

# Chart using Altair with orange color
chart = alt.Chart(response_counts).mark_bar(color='#F7931E').encode(
    x='Count:Q',
    y=alt.Y('Libraries:N', sort='-x'),
    tooltip=['Libraries', 'Count']
).properties(
    title='Machine Learning libraries used',
    width=500,
    height=300
)

chart.show()

![Machine Learning libraries used](images/chart5.png)
***

# Estimators regularly used

RandomForestClassifier and RandomForestRegressor are the most used estimators, other estimators that are not listed in the options given in the survey, HistGradientBoostingRegresor and HistGradientBoostingClassifier are the least used.

```{tip} Show code
:class: dropdown 

```python
# Mapping dictionary
mapping_dict = {
     # LogisticRegression responses
    "LogisticRegression": "LogisticRegression",
    "RandomForestClassifier أو RandomForestRegressorLogisticRegression الانحدار اللوجستي": "LogisticRegression",
    # RandomForestClassifier or RandomForestRegressor responses
    "RandomForestClassifier or RandomForestRegressor": "RandomForestClassifier or RandomForestRegressor",
    "RandomForestClassifier ou RandomForestRegressor": "RandomForestClassifier or RandomForestRegressor",
    "RandomForestClassifier o RandomForestRegressor": "RandomForestClassifier or RandomForestRegressor",
    "RandomForestClassifier 或 RandomForestRegressor": "RandomForestClassifier or RandomForestRegressor",
    "مصنف الغابة العشوائية أو انحدار الغابة العشوائية": "RandomForestClassifier or RandomForestRegressor",
    # HistGradientBoostingRegressor or HistGradientBoostingClassifier responses
    "HistGradientBoostingRegressor or HistGradientBoostingClassifier": "HistGradientBoostingRegressor or HistGradientBoostingClassifier",
    "HistGradientBoostingRegressor ou HistGradientBoostingClassifier": "HistGradientBoostingRegressor or HistGradientBoostingClassifier",
    "HistGradientBoostingRegressor o HistGradientBoostingClassifier": "HistGradientBoostingRegressor or HistGradientBoostingClassifier",
    "HistGradientBoostingRegressor 或 HistGradientBoostingClassifier": "HistGradientBoostingRegressor or HistGradientBoostingClassifier",
    "HistGradientBoostingRegressorأو HistGradientBoostingClassifier  مصنف الانحدار المدعم بتحليل التردد أو شجرة الانحدار المدعمة بتحليل التردد</li>": "HistGradientBoostingRegressor or HistGradientBoostingClassifier",
    # Pipeline responses
    "Pipeline": "Pipeline",
    "Pipeline الوصلات \ خطوط الأنابيب": "Pipeline",
    # ColumnTransformer responses
    "ColumnTransformer": "ColumnTransformer",
    "ColumnTransforme محولات الاعمدة": "ColumnTransformer",
    # Other responses
    "Other": "Other",
    "Outro": "Other",
    "Otro": "Other",
    "其它": "Other",
    "Autre": "Other",
    "أخرى": "Other"
}

# Function to normalize responses
def normalize_responses(response):
    if isinstance(response, str):
        response_split = [r.strip() for r in response.split(',')]
        normalized = [mapping_dict.get(r, None) for r in response_split]
        return [r for r in normalized if r is not None]
    return []

# Apply normalization and count responses
df['Normalized_Responses'] = df['Which estimators do you regularly use? Select all that apply.'].apply(normalize_responses)
all_responses = [item for sublist in df['Normalized_Responses'].dropna() for item in sublist]
response_counts = pd.Series(all_responses).value_counts().reset_index()
response_counts.columns = ['Estimators', 'Count']

# Chart using Altair with orange color
chart = alt.Chart(response_counts).mark_bar(color='#F7931E').encode(
    x='Count:Q',
    y=alt.Y('Estimators:N', sort='-x'),
    tooltip=['Estimators', 'Count']
).properties(
    title='Estimators Regularly Used',
    width=500,
    height=300
)

chart.show()

![Estimators regularly used](images/chart6.png)
***

# Important ML features

The respondents consider that the most important Machine Learning features for use case are feature importances, and uncertainity estimates for prediction, and they consider that the least important are metadata routing and non-euclidean metrics.

```{tip} Show code
:class: dropdown 

```python
# Mapping dictionary
mapping_dict = {
     # Calibration of probabilistic classifiers responses
    "Calibration of probabilistic classifiers": "Calibration of probabilistic classifiers",
    "Calibração de classificadores probabilísticos": "Calibration of probabilistic classifiers",
    "Calibración de clasificadores probabilísticos": "Calibration of probabilistic classifiers",
    "概率分类器（probabilistic classifiers）的校准": "Calibration of probabilistic classifiers",
    "Calibration des classificateurs probabilistes": "Calibration of probabilistic classifiers",
    "معايرة المصنفات الاحتمالية": "Calibration of probabilistic classifiers",
    # Calibration of regressors responses
    "Calibration of regressors": "Calibration of regressors",
    "Calibração de regressores": "Calibration of regressors",
    "Calibración de regresores": "Calibration of regressors",
    "回归子（regressor）的校准": "Calibration of regressors",
    "Calibration des régressions": "Calibration of regressors",
    "معايرة نماذج الانحدار": "Calibration of regressors",
    # Uncertainty estimates for prediction responses
    "Uncertainty estimates for prediction": "Uncertainty estimates for prediction",
    "Estimativas de incerteza para previsão": "Uncertainty estimates for prediction",
    "Estimaciones de incertidumbre para la predicción": "Uncertainty estimates for prediction",
    "对预测的不确定性估计": "Uncertainty estimates for prediction",
    "Estimations d'incertitude pour les prédictions": "Uncertainty estimates for prediction",
    "تقديرات عدم اليقين للتنبؤ": "Uncertainty estimates for prediction",
    # Cost-sensitive learning responses
    "Cost-sensitive learning": "Cost-sensitive learning",
    "Aprendizado sensível a custo": "Cost-sensitive learning",
    "Aprendizaje sensible al costo (cost-sensitive learning)": "Cost-sensitive learning",
    "代价敏感学习": "Cost-sensitive learning",
    "Apprentissage sensible aux coûts (Cost-sensitive learning)": "Cost-sensitive learning",
    "التعلم الحساس للتكلفة": "Cost-sensitive learning",
    # Feature importances responses
    "Feature importances": "Feature importances",
    "Importância das características": "Feature importances",
    "Importancia de variables": "Feature importances",
    "特征重要性": "Feature importances",
    "Importance des caractéristiques (features)": "Feature importances",
    "الأهمية النسبية للخواص": "Feature importances",
    # Sample weights responses
    "Sample weights": "Sample weights",
    "Pesos de amostra": "Sample weights",
    "Pesos de muestra (sample weights)": "Sample weights",
    "样本权重": "Sample weights",
    "Poids des échantillons(Sample Weights)": "Sample weights",
    "أوزان العينات": "Sample weights",
    # Metadata routing responses
    "Metadata routing": "Metadata routing",
    "Roteamento de metadados": "Metadata routing",
    "Enrutamiento de metadatos": "Metadata routing",
    "元数据路由（Metadata routing）": "Metadata routing",
    "Routage des métadonnées": "Metadata routing",
    "توجيه البيانات الوصفية": "Metadata routing",
    # Non-euclidean metrics responses
    "Non-euclidean metrics": "Non-euclidean metrics",
    "Métricas não-euclidianas": "Non-euclidean metrics",
    "Métricas no-euclidianas": "Non-euclidean metrics",
    "非欧几里得度量（Non-Euclidean Metric)": "Non-euclidean metrics",
    "Métriques non-euclidiennes": "Non-euclidean metrics",
    "المقاييس غير الإقليدية": "Non-euclidean metrics",
}

# Function to normalize responses
def normalize_responses(response):
    if isinstance(response, str):
        response_split = [r.strip() for r in response.split(',')]
        normalized = [mapping_dict.get(r, None) for r in response_split]
        return [r for r in normalized if r is not None]
    return []

# Apply normalization and count responses
df['Normalized_Responses'] = df['What ML features are important for your use case? Select all that apply.'].apply(normalize_responses)
all_responses = [item for sublist in df['Normalized_Responses'].dropna() for item in sublist]
response_counts = pd.Series(all_responses).value_counts().reset_index()
response_counts.columns = ['Features', 'Count']

# Chart using Altair with orange color
chart = alt.Chart(response_counts).mark_bar(color='#F7931E').encode(
    x='Count:Q',
    y=alt.Y('Features:N', sort='-x'),
    tooltip=['Features', 'Count']
).properties(
    title='Important ML features',
    width=500,
    height=300
)

chart.show()

![Important ML features](images/chart7.png)
***

# Tools used for model registry and experiment tracking

The most popular tool is MLFlow. Respondants do not use other tools as much as MLFlow, being Neptune the least popular among them.

```{tip} Show code
:class: dropdown 

```python
# Mapping dictionary
mapping_dict = {
    # DVC responses
    "DVC": "DVC",
    "DVC دي في سي": "DVC",
    # Neptune responses
    "Neptune": "Neptune",
    "Neptune نبتون": "Neptune",
    # MLFlow responses
    "MLFlow": "MLFlow",
    "MLFlow ام ال فلو": "MLFlow",
    # Weight and biases responses
    "Weight and biases": "Weight and biases",
    "Weights and Biases": "Weight and biases",
    "Weight and biases الوزن و الانحيازات": "Weight and biases",
    # Custom tool responses
    "Custom tool": "Custom tool",
    "Ferramenta personalizada": "Custom tool",
    "Herramientas personalizadas": "Custom tool",
    "自定义工具": "Custom tool",
    "Outil personnalisé": "Custom tool",
    "أداة مخصصة": "Custom tool",
    # Other responses
    "Other": "Other",
    "Outro": "Other",
    "Otro": "Other",
    "其它": "Other",
    "Autre": "Other",
    "أخرى": "Other"
}

# Function to normalize responses
def normalize_responses(response):
    if isinstance(response, str):
        response_split = [r.strip() for r in response.split(',')]
        normalized = [mapping_dict.get(r, None) for r in response_split]
        return [r for r in normalized if r is not None]
    return []

# Apply normalization and count responses
df['Normalized_Responses'] = df['For model registry and experiment tracking, do you use any of the following tools? Select all that apply.'].apply(normalize_responses)
all_responses = [item for sublist in df['Normalized_Responses'].dropna() for item in sublist]
response_counts = pd.Series(all_responses).value_counts().reset_index()
response_counts.columns = ['Tools', 'Count']

# Chart using Altair with orange color
chart = alt.Chart(response_counts).mark_bar(color='#F7931E').encode(
    x='Count:Q',
    y=alt.Y('Tools:N', sort='-x'),
    tooltip=['Tools', 'Count']
).properties(
    title='Tools used for model registry and experiment tracking',
    width=500,
    height=300
)

chart.show()

![Tools used for model registry and experiment tracking](images/chart8.png)
***

# Tools used for scheduling

For scheduling, the majority of respondants have chosen Airflow. Also, there are other tools that were not listed in the survey that are popular among the users. Coiled is the least popular of the listed options.

```{tip} Show code
:class: dropdown 

```python
# Mapping dictionary
mapping_dict = {
    # Airflow responses
    "Airflow": "Airflow",
    "Airflow اير فلو": "Airflow",
    # Argo responses
    "Argo": "Argo",
    "Argo  ارجو": "Argo",
    # Coiled responses
    "Coiled": "Coiled",
    "Coiled  كويلد": "Coiled",
    # Dagster responses
    "Dagster": "Dagster",
    "Dagster  داجستر": "Dagster",
    # Kubeflow responses
    "Kubeflow": "Kubeflow",
    "Kubeflow  كيوب فلو": "Kubeflow",
    # Metaflow (outerbounds) responses
    "Metaflow (outerbounds)": "Metaflow (outerbounds)",
    "Metaflow (outerbounds)(اوت باندز)  ميتا فلو": "Metaflow (outerbounds)",
    # Custom tool responses
    "Custom tool": "Custom tool",
    "Ferramenta personalizada": "Custom tool",
    "Herramientas personalizadas": "Custom tool",
    "自定义工具": "Custom tool",
    "Outil personnalisé": "Custom tool",
    "أداة مخصصة": "Custom tool",
    # Other responses
    "Other": "Other",
    "Outro": "Other",
    "Otro": "Other",
    "其它": "Other",
    "Autre": "Other",
    "أخرى": "Other"
}

# Function to normalize responses
def normalize_responses(response):
    if isinstance(response, str):
        response_split = [r.strip() for r in response.split(',')]
        normalized = [mapping_dict.get(r, None) for r in response_split]
        return [r for r in normalized if r is not None]
    return []

# Apply normalization and count responses
df['Normalized_Responses'] = df['For scheduling, do you use any of the following tools? Select all that apply.'].apply(normalize_responses)
all_responses = [item for sublist in df['Normalized_Responses'].dropna() for item in sublist]
response_counts = pd.Series(all_responses).value_counts().reset_index()
response_counts.columns = ['Tools', 'Count']

# Chart using Altair with orange color
chart = alt.Chart(response_counts).mark_bar(color='#F7931E').encode(
    x='Count:Q',
    y=alt.Y('Tools:N', sort='-x'),
    tooltip=['Tools', 'Count']
).properties(
    title='Tools used for scheduling',
    width=500,
    height=300
)

chart.show()

![Tools used for scheduling](images/chart9.png)
***

# Time that a typical model training takes in ML projects

Respondents usually take from some minutes to a day for a typical model training. It is not very common that they take less than a minute or more than a day.

```{tip} Show code
:class: dropdown 

```python
# Mapping dictionary
mapping_dict = {
    # less than 10 seconds responses
    "less than 10 seconds": "less than 10 seconds",
    "menos de 10 segundos": "less than 10 seconds",
    "少于10秒": "less than 10 seconds",
    "moins de 10 secondes": "less than 10 seconds",
    "أقل من ١٠ ثوانٍ": "less than 10 seconds",
    # less than a minute responses
    "less than a minute": "less than a minute",
    "menos de um minuto": "less than a minute",
    "menos de un minuto": "less than a minute",
    "少于1分钟": "less than a minute",
    "moins d'une minute": "less than a minute",
    "أقل من دقيقة": "less than a minute",
    # less than 10 minutes responses
    "less than 10 minutes": "less than 10 minutes",
    "menos de 10 minutos": "less than 10 minutes",
    "少于10分钟": "less than 10 minutes",
    "moins de 10 minutes": "less than 10 minutes",
    "أقل من ١٠ دقائق": "less than 10 minutes",
    # less than an hour responses
    "less than an hour": "less than an hour",
    "menos de uma hora": "less than an hour",
    "menos de una hora": "less than an hour",
    "少于1小时": "less than an hour",
    "moins d'une heure": "less than an hour",
    "أقل من ساعة": "less than an hour",
    # less than a day responses
    "less than a day": "less than a day",
    "menos de um dia": "less than a day",
    "menos de un día": "less than a day",
    "少于1天": "less than a day",
    "moins d'une journée": "less than a day",
    "أقل من يوم": "less than a day",
    # more than a day responses
    "more than a day": "more than a day",
    "mais de um dia": "more than a day",
    "más de un día": "more than a day",
    "多于1天": "more than a day",
    "plus d'une journée": "more than a day",
    "أكثر من يوم": "more than a day"
}

# Function to normalize responses
def normalize_responses(response):
    if isinstance(response, str):
        response_split = [r.strip() for r in response.split(',')]
        normalized = [mapping_dict.get(r, None) for r in response_split]
        return [r for r in normalized if r is not None]
    return []

# Apply normalization and count responses
df['Normalized_Responses'] = df['How long does a typical model training take in your ML projects?'].apply(normalize_responses)
all_responses = [item for sublist in df['Normalized_Responses'].dropna() for item in sublist]
response_counts = pd.Series(all_responses).value_counts().reset_index()
response_counts.columns = ['Time', 'Count']

# Chart using Altair with orange color
chart = alt.Chart(response_counts).mark_bar(color='#F7931E').encode(
    x='Count:Q',
    y=alt.Y('Time:N', sort='-x'),
    tooltip=['Time', 'Count']
).properties(
    title='Time taken for a typical model training',
    width=500,
    height=300
)

chart.show()

![Time that a typical model training takes in ML projects](images/chart10.png)
***