<a href="https://colab.research.google.com/github/pvaluedotone/winsor/blob/main/Winsor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Winsorisation tool

Saiyidi MAT RONI

4 Apr 2025

Version 1.0


---


***Citation***

Mat Roni, S. (2025). *Winsorisation tool* (verion 1.0) [software]. Google Colab. https://colab.research.google.com/drive/1dkLWC79uJ-MAyKPY_Hv_wAqQrSa59nhQ?usp=sharing


---


This code will run a Gradio interface for a better user experience. You can either interact with Winsor Gradio in the output section in the Colab environment, or run the app as a standalone web interface. The URL is dynamic. Check for the URL in the output section which typically starts with, "* Running on public URL: https..........gradio.live"





---


Install dependencies

In [1]:
!pip install numpy pandas scipy matplotlib gradio

Collecting gradio
  Downloading gradio-5.23.3-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.8.0 (from gradio)
  Downloading gradio_client-1.8.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.11.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
Collecting safehttpx<0.2.0,>=0.1.6 



---
Run Winsor on Gradio.


In [2]:
import numpy as np
import pandas as pd
import gradio as gr
from scipy.stats.mstats import winsorize
import matplotlib.pyplot as plt

# Helper function to create before and after box-whisker plots
def create_comparison_boxplot(df_original, df_winsorized, columns):
    fig, axes = plt.subplots(nrows=1, ncols=len(columns), figsize=(6 * len(columns), 5))
    if len(columns) == 1:
        axes = [axes]
    for ax, col in zip(axes, columns):
        ax.boxplot([df_original[col], df_winsorized[col]], labels=['Before', 'After'])
        ax.set_title(f'Box-Whisker Plot Comparison for {col}')
        ax.set_ylabel(col)
    plt.tight_layout()
    plot_filename = "comparison_boxplots.png"
    plt.savefig(plot_filename)
    plt.close(fig)
    return plot_filename

# Processing function with statistical comparison
def process_file(file, columns_to_winsorize, winsor_level):
    try:
        df = pd.read_csv(file.name)

        selected_columns = [col.strip() for col in columns_to_winsorize.split(',')]
        if all(var in df.columns for var in selected_columns):

            winsor_limits = [winsor_level / 100, winsor_level / 100]
            df_winsorized = df.copy()

            stats_summary = []

            for col in selected_columns:
                original_stats = {
                    "Variable": col,
                    "Type": "Before",
                    "Min": np.min(df[col]),
                    "Max": np.max(df[col]),
                    "Mean": np.mean(df[col]),
                }
                df_winsorized[col] = winsorize(df[col], limits=winsor_limits)
                winsorized_stats = {
                    "Variable": col,
                    "Type": "After",
                    "Min": np.min(df_winsorized[col]),
                    "Max": np.max(df_winsorized[col]),
                    "Mean": np.mean(df_winsorized[col]),
                }
                stats_summary.extend([original_stats, winsorized_stats])

            output_filename = "winsorized_dataset.csv"
            df_winsorized.to_csv(output_filename, index=False)

            plot_filename = create_comparison_boxplot(df, df_winsorized, selected_columns)

            #return output_filename, df_winsorized.head(), stats_summary, plot_filename
            stats_df = pd.DataFrame(stats_summary)
            return output_filename, df_winsorized.head(), stats_df, plot_filename

        else:
            return "Error: One or more specified variables do not exist in the dataset.", None, None, None
    except Exception as e:
        return str(e), None, None, None

# Listing columns function
def list_columns(file):
    try:
        df = pd.read_csv(file.name)
        return ', '.join(df.columns)
    except Exception as e:
        return str(e)

# Gradio Interface
winsor = gr.Blocks()
with winsor:
    gr.Markdown("## Winsorisation tool")
    with gr.Row():
        file_input = gr.File(label="Upload CSV file")
        column_display = gr.Textbox(label="Column names", interactive=False)
        list_button = gr.Button("List columns")

    columns_input = gr.Textbox(label="Enter columns to winsorise (comma-separated)")
    winsor_level_input = gr.Number(label="Enter winsorisation level (%)", value=5, precision=1)
    process_button = gr.Button("Process data")

    output_file = gr.File(label="Download processed dataset")
    output_preview = gr.Dataframe(label="Preview of processed data")
    stats_output = gr.Dataframe(label="Statistics before and after winsorisation")
    plot_output = gr.Image(label="Comparative Box-Whisker plots")

    list_button.click(list_columns, inputs=[file_input], outputs=[column_display])
    process_button.click(
        process_file,
        inputs=[file_input, columns_input, winsor_level_input],
        outputs=[output_file, output_preview, stats_output, plot_output]
    )

winsor.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://c01db44ee5789bbf10.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


