# Tutorial 4: Client-side processors

In this tutorial, we will learn how to prepare the data prior to running an avatarization by using processors on your local machine.

This step is necessary in some cases to handle and preserve data characteristics that are not natively handled by the avatarization or its embedded processors.

We'll also show how custom client-side processors can be defined to integrate domain knowledge into an avatarization.

## Principles

![pipeline](img/pipeline.png)

## Connection

In [1]:
import os

url = os.environ.get("AVATAR_BASE_URL")
username = os.environ.get("AVATAR_USERNAME")
password = os.environ.get("AVATAR_PASSWORD")

In [2]:
# This is the client that you'll be using for all of your requests
from avatars.client import ApiClient
from avatars.models import (
    AvatarizationJobCreate,
    AvatarizationParameters,
    ImputationParameters,
    ImputeMethod,
    ExcludeCategoricalParameters,
    ExcludeCategoricalMethod,
    RareCategoricalMethod,
)
from avatars.models import ReportCreate

from avatars.api import AvatarizationPipelineCreate
from avatars.processors.proportions import ProportionProcessor
from avatars.processors.group_modalities import GroupModalitiesProcessor
from avatars.processors.relative_difference import RelativeDifferenceProcessor
from avatars.processors.perturbation import PerturbationProcessor
from avatars.processors.expected_mean import ExpectedMeanProcessor
from avatars.processors.datetime import DatetimeProcessor

# The following are not necessary to run avatar but are used in this tutorial
import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Change this to your actual server endpoint, e.g. base_url="https://avatar.company.com"
client = ApiClient(base_url=url)
client.authenticate(username=username, password=password)

# Verify that we can connect to the API server
client.health.get_health()

{'message': 'ok'}

## A helper processor to reduce the number of modalities

We have seen in the previous tutorial one approach to handle categorical variables with large cardinality. We propose here an alternative to do so by means of a client-side processor.

This processor will group modalities together to ensure the target variable has a requested number of modalities. The least represented modalities will be brought together under a `other` modality. Note that this transformation is irreversible (the original value cannot be infered from `other`. 

Because this is an irreversible operation, this transformation of the data should be done outside the pipeline. The transformed data will be used as a basis for comparison when computing utility and privacy metrics.

In [3]:
df = pd.read_csv("../fixtures/adult_with_cities.csv").head(1000)
dataset = client.pandas_integration.upload_dataframe(df)
print(df.shape)
df.head()

(1000, 16)


Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income,city
0,25.0,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,,0.0,40.0,United-States,<=50K,Gordonville
1,38.0,Private,89814.0,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,,0.0,0.0,50.0,United-States,<=50K,Connieburgh
2,28.0,Local-gov,336951.0,Assoc-acdm,12.0,Married-civ-spouse,,Husband,White,Male,0.0,0.0,40.0,United-States,>50K,New Jason
3,44.0,Private,160323.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States,,Lake Kirstenberg
4,18.0,?,103497.0,Some-college,10.0,Never-married,?,Own-child,White,Female,0.0,0.0,30.0,United-States,<=50K,Diazfort


After loading the data, we decide we wish to reduce the number of modalities for the variable `city` which contains originally over 80 distinct values.

In [4]:
df["city"].value_counts()

South Monica         36
Port Amanda          35
Lake Rebeccaville    31
Calderonport         31
Keithport            30
                     ..
Scottburgh            1
Trevorbury            1
Nicholsmouth          1
South Joseph          1
Fletcherview          1
Name: city, Length: 84, dtype: int64

In [5]:
group_modalities_processor = GroupModalitiesProcessor(
    min_unique=10,  # number of modalities for a variable to be considered for grouping
    global_threshold=25,  # if considered for grouping, number of individuals in modality to preserve it
    new_category="other",
)

In [6]:
df_preprocessed = group_modalities_processor.preprocess(df)

Once the group modality processor has been applied, we can confirm that the number of modalities for the `city` variables has been reduced

In [7]:
df_preprocessed["city"].value_counts()

other                698
South Monica          36
Port Amanda           35
Lake Rebeccaville     31
Calderonport          31
Keithport             30
Lake Kirstenberg      29
Robinfort             29
Johnsonview           28
North Jacqueline      27
Charlesberg           26
Name: city, dtype: int64

In [8]:
%%time
dataset = client.pandas_integration.upload_dataframe(df_preprocessed)

result = client.pipelines.avatarization_pipeline_with_processors(
    AvatarizationPipelineCreate(
        avatarization_job_create=AvatarizationJobCreate(
            parameters=AvatarizationParameters(dataset_id=dataset.id, k=5),
        ),
        processors=[],
        df=df,
    ),
    timeout=1000,
)

CPU times: user 263 ms, sys: 23.2 ms, total: 286 ms
Wall time: 17.2 s


In [9]:
result

AvatarizationPipelineResult(privacy_metrics=PrivacyMetrics(hidden_rate=98.6, local_cloaking=52.0, distance_to_closest=5.424844741821289, closest_distances_ratio=0.8962736569284675, direct_match_protection=0.87012, categorical_hidden_rate=99.37823834196891), signal_metrics=SignalMetrics(hellinger_mean=0.1911243739822291, hellinger_std=0.1826347944284449, correlation_difference_ratio=1.9179414698104045), post_processed_avatars=      age         workclass    fnlwgt  education  educational-num  \
0    33.0           Private  128806.0  Bachelors             13.0   
1    38.0  Self-emp-not-inc  201089.0  Bachelors             13.0   
2     NaN           Private  153349.0  Bachelors             10.0   
3    31.0           Private  109891.0  Bachelors             13.0   
4    36.0      Self-emp-inc  202530.0    HS-grad              NaN   
..    ...               ...       ...        ...              ...   
995  38.0           Private       NaN  Bachelors             11.0   
996  23.0          

In [10]:
avatars = result.post_processed_avatars
avatars.head(3)

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income,city
0,33.0,Private,128806.0,Bachelors,13.0,Never-married,Exec-managerial,Not-in-family,White,Male,0.0,19.0,42.0,United-States,<=50K,Matthewfort
1,38.0,Self-emp-not-inc,201089.0,Bachelors,13.0,Married-civ-spouse,Other-service,Husband,,Male,0.0,14.0,55.0,United-States,>50K,Laurenside
2,,Private,153349.0,Bachelors,10.0,,Sales,Husband,White,Male,5927.0,-0.0,,United-States,>50K,Coreystad


In [11]:
avatars["city"].value_counts()

South Monica         40
Port Amanda          40
Charlesberg          35
Calderonport         34
Lake Rebeccaville    33
                     ..
New Mario             1
West Nathaniel        1
Cassandrafurt         1
South Kara            1
Blackbury             1
Name: city, Length: 75, dtype: int64

We observe that the avatars produced have a reduced number of cities and an extra `other` modality for the `city` variable. Note that the use of a client-side processor made the transformation of the data straightforward. 

The calculation of the metrics has been performed during the execution of the pipeline. Results can be obtained as shown below.

In [12]:
privacy_metrics = result.privacy_metrics
print("*** Privacy metrics ***")
for metric in privacy_metrics:
    print(metric)

utility_metrics = result.signal_metrics
print("\n*** Utility metrics ***")
for metric in utility_metrics:
    print(metric)

*** Privacy metrics ***
('hidden_rate', 98.6)
('local_cloaking', 52.0)
('distance_to_closest', 5.424844741821289)
('closest_distances_ratio', 0.8962736569284675)
('direct_match_protection', 0.87012)
('categorical_hidden_rate', 99.37823834196891)

*** Utility metrics ***
('hellinger_mean', 0.1911243739822291)
('hellinger_std', 0.1826347944284449)
('correlation_difference_ratio', 1.9179414698104045)


## Modeling inter-variables constraints with processors

We will now use two processors to enforce inter-variable constraints.

The two processors we will now apply are processors that temporarily transform the data in order to improve the avatarization. This means that they both contain a `preprocess` step and a `postprocess` step, ensuring that the effect of the `preprocess` action can be reversed via the use of the `postprocess` action. 

These processors will be used to demonstrate the use of the pipeline tool that automates the use of processors, the avatarization and the metric computation in a single command. 


In [13]:
df = pd.read_csv("../fixtures/epl.csv")

Prior to applying processors, it is important to check `dtypes` and eventually convert date variables to a `datetime` format using `pandas.to_datetime` function. 

In [14]:
df.dtypes

age                      int64
position                object
minutes_in_game          int64
appearances              int64
minutes_played_home      int64
minutes_played_away      int64
minutes_on_bench         int64
appearances_home         int64
appearances_away         int64
penalty_goals            int64
goals_home               int64
goals_away               int64
penalty_misses           int64
yellow_cards_overall     int64
red_cards_overall        int64
penalty_attempts         int64
career_start_date       object
club_signing_date       object
dtype: object

In [15]:
df['career_start_date'] = pd.to_datetime(df['career_start_date'], format='%Y-%m-%d %H:%M:%S')
df['club_signing_date'] = pd.to_datetime(df['club_signing_date'], format='%Y-%m-%d %H:%M:%S')

### Proportions

Variables may have relationships in which one or many variables to be represented as a proportion of another. In order to best preserve this type of relationships during avatarization, it is recommended to express such variables as proportions. To do so, the `proportion` processor can be used.

In [16]:
proportion_processor = ProportionProcessor(
    variable_names=["minutes_played_home", "minutes_played_away", "minutes_on_bench"],
    reference="minutes_in_game",
    sum_to_one=True,
    decimal_count=0,
)

In [17]:
df

Unnamed: 0,age,position,minutes_in_game,appearances,minutes_played_home,minutes_played_away,minutes_on_bench,appearances_home,appearances_away,penalty_goals,goals_home,goals_away,penalty_misses,yellow_cards_overall,red_cards_overall,penalty_attempts,career_start_date,club_signing_date
0,31,Defender,1964,22,898,711,355,12,10,0,0,0,0,1,0,0,2015-04-18 07:05:27,2018-07-10 05:13:43
1,33,Midfielder,1607,18,497,740,370,8,10,0,1,0,0,1,0,0,2014-01-20 21:09:29,2021-04-14 05:43:39
2,30,Midfielder,2920,31,1200,1147,573,16,15,1,1,2,0,4,0,1,2014-10-21 23:56:27,2021-03-05 19:52:44
3,30,Midfielder,1671,30,699,648,324,15,15,0,2,2,0,0,0,0,2017-09-25 17:57:41,2018-10-28 08:35:36
4,20,Forward,121,4,24,65,32,2,2,0,0,0,0,0,0,0,2016-02-23 08:34:56,2020-02-16 04:25:53
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
567,23,Midfielder,1375,15,585,527,263,8,7,0,2,1,0,2,0,0,2015-09-28 04:52:53,2018-05-25 01:23:34
568,24,Midfielder,2305,30,757,1032,516,14,16,0,0,0,0,5,0,0,2014-04-02 08:05:29,2018-01-30 20:27:10
569,20,Defender,25,2,10,10,5,1,1,0,0,0,0,0,0,0,2016-02-28 16:14:44,2020-03-26 21:02:43
570,20,Defender,25,2,10,10,5,1,1,0,0,0,0,0,0,0,2017-10-21 11:10:20,2019-11-10 07:34:54


### Relative differences

Some variables may have a hierarchy where on variable is always higher than an other. In order to be sure that this hierarchy is preserved at avatarization, it is recommended to express one variable as the difference from the other.

We take `penalty_attempts` and `penalty_goals` as an example where one variable (`penalty_goals`) cannot be greater than the other (`penalty_attempts`).

In [18]:
relative_difference_processor = RelativeDifferenceProcessor(
    target="penalty_goals",
    references=["penalty_attempts"],
)

### Relative differences with datetime variables

The relative difference processor can also be used to express a date relative to another. To do so, it is required to use the `DatetimeProcessor` processor that will transform datetime variables into numeric values, enabling differences to be computed between date variables. Because the `DatetimeProcessor` has a post-process function, the data output by the avtarization pipeline will contain the datetime variables in their original format (i.e. as datetime rather than numeric values). 

In [19]:
datetime_processor = DatetimeProcessor()

In [20]:
relative_difference_processor_dates = RelativeDifferenceProcessor(
    target="club_signing_date",
    references=["career_start_date"],
)

### Computed variables

The data also contains a third variable related to the penalty context: `penalty_misses`. This variable can be computed directly as the difference between `penalty_attempts` and `penalty_goals`. 

Computed variables should be removed from the data prior to running the avtarization and re-computed after.

In [21]:
df = df.drop(columns=["penalty_misses"])

### Run the pipeline

In [22]:
%%time
dataset = client.pandas_integration.upload_dataframe(df)

result = client.pipelines.avatarization_pipeline_with_processors(
    AvatarizationPipelineCreate(
        avatarization_job_create=AvatarizationJobCreate(
            parameters=AvatarizationParameters(dataset_id=dataset.id, k=5)
        ),
        processors=[proportion_processor, relative_difference_processor, datetime_processor, relative_difference_processor_dates],
        df=df,
    ),
    timeout=1000,
)

CPU times: user 381 ms, sys: 19.6 ms, total: 401 ms
Wall time: 9.29 s


In [23]:
avatars = result.post_processed_avatars
avatars.head(5)

Unnamed: 0,age,position,minutes_in_game,appearances,minutes_played_home,minutes_played_away,minutes_on_bench,appearances_home,appearances_away,penalty_goals,goals_home,goals_away,yellow_cards_overall,red_cards_overall,penalty_attempts,career_start_date,club_signing_date
0,24,Midfielder,199,5,39.0,107.0,53.0,2,3,0.0,0,0,0,0,0,2016-10-21 20:41:43,2019-06-11 13:57:02
1,31,Defender,2987,29,1127.0,1240.0,620.0,14,15,0.0,0,0,2,0,0,2015-09-21 08:35:40,2020-04-16 05:27:04
2,22,Midfielder,25,2,10.0,10.0,5.0,1,1,0.0,0,0,0,0,0,2014-09-20 08:21:20,2020-02-24 06:10:24
3,30,Midfielder,1880,26,964.0,611.0,305.0,15,11,0.0,2,1,3,0,0,2015-10-02 22:11:09,2020-10-11 17:16:57
4,24,Forward,3003,36,1321.0,1122.0,560.0,18,18,0.0,4,6,4,1,0,2015-08-03 05:20:01,2020-01-23 22:54:11


In [24]:
privacy_metrics = result.privacy_metrics
print("*** Privacy metrics ***")
for metric in privacy_metrics:
    print(metric)

utility_metrics = result.signal_metrics
print("\n*** Utility metrics ***")
for metric in utility_metrics:
    print(metric)

*** Privacy metrics ***
('hidden_rate', 93.18181818181819)
('local_cloaking', 12.0)
('distance_to_closest', 0.8525838255882263)
('closest_distances_ratio', 0.7774512262409512)
('direct_match_protection', 0.9063034867230672)
('categorical_hidden_rate', 100.0)

*** Utility metrics ***
('hellinger_mean', 0.19481139717883394)
('hellinger_std', 0.14968866351296)
('correlation_difference_ratio', 2.0990272363945204)


### Should these processors really be used ?

Let's try without ...

In [25]:
df2 = pd.read_csv("../fixtures/epl.csv")
df2['career_start_date'] = pd.to_datetime(df2['career_start_date'], format='%Y-%m-%d %H:%M:%S')
df2['club_signing_date'] = pd.to_datetime(df2['club_signing_date'], format='%Y-%m-%d %H:%M:%S')

dataset = client.pandas_integration.upload_dataframe(df2)
job = client.jobs.create_avatarization_job(
    AvatarizationJobCreate(
        parameters=AvatarizationParameters(k=20, ncp=2, dataset_id=dataset.id)
    )
)
job = client.jobs.get_avatarization_job(id=job.id)

In [26]:
avatars_noprocessing = client.pandas_integration.download_dataframe(
    job.result.avatars_dataset.id
)

In [27]:
avatars_noprocessing.head(5)

Unnamed: 0,age,position,minutes_in_game,appearances,minutes_played_home,minutes_played_away,minutes_on_bench,appearances_home,appearances_away,penalty_goals,goals_home,goals_away,penalty_misses,yellow_cards_overall,red_cards_overall,penalty_attempts,career_start_date,club_signing_date
0,27,Defender,25,2,10,10,5,1,1,0,0,0,0,0,0,0,2016-08-26 01:31:27,2019-04-10 11:12:46
1,26,Forward,1618,25,723,597,298,13,12,0,3,1,0,2,0,0,2016-03-29 02:09:50,2019-07-21 13:08:08
2,24,Defender,1152,15,432,480,240,7,8,0,0,1,0,3,1,0,2014-12-07 16:53:14,2020-03-23 16:59:32
3,25,Midfielder,2837,33,1114,1149,574,16,17,0,1,1,0,4,0,0,2016-06-15 15:13:14,2020-08-11 00:52:50
4,34,Goalkeeper,66,3,50,11,5,2,1,0,0,0,0,0,0,0,2016-04-14 19:56:08,2020-06-07 01:10:53


#### Preservation of the proportions

To confirm that proportions are well kept, we can compute the maximum difference between the reference variable (`minutes_in_game`) and the sum of the three proportion variables (`minutes_played_home`, `minutes_played_away` and `minutes_on_bench`). Where it may not be zero when no processor is used, this difference should be zero when using a proportion processor. 

In [28]:
np.max(
    abs(
        avatars_noprocessing["minutes_in_game"]
        - (
            avatars_noprocessing["minutes_played_home"]
            + avatars_noprocessing["minutes_played_away"]
            + avatars_noprocessing["minutes_on_bench"]
        )
    )
)

1

In [29]:
np.max(
    abs(
        avatars["minutes_in_game"]
        - (
            avatars["minutes_played_home"]
            + avatars["minutes_played_away"]
            + avatars["minutes_on_bench"]
        )
    )
)

0.0

#### Preservation of the relative difference

In [30]:
print("Avatars with processors")
print(
    "Number of players with penalty attempts > penalty goals: ",
    (sum(avatars["penalty_attempts"] - avatars["penalty_goals"] > 0)),
)
print(
    "Number of players with penalty attempts < penalty goals: ",
    (sum(avatars["penalty_attempts"] - avatars["penalty_goals"] < 0)),
)

Avatars with processors
Number of players with penalty attempts > penalty goals:  7
Number of players with penalty attempts < penalty goals:  0


In [31]:
print("Avatars without processors")
print(
    "Number of players with penalty attempts > penalty goals: ",
    (
        sum(
            avatars_noprocessing["penalty_attempts"]
            - avatars_noprocessing["penalty_goals"]
            > 0
        )
    ),
)
print(
    "Number of players with penalty attempts < penalty goals: ",
    (
        sum(
            avatars_noprocessing["penalty_attempts"]
            - avatars_noprocessing["penalty_goals"]
            < 0
        )
    ),
)

Avatars without processors
Number of players with penalty attempts > penalty goals:  11
Number of players with penalty attempts < penalty goals:  0


## Post-processors

Post-processors are processors that do not transform the data prior to the avatarization but after only. These can be used to fix some variables that could have been altered beyond acceptable. Care should always be taken when using such processors because they are likely to decrease the level of privacy. By using these processors via the pipeline feature, we ensure that metrics are computed after application of the post-process step and so that the privacy and utility metrics have taken these processors into consideration.

### Expected mean

In [32]:
expected_mean_processor = ExpectedMeanProcessor(
    target_variables=["goals_away", "goals_home"],
    groupby_variables=["position"],
    same_std=False,
)

### Run the pipeline

In [33]:
dataset = client.pandas_integration.upload_dataframe(df)

result = client.pipelines.avatarization_pipeline_with_processors(
    AvatarizationPipelineCreate(
        avatarization_job_create=AvatarizationJobCreate(
            parameters=AvatarizationParameters(dataset_id=dataset.id, k=5),
        ),
        processors=[
            proportion_processor,
            relative_difference_processor,
            expected_mean_processor,
        ],
        df=df,
    ),
    timeout=1000)

In [34]:
avatars = result.post_processed_avatars
avatars.head(5)

Unnamed: 0,age,position,minutes_in_game,appearances,minutes_played_home,minutes_played_away,minutes_on_bench,appearances_home,appearances_away,penalty_goals,goals_home,goals_away,yellow_cards_overall,red_cards_overall,penalty_attempts,career_start_date,club_signing_date
0,27,Goalkeeper,25,2,10.0,10.0,5.0,1,1,0.0,0.0,0.0,0,0,0,2015-07-17 14:34:56,2019-11-13 15:29:38
1,32,Midfielder,2320,33,970.0,900.0,450.0,16,17,0.0,4.195442,2.161519,2,0,0,2016-09-20 07:11:50,2019-09-27 01:39:26
2,23,Midfielder,25,2,10.0,10.0,5.0,1,1,0.0,0.195442,0.161519,0,0,0,2017-01-14 17:01:16,2020-11-01 19:29:38
3,24,Midfielder,1005,16,427.0,385.0,193.0,8,7,0.0,0.195442,0.161519,1,0,0,2015-03-11 11:58:39,2019-05-28 14:00:10
4,30,Defender,233,7,20.0,142.0,71.0,3,4,0.0,0.003934,0.029668,0,0,0,2015-04-09 22:17:12,2019-09-28 20:52:51


Looking at the mean of the two variables on which the expected mean processor was applied, we can confirm that the mean for each target category is preserved.

The same statistics computed on avatars that did not get post-processed by this same processor are more different than the statistics obtained on the original data.

In [35]:
df.groupby(["position"]).mean(numeric_only=True)[["goals_away", "goals_home"]]

  df.groupby(["position"]).mean()[["goals_away", "goals_home"]]


Unnamed: 0_level_0,goals_away,goals_home
position,Unnamed: 1_level_1,Unnamed: 2_level_1
Defender,0.333333,0.375661
Forward,1.833333,2.614035
Goalkeeper,0.0,0.0
Midfielder,0.910377,0.971698


In [36]:
avatars.groupby(["position"]).mean(numeric_only=True)[["goals_away", "goals_home"]]

  avatars.groupby(["position"]).mean()[["goals_away", "goals_home"]]


Unnamed: 0_level_0,goals_away,goals_home
position,Unnamed: 1_level_1,Unnamed: 2_level_1
Defender,0.333333,0.375661
Forward,1.833333,2.614035
Goalkeeper,0.0,0.0
Midfielder,0.910377,0.971698


In [37]:
avatars_noprocessing.groupby(["position"]).mean(numeric_only=True)[["goals_away", "goals_home"]]

  avatars_noprocessing.groupby(["position"]).mean()[["goals_away", "goals_home"]]


Unnamed: 0_level_0,goals_away,goals_home
position,Unnamed: 1_level_1,Unnamed: 2_level_1
Defender,0.320197,0.339901
Forward,1.796117,2.582524
Goalkeeper,0.0,0.042553
Midfielder,0.730594,0.863014


### Computed variables

To complete the anonymization process, variables that are the results of an operation between other variables and that should have been removed from the data should be added back to the avatarized data.

In [38]:
avatars["penalty_missed"] = avatars["penalty_attempts"] - avatars["penalty_goals"]
avatars_noprocessing["penalty_missed"] = (
    avatars_noprocessing["penalty_attempts"] - avatars_noprocessing["penalty_goals"]
)

In [39]:
avatars.head()

Unnamed: 0,age,position,minutes_in_game,appearances,minutes_played_home,minutes_played_away,minutes_on_bench,appearances_home,appearances_away,penalty_goals,goals_home,goals_away,yellow_cards_overall,red_cards_overall,penalty_attempts,career_start_date,club_signing_date,penalty_missed
0,27,Goalkeeper,25,2,10.0,10.0,5.0,1,1,0.0,0.0,0.0,0,0,0,2015-07-17 14:34:56,2019-11-13 15:29:38,0.0
1,32,Midfielder,2320,33,970.0,900.0,450.0,16,17,0.0,4.195442,2.161519,2,0,0,2016-09-20 07:11:50,2019-09-27 01:39:26,0.0
2,23,Midfielder,25,2,10.0,10.0,5.0,1,1,0.0,0.195442,0.161519,0,0,0,2017-01-14 17:01:16,2020-11-01 19:29:38,0.0
3,24,Midfielder,1005,16,427.0,385.0,193.0,8,7,0.0,0.195442,0.161519,1,0,0,2015-03-11 11:58:39,2019-05-28 14:00:10,0.0
4,30,Defender,233,7,20.0,142.0,71.0,3,4,0.0,0.003934,0.029668,0,0,0,2015-04-09 22:17:12,2019-09-28 20:52:51,0.0


### Perturbation level

The perturbation processor can be used to control how close to the avatarized values, the final values of a variable will be. At the extremes, if using a perturbation level of zero, the avatarized values will not contribute at all to the final values. On the other hand, with a perturbation level of 1, the original values will not contribute. A perturbation level of 0.3 will mean that the final value will be closer to the original values than it is from the avatraized value. By default, the perturbation level is set to 1.

In [40]:
perturbation_processor = PerturbationProcessor(perturbation_level={"age": 1})

In [41]:
result = client.pipelines.avatarization_pipeline_with_processors(
    AvatarizationPipelineCreate(
        avatarization_job_create=AvatarizationJobCreate(
            parameters=AvatarizationParameters(dataset_id=dataset.id, k=5),
        ),
        processors=[
            proportion_processor,
            relative_difference_processor,
            expected_mean_processor,
            perturbation_processor,
        ],
        df=df,
    ),
    timeout=1000,
)
avatars_perturbation_1 = result.post_processed_avatars

perturbation_processor = PerturbationProcessor(perturbation_level={"age": 0})
result = client.pipelines.avatarization_pipeline_with_processors(
    AvatarizationPipelineCreate(
        avatarization_job_create=AvatarizationJobCreate(
            parameters=AvatarizationParameters(dataset_id=dataset.id, k=5),
        ),
        processors=[
            proportion_processor,
            relative_difference_processor,
            expected_mean_processor,
            perturbation_processor,
        ],
        df=df,
    ),
    timeout=1000,
)
avatars_perturbation_0 = result.post_processed_avatars

perturbation_processor = PerturbationProcessor(perturbation_level={"age": 0.5})
result = client.pipelines.avatarization_pipeline_with_processors(
    AvatarizationPipelineCreate(
        avatarization_job_create=AvatarizationJobCreate(
            parameters=AvatarizationParameters(dataset_id=dataset.id, k=5),
        ),
        processors=[
            proportion_processor,
            relative_difference_processor,
            expected_mean_processor,
            perturbation_processor,
        ],
        df=df,
    ),
    timeout=1000,
)
avatars_perturbation_05 = result.post_processed_avatars

We observe that as expected, using a perturbation level of 0 on the variable `age`, this variable gets unchanged.

In [42]:
df["age"].value_counts() - avatars_perturbation_0["age"].value_counts()

29    0
27    0
31    0
28    0
30    0
32    0
26    0
25    0
24    0
23    0
33    0
34    0
22    0
21    0
20    0
35    0
37    0
36    0
38    0
40    0
41    0
0     0
19    0
39    0
Name: age, dtype: int64

The same comment does not hold when using a perturbation level of 0.5 or 1. A count of each modality shows this with new modalities being created at avatarization.

In [43]:
df["age"].value_counts() - avatars_perturbation_05["age"].value_counts()

0.0      NaN
10.0     NaN
10.5     NaN
19.0     NaN
20.0    17.0
20.5     NaN
21.0     8.0
21.5     NaN
22.0    13.0
22.5     NaN
23.0    18.0
23.5     NaN
24.0    20.0
24.5     NaN
25.0    19.0
25.5     NaN
26.0     7.0
26.5     NaN
27.0    23.0
27.5     NaN
28.0    18.0
28.5     NaN
29.0    24.0
29.5     NaN
30.0    15.0
30.5     NaN
31.0    26.0
31.5     NaN
32.0    26.0
32.5     NaN
33.0    11.0
33.5     NaN
34.0    15.0
34.5     NaN
35.0    10.0
36.0     4.0
36.5     NaN
37.0     7.0
37.5     NaN
38.0     3.0
38.5     NaN
39.0     NaN
40.0     NaN
41.0     NaN
Name: age, dtype: float64

In [44]:
df["age"].value_counts() - avatars_perturbation_1["age"].value_counts()

0      NaN
17     NaN
19     NaN
20    11.0
21    -3.0
22     6.0
23     5.0
24     4.0
25    -1.0
26     1.0
27   -20.0
28   -10.0
29   -21.0
30   -13.0
31   -10.0
32    12.0
33     1.0
34    13.0
35     7.0
36     5.0
37     1.0
38     NaN
39     0.0
40     NaN
41     NaN
Name: age, dtype: float64

*In the next tutorial, we will show how to define your own processor to be executed client-side.*