pockerman · pockerman · Mar 23, 2022 · Mar 23, 2022 · Mar 23, 2022 · Mar 23, 2022
diff --git a/.gitignore b/.gitignore
@@ -9,3 +9,4 @@ src/policies/__pycache__/
 src/apps/__pycache__/
 src/.coverage
 src/maths/__pycache__/
+src/networks/__pycache__/
diff --git a/README.md b/README.md
@@ -3,86 +3,6 @@
 
 # RL anonymity (with Python)
 
-An experimental effort to use reinforcement learning techniques for data anonymization. 
-
-## Conceptual overview
-
-The term data anonymization refers to techiniques that can be applied on a given dataset, D, such that after
-the latter has been submitted to such techniques, it makes it difficult for a third party to identify or infer the existence
-of specific individuals in D. Anonymization techniques, typically result into some sort of distortion
-of the original dataset. This means that in order to maintain some utility of the transformed dataset, the transofrmations
-applied should be constrained in some sense. In the end, it can be argued, that data anonymization is an optimization problem
-meaning striking the right balance between data utility and privacy. 
-
-Reinforcement learning is a learning framework based on accumulated experience. In this paradigm, an agent is learning by iteracting with an environment 
-without (to a large extent) any supervision. The following image describes, schematically, the reinforcement learning framework .
-
-![RL paradigm](images/agent_environment_interface.png "Reinforcement learning paradigm") 
-
-The agent chooses an action, ```a_t```, to perform out of predefined set of actions ```A```. The chosen action is executed by the environment
-instance and returns to the agent a reward signal, ```r_t```, as well as the new state, ```s_t```, that the enviroment is in. 
-The framework has successfully been used  to many recent advances in control, robotics, games and elsewhere.
-
-
-Let's assume that we have in our disposal two numbers a minimum distortion, ```MIN_DIST``` that should be applied to the dataset
-for achieving privacy and a maximum distortion, ```MAX_DIST```,  that should be applied to the dataset in order to maintain some utility.
-Let's assume also that any overall dataset distortion in ```[MIN_DIST, MAX_DIST]``` is acceptable in order to cast the dataset as 
-preserving  privacy and preserving dataset utility. We can then train a reinforcement learning agent to distort the dataset
-such that the aforementioned objective is achieved.
-
-Overall, this is shown in the image below.
-
-![RL anonymity paradigm](images/general_concept.png "Reinforcement learning anonymity schematics")
-
-The images below show the overall running distortion average and running reward average achieved by using the 
-<a href="https://en.wikipedia.org/wiki/Q-learning">Q-learning</a> algorithm and various policies.
-
-**Q-learning with epsilon-greedy policy and constant epsilon**
-![RL anonymity paradigm](images/q_learn_epsilon_greedy_avg_run_distortion.png "Epsilon-greedy constant epsilon ")
-![RL anonymity paradigm](images/q_learn_epsilon_greedy_avg_run_reward.png "Reinforcement learning anonymity schematics")
-
-**Q-learning with epsilon-greedy policy and decaying epsilon per episode**
-![RL anonymity paradigm](images/q_learn_epsilon_greedy_decay_avg_run_distortion.png "Reinforcement learning anonymity schematics")
-![RL anonymity paradigm](images/q_learn_epsilon_greedy_decay_avg_run_reward.png "Reinforcement learning anonymity schematics")
-
-
-**Q-learning with epsilon-greedy policy with decaying epsilon at constant rate**
-![RL anonymity paradigm](images/q_learn_epsilon_greedy_decay_rate_avg_run_distortion.png "Reinforcement learning anonymity schematics")
-![RL anonymity paradigm](images/q_learn_epsilon_greedy_decay_rate_avg_run_reward.png "Reinforcement learning anonymity schematics")
-
-**Q-learning with softmax policy running average distorion**
-![RL anonymity paradigm](images/q_learn_softmax_avg_run_distortion.png "Reinforcement learning anonymity schematics")
-![RL anonymity paradigm](images/q_learn_softmax_avg_run_reward.png "Reinforcement learning anonymity schematics")
-
-
-## Dependencies 
-
-The following packages are required. 
-
-- <a href="#">NumPy</a>
-- <a href="https://www.sphinx-doc.org/en/master/">Sphinx</a> 
-- <a href="#">Python Pandas</a>
-
-You can use 
-
-```
-pip install -r requirements.txt
-```
-
-## Examples
-
-- <a href="src/examples/qlearning_three_columns.py"> Qlearning agent on a three columns dataset</a>
-- <a href="src/examples/nstep_semi_grad_sarsa_three_columns.py"> n-step semi-gradient SARSA on  a three columns dataset</a>
-
-## Documentation
-
-You will need <a href="https://www.sphinx-doc.org/en/master/">Sphinx</a> in order to generate the API documentation. Assuming that Sphinx is already installed
-on your machine execute the following commands (see also <a href="https://www.sphinx-doc.org/en/master/tutorial/index.html">Sphinx tutorial</a>). 
-
-```
-sphinx-quickstart docs
-sphinx-build -b html docs/source/ docs/build/html
-```
-
-## References
+An experimental effort to use reinforcement learning techniques for data anonymization. The project documentation
+can be found at <a href="https://rl-anonymity-with-python.readthedocs.io/en/latest/index.html">RL anonymity (with Python)</a>
 
diff --git a/docs/source/API/column_type.rst → docs/source/API/datasets/column_type.rst b/docs/source/API/column_type.rst → docs/source/API/datasets/column_type.rst
diff --git a/docs/source/API/exceptions.rst → docs/source/API/exceptions/exceptions.rst b/docs/source/API/exceptions.rst → docs/source/API/exceptions/exceptions.rst
diff --git a/docs/source/API/optimizer_type.rst → docs/source/API/maths/optimizer_type.rst b/docs/source/API/optimizer_type.rst → docs/source/API/maths/optimizer_type.rst
diff --git a/.../source/API/pytorch_optimizer_builder.rst → ...e/API/maths/pytorch_optimizer_builder.rst b/.../source/API/pytorch_optimizer_builder.rst → ...e/API/maths/pytorch_optimizer_builder.rst
diff --git a/docs/source/API/networks/a2c_networks.rst b/docs/source/API/networks/a2c_networks.rst
@@ -0,0 +1,7 @@
+a2c\_networks
+=============
+
+.. automodule:: a2c_networks
+
+.. autoclass:: A2CNetSimpleLinear
+   :members: __init__, forward
diff --git a/docs/source/API/action_space.rst → docs/source/API/spaces/action_space.rst b/docs/source/API/action_space.rst → docs/source/API/spaces/action_space.rst
diff --git a/docs/source/API/actions.rst → docs/source/API/spaces/actions.rst b/docs/source/API/actions.rst → docs/source/API/spaces/actions.rst
diff --git a/...source/API/discrete_state_environment.rst → ...API/spaces/discrete_state_environment.rst b/...source/API/discrete_state_environment.rst → ...API/spaces/discrete_state_environment.rst
diff --git a/docs/source/API/state.rst → docs/source/API/spaces/state.rst b/docs/source/API/state.rst → docs/source/API/spaces/state.rst
diff --git a/docs/source/API/tiled_environment.rst → docs/source/API/spaces/tiled_environment.rst b/docs/source/API/tiled_environment.rst → docs/source/API/spaces/tiled_environment.rst
diff --git a/docs/source/API/time_step.rst → docs/source/API/spaces/time_step.rst b/docs/source/API/time_step.rst → docs/source/API/spaces/time_step.rst
diff --git a/docs/source/Examples/a2c_three_columns.rst b/docs/source/Examples/a2c_three_columns.rst
@@ -15,9 +15,19 @@ However, the true objective of reinforcement learning is to directly learn a pol
 
 The main advantage of learning a parametrized policy is that it can be any learnable function e.g. a linear model or a deep neural network.
 
-The A2C algorithm falls under the umbrella of actor-critic methods [REF]. In these methods, we estimate a parametrized policy; the actor
-and a parametrized value function; the critic.
+The A2C algorithm  is a a synchronous version of A3C. Both algorithms, fall under the umbrella of actor-critic methods [REF]. In these methods, we estimate a parametrized policy; the actor
+and a parametrized value function; the critic. The role of the policy or actor network is to indicate which action to take on a given state. In our implementation below,
+the policy network returns a probability distribution over the action space. Specifically,  a tensor of probabilities. The role of the critic model is to evaluate how good is
+the action that is selected.
+
+In A2C there is a single agent that interacts with multiple instances of the environment. In other words, we create a number of workers where each worker loads its own instance
+of the data set to anonymize. A shared model is then optimized by each worker.
+
+We can use neural networks to approximate both models
 
 
 Specifically, we will use a weight-sharing model. Moreover, the environment is a multi-process class that gathers samples from multiple
-emvironments at once
+emvironments at once.
+
+Code
+----
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -20,6 +20,8 @@
 sys.path.append(os.path.abspath("../../src/policies/"))
 sys.path.append(os.path.abspath("../../src/maths/"))
 sys.path.append(os.path.abspath("../../src/utils/"))
+sys.path.append(os.path.abspath("../../src/datasets/"))
+sys.path.append(os.path.abspath("../../src/networks/"))
 print(sys.path)
 
 

diff --git a/docs/source/images/agent_environment_interface.png b/docs/source/images/agent_environment_interface.png
diff --git a/docs/source/images/general_concept.png b/docs/source/images/general_concept.png
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -3,10 +3,11 @@
    You can adapt this file completely to your liking, but it should at least
    contain the root `toctree` directive.
 
-Welcome to RL Anonymity (with Python)'s documentation!
-======================================================
+RL Anonymity (with Python)
+==========================
 
 An experimental effort to use reinforcement learning techniques for data anonymization. 
+The project repository is at `RL anonymity (with Python) <https://github.com/pockerman/rl_anonymity_with_python>`_.
 
 Contents
 --------

diff --git a/docs/source/install.rst b/docs/source/install.rst
@@ -6,14 +6,17 @@ The following packages are required:
 - `NumPy <https://numpy.org/>`_
 - `Sphinx <https://www.sphinx-doc.org/en/master/>`_
 - `Python Pandas <https://pandas.pydata.org/>`_
+- `PyTorch <https://pytorch.org/>`_
 
 .. code-block:: console
 
 	pip install -r requirements.txt
 
+Run tests
+---------
 
 Generate documentation
-======================
+----------------------
 
 You will need `Sphinx <https://www.sphinx-doc.org/en/master/>`_ in order to generate the API documentation. Assuming that Sphinx is already installed
 on your machine execute the following commands (see also `Sphinx tutorial <https://www.sphinx-doc.org/en/master/tutorial/index.html>`_). 

diff --git a/docs/source/modules.rst b/docs/source/modules.rst
@@ -4,20 +4,24 @@ API
 .. toctree::
    :maxdepth: 4
 
-   API/actions
-   API/action_space
-   API/state
-   API/time_step
    API/epsilon_greedy_policy
    API/epsilon_greedy_q_estimator
    API/q_learning
    API/trainer
-   API/optimizer_type
-   API/pytorch_optimizer_builder
+   API/datasets/column_type
+   API/exceptions/exceptions
+   API/maths/optimizer_type
+   API/maths/pytorch_optimizer_builder
+   API/networks/a2c_networks
+   API/spaces/actions
+   API/spaces/action_space
+   API/spaces/state
+   API/spaces/discrete_state_environment
+   API/spaces/tiled_environment
+   API/spaces/time_step
    API/replay_buffer
    API/a2c
-   API/exceptions
-   API/column_type
-   API/discrete_state_environment
-   API/tiled_environment
+
+
+
 
diff --git a/docs/source/overview.rst b/docs/source/overview.rst
@@ -1,25 +1,73 @@
 Conceptual overview
 ===================
 
-The term data anonymization refers to techiniques that can be applied on a given dataset, D, such that after
-the latter has been submitted to such techniques, it makes it difficult for a third party to identify or infer the existence
-of specific individuals in D. Anonymization techniques, typically result into some sort of distortion
+The term data anonymization refers to techiniques that can be applied on a given dataset, :math:`D`, such that  it makes it difficult for a third party to identify or infer the existence
+of specific individuals in :math:`D`. Anonymization techniques, typically result into some sort of distortion
 of the original dataset. This means that in order to maintain some utility of the transformed dataset, the transofrmations
 applied should be constrained in some sense. In the end, it can be argued, that data anonymization is an optimization problem
 meaning striking the right balance between data utility and privacy. 
 
-Reinforcement learning is a learning framework based on accumulated experience. In this paradigm, an agent is learning by iteracting with an environment 
+Reinforcement learning is a learning framework based on accumulated experience.  In this paradigm, an agent is learning by iteracting with an environment 
 without (to a large extent) any supervision. The following image describes, schematically, the reinforcement learning framework .
 
-![RL paradigm](images/agent_environment_interface.png "Reinforcement learning paradigm") 
+.. figure:: images/agent_environment_interface.png
 
-The agent chooses an action, ```a_t```, to perform out of predefined set of actions ```A```. The chosen action is executed by the environment
-instance and returns to the agent a reward signal, ```r_t```, as well as the new state, ```s_t```, that the enviroment is in. 
+   Reinforcement learning paradigm.
+
+
+The agent chooses an action, :math:`A_t \in \mathbb{A}`, to perform out of predefined set of actions :math:`\mathbb{A}`. The chosen action is executed by the environment
+instance and returns to the agent a reward signal, :math:`R_{t+1}`, as well as the new state, :math:`S_{t + 1}`, that the enviroment is in. 
+The overall goal of the agent is to maximize the expected total reward i.e.
+
+.. math::
+
+   max E\left[R\right]
+
+
 The framework has successfully been used  to many recent advances in control, robotics, games and elsewhere.
 
+In this work we are intersted in applying reinforcment learning techniques, in order to train agents to optimally anonymize a given 
+data set. In particular, we want to consider the following two scenarios
+
+- A tabular data set is to be publicly released
+- A data set is behind a restrictive API that allows users to perform certain queries on the hidden data set.
+
+For the first scenario,  let's assume that we have in our disposal two numbers :math:`DIST_{min}` and :math:`DIST_{max}`. The former indicates
+the minimum total data set distortion that it should be applied in order to satisfy some minimum safety criteria. The latter indicates
+the maximum total data set distortion that it should be applied in order to satisfy some utility criteria. Note that the same idea can be
+applied to enforce constraints on how much a column should be distorted. Furtheremore, let's assume the most common transformations applied
+for data anonymization 
+
+- Generalization
+- Suppresion
+- Permutation
+- Pertubation
+- Anatomization
+
+We can conceive the above transformations as our action set  :math:`\mathbb{A}`. We can now cast the data anonymity problem into a form
+suitable for reinforcement learning. Specifically, our goal, and the agent's goal in that matter,  is to obtain a policy $\pi$ of transformations such that by following $\pi$,
+the data set total distortion will be into the interval  :math:`[DIST_{min}, DIST_{max}]`. This is done by choosing actions/transformations from :math:`\mathbb{A}`. 
+This is shown schematically in the figure below
+
+.. figure:: images/general_concept.png
+
+   Data anonymization using reinforcement learning.
+
+Thus the environment is our case is an entity that encapsulates the original data set and controls the actions applied on it as well as the
+reward signal :math:`R_{t+1}` and the next state :math:`S_{t+1}` to be presented to the agent.
+
+Nevertheless, there are some caveats that we need to take into account. We summarize these below.
+
+First, we need a reward policy. The way we assign rewards implicitly 
+specifies the degree of supervision we allow. For instance we could allow for a reward to be assigned every time a transformation is applied.
+This strategy allows for faster learning but it leaves little room for the agent to come up with novel strategies. In contrast,
+returning a reward at the end of the episode, although it increases the training time, it allows the agent to explore novel strategies.
+Related to the reward assignement is also the follwing issue. We need to reward the agent in a way that it is convinced that it should
+explore transformations. This is important as we don't want to the agent to simply exploit around the zero distortion point.
+The second thing we need to take into account is that  the metric we use to measure the data set distortion plays an important role. 
+Thirdly, we need to hold into memory two copies of the data set. One copy that no distortion is applied and one copy that we distort somehow
+during an episode. We need this setting so that we are able to compute the column distortions. Fourthly, we need to establish the episode
+termination criteria i.e. when do we consider that an episode is complete. Finally, as we assume that a data set may contain strings, floating point
+numbers as well as integers, then computed distortions are normalized. This is needed in order to  avoid having large column distortions, e.g. consider a salary column being distorted, 
+and also being able to sum all the column distortions in a meanigful way.
 
-Let's assume that we have in our disposal two numbers a minimum distortion, ```MIN_DIST``` that should be applied to the dataset
-for achieving privacy and a maximum distortion, ```MAX_DIST```,  that should be applied to the dataset in order to maintain some utility.
-Let's assume also that any overall dataset distortion in ```[MIN_DIST, MAX_DIST]``` is acceptable in order to cast the dataset as 
-preserving  privacy and preserving dataset utility. We can then train a reinforcement learning agent to distort the dataset
-such that the aforementioned objective is achieved.