## SAFE Protocol NLP Experiments 

[look at this link: https://medium.com/codex/the-magical-markdown-i-bet-you-dont-know-b51f8c049773]:# (resource for markdown)

This notebook contains the following contents:


It builds on top of the other notebook where I [illustrated the bare SAFE Protocol itself](./safe-illustration.ipynb) as well as [some suggestions to make it more secure](./safe-improved.ipynb). This notebook specifically illustrates the application of SAFE to trend detection in document interactions.

As with the other notebooks, much of the material here relies heavily on {cite}`SAFE`.

See Section 3.4.1 of the report for elaboration on the approach being implemented here.

### Trending Mood Detection For Survey Responses

As described in the report, users respond to a prompt question like "*How do you feel today?*" every day for a month, and an analysis on what responses are trending is carried out every 2 weeks.

When submitting a response to this question, the user picks from a pre-defined list of responses that contains strings like the following

In [1]:
possible_responses = {
    0: "I'm feeling joyful!",
    1: "I'm feeling angry",
    2: "I'm feeling disgusted",
    3: "I'm feeling fearful",
    4: "I'm feeling sad...",
    5: "I'm feeling surprised!",
    6: "I'm feeling neutral"
}

These responses are each encoded as numbers, such that what is actually stored locally on user devices as their responses are integers from `0` to `6` corresponding to each of these answers (`0` corresponds to `I'm feeling joyful!"`, `1` corresponds to `"I'm feeling angry"`, and so on...)

The target computation of the Bayesian approach to secure trend detection is:

$$
p(\text{t} \space =\space t|D)= \frac {p(D|\text{t} \space =\space t|)p(\text{t} \space =\space t|)}{p(D)} 
$$

where $t$ is some term in the keyword set $V$. 

The **keyword set** $V$ for our experimental setting consists of the responses themselves, since what we what to determine is a probability distribution for the various moods given the overall document set contributed by users.

As discussed in their paper, calculating the marginal probablity $p(D)$ would not be privacy-preserving. The authors advise avoiding calculating it by treating it as constant. Then, the posterior likelihood $p(\text{t} \space =\space t|D)$ is proportional to ${p(D|\text{t} \space =\space t|)p(\text{t} \space =\space t|)}$. Since we are only trying to find trending keywords (in terms of rankings), we do not need the exact value of the posterior and we can simply consider this equation:

$$
p(\text{t} \space =\space t|D) \propto {p(D|\text{t} \space =\space t|)p(\text{t} \space =\space t|)}
$$



#### Defining Our Terms

In particular, we want to define the terms on the RHS : the likelihood, $p(D|\text{t} \space =\space t|)$, and the prior $p(\text{t} \space =\space t|)$.

##### Defining The Prior

For the first run of the protocol, the priors will be defined uniformly-- i.e., $\frac{1}{d}$ where $d$ is the number of keywords in $V$.

For each subsequent run, the priors will be defined according to last round's posterior probability.

##### Defining the Likelihood Vector

Let's suppose there are $N$ users who respond to the prompt question every day for $y$ days. (We assume that users respond every day with some response)

This will mean that each user has a list of $y$ integer responses that represents the option they picked for each day. 

This will constitute their **document set** $D_{i}$ which we wish remains private, where *each integer* (representing a response) is a **document**.

Each document's **primary keyword set** is simply a single-element set with the integer that represents the response they picked. (e.g. the primary keyword set of the document `1` just is `1`).

As described in the paper, what ends up being the user's raw feature vector is a vector of likelihoods,

$$
L_{i}  =(p( D_ {i}  |\text{t} \space =\space t_ {1}  ), \cdots  ,p(  D_ {i}  |\text{t} \space =\space t_ {d}  ))
$$

where $d$ is the number of words in the vocabulary $V$, and each $p(D_{i} | \text{t} \space =\space t)$ is calculated by the fraction of the given keyword in the document set of the user.

For instance, say that a user has the following document set of just 7 responses (for simplicity):

```
u_1 = [0, 1, 2, ,3, 1, 4, 5]  
```

They have 7 documents in total. Their feature vector of likelihoods is

```
V = [0, 1, 2, 3, 4, 5, 6]

u_1 = [0, 1, 2, 3, 1, 4, 5]  

u_1_feature_vector = [1/7, 2/7, 1/7, 1/7, 1/7,  1/7, 0]
```

#### Computing The Target Function

Treating each user's document set $D_i$ as a random variable for the subset of the overall document set $D$ (i.e., the set of all user responses), then we can treat the aggregation of these instances (i.e., the likelihood of the whole document set) can be represented as the mean. (ibid., 5)

So, the aggregator wants to compute the following for each keyword.

$$
p(D | \text{t} \space =\space t_1) = \sum_{i = 1}^N {p(D|\text{t} \space =\space t_1|)}
$$

Aggregating the feature vectors for all users can be done securely using the SAFE Protocol.

### Test Run : With Raw SAFE version

#### Generating The Raw Feature Vectors

Let's say there are 10 users who respond to the prompt question every day for 21 days, and as analysts we want to consider what responses are trending given this period.

First, lets do a sanity check.

In [2]:
import random as r
from itertools import repeat
from collections import Counter
from MoodAppUser import MoodAppUser

r.seed(123) # for reproducibility 

# as from above: 
POSSIBLE_RESPONSES = {
    0: "I'm feeling joyful!",
    1: "I'm feeling angry",
    2: "I'm feeling disgusted",
    3: "I'm feeling fearful",
    4: "I'm feeling sad...",
    5: "I'm feeling surprised!",
    6: "I'm feeling neutral",
}

# e.g. SANITY CHECK
KEYWORDS = [0, 1, 2, 3, 4, 5, 6] # corresponding to our 7-emotion taxonomy
user_reseponse = [0, 1, 2, 3, 1, 4, 5]

dummy_user = MoodAppUser(0, user_reseponse)

# roughly how the feature vectors are generated.
frequencies = Counter(user_reseponse)
feature_vector = [round(count / len(user_reseponse), 4) for count in frequencies.values()] # NOTE : we round feature values to 4dp.
feature_vector.append(0.0) # just to account for no "I'm feeling neutral response"-- this is handled programatically in the class

assert feature_vector == dummy_user.feature_vector


Next, let's create the 10 users and give them document sets that consists of a random selection of responses.

In [3]:
NO_OF_DAYS_TRACKED = 21
NO_OF_USERS = 10

random_document_sets = [r.choices(KEYWORDS, k = NO_OF_DAYS_TRACKED) for _ in repeat(None, NO_OF_USERS)]

In [4]:
users = [MoodAppUser(i, document_set) for i, document_set in enumerate(random_document_sets)]

print("Here are some users and their random document sets:\n")
for i in range(3):
    print(f"User {users[i].id} and their random document set:\n{users[i].document_set}\n")

Here are some users and their random document sets:

User 0 and their random document set:
[0, 0, 2, 0, 6, 0, 3, 2, 5, 1, 2, 2, 1, 0, 3, 0, 4, 0, 2, 3, 6]

User 1 and their random document set:
[0, 0, 5, 0, 6, 4, 1, 5, 5, 2, 5, 1, 4, 3, 5, 2, 2, 5, 3, 4, 4]

User 2 and their random document set:
[4, 6, 3, 4, 2, 0, 5, 1, 5, 6, 4, 1, 2, 2, 0, 3, 4, 2, 1, 0, 4]



We can also take a look at their raw feature vectors.

In [5]:
print("Here are some users and their raw feature vectors:\n")
for i in range(3):
    print(f"User {users[i].id} and their raw feature vectors:\n{users[i].feature_vector} of length {len(users[i].feature_vector)}\n")

Here are some users and their raw feature vectors:

User 0 and their raw feature vectors:
[0.3333, 0.2381, 0.0952, 0.1429, 0.0476, 0.0952, 0.0476] of length 7

User 1 and their raw feature vectors:
[0.1429, 0.2857, 0.0476, 0.1905, 0.0952, 0.1429, 0.0952] of length 7

User 2 and their raw feature vectors:
[0.2381, 0.0952, 0.0952, 0.1905, 0.1429, 0.0952, 0.1429] of length 7



#### Performing Secure Aggregation of Feature Vectors

Now that we have users and their raw feature vectors, let's compute the target function. 

Since this is the first run of the protocol, we begin with uniform priors.

In [7]:
no_of_keywords = len(KEYWORDS)
priors_for_keywords = [round(1 / no_of_keywords, 4) for _ in repeat(None, no_of_keywords)]

0.1429
0.1429
0.1429
0.1429
0.1429
0.1429
0.1429


Next, let's calculate using the raw feature vectors what the target value is:

In [None]:
# TODO - get vector of mean for raw feature vectors

We want to get the same result running SAFE. Let's see if that happens:

In [None]:
# TODO - try implementing the super class FIRST, and then see what's the best way to go about implementing the SAFE functionality into a class.

### Detecting Trends In Interactions Over Some Static List Of Resources

A neat feature of the above protocol is that it can not only be used to detect trends in user responses to the prompt question, but also to detect trends for **interactions with a static list of resources**.

We simply map the list of resources to numbers just as we did with the possible questions.

In [None]:
# TODO - try creating a super class from the original User ... that has a modified constructor and what not.

### Suggested Method For Free-Form Journal Entries

Here we consider how mood detection might be done not in the case where there is some pre-defined list of responses a user selects from, but rather they are allowed to journal freely about whatever comes to mind for them. There may or may not be a prompt, but what is stored as their individual document sets are plain texts which we apply some pre-processing on.

These journal entries can then go through the various stages of pre-processing described in (Huth and Chaulwar, 4). There is still a pre-defined vocabulary, however this time we can make use of a richer emotional taxonomy and expand the vocabulary set $V$ that is used to generate each users primary keyword set.

Let's first import the `spacy` library we'll use for preprocessing.


In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)


Then, we'll read in our sample journal entry file.

In [None]:

journal_entries = ""

with open('sample-journal-entries.txt', 'r') as f:
    for line in f.readlines():
        journal_entries += (line)

print(f'These are the journal entries: \n {journal_entries}')

Next, let's try pick out where the keywords are mentioned in these entries. In this case, the keywords are the emotions:

In [None]:

# Add match ID "HelloWorld" with no callback and one pattern
pattern = [{"LEMMA": {"IN": ["joyful", "sad", "angry", "disgusted", "afraid", "neutral", "surprised"]}}]
matcher.add("Emotions-Identifier", [pattern])

doc = nlp(journal_entries)

matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]  # The matched span
    print(start, end, span.text)

# do have a way of building up a frequency of the emotions. 

# NOTE: this is obviously a woefully inadequate way of doing NLP though. (e.g. it counts 'angry' twice even though in the second journal entry/line it occurs in a rhetorical question. So, there's still a lot more processing to do). However, if there's something to this idea I might go ahead with it?

# big limitation to address is how to capture more word variations... (sadness, joyous) seems too complex though.

# some other sources: https://towardsdatascience.com/keyword-extraction-process-in-python-with-natural-language-processing-nlp-d769a9069d5c

# NOTE : alternative : could preprocess words. (remove stop words, lemmatize etc.), then generate a frequency of all the worlds; filter this dictionary for the emotions; and then work based off that.

In [None]:
'''
# todo - take from this example code block things which would be helpful to use

# Rangarajan Krishnamoorthy, 2/2/2019
# Using neuralcoref for coreference resolution

taken from: https://www.rangakrish.com/index.php/2019/02/03/coreference-resolution-using-spacy/

import spacy
nlp = spacy.load('en_coref_lg') # TODO : see if the large coref model is the right choice.

examples = [
    u'My sister has a dog and she loves him.',
    u'My sister has a dog and she loves him. He is cute.',
    u'My sister has a dog and she loves her.',
    u'My brother has a dog and he loves her.',
    u'Mary and Julie are sisters. They love chocolates.',
    u'John and Mary are neighbours. She admires him because he works hard.',
    u'X and Y are neighbours. She admires him because he works hard.',
    u'The dog chased the cat. But it escaped.',
]

def printMentions(doc):
    print '\nAll the "mentions" in the given text:'
    for cluster in doc._.coref_clusters:
        print cluster.mentions

def printPronounReferences(doc):
    print '\nPronouns and their references:'
    for token in doc:
        if token.pos_ == 'PRON' and token._.in_coref:
            for cluster in token._.coref_clusters:
                print token.text + " => " + cluster.main.text

def processDoc(text):
    doc = nlp(text)
    if doc._.has_coref:
        print "Given text: " + text
        printMentions(doc)
        printPronounReferences(doc)

if __name__ == "__main__":
    processDoc(examples[8])

'''

Using `spacy`'s matcher, we can match for the keywords in journal entries and create freqeuencies of them in the text.

In [None]:
# TODO : create frequencies of keywords in a few user's document set.

##### Obvious Drawbacks 

There are obvious drawbacks to this simple NLP pre-processing as a way of counting for the keywords available.

For instance, one might have noticed that the occurrence of `angry` in two drastically different semantic contexts (once in a declarative statement and another in a rhetorical question) are both simply treated alike, added to the count.

<!-- illustrate the drawback with the specific example -->


Another drawback of these approaches generally is that flat distributions are not easily recognisable in these methods.

For instance, 

<!-- illustrate the impact of flat distributions -->

Another issue that should be addressed is how to account for the fact that users may not use the app every day. How should we encode and handle non-responses?

In [None]:
# as from above: 
POSSIBLE_RESPONSES = {
    0: "I'm feeling joyful!",
    1: "I'm feeling angry",
    2: "I'm feeling disgusted",
    3: "I'm feeling fearful",
    4: "I'm feeling sad...",
    5: "I'm feeling surprised!",
    6: "I'm feeling neutral",
    -1: None # no response for that day.
}

u_1 = User(0, [0, 1, 2, 3, 1, 4, 5, -1]) # 8 days of collected data

# what do we do with the -1?