## SAFE Protocol NLP Experiments 

[update the other Jupyter files to account for two notebooks]:# (directory-integration)

[look at this link: https://medium.com/codex/the-magical-markdown-i-bet-you-dont-know-b51f8c049773]:# (resource for markdown)

This notebook contains the following contents:
* [Section 1: Introduction](#introduction)
* [Section 2: Prior Calculation](#prior-calculation)
* [Section 3: Likelihood Calculation](#likelihood-calculation-a-classanchor-idlikelihooda)

It builds on top of [the other notebook where I illustrated the bare SAFE Protocol itself as well as some extensions to make it more secure](/safe-experiments.ipynb). This notebook specifically illustrates the application of SAFE to trend detection in document interactions.

### Introduction <a class="anchor" id="introduction"></a>

##### What Trends In Documents Are

Huth and Chaulwar offer the following definition of a trend for interacted documents in their paper (Huth, Chaulwar, 4),

[add proper citation]:# (citation)

>"Keywords that are not frequently present in past interacted documents but that appear frequently in current interacted documents"

They note that the definition is "very basic". However, it suffices as a definition to illustrate how their Bayesian approach to federated analytics.

[look into seasonality]: # (extra-bit)

##### Formal Specification of The Task

The terminology and description in this section is also taken from the paper. 

Key Terms:
* $D$ - Document Set 
* $t$ - Trending Keyword
* $V$ - Set of possible trending keywords in $D$
* $D_i$ - subset of $D$ that $user_i$ has interacted with

[add proper citation]:# (citation)

The aim is to determine a distribution $p(t | D)$ that specifies a probability for a keyword in $V$ being trending. 

Using Bayes formula, the task can be phrased as the determinination for each term $t_i$ in $V$:

$$
    p(t_i = t | D) = \frac{P(D | t_i = t)p(t_i = t)}{p(D)}
$$

* Application To Use Case

[one issue of relating the use case to the trend detection methodology is that in the specific case mentioned in the paper, 'multiple users can interact with (the) same document' (4), however, this wouldn't make sense in our case?]:# (question-to-resolve)

### Prior Calculation <a class="anchor" id="prior-calculation"></a>

Here we are interested in calculating the term $p(t_i = t)$, the prior discrete probablity for a keyword in $V$ for being a trending keyword.


In [None]:
# run code generating loading in/creating the data & generating the prior distribution

'''
prior distribution is modelled as a Dirichlet distribution parametized by the IDF of keywords in V
- generate IDF of keywords in V
- construct Dirichlet distribution using generated IDF as arg.

running it iteratively: start with uniform priors,

TODO - how do we update uniform priors over each distribution? -- assigning the result of each round as the new IDF parameter?

can use tensorflow's probability library to generate the Dirichlet distribution.
'''

### Likelihood Calculation <a class="anchor" id="likelihood"></a>

Each user has their own subset of the document set...

In [None]:
'''
Describing the use case:

3 users, they use the app for a month, once a day, 5 days a week. 

Every day, they answer a question, "How are you feeling today?"-- 

The answers given are something like,

"I felt joy"
"I felt anger"
"I felt fear"
"I felt sadness"
"I felt digust"
"I felt surpise"

 Ekman 1992a:

(joy, anger, fear, sadness, disgust, and surprise). <= not very interesting.

What about...

And then they are prompted to journal in response to a follow-up question, "Why did you feel that way?". NOTE: sentiment analysis is kind of redundant... here. They've already indicated what they feel.

These responses are stored in a list over the period of two weeks. (That's their own Di which we want to keep private).

So, SAFE will help us to track, what emotions have people started to feel over time.

METRICS:
- count: how many people felt X over the 2 week period?
- simple mean: what's the average number of people that felt X over a 4 month period?

# TODO - feels like the use case doesn't make for very interesting extensions...

# NOTE : extending the use case: something similar to what is tracked. Now trying to track what resources are being accessed within the app, so to keywords are now not emotions but general topics.

# thinking through some of the extensions :(e.g. weighted mean)
thoughts on weighted mean:
- each obfuscated feature vector is sent with a weight.
- how secure would this be? Well, it would reveal : e.g., that the user has been accessing a lot of articles. (even though it's not clear which articles)...But, then, if it turns out that a particular topic/emotion has been 'trending', wouldn't that reveal a bit much about those who had heavier weights?

The Bayesian FA that uses SAFE would help us to compute : what's the probability of each keyword (emotion) being trending (i.e., felt more now than in past sessions)

'''

# NOTE: maybe a good dataset to use would be GoEmotion? That's a large dataset that's been benchmarked academically.

# step : Define D-- overall document set. 

# TODO - what are the document sets in this context? Would it be journal entries? 
''' 
Let's start with something simpler: let's say the document set is a set of possible answers to 'survey' like questions. For instance,

# NOTE:  base this on the emotion taxonomy mentioned in GoEmotion.
responses = [
    u'I was upset today',
    u'I was ...',
] <= this list contains all the responses from each user. 

And then each user's D_i is the specific responses they clicked?

And would the keywords be the emotions?
'''
# NOTE : if this is right, then there's no need for conference resolution? 
# TODO : are the only pre-processing steps necessary the keyword set calculation and the document frequency calculation in this particular use case?
# NOTE : c.f. the PyTorch Deep Dive into NLP slides for methods that do the pre-processing for you.

# D has a set of keywords, V, which are trending.

# TODO - are the keywords the emotions? And then isn't there a low amount of keywords in V?

# step: Each user's document sets are their answers.

# TODO - what is p(D_i | t)?

# Wouldn't it be fairly straightword? e.g. given that 'joy' is trending, wouldn't it be the case that the document set would largely consists of the 'I have been feeling joyful' today?

# NOTE : comparing by ranking based on total count or pooled trend seems to be pretty easy to do...

# TODO - get clear : why is a Dirichlet distribution used?

# TODO - updating uniform priors iteratively

'''
Does that involve setting the p(t | D) that we get at one end of the result to be the new parameters for the dirichlet distribution used for the next round's priors?
'''


In [None]:
'''
# todo - take from this example code block things which would be helpful to use

# Rangarajan Krishnamoorthy, 2/2/2019
# Using neuralcoref for coreference resolution

taken from: https://www.rangakrish.com/index.php/2019/02/03/coreference-resolution-using-spacy/

import spacy
nlp = spacy.load('en_coref_lg') # TODO : see if the large coref model is the right choice.

examples = [
    u'My sister has a dog and she loves him.',
    u'My sister has a dog and she loves him. He is cute.',
    u'My sister has a dog and she loves her.',
    u'My brother has a dog and he loves her.',
    u'Mary and Julie are sisters. They love chocolates.',
    u'John and Mary are neighbours. She admires him because he works hard.',
    u'X and Y are neighbours. She admires him because he works hard.',
    u'The dog chased the cat. But it escaped.',
]

def printMentions(doc):
    print '\nAll the "mentions" in the given text:'
    for cluster in doc._.coref_clusters:
        print cluster.mentions

def printPronounReferences(doc):
    print '\nPronouns and their references:'
    for token in doc:
        if token.pos_ == 'PRON' and token._.in_coref:
            for cluster in token._.coref_clusters:
                print token.text + " => " + cluster.main.text

def processDoc(text):
    doc = nlp(text)
    if doc._.has_coref:
        print "Given text: " + text
        printMentions(doc)
        printPronounReferences(doc)

if __name__ == "__main__":
    processDoc(examples[8])

'''