# Extracting and Predicting Linked Social Media Accounts
> "This blog post attempts to walk through computing the precision of from a range of possibly imprecise scrapers/classifiers."

- toc: true
- branch: master
- badges: true
- comments: true
- categories: [fastpages, jupyter]
- hide: false
- search_exclude: true

With the overwhelming amount of information out there,
it is sometimes hard to find what one is looking for.
For every few true sources, there are often incorrect ones.

# Example Case: A Social Media Finder
This kind of scenario can often be found in web scraping.
Let's look at a simple example of a "social media finder".
Let's say, we have lists of hundreds of thousands of domains,
and we want to see if we can find linked social media accounts?

It turns out, there is a practical way of determining this which can
often lead to simplifying the understanding of the correctness of
knowledgegraphs that one develops, and I'd like to show you how.

## Social Media Finder Example
Let's say, for example, you have some hypotheses of potential twitter accounts for a certain domain, `cnn.com`, one `@CNN` and the other `@JohnDoe`. How do you know if they are right or wrong?

<img src="images/2021_05_24_cnn_example_1.png" alt="CNN Example" width="400"/>

Often in these cases, one will seek for help from multiple sources to get this answer right. And often, these sources will be correct a certain percentage of the time:


<img src="images/2021_05_24_cnn_example_2.png" alt="CNN Example" width="400"/>

Such a scenario can be common not only in automated scrapers, but even situations such as knowledge graphs, where the "classifier" is actually a human.

## Social Media Finder Example: Some numbers
In order to better understand how to tackle this problem,
let's look at a specific example.

Let's say you have a bunch of social media finders 
trying to figure out what social media accounts 
are connected to `cnn.com` and that you have two
that have found `@CNN` to be 
related {% fn 1 %}:
- method 1 uses some other independent technique and
  is right **65%** of the time for correct values.
- method 2 uses a knowledgegraph and is right **80%** of the time

And one method that has not found this relation to be true:
- method 3 uses some scraping technique. Although it has not detected
anything, it has a False negative rate of **60%** (**60%** of the accounts
it fails to link it is wrong about).

<img src="images/2021_05_24_cnn_example_3.png" alt="CNN Example" width="400"/>

What would be this combined precision?





To get to this will take a bit of work, and I will walk you through it.

But to get a feel for what the answer would look like, I will give you a small preview. The answer to this problem should be:
$$.65 * .8 * .6 / (.65 * .8 * .6 + (1-.65)*(1-.8)*(1-.6) \approx 91.8%$$
That's **91.8%**!!!! That's quite jump in confidence.

Before I continue to explain this result, here is the same formula in code:

In [4]:
from typing import List
from dataclasses import dataclass


def combined_p(precisions_or_fpr: List[float]):
    """Compute the combined precision_or_fpr from multiple authorities."""
    if not precisions_or_fpr:
        return 0
    product = 1
    product_i = 1
    for precision_or_fpr in precisions_or_fpr:
        product *= precision_or_fpr
        product_i *= (1. - precision_or_fpr)
    return product / (product + product_i)

And run it on our test case above:

In [5]:
# Note, the first two quantities are precisions and the last false positives
combined_p([0.65, .8, .6])

0.9176470588235295

How do we get to this?

In order to explain this, I will first go through some definitions
and give the answer. I will then simulate a simple two system
classifier to qualitatively validate the assumption. We will
then validate whether this formula above is the right solution.

The proof will be left for a later post.

# Definitions
We will define for any link being determined (ex: `@CNN` is a twitter user for `cnn.com`) as it being `True` if the link exists and `False` otherwise.

We can define their confidence via their precision $Pr$:
$$Pr = \frac{TP}{TP + FP}$$

Where $TP$ stands for True Positives: the times something actually 
True was also classified True ("Positive") by the classifier. $FP$ stands for 
False positive, and finally $FN$ and $TN$ stand for false negative and 
true negative.

We will also need to know their false negative rate. This is needed
when the authority we are using does not think the link exists:
$$FNR = \frac{FN}{FN + TP}$$

## Problem Statement
> Given a set of  authorities and measured precision, 
where some classify an item as positive and some negative, 
what is the combined precision that this item is 
likely postive? We also assume here that all 
authorities are **independent**. {% fn 2 %}


It turns out if you make the independence assumption, 
that there is a very simple answer. Say we have a set of authorities.
For each of these $A_i$, they either classified an item as positive or not.
Let's say that the set of all authorities that classified an item
as positive is $\bf{A}^+$. So authorities that classified 
an item as positive belong to this set: $A_i \in \bf{A}^+$.
Conversely, authorities that did not classify an item as positive
do not belong to $\bf{A}^+$: $A_i \notin \bf{A}^+$.
For each authority $A_i$, we measure their precision as $PR_i$
and false negative rate $FNR_i$. Considering this
notation, the combined precision would be:


$$Pr_{combined} = \frac{\prod_{i \in \bf{A}^+} Pr_i \prod_{j \notin \bf{A}^+} FNR_j}
{\prod_{i \in \bf{A}^+} Pr_i\prod_{j \notin \bf{A}^+} FNR_j + \prod_{i \in \bf{A}^+} (1 - Pr_i)\prod_{j \notin \bf{A}^+} (1 - FNR_j)}
$$




# Generating an Authority with a Model
Let's call these extractors authorities.

We want to validate the formula stated above, so let's try
simulating three authorities, compute their precisions and
finally compute their combined precision both from data
and the formula stated above.

We can model an authority by its recall $Re$ and false positive rate $FPR$:

$$Re = \frac{TP}{T} =  \frac{TP}{TP + FN}$$

$$FPR = \frac{FP}{F} = \frac{FP}{FP + TN}$$

We then will simulate the decision of an authority for a set
of possibilities with a known ground truth. When the ground 
truth is True, the probability of the authority reporting 
True will be the recall rate. When the ground truth is 
False, the probability of the authority reporting True 
is $(1- FPR)$ (or the probability of reporting false is 
the false positive rate $FPR$).


Let's now represent this authority with a class.

In [6]:
import numpy as np


class Authority:
    def __init__(self, recall: float, fpr: float):
        self.recall = recall
        self.fpr = fpr
        
    def __call__(self, actual_value: bool) -> bool:
        """
        Here is our simulated authority.
        We give it the actual value for simulation purposes.
        
        If the value is true, we only predict that it is true
        in accordance with its recall.
        If the value is false, we accidentally predict it is 
        true in accordance with its False positive rate.
        """
        if actual_value is True:
            if np.random.random() <= self.recall:
                return True
            else:
                return False
        else:
            if np.random.random() <= self.fpr:
                return True
            else:
                return False

And let's create three simple authorities.

Authority 1 has a low recall but a very low false positive rate:

In [7]:
a1 = Authority(recall=.3, fpr=.1)

Authority 2 has a high recall but a worse false positive rate:

In [8]:
a2 = Authority(recall=.7, fpr=.4)

Authority 3 has low recall but fairly high false positive rate:

In [9]:
a3 = Authority(recall=.1, fpr=.5)

The question here is that, if two independent authorities measure a positive and a third a negative, what is the probability that we have a positive?

## Simulating the data

Let's create a training set where we will run our comparisons.
Let's run on 1000000 cases, and to make things closer to real world examples, 
let's introduce a bias of 80% toward False as such scenarios are quite
common.

In [10]:
# Number of samples
N_samples = 1000000
# bias towards false values
bias = .8

In [11]:
import numpy as np
import pandas as pd


measurements = []
for _ in range(N_samples):
    actual_value = bool(np.random.random() > bias)
    
    decision_a1 = a1(actual_value)
    decision_a2 = a2(actual_value)
    decision_a3 = a3(actual_value)
    measurement = {
        'actual': actual_value,
        'a1': decision_a1,
        'a2': decision_a2,
        'a3': decision_a3,
    }
    measurements.append(measurement)
    
    
# Let's combine our results into a dataframe to make things easier to visualize:
df = pd.DataFrame(measurements)

In [12]:
# The measurements
df

Unnamed: 0,actual,a1,a2,a3
0,False,False,True,True
1,False,False,False,True
2,False,False,False,False
3,False,False,False,False
4,False,False,False,True
...,...,...,...,...
999995,False,False,False,True
999996,False,False,True,True
999997,False,False,True,True
999998,False,False,False,False


## Preparing the validation set

Now, I mentioned we have a bias. If we want to measure the true precision, 
we have to make sure to account for this bias by resampling 
from our training set an equal number of actually true and false
values. It is a bit silly to have introduced a bias in the 
first place from simulated data if we are just going to 
remove it. But this something that can be easily overlooked 
and is always worthwhile considering.

In [13]:
# Filter dataframe to actually true and false cases
df_true = df[df['actual']]
df_false = df[~df['actual']]
# count them
number_actually_true = len(df_true)
number_actually_false = len(df_false)

# choose smallest number as sampling number
number_to_sample = min(number_actually_true, number_actually_false)

print(f'Of samples, {number_actually_true} are true and {number_actually_false} are false.')
print(f'We choose to keep {number_to_sample} of true and false cases.')

Of samples, 200995 are true and 799005 are false.
We choose to keep 200995 of true and false cases.


In [14]:
# Finally, we reduce the dataset to a sampled dataset:
df_sampled = pd.concat(
    (
        df_true.sample(number_to_sample),
        df_false.sample(number_to_sample),
))

## Computing the Combined Precision

Now let's compute the precision of A1 and A2 and verify the false negative rate of A3:

In [15]:
# Precision of A1
prec_a1 = df_sampled[df_sampled['a1']]['actual'].mean()
print(f'The precision of A1 is {prec_a1}')

The precision of A1 is 0.7510203575283158


In [16]:
# Precision of A2
prec_a2 = df_sampled[df_sampled['a2']]['actual'].mean()
print(f'The precision of A2 is {prec_a2}')

The precision of A2 is 0.6346139096302646


In [18]:
# False negative rate of A3
fnr_a3 = df_sampled[~df_sampled['a3']]['actual'].mean()
print(f'The fnr of A3 is {fnr_a3}')

The fnr of A3 is 0.6425435872685785


Let's now take the samples where A1 and A2 voted for a reference but A3 did not and count the precision of these two:

In [19]:
# Precision of A1 and A2 voting
prec_a1a2_not_a3 = df_sampled[df_sampled['a1'] & df_sampled['a2'] & ~df_sampled['a3']]['actual'].mean()
matching_rows = len(df_sampled[df_sampled['a1'] & df_sampled['a2'] & ~df_sampled['a3']])
print(f'The precision of A1 and A2 given A3 did not classify is {prec_a1a2_not_a3}')
print(f'This was measured over {matching_rows} measurements.')

The precision of A1 and A2 given A3 did not classify is 0.9027603364530269
This was measured over 42205 measurements.


## Comparing Against Our Formula

Now let's use our formula:

In [21]:
#combined_p([prec_a1, prec_a2])
prec_a1 * prec_a2 * fnr_a3 / (prec_a1 * prec_a2 * fnr_a3  + (1-prec_a1) * (1 - prec_a2) * (1 - fnr_a3))

0.9040055451209015

The results are pretty close! If you are not convinced, I invite you to 
try this out with different authorities with different values of 
recall and false positive rates.

In my next post I will outline a proof.


If you have any questions or comments, or believe something is in 
error, I would love to hear about it in a comment!

Footnotes:
{{ "These precisions are assumed to be calculated 
on a balanced data set with as many true values as false
values. If you have an imbalanced dataset, it is suggested
to use random downsampling."
| fndetail: 1
}}
{{ "That this last statement is not often
true. A good example is two cat classifiers trained on the same
pretrained network but with different parameters. In this case,
these will not be independent because they will likely start with
the same initial features (i.e. given classifier "A" predicts that
the image is a cat because it has a tail and tall ears, classifier "B"
will probably also likely classify it as a cat). However, the case of web
scraping will often be independent enough. A good example is something
that looks for article titles from different metadata, say the `<title>`
field and the `<meta key='og:title' content='...'>` meta field.
In this case, the  presence of the `meta` field would not directly 
influence the prediction of the title classifier that looks for the 
`<title>` field."
| fndetail: 2}}
