# Different strategies for dealing with noise



The idea here is that there are different possible strategies of dealing with noise that are attested in natural communication systems:

* **Reduplication**: simply repeat the signal several times. This is compatible with a compositional system. It is not costly in terms of learnability, because the only extra thing that needs to be learned is a single extra rule that applies to all signals. However, it is relatively costly in terms of utterance length (it would thus not do well under a pressure for minimal effort). 
* **Diversify signal**: make the individual segments that each signal consists of as distinct as possible. For example, in the language shown below, all four signals can be distinguished from each other in all cases where one character is obscured by noise. This strategy however is not compatible with compositionality, because it relies on making each of the segments as distinct from each other as possible. That means these languages are necessarily holistic, and therefore less easy to learn (so they would do less well under a pressure for learnability). Below is an example of a language for which, if noise obscures a single character, each signal would still be uniquely identifiable under noise.
    - 02 --> 'aaaa'
    - 03 --> 'bbbb'
    - 12 --> 'abba'
    - 13 --> 'baab'
* **Repair**: This strategy could be seen as a form of redundancy across turns, instead of within a signal. However, it will be initiated only when neccesary, and should therefore fare slightly better than the reduplication strategy under a pressure for minimal effort (where effort is measured as total shared utterance length of both interlocutors across a set number of interactions).

The predictions of Vinicius & Seán (2016 Evolang abstract titled "Language adapts to signal disruption in interaction"), are that although the reduplication strategy and the repair strategy should do equally well under a pressure for learnability, adding the possibility of repair will 'lift the pressure for redundancy', such that receivers can request that speakers repeat a signal only after a problem occurs.
---> However, we would add that in the absence of a pressure for minimal effort, the repair strategy does not have an advantage over the reduplication strategy. 
    


## Predictions under different selection pressures:

We predict that under the following assumptions:
- There is a pressure for expressivity/mutual understanding (or rather: a pressure to get ones signal/message across; which feels like a better way to describe the pressure that frogs and song birds are under)
- Noise regularly disrupts part of the signal (Vinicius & Seán used a 0.5 probability in their experiment)
- Repair is a possibility


the following strategies will become dominant under the following combinations of the presense/absence of a pressure for learnability and a pressure for minimal effort:

|                | - minimal effort                                              | + minimal effort          |
|----------------|---------------------------------------------------------------|---------------------------|
| **- learnability** | Any of the three strategies above will do                                         | Repair + Compositional OR Holistic         |
| **+ learnability** | Reduplication + Compositional OR Repair + Compositional | Repair + Compositional |

<span class="mark">Note</span> that the prediction in the {-learnability, +minimal effort} condition above only holds if we do not distinguish between open and closed requests. Because if we do, as in the model we submitted to evolang, we'd expect the Repair + compositional strategy to fare best in this condition, without the need for a pressure for learnability.

## How to represent languages?

### Possibility 1: Different form lengths

If we continue with Kirby et al.'s (2015) way of representing meanings and forms (which is a minimal way of creating languages that we can classify as compositional, holistic or degenerate), where meanings consist of $f=2$ features, which can each have $v=2$ values, we can allow for each of the language strategies specified above ('reduplication' and 'diversify signal'), by simply allowing for multiple string lengths $l$, while keeping the alphabet size $|\Sigma|$ at 2.

For example, where Kirby et al. (2015) only allowed for a single possible string length, and specified $f = v = l = |\Sigma| = 2$, we could minimally allow for two possible string lengths: one being equal to $f$ (i.e. the minimum string length required to uniquely specify each meaning feature), and one being equal to $2*f$, to enable reduplication of the signal.

That would yield the following types of languages:


**Reduplication + compositional:**

02 --> aaaa

03 --> abab

12 --> baba

13 --> bbbb


**Diversify signal + holistic:**

02 --> aaaa

03 --> bbbb

12 --> abba

13 --> baab


**Repair + compositional:**

02 --> aa

03 --> ab

12 --> ba

13 --> bb


In order to still make it possible for iterated learning chains to transition from a language that uses forms of length 2 into a language that uses forms of length 4 and vice versa, we need to then also allow for languages that use a mixture of form lengths (e.g. three forms of length 2, and one form of length 4). This yields the following number of possible languages:

$$ (2^2+2^4)^4 = 160000$$

which means that compared to the Kirby et al. (2015) model (where there were ($(2^2)^4 = 256$ possible languages), the hypothesis space expands by a factor of 625. That is not ideal, because if we assume that simulation run times increase linearly with the size of the hypothesis space, a simulation that took 1 hour to run in our previous model would now take almost 4 weeks to run.

However, this linear relationship between the simulation run times and the size of the hypothesis space holds when during learning, we actually loop through each hypothesis and update its posterior probability based on the data. There are a couple of ways in which this process can be optimised:

1. **memoisation**: This would require enumerating all possible data points (i.e. <meaning, form> pairs) (including all possible noisy forms), and for each of them calculating its likelihood for all possible hypotheses **once**, and caching the result. Whenever the same <meaning, form> pair is then encountered by any learner, the corresponding likelihood vector is then simply retrieved from memory and multiplied with the learner's current posterior. This should be doable given that the total number of meanings is 4, and the total number of forms (including all possible noisy variants, assuming that noise is restricted to a single character) is 56; which makes 4\*56 = 224 possible <meaning, form> pairs. For each of those 224 possible datapoints, we would then calculate its likelihood for all 160,000 hypotheses, and cache these values in a 224\*160,000 matrix. (That matrix thus has 224\*160,000 = 35,840,000 entries.)
2. Intergenerational learning could be sped up by representing data as simple counts of <meaning, form> pairs, and simply updating the posterior probability distribution for the full data set in one step, by multiplying the prior of the hypothesis with the likelihood of the <meaning, form> pair to the power of the number of times it occurs in the data set. This should speed things up a little in intergenerational learning, but won't make a difference in *intra*generational learning, because there we assume that the hearer updates their posterior in each interaction. 
3. Do not do exact inference over the full hypothesis space at all, but instead use an MCMC sampling technique (e.g. Burkett & Griffiths, 2010, and Kirby et al., 2015 use Gibbs sampling) --> This would require a bit more time to figure out, and is hopefully not necessary once optimisations 1 and 2 above have been implemented.

### Possibility 2: Allow reduplication as grammatical rule, and increase alphabet size

If instead of allowing for multiple form lengths, we instead increase the size of the alphabet $\Sigma$ from 2 to 4, that will make the diversify signal strategy possible. More concretely, that would mean that instead of there being an alphabet $[a, b]$, there would be an alphabet $[a, b, c, d]$. 
That would allow for the following example languages, where the bit at the end of the signal specifies whether the signal should be repeated (1) or not (0).


**Reduplication + compositional:**

02 --> aa1

03 --> ab1

12 --> ba1

13 --> bb1


**Diversify signal + holistic:**

02 --> aa0

03 --> bb0

12 --> cc0

13 --> dd0


**Repair + compositional:**

02 --> aa0

03 --> ab0

12 --> ba0

13 --> bb0


Choosing for this option would mean that instead of there being $(2ˆ2 + 2ˆ4) = 20$ possible forms, there would be $4ˆ2 = 16$ possible forms, and therefore $(4^2)^4 = 65536$ possible languages. In addition however, we'd need languages to have an extra bit that specifies whether signals are reduplicated or not (assuming there are only two options: reduplication ON versus reduplication OFF). That means that there'd be a total of $((4^2)^4)*2 = 131072$. So compared to possibility 1, possibility 2 only reduces the size of the hypothesis space by a factor of 1.22. That is not much, but an added advantage of this way of representing languages is that it allows for a straightforward way of capturing the assumed simplicity of a reduplication rule in the coding of the languages, and therefore into the prior.

If we find that despite the optimisation strategies outlined above it is still not feasible to run simulations within a reasonable time-frame, we could consider tackling the different possible strategies for dealing with noise separately. I.e. one model where we allow for the possibility to add reduplication to signals vs. repair, and another model where we allow for diversification of signal segments 

## Compressibility measure for two different language representation possibilities


### Possibility 1: Different form lengths

#### Rewrite rules:

**Reduplication + compositional:**

There are in fact two different ways of reduplicating a compositional language: either reduplicating the whole signal, or reduplicating each of the segments. In both cases the length of the minimally redundant form, and therefore the language type's compressibility, will be the same however, as shown below.

_**Reduplicate whole signal:**_

*Language:*

02 --> aaaa

03 --> abab

12 --> baba

13 --> bbbb


*Rewrite rules:*

S --> ABAB

A:0 --> a

A:1 --> b

B:2 --> a

B:3 --> b


*Minimally redundant form:*

SABAB.A0a.A1b.B2a.B3b


_**Reduplicate each segment:**_

*Language:*

02 --> aaaa

03 --> aabb

12 --> bbaa

13 --> bbbb


*Rewrite rules:*

S --> AABB

A:0 --> a

A:1 --> b

B:2 --> a

B:3 --> b


*Minimally redundant form:*

SAABB.A0a.A1b.B2a.B3b




**Diversify signal + holistic:**

02 --> aaaa

03 --> bbbb

12 --> abba

13 --> baab


*Rewrite rules:*

S:02 --> aaaa

S:03 --> bbbb

S:12 --> abba

S:13 --> baab


*Minimally redundant form:*

S02aaaa.S03bbbb.S12abba.S13baab




**Repair + compositional:**

02 --> aa

03 --> ab

12 --> ba

13 --> bb


*Rewrite rules:*

S --> AB

A:0 --> a

A:1 --> b

B:2 --> a

B:3 --> b


*Minimally redundant form:*

SAB.A0a.A1b.B2a.B3b


### Now let's calculate the actual compressibility in terms of coding length, given the strings in minimally redundant form specified above:

In [33]:
def classify_language_four_forms(lang, forms, meaning_list):
    """
    Classify one particular language as either 0 = degenerate, 1 = holistic, 2 = hybrid, 3 = compositional, 4 = other
    (Kirby et al., 2015). NOTE that this function is specific to classifying languages that consist of exactly 4 forms,
    where each form consists of exactly 2 characters. For a more general version of this function, see
    classify_language_general() below.

    :param lang: a language; represented as a tuple of forms_without_noisy_variants, where each form index maps to same
    index in meanings
    :param forms: list of strings corresponding to all possible forms_without_noisy_variants
    :param meaning_list: list of strings corresponding to all possible meanings
    :returns: integer corresponding to category that language belongs to:
    0 = degenerate, 1 = holistic, 2 = hybrid, 3 = compositional, 4 = other (here I'm following the
    ordering used in the Kirby et al., 2015 paper; NOT the ordering from SimLang lab 21)
    """
    class_degenerate = 0
    class_holistic = 1
    class_hybrid = 2  # this is a hybrid between a holistic and a compositional language; where *half* of the partial
    # forms is mapped consistently to partial meanings (instead of that being the case for *all* partial forms)
    class_compositional = 3
    class_other = 4

    # First check whether some conditions are met, bc this function hasn't been coded up in the most general way yet:
    if len(forms) != 4:
        raise ValueError(
            "This function only works for a world in which there are 4 possible forms_without_noisy_variants"
        )
    if len(forms[0]) != 2:
        raise ValueError(
            "This function only works when each form consists of 2 elements")
    if len(lang) != len(meaning_list):
        raise ValueError("Lang should have same length as meanings")

    # lang is degenerate if it uses the same form for every meaning:
    if lang[0] == lang[1] and lang[1] == lang[2] and lang[2] == lang[3]:
        return class_degenerate

    # lang is compositional if it makes use of all possible forms_without_noisy_variants, *and* each form element maps
    # to the same meaning element for each form:
    elif forms[0] in lang and forms[1] in lang and forms[2] in lang and forms[
        3] in lang and lang[0][0] == lang[1][0] and lang[2][0] == lang[3][0] and lang[0][
        1] == lang[2][1] and lang[1][1] == lang[3][1]:
        return class_compositional

    # lang is holistic if it is *not* compositional, but *does* make use of all possible forms_without_noisy_variants:
    elif forms[0] in lang and forms[1] in lang and forms[2] in lang and forms[3] in lang:
        # within holistic languages, we can distinguish between those in which at least one part form is mapped
        # consistently onto one part meaning. This class we will call 'hybrid' (because for the purposes of repair, it
        # is a hybrid between a holistic and a compositional language, because for half of the possible noisy forms that
        # a listener could receive it allows the listener to figure out *part* of the meaning, and therefore use a
        # restricted request for repair instead of an open request.
        if lang[0][0] == lang[1][0] and lang[2][0] == lang[3][0]:
            return class_hybrid
        elif lang[0][1] == lang[2][1] and lang[1][1] == lang[3][1]:
            return class_hybrid
        else:
            return class_holistic

    # In all other cases, a language belongs to the 'other' category:
    else:
        return class_other

In [50]:
from math import log2
from string import ascii_uppercase


def mrf_degenerate(lang, meaning_list):
    mrf_string = 'S'
    for i in range(len(meaning_list)):
        meaning = meaning_list[i]
        if i != len(meaning_list)-1:
            mrf_string += str(meaning)+','
        else:
            mrf_string += str(meaning)
    mrf_string += lang[0]
    return mrf_string


def mrf_holistic(lang, meaning_list):
    mrf_string = ''
    for i in range(len(meaning_list)):
        meaning = meaning_list[i]
        form = lang[i]
        if i != len(meaning_list)-1:
            mrf_string += 'S'+meaning+form+'.'
        else:
            mrf_string += 'S'+meaning+form
    return mrf_string
    

def mrf_compositional(lang, meaning_list):
    n_features = len(meaning_list[0])
    categories = ascii_uppercase[:n_features]
    mrf_string = 'S'+categories
    for i in range(len(categories)):
        category = categories[i]
        category_feature_values = []
        feature_value_segments = []
        for j in range(len(meaning_list)):
            if meaning_list[j][i] not in category_feature_values:
                category_feature_values.append(meaning_list[j][i])
                feature_value_segments.append(lang[j][i])
        for k in range(len(category_feature_values)):
            value = category_feature_values[k]
            segment = feature_value_segments[k]
            mrf_string += "."+category+value+segment
    return mrf_string


def minimally_redundant_form(lang, forms, meaning_list):
    lang_class = classify_language_four_forms(lang, forms, meaning_list) # 0 = degenerate, 1 = holistic, 2 = hybrid, 3 = compositional, 4 = other
    if lang_class == 0: # the language is 'degenerate'
        mrf_string = mrf_degenerate(lang, meaning_list)
    elif lang_class == 1 or lang_class == 2: # the language is 'holistic' or 'hybrid'
        mrf_string = mrf_holistic(lang, meaning_list)
    elif lang_class == 3: # the language is 'compositional'
        mrf_string = mrf_compositional(lang, meaning_list)
    return mrf_string




def character_probs(mrf_string):
    count_dict = {}
    for character in mrf_string:
        if character in count_dict.keys():
            count_dict[character] += 1
        else:
            count_dict[character] = 1       
    prob_dict = {}
    for character in count_dict.keys():
        char_prob = count_dict[character]/len(mrf_string)
        prob_dict[character] = char_prob
    return prob_dict


def coding_length(mrf_string):
    char_prob_dict = character_probs(mrf_string)
    coding_len = 0
    for character in mrf_string:
        coding_len += log2(char_prob_dict[character])
    return -coding_len
  

mrf_string_degenerate is:
S02,03,12,13aa

n_features
2
categories are:
AB
mrf_string before for-loop is:
SAB

i is:
0
category is:
A
category_feature_values are:
['0', '1']
feature_value_segments are:
['a', 'b']

i is:
1
category is:
B
category_feature_values are:
['2', '3']
feature_value_segments are:
['a', 'b']

mrf_string_compositional is:
SAB.A0a.A1b.B2a.B3b

mrf_string_holistic is:
S02aa.S03ab.S12bb.S13ba


In [42]:
# First, let's check whether the functions defined above work correctly
# for the example languages given in Kirby et al. (2015):

meanings = ['02', '03', '12', '13']
forms_without_noisy_variants = ['aa', 'ab', 'ba', 'bb']

lang_degenerate = ['aa', 'aa', 'aa', 'aa']

lang_holistic = ['aa', 'ab', 'bb', 'ba']

lang_compositional = ['aa', 'ab', 'ba', 'bb']


coding_len_degenerate = coding_length(lang_degenerate)
print("coding_len_degenerate is:")
print(round(coding_len_degenerate, ndigits=2))


coding_len_degenerate = coding_length(lang_holistic)
print('')
print("coding_len_degenerate is:")
print(round(coding_len_degenerate, ndigits=2))


coding_len_degenerate = coding_length(lang_compositional)
print('')
print("coding_len_degenerate is:")
print(round(coding_len_degenerate, ndigits=2))

coding_len_degenerate is:
38.55

coding_len_degenerate is:
59.2

coding_len_degenerate is:
67.29


In [30]:
# And now that we know that these functions are coded up correctly,
# let's have a look at the coding lengths for our example languages
# for Possibility 1: different form lengths

mrf_compositional_reduplicate_whole_signal = "SABAB.A0a.A1b.B2a.B3b"
coding_len_compositional_reduplicate_whole_signal = coding_length(mrf_compositional_reduplicate_whole_signal)
print("coding_len_compositional_reduplicate_whole_signal is:")
print(round(coding_len_compositional_reduplicate_whole_signal, ndigits=2))

mrf_compositional_reduplicate_segments = "SAABB.A0a.A1b.B2a.B3b"
coding_len_compositional_reduplicate_segments = coding_length(mrf_compositional_reduplicate_segments)
print('')
print("coding_len_compositional_reduplicate_segments is:")
print(round(coding_len_compositional_reduplicate_segments, ndigits=2))

mrf_holistic_diversify_signal = "S02aaaa.S03bbbb.S12abba.S13baab"
coding_len_holistic_diversify_signal = coding_length(mrf_holistic_diversify_signal)
print('')
print("coding_len_holistic_diversify_signal is:")
print(round(coding_len_holistic_diversify_signal, ndigits=2))

mrf_compositional_repair = "SAB.A0a.A1b.B2a.B3b"
coding_len_compositional_repair = coding_length(mrf_compositional_repair)
print('')
print("coding_len_compositional_repair is:")
print(round(coding_len_compositional_repair, ndigits=2))


ratio_reduplication_vs_repair = coding_len_compositional_reduplicate_whole_signal/coding_len_compositional_repair
print('')
print("ratio_reduplication_vs_repair is:")
print(round(ratio_reduplication_vs_repair, ndigits=2))


ratio_diversify_signal_vs_reduplication = coding_len_holistic_diversify_signal/coding_len_compositional_reduplicate_whole_signal
print('')
print("ratio_diversify_signal_vs_reduplication is:")
print(round(ratio_diversify_signal_vs_reduplication, ndigits=2))

ratio_diversify_signal_vs_repair = coding_len_holistic_diversify_signal/coding_len_compositional_repair
print('')
print("ratio_diversify_signal_vs_repair is:")
print(round(ratio_diversify_signal_vs_repair, ndigits=2))



coding_len_compositional_reduplicate_whole_signal is:
64.24

coding_len_compositional_reduplicate_segments is:
64.24

coding_len_holistic_diversify_signal is:
84.83

coding_len_compositional_repair is:
59.2

ratio_reduplication_vs_repair is:
1.09

ratio_diversify_signal_vs_reduplication is:
1.32

ratio_diversify_signal_vs_repair is:
1.43


Alright, so as we can see from the coding lengths above, possibility 1 of how to represent languages gives relative coding lengths that capture the intuitions we have about how hard it is to learn these different languages: the compositional languages with reduplication only have slightly longer coding lengths than the compositional language without it (ratio reduplication:repair = 1.09:1), whereas the holistic language resulting from the diversify signal strategy has a significantly longer coding length (ratio diversify_signal:repair = 1.43:1)

## Conclusion about how to represent languages:

Based on all the above, I'd say let's go for possibility 1 of allowing for different form lengths. That keeps our way of representing languages as close as possibility to the one used by Kirby et al. (2015); it allows us to straightforwardly calculate the coding lengths; and it will not cause languages to make use of a different number of characters, as possibility 2 would.

# References

Burkett, D., & Griffiths, T. L. (2010). Iterated learning of multiple languages from multiple teachers. The Evolution of Language: Proceedings of the 8th International Conference (EVOLANG8), Utrecht, Netherlands, 14-17 April 2010, 58–65.

Kirby, S., Tamariz, M., Cornish, H., & Smith, K. (2015). Compression and communication in the cultural evolution of linguistic structure. Cognition, 141, 87–102. https://doi.org/10.1016/j.cognition.2015.03.016