# Comparison of Moralizations with other Texts

This notebook demonstrates how the modules of this directory can be used to compare linguistic features of moralizations with non-moralizing texts, such as thematizations of morality.


## Setup

The following cell needs to be executed only if the notebook is run in Google Colab.

In [None]:
# Check if the code is running in Google Colab
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount('/content/drive')
    !git clone https://github.com/maria-becker/Moralization/
    %cd "/content/Moralization/Annotation Analysis Tools/data_analysis"
else:
    print("This code should be run in Google Colab only.")

These modules always need to be imported; always execute the cell below.
Note that these imports (and the imports inside the imported modules) only work with Unix/Linux-style filepaths. Under Windows, *corpus_extraction* and *xmi_analysis_util* have to be imported with other means (such as copying them into the current directory or installing them with pip).

In [None]:
from analysis_functions import (
    moral_vs_nonmoral,
    surface_corpus,
    _util_ as util,
    _corpus_extraction_ as corpus_extraction
)

## Creating a List of Items to Compare

There are two ways of creating lists of items whose frequencies we want to compare. The first is just to do it manually, like below.

In [None]:
comparison_list = [
    "würde",
    "recht",
    "gerechtigkeit",
    "demokratie"
    ]

The other way is to use data from the corpora.
The module *comparison_list_gen* makes it possible to create dictionaries of lemmata or tokens that appear inside different types of annotations, where the keys are the tokens/lemmata and the values are the number of appearances.
Subsequently, these dictionaries can be used for comparison.

In [None]:
moral_filepath = "/home/brunobrocai/Desktop/Code/moralization/Testfiles/test_gerichtsurteile_DE.xmi"
corpus = corpus_extraction.Corpus(moral_filepath)

# Other functions that could be used here:
# tokens_in_annotations() for tokens
# pos_lemmata_in_annotations() for lemmata with specific POSs

comparison_lemmata = surface_corpus.lemmata_in_annotation(
    category="all_morals",
    language="de",
    corpus=corpus,
    tagger="HanTa"
)

Print the dictionary to get an overview of common lemmata.

In [None]:
for lemma, count in comparison_lemmata.items():
    print(lemma, count)

Use list comprehension to get a list of lemmata whose frequencies we can compare. In this example, we are taking all lemata with an absolute frequency of more than 10, them printing it to make sure it contains only interesting lemmata (and then remove those we do not care for).

In [None]:
comparison_list = [l for l in comparison_lemmata if comparison_lemmata[l] > 10]
comparison_list.remove("#")
print(comparison_list)

## Creating Corpora to Compare

We now need to create two more lists of strings. These should contain lists of moralizations and a type of non-moralizing texts. They will be the basis on which the frequencies of phenomena such as tokens or lemmata will be compared.

This can be achieved via the *comparison_corpus_gen* module.

We can create a list of non-moralizing strings by using excel files that contain categorizations, where a 3 in the second column is assigned to moralizations and 0-2 are assigned to non-moralizing speech.

In the following code, we retrieve text tagged with a 0 -- in other words, thematizations of morality (see annotation guidelines).

In [None]:
nonmoral_list = surface_corpus.list_nonmoral_strings_from_corpus(
    corpus
)
for element in nonmoral_list[:3]:
    print(element)

In [None]:
nonmoral_list2 = surface_corpus.list_nonmoral_strings_from_xlsx(
    "/home/brunobrocai/Data/Moralization/Excels/Alle_bearbeiteten_Annotationen_positiv_final.xlsx",
    ["Gerichtsurteile"],
    [0, 1, 2]
)

We retrieve moralizing strings via xmi files. (While it is possible to use the above function to retrieve items tagged with a 3, these annotations are of lower quality than those in the xmi files.) The function call below is rather self-explanatory.

In [None]:
moral_list = surface_corpus.list_moralization_strings_from_corpus(
    corpus,
)
for element in moral_list[:3]:
    print(element)
    print('-'*10)

## Comparison

Let's use the three lists of strings we generated - the lemmata whose frequencies will be compared, and the lists of moralizing and non-moralizing speech on which we are basing our analysis, to see whether the lemmata are significantly more frequent in moralizations and hence indicative of that speech act.

In [None]:
comparison_dict = moral_vs_nonmoral.compare_token_likelihood_dict(
    moral_list=moral_list,
    nonmoral_list=nonmoral_list,
    token_list=comparison_list,
    language='german'
)
for lemma, stats in comparison_dict.items():
    print(lemma, stats)

We can also write the results into an excel file, like so:

In [None]:
mvn.dict_to_xlsx(comparison_dict, "output.xlsx")