# Comparison of Moralizations with other Texts

This notebook demonstrates how the modules of this directory can be used to compare linguistic features of moralizations with non-moralizing texts, such as thematizations of morality.


## Setup

The following cell needs to be executed only if the notebook is run in Google Colab.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
!git clone https://github.com/maria-becker/Moralization/
%cd "/content/Moralization/Annotation Analysis Tools/data_analysis"

These modules always need to be imported; always execute the cell below.
Note that these imports (and the imports inside the imported modules) only work with Unix/Linux-style filepaths. Under Windows, *corpus_extraction* and *xmi_analysis_util* have to be imported with other means (such as copying them into the current directory or installing them with pip).

In [1]:
import comparison_corpus_gen as ccg
import comparison_list_gen as clg
import moral_vs_nonmoral as mvn

import sys
sys.path.append('../_utils_')
import corpus_extraction as ce

## Creating a List of Items to Compare

There are two ways of creating lists of items whose frequencies we want to compare. The first is just to do it manually, like below.

In [2]:
comparison_tokens = [
    "bildung",
    "bildungsgerechtigkeit",
    "chancengleichheit",
    "erziehung"
    ]

The other way is to use data from the corpora.
The module *comparison_list_gen* makes it possible to create dictionaries of lemmata or tokens that appear inside different types of annotations, where the keys are the tokens/lemmata and the values are the number of appearances.
Subsequently, these dictionaries can be used for comparison.

In [10]:
filepath = [
    r"C:\Users\ed304\Documents\Moralization\Testfiles\test_gerichtsurteile_DE.xmi",
    r"C:\Users\ed304\Documents\Moralization\Testfiles\test_plenar_FR.xmi"
]
corpus = ce.XMI_TO_CORPUS_OBJECT(filepath)

# Other functions that could be used here:
# tokens_in_annotations() for tokens
# pos_lemmata_in_annotations() for lemmata with specific POSs

comparison_lemmata = clg.lemmata_in_annotations(
    label_type="all_morals",
    language="de",
    corpus=corpus,
    hanta=True
)

AttributeError: module 'corpus_extraction' has no attribute 'CorpusData'

Print the dictionary to get an overview of common lemmata.

In [None]:
for lemma, count in comparison_lemmata.items():
    print(lemma, count)

Use list comprehension to get a list of lemmata whose frequencies we can compare. In this example, we are taking all lemata with an absolute frequency of more than 10, them printing it to make sure it contains only interesting lemmata (and then remove those we do not care for).

In [None]:
comparison_list = [l for l in comparison_lemmata if comparison_lemmata[l] > 10]
print(comparison_list)
comparison_list.remove("#")

## Creating Corpora to Compare

We now need to create two more lists of strings. These should contain lists of moralizations and a type of non-moralizing texts. They will be the basis on which the frequencies of phenomena such as tokens or lemmata will be compared.

This can be achieved via the *comparison_corpus_gen* module.

We can create a list of non-moralizing strings by using excel files that contain categorizations, where a 3 in the second column is assigned to moralizations and 0-2 are assigned to non-moralizing speech.

In the following code, we retrieve text tagged with a 0 -- in other words, thematizations of morality (see annotation guidelines).

In [None]:
nonmoral_xlsx_filepath = "/home/bruno/Desktop/Databases/Moralization/Kategorisierungen/Alle_bearbeiteten_Annotationen_positiv_final.xlsx"
nonmoral_list = ccg.list_nonmoral_strings_from_xlsx(
    nonmoral_xlsx_filepath,
    "Interviews",
    [0],
    rm_duplicates=True
)
for element in nonmoral_list:
    print(element)

We retrieve moralizing strings via xmi files. (While it is possible to use the above function to retrieve items tagged with a 3, these annotations are of lower quality than those in the xmi files.) The function call below is rather self-explanatory.

In [None]:
moral_list = ccg.list_moralization_strings_from_xmi(
    moral_filepath,
    rm_duplicates=True
)
for element in moral_list:
    print(element)

## Comparison

Let's use the three lists of strings we generated - the lemmata whose frequencies will be compared, and the lists of moralizing and non-moralizing speech on which we are basing our analysis, to see whether the lemmata are significantly more frequent in moralizations and hence indicative of that speech act.

In [None]:
comparison_dict = mvn.compare_lemma_likelihood_dict(
    moral_list=moral_list,
    nonmoral_list=nonmoral_list,
    lemmata=comparison_list,
    language='german'
)
for lemma, stats in comparison_dict.items():
    print(lemma, stats)

We can also write the results into an excel file, like so:

In [None]:
mvn.dict_to_xlsx(comparison_dict, "output.xlsx")

## Combining Subcorpora into Bigger Datasets
The above examples uses single files for all its operations. However, to compare a genre of text, such as newspaper writing, and/or achieve higher statistical significance, we might want to use bigger corpora. In other words, we want to base our analysis on the contents of several files. This can be achieved by looping over lists of filepaths and adding lists together.

For example, this is how to create a list of moralizations out of a collection of XMIs:

In [None]:
filepath_list = [
    'filepath1.xmi',
    'filepath2.xmi',
    'filepath3.xmi',
    'filepath4.xmi'
]
moral_list = []
for filepath in filepath_list:
    moral_subcorpus_list = ccg.list_moralization_strings_from_xmi(
        filepath,
        rm_duplicates=True
    )
    moral_list = moral_list + moral_subcorpus_list