# Comparison of Moralizations with other Texts

This notebook demonstrates how the modules of this directory can be used to compare linguistic features of moralizations with non-moralizing texts, such as thematizations of morality.


## Setup

The following cell needs to be executed only if the notebook is run in Google Colab.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
!git clone https://github.com/maria-becker/Moralization/
%cd "/content/Moralization/Annotation Analysis Tools/data_analysis"

These modules always need to be imported; always execute the cell below.
Note that these imports (and the imports inside the imported modules) only work with Unix/Linux-style filepaths. Under Windows, *corpus_extraction* and *xmi_analysis_util* have to be imported with other means (such as copying them into the current directory or installing them with pip).

In [1]:
import comparison_corpus_gen as ccg
import comparison_list_gen as clg
import moral_vs_nonmoral as mvn
import comparison_gen_collection as cgc

import sys
sys.path.append('../_utils_')
import corpus_extraction as ce

  match = re.match("^#\s*version\s*([0-9a-z]*)\s*$", line)


## Creating a List of Items to Compare

There are two ways of creating lists of items whose frequencies we want to compare. The first is just to do it manually, like below.

In [7]:
comparison_list = [
    "würde",
    "recht",
    "gerechtigkeit",
    "demokratie"
    ]

The other way is to use data from the corpora.
The module *comparison_list_gen* makes it possible to create dictionaries of lemmata or tokens that appear inside different types of annotations, where the keys are the tokens/lemmata and the values are the number of appearances.
Subsequently, these dictionaries can be used for comparison.

In [2]:
moral_filepath = "/home/brunobrocai/Desktop/Code/moralization/Testfiles/test_gerichtsurteile_DE.xmi"
corpus = ce.Corpus(moral_filepath)

# Other functions that could be used here:
# tokens_in_annotations() for tokens
# pos_lemmata_in_annotations() for lemmata with specific POSs

comparison_lemmata = cgc.lemmata_in_annotation(
    category="all_morals",
    language="de",
    corpus=corpus,
    hanta=True
)

KeyError: {'Coordinates': (43846, 43875), 'Category': 'Subversion'} -- Is the label inside a moralization?


Print the dictionary to get an overview of common lemmata.

In [None]:
for lemma, count in comparison_lemmata.items():
    print(lemma, count)

Use list comprehension to get a list of lemmata whose frequencies we can compare. In this example, we are taking all lemata with an absolute frequency of more than 10, them printing it to make sure it contains only interesting lemmata (and then remove those we do not care for).

In [5]:
comparison_list = [l for l in comparison_lemmata if comparison_lemmata[l] > 10]
comparison_list.remove("#")
print(comparison_list)

['der']


## Creating Corpora to Compare

We now need to create two more lists of strings. These should contain lists of moralizations and a type of non-moralizing texts. They will be the basis on which the frequencies of phenomena such as tokens or lemmata will be compared.

This can be achieved via the *comparison_corpus_gen* module.

We can create a list of non-moralizing strings by using excel files that contain categorizations, where a 3 in the second column is assigned to moralizations and 0-2 are assigned to non-moralizing speech.

In the following code, we retrieve text tagged with a 0 -- in other words, thematizations of morality (see annotation guidelines).

In [4]:
nonmoral_list = cgc.list_nonmoral_strings_from_corpus(
    corpus
)
for element in nonmoral_list:
    print(element)

Klägerin war eine 55-jährige Arbeitslose, die mit ihrem Mann in "Bedarfsgemeinschaft" zusammenlebt.
Deswegen bekommen beide pro Kopf nur 311 Euro ALG II statt des vollen Satzes von 354 Euro.
Es reiche aber für eine menschenwürdige "bescheidene Lebensführung" noch aus.
(AZ: VII ZR 52/06)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

In [5]:
nonmoral_list2 = cgc.list_nonmoral_strings_from_xlsx(
    "/home/brunobrocai/Data/Moralization/Excels/Alle_bearbeiteten_Annotationen_positiv_final.xlsx",
    ["Gerichtsurteile"],
    [0, 1, 2]
)

We retrieve moralizing strings via xmi files. (While it is possible to use the above function to retrieve items tagged with a 3, these annotations are of lower quality than those in the xmi files.) The function call below is rather self-explanatory.

In [3]:
moral_list = cgc.list_moralization_strings_from_corpus(
    corpus,
)
for element in moral_list:
    print(element)

Die Klausel ist unwirksam. ### Sie #benachteiligt# den Mieter unangemessen
Der stellvertretende Fraktionsvorsitzende der Linken sagte, nun müsse die Bundesregierung den Verfassungsschutz anweisen, grundsätzlich die Beobachtung der Linken einzustellen: „Wenn die Bundesregierung Mut und Kraft hat, wird sie den Kalten #Krieg# beenden und den unsäglichen Weg der Bespitzelung nicht weitergehen”, sagte Ramelow der Süddeutschen Zeitung.
Diese "eugenische" Argumentation könne er nicht nachvollziehen
Sucht eine Firma in einer internen Stellenausschreibung gezielt nur Berufsanfänger mit erst wenigen Berufsjahren, ist das wegen #Diskriminierung# von älteren Beschäftigten unzulässig
ORF-Generaldirektor Alexander Wrabetz sagte der Presseagentur APA: „Wir werden diesen beispiellosen #Eingriff# in die Meinungsfreiheit selbstverständlich auch diesmal nicht hinnehmen
Seine Anwältin argumentierte, er sei durch den vom Jobcenter geforderten Verkauf der Lebensversicherung „auf den Weg der #Altersarmut# ve

## Comparison

Let's use the three lists of strings we generated - the lemmata whose frequencies will be compared, and the lists of moralizing and non-moralizing speech on which we are basing our analysis, to see whether the lemmata are significantly more frequent in moralizations and hence indicative of that speech act.

In [10]:
comparison_dict = mvn.compare_token_likelihood_dict(
    moral_list=moral_list,
    nonmoral_list=nonmoral_list,
    token_list=comparison_list,
    language='german'
)
for lemma, stats in comparison_dict.items():
    print(lemma, stats)

Error: Division by zero.
Error: Division by zero.
würde {'likelihood_moral': 0.0008756567425569177, 'likelihood_nonmoral': 0.00033534540576794097, 'ratio': 2.6112084063047285, 'diff_coeficient': 0.446168768186227, 'pvalue_fisher': 0.47719753844735113, 'contingency_table': [[1, 1141], [1, 2981]]}
recht {'likelihood_moral': 0.0008756567425569177, 'likelihood_nonmoral': 0.001006036217303823, 'ratio': 0.8704028021015762, 'diff_coeficient': -0.06928838951310866, 'pvalue_fisher': 1.0, 'contingency_table': [[1, 1141], [3, 2979]]}
gerechtigkeit None
demokratie None


We can also write the results into an excel file, like so:

In [None]:
mvn.dict_to_xlsx(comparison_dict, "output.xlsx")