# Hybrid Nouns in Yiddish

The purpose of this notebook is to analyse the corpus entries of the [Corpus of Modern Yiddish (CMY)](http://web-corpora.net/YNC/search/index.php?interface_language=en) which are the results of `query_corpus.py`. In a first step the entries were rated with the script `rate_entries.py`, whether the corpus entry includes an agreement between a hybrid noun and some goal. A hybrid noun as two different gender features, one being grammatical and the other semantic. For example, the noun 'meydl' (girl) has neuter grammatical gender, whereas its semantic gender is feminine.

The word that were queried were: 'meydl' (girl), 'froyentsimer' (woman), 'vayb' (woman) and its diminutive 'vaybl'. These words trigger agreement on different words in a sentence. The list below gives an overview of the different agreement goals, the argument passed to the query script, the maximal distance to the controller, and the POS label 

- attributive adjectives: `--gram1=PRON,A` | max. 5 tokens preceding and 1 token distance succeeding | `attributive`
- articles: `--lex1=der` | max. 10 tokens preceding | `determiner`
- relative pronouns: `--lex2=vos|velkher` | max. 3 tokens succeeding | `rel_pron`
- anaphoric pronouns: `--lex2=zi|es --gram2=PRON,S` | max. 10 tokens succeeding | `pronoun`
- possessive pronouns: `--lex2=ir|zayn --gram2=PRON,A` | max. 10 tokens succeeding | `possessive`

The query results are stored in the folder data but not included in this repository, due to copyright reasons. The file names follow the pattern: `(lex1|word1)_gram1_(lex2|word2)_gram2.jsonl`

### Import/Install libraries

First some libraries need to be installed and imported

In [69]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install pandas



In [70]:
import pandas as pd
import json
import os
import re

### Constants
The constants LEX and POS hold regexes to parse the file names. The notebook assumes the files living in the data folder. The constant LEX represents the controller words and POS the agreement goals and their POS label.

In [71]:
PATH = r'data'
LEX = {
    'froyentsimer': 'froyentsimer',
    'meydl': 'meydl',
    'vayb(?!l)': 'vayb',
    'vaybl': 'vaybl'
}
POS = {
    'velkh|vos': 'rel_pron',
    'PRON_A': 'possessive',
    'PRON_S': 'pronoun',
    '^_A': 'attributive',
    '^der': 'determiner'
}

### Load files

First those jsonl files that were rated are loaded, processed and written to a single data frame.

In [72]:
def process_file(file):
    # Runs through the regex dict and returns the value, if the regex matches
    try:
        lex = next(lex for regex, lex in LEX.items() if re.search(regex,file))
    except StopIteration:
        lex = None
    try:
        pos = next(pos for regex, pos in POS.items() if re.search(regex,file))
    except StopIteration:
        pos = None
    return lex, pos

In [73]:
dfs = []
files = [file for file in os.listdir(PATH) if 'rated' in file]

for file in files:
    lex, pos = process_file(file)
    df_temp = pd.read_json(f'data/{file}', lines=True)
    df_temp['file_name'] = file
    df_temp['controller_lex'] = lex
    df_temp['goal_pos'] = pos
    dfs.append(df_temp)

total_data = pd.concat(dfs)

### Process and clean data

The column `decision` is checked and the result is written to the corresponding column. After that collumns that are not needed are being dropped. The remaining columns are:
- `author`: the authors name of the document where the entry was found
- `controller_lex`: the word that triggers ('controls') the agreement
- `goal_pos`: The POS label for the goal of the agreement
- `neut`: Whether or not the 'goal' shows neuter gender
- `fem`: Whether or not the 'goal' shows feminine gender
- `indiff`: If the gender can not be specified

In [74]:
total_data['neut'] = total_data['decision'].map(lambda x: int(x =='n'))
total_data['fem'] = total_data['decision'].map(lambda x: int(x == 'f'))
total_data['indiff'] = total_data['decision'].map(lambda x: int(x == 'i'))
clean_data = total_data.drop(['id', 'tokens', 'date', 'text', 'raw_text', 'file_name', 'decision', 'document'],axis=1)
clean_data

Unnamed: 0,author,controller_lex,goal_pos,neut,fem,indiff
0,Rouz Dzheyn,froyentsimer,rel_pron,0,0,1
0,Vaysenberg Itshe Meyer,vayb,pronoun,1,0,0
0,Forverts,vaybl,determiner,1,0,0
1,Sholem-Aleykhem,vaybl,determiner,1,0,0
2,Sholem-Aleykhem,vaybl,determiner,1,0,0
...,...,...,...,...,...,...
19,Yehoyesh,vayb,rel_pron,0,0,1
20,Forverts,vayb,rel_pron,0,0,1
21,Forverts,vayb,rel_pron,0,0,1
22,Bergelson Dovid,vayb,rel_pron,0,0,1


### Inspect data
First, the data frame is grouped by the columns `author`, `controller_lex`, `goal_pos`. The values of the column `neut`, `fem` and `indeff` are summed.

In [75]:
grouped_data = clean_data.groupby(['author', 'controller_lex', 'goal_pos']).sum()
grouped_data

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,neut,fem,indiff
author,controller_lex,goal_pos,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Ash Sholem,meydl,attributive,1,0,0
Ash Sholem,meydl,pronoun,0,1,0
Ash Sholem,vayb,attributive,3,0,7
Balkin Leybl,meydl,attributive,1,0,0
Balkin Leybl,meydl,determiner,1,0,0
...,...,...,...,...,...
Yehoyesh,vayb,attributive,7,6,110
Yehoyesh,vayb,determiner,2,42,0
Yehoyesh,vayb,possessive,0,27,0
Yehoyesh,vayb,pronoun,0,58,0


In the next step we group only by the author and calculate the total counts of found corpus entry by author and the percentage of feminine gender present on the agreement goal. In total count includes indifferent gender, hence where it can not be decided whether the word bears feminine or neuter gender. In the calculations for percentage of feminine gender, the indifferent forms are not included.

In [76]:
def percentage(fem, neut):
    return round(fem/(fem + neut)*100,1) if (fem+neut) != 0 else 0

def process_data(data_frame):
    data_frame.loc['total'] = data_frame.sum(numeric_only = True)
    data_frame['total'] = data_frame.apply(lambda x: int(x.fem+x.neut+x.indiff), axis=1)
    data_frame['% fem.'] = data_frame.apply(lambda x: percentage(x.fem, x.neut),axis=1)

In [77]:
overall = grouped_data.groupby(['author']).sum()
process_data(overall)
overall[overall['total'] >= 10]

Unnamed: 0_level_0,neut,fem,indiff,total,% fem.
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Ash Sholem,4,1,7,12,20.0
Bashevis Zinger Yitskhok,4,0,9,13,0.0
Bergelson Dovid,44,15,19,78,25.4
Dik Ayzik-Meyer,1,5,5,11,83.3
Forverts,284,110,254,648,27.9
Khaver-Paver,4,7,6,17,63.6
Kobrin Leon,8,7,16,31,46.7
Lebns-Fragn,32,15,28,75,31.9
Manger Itsik,10,3,10,23,23.1
Ostrovski V.,6,7,0,13,53.8


As can be seen the 'author' Forverts has total count of 647 relevant corpus entries, but for this analysis it does not make much sense, since it holds corpus entries drawn from the newspaper [Forverts(The Forward)](https://forward.com/yiddish/). The data also holds entries form the Yiddish newspaper [Lebns-Fragn](https://en.wikipedia.org/wiki/Lebns_Fragn). A 'real' author with the highest total count is [Yehoyesh](https://en.wikipedia.org/wiki/Yehoash_(poet)) who is mostly present in the data with translations of the Tanakh. He also seems to mainly use feminine gender (93.5%) for the agreement goal.

In [78]:
total_data[total_data['author'] == 'Yehoyesh']['document']

0               Tanakh: Yirmeyohu
1             Tanakh: Shmuel Alef
12    Tanakh: Divrey Hayomim Alef
14               Tanakh: Breyshis
18                Tanakh: Shoftim
                 ...             
15               Tanakh: Bamidber
16               Tanakh: Koyheles
17    Tanakh: Divrey Hayomim Alef
19            Tanakh: Shmuel Beyz
23               Tanakh: Breyshis
Name: document, Length: 310, dtype: object

When inspecting the distribution over the controller lexemes, it can be seen that the lexeme `froyentsimer` only shows semantic agreement, but is also only present 4 times in the data. The lexeme `vayb` seems to have a relative high tendency for semantic agreement, where as the lexeme `meydl` seems to mainly triggers grammatical agreement. The diminutive version of `vayb` seems to have an even lower preference for semantic agreement.

In [79]:
controller_group = grouped_data.groupby(['controller_lex']).sum()
process_data(controller_group)
controller_group

Unnamed: 0_level_0,neut,fem,indiff,total,% fem.
controller_lex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
froyentsimer,0,2,2,4,100.0
meydl,360,143,92,595,28.4
vayb,109,230,410,749,67.8
vaybl,49,16,25,90,24.6
total,518,391,529,1438,43.0


Note that there might be an unbalanced distribution of lexemes over the authors. Before investigating further the distribution of the different agreement goals could give a hint, whether there is an effect on gender agreement. The table below shows, that the attributive adjectives bear mostly the gender neuter, if they appear to be decidable for gender. Overall 13% of the agreeing attributive show feminine gender. Again the author Yehoyesh has the highest proportion of semantic agreement with 41% while the authors Dovid Bergelson and Leon Kobrin tend to favour grammatical agreement on attributives even more. 

In [80]:
attributive = grouped_data.filter(like = 'attributive', axis=0).groupby(['author']).sum()
process_data(attributive)
attributive[attributive['total'] >= 10]

Unnamed: 0_level_0,neut,fem,indiff,total,% fem.
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Ash Sholem,4,0,7,11,0.0
Bashevis Zinger Yitskhok,3,0,8,11,0.0
Bergelson Dovid,19,5,3,27,20.8
Forverts,162,10,179,351,5.8
Kobrin Leon,6,3,16,25,33.3
Lebns-Fragn,19,1,20,40,5.0
Manger Itsik,4,0,8,12,0.0
Perets Yitskhok-Leyb,3,0,10,13,0.0
Sholem-Aleykhem,5,0,5,10,0.0
Yehoyesh,10,7,110,127,41.2


As for the determiners, it can be observed, that the overall proportion of feminine gender is higher than the one of the attributives.  28.2 % of the relevant corpus entries show semantic agreement. Yehoyesh shows a clear tendency for feminine gender here, only 2 out of 72 entries bear neuter gender. The author V. Ostrovski still favours neuter gender with only 40% of feminine gender agreement.

In [81]:
determiner = grouped_data.filter(like = 'determiner', axis=0).groupby(['author']).sum()
process_data(determiner)
determiner[determiner['total'] >= 10]

Unnamed: 0_level_0,neut,fem,indiff,total,% fem.
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bergelson Dovid,23,3,0,26,11.5
Forverts,119,9,0,128,7.0
Lebns-Fragn,13,0,0,13,0.0
Ostrovski V.,6,4,0,10,40.0
Perets Yitskhok-Leyb,11,0,0,11,0.0
Sholem-Aleykhem,12,0,0,12,0.0
Yehoyesh,2,70,0,72,97.2
total,244,96,0,340,28.2


The relative pronouns show a even greater preference for semantic agreement with around 50% of corpus entries showing feminine gender. Two important notes need to be made here. First, the only examples with either neuter or feminine gender come from Forverts and also being 4 out of 78 query results. Second, this fact might be due to the fact, that the lexemes in question, hence 'vos' and 'velkh', show no inflection when bearing nominative case (for 'velkh') or no inflection at all (vos).

In [82]:
relative = grouped_data.filter(like = 'rel_pron', axis=0).groupby(['author']).sum()
process_data(relative)
relative
relative[relative['total'] >= 5]

Unnamed: 0_level_0,neut,fem,indiff,total,% fem.
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bergelson Dovid,0,0,16,16,0.0
Forverts,2,2,74,78,50.0
Lebns-Fragn,0,0,8,8,0.0
Yehoyesh,0,0,16,16,0.0
total,2,2,130,134,50.0


In the case of anaphoric pronouns the picture is much clearer. All authors show a clear preference for semantic agreement with 98.2 % of pronouns bearing feminine gender. Some authors, such as Yehoyesh seem to only use feminine gender. The same results can be shown for possessive pronouns with a slight less percentage, but also less query results in general.

In [83]:
pronoun = grouped_data.filter(like = 'pronoun', axis=0).groupby(['author']).sum()
process_data(pronoun)
pronoun[pronoun['total'] >= 3]

Unnamed: 0_level_0,neut,fem,indiff,total,% fem.
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Berditshevski Mika Yoysef,0,3,0,3,100.0
Bergelson Dovid,1,6,0,7,85.7
Forverts,1,60,0,61,98.4
Katle Kanye,0,3,0,3,100.0
Lebns-Fragn,0,5,0,5,100.0
Manger Itsik,0,3,0,3,100.0
Sholem-Aleykhem,0,6,0,6,100.0
Yehoyesh,0,64,0,64,100.0
total,3,165,0,168,98.2


In [84]:
possessive = grouped_data.filter(like = 'possessive', axis=0).groupby(['author']).sum()
process_data(possessive)
possessive[possessive['total'] >= 3]

Unnamed: 0_level_0,neut,fem,indiff,total,% fem.
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Forverts,0,29,1,30,100.0
Lebns-Fragn,0,9,0,9,100.0
Yehoyesh,0,31,0,31,100.0
total,2,88,2,92,97.8


To determine if the lexemes are different in the sense that they have a higher likeliness of triggering semantic agreement, the table below can give some hints to that. The lexeme `froyentsimer` ony shows feminine gender, but also appearing only in 4 query results.
The lexeme `meydl` seem to have a lower preference for semantic agreement for attributives and determiner compared to the lexeme `vayb`. For possessive and anaphoric pronouns such a difference can not be observed. Both lexemes seem to have a strong tendency for semantic gender agreement. Interestingly the lexeme `vaybl` occurs more often with bearing neuter gender than its non-diminutive counterpart. That could be explained by the suffix `l` which triggers neuter grammatical gender.

In [85]:
lex_by_goal = clean_data.drop('author',axis=1).groupby(['controller_lex', 'goal_pos']).sum()
lex_by_goal['total'] = lex_by_goal.apply(lambda x: int(x.fem+x.neut+x.indiff), axis=1)
lex_by_goal['% fem.'] = lex_by_goal.apply(lambda x: percentage(x.fem, x.neut),axis=1)
lex_by_goal

Unnamed: 0_level_0,Unnamed: 1_level_0,neut,fem,indiff,total,% fem.
controller_lex,goal_pos,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
froyentsimer,attributive,0,1,0,1,100.0
froyentsimer,possessive,0,1,1,2,100.0
froyentsimer,rel_pron,0,0,1,1,0.0
meydl,attributive,194,16,9,219,7.6
meydl,determiner,162,38,0,200,19.0
meydl,possessive,0,41,0,41,100.0
meydl,pronoun,2,47,0,49,95.9
meydl,rel_pron,2,1,83,86,33.3
vayb,attributive,48,23,375,446,32.4
vayb,determiner,60,57,0,117,48.7


As could be observed above some authors seem to have a strong tendency for semantic agreement, such as Yehoyesh, which is also overrepresented in the corpus. By normalizing the counts of the agreement gender for author, we might get a clearer picture. To normalize, we divide the each value by the sum the author used the lexeme and multiply the result by the factor 1000.

In [86]:
author_freq = clean_data.groupby(['author', 'controller_lex']).size().unstack(fill_value=0)

In [87]:
def normalize(value, author, lex):
    freq = author_freq.filter(like=author,axis=0)[lex].iloc[0]
    return value/freq*1000 if freq != 0 else 0

In [88]:
normalized_data = clean_data
normalized_data['neut'] = normalized_data.apply(lambda x: normalize(x.neut,x.author,x.controller_lex), axis=1)
normalized_data['fem'] = normalized_data.apply(lambda x: normalize(x.fem,x.author,x.controller_lex), axis=1)
normalized_data['indiff'] = normalized_data.apply(lambda x: normalize(x.indiff,x.author,x.controller_lex), axis=1)

normalized_lex = normalized_data.drop(['author', 'goal_pos'],axis=1).groupby(['controller_lex']).sum()
process_data(normalized_lex)
normalized_lex.drop(['neut', 'fem', 'indiff', 'total'],axis=1)

Unnamed: 0_level_0,% fem.
controller_lex,Unnamed: 1_level_1
froyentsimer,100.0
meydl,28.9
vayb,46.7
vaybl,20.1
total,33.3


As it can be seen the percentage of feminine gender triggered by the lexeme `vayb` dropped from 67.8 %  to 46 %. The percentage for the lexeme `meydl` stayed the same, whereas the lexeme `vaybl` seem to have a much less preference for `feminine` gender.