# Keyness analysis per Genre

## Loading Modules and Data

In [1]:
# This reload library is just used for developing the notebook
# code and can be removed once this is stable.
%reload_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from src.topic_summary import ModelAnalyser, NurGenreMapper, ReviewExtractor

  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()


### Set paths and load functions

In [3]:
# please adjust the following paths to reflect the location of the following files in your local directory

impact_file = '../data/review-impact_matches.tsv.gz'
raw_review_data = '../data/reviews-stats.tsv.gz'
isbn_map = "../data/work-isbn-mapping.tsv"
isbn_work_id_mappings_file = "../data/work_isbn_title_genre.tsv.gz"

In [101]:
from impfic_core.map.map_genre import read_genre_file

# read review metadata
review_stats = pd.read_csv(raw_review_data, sep='\t', compression='gzip')

# read work genre mapping
work_genre = read_genre_file(isbn_work_id_mappings_file)

# merge review metadata and genre data
review_stats = pd.merge(review_stats, work_genre[['work_id', 'nur_genre']].drop_duplicates(), 
                        on='work_id', how='left')

# review professional reviews
review_stats = review_stats[review_stats.source != 'NBD_Biblion']


In [102]:
review_stats['nur_genre'] = review_stats.nur_genre.apply(lambda x: 'unknown' if pd.isna(x) else x)


### Computing Basic Review Statistics

First, we want to know the distribution of book genres across the reviews in the corpus. That is, each review reviews one book, which has one main genre label. We want to know the fraction of reviews that review books of a given genre:

In [103]:
print('number of reviews:', len(review_stats))
review_stats.nur_genre.value_counts() / len(review_stats)

number of reviews: 634614


Literary_fiction      0.305584
Non-fiction           0.149828
unknown               0.142777
Literary_thriller     0.119334
Suspense              0.104566
Other fiction         0.054745
Young_adult           0.045815
Children_fiction      0.039511
Fantasy_fiction       0.020644
Romanticism           0.009847
Historical_fiction    0.005184
Regional_fiction      0.002165
Name: nur_genre, dtype: float64

### Load custom-made classes from `topic_summary.py`

In [8]:
# this class helps to preprocess the inputs and output a genre mapping file
mapper = NurGenreMapper(isbn_map, isbn_work_id_mappings_file)

# this class produces as output the impact_reviews
extractor = ReviewExtractor(impact_file, raw_review_data)

### Prepare dataset

In [11]:
# get the mapping file which contains `work_id` and `isbn` columns. These are necessary to merge the reviews with the genre information
mapped_df = mapper.process_genre_mapping()

# remove the isbn so each work_id occurs only once
mapped_df = mapped_df[['work_id', 'nur_genre']].drop_duplicates()

# this is our impact reviews dataset:
review_impact = extractor.get_impact_reviews()

# NB. left-join is the best way to merge the files without losing data
review_impact_with_genre = pd.merge(review_impact, mapped_df, on = 'work_id', how = 'left')

Dataset consists of impact terms extracted from the reviews of books by the impact model and scored according to _affect_, _style_, _narrative_ and _reflection_. 

The _reflection_ category is not validated by the manual annotations (see Boot & Koolen 2021), so we remove all _reflection_-only matches from the dataset:

In [12]:
review_impact_with_genre['nur_genre'] = (review_impact_with_genre
                                         .nur_genre
                                         .apply(lambda x: 'unknown' if pd.isna(x) else x))

review_impact_with_genre = (review_impact_with_genre[review_impact_with_genre[['affect', 'style', 'narrative']]
                            .sum(axis=1) > 0]
                            .drop('reflection', axis=1))

In [371]:
dt = review_impact_with_genre.rename(columns={'style': 'stylistic'})
dt.head()

Unnamed: 0,work_id,review_id,affect,stylistic,narrative,impact_term,review_num_words,nur_genre
0,impfic-work-3723,impfic-review-1,1,0,0,fantastisch,185,Young_adult
1,impfic-work-3723,impfic-review-1,1,0,1,fantastisch,185,Young_adult
2,impfic-work-3723,impfic-review-1,1,0,1,spanning,185,Young_adult
3,impfic-work-36913,impfic-review-2,1,1,0,prachtig,185,Literary_fiction
4,impfic-work-31725,impfic-review-3,1,0,0,leuk,217,Children_fiction


In [372]:
dt = dt[dt.review_id.isin(review_stats.review_id)]

In [373]:
# there are 2.09 million impact matches. 
# Check that the DataFrame has the correct shape
dt.shape

(2089576, 8)

## Compute deviation of proportions

Lijffijt et al. (2014) showed that Log-Likelihood Ratio ($G^2$, Dunning 1993) and several other frequency-based bag-of-words keyness measures suffer from excessively high confidence in the estimates. Du, Dudar & Schöch compare frequency-based and dispersion-based measures for a downstream task (text classification) to show that for identifying key terms in a sub-corpus compared to the rest of the corpus, dispersion-based measures are more effective. 

Deviation of Proportions (_DP_) is a measure introduced by Stefa Gries in 2008 (see reference below) which:

1. computes the proportion of a document's size $s_i$ (in number of tokens) w.r.t the total number of tokens $S$ in a (sub-)corpus,
2. computes the proportion of the frequency of a specific token type (a given word or phrase) $v_i$ in a document w.r.t. its total frequency in the same (sub-)corpus $V$,
3. then computes the absolute difference between the two proportions per document,
4. sums the absolute differences and divides that sum by two.

DP results in a number between 0 and 1. If $DP$ is 0, the given word or phrase is perfectly equally distributed or dispersed across all documents in the (sub-)corpus. If $DP$ is (close to) 1, the occurrences of a word or phrase are concentrated in one or a few documents of the (sub-)corpus. The underlying assumption is that a word or phrase is key to a target corpus $C_t$ w.r.t. to a reference corpus $C_r$, if the word or phrase is more equally dispersed across the documents in $C_t$ than in the documents in $C_r$. To make sure that the score for minimum dispersion is 1, a normalised version $DP_{norm}$ is calculated as $DP_{norm} = \frac{DP}{1 - s_{min}}$, where $s_{min}$ is the size of the smallest document in the corpus. 

To measure the keyness of a word or phrase in a target corpus compared to a reference corpus, Du et al. (2021) introduce _Eta_, which is a variant of the _Zeta_ measure by Burrows (2006). It computes $DP_{norm}(t, C)$ for a token $t$ in target corpus $C_{t}$ and reference corpus $C_{r}$ and subtracts them: $E(t) = DP_{norm}(t, C_t) - DP_{norm}(t, C_r)$. This gives an Eta score between -1 ($t$ is maximally dispersed in $C_r$ and minimally dispersed in $C_t$) and 1 ($t$ is maximally dispersed in $C_r$ and minimally dispersed in $C_t$). A score of 0 corresponds to t having the same dispersion in $C_r$ and $C_t$. 

- Burrows, J. (2006). "All the Way Through: Testing for Authorship in Different Frequency Strata." In: Literary and Linguistic Computing, 22(1), 27–47. doi:10.1093/llc/fqi067 
- Gries, Stefan Th. (2008). “Dispersions and adjusted frequencies in corpora”. In: International Journal of Corpus Linguistics 13 (4), pp. 403–437. DOI: http://doi.org/10.1075/ijcl.13.4.02gri.
- Lijffijt, Jefrey, Terttu Nevalainen, Tanja Säily, Panagiotis Papapetrou, Kai Puolamäki, and Heikki Mannila (2014). “Significance testing of word frequencies in corpora”. In: Digital Scholarship in the Humanities 31 (2), pp. 374–397. DOI: http://doi.org/10.1093/llc/fqu064.
- Du, K., Dudar, J. & Schöch, C., (2022) “Evaluation of Measures of Distinctiveness. Classification of Literary Texts on the Basis of Distinctive Words”, In: Journal of Computational Literary Studies 1(1). doi: https://doi.org/10.48694/jcls.102
- Du, K., Dudar, J., Rok, C., & Schöch, C. (2021). Zeta & eta: An exploration and evaluation of two dispersion-based measures of distinctiveness. Proceedings http://ceur-ws. org ISSN, 1613, 0073.


### Calculating Deviation of Proportions for Impact Terms in Reviews

Below, we compute deviation of proportions for the impact terms in reviews of different genres, whereby the reviews of books of a certain genre represent different a sub-corpus of the total review corpus.

#### Step 1: counting impact terms per review

In [376]:
impact = {}
impact['affect'] = (dt[dt.affect == 1]
                    .groupby(['review_id', 'nur_genre'])
                    .impact_term.value_counts()
                    .unstack()
                    .fillna(0.0).reset_index())

impact['narrative'] = (dt[dt.narrative == 1]
                       .groupby(['review_id', 'nur_genre'])
                       .impact_term.value_counts()
                       .unstack()
                       .fillna(0.0).reset_index())

impact['stylistic'] = (dt[dt.stylistic == 1]
                       .groupby(['review_id', 'nur_genre'])
                       .impact_term.value_counts()
                       .unstack()
                       .fillna(0.0).reset_index())

for it in impact:
    impact[it] = pd.merge(impact[it], review_stats[['review_id', 'nur_genre']], on=['review_id', 'nur_genre'], how='right').fillna(0.0).set_index('review_id')

impact['narrative']

Unnamed: 0_level_0,nur_genre,(ik|je) (hoopte|hoopt),(ik|je|lezer) (voelt|voelde),(in).+(één|een|1).+(adem|avond|dag|keer|middag|ruk|stuk|zucht).+(gelezen|uitgelezen|uit),(laat|liet|lieten).+(mij|me|je|lezer).+(niet).+(los),(neem*|nam).+(je|me|lezer|ons).+(mee),(spreekt|spreken|sprak|spraken).+(me).+(aan),(zien|ziet|zag).+(voor (me|mij|je)),aangrijpend,als een trein,...,verbeelding,verdriet,verplaatsen,verrassen,verrassend,verrassing,verrast,verslavend,voelbaar,wegleggen
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
impfic-review-36635,unknown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
impfic-review-36636,Other fiction,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
impfic-review-36637,Other fiction,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
impfic-review-36638,Other fiction,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
impfic-review-36639,Suspense,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
impfic-review-671241,Non-fiction,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
impfic-review-671242,Literary_fiction,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
impfic-review-671243,Literary_fiction,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
impfic-review-671244,Non-fiction,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [377]:
genres = impact['affect'].nur_genre.unique()
genres

array(['unknown', 'Other fiction', 'Suspense', 'Literary_thriller',
       'Children_fiction', 'Literary_fiction', 'Romanticism',
       'Historical_fiction', 'Non-fiction', 'Young_adult',
       'Fantasy_fiction', 'Regional_fiction'], dtype=object)

Each type of impact has a different number of impact terms:

In [382]:
impact_terms = {it: [col for col in impact[it].columns if col != 'nur_genre'] for it in impact}
for it in impact_terms:
    print(it, len(impact_terms[it]))

affect 149
narrative 85
stylistic 56


#### Step 2: Computing Total Frequencies

To calculate proportions of impact terms frequencies, we need to the total number of occurrences of each impact term.


In [383]:
impact_term_freq = {it: impact[it][impact_terms[it]].sum() for it in impact}
impact_term_freq['stylistic'].sort_values().head(20)

overtuigend                                        9.0
intrigerend                                      171.0
(spreekt|spreken|sprak|spraken).+(me).+(aan)     291.0
opmerkelijk                                      415.0
geniaal                                          455.0
aantrekkelijk                                    464.0
pakkend                                          474.0
gevoelig                                         681.0
meesterlijk                                      710.0
apart                                            873.0
intens                                           880.0
treffend                                         969.0
droog                                            972.0
intrigeren                                      1021.0
uniek                                           1025.0
scherp                                          1118.0
aangenaam                                       1120.0
perfect                                         1441.0
meeslepend

We also need the total frequency of impact terms per genre.

In [385]:
from collections import defaultdict

genres = temp.nur_genre.unique()

genre_impact = defaultdict(dict)
genre_impact_totals = defaultdict(dict)

for it in impact:
    for genre in genres:
        genre_impact[it][genre] = impact[it][impact[it].nur_genre == genre].drop('nur_genre', axis=1)
        genre_impact_totals[it][genre] = genre_impact[it][genre].sum()

In [386]:
genre_impact_totals['stylistic']['Regional_fiction'].sort_values()

opmerkelijk                                       0.0
aantrekkelijk                                     0.0
overtuigend                                       0.0
intrigerend                                       0.0
intrigeren                                        0.0
droog                                             0.0
gevoelig                                          0.0
geniaal                                           0.0
(spreekt|spreken|sprak|spraken).+(me).+(aan)      1.0
meesterlijk                                       1.0
intens                                            1.0
perfect                                           1.0
gelachen                                          1.0
ontzettend                                        1.0
scherp                                            1.0
apart                                             1.0
meeslepend                                        2.0
origineel                                         2.0
treffend                    

#### Step 3: calculating proportions

Next, we calculate the proportions of impact terms per review (w.r.t. the total frequency of the impact term in a genre).

In [388]:
genre_impact_prop = defaultdict(dict)

for it in impact:
    for genre in genres:
        genre_impact_prop[it][genre] = (genre_impact[it][genre]
                                       .div(genre_impact_totals[it][genre])
                                       .fillna(0.0))



In [389]:
it = 'narrative'
genre = 'Regional_fiction'
genre_impact_prop[it][genre].sort_values(impact_term)


Unnamed: 0_level_0,(ik|je) (hoopte|hoopt),(ik|je|lezer) (voelt|voelde),(in).+(één|een|1).+(adem|avond|dag|keer|middag|ruk|stuk|zucht).+(gelezen|uitgelezen|uit),(laat|liet|lieten).+(mij|me|je|lezer).+(niet).+(los),(neem*|nam).+(je|me|lezer|ons).+(mee),(spreekt|spreken|sprak|spraken).+(me).+(aan),(zien|ziet|zag).+(voor (me|mij|je)),aangrijpend,als een trein,apart,...,verbeelding,verdriet,verplaatsen,verrassen,verrassend,verrassing,verrast,verslavend,voelbaar,wegleggen
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
impfic-review-59928,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0
impfic-review-464463,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
impfic-review-464461,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
impfic-review-464417,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
impfic-review-464406,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
impfic-review-167944,0.0,0.027027,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
impfic-review-195775,0.0,0.027027,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
impfic-review-355656,0.0,0.027027,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
impfic-review-183276,0.0,0.027027,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


And we need the proportion of the size of each review w.r.t. the total size of reviews in a genre.

In [391]:
review_num_words = (review_stats[['review_id', 'nur_genre', 'review_num_words']]
                    .set_index('review_id'))

genre_reviews = {}
genre_num_words = defaultdict(dict)
genre_size_prop = defaultdict(dict)

for it in impact:
    for genre in genres:
        genre_reviews[genre] = review_num_words[review_num_words.nur_genre == genre] 
        genre_num_words[it][genre] = genre_reviews[genre].review_num_words.sum()
        genre_size_prop[it][genre] = genre_reviews[genre].review_num_words.div(genre_num_words[it][genre])


it = 'narrative'
genre = 'Young_adult'

print(genre)
genre_size_prop[it][genre]
#genre_num_words[genre]
#genre_reviews[genre]

Young_adult


review_id
impfic-review-36852     0.000080
impfic-review-36853     0.000044
impfic-review-36854     0.000105
impfic-review-36855     0.000056
impfic-review-36856     0.000043
                          ...   
impfic-review-671020    0.000087
impfic-review-671032    0.000084
impfic-review-671071    0.000075
impfic-review-671072    0.000077
impfic-review-671094    0.000070
Name: review_num_words, Length: 29075, dtype: float64

Check that the two sets have the same number of proportions per genre:

In [392]:
for it in impact:
    for genre in genres:
        assert genre_size_prop[it][genre].shape[0] == genre_impact_prop[it][genre].shape[0]

#### Step 4: Calculating Deviations

In [394]:
genre_deviations = defaultdict(dict)

for it in impact:
    for genre in genres:
        genre_deviations[it][genre] = []
        for impact_term in impact_terms[it]:
            impact_dev = genre_size_prop[it][genre].sub(genre_impact_prop[it][genre][impact_term]).abs()
            genre_deviations[it][genre].append(impact_dev.rename(impact_term))
        genre_deviations[it][genre] = pd.concat(genre_deviations[it][genre], axis=1)

it = 'stylistic'
genre = 'Young_adult'

genre_deviations[it][genre]

Unnamed: 0_level_0,(spreekt|spreken|sprak|spraken).+(me).+(aan),aangenaam,aantrekkelijk,apart,beschreven,beschrijven,bijzonder,boeien,boeiend,droog,...,prettig,scherp,schitterend,schrijfstijl,stijl,subtiel,taalgebruik,treffend,uniek,zinnen
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
impfic-review-36852,0.000080,0.000080,0.000080,0.000080,0.000080,0.000080,0.000080,0.000080,0.000080,0.000080,...,0.000080,0.000080,0.000080,0.000135,0.000080,0.000080,0.000080,0.000080,0.000080,0.000080
impfic-review-36853,0.000044,0.000044,0.000044,0.000044,0.000044,0.000044,0.000044,0.000044,0.000044,0.000044,...,0.000044,0.000044,0.000044,0.000044,0.000044,0.000044,0.000044,0.000044,0.000044,0.000044
impfic-review-36854,0.000105,0.000105,0.000105,0.000105,0.000105,0.000105,0.000105,0.000105,0.000105,0.000105,...,0.000105,0.000105,0.000105,0.000105,0.000105,0.000105,0.000105,0.000105,0.000105,0.000105
impfic-review-36855,0.000056,0.000056,0.000056,0.000056,0.000056,0.000056,0.000056,0.000056,0.000056,0.000056,...,0.000056,0.000056,0.000056,0.000056,0.000056,0.000056,0.000056,0.000056,0.000056,0.000056
impfic-review-36856,0.000043,0.000043,0.000043,0.000043,0.000043,0.000043,0.000043,0.000043,0.000043,0.000043,...,0.000043,0.000043,0.000043,0.000173,0.000043,0.000043,0.000043,0.000043,0.000043,0.000043
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
impfic-review-671020,0.000087,0.000087,0.000087,0.000087,0.000087,0.000087,0.003686,0.000087,0.000087,0.000087,...,0.000087,0.000087,0.000087,0.000128,0.000087,0.000087,0.000087,0.000087,0.000087,0.000801
impfic-review-671032,0.000084,0.000084,0.000084,0.000084,0.000084,0.000084,0.000084,0.000084,0.000084,0.000084,...,0.000084,0.000084,0.000084,0.000131,0.000084,0.000084,0.000084,0.000084,0.000084,0.000084
impfic-review-671071,0.000075,0.000075,0.000075,0.000075,0.000075,0.000075,0.000075,0.000075,0.000075,0.000075,...,0.000075,0.000075,0.000075,0.000075,0.000075,0.000075,0.000075,0.000075,0.000075,0.000075
impfic-review-671072,0.000077,0.000077,0.000077,0.000077,0.000077,0.000077,0.000077,0.000077,0.000077,0.000077,...,0.000077,0.000077,0.000077,0.000077,0.000077,0.000077,0.000077,0.000077,0.000077,0.000077


There are impact terms that do not occur in any reviews of a particular genre:

In [396]:
it = 'stylistic'
for genre in genres:
    s = genre_impact_prop[it][genre].sum()
    if len(s[s==0]) == 0:
        continue
    print(f'\n--------\n{genre}\n')
    print(s[s == 0])


--------
Literary_thriller

overtuigend    0.0
dtype: float64

--------
Children_fiction

overtuigend    0.0
dtype: float64

--------
Romanticism

intrigeren     0.0
intrigerend    0.0
overtuigend    0.0
dtype: float64

--------
Historical_fiction

intrigerend    0.0
overtuigend    0.0
dtype: float64

--------
Young_adult

intrigerend    0.0
overtuigend    0.0
dtype: float64

--------
Fantasy_fiction

overtuigend    0.0
dtype: float64

--------
Regional_fiction

aantrekkelijk    0.0
droog            0.0
geniaal          0.0
gevoelig         0.0
intrigeren       0.0
intrigerend      0.0
opmerkelijk      0.0
overtuigend      0.0
dtype: float64


For impact terms that do not occur in reviews of certain genres, the deviation distribution completely matches the size distribution, which sums to 1. As a consequence, the $DP$ score is $\frac{1}{2} = 0.5$. This is a situation that is not mentioned in Gries (2008). A potential problem is that, for keywords that do not occur in most reviews (as is the case here with impact terms), the $DP$ score for an impact term $t_i$ might be higher for a genre where it _does_ occur than for a genre where it _does not_ occur. 

In [398]:
genre_data = {
    'impact_type': [],
    'genre': [],
    'impact_term': [],
    'DP': []
}

for it in impact:
    for genre in genres:
        for impact_term in impact_terms[it]:
            genre_data['impact_type'].append(it)
            genre_data['genre'].append(genre)
            genre_data['impact_term'].append(impact_term)
            genre_data['DP'].append(genre_deviations[it][genre][impact_term].sum() / 2)

genre_dp = pd.DataFrame(data=genre_data)
genre_dp.sort_values('DP')

Unnamed: 0,impact_type,genre,impact_term,DP
1739,affect,Regional_fiction,opmerkelijk,0.500000
2752,narrative,Regional_fiction,gruwelijk,0.500000
2749,narrative,Regional_fiction,geniaal,0.500000
2746,narrative,Regional_fiction,gegrepen,0.500000
2719,narrative,Fantasy_fiction,verrast,0.500000
...,...,...,...,...
289,affect,Other fiction,verrast,0.999990
1436,affect,Young_adult,onderhoudend,0.999991
1294,affect,Non-fiction,overtuigend,0.999993
3297,stylistic,Non-fiction,overtuigend,0.999993


Next, we add impact term frequencies of the total collection and per genres, as well as the number of reviews per genre, so we can see how $DP$ scores relate to characteristics of terms and genres.

In [404]:
genre_num_reviews = review_stats.nur_genre.value_counts()

genre_dp['impact_term_freq'] = (genre_dp.apply(lambda row: 
                                               impact_term_freq[row['impact_type']][row['impact_term']], 
                                               axis=1))
genre_dp['genre_num_reviews'] = genre_dp.genre.apply(lambda genre: genre_num_reviews[genre])
genre_dp['genre_term_freq'] = (genre_dp
                               .apply(lambda row: genre_totals[row['genre']][row['impact_term']], axis=1))
genre_dp[genre_dp.impact_term_freq > 100].sort_values('DP').head(50)


Unnamed: 0,impact_type,genre,impact_term,DP,impact_term_freq,genre_num_reviews,genre_term_freq
1739,affect,Regional_fiction,opmerkelijk,0.5,415.0,1374,0.0
2454,narrative,Historical_fiction,treffend,0.5,243.0,3290,15.0
1758,affect,Regional_fiction,stil van,0.5,776.0,1374,0.0
1753,affect,Regional_fiction,schokkend,0.5,421.0,1374,0.0
3226,stylistic,Historical_fiction,intrigerend,0.5,171.0,3290,15.0
1694,affect,Regional_fiction,gruwelijk,0.5,12429.0,1374,0.0
1684,affect,Regional_fiction,geniaal,0.5,4520.0,1374,0.0
3426,stylistic,Regional_fiction,aantrekkelijk,0.5,464.0,1374,0.0
1651,affect,Regional_fiction,aantrekkelijk,0.5,464.0,1374,0.0
2749,narrative,Regional_fiction,geniaal,0.5,805.0,1374,0.0


If we filter on term frequencies per genre above zero, we see another relationship between $DP$ and frequency, namely, that the highest $DP$s occur for genres where a term occurs only once: 

In [405]:
genre_dp[genre_dp.genre_term_freq > 0].sort_values('DP')

Unnamed: 0,impact_type,genre,impact_term,DP,impact_term_freq,genre_num_reviews,genre_term_freq
1845,narrative,unknown,overtuigend,0.500000,8.0,90608,1.0
3433,stylistic,Regional_fiction,droog,0.500000,972.0,1374,2.0
3439,stylistic,Regional_fiction,gevoelig,0.500000,681.0,1374,8.0
2762,narrative,Regional_fiction,intens,0.500000,417.0,1374,2.0
3409,stylistic,Fantasy_fiction,overtuigend,0.500000,9.0,13101,1.0
...,...,...,...,...,...,...,...
289,affect,Other fiction,verrast,0.999990,18.0,34742,1.0
1436,affect,Young_adult,onderhoudend,0.999991,308.0,29075,1.0
3297,stylistic,Non-fiction,overtuigend,0.999993,9.0,95083,1.0
1294,affect,Non-fiction,overtuigend,0.999993,17.0,95083,1.0


In [355]:
(genre_dp.groupby('impact_term').DP.max() - genre_dp.groupby('impact_term').DP.min()).sort_values().tail(20)



impact_term
romantisch       0.152192
grappig          0.158421
schrijfstijl     0.230604
leuk             0.241310
spannend         0.317034
spanning         0.331207
gruwelijk        0.495249
geniaal          0.498298
aantrekkelijk    0.498997
stil van         0.499631
schokkend        0.499816
opmerkelijk      0.499838
zucht dicht      0.499932
geboeid          0.499950
gegrepen         0.499986
meegezogen       0.499990
verrast          0.499990
onderhoudend     0.499991
overtuigend      0.499993
genoten          0.499999
Name: DP, dtype: float64

In [356]:
impact_term = 'schrijfstijl' # more romance, regional, YA, suspense, lit. thriller, fantasy
impact_term = 'wegleggen' # more suspsense, lit. thriller,  regional, YA
impact_term = 'pakkend' # balanced
impact_term = 'onderhoudend' # not in children's fiction
impact_term = 'overtuigend' # not in YA, children's, regional, romance, historical, lit. thriller
impact_term = 'gruwelijk' # mainly regional
impact_term = 'stil van' # not used in regional fiction, rare in all genres
impact_term = 'schrijfstijl' # mainly NOT non-fiction (concentrated in small number of reviews)
impact_term = 'mooi' # mainly regional, historical, romance, less suspsen, thriller
impact_term = 'meeslepend' # balanced, most dispersed in historical

genre_dp[genre_dp.DP == 0.5]
selected_impact_dp = genre_dp[genre_dp.impact_term == impact_term].reset_index().drop('index', axis=1)
selected_impact_dp.sort_values('DP')

Unnamed: 0,genre,impact_term,DP,impact_term_freq,genre_num_reviews,genre_term_freq
7,Historical_fiction,meeslepend,0.964551,8266.0,3290,134.0
11,Regional_fiction,meeslepend,0.977284,8266.0,1374,24.0
5,Literary_fiction,meeslepend,0.980174,8266.0,193928,3913.0
10,Fantasy_fiction,meeslepend,0.984197,8266.0,13101,266.0
0,unknown,meeslepend,0.987248,8266.0,90608,757.0
3,Literary_thriller,meeslepend,0.98766,8266.0,75731,934.0
9,Young_adult,meeslepend,0.987987,8266.0,29075,380.0
6,Romanticism,meeslepend,0.988098,8266.0,6249,82.0
1,Other fiction,meeslepend,0.988203,8266.0,34742,425.0
2,Suspense,meeslepend,0.98824,8266.0,66359,870.0


In [357]:
arr = abs(selected_impact_dp.DP.values - selected_impact_dp.DP.values[:, None])

pd.concat((selected_impact_dp.genre, pd.DataFrame(arr, columns=selected_impact_dp.genre)), axis=1)



Unnamed: 0,genre,unknown,Other fiction,Suspense,Literary_thriller,Children_fiction,Literary_fiction,Romanticism,Historical_fiction,Non-fiction,Young_adult,Fantasy_fiction,Regional_fiction
0,unknown,0.0,0.000955,0.000992,0.000413,0.002382,0.007074,0.00085,0.022697,0.009208,0.000739,0.003051,0.009963
1,Other fiction,0.000955,0.0,3.7e-05,0.000542,0.001427,0.008029,0.000105,0.023652,0.008253,0.000216,0.004006,0.010918
2,Suspense,0.000992,3.7e-05,0.0,0.00058,0.00139,0.008066,0.000142,0.023689,0.008216,0.000253,0.004043,0.010956
3,Literary_thriller,0.000413,0.000542,0.00058,0.0,0.00197,0.007487,0.000437,0.02311,0.008796,0.000327,0.003463,0.010376
4,Children_fiction,0.002382,0.001427,0.00139,0.00197,0.0,0.009456,0.001532,0.025079,0.006826,0.001643,0.005433,0.012346
5,Literary_fiction,0.007074,0.008029,0.008066,0.007487,0.009456,0.0,0.007924,0.015623,0.016282,0.007813,0.004023,0.002889
6,Romanticism,0.00085,0.000105,0.000142,0.000437,0.001532,0.007924,0.0,0.023547,0.008358,0.000111,0.003901,0.010813
7,Historical_fiction,0.022697,0.023652,0.023689,0.02311,0.025079,0.015623,0.023547,0.0,0.031905,0.023436,0.019646,0.012734
8,Non-fiction,0.009208,0.008253,0.008216,0.008796,0.006826,0.016282,0.008358,0.031905,0.0,0.008469,0.012259,0.019172
9,Young_adult,0.000739,0.000216,0.000253,0.000327,0.001643,0.007813,0.000111,0.023436,0.008469,0.0,0.00379,0.010703


In [358]:
genre_dp[genre_dp['genre'] == genre].DP.sort_values()
genre_dp.groupby('genre').DP.min()

genre
Children_fiction      0.5
Fantasy_fiction       0.5
Historical_fiction    0.5
Literary_fiction      0.5
Literary_thriller     0.5
Non-fiction           0.5
Other fiction         0.5
Regional_fiction      0.5
Romanticism           0.5
Suspense              0.5
Young_adult           0.5
unknown               0.5
Name: DP, dtype: float64