# Building a multilingual domain-specific vocabulary

Analysing book reviews across multiple languages can shine a light on how readers express their reading experiences in different languages.

Reviewers writing reviews in different languages use different terminology to talk about writing style, narrative, characers and plot, or about what they like and don't like. But there is possibly a lot of overlap between different languages in how reviewers express themselves. That is, there is a **vocabulary for talking about book aspects and reading experiences with links between different languages**.

Although we can probably come up with a set of relevant words for such a vocabulary in the languages that we are fluent in, finding similar terms in different languages is challenging. Because the **register of words** that people use to talk about reading experience probably consists of relatively common words, we should focus on **words that are shared by many reviewers, and that are typical in reviews**. If we zoom in on less common words, we are likely to find words that are specific to a specific book, author or series.

In this notebook, we extract a domain-specific vocabulary of book reviews for a range of languages, using simple statistics.

In [1]:
%reload_ext autoreload
%autoreload 2


First, we import a set of python libraries that will be used.

In [62]:
# python internal libraries
import glob
import gzip
import json
import os
from ast import literal_eval
from collections import Counter
from collections import defaultdict

# external libraries
import numpy as np
import pandas as pd
import spacy
import stopwordsiso

# scripts in this directory
from analyse import read_from_doc_bins
from language import lang_code_map, spacy_model_map
from language import load_language_nlp_model

# The language code map shows the list of languages for which reviews are available.
lang_code_map

{'Arabic': 'ar',
 'Czech': 'cs',
 'Danish': 'da',
 'German': 'de',
 'Greek': 'el',
 'English': 'en',
 'Spanish': 'es',
 'Persian': 'fa',
 'Finnish': 'fi',
 'French': 'fr',
 'Hindi': 'hi',
 'Hungarian': 'hu',
 'Indonesian': 'id',
 'Italian': 'it',
 'Japanese': 'ja',
 'Korean': 'ko',
 'Dutch': 'nl',
 'Norwegian': 'no',
 'Polish': 'pl',
 'Pashto': 'ps',
 'Portuguese': 'pt',
 'Russian': 'ru',
 'Slovak': 'sk',
 'Slovenian': 'sl',
 'Serbian': 'sr',
 'Swedish': 'sv',
 'Turkish': 'tr',
 'Ukranian': 'uk',
 'Urdu': 'ur',
 'Chinese': 'zh'}

## Language and linguistics parser

In this analysis we focus on language for which linguistic parsers are available. [SpaCy](https://spacy.io) has parser models for a range of languages. For some languages, there is no SpaCy model, but linguistics parsers and other language-specific NLP techniques are available elsewhere (e.g. [Farsi/Persian](https://github.com/Dadmatech/DadmaTools), [Arabic](https://github.com/Curated-Awesome-Lists/awesome-arabic-nlp)).

In [11]:
languages = [
    # Add languages for which you want to do linguistic parsing of reviews
    'Danish', 'Dutch', 'English', 'French', 'German', 'Italian', 
    'Japanese', 'Korean', 'Persian', 'Slovenian', 'Spanish', 'Swedish', 'Ukranian'
    # Skipping Chinese because the parser seems to reduce everything to a single character
    # 'Chinese'
]

lang_codes = sorted([lang_code_map[language] for language in languages])
lang_codes

['da', 'de', 'en', 'es', 'fa', 'fr', 'it', 'ja', 'ko', 'nl', 'sl', 'sv', 'uk']

## Load the SpaCy models

If you want to parse the reviews yourself, you need to install the various SpaCy models. If you want to use the pre-parsed reviews (in the `spacy_doc_bins` directory), you just need to load a single model, to have access to a SpaCy `vocab` instance to load the reviews from so-called [DocBin](https://spacy.io/api/docbin)s.


In [63]:
# For now we'll just load the English parser model
lang_nlp = {lang: load_language_nlp_model(lang) for lang in ['en']}

# Reading book metadata and reviews

The review data contains a book identifier, so we know which book is associated with each review. There is a separate book metadata file with more info on the books, like title, author, publication year and statistics on ratings and reviews.

In [13]:
book_meta_file = '../data/book_metadata.csv'
book_df = pd.read_csv(book_meta_file, sep='\t')
book_df['book_author'] = book_df.book_author.apply(literal_eval)
book_df['book_author'] = book_df.book_author.apply(lambda x: x[0])
book_df.head(2)

Unnamed: 0,source_file,source_url,book_id,book_title,book_description,book_author,book_author_url,genres,format,num_pages,publication_date,rating_avg,rating_count,review_count,canonical_url
0,../data/Book_language_pages/en/19288043-gone-g...,https://www.goodreads.com/en/book/show/1928804...,19288043,Gone Girl,An alternative cover edition for this ISBN can...,Gillian Flynn,['https://www.goodreads.com/author/show/2383.G...,"['Fiction', 'Mystery', 'Thriller', 'Book Club'...",Paperback,415.0,2012-05-24T00:00:00,4.14,3399892,167690,https://www.goodreads.com/book/show/19288043-g...
1,../data/Book_language_pages/en/41865.Twilight....,https://www.goodreads.com/en/book/show/41865.T...,41865,Twilight,About three things I was absolutely positive. ...,Stephenie Meyer,['https://www.goodreads.com/author/show/941441...,"['Fantasy', 'Young Adult', 'Romance', 'Fiction...",Paperback,498.0,2005-10-05T00:00:00,3.67,7211130,146232,https://www.goodreads.com/book/show/41865.Twil...


Next, we load the reviews for the language specified above.

In [14]:
def read_reviews(review_file):
    with gzip.open(review_file, 'rt') as fh:
        return [json.loads(line) for line in fh]

review_dir = '../data/lang_reviews/'
review_files = glob.glob(os.path.join(review_dir, '*'))
review_file_map = {rf.split('lang_')[-1][:2]: rf for rf in review_files}
reviews = {}
for lang in lang_codes:
    reviews[lang] = read_reviews(review_file_map[lang])
    print(f"{len(reviews[lang])} reviews for language {lang}")

review_df = pd.DataFrame([review for lang in reviews for review in reviews[lang]])
review_df = pd.merge(review_df, book_df[['book_id', 'book_title', 'book_author']], on='book_id')

3495 reviews for language da
5825 reviews for language de
6270 reviews for language en
6254 reviews for language es
5481 reviews for language fa
5982 reviews for language fr
6128 reviews for language it
358 reviews for language ja
232 reviews for language ko
5861 reviews for language nl
764 reviews for language sl
5140 reviews for language sv
4277 reviews for language uk


A quick peek at the review data to know what it looks like:

In [15]:
review_df.head(2)

Unnamed: 0,review_text,user_id,review_id,review_date,shelf_status,user_shelves,rating,book_id,source_url,review_lang,book_title,book_author
0,"Meget fin, men har også følgende ting som irri...",a731b23ab9c845ad76c48c9ae0c37201af81a7294dd8f3...,8eee415ac1699108c550ba2b5804a44141c5b542dab4d9...,2025-01-26T00:00:00,Read,[],,2175,https://goodreads.com/da/book/show/2175.Madame...,da,Madame Bovary,Gustave Flaubert
1,I jagten på at lykkes med livet kæmper vi med ...,9aeb2cd6b7ccc733d6d0c373c1c8e1d815a5ab40df3e07...,d2e728a22e350c07c7030f0ab53b9cf3ec646f6e25cfb3...,2025-04-11T00:00:00,,"[100-classics-penguin, 1001-books-boxall, 488-...",4.0,2175,https://goodreads.com/da/book/show/2175.Madame...,da,Madame Bovary,Gustave Flaubert


Next, we check per book how many reviews there are in each language. For many languages, there are 30 reviews per book. This is because the first page of reviews of a book contains at most 30 reviews. We have not crawled reviews beyond the first page, so 30 is the maximum number of reviews per book/language combination in our data set, but for many books there are many more reviews.

In [16]:
review_df.groupby(['book_id', 'book_title', 'book_author']).review_lang.value_counts().unstack().fillna(0.0)

Unnamed: 0_level_0,Unnamed: 1_level_0,review_lang,da,de,en,es,fa,fr,it,ja,ko,nl,sl,sv,uk
book_id,book_title,book_author,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
11,The Hitchhiker’s Guide to the Galaxy,Douglas Adams,26.0,30.0,30.0,30.0,30.0,30.0,30.0,4.0,0.0,30.0,8.0,30.0,30.0
93,Heidi,Johanna Spyri,3.0,30.0,30.0,30.0,20.0,20.0,30.0,0.0,0.0,9.0,1.0,4.0,2.0
320,One Hundred Years of Solitude,Gabriel García Márquez,30.0,30.0,30.0,30.0,30.0,30.0,30.0,2.0,1.0,30.0,5.0,30.0,30.0
343,Perfume: The Story of a Murderer,Patrick Süskind,12.0,30.0,30.0,30.0,30.0,30.0,30.0,1.0,2.0,30.0,5.0,30.0,25.0
656,War and Peace,Leo Tolstoy,19.0,30.0,30.0,30.0,30.0,30.0,30.0,1.0,0.0,30.0,3.0,25.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61439040,1984,George Orwell,30.0,30.0,30.0,30.0,30.0,30.0,30.0,1.0,7.0,30.0,10.0,30.0,30.0
77265004,The Iliad,Homer,20.0,30.0,30.0,30.0,30.0,30.0,30.0,1.0,0.0,30.0,2.0,30.0,14.0
127441416,The Diary of a Young Girl,Anne Frank,30.0,30.0,30.0,30.0,30.0,30.0,30.0,2.0,0.0,30.0,4.0,30.0,30.0
129915654,Pride and Prejudice,Jane Austen,30.0,30.0,30.0,30.0,30.0,30.0,30.0,2.0,1.0,30.0,8.0,30.0,30.0


In [17]:
# show an example review
print(json.dumps(reviews['it'][0], indent=4))

{
    "review_text": "C'EST MOI   Meravigliosa come sempre, semplicemente perfetta, Isabelle Huppert nell\u2019adattamento del 1991 firmato da Claude Chabrol. Letto un paio di volte e sempre amato. Uno dei massimi capolavori della letteratura, secondo me. Flaubert \u00e8 uno dei sommi: me lo immagino di notte, solo nella sua casa di Rouen, che sono ovviamente stato a visitare, al lume di candela, che 'recita' le parole scritte, ancora e ancora, urlandole, cancellando, limando, riscrivendo, fino a trovare la formula giusta, quella perfetta. Le mot juste.  Perch\u00e9, lui \u00e8 con la perfezione che si misurava.  E alla perfezione si \u00e8 avvicinato, e, secondo me, la perfezione ha raggiunto. Realistico, il romanzo certamente lo \u00e8: non contiene nulla che non sia esistito nella vita reale (e facilissimo da riscontrare attraverso sopralluoghi e testimonianze); e anche se sbuffa ogni tanto \"nulla in questa storia \u00e8 tratto dalla vita, \u00e8 totalmente inventata\", non c'\u00e

## Loading parsed reviews

As mentioned, the pre-parsed reviews are available in the `spacy_doc_bins` sub-directory. We load them as well, so we have access to the individual word tokens in each review, with per token the Part-Of-Speech (POS) tag and word lemma.

In [18]:


parsed_dir = '../data/spacy_doc_bins/'
parsed_reviews = {}
for lang in lang_codes:
    parsed_file = os.path.join(parsed_dir, f"parsed_reviews-{lang}.doc_bin")
    parsed_reviews[lang] = read_from_doc_bins(lang, parsed_dir, lang_nlp['en'].vocab)
    print(lang, len(parsed_reviews[lang]))

da 3495
de 5825
en 6270
es 6254
fa 5481
fr 5982
it 6128
ja 358
ko 232
nl 5861
sl 764
sv 5140
uk 4277


We also load stopwords lists, because our domain-specific vocabularies should contain domain-specific terms, not words common across all text genres in a language.

In [23]:
for lang in sorted(lang_codes):
    print(lang, len(stopwordsiso.stopwords(lang)))

da 170
de 620
en 1298
es 732
fa 799
fr 691
it 632
ja 134
ko 679
nl 413
sl 446
sv 418
uk 73


The next step is to build list of all non-stopwords used in reviews and keep track of the number of reviews in which they occur, so we can distinguish between words used in many reviews and words used in few reviews. The latter are probably to specific to a book or rare.

In [24]:
def filter_content_words(doc, stopwords):
    tokens = [token for token in doc if token.text not in stopwords and token.lemma_ not in stopwords]
    tokens = [token for token in tokens if token.pos_ != 'PUNCT' and token.lemma_ != ' ']
    return [token for token in tokens if len(token.text) > 2 and len(token.lemma_) > 2]

doc_freq = defaultdict(Counter)
for lang in parsed_reviews:
    stopwords = stopwordsiso.stopwords(lang)
    for doc in parsed_reviews[lang]:
        lemmas = [token.lemma_ for token in filter_content_words(doc, stopwords)]
        doc_freq[lang].update(set(lemmas))
        

If we remove stopwords, what are the most common words (words occurring in most documents)?

In English:

In [25]:
doc_freq['en'].most_common(10)

[('book', 4658),
 ('read', 4318),
 ('time', 3414),
 ('story', 3184),
 ('life', 2819),
 ('love', 2675),
 ('character', 2625),
 ('write', 2385),
 ('people', 2258),
 ('feel', 1893)]

In Italian:

In [26]:
doc_freq['it'].most_common(10)

[('libro', 3197),
 ('storia', 2128),
 ('leggere', 2108),
 ('di il', 2072),
 ('romanzo', 1983),
 ('potere', 1732),
 ('personaggio', 1709),
 ('a il', 1612),
 ('lettura', 1523),
 ('venire', 1411)]

In Ukranian:

In [27]:
doc_freq['uk'].most_common(10)

[('книга', 2079),
 ('той', 1425),
 ('історія', 1400),
 ('дуже', 1360),
 ('читати', 1349),
 ('свій', 1341),
 ('могти', 1153),
 ('життя', 1101),
 ('себе', 1065),
 ('людина', 984)]

The most common words in each language (where common is the number of reviews in which they occur) is clearly related to the domain of books and reading. 

### Build a vocabulary of common terms

To build an initial domain-specific vocabulary per language, we use a simple threshold: A term should occur in at least 1% of all reviews in a given language to be considered a domain term. Since we have a few thousand reviews for most languages, this corresponds to a threshold of a few dozen reviews. That ensure that terms are not specific to a single book, author or book series.

The chosen threshold of 1% is arbitrary. Depending on the total number of reviews in a language and the diversity of the books that they are associated with, you can try different thresholds. A good rule of thum is that a threshold should be substantially higher than the maximum number of reviews for a single book (preferably also higher than the number of reviews for a single book series or author), otherwise you may get plot-specific words or the names of characters or authors that are mentioned in many reviews of the same book, series or author.

In [29]:
min_df = {lang: len(parsed_reviews[lang]) * 0.01 for lang in parsed_reviews}
vocab = {}
for lang in doc_freq:
    vocab[lang] = set([term for term in doc_freq[lang] if doc_freq[lang][term] >= min_df[lang]])
    print(f"lang: {lang}  min_df: {min_df[lang]: >5.2f}  full vocab: {len(doc_freq[lang]): >7}"
          f"  common vocab: {len(vocab[lang]): >6}")

lang: da  min_df: 34.95  full vocab:   21921  common vocab:    513
lang: de  min_df: 58.25  full vocab:   47879  common vocab:    926
lang: en  min_df: 62.70  full vocab:   59070  common vocab:   2226
lang: es  min_df: 62.54  full vocab:   43067  common vocab:   1274
lang: fa  min_df: 54.81  full vocab:   57262  common vocab:   1290
lang: fr  min_df: 59.82  full vocab:   25074  common vocab:    724
lang: it  min_df: 61.28  full vocab:   48840  common vocab:   1319
lang: ja  min_df:  3.58  full vocab:    1880  common vocab:    160
lang: ko  min_df:  2.32  full vocab:    9573  common vocab:    560
lang: nl  min_df: 58.61  full vocab:   35978  common vocab:    557
lang: sl  min_df:  7.64  full vocab:   10933  common vocab:    617
lang: sv  min_df: 51.40  full vocab:   26982  common vocab:    418
lang: uk  min_df: 42.77  full vocab:   36285  common vocab:   1145


The SpaCy docs with the parsed version of the review contain custom metadata including the identifier of the review. In the next steps we want to sort the reviews per book, so we add the book identifier to each SpaCy document for easy reference.

In [30]:
lang_book_docs = defaultdict(lambda: defaultdict(list))

for lang in parsed_reviews:
    for review, doc in zip(reviews[lang], parsed_reviews[lang]):
        doc.user_data['book_id'] = review['book_id']
        lang_book_docs[lang][doc.user_data['book_id']].append(doc)


### Language/Term vectors

Now that we have a domain-specific vocabulary per language, we can build vectors per language and term, where each element in the vector represents the number of reviews for a given book that contains the given term in the given language. That is, for each term in a language, the vector has 209 frequencies, one for each of the 209 books, representing the number of reviews containing that term.

First we make a list of all the book identifiers:

In [28]:
book_ids = sorted(set([review['book_id'] for lang in reviews for review in reviews[lang]]))

Next, we build a dictionary with language and term as key, and a list of 209 frequencies as value:

In [32]:
lang_term_freq = defaultdict(lambda: defaultdict(list))
for lang in parsed_reviews:
    if lang == 'zh':
        continue
    for book_id in book_ids:
        review_freq = Counter()
        for doc in lang_book_docs[lang][book_id]:
            common_tokens = set([token.lemma_ for token in doc if token.lemma_ in vocab[lang]])
        
            review_freq.update(common_tokens)
        for term in vocab[lang]:
            lang_term_freq[lang][term].append(review_freq[term])


Absolute frequencies are not directly comparable, because for some books there are more reviews than for others. Therefore, we want to turn the absolute frequencies into relative frequencies.

For that, we need to know how many reviews there are in total in each language for each book identifier:

In [34]:
lang_book_num_reviews = (review_df
    .groupby(['review_lang', 'book_id'])
    .book_id
    .value_counts()
    .unstack()
    .fillna(0.0))


Check how many reviews we have for each book in a given language, e.g. Italian:

In [36]:
lang_book_num_reviews.loc['it'].sort_values()

book_id
2767052       0.0
52516332      9.0
762390        9.0
17802724     14.0
6193         15.0
             ... 
17245        30.0
17690        30.0
18386        30.0
14942        30.0
239775146    30.0
Name: it, Length: 209, dtype: float64

Now we compute the relative frequency (fraction) vectors:

In [38]:
lang_term_frac = defaultdict(lambda: defaultdict(list))
for lang in lang_term_freq:
    for term in lang_term_freq[lang]:
        freq_num_reviews = zip(lang_term_freq[lang][term], lang_book_num_reviews.loc[lang])
        lang_term_frac[lang][term] = [freq / num_reviews if num_reviews > 0 else 0 for freq, num_reviews in freq_num_reviews]

Next, we build a matrix of the vectors, where the index of a row in the matrix corresponds to a language/term combination, and a column corresponds to a book identifier.

We make mappings from row indexes to term/language pairs and vice versa:

In [66]:
docid2term = {}
term2docid = {}

for lang in lang_term_frac:
    for term in lang_term_frac[lang]:
        docid = len(docid2term)
        docid2term[docid] = (lang, term)
        term2docid[(lang, term)] = docid

term_vecs = np.array([lang_term_frac[lang][term] for lang in lang_term_frac for term in lang_term_frac[lang]])
term_vecs.shape

(11729, 209)

There are 11,729 term/language pairs and 209 books.

To measure how similar two rows (two word/language pairs) are, we can use cosine similarity. **Note that with the multilingual term/language - book matrix, we can easily compute similarity of terms from different languages.

In [41]:
from sklearn.metrics.pairwise import cosine_similarity

# compute cosine similarity of the term/language vectors
term_cosim = cosine_similarity(term_vecs)

# make a dataframe with the term/language pairs as index and column labels
term_cosim = pd.DataFrame(term_cosim, columns=term2docid.keys(), index=term2docid.keys())


Now we can select a column (a term/language pair) and see which rows (term/language pairs) are most similar:

In [65]:
term_cosim[('en', 'character')].sort_values(ascending=False).head(20)

en  character      1.000000
es  personaje      0.948104
fr  personnage     0.940962
it  personaggio    0.928644
es  historia       0.920066
it  romanzo        0.915456
en  story          0.913704
    feel           0.912930
it  a il           0.910778
en  read           0.909190
it  da il          0.908457
en  love           0.908316
    time           0.908042
fa  داستان         0.907022
it  di il          0.904696
fr  roman          0.901980
en  reader         0.899937
it  in il          0.898537
en  book           0.897922
    start          0.897491
Name: (en, character), dtype: float64

The most similar terms to 'character' as equivalents in other languages, e.g. 'personage' in Spanish, 'personnage' in French and 'personaggio' in Italian. We don't see equivalents in most other languages, which might mean that in other languages there are no equivalents in the common vocabulary, or that the equivalents are used in different ways or at least in different contexts. 

We can also restrict the similarity to terms in a specific language:

In [64]:
term_cosim.loc['it'][('en', 'character')].sort_values(ascending=False).head(20)

personaggio    0.928644
romanzo        0.915456
a il           0.910778
da il          0.908457
di il          0.904696
in il          0.898537
storia         0.892953
potere         0.889716
riuscire       0.888808
libro          0.875448
venire         0.874147
su il          0.873745
leggere        0.872140
bello          0.866218
dovere         0.865983
piacere        0.864737
lettura        0.862533
trovare        0.862480
pagina         0.861608
andare         0.859184
Name: (en, character), dtype: float64

### Book similarity

We can also compute the similarity of books (in terms of how they are discussed in reviews using the domain-specific vocabulary). The 209 columns in the matrix represent book vectors, so if we transpose the matrix and compute cosine similarities for the 209 books based on the 11,729 term/language pairs.

First, we make more interpretable book labels for each book identifier, based on the title and author:

In [44]:
book_df['book_label'] = book_df.apply(lambda row: f"{row['book_author']}--{row['book_title']}", axis=1)
book_labels = book_df.sort_values('book_id').book_label

Next, we compute the similarities and make a dataframe with the book labels as index and column names:

In [46]:
doc_sim = cosine_similarity(term_vecs.T)
doc_sim = pd.DataFrame(doc_sim, columns=book_labels, index=book_labels)
doc_sim

book_label,Douglas Adams--The Hitchhiker’s Guide to the Galaxy,Johanna Spyri--Heidi,Gabriel García Márquez--One Hundred Years of Solitude,Patrick Süskind--Perfume: The Story of a Murderer,Leo Tolstoy--War and Peace,Arthur Golden--Memoirs of a Geisha,Dan Brown--Angels & Demons,Carlos Ruiz Zafón--The Shadow of the Wind,"John Gray--Men Are from Mars, Women Are from Venus",Homer--The Odyssey,...,Elizabeth Acevedo--Clap When You Land,Alice Walker--The Color Purple,Kristin Hannah--The Four Winds,Dan Brown--The Da Vinci Code,William Shakespeare--Romeo and Juliet,George Orwell--1984,Homer--The Iliad,Anne Frank--The Diary of a Young Girl,Jane Austen--Pride and Prejudice,Haruki Murakami--Kafka on the Shore
book_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Douglas Adams--The Hitchhiker’s Guide to the Galaxy,1.000000,0.425141,0.595284,0.547511,0.509258,0.605838,0.555832,0.590952,0.459859,0.584057,...,0.385773,0.545468,0.547437,0.610680,0.518967,0.600041,0.549197,0.560569,0.616020,0.603604
Johanna Spyri--Heidi,0.425141,1.000000,0.435338,0.405507,0.383722,0.484655,0.371903,0.415139,0.321090,0.451553,...,0.363332,0.454200,0.475592,0.413403,0.413847,0.430965,0.383356,0.444191,0.474465,0.411653
Gabriel García Márquez--One Hundred Years of Solitude,0.595284,0.435338,1.000000,0.542135,0.548669,0.624059,0.503231,0.613721,0.424733,0.620170,...,0.418765,0.568516,0.610642,0.566524,0.543886,0.613246,0.578616,0.583595,0.612866,0.628941
Patrick Süskind--Perfume: The Story of a Murderer,0.547511,0.405507,0.542135,1.000000,0.476021,0.574646,0.477668,0.536150,0.414257,0.536513,...,0.382125,0.538712,0.524336,0.547662,0.495373,0.561447,0.507439,0.508425,0.561020,0.537347
Leo Tolstoy--War and Peace,0.509258,0.383722,0.548669,0.476021,1.000000,0.536463,0.435348,0.518193,0.378716,0.551711,...,0.357725,0.490663,0.519091,0.494643,0.482098,0.550265,0.539300,0.525844,0.540609,0.528436
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
George Orwell--1984,0.600041,0.430965,0.613246,0.561447,0.550265,0.609257,0.511272,0.574745,0.458798,0.603071,...,0.396385,0.554937,0.580984,0.564238,0.527214,1.000000,0.577074,0.600834,0.617593,0.597663
Homer--The Iliad,0.549197,0.383356,0.578616,0.507439,0.539300,0.544901,0.447153,0.526643,0.399127,0.675327,...,0.360706,0.514523,0.520810,0.516263,0.538091,0.577074,1.000000,0.550546,0.553530,0.534283
Anne Frank--The Diary of a Young Girl,0.560569,0.444191,0.583595,0.508425,0.525844,0.606855,0.471544,0.536692,0.450440,0.567767,...,0.425208,0.551491,0.594324,0.537827,0.520851,0.600834,0.550546,1.000000,0.576666,0.553042
Jane Austen--Pride and Prejudice,0.616020,0.474465,0.612866,0.561020,0.540609,0.635823,0.529020,0.608426,0.469970,0.614307,...,0.440668,0.602505,0.609780,0.588025,0.585373,0.617593,0.553530,0.576666,1.000000,0.585100


Now, we can pick any book label and query the dataframe to find books that are similar in terms of the domain-specific vocabulary:

In [47]:
book_label = "John Green--The Fault in Our Stars"

doc_sim[book_label].sort_values()

book_label
Michael    Connelly--The Black Echo                                                               0.338156
Agatha Christie--Murder on the Orient Express                                                     0.357973
Dale Carnegie--How to Stop Worrying and Start Living: Time-Tested Methods for Conquering Worry    0.372302
J.M. Coetzee--Life & Times of Michael K                                                           0.374236
Adania Shibli--Minor Detail                                                                       0.384830
                                                                                                    ...   
Gabriel García Márquez--Love in the Time of Cholera                                               0.653445
Jojo Moyes--Me Before You                                                                         0.668524
Haruki Murakami--Norwegian Wood                                                                   0.679925
Stephenie Meyer--Twilight 

In [48]:
book_label = "Anne Frank--The Diary of a Young Girl"

doc_sim[book_label].sort_values()

book_label
Michael    Connelly--The Black Echo                                                               0.303608
Agatha Christie--Murder on the Orient Express                                                     0.310408
Dale Carnegie--How to Stop Worrying and Start Living: Time-Tested Methods for Conquering Worry    0.370094
Charles Baudelaire--Les Fleurs du Mal                                                             0.374467
J.M. Coetzee--Life & Times of Michael K                                                           0.376204
                                                                                                    ...   
Arthur Golden--Memoirs of a Geisha                                                                0.606855
Khaled Hosseini--A Thousand Splendid Suns                                                         0.609997
Ray Bradbury--Fahrenheit 451                                                                      0.613907
John Green--The Fault in O

## Term-Document matrix

Finally, we can turn the term-document matrix into a dataframe as well, which allows us to:

- use a vocabulary term to find which books are most commonly described by that term (e.g. for which books do reviewers often mentioned characters, plot or writing style)
- use a book label to find which vocabulary terms are most commonly used to describe it (e.g. what terms are most typically used to describes the diary of Anna Frank).

In [68]:
term_doc_frac = pd.DataFrame(term_vecs, columns=book_labels, index=term2docid.keys())

term_doc_frac.loc[('en', 'character')].sort_values()

book_label
Rhonda Byrne--The Secret                                                                          0.000000
Anne Frank--The Diary of a Young Girl                                                             0.000000
Dale Carnegie--How to Stop Worrying and Start Living: Time-Tested Methods for Conquering Worry    0.000000
Viktor E. Frankl--Man's Search for Meaning                                                        0.000000
Kahlil Gibran--The Prophet                                                                        0.000000
                                                                                                    ...   
Victor Hugo--Les Misérables                                                                       0.766667
Chimamanda Ngozi Adichie--Half of a Yellow Sun                                                    0.800000
Chimamanda Ngozi Adichie--Americanah                                                              0.800000
Leo Tolstoy--Anna Karenina

In [78]:
book_label = "Anne Frank--The Diary of a Young Girl"

term_doc_frac[book_label].loc[['en']].sort_values(ascending=False).head(20)

en  book      0.766667
    Anne      0.700000
    read      0.700000
    diary     0.700000
    time      0.666667
    people    0.600000
    life      0.600000
    write     0.533333
    girl      0.500000
    live      0.466667
    family    0.433333
    word      0.433333
    day       0.433333
    war       0.400000
    review    0.400000
    happen    0.366667
    story     0.366667
    feel      0.366667
    heart     0.333333
    love      0.333333
Name: Anne Frank--The Diary of a Young Girl, dtype: float64

## Entity Intermezzo

In [51]:
book_meta = {record['book_id']: record for record in book_df.to_dict('records')}

In [52]:
book_lang_ents = []
for lang in parsed_reviews:
    print(lang, len(reviews[lang]), len(parsed_reviews[lang]))
    for review, doc in zip(reviews[lang], parsed_reviews[lang]):
        for ent in doc.ents:
            try:
                row = {
                    'book_id': review['book_id'], 'book_title': 
                    book_meta[review['book_id']]['book_title'], 
                    'book_author': book_meta[review['book_id']]['book_author'], 
                    'lang': lang, 'entity_type': ent.label_, 'entity_name': ent.text
                }
                book_lang_ents.append(row)
            except KeyError:
                print(review)
                raise

len(book_lang_ents)

da 3495 3495
de 5825 5825
en 6270 6270
es 6254 6254
fa 5481 5481
fr 5982 5982
it 6128 6128
ja 358 358
ko 232 232
nl 5861 5861
sl 764 764
sv 5140 5140
uk 4277 4277


360268

In [53]:
book_lang_ent_df = pd.DataFrame(book_lang_ents)
book_lang_ent_df.groupby('lang').entity_type.value_counts().unstack().T.sum()

lang
da      8067.0
de     36927.0
en    132364.0
es     64985.0
fr     21377.0
it     44268.0
ja      1109.0
ko      1307.0
nl     28313.0
sl      1586.0
sv      7692.0
uk     12273.0
dtype: float64

## Continue

In [54]:
from analyse import has_person, sort_tokens_by_head

In [55]:
num_docs = sum(len(parsed_reviews[lang]) for lang in lang_codes)
full_ids = [doc.user_data['review_id'] for lang in parsed_reviews for doc in parsed_reviews[lang]]
short_ids = [full_id[:10] for full_id in full_ids]
short_ids
len(set(full_ids)), len(set(short_ids))

(56067, 56067)

In [56]:
from typing import List, Union


def get_person_pos_pairs(doc, pos: Union[str, List[str]], debug: int = 0):
    doc_id = doc.user_data['review_id'][:10]
    if isinstance(pos, str):
        pos = {pos}
    else:
        pos = set(pos)
    person_pos_pairs = []
    for si, sent in enumerate(doc.sents):
        #print([(token.text, token.pos_) for token in sent])
        if debug > 0:
            print(f"sent: {sent}")
        head_tokens = sort_tokens_by_head(sent)
        for hi, head in enumerate(head_tokens):
            clause_id = f"{doc_id}-sent_{si}-clause_{hi}"
            person_tokens = [token for token in head_tokens[head] if has_person(token)]
            clause_persons = set([token.morph.to_dict()['Person'] for token in person_tokens])
            pos_tokens = [token for token in head_tokens[head] if token.pos_ in pos]
            if debug > 0:
                print(f"\thead: {head}")
                print(f"\tperson_tokens: {person_tokens}")
                print(f"\tclause_persons: {clause_persons}")
                print(f"\tpos_tokens: {pos_tokens}")
            for person in clause_persons:
                for token in pos_tokens:
                    person_pos_pairs.append({
                        'doc_id': doc_id, 'clause_id': clause_id, 'person': person, 
                        'term': token.text, 'lemma': token.lemma_, 'pos': token.pos_
                    })
                    if token.lemma_ == '’':
                        if debug > 0:
                            print(head_tokens[head])
    return person_pos_pairs

lang = 'es'
all_pairs = []
for lang in lang_codes:
    for di, doc in enumerate(parsed_reviews[lang]):
        person_pos_pairs = get_person_pos_pairs(doc, ['VERB', 'AUX'])
        for pair in person_pos_pairs:
            pair['lang'] = lang
        all_pairs.extend(person_pos_pairs)
        if (di+1) % 100000 == 0:
            break

person_lemma = pd.DataFrame(all_pairs)
person_lemma

Unnamed: 0,doc_id,clause_id,person,term,lemma,pos,lang
0,8eee415ac1,8eee415ac1-sent_0-clause_3,1,fjerne,fjerne,VERB,da
1,8eee415ac1,8eee415ac1-sent_0-clause_7,3,overveje,overveje,VERB,da
2,8eee415ac1,8eee415ac1-sent_0-clause_7,3,kunne,kunne,AUX,da
3,8eee415ac1,8eee415ac1-sent_0-clause_7,3,ændre,ændre,VERB,da
4,8eee415ac1,8eee415ac1-sent_1-clause_1,2,lad,lade,VERB,da
...,...,...,...,...,...,...,...
1433262,d741e8b7d1,d741e8b7d1-sent_29-clause_0,1,бачу,бачити,VERB,uk
1433263,d741e8b7d1,d741e8b7d1-sent_30-clause_1,1,думаю,думати,VERB,uk
1433264,d741e8b7d1,d741e8b7d1-sent_30-clause_1,1,спробую,спробувати,VERB,uk
1433265,d741e8b7d1,d741e8b7d1-sent_30-clause_1,1,прочитати,прочитати,VERB,uk


In [57]:
person_lemma.groupby('lang').person.value_counts().unstack()

person,0,1,"1,2","1,3",2,"2,3",3
lang,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
da,,10848.0,,,619.0,,13419.0
de,,30730.0,,,1315.0,,107056.0
en,,105195.0,,,21369.0,,250396.0
es,,63417.0,4.0,3.0,7735.0,6.0,185776.0
fa,,39879.0,,,10206.0,,117753.0
fr,,37814.0,,,2618.0,,77984.0
it,,52979.0,105.0,41.0,8524.0,89.0,162750.0
nl,,21823.0,,,5235.0,,28189.0
sl,,3910.0,,,673.0,,8299.0
uk,1167.0,16666.0,,,4673.0,,34002.0


In [58]:
person_lemma = person_lemma[person_lemma.person.str.contains(',') == False]

In [60]:
# How often is each person category used in total?
lang_person_freq = person_lemma.groupby(['lang', 'person']).clause_id.nunique().rename('freq')
lang_person_frac = (lang_person_freq / lang_person_freq.sum()).rename('frac')
#pd.concat([person_freq, person_frac], axis=1)

lang_person_freq = lang_person_freq.unstack()
clause_freq = person_lemma.groupby('lang').clause_id.nunique()
lang_person_freq.T.div(clause_freq).T

person,0,1,2,3
lang,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
da,,0.436637,0.024675,0.609911
de,,0.220168,0.009637,0.850126
en,,0.282503,0.059082,0.745148
es,,0.257395,0.033416,0.874075
fa,,0.245682,0.050731,0.847688
fr,,0.321106,0.024593,0.784516
it,,0.236726,0.040189,0.848903
nl,,0.417745,0.105405,0.592632
sl,,0.32052,0.06007,0.792166
uk,0.021536,0.30728,0.084805,0.689077


The majority of terms with person-related morpholgy are in third person (ranging from 59% to 87%), while only between 1 and 11% are in second person.

In [20]:
import re

person_lemma = person_lemma[person_lemma.lemma.apply(lambda x: re.match(r"^\w+$", x) is not None)]

In [21]:
lemma_doc_freq = (person_lemma
    .groupby(['lemma'])
    .clause_id
    .nunique())

vocab = lemma_doc_freq[lemma_doc_freq >= 10].index
len(vocab)

1008

In [22]:
lemma_person_doc_freq = (person_lemma[person_lemma.lemma.isin(vocab)]
    .groupby(['lemma', 'person'])
    .clause_id
    .nunique()
    .unstack()
    .fillna(0.0))

lemma_person_doc_freq

person,1,2,3
lemma,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
abandonar,34.0,3.0,119.0
abarcar,3.0,1.0,55.0
abordar,9.0,1.0,113.0
aborrecer,6.0,2.0,10.0
abrazar,8.0,3.0,20.0
...,...,...,...
vivir,268.0,28.0,1012.0
volar,12.0,4.0,35.0
volcar,1.0,0.0,9.0
volver,330.0,35.0,775.0


In [23]:
lemma_person_doc_frac = lemma_person_doc_freq.div(person_freq)
lemma_person_doc_frac

person,1,2,3
lemma,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
abandonar,0.000942,0.000640,0.000970
abarcar,0.000083,0.000213,0.000449
abordar,0.000249,0.000213,0.000922
aborrecer,0.000166,0.000427,0.000082
abrazar,0.000222,0.000640,0.000163
...,...,...,...
vivir,0.007422,0.005973,0.008253
volar,0.000332,0.000853,0.000285
volcar,0.000028,0.000000,0.000073
volver,0.009139,0.007466,0.006320


In [27]:
_SMALL = np.power(0.1, 10)
_SMALL

1.0000000000000006e-10

In [28]:
modals = "must will would should may can could might shall".split(' ')
modals

['must', 'will', 'would', 'should', 'may', 'can', 'could', 'might', 'shall']

In [29]:
def rel_prop(p1, p2):
    p1 += _SMALL
    p2 += _SMALL
    return (p1 - p2) / p1

p1, p2 = '1', '2'

lemma_person_doc_frac[f"rel_prop_{p1}_{p2}"] = lemma_person_doc_frac.apply(lambda row: rel_prop(row[p1], row[p2]), axis=1)
lemma_person_doc_frac.sort_values('2').tail(20)

person,1,2,3,rel_prop_1_2
lemma,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
llegar,0.010163,0.013225,0.009762,-0.301264
pasar,0.011493,0.013225,0.011254,-0.150757
creer,0.049017,0.013439,0.012192,0.725838
gustar,0.038909,0.016212,0.018129,0.583344
dar,0.018831,0.019625,0.016604,-0.042123
decir,0.038133,0.022611,0.017289,0.407059
dejar,0.017225,0.022824,0.013488,-0.325053
sentir,0.028524,0.026664,0.010683,0.065213
ver,0.028995,0.02773,0.014337,0.043607
pensar,0.019164,0.028584,0.007943,-0.491553


In [221]:
lemma_person_doc_frac.loc[modals]

person,1,2,3,rel_prop_1_2
lemma,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
must,0.006307,0.012302,0.003407,-0.950546
will,0.032197,0.092354,0.012776,-1.868397
would,0.038736,0.031696,0.015478,0.181756
should,0.009756,0.019479,0.003949,-0.99669
may,0.003913,0.016062,0.003868,-3.104706
can,0.050261,0.123964,0.016522,-1.466414
could,0.02673,0.025203,0.009124,0.057115
might,0.003841,0.01777,0.003272,-3.625868
shall,0.003055,0.002905,0.000285,0.049283


In [31]:
lemma_person_doc_frac[lemma_person_doc_frac['2'] > 0.01].sort_values('rel_prop_1_2').head(20)

person,1,2,3,rel_prop_1_2
lemma,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sar,0.000305,0.010666,0.000196,-34.012012
mirar,0.001745,0.011305,0.001158,-5.480003
querer,0.021268,0.043089,0.012363,-1.025956
ir,0.030435,0.049701,0.021464,-0.633045
poder,0.049986,0.078498,0.039568,-0.570401
pensar,0.019164,0.028584,0.007943,-0.491553
dejar,0.017225,0.022824,0.013488,-0.325053
llegar,0.010163,0.013225,0.009762,-0.301264
estar,0.039214,0.048422,0.042382,-0.234817
pasar,0.011493,0.013225,0.011254,-0.150757


In [32]:
person_lemma[person_lemma.lemma == 'sar']

Unnamed: 0,doc_id,clause_id,person,term,lemma,pos
6439,9ae910b30a,9ae910b30a-sent_2-clause_0,2,Sé,sar,AUX
6441,9ae910b30a,9ae910b30a-sent_2-clause_0,1,Sé,sar,AUX
17318,67d2dc66c6,67d2dc66c6-sent_95-clause_0,2,Sé,sar,AUX
18308,3db6b06d9b,3db6b06d9b-sent_10-clause_0,2,Sé,sar,AUX
18310,3db6b06d9b,3db6b06d9b-sent_10-clause_0,3,Sé,sar,AUX
...,...,...,...,...,...,...
252456,4c3ea3afc2,4c3ea3afc2-sent_3-clause_0,2,Sé,sar,AUX
252699,ad03d914b1,ad03d914b1-sent_13-clause_0,2,Sé,sar,AUX
252702,ad03d914b1,ad03d914b1-sent_13-clause_0,1,Sé,sar,AUX
255523,3d648b94d3,3d648b94d3-sent_7-clause_0,2,Sé,sar,AUX


In [84]:
lemma_person_coll_freq = person_lemma.groupby('lemma').person.value_counts().unstack().fillna(0.0)
lemma_person_coll_freq['total'] = lemma_person_coll_freq.apply(sum, axis=1)
lemma_person_coll_freq.sort_values('total')

person,1,2,3,total
lemma,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30somethe,0.0,0.0,1.0,1.0
lockdown,0.0,1.0,0.0,1.0
lobby,0.0,0.0,1.0,1.0
lob,0.0,0.0,1.0,1.0
loan,0.0,0.0,1.0,1.0
...,...,...,...,...
read,3544.0,531.0,2384.0,6459.0
can,2816.0,1454.0,2439.0,6709.0
do,5666.0,1333.0,8263.0,15262.0
have,7623.0,1283.0,12674.0,21580.0


In [85]:
lemma_person_term_freq = (person_lemma.groupby(['lemma', 'doc_id'])
                          .person
                          .value_counts()
                          .unstack()
                          .fillna(0.0))
lemma_person_term_freq['total'] = lemma_person_term_freq.apply(sum, axis=1)
lemma_person_term_freq.sort_values('total')

Unnamed: 0_level_0,person,1,2,3,total
lemma,doc_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
30somethe,8684046405,0.0,0.0,1.0,1.0
offer,15fc61099d,0.0,0.0,1.0,1.0
offer,15aafb5c3c,0.0,0.0,1.0,1.0
offer,1360e59d00,0.0,0.0,1.0,1.0
offer,120455e0b8,0.0,0.0,1.0,1.0
...,...,...,...,...,...
be,81313dcc1f,20.0,1.0,131.0,152.0
be,e091705b3f,31.0,3.0,126.0,160.0
be,1300365aee,39.0,18.0,108.0,165.0
be,58ad921a7e,42.0,6.0,134.0,182.0


In [108]:
# person_lemma_doc = person_lemma[['doc_id', 'person', 'lemma']].drop_duplicates()
# lemma_person_doc_freq = (person_lemma_doc.groupby(['lemma', 'doc_id'])
#                           .person
#                           .value_counts()
#                           .unstack()
#                           .fillna(0.0))
lemma_person_doc_freq = person_lemma_doc.groupby('lemma').person.value_counts().unstack().fillna(0.0)

lemma_person_doc_freq['total'] = (person_lemma_doc[['doc_id', 'lemma']]
    .drop_duplicates()
    .lemma
    .value_counts())
lemma_person_doc_freq.sort_values('total')



person,1,2,3,total
lemma,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30somethe,0.0,0.0,1.0,1
lob,0.0,0.0,1.0,1
lobby,0.0,0.0,1.0,1
lockdown,0.0,1.0,0.0,1
longliste,0.0,0.0,1.0,1
...,...,...,...,...
can,1798.0,1038.0,1612.0,2839
read,2180.0,439.0,1676.0,2963
do,2677.0,862.0,3240.0,4019
have,3196.0,892.0,3958.0,4701


### Determining minimum document frequency

What is the minimum document frequency for a difference to be significant?

In [73]:
from itertools import combinations

import numpy as np


def resample(sample, resample_length):
    resample_index = np.random.randint(0, len(sample), resample_length)
    return sample[resample_index]


def two_sample_bootstrap_test(sample1, sample2, stat_func, num_samples: int = 1000, debug: int = 0):
    sample1 = np.array(sample1)
    sample2 = np.array(sample2)
    resampled_stats = []
    resampled_signs = []
    if debug > 0:
        print(f"sample 1: {sample1}\n")
        print(f"sample 2: {sample2}\n")
    for n in range(num_samples):
        resample1 = resample(sample1, len(sample1))
        resample2 = resample(sample2, len(sample2))
        res1 = stat_func(resample1)
        res2 = stat_func(resample2)
        sign = 1 if res1 > res2 else 0
        stat = res1 - res2
        # if res2 > 0.0:
        #     stat = res1 / res2
        #     sign = 1 if stat > 1.0 else 0
        # else:
        #     stat = np.inf if res1 > 0.0 else 0.0
        #     sign = 1 if res1 > 0.0 else 0
        if debug > 0:
            print(f"\nresample\n")
            print(resample1)
            print(resample2)
            print(res1, res2, stat, sign)
            print('------\n')
        resampled_stats.append(stat)
        resampled_signs.append(sign)
    p_value = sum(resampled_signs) / len(resampled_signs)
    return p_value, np.array([resampled_stats, resampled_signs])





In [53]:
sample2 = np.zeros(num_docs)
alpha = 0.01
two_sided_alpha = alpha / 2

min_df = num_docs

for n in range(1, 10):
    sample1 = np.concatenate([np.ones(n), np.zeros(num_docs - n)])
    #print(sample1.mean())
    p_value, stats = two_sample_bootstrap_test(sample1, sample2, stat_func=np.mean, num_samples=10000)
    significant = p_value <= two_sided_alpha or (1 - p_value) <= two_sided_alpha
    print(n, p_value, significant)
    if significant:
        min_df = n
        break
    

1 0.6334 False
2 0.8631 False
3 0.9527 False
4 0.9831 False
5 0.9933 False
6 0.9975 True


In [124]:
min_df = 10
vocab = lemma_person_doc_freq[lemma_person_doc_freq.total >= min_df].index
len(vocab)

1294

### Testing differences in word usage


In [186]:
person_freq['1']

55968

In [188]:
import matplotlib.pyplot as plt
import seaborn as sns

persons = ['1', '2', '3']
lemma = 'read'
rows = []


for li, lemma in enumerate(vocab):
    doc_freqs = {}
    for person in persons:
        doc_freqs[person] = np.array(lemma_person_term_freq.loc[lemma][person])
        extra = num_docs - len(doc_freqs[person])
        extra = person_freq[person] - len(doc_freqs[person])
        doc_freqs[person] = np.concatenate((doc_freqs[person], np.zeros(extra)))
        #print(lemma, person, len(doc_freqs[person]))
    for p1, p2 in combinations(persons, 2):
        p_value, stats = two_sample_bootstrap_test(doc_freqs[p1], doc_freqs[p2], 
                                                   np.mean, num_samples=1000, debug=0)
        rows.append([lemma, f"p{p1}-p{p2}", p_value])
        #print(lemma, p1, p2, p_value)
        # ax = sns.kdeplot(stats[0], label = f"p{p1} - p{p2}")
    if (li+1) % 100 == 0:
        print(f"{li+1} of {len(vocab)} terms tested")

pd.DataFrame(data=rows, columns=['lemma', 'test', 'p_value'])

100 of 1294 terms tested
200 of 1294 terms tested
300 of 1294 terms tested
400 of 1294 terms tested
500 of 1294 terms tested
600 of 1294 terms tested
700 of 1294 terms tested
800 of 1294 terms tested


KeyboardInterrupt: 

In [172]:
lemma = 'absorb'
lemma = 'imagine'
lemma = 'soar'
lemma = 'abandon'
lemma = 'realize'

lemma_person_coll_freq.loc[lemma]
lemma_person_doc_freq.loc[lemma]


person
1        182.0
2         41.0
3        282.0
total    409.0
Name: realize, dtype: float64

In [166]:
term_test = pd.DataFrame(data=rows, columns=['lemma', 'test', 'p_value'])

term_test['significant'] = term_test.p_value.apply(lambda x: x <= two_sided_alpha or (1 - x) <= two_sided_alpha)
    
term_test['significant'] = term_test.p_value.apply(lambda x: x <= alpha)
    
term_test[term_test.significant == True]

Unnamed: 0,lemma,test,p_value,significant
1,abandon,p1-p3,0.003,True
2,abandon,p2-p3,0.000,True
5,abhor,p2-p3,0.002,True
7,absorb,p1-p3,0.003,True
8,absorb,p2-p3,0.000,True
...,...,...,...,...
3874,yearn,p1-p3,0.000,True
3875,yearn,p2-p3,0.000,True
3878,yell,p2-p3,0.000,True
3880,yield,p1-p3,0.001,True


In [168]:
term_test[(term_test.test == "p1-p2")].sort_values('p_value')

Unnamed: 0,lemma,test,p_value,significant
3201,soar,p1-p2,0.0,True
969,devise,p1-p2,0.0,True
1917,kidnap,p1-p2,0.0,True
2148,morph,p1-p2,0.0,True
951,detail,p1-p2,0.0,True
...,...,...,...,...
2634,realize,p1-p2,1.0,False
2631,realise,p1-p2,1.0,False
2628,read,p1-p2,1.0,False
2694,refer,p1-p2,1.0,False


In [13]:
from collections import defaultdict, Counter

lang = 'en'

# the total frequency of words
term_freq = Counter()
# the document frequency of words, that is, in how many reviews does a word occur?
doc_freq = Counter()
# The total number of documents/reviews
num_reviews = len(parsed_reviews[lang])

for doc in parsed_reviews[lang]:
    # list all words in the review
    terms = [token.text for token in doc]
    # ignore case, turn all terms to lowercase
    terms = [term.lower() for term in terms]
    term_freq.update(terms)
    doc_freq.update(set(terms))
    

In [17]:
print(f"total number of terms: {len(term_freq):,}\ttokens: {sum(term_freq.values()):,}")

total number of terms: 139,286	tokens: 2,979,373


In [18]:
term_freq.most_common(10)

[(',', 133851),
 ('.', 113942),
 ('the', 111651),
 ('and', 65077),
 ('of', 60654),
 ('a', 56884),
 ('to', 53771),
 ('i', 41909),
 ('is', 36210),
 ('in', 35373)]

In [19]:
doc_freq.most_common(10)

[('.', 5259),
 (',', 4776),
 ('a', 4570),
 ('the', 4428),
 ('and', 4359),
 ('to', 4288),
 ('of', 4277),
 ('i', 4186),
 ('this', 4109),
 ('in', 4102)]

In [24]:
stopword_file = '../resources/stopwords-en.json'
with open(stopword_file, 'rt') as fh:
    stopwords = set(json.load(fh))

print(f"number of stopwords: {len(stopwords)}")

number of stopwords: 1298


In [25]:
# use min_df to filter words that are rare, e.g. occur in fewer than 10 reviews
min_df = 10
# filter stopwords and rare words
vocab = set(term for term in doc_freq if term not in stopwords and doc_freq[term] >= min_df)
print(f"original vocabulary size: {len(doc_freq)}")
print(f"filtered vocabulary size: {len(vocab)}")


original vocabulary size: 139286
filtered vocabulary size: 13099


In [26]:
pd.DataFrame([{'term': term, 'tf': term_freq[term], 'df': doc_freq[term]} for term in vocab])

Unnamed: 0,term,tf,df
0,decay,49,27
1,playing,174,147
2,«,540,134
3,لغة,19,17
4,concepto,14,12
...,...,...,...
13094,noisy,16,16
13095,label,29,26
13096,flynn,150,56
13097,indulge,22,21


In [149]:
from typing import Dict, List, Generator, Union


def find_term_in_context(term: str,
                         docs: List[Dict[str, any]],
                         max_hits: int = -1,
                         context_size: int = 3,
                         ignorecase: bool = True) -> Union[Generator[str, None, None], None]:
    """Find a term and its context in text lines from a line reader iterable.
    The term can include wildcard symbol at either the start or end of the term, or both.

    :param term: a term to find in a list of lines
    :type: str
    :param line_reader: an iterable for a list of lines
    :type line_reader: LineReader
    :param max_hits: the maximum number of term matches to return
    :type max_hits: int
    :param context_size: the number of words before and after each term to return as context
    :type context_size: int
    :param ignorecase: flag to indicate whether case should be ignored
    :type ignorecase: bool
    :return: a generator yield occurrences of the term with its context
    :type: Generator[str, None, None]
    """
    pre_regex = r'(\w+\W+){,' + f'{context_size}' + r'}\b('
    post_regex = r')\b(\W+\w+){,' + f'{context_size}' + '}'
    pre_width = context_size * 10
    num_contexts = 0
    match_term = term
    if term.startswith('*'):
        match_term = r'\w*' + match_term[1:]
    if term.endswith('*'):
        match_term = match_term[:-1] + r'\w*'
    for doc in docs:
        if 'review_text' not in doc or doc['review_text'] is None:
            continue
        if ignorecase:
            re_gen = re.finditer(pre_regex + match_term + post_regex, doc['review_text'], re.IGNORECASE)
        else:
            re_gen = re.finditer(pre_regex + match_term + post_regex, doc['review_text'])
        for match in re_gen:
            main = match.group(2)
            pre, post = match.group(0).split(main, 1)
            context = {
                'term': term,
                'term_match': main,
                'match_offset': match.start,
                'pre': pre,
                'post': post,
                'context': f"{pre: >{pre_width}}{main}{post}",
                'doc_id': doc['review_id']
            }
            num_contexts += 1
            yield context
            if num_contexts == max_hits:
                return None
    return None



In [161]:

hit_count = 0
term = 'character*'
for context in find_term_in_context(term, reviews['en'], context_size=5):
    #print(context)
    print(f"{context['pre']: >40}{context['term_match']}{context['post']: <20}")
    #break


       delicate similes, but rather has characters say what the mean in
                  way that makes Emma a character who can impact the lives
                 is no sympathy for the characters… The snide style of the
                  her daughter-in-law’s character.  
 ”Wouldn’t one have the
             Both women are fully drawn characters, completely exposed to our critical
    sure, inevitable progression of the characters' becoming. Emma believes in finding
       tiny references to the principle characters. It may surprise you to
                    on the moods of the characters and tone of the novel
                 it, and hated the main character, Emma. They said, “that woman
           using quotation marks when a character speaks. This did not enhance
               good thing: it makes her character believable; it makes her seem
             Of course, when a literary character plays with fire, you just
              this short.  Our two main characters are remarkabl