# Building a multilingual domain-specific vocabulary

Analysing book reviews across multiple languages can shine a light on how readers express their reading experiences in different languages.

Reviewers writing reviews in different languages use different terminology to talk about writing style, narrative, characers and plot, or about what they like and don't like. But there is possibly a lot of overlap between different languages in how reviewers express themselves. That is, there is a **vocabulary for talking about book aspects and reading experiences with links between different languages**.

Although we can probably come up with a set of relevant words for such a vocabulary in the languages that we are fluent in, finding similar terms in different languages is challenging. Because the **register of words** that people use to talk about reading experience probably consists of relatively common words, we should focus on **words that are shared by many reviewers, and that are typical in reviews**. If we zoom in on less common words, we are likely to find words that are specific to a specific book, author or series.

In this notebook, we extract a domain-specific vocabulary of book reviews for a range of languages, using simple statistics.

First, we install required packages and import a set of python libraries that will be used.

In [1]:
import sys

!{sys.executable} -m pip install stopwordsiso


In [2]:
# python internal libraries
import glob
import gzip
import json
import os
from ast import literal_eval
from collections import Counter
from collections import defaultdict

# external libraries
import numpy as np
import pandas as pd
import spacy
import stopwordsiso

# scripts in this directory

review_json_urls = {
    "ar": "https://surfdrive.surf.nl/public.php/dav/files/6EjpYBae65JKq35",
    "da": "https://surfdrive.surf.nl/public.php/dav/files/QyRbNXRjKH5qAoF",
    "de": "https://surfdrive.surf.nl/public.php/dav/files/qs3iD76PXpDZjdL",
    "en": "https://surfdrive.surf.nl/public.php/dav/files/FDWCKim3mQAn592",
    "es": "https://surfdrive.surf.nl/public.php/dav/files/7wYxiRNfdtyH6kd",
    "fa": "https://surfdrive.surf.nl/public.php/dav/files/mKYKiMRTMWesFDX",
    "fr": "https://surfdrive.surf.nl/public.php/dav/files/PXnYeBFgXb4iHpN",
    "it": "https://surfdrive.surf.nl/public.php/dav/files/Bpmc9Nf5Re2s5Eq",
    "ja": "https://surfdrive.surf.nl/public.php/dav/files/BceKXZ4QiM5G8mc",
    "ko": "https://surfdrive.surf.nl/public.php/dav/files/2o6P6aq4csEWr3g",
    "nl": "https://surfdrive.surf.nl/public.php/dav/files/GHtmzgGaTJwiRaR",
    "pl": "https://surfdrive.surf.nl/public.php/dav/files/eBe3XPzgm8sBzy3",
    "pt": "https://surfdrive.surf.nl/public.php/dav/files/MBazLdF5doNm9Rt",
    "sl": "https://surfdrive.surf.nl/public.php/dav/files/y3ZdqAwTfSPC4Ay",
    "sv": "https://surfdrive.surf.nl/public.php/dav/files/qRr9SbbtgLP2R9n",
    "tr": "https://surfdrive.surf.nl/public.php/dav/files/2rmd6M5EcBrKKTD",
    "uk": "https://surfdrive.surf.nl/public.php/dav/files/KTDMfpnKYP8LQDP"
}
spacy_doc_bin_urls = {
    "da": "https://surfdrive.surf.nl/public.php/dav/files/ewCq2dwEF5AR9B9",
    "de": "https://surfdrive.surf.nl/public.php/dav/files/8or8rxeiZjEZsLt",
    "en": "https://surfdrive.surf.nl/public.php/dav/files/zyCbNoSLLHrCkij",
    "es": "https://surfdrive.surf.nl/public.php/dav/files/GFk53JrcYSg3Gk3",
    "fa": "https://surfdrive.surf.nl/public.php/dav/files/yH6krzA8YoELSiR",
    "fr": "https://surfdrive.surf.nl/public.php/dav/files/nxqsbtWHz5BQASJ",
    "it": "https://surfdrive.surf.nl/public.php/dav/files/qatCX5Af6RkCX5E",
    "ja": "https://surfdrive.surf.nl/public.php/dav/files/dBrkkpt2g29cwNS",
    "ko": "https://surfdrive.surf.nl/public.php/dav/files/Bbrb2xaEDj6zndm",
    "nl": "https://surfdrive.surf.nl/public.php/dav/files/CFCkFWdHs27GsNE",
    "pl": "https://surfdrive.surf.nl/public.php/dav/files/kDxtrb3Tp4aPezP",
    "pt": "https://surfdrive.surf.nl/public.php/dav/files/nwkTAsTedbBiYyH",
    "sl": "https://surfdrive.surf.nl/public.php/dav/files/gYDJTB54CkaQob2",
    "sv": "https://surfdrive.surf.nl/public.php/dav/files/QNYX5dj2LigPDof",
    "uk": "https://surfdrive.surf.nl/public.php/dav/files/BCRg3SnZsa3deYT"
}

# The language code map shows the list of languages for which reviews are available.
code_lang_map = {
    'ar': 'Arabic',
    'cs': 'Czech',
    'da': 'Danish',
    'de': 'German',
    'el': 'Greek',
    'en': 'English',
    'es': 'Spanish',
    'fa': 'Persian',
    'fi': 'Finnish',
    'fr': 'French',
    'hi': 'Hindi',
    'hu': 'Hungarian',
    'id': 'Indonesian',
    'it': 'Italian',
    'ja': 'Japanese',
    'ko': 'Korean',
    'nl': 'Dutch',
    'no': 'Norwegian',
    'pl': 'Polish',
    'ps': 'Pashto',
    'pt': 'Portuguese',
    'ru': 'Russian',
    'sk': 'Slovak',
    'sl': 'Slovenian',
    'sr': 'Serbian',
    'sv': 'Swedish',
    'tr': 'Turkish',
    'uk': 'Ukranian',
    'ur': 'Urdu',
    'zh': 'Chinese' # (macro-language label)
}

lang_code_map = {lang: code for code, lang in code_lang_map.items()}


  import pkg_resources


## Language and linguistics parser

In this analysis we focus on language for which linguistic parsers are available. [SpaCy](https://spacy.io) has parser models for a range of languages. For some languages, there is no SpaCy model, but linguistics parsers and other language-specific NLP techniques are available elsewhere (e.g. [Farsi/Persian](https://github.com/Dadmatech/DadmaTools), [Arabic](https://github.com/Curated-Awesome-Lists/awesome-arabic-nlp)).

In [3]:
languages = [
    # Add languages for which you want to do linguistic parsing of reviews
    'Danish', 'Dutch', 'English', 'French', 'German', 'Italian', 
    'Japanese', 'Korean', 'Persian', 'Polish', 'Portuguese', 'Slovenian', 'Spanish', 'Swedish', 'Ukranian'
    # Skipping Chinese because the parser seems to reduce everything to a single character
    # 'Chinese'
]

lang_codes = sorted([lang_code_map[language] for language in languages])
lang_codes

['da',
 'de',
 'en',
 'es',
 'fa',
 'fr',
 'it',
 'ja',
 'ko',
 'nl',
 'pl',
 'pt',
 'sl',
 'sv',
 'uk']

## Load the SpaCy models

If you want to parse the reviews yourself, you need to install the various SpaCy models. If you want to use the pre-parsed reviews (in the `spacy_doc_bins` directory), you just need to load a single model, to have access to a SpaCy `vocab` instance to load the reviews from so-called [DocBin](https://spacy.io/api/docbin)s.


In [4]:
# For now we'll just load the English parser model
lang_nlp = spacy.load('en_core_web_lg')

# Reading book metadata and reviews

The review data contains a book identifier, so we know which book is associated with each review. There is a separate book metadata file with more info on the books, like title, author, publication year and statistics on ratings and reviews.

In [5]:
import requests

import pandas as pd


# the URL for the book metadata
book_meta_url = "https://surfdrive.surf.nl/public.php/dav/files/bN8qFtH4BKtJADC/"

response = requests.get(book_meta_url)
if response.status_code != 200:
    raise ValueError(f"Failed to download book metadata file, with HTTP code {response.status_code}")
book_meta_data = response.json()

book_df = pd.DataFrame(book_meta_data)
book_df.head(2)

Unnamed: 0,source_file,source_url,book_id,book_title,book_description,book_author,book_author_url,genres,format,num_pages,publication_date,rating_avg,rating_count,review_count,canonical_url
0,../data/Book_language_pages/en/19288043-gone-g...,https://www.goodreads.com/en/book/show/1928804...,19288043,Gone Girl,An alternative cover edition for this ISBN can...,Gillian Flynn,['https://www.goodreads.com/author/show/2383.G...,"['Fiction', 'Mystery', 'Thriller', 'Book Club'...",Paperback,415.0,2012-05-24T00:00:00,4.14,3399892,167690,https://www.goodreads.com/book/show/19288043-g...
1,../data/Book_language_pages/en/41865.Twilight....,https://www.goodreads.com/en/book/show/41865.T...,41865,Twilight,About three things I was absolutely positive. ...,Stephenie Meyer,['https://www.goodreads.com/author/show/941441...,"['Fantasy', 'Young Adult', 'Romance', 'Fiction...",Paperback,498.0,2005-10-05T00:00:00,3.67,7211130,146232,https://www.goodreads.com/book/show/41865.Twil...


Next, we load the reviews for the language specified above.

In [6]:
import io
import gzip
import zipfile


def download_reviews(lang_code):
    if lang_code not in review_json_urls:
        choices = list(review_json_urls.keys())
        raise KeyError(f"'{lang_code}' is not a valid language code. Please choose from {choices}")
    response = requests.get(review_json_urls[lang_code])
    if response.status_code == 200:
        with gzip.open(io.BytesIO(response.content), 'rt') as fh:
            reviews = [json.loads(line) for line in fh]
            return reviews
    return None



In [7]:
reviews = {}
for lang in lang_codes:
    reviews[lang] = download_reviews(lang)
    print(f"{len(reviews[lang])} reviews for language {lang}")

review_df = pd.DataFrame([review for lang in reviews for review in reviews[lang]])
review_df = pd.merge(review_df, book_df[['book_id', 'book_title', 'book_author']], on='book_id')

3495 reviews for language da
5825 reviews for language de
6270 reviews for language en
6254 reviews for language es
5481 reviews for language fa
5982 reviews for language fr
6128 reviews for language it
358 reviews for language ja
232 reviews for language ko
5861 reviews for language nl
764 reviews for language sl
5140 reviews for language sv
4277 reviews for language uk


A quick peek at the review data to know what it looks like:

In [8]:
review_df.head(2)

Unnamed: 0,review_text,user_id,review_id,review_date,shelf_status,user_shelves,rating,book_id,source_url,review_lang,book_title,book_author
0,"Meget fin, men har også følgende ting som irri...",a731b23ab9c845ad76c48c9ae0c37201af81a7294dd8f3...,8eee415ac1699108c550ba2b5804a44141c5b542dab4d9...,2025-01-26T00:00:00,Read,[],,2175,https://goodreads.com/da/book/show/2175.Madame...,da,Madame Bovary,Gustave Flaubert
1,I jagten på at lykkes med livet kæmper vi med ...,9aeb2cd6b7ccc733d6d0c373c1c8e1d815a5ab40df3e07...,d2e728a22e350c07c7030f0ab53b9cf3ec646f6e25cfb3...,2025-04-11T00:00:00,,"[100-classics-penguin, 1001-books-boxall, 488-...",4.0,2175,https://goodreads.com/da/book/show/2175.Madame...,da,Madame Bovary,Gustave Flaubert


Next, we check per book how many reviews there are in each language. For many languages, there are 30 reviews per book. This is because the first page of reviews of a book contains at most 30 reviews. We have not crawled reviews beyond the first page, so 30 is the maximum number of reviews per book/language combination in our data set, but for many books there are many more reviews.

In [9]:
review_df.groupby(['book_id', 'book_title', 'book_author']).review_lang.value_counts().unstack().fillna(0.0)

Unnamed: 0_level_0,Unnamed: 1_level_0,review_lang,da,de,en,es,fa,fr,it,ja,ko,nl,sl,sv,uk
book_id,book_title,book_author,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
11,The Hitchhiker’s Guide to the Galaxy,Douglas Adams,26.0,30.0,30.0,30.0,30.0,30.0,30.0,4.0,0.0,30.0,8.0,30.0,30.0
93,Heidi,Johanna Spyri,3.0,30.0,30.0,30.0,20.0,20.0,30.0,0.0,0.0,9.0,1.0,4.0,2.0
320,One Hundred Years of Solitude,Gabriel García Márquez,30.0,30.0,30.0,30.0,30.0,30.0,30.0,2.0,1.0,30.0,5.0,30.0,30.0
343,Perfume: The Story of a Murderer,Patrick Süskind,12.0,30.0,30.0,30.0,30.0,30.0,30.0,1.0,2.0,30.0,5.0,30.0,25.0
656,War and Peace,Leo Tolstoy,19.0,30.0,30.0,30.0,30.0,30.0,30.0,1.0,0.0,30.0,3.0,25.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61439040,1984,George Orwell,30.0,30.0,30.0,30.0,30.0,30.0,30.0,1.0,7.0,30.0,10.0,30.0,30.0
77265004,The Iliad,Homer,20.0,30.0,30.0,30.0,30.0,30.0,30.0,1.0,0.0,30.0,2.0,30.0,14.0
127441416,The Diary of a Young Girl,Anne Frank,30.0,30.0,30.0,30.0,30.0,30.0,30.0,2.0,0.0,30.0,4.0,30.0,30.0
129915654,Pride and Prejudice,Jane Austen,30.0,30.0,30.0,30.0,30.0,30.0,30.0,2.0,1.0,30.0,8.0,30.0,30.0


In [10]:
# show an example review
print(json.dumps(reviews['it'][0], indent=4))

{
    "review_text": "C'EST MOI   Meravigliosa come sempre, semplicemente perfetta, Isabelle Huppert nell\u2019adattamento del 1991 firmato da Claude Chabrol. Letto un paio di volte e sempre amato. Uno dei massimi capolavori della letteratura, secondo me. Flaubert \u00e8 uno dei sommi: me lo immagino di notte, solo nella sua casa di Rouen, che sono ovviamente stato a visitare, al lume di candela, che 'recita' le parole scritte, ancora e ancora, urlandole, cancellando, limando, riscrivendo, fino a trovare la formula giusta, quella perfetta. Le mot juste.  Perch\u00e9, lui \u00e8 con la perfezione che si misurava.  E alla perfezione si \u00e8 avvicinato, e, secondo me, la perfezione ha raggiunto. Realistico, il romanzo certamente lo \u00e8: non contiene nulla che non sia esistito nella vita reale (e facilissimo da riscontrare attraverso sopralluoghi e testimonianze); e anche se sbuffa ogni tanto \"nulla in questa storia \u00e8 tratto dalla vita, \u00e8 totalmente inventata\", non c'\u00e

## Loading parsed reviews

As mentioned, the pre-parsed reviews are available in the `spacy_doc_bins` sub-directory. We load them as well, so we have access to the individual word tokens in each review, with per token the Part-Of-Speech (POS) tag and word lemma.

In [23]:
from spacy.tokens import DocBin


def download_spacy_doc_bin(lang_code, vocab):
    if lang_code not in spacy_doc_bin_urls:
        choices = list(spacy_doc_bin_urls.keys())
        raise KeyError(f"'{lang_code}' is not a valid language code. Please choose from {choices}")
    response = requests.get(spacy_doc_bin_urls[lang_code])
    if response.status_code == 200:
        doc_bin = DocBin().from_bytes(response.content)
        return list(doc_bin.get_docs(vocab))
    return None


parsed_reviews = {}
for lang in lang_codes:
    parsed_reviews[lang] = download_spacy_doc_bin(lang, lang_nlp['en'].vocab)
    print(lang, len(parsed_reviews[lang]))

da 3495
de 5825
en 6270
es 6254
fa 5481
fr 5982
it 6128
ja 358
ko 232
nl 5861
sl 764
sv 5140
uk 4277


We also load stopwords lists, because our domain-specific vocabularies should contain domain-specific terms, not words common across all text genres in a language.

In [24]:
for lang in sorted(lang_codes):
    print(lang, len(stopwordsiso.stopwords(lang)))

da 170
de 620
en 1298
es 732
fa 799
fr 691
it 632
ja 134
ko 679
nl 413
sl 446
sv 418
uk 73


The next step is to build list of all non-stopwords used in reviews and keep track of the number of reviews in which they occur, so we can distinguish between words used in many reviews and words used in few reviews. The latter are probably to specific to a book or rare.

In [24]:
def filter_content_words(doc, stopwords):
    tokens = [token for token in doc if token.text not in stopwords and token.lemma_ not in stopwords]
    tokens = [token for token in tokens if token.pos_ != 'PUNCT' and token.lemma_ != ' ']
    return [token for token in tokens if len(token.text) > 2 and len(token.lemma_) > 2]

doc_freq = defaultdict(Counter)
for lang in parsed_reviews:
    stopwords = stopwordsiso.stopwords(lang)
    for doc in parsed_reviews[lang]:
        lemmas = [token.lemma_ for token in filter_content_words(doc, stopwords)]
        doc_freq[lang].update(set(lemmas))
        

If we remove stopwords, what are the most common words (words occurring in most documents)?

In English:

In [25]:
doc_freq['en'].most_common(10)

[('book', 4658),
 ('read', 4318),
 ('time', 3414),
 ('story', 3184),
 ('life', 2819),
 ('love', 2675),
 ('character', 2625),
 ('write', 2385),
 ('people', 2258),
 ('feel', 1893)]

In Italian:

In [26]:
doc_freq['it'].most_common(10)

[('libro', 3197),
 ('storia', 2128),
 ('leggere', 2108),
 ('di il', 2072),
 ('romanzo', 1983),
 ('potere', 1732),
 ('personaggio', 1709),
 ('a il', 1612),
 ('lettura', 1523),
 ('venire', 1411)]

In Ukranian:

In [27]:
doc_freq['uk'].most_common(10)

[('книга', 2079),
 ('той', 1425),
 ('історія', 1400),
 ('дуже', 1360),
 ('читати', 1349),
 ('свій', 1341),
 ('могти', 1153),
 ('життя', 1101),
 ('себе', 1065),
 ('людина', 984)]

The most common words in each language (where common is the number of reviews in which they occur) is clearly related to the domain of books and reading. 

### Build a vocabulary of common terms

To build an initial domain-specific vocabulary per language, we use a simple threshold: A term should occur in at least 1% of all reviews in a given language to be considered a domain term. Since we have a few thousand reviews for most languages, this corresponds to a threshold of a few dozen reviews. That ensure that terms are not specific to a single book, author or book series.

The chosen threshold of 1% is arbitrary. Depending on the total number of reviews in a language and the diversity of the books that they are associated with, you can try different thresholds. A good rule of thum is that a threshold should be substantially higher than the maximum number of reviews for a single book (preferably also higher than the number of reviews for a single book series or author), otherwise you may get plot-specific words or the names of characters or authors that are mentioned in many reviews of the same book, series or author.

In [29]:
min_df = {lang: len(parsed_reviews[lang]) * 0.01 for lang in parsed_reviews}
vocab = {}
for lang in doc_freq:
    vocab[lang] = set([term for term in doc_freq[lang] if doc_freq[lang][term] >= min_df[lang]])
    print(f"lang: {lang}  min_df: {min_df[lang]: >5.2f}  full vocab: {len(doc_freq[lang]): >7}"
          f"  common vocab: {len(vocab[lang]): >6}")

lang: da  min_df: 34.95  full vocab:   21921  common vocab:    513
lang: de  min_df: 58.25  full vocab:   47879  common vocab:    926
lang: en  min_df: 62.70  full vocab:   59070  common vocab:   2226
lang: es  min_df: 62.54  full vocab:   43067  common vocab:   1274
lang: fa  min_df: 54.81  full vocab:   57262  common vocab:   1290
lang: fr  min_df: 59.82  full vocab:   25074  common vocab:    724
lang: it  min_df: 61.28  full vocab:   48840  common vocab:   1319
lang: ja  min_df:  3.58  full vocab:    1880  common vocab:    160
lang: ko  min_df:  2.32  full vocab:    9573  common vocab:    560
lang: nl  min_df: 58.61  full vocab:   35978  common vocab:    557
lang: sl  min_df:  7.64  full vocab:   10933  common vocab:    617
lang: sv  min_df: 51.40  full vocab:   26982  common vocab:    418
lang: uk  min_df: 42.77  full vocab:   36285  common vocab:   1145


The SpaCy docs with the parsed version of the review contain custom metadata including the identifier of the review. In the next steps we want to sort the reviews per book, so we add the book identifier to each SpaCy document for easy reference.

In [30]:
lang_book_docs = defaultdict(lambda: defaultdict(list))

for lang in parsed_reviews:
    for review, doc in zip(reviews[lang], parsed_reviews[lang]):
        doc.user_data['book_id'] = review['book_id']
        lang_book_docs[lang][doc.user_data['book_id']].append(doc)


### Language/Term vectors

Now that we have a domain-specific vocabulary per language, we can build vectors per language and term, where each element in the vector represents the number of reviews for a given book that contains the given term in the given language. That is, for each term in a language, the vector has 209 frequencies, one for each of the 209 books, representing the number of reviews containing that term.

First we make a list of all the book identifiers:

In [28]:
book_ids = sorted(set([review['book_id'] for lang in reviews for review in reviews[lang]]))

Next, we build a dictionary with language and term as key, and a list of 209 frequencies as value:

In [32]:
lang_term_freq = defaultdict(lambda: defaultdict(list))
for lang in parsed_reviews:
    if lang == 'zh':
        continue
    for book_id in book_ids:
        review_freq = Counter()
        for doc in lang_book_docs[lang][book_id]:
            common_tokens = set([token.lemma_ for token in doc if token.lemma_ in vocab[lang]])
        
            review_freq.update(common_tokens)
        for term in vocab[lang]:
            lang_term_freq[lang][term].append(review_freq[term])


Absolute frequencies are not directly comparable, because for some books there are more reviews than for others. Therefore, we want to turn the absolute frequencies into relative frequencies.

For that, we need to know how many reviews there are in total in each language for each book identifier:

In [34]:
lang_book_num_reviews = (review_df
    .groupby(['review_lang', 'book_id'])
    .book_id
    .value_counts()
    .unstack()
    .fillna(0.0))


Check how many reviews we have for each book in a given language, e.g. Italian:

In [36]:
lang_book_num_reviews.loc['it'].sort_values()

book_id
2767052       0.0
52516332      9.0
762390        9.0
17802724     14.0
6193         15.0
             ... 
17245        30.0
17690        30.0
18386        30.0
14942        30.0
239775146    30.0
Name: it, Length: 209, dtype: float64

Now we compute the relative frequency (fraction) vectors:

In [38]:
lang_term_frac = defaultdict(lambda: defaultdict(list))
for lang in lang_term_freq:
    for term in lang_term_freq[lang]:
        freq_num_reviews = zip(lang_term_freq[lang][term], lang_book_num_reviews.loc[lang])
        lang_term_frac[lang][term] = [freq / num_reviews if num_reviews > 0 else 0 for freq, num_reviews in freq_num_reviews]

Next, we build a matrix of the vectors, where the index of a row in the matrix corresponds to a language/term combination, and a column corresponds to a book identifier.

We make mappings from row indexes to term/language pairs and vice versa:

In [66]:
docid2term = {}
term2docid = {}

for lang in lang_term_frac:
    for term in lang_term_frac[lang]:
        docid = len(docid2term)
        docid2term[docid] = (lang, term)
        term2docid[(lang, term)] = docid

term_vecs = np.array([lang_term_frac[lang][term] for lang in lang_term_frac for term in lang_term_frac[lang]])
term_vecs.shape

(11729, 209)

There are 11,729 term/language pairs and 209 books.

To measure how similar two rows (two word/language pairs) are, we can use cosine similarity. **Note that with the multilingual term/language - book matrix, we can easily compute similarity of terms from different languages.

In [41]:
from sklearn.metrics.pairwise import cosine_similarity

# compute cosine similarity of the term/language vectors
term_cosim = cosine_similarity(term_vecs)

# make a dataframe with the term/language pairs as index and column labels
term_cosim = pd.DataFrame(term_cosim, columns=term2docid.keys(), index=term2docid.keys())


Now we can select a column (a term/language pair) and see which rows (term/language pairs) are most similar:

In [65]:
term_cosim[('en', 'character')].sort_values(ascending=False).head(20)

en  character      1.000000
es  personaje      0.948104
fr  personnage     0.940962
it  personaggio    0.928644
es  historia       0.920066
it  romanzo        0.915456
en  story          0.913704
    feel           0.912930
it  a il           0.910778
en  read           0.909190
it  da il          0.908457
en  love           0.908316
    time           0.908042
fa  داستان         0.907022
it  di il          0.904696
fr  roman          0.901980
en  reader         0.899937
it  in il          0.898537
en  book           0.897922
    start          0.897491
Name: (en, character), dtype: float64

The most similar terms to 'character' as equivalents in other languages, e.g. 'personage' in Spanish, 'personnage' in French and 'personaggio' in Italian. We don't see equivalents in most other languages, which might mean that in other languages there are no equivalents in the common vocabulary, or that the equivalents are used in different ways or at least in different contexts. 

We can also restrict the similarity to terms in a specific language:

In [64]:
term_cosim.loc['it'][('en', 'character')].sort_values(ascending=False).head(20)

personaggio    0.928644
romanzo        0.915456
a il           0.910778
da il          0.908457
di il          0.904696
in il          0.898537
storia         0.892953
potere         0.889716
riuscire       0.888808
libro          0.875448
venire         0.874147
su il          0.873745
leggere        0.872140
bello          0.866218
dovere         0.865983
piacere        0.864737
lettura        0.862533
trovare        0.862480
pagina         0.861608
andare         0.859184
Name: (en, character), dtype: float64

### Book similarity

We can also compute the similarity of books (in terms of how they are discussed in reviews using the domain-specific vocabulary). The 209 columns in the matrix represent book vectors, so if we transpose the matrix and compute cosine similarities for the 209 books based on the 11,729 term/language pairs.

First, we make more interpretable book labels for each book identifier, based on the title and author:

In [44]:
book_df['book_label'] = book_df.apply(lambda row: f"{row['book_author']}--{row['book_title']}", axis=1)
book_labels = book_df.sort_values('book_id').book_label

Next, we compute the similarities and make a dataframe with the book labels as index and column names:

In [46]:
doc_sim = cosine_similarity(term_vecs.T)
doc_sim = pd.DataFrame(doc_sim, columns=book_labels, index=book_labels)
doc_sim

book_label,Douglas Adams--The Hitchhiker’s Guide to the Galaxy,Johanna Spyri--Heidi,Gabriel García Márquez--One Hundred Years of Solitude,Patrick Süskind--Perfume: The Story of a Murderer,Leo Tolstoy--War and Peace,Arthur Golden--Memoirs of a Geisha,Dan Brown--Angels & Demons,Carlos Ruiz Zafón--The Shadow of the Wind,"John Gray--Men Are from Mars, Women Are from Venus",Homer--The Odyssey,...,Elizabeth Acevedo--Clap When You Land,Alice Walker--The Color Purple,Kristin Hannah--The Four Winds,Dan Brown--The Da Vinci Code,William Shakespeare--Romeo and Juliet,George Orwell--1984,Homer--The Iliad,Anne Frank--The Diary of a Young Girl,Jane Austen--Pride and Prejudice,Haruki Murakami--Kafka on the Shore
book_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Douglas Adams--The Hitchhiker’s Guide to the Galaxy,1.000000,0.425141,0.595284,0.547511,0.509258,0.605838,0.555832,0.590952,0.459859,0.584057,...,0.385773,0.545468,0.547437,0.610680,0.518967,0.600041,0.549197,0.560569,0.616020,0.603604
Johanna Spyri--Heidi,0.425141,1.000000,0.435338,0.405507,0.383722,0.484655,0.371903,0.415139,0.321090,0.451553,...,0.363332,0.454200,0.475592,0.413403,0.413847,0.430965,0.383356,0.444191,0.474465,0.411653
Gabriel García Márquez--One Hundred Years of Solitude,0.595284,0.435338,1.000000,0.542135,0.548669,0.624059,0.503231,0.613721,0.424733,0.620170,...,0.418765,0.568516,0.610642,0.566524,0.543886,0.613246,0.578616,0.583595,0.612866,0.628941
Patrick Süskind--Perfume: The Story of a Murderer,0.547511,0.405507,0.542135,1.000000,0.476021,0.574646,0.477668,0.536150,0.414257,0.536513,...,0.382125,0.538712,0.524336,0.547662,0.495373,0.561447,0.507439,0.508425,0.561020,0.537347
Leo Tolstoy--War and Peace,0.509258,0.383722,0.548669,0.476021,1.000000,0.536463,0.435348,0.518193,0.378716,0.551711,...,0.357725,0.490663,0.519091,0.494643,0.482098,0.550265,0.539300,0.525844,0.540609,0.528436
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
George Orwell--1984,0.600041,0.430965,0.613246,0.561447,0.550265,0.609257,0.511272,0.574745,0.458798,0.603071,...,0.396385,0.554937,0.580984,0.564238,0.527214,1.000000,0.577074,0.600834,0.617593,0.597663
Homer--The Iliad,0.549197,0.383356,0.578616,0.507439,0.539300,0.544901,0.447153,0.526643,0.399127,0.675327,...,0.360706,0.514523,0.520810,0.516263,0.538091,0.577074,1.000000,0.550546,0.553530,0.534283
Anne Frank--The Diary of a Young Girl,0.560569,0.444191,0.583595,0.508425,0.525844,0.606855,0.471544,0.536692,0.450440,0.567767,...,0.425208,0.551491,0.594324,0.537827,0.520851,0.600834,0.550546,1.000000,0.576666,0.553042
Jane Austen--Pride and Prejudice,0.616020,0.474465,0.612866,0.561020,0.540609,0.635823,0.529020,0.608426,0.469970,0.614307,...,0.440668,0.602505,0.609780,0.588025,0.585373,0.617593,0.553530,0.576666,1.000000,0.585100


Now, we can pick any book label and query the dataframe to find books that are similar in terms of the domain-specific vocabulary:

In [47]:
book_label = "John Green--The Fault in Our Stars"

doc_sim[book_label].sort_values()

book_label
Michael    Connelly--The Black Echo                                                               0.338156
Agatha Christie--Murder on the Orient Express                                                     0.357973
Dale Carnegie--How to Stop Worrying and Start Living: Time-Tested Methods for Conquering Worry    0.372302
J.M. Coetzee--Life & Times of Michael K                                                           0.374236
Adania Shibli--Minor Detail                                                                       0.384830
                                                                                                    ...   
Gabriel García Márquez--Love in the Time of Cholera                                               0.653445
Jojo Moyes--Me Before You                                                                         0.668524
Haruki Murakami--Norwegian Wood                                                                   0.679925
Stephenie Meyer--Twilight 

In [48]:
book_label = "Anne Frank--The Diary of a Young Girl"

doc_sim[book_label].sort_values()

book_label
Michael    Connelly--The Black Echo                                                               0.303608
Agatha Christie--Murder on the Orient Express                                                     0.310408
Dale Carnegie--How to Stop Worrying and Start Living: Time-Tested Methods for Conquering Worry    0.370094
Charles Baudelaire--Les Fleurs du Mal                                                             0.374467
J.M. Coetzee--Life & Times of Michael K                                                           0.376204
                                                                                                    ...   
Arthur Golden--Memoirs of a Geisha                                                                0.606855
Khaled Hosseini--A Thousand Splendid Suns                                                         0.609997
Ray Bradbury--Fahrenheit 451                                                                      0.613907
John Green--The Fault in O

## Term-Document matrix

Finally, we can turn the term-document matrix into a dataframe as well, which allows us to:

- use a vocabulary term to find which books are most commonly described by that term (e.g. for which books do reviewers often mentioned characters, plot or writing style)
- use a book label to find which vocabulary terms are most commonly used to describe it (e.g. what terms are most typically used to describes the diary of Anna Frank).

In [68]:
term_doc_frac = pd.DataFrame(term_vecs, columns=book_labels, index=term2docid.keys())

term_doc_frac.loc[('en', 'character')].sort_values()

book_label
Rhonda Byrne--The Secret                                                                          0.000000
Anne Frank--The Diary of a Young Girl                                                             0.000000
Dale Carnegie--How to Stop Worrying and Start Living: Time-Tested Methods for Conquering Worry    0.000000
Viktor E. Frankl--Man's Search for Meaning                                                        0.000000
Kahlil Gibran--The Prophet                                                                        0.000000
                                                                                                    ...   
Victor Hugo--Les Misérables                                                                       0.766667
Chimamanda Ngozi Adichie--Half of a Yellow Sun                                                    0.800000
Chimamanda Ngozi Adichie--Americanah                                                              0.800000
Leo Tolstoy--Anna Karenina

In [78]:
book_label = "Anne Frank--The Diary of a Young Girl"

term_doc_frac[book_label].loc[['en']].sort_values(ascending=False).head(20)

en  book      0.766667
    Anne      0.700000
    read      0.700000
    diary     0.700000
    time      0.666667
    people    0.600000
    life      0.600000
    write     0.533333
    girl      0.500000
    live      0.466667
    family    0.433333
    word      0.433333
    day       0.433333
    war       0.400000
    review    0.400000
    happen    0.366667
    story     0.366667
    feel      0.366667
    heart     0.333333
    love      0.333333
Name: Anne Frank--The Diary of a Young Girl, dtype: float64