# Workshop Preparation

To follow the workshop, you need to have the review datasets and to install a number of Python libraries.

## Downloading the dataset

The datasets are available on SURF drive. Workshop participants will receive the URL for this via email, because we cannot share the datasets publicly.

There are three sets of data:

- `book_metadata.csv`: basic metadata for a set of 209 books.
- `lang_reviews`: a directory with reviews in different languages, one file per language.
- `spacy_doc_bins`: a directory with pre-parsed reviews in SpaCy DocBin format. Parsing was done with SpaCy and Dadmatools (the latter for reviewws in Persian).

# Installing Python libraries

Install the libraries that you don't already have:

In [1]:
import sys

# use !{sys.executable} instead of python to ensure the python version 
# used to install is the same as the one running the notebook
!{sys.executable} -m pip install spacy


In [None]:
!{sys.executable} -m pip install pandas

## Optional: language parser for Persian/Farsi

The following library is only necessary if you want to do your own parsing of reviews in Persian/Farsi. A pre-parsed version of the reviews is available on the SURF drive (`spacy_doc_bins/parsed_reviews-fa.doc_bin`).

In [None]:

# if you want to parse reviews written in Persian/Farsi, install
# the following packages
!{sys.executable} -m pip install html2text
!{sys.executable} -m pip install protobuf==3.19
!{sys.executable} -m pip install dadmatools


In [24]:
import spacy

from language import code_lang_map, lang_code_map, spacy_model_map



# Download SpaCy models

Pick one or more languages and download the corresponding SpaCy model:

In [25]:
languages = [
    # Add languages for which you want to do linguistic parsing of reviews
    'Chinese', 'Dutch', 'English', 'French', 'German', 'Italian', 'Persian', 'Portuguese', 
]

In [20]:
for language in languages:
    model_name = spacy_model_map[lang_code[language]]
    spacy.cli.download(model = model_name)


Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m21.4 MB/s[0m  [33m0:00:23[0m0m eta [36m0:00:01[0m[36m0:00:01[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting it-core-news-lg==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/it_core_news_lg-3.7.0/it_core_news_lg-3.7.0-py3-none-any.whl (567.9 MB)
[2K     [38;2;114;

In [21]:
[lang_code[lang] for lang in languages]

['en', 'it', 'de', 'nl', 'fr', 'pt']