# Workshop Preparation

To follow the workshop, you need to have the review datasets and you will need to install a number of Python libraries.

## Creating a virtual environment

We strongly advise running this notebook from a Python virtual environment. To create one, open a terminal and navigate into the directory where you stored this notebook. Now execute the following commands:

`python -m venv .venv` <br/>
`activate` <br/>
`pip install ipykernel` <br/>
`python -m ipykernel install --name=bologna` <br/>

Then run your notebook by executing `jupyter notebook` from the same directory. <br/>
Inside the notebook make sure the kernel running is the one named 'bologna'.

## Downloading the dataset

The datasets are available on SURF drive. Workshop participants will receive the URL for this via email, because we cannot share the datasets publicly.

There are three sets of data:

- `book_metadata.csv`: basic metadata for a set of 209 books.
- `lang_reviews`: a directory with reviews in different languages, one file per language.
- `spacy_doc_bins`: a directory with pre-parsed reviews in SpaCy DocBin format. Parsing was done with SpaCy and Dadmatools (the latter for reviews in Persian).

# Installing Python libraries

Install the libraries that you don't already have:

In [None]:
import sys

In [None]:
# use !{sys.executable} instead of python to ensure the python version 
# used to install is the same as the one running the notebook
!{sys.executable} -m pip install spacy

# install pandas and matplotlib for data handling and visualisation
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install matplotlib

# install module with stopwords lists for many languages
!{sys.executable} -m pip install stopwordsiso


## Optional: language parser for Persian/Farsi

The following library is only necessary if you want to do your own parsing of reviews in Persian/Farsi. A pre-parsed version of the reviews is available on the SURF drive (`spacy_doc_bins/parsed_reviews-fa.doc_bin`).

In [None]:

# if you want to parse reviews written in Persian/Farsi, install
# the following packages
!{sys.executable} -m pip install html2text
!{sys.executable} -m pip install protobuf==3.19
!{sys.executable} -m pip install dadmatools
!{sys.executable} -m pip install torch
!{sys.executable} -m pip install huggingface_hub==0.34.3
!{sys.executable} -m pip install tokenizers
!{sys.executable} -m pip install sentencepiece
!{sys.executable} -m pip install transformers
!{sys.executable} -m pip install gdown
!{sys.executable} -m pip install scikit-learn

# Probably restart kernel
Once you arrive here, it is probably a good thing to restart your kernel.

In [None]:
import spacy

from language import code_lang_map, lang_code_map, spacy_model_map



# Download SpaCy models

Pick one or more languages and download the corresponding SpaCy model:

In [None]:
languages = [
    # Add languages for which you want to do linguistic parsing of reviews
    'Chinese', 'Dutch', 'English', 'French', 'German', 'Italian', 'Spanish', 'Portuguese' 
]

In [None]:
for language in languages:
    model_name = spacy_model_map[lang_code_map[language]]
    spacy.cli.download(model = model_name)


In [None]:
[lang_code_map[lang] for lang in languages]