<a href="https://colab.research.google.com/github/iued-uni-heidelberg/DAAD-Training-2021/blob/main/cwb2021experimentsV06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with CWB on Colab
Author: Bogdan Babych, IÜD, Heidelberg University

Modifying CWB installations and packages to work with colab environment

### Downloading packages and data

In [None]:
!wget https://heibox.uni-heidelberg.de/f/7f1e8929352b4cf4b13a/?dl=1

In [2]:
!mv index.html?dl=1 cwb-3.4.22-source.tar.gz

In [None]:
!tar xvfz cwb-3.4.22-source.tar.gz

### Installing the parser generator 'bison'

In [None]:
!apt-get install flex bison

### Replacing the configuration file
Using correct environment and 'standard' location for installation (otherwise python bindings do not work)

In [None]:
!wget https://heibox.uni-heidelberg.de/f/67bb38a210064bc5961e/?dl=1
!mv /content/cwb-3.4.22/config.mk /content/cwb-3.4.22/config.mk.old.01
!mv index.html?dl=1 /content/cwb-3.4.22/config.mk

In [None]:
# alternative: editing the file at the line numbers
!awk '{ if (NR == 42) print "PLATFORM=linux-64"; else print $0}' /content/cwb-3.4.22/config.mk > /content/cwb-3.4.22/config.mk.TMP
!awk '{ if (NR == 63) print "SITE=standard"; else print $0}' /content/cwb-3.4.22/config.mk.TMP > /content/cwb-3.4.22/config.mk

### Changing into installation directory and running installation scripts

In [6]:
%cd /content/cwb-3.4.22/

/content/cwb-3.4.22


In [7]:
!pwd

/content/cwb-3.4.22


(this will be the default register directory for the 'standard' CWB installation):

In [8]:
!mkdir -p /usr/local/share/cwb/registry/

In [None]:
!sudo ./install-scripts/config-basic
!sudo ./install-scripts/install-linux

In [None]:
%cd /content/

### Downloading and relocating the register of a sample corpus
The register is placed into the standard cwb location

In [None]:
!wget https://heibox.uni-heidelberg.de/f/dd3538603aa84dd09a76/?dl=1
!mv index.html?dl=1 Dickens-1.0.tar.gz
!tar xvzf Dickens-1.0.tar.gz

In [12]:
!cp /content/Dickens-1.0/registry/dickens /content/Dickens-1.0/registry/dickens.old.01
!awk '{ if (NR == 10) print "HOME /content/Dickens-1.0/data"; else print $0}' /content/Dickens-1.0/registry/dickens > /content/Dickens-1.0/registry/dickens.TMP
!awk '{ if (NR == 12) print "INFO /content/Dickens-1.0/data/.info"; else print $0}' /content/Dickens-1.0/registry/dickens.TMP > /content/Dickens-1.0/registry/dickens

In [13]:
!mv /content/Dickens-1.0/registry/dickens /usr/local/share/cwb/registry

### Updating path (only needed if installing into a non-standard location


In [None]:
# !echo $PATH

In [None]:
# %env PATH=/usr/local/cwb-3.4.22/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin:/opt/bin

In [None]:
# !echo $PATH

In [None]:
# %env PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin:/opt/bin

In [14]:
!pwd

/content


### Testing interactive Corpus Query Processor (CQP)


In [None]:
# try these commands in the interactive prompt (just copy and paste them):
# DICKENS;
# "question";
# q
# exit;
!cqp -e

In [None]:
!cwb-describe-corpus -h

In [None]:
!cwb-describe-corpus -s dickens

In [None]:
# !cwb-describe-corpus -s -r registry dickens

In [18]:
%cd /content/

/content


## Installing python interface to CWB 
cwb-ccc

In [None]:
# !python -m pip install cwb-ccc
!python -m pip install cwb-ccc

## pandas versions are incompatible
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

google-colab 1.0.0 requires pandas~=1.1.0; python_version >= "3.0", but you have pandas 1.3.5 which is incompatible.

Successfully installed association-measures-0.2.0 cwb-ccc-0.10.1 pandas-1.3.5 pyyaml-6.0 unidecode-1.3.2

- we go for a compromise, which works for both so far...

In [20]:
!pip show pandas

Name: pandas
Version: 1.3.5
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: The Pandas Development Team
Author-email: pandas-dev@python.org
License: BSD-3-Clause
Location: /usr/local/lib/python3.7/dist-packages
Requires: pytz, numpy, python-dateutil
Required-by: xarray, vega-datasets, statsmodels, sklearn-pandas, seaborn, pymc3, plotnine, pandas-profiling, pandas-gbq, pandas-datareader, mlxtend, mizani, holoviews, gspread-dataframe, google-colab, fix-yahoo-finance, fbprophet, fastai, cwb-ccc, cufflinks, cmdstanpy, association-measures, arviz, altair


In [21]:
!pip install pandas==1.1.5
# click [restart runtime] button!

Collecting pandas==1.1.5
  Downloading pandas-1.1.5-cp37-cp37m-manylinux1_x86_64.whl (9.5 MB)
[K     |████████████████████████████████| 9.5 MB 12.1 MB/s 
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.3.5
    Uninstalling pandas-1.3.5:
      Successfully uninstalled pandas-1.3.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cwb-ccc 0.10.1 requires pandas>=1.2.0, but you have pandas 1.1.5 which is incompatible.[0m
Successfully installed pandas-1.1.5


### Experiments with cwb-ccc 
From the webpage 


In [None]:
from ccc import Corpora
corpora = Corpora(registry_path="/usr/local/share/cwb/registry/")
# corpora = Corpora("/content/Dickens-1.0/registry")
print(corpora)
corpora.show()  # returns a DataFrame

In [None]:
corpus = corpora.activate(corpus_name="DICKENS")

In [2]:
from ccc import Corpus
corpus = Corpus(
  corpus_name="DICKENS",
  registry_path="/usr/local/share/cwb/registry/"
)

In [3]:
query = r'[word="[A-Z0-9][A-Z0-9][A-Z0-9]+"]'
dump = corpus.query(query)

In [None]:
dump.df

In [None]:
corpus.attributes_available

In [6]:
query = r'"question"'
dump = corpus.query(query)

In [None]:
dump.df

In [8]:
dump = corpus.query(
  cqp_query=query,
  context=20,
  context_break='s'
)

In [None]:
dump.df

In [10]:
dump.set_context(
    context_left=5,
    context_right=10,
    context_break='s'
)

In [None]:
dump.df

In [None]:
dump.breakdown()

In [None]:
dump.concordance()

In [None]:
dump.concordance(p_show=["word", "lemma"], s_show=["text_id"])

In [None]:
dump.concordance(form="kwic")

In [16]:
lines = dump.concordance(
    p_show=['word', 'pos', 'lemma'],
    form='dataframe'
)

In [None]:
lines.iloc[0]['dataframe']

In [None]:
type(lines.iloc[2]['dataframe'])

In [19]:
lines = dump.concordance(
    p_show=['word', 'pos', 'lemma'],
    form='dict'
)

In [None]:
lines.iloc[0]['dict']

In [21]:
lines = dump.concordance(
    p_show=['word', 'pos', 'lemma'],
    form='slots'
)

In [22]:
lines.iloc[0]

word                     Scrooge asked the question , because he did n'...
pos                             NN VBD DT NN , IN PP VBD RB VB IN DT NN RB
lemma                    Scrooge ask the question , because he do not k...
match..matchend_word                                              question
match..matchend_pos                                                     NN
match..matchend_lemma                                             question
Name: (5614, 5614), dtype: object

In [23]:
lines.iloc[0]['lemma']

'Scrooge ask the question , because he do not know whether a ghost so'

In [24]:
dump = corpus.query(
  cqp_query=r'@1[pos="D.*"] @2[pos="NN"] @3[word="question"]',
  context=None, 
  context_break='s', 
  match_strategy='longest'
)
lines = dump.concordance(form='dataframe')

In [None]:
lines.iloc[1]['dataframe']

In [26]:
lines = dump.concordance(form='dict')

In [None]:
lines.iloc[1]['dict']

In [28]:
lines = dump.concordance(
  form='slots', 
  p_show=['word', 'lemma'],
  slots={"article": [1], "np": [2, 3]}
)

In [None]:
lines

In [30]:
dump.correct_anchors({2: -2, 3: +1})
lines = dump.concordance(
  form='slots',
  slots={"art": [1],
  "np": [2, 3]}
)

In [None]:
lines

In [32]:
dump = corpus.query(
    '[lemma="question"]', 
    context=10, 
    context_break='s'
)

In [None]:
dump.collocates(order='log_likelihood')

In [34]:
dump = corpus.query(
    '[lemma="answer"]', 
    context=10, 
    context_break='s'
)

In [None]:
dump.collocates(p_query=['lemma'], order='conservative_log_ratio')
# ['lemma', 'pos']

In [None]:
corpus.query('[lemma="question" & pos="N.*"]').breakdown()

In [None]:
# https://pypi.org/project/cwb-ccc/#anchored-queries

In [37]:
%tb

No traceback available to show.


In [None]:
# !export CWB_DIR=/usr/local/cwb-3.4.10
# /usr/local/cwb-3.4.22/bin

In [None]:
# !python --version

### todo:
1. to add corpus lemmatization & encoding parts
2. to add generation of interesting collocations, exporting them as lists
3. to add parallel corpus functionality


## Building own corpus
Europarl 8M EN (DE):
https://heibox.uni-heidelberg.de/f/0e1fcda2b7bc494d83b8/?dl=1
ep_en_de.txt

- Create a data directory where files in the binary CWB format will be stored. Here, we assume that this directory is called /corpora/data/example. If this directory already exists and contains corpus data (from a previous version), you should delete all files in the directory. NB: You need a separate data directory for each corpus you want to encode.
- Choose a registry directory, where all encoded corpora have to be registered to make them accessible to the CWB tools. It is recommended that you use the default registry directory /usr/local/share/cwb/registry. Otherwise,youwillhavetospecifythepathtoyourregistry directory with a -r flag whenever you invoke one of the CWB tools (or set an appropriate environment variable, see below). In the example commands in this manual, we assume that you use the standard registry directory.
- The next step is to encode the corpus, i.e. convert the verticalized text to CWB binary format with the cwb-encode tool. Note that the command below has to be entered on a single line.


```
$ cwb-encode -d /corpora/data/example
                      -xsBC9 -c ascii -f example.vrt
                      -R /usr/local/share/cwb/registry/example
                      -P pos -P lemma -S s
```
from: https://cwb.sourceforge.io/files/CWB_Encoding_Tutorial.pdf


In [None]:
!wget https://heibox.uni-heidelberg.de/f/0e1fcda2b7bc494d83b8/?dl=1
!mkdir EP


In [39]:
!mv index.html?dl=1 EP/ep_en_de.txt

In [40]:
!mkdir epende0en

In [42]:
!cwb-encode -d /content/epende0en -xsBC9 -c UTF8 -f /content/EP/ep_en_de.txt -R /usr/local/share/cwb/registry/epende0en

In [60]:
!cwb-describe-corpus epende0en


Corpus: epende0en

description:    
registry file:  /usr/local/share/cwb/registry/epende0en
home directory: /content/epende0en/
info file:      /content/epende0en/.info
encoding:       utf8
size (tokens):  7957477

  1 positional attributes:
      word            

  0 structural attributes:
      

  0 alignment  attributes:
      




In [69]:
corpora = Corpora(registry_path="/usr/local/share/cwb/registry/")
# corpora = Corpora("/content/Dickens-1.0/registry")
print(corpora)
corpora.show()  # returns a DataFrame


registry path: "/usr/local/share/cwb/registry/"
cqp binary   : "cqp"
found 2 corpora:
              size
corpus            
DICKENS    3407085
EPENDE0EN  7957477


Unnamed: 0_level_0,size
corpus,Unnamed: 1_level_1
DICKENS,3407085
EPENDE0EN,7957477


In [70]:
corpus = corpora.activate(corpus_name="EPENDE0EN")

In [73]:
from ccc import Corpus
corpus = Corpus(
  corpus_name="EPENDE0EN",
  registry_path="/usr/local/share/cwb/registry/"
)

In [74]:
query = r'"value"'

In [67]:
query = '[word="the"]'

In [75]:
dump = corpus.query(query)

found 0 matches


In [77]:
dump = corpus.query(
    '[word="quest.*"]', 
    context=10
)

found 0 matches
