<a href="https://colab.research.google.com/github/iued-uni-heidelberg/corpustools/blob/main/cwb2021experimentsV08.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with CWB on Colab
Author: Bogdan Babych, IÜD, Heidelberg University

Modifying CWB installations and packages to work with colab environment

### Downloading packages and data

In [None]:
!wget https://heibox.uni-heidelberg.de/f/7f1e8929352b4cf4b13a/?dl=1

In [None]:
!mv index.html?dl=1 cwb-3.4.22-source.tar.gz

In [None]:
!tar xvfz cwb-3.4.22-source.tar.gz

### Installing the parser generator 'bison'

In [None]:
!apt-get install flex bison

### Replacing the configuration file
Using correct environment and 'standard' location for installation (otherwise python bindings do not work)

In [None]:
!wget https://heibox.uni-heidelberg.de/f/67bb38a210064bc5961e/?dl=1
!mv /content/cwb-3.4.22/config.mk /content/cwb-3.4.22/config.mk.old.01
!mv index.html?dl=1 /content/cwb-3.4.22/config.mk

In [None]:
# alternative: editing the file at the line numbers
!awk '{ if (NR == 42) print "PLATFORM=linux-64"; else print $0}' /content/cwb-3.4.22/config.mk > /content/cwb-3.4.22/config.mk.TMP
!awk '{ if (NR == 63) print "SITE=standard"; else print $0}' /content/cwb-3.4.22/config.mk.TMP > /content/cwb-3.4.22/config.mk

### Changing into installation directory and running installation scripts

In [None]:
%cd /content/cwb-3.4.22/

In [None]:
!pwd

(this will be the default register directory for the 'standard' CWB installation):

In [None]:
!mkdir -p /usr/local/share/cwb/registry/

In [None]:
!sudo ./install-scripts/config-basic
!sudo ./install-scripts/install-linux

In [None]:
%cd /content/

/content


### Downloading and relocating the register of a sample corpus
The register is placed into the standard cwb location

In [None]:
!wget https://heibox.uni-heidelberg.de/f/dd3538603aa84dd09a76/?dl=1
!mv index.html?dl=1 Dickens-1.0.tar.gz
!tar xvzf Dickens-1.0.tar.gz

In [None]:
!cp /content/Dickens-1.0/registry/dickens /content/Dickens-1.0/registry/dickens.old.01
!awk '{ if (NR == 10) print "HOME /content/Dickens-1.0/data"; else print $0}' /content/Dickens-1.0/registry/dickens > /content/Dickens-1.0/registry/dickens.TMP
!awk '{ if (NR == 12) print "INFO /content/Dickens-1.0/data/.info"; else print $0}' /content/Dickens-1.0/registry/dickens.TMP > /content/Dickens-1.0/registry/dickens

In [None]:
!mv /content/Dickens-1.0/registry/dickens /usr/local/share/cwb/registry

### Updating path (only needed if installing into a non-standard location


In [None]:
# !echo $PATH

In [None]:
# %env PATH=/usr/local/cwb-3.4.22/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin:/opt/bin

In [None]:
# !echo $PATH

In [None]:
# %env PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin:/opt/bin

In [None]:
!pwd

/content


### Testing interactive Corpus Query Processor (CQP)


In [None]:
# try these commands in the interactive prompt (just copy and paste them):
# DICKENS;
# "question";
# q
# exit;
!cqp -e

In [None]:
!cwb-describe-corpus -h

In [None]:
!cwb-describe-corpus -s dickens

In [None]:
# !cwb-describe-corpus -s -r registry dickens

In [None]:
%cd /content/

/content


## Installing python interface to CWB 
cwb-ccc

In [None]:
# !python -m pip install cwb-ccc
!python -m pip install cwb-ccc

## pandas versions are incompatible
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

google-colab 1.0.0 requires pandas~=1.1.0; python_version >= "3.0", but you have pandas 1.3.5 which is incompatible.

Successfully installed association-measures-0.2.0 cwb-ccc-0.10.1 pandas-1.3.5 pyyaml-6.0 unidecode-1.3.2

- we go for a compromise, which works for both so far...

In [None]:
!pip show pandas

In [None]:
!pip install pandas==1.1.5
# click [restart runtime] button!

Collecting pandas==1.1.5
  Downloading pandas-1.1.5-cp37-cp37m-manylinux1_x86_64.whl (9.5 MB)
[K     |████████████████████████████████| 9.5 MB 6.4 MB/s 
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.3.5
    Uninstalling pandas-1.3.5:
      Successfully uninstalled pandas-1.3.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cwb-ccc 0.10.1 requires pandas>=1.2.0, but you have pandas 1.1.5 which is incompatible.[0m
Successfully installed pandas-1.1.5


### Experiments with cwb-ccc 
From the webpage 


In [None]:
from ccc import Corpora
corpora = Corpora(registry_path="/usr/local/share/cwb/registry/")
# corpora = Corpora("/content/Dickens-1.0/registry")
print(corpora)
corpora.show()  # returns a DataFrame

In [None]:
corpus = corpora.activate(corpus_name="DICKENS")

In [None]:
from ccc import Corpus
corpus = Corpus(
  corpus_name="DICKENS",
  registry_path="/usr/local/share/cwb/registry/"
)

In [None]:
query = r'[word="[A-Z0-9][A-Z0-9][A-Z0-9]+"]'
dump = corpus.query(query)

In [None]:
dump.df

In [None]:
corpus.attributes_available

In [None]:
query = r'"question"'
dump = corpus.query(query)

In [None]:
dump.df

In [None]:
dump = corpus.query(
  cqp_query=query,
  context=20,
  context_break='s'
)

In [None]:
dump.df

In [None]:
dump.set_context(
    context_left=5,
    context_right=10,
    context_break='s'
)

In [None]:
dump.df

In [None]:
dump.breakdown()

In [None]:
dump.concordance()

In [None]:
dump.concordance(p_show=["word", "lemma"], s_show=["text_id"])

In [None]:
dump.concordance(form="kwic")

In [None]:
lines = dump.concordance(
    p_show=['word', 'pos', 'lemma'],
    form='dataframe'
)

In [None]:
lines.iloc[0]['dataframe']

In [None]:
type(lines.iloc[2]['dataframe'])

In [None]:
lines = dump.concordance(
    p_show=['word', 'pos', 'lemma'],
    form='dict'
)

In [None]:
lines.iloc[0]['dict']

In [None]:
lines = dump.concordance(
    p_show=['word', 'pos', 'lemma'],
    form='slots'
)

In [None]:
lines.iloc[0]

word                     Scrooge asked the question , because he did n'...
pos                             NN VBD DT NN , IN PP VBD RB VB IN DT NN RB
lemma                    Scrooge ask the question , because he do not k...
match..matchend_word                                              question
match..matchend_pos                                                     NN
match..matchend_lemma                                             question
Name: (5614, 5614), dtype: object

In [None]:
lines.iloc[0]['lemma']

'Scrooge ask the question , because he do not know whether a ghost so'

In [None]:
dump = corpus.query(
  cqp_query=r'@1[pos="D.*"] @2[pos="NN"] @3[word="question"]',
  context=None, 
  context_break='s', 
  match_strategy='longest'
)
lines = dump.concordance(form='dataframe')

In [None]:
lines.iloc[1]['dataframe']

In [None]:
lines = dump.concordance(form='dict')

In [None]:
lines.iloc[1]['dict']

In [None]:
lines = dump.concordance(
  form='slots', 
  p_show=['word', 'lemma'],
  slots={"article": [1], "np": [2, 3]}
)

In [None]:
lines

In [None]:
dump.correct_anchors({2: -2, 3: +1})
lines = dump.concordance(
  form='slots',
  slots={"art": [1],
  "np": [2, 3]}
)

In [None]:
lines

In [None]:
dump = corpus.query(
    '[lemma="question"]', 
    context=10, 
    context_break='s'
)

In [None]:
dump.collocates(order='log_likelihood')

In [None]:
dump = corpus.query(
    '[lemma="answer"]', 
    context=10, 
    context_break='s'
)

In [None]:
dump.collocates(p_query=['lemma'], order='conservative_log_ratio')
# ['lemma', 'pos']

In [None]:
corpus.query('[lemma="question" & pos="N.*"]').breakdown()

In [None]:
# https://pypi.org/project/cwb-ccc/#anchored-queries

In [None]:
%tb

No traceback available to show.


In [None]:
# !export CWB_DIR=/usr/local/cwb-3.4.10
# /usr/local/cwb-3.4.22/bin

In [None]:
# !python --version

### todo:
1. to add corpus lemmatization & encoding parts
2. to add generation of interesting collocations, exporting them as lists
3. to add parallel corpus functionality


## Building own corpus
Europarl 8M EN (DE):
https://heibox.uni-heidelberg.de/f/0e1fcda2b7bc494d83b8/?dl=1
ep_en_de.txt

- Create a data directory where files in the binary CWB format will be stored. Here, we assume that this directory is called /corpora/data/example. If this directory already exists and contains corpus data (from a previous version), you should delete all files in the directory. NB: You need a separate data directory for each corpus you want to encode.
- Choose a registry directory, where all encoded corpora have to be registered to make them accessible to the CWB tools. It is recommended that you use the default registry directory /usr/local/share/cwb/registry. Otherwise,youwillhavetospecifythepathtoyourregistry directory with a -r flag whenever you invoke one of the CWB tools (or set an appropriate environment variable, see below). In the example commands in this manual, we assume that you use the standard registry directory.
- The next step is to encode the corpus, i.e. convert the verticalized text to CWB binary format with the cwb-encode tool. Note that the command below has to be entered on a single line.


```
$ cwb-encode -d /corpora/data/example
                      -xsBC9 -c ascii -f example.vrt
                      -R /usr/local/share/cwb/registry/example
                      -P pos -P lemma -S s
```
from: https://cwb.sourceforge.io/files/CWB_Encoding_Tutorial.pdf


Cleanup ... (if needed to reinstall)

In [None]:
rm -r EP

In [None]:
rm -r corpepen/

In [None]:
rm /usr/local/share/cwb/registry/corpepen

downloading EP data

In [None]:
# !wget https://heibox.uni-heidelberg.de/f/0e1fcda2b7bc494d83b8/?dl=1
!wget https://heibox.uni-heidelberg.de/f/0e1fcda2b7bc494d83b8/?dl=1

In [None]:
!mkdir EP

In [None]:
!mv index.html?dl=1 EP/ep_en_de1.txt

In [None]:
FIn = open('EP/ep_en_de1.txt', 'r')
FOut = open('EP/ep_en_de.txt', 'w')
for SLine in FIn:
    SLine = SLine.strip()
    SLine = SLine.lower()
    FOut.write(SLine + '\n')

FOut.flush()


In [None]:
!head --lines=100 EP/ep_en_de.txt >ep_en_de_100.txt

In [None]:
!mkdir corpepen

encoding corpus

In [None]:
!cwb-encode -d /content/corpepen -xsBC9 -c utf8 -f /content/EP/ep_en_de.txt -R /usr/local/share/cwb/registry/corpepen 2>&1

In [None]:
!cwb-makeall --help

In [None]:
!cwb-makeall -V CORPEPEN

In [None]:
%tb

No traceback available to show.


In [None]:
!cwb-describe-corpus corpepen

In [None]:
# try these commands in the interactive prompt (just copy and paste them):
# CORPEPEN;
# "question";
# "of";
# q
# exit;
!cqp -e

[m[no corpus]> CORPEPEN;
[mCORPEPEN> "question";
      122: oncern inadmissibility my <[m[7mquestion[m> relate to something that[m[m
      513: ing on this matter on the <[m[7mquestion[m> of the strategic plan of[m[m
     4825: st the european union the <[m[7mquestion[m> be who prosecute that be[m[m
     5003: e at the moment the whole <[m[7mquestion[m> as to whether we have th[m[m
     5046: ission should tackle this <[m[7mquestion[m> we be aware that it be a[m[m
     7039: say in her remark but the <[m[7mquestion[m> be how long will it take[m[m
    10935: ue relate to and surround <[m[7mquestion[m> of liability despite my [m[m
    11144: behrendt have ask me this <[m[7mquestion[m> before and i say i think[m[m
    11184:  next time you ask me the <[m[7mquestion[m> i will be in a position [m[m
    11717: mr florenz also raise the <[m[7mquestion[m> of anonymity i be happy [m[m
    11899:  tooth i suspect that the <[m[7mquestion[m> b

In [None]:
from ccc import Corpora
corpora = Corpora(registry_path="/usr/local/share/cwb/registry/")
# corpora = Corpora("/content/Dickens-1.0/registry")
print(corpora)
corpora.show()  # returns a DataFrame


registry path: "/usr/local/share/cwb/registry/"
cqp binary   : "cqp"
found 1 corpora:
             size
corpus           
CORPEPEN  7281561


Unnamed: 0_level_0,size
corpus,Unnamed: 1_level_1
CORPEPEN,7281561


In [None]:
corpus = corpora.activate(corpus_name="CORPEPEN")

In [None]:
from ccc import Corpus
corpus = Corpus(
  corpus_name="CORPEPEN",
  registry_path="/usr/local/share/cwb/registry/"
)

In [None]:
query = r'"value"'

In [None]:
query = '[word="the"]'

In [None]:
dump = corpus.query(query)

In [None]:
dump = corpus.query(
    '[word="quest.*"]', 
    context=10
)

In [None]:
lines = dump.concordance(
    form='dict'
)

In [None]:
lines

Unnamed: 0_level_0,Unnamed: 1_level_0,context,contextend,dict
match,matchend,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2819,2819,2799,2839,"{'cpos': [2799, 2800, 2801, 2802, 2803, 2804, ..."
4589,4589,4569,4609,"{'cpos': [4569, 4570, 4571, 4572, 4573, 4574, ..."
15705,15705,15685,15725,"{'cpos': [15685, 15686, 15687, 15688, 15689, 1..."
22098,22098,22078,22118,"{'cpos': [22078, 22079, 22080, 22081, 22082, 2..."
23642,23642,23622,23662,"{'cpos': [23622, 23623, 23624, 23625, 23626, 2..."
...,...,...,...,...
372781,372781,372761,372801,"{'cpos': [372761, 372762, 372763, 372764, 3727..."
385062,385062,385042,385082,"{'cpos': [385042, 385043, 385044, 385045, 3850..."
386601,386601,386581,386621,"{'cpos': [386581, 386582, 386583, 386584, 3865..."
386658,386658,386638,386678,"{'cpos': [386638, 386639, 386640, 386641, 3866..."


In [None]:
dump = corpus.query(
  cqp_query=query,
  context=5
)

In [None]:
dump.df

Unnamed: 0_level_0,Unnamed: 1_level_0,context,contextend
match,matchend,Unnamed: 2_level_1,Unnamed: 3_level_1
2819,2819,2809,2829
4589,4589,4579,4599
15705,15705,15695,15715
22098,22098,22088,22108
23642,23642,23632,23652
...,...,...,...
7932213,7932213,7932203,7932223
7932223,7932223,7932213,7932233
7949447,7949447,7949437,7949457
7951869,7951869,7951859,7951879


In [None]:
dump.concordance(form="kwic", cut_off=None)

Unnamed: 0_level_0,Unnamed: 1_level_0,left_word,node_word,right_word
match,matchend,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2819,2819,"and addition relate to achieve what be know as """,value,"for money "" indicator in the grant-giving proc..."
4589,4589,the figure for last year now show that the total,value,of merger in the European area be EUR @card@ t...
15705,15705,I be not sure that it would add much practical,value,to our effort . it would be unlikely to have
22098,22098,as a former official - and I think that the,value,of that be show by his ability to tackle the
23642,23642,we be at one in the belief that there be,value,in define the well possible mechanism for this...
...,...,...,...,...
7932213,7932213,"Kong , we would not be true to our own",value,", if we do not speak out when those value"
7932223,7932223,"value , if we do not speak out when those",value,appear to us to be under threat . What sensible
7949447,7949447,"to sell that beef into our market , at good",value,"to our consumer , and if we be to put"
7951869,7951869,"Kingdom , we have a long political debate over...",value,of resale price maintenance apply to book . th...


In [None]:
# dump.collocates
# dump.collocates(order='log_likelihood', cut_off=None)
# 
dfColl = dump.collocates(order='mutual_information', cut_off=None)
dump.collocates(order='mutual_information', cut_off=None)


specfied p-attribute(s) ("lemma") not available
falling back to primary layer
specfied p-attribute(s) ("lemma") not available
falling back to primary layer


Unnamed: 0_level_0,O11,O12,O21,O22,E11,E12,E21,E22,z_score,t_score,log_likelihood,simple_ll,dice,log_ratio,mutual_information,local_mutual_information,conservative_log_ratio,ipm,ipm_reference,ipm_expected,ipm_reference_expected,in_nodes,marginal
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
parametric,2,21110,0,7934225,0.005308,21111.994692,1.994692,7.934223e+06,27.379501,1.410460,23.727216,19.737642,0.000189,10.553882,2.576129,5.152258,0.398702,94.732853,0.000000,0.251404,0.251404,0,2
mg/litre,2,21110,0,7934225,0.005308,21111.994692,1.994692,7.934223e+06,27.379501,1.410460,23.727216,19.737642,0.000189,10.553882,2.576129,5.152258,0.398702,94.732853,0.000000,0.251404,0.251404,0,2
calorific,2,21110,0,7934225,0.005308,21111.994692,1.994692,7.934223e+06,27.379501,1.410460,23.727216,19.737642,0.000189,10.553882,2.576129,5.152258,0.398702,94.732853,0.000000,0.251404,0.251404,0,2
added,202,20910,52,7934173,0.674069,21111.325931,253.325931,7.933972e+06,245.215241,14.165243,2141.140925,1901.234953,0.018909,10.511654,2.476647,500.282665,9.513875,9568.018189,6.553885,31.928251,31.928251,0,254
tax-based,2,21110,1,7934224,0.007961,21111.992039,2.992039,7.934222e+06,22.325527,1.408584,19.913445,18.121089,0.000189,9.553882,2.400038,4.800076,1.687763,94.732853,0.126036,0.377105,0.377105,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
purpose,3,21109,1121,7933104,2.982889,21109.017111,1121.017111,7.933104e+06,0.009907,0.009879,0.000098,0.000098,0.000270,0.008274,0.002484,0.007452,0.000000,142.099280,141.286641,141.288798,141.288798,0,1124
among,3,21109,1121,7933104,2.982889,21109.017111,1121.017111,7.933104e+06,0.009907,0.009879,0.000098,0.000098,0.000270,0.008274,0.002484,0.007452,0.000000,142.099280,141.286641,141.288798,141.288798,0,1124
that,374,20738,139828,7794397,372.070300,20739.929700,139829.929700,7.794395e+06,0.100041,0.099782,0.010198,0.009991,0.004637,0.007483,0.002247,0.840228,0.000000,17715.043577,17623.397370,17623.640582,17623.640582,0,140202
yes,2,21110,749,7933476,1.993016,21110.006984,749.006984,7.933476e+06,0.004947,0.004939,0.000025,0.000024,0.000183,0.005060,0.001519,0.003039,0.000000,94.732853,94.401154,94.402035,94.402035,0,751


In [None]:
dfColl.to_excel(r'value_collstat.xlsx', index = True)

In [None]:
def query2colist(qry, ord='mutual_information', cut=100, minFrq = 5, file=None, values = None):
    query=None
    dump = None
    dfColl = None
    LAll = None
    retList = []

    query = r'{}'.format(qry)
    print(query)
    dump = corpus.query(cqp_query=query, context=5)
    dfColl = dump.collocates(order=ord, cut_off=None)
    LIndex = dfColl.index.to_numpy().tolist()
    LValues = dfColl[['O11','O21','mutual_information']].to_numpy().tolist()
    LAll = list(zip(LIndex, LValues))
    
    i=0
    for word, (o11, o21, mutual_information) in LAll:
        oAll = o11+o21
        if oAll >= minFrq:
            i+=1
            if values:
                retList.append((word, int(o11), int(o21), int(oAll), mutual_information))
            else:
                retList.append(word)
            if i>= cut:
                break
    if file:
        FOut = open(file, 'w')
        for el in retList:
            FOut.write(str(el) + '\n')
        FOut.flush()
    return retList


In [None]:
rList = query2colist("'however'", file='however_mi100_minFrq5.txt', values=True) 

specfied p-attribute(s) ("lemma") not available
falling back to primary layer


'however'


In [None]:
rList

In [None]:
query = r'"value"'
dump = corpus.query(
  cqp_query=query,
  context=5
)
dfColl = dump.collocates(order='mutual_information', cut_off=100)
LIndex = dfColl.index.to_numpy().tolist()
FOut = open('value_mi100_collstat.txt', 'w')
for el in LIndex:
    FOut.write(el + '\n')
FOut.flush()
# dfColl.to_excel(r'value_collstat.xlsx', index = True)


specfied p-attribute(s) ("lemma") not available
falling back to primary layer


In [None]:
query = r'"however"'
dump = corpus.query(
  cqp_query=query,
  context=5
)
dfColl = dump.collocates(order='mutual_information', cut_off=None)
dfColl.to_excel(r'however_collstat.xlsx', index = True)

specfied p-attribute(s) ("lemma") not available
falling back to primary layer


In [None]:
query = r'"commission"'
dump = corpus.query(
  cqp_query=query,
  context=5
)
dfColl = dump.collocates(order='mutual_information', cut_off=None)
dfColl.to_excel(r'commission_collstat.xlsx', index = True)

specfied p-attribute(s) ("lemma") not available
falling back to primary layer


In [None]:
query = r'"conclusion"'
dump = corpus.query(
  cqp_query=query,
  context=5
)
dfColl = dump.collocates(order='mutual_information', cut_off=None)
dfColl.to_excel(r'conclusion_collstat.xlsx', index = True)

specfied p-attribute(s) ("lemma") not available
falling back to primary layer


In [None]:
# lst = dfColl[['mutual_information', 'log_likelihood']].items()
LIndex = dfColl.index.to_numpy().tolist()

In [None]:
# z_score	t_score	log_likelihood	simple_ll	dice	log_ratio	mutual_information	local_mutual_information	conservative_log_ratio
# z_score, t_score, log_likelihood, simple_ll, dice, log_ratio, mutual_information, local_mutual_information, conservative_log_ratio
# 'z_score', 't_score', 'log_likelihood', 'simple_ll', 'dice', 'log_ratio', 'mutual_information', 'local_mutual_information', 'conservative_log_ratio'
# O11	O12	O21	O22	E11	E12	E21	E22
# 'O11', 'O12', 'O21', 'O22', 'E11', 'E12', 'E21', 'E22'
LValues = dfColl[['O11', 'O12', 'O21', 'O22', 'E11', 'E12', 'E21', 'E22', 'z_score', 't_score', 'log_likelihood', 'simple_ll', 'dice', 'log_ratio', 'mutual_information', 'local_mutual_information', 'conservative_log_ratio']].to_numpy().tolist()

In [None]:
print(len(LIndex), len(LValues))

710 710


In [None]:
LAll = list(zip(LIndex, LValues))

In [None]:
LAll

In [None]:
lex = 'value'

In [None]:
FOut = open(lex + '_collstat.txt', 'w')
FOut.write("'Collocation','O11','O12','O21','O22','E11','E12','E21','E22','z_score','t_score','log_likelihood','simple_ll','dice','log_ratio','mutual_information','local_mutual_information','conservative_log_ratio'\n")
for Word, LVals in LAll:
    FOut.write("'" + Word + "'" + ',')
    LSVal = [str(IVal) for IVal in LVals]
    SVal = ','.join(LSVal)
    FOut.write(SVal + '\n')

In [None]:
for key, value in lst:
  print(key, value)

mutual_information item
parametric    2.576129
mg/litre      2.576129
calorific     2.576129
added         2.476647
tax-based     2.400038
                ...   
purpose       0.002484
among         0.002484
that          0.002247
yes           0.001519
examine       0.000941
Name: mutual_information, Length: 710, dtype: float64
log_likelihood item
parametric      23.727216
mg/litre        23.727216
calorific       23.727216
added         2141.140925
tax-based       19.913445
                 ...     
purpose          0.000098
among            0.000098
that             0.010198
yes              0.000025
examine          0.000009
Name: log_likelihood, Length: 710, dtype: float64
