Skip to content
Mika Hämäläinen edited this page Mar 22, 2019 · 8 revisions

Specifying Korp url

Before making any queries, you have to define which Korp to use. By default, the library knows kielipankki, språkbanken and GT. These can be passed as string to the costructor

from korp.korp import Korp
korppi = Korp(service_name="GT")#Giellatekno
korppi = Korp(service_name="kielipankki")#Language Bank of Finland
korppi = Korp(service_name="språkbanken")#Language Bank of Sweden

In case there's another provider with their own Korp, you can also pass the url to the Korp API.

from korp.korp import Korp
korppi = Korp(url="http://www.yourkorp.com/korp.cgi")

Listing corpora

All queries require you to specify which corpora you want to use. The library lets you query for all available corpora easily.

corpora = korppi.list_corpora()

You can also limit the corpora to the ones starting with a specific string

corpora = korppi.list_corpora("SME")

To get more information about the corpora in the system, you can use the following code, where corpora is a list of corpora you want to learn more about

corpora_info = korppi.corpus_information(corpora)

Queries

The queries use mostly the CQP query language

Concordance

To get concordances for a query, run

query = '[pos="A"] "go" [pos="N"]'
number, concordances = korppi.concordance(query, corpora)

Optionally, you can specify start and end indices. Note: This code will only return the first 1000 results. The number variable indicates the total number of results. To get all results, run

query = '[pos="A"] "go" [pos="N"]'
number, concordances = korppi.all_concordances(query, corpora)

Listing all corpora might yield a large list that occupies too much RAM memory. In such a case you can pass use_function_on_iteration parameter to the all_concordances() method that points to a function that takes in a concordance list and an integer. If this parameter is set, all_concordances() will return an empty list once it finishes because the concordances have already been passed to the use_function_on_iteration parameter on each iteration. all_concordances() will get the concordances in batches of 1000 and for each batch, once returned from Korp, it will run the function pointed by use_function_on_iteration.

def my_batch_processor(concordances, iteration_number):
    print "Iteration number", iteration_number
    print "The first token", concordances["tokens"][0]

query = '[pos="A"] "go" [pos="N"]'
number, concordances = korppi.all_concordances(query, corpora, use_function_on_iteration=my_batch_processor)

Statistics

To get frequencies out of the system, run

query = '[pos="A"] "go" [pos="N"]'
results = korppi.statistics(query, corpora, "pos")

This will group the results by pos

Trend diagram

query = '[pos="A"] "go" [pos="N"]'
results = korppi.trend_diagram(query, corpora, "pos")

Log-likelihood

Log-likelihood takes two different queries and two different corpora in addition to a groupby variable

query1 = '[pos="A"] "go" [pos="N"]'
query2 = '[pos="A"] "dago" [pos="N"]'
results = korppi.log_likelihood(self, query1, query2, corpora1, corpora2, "pos")

This will group the results by pos

Word picture

Word picture doesn't take a query but a word as its input

results = korppi.word_picture("go", corpora)

Word picture hits

Word picture hits takes head, corpora, relation and sources (see word_picture results for more information)

results = korppi.word_picture_hits('villiinty\xe4..vb.1', corpora, "ADV", ["S24:16174044", "S24:2343136"])

Additional parameters

All methods described above accept a dictionary of additional parameters additional_parameters to see a list of parameters you can use the list_additional_parameters method and pass the desired method name

print korppi.list_additional_parameters("log_likelihood")
>>> [u'[?] max = The maximum number of results', u'[?] incremental = true: Return results incrementally (as soon as the results for each corpus are ready) for a search from multiple corpora']