-
-
Notifications
You must be signed in to change notification settings - Fork 3
Home
Before making any queries, you have to define which Korp to use. By default, the library knows kielipankki, språkbanken and GT. These can be passed as string to the costructor
from korp.korp import Korp
korppi = Korp(service_name="GT")#Giellatekno
korppi = Korp(service_name="kielipankki")#Language Bank of Finland
korppi = Korp(service_name="språkbanken")#Language Bank of Sweden
In case there's another provider with their own Korp, you can also pass the url to the Korp API.
from korp.korp import Korp
korppi = Korp(url="http://www.yourkorp.com/korp.cgi")
All queries require you to specify which corpora you want to use. The library lets you query for all available corpora easily.
corpora = korppi.list_corpora()
You can also limit the corpora to the ones starting with a specific string
corpora = korppi.list_corpora("SME")
To get more information about the corpora in the system, you can use the following code, where corpora is a list of corpora you want to learn more about
corpora_info = korppi.corpus_information(corpora)
The queries use mostly the CQP query language
To get concordances for a query, run
query = '[pos="A"] "go" [pos="N"]'
number, concordances = korppi.concordance(query, corpora)
Optionally, you can specify start and end indices. Note: This code will only return the first 1000 results. The number variable indicates the total number of results. To get all results, run
query = '[pos="A"] "go" [pos="N"]'
number, concordances = korppi.all_concordances(query, corpora)
Listing all corpora might yield a large list that occupies too much RAM memory. In such a case you can pass use_function_on_iteration parameter to the all_concordances() method that points to a function that takes in a concordance list and an integer. If this parameter is set, all_concordances() will return an empty list once it finishes because the concordances have already been passed to the use_function_on_iteration parameter on each iteration. all_concordances() will get the concordances in batches of 1000 and for each batch, once returned from Korp, it will run the function pointed by use_function_on_iteration.
def my_batch_processor(concordances, iteration_number):
print "Iteration number", iteration_number
print "The first token", concordances["tokens"][0]
query = '[pos="A"] "go" [pos="N"]'
number, concordances = korppi.all_concordances(query, corpora, use_function_on_iteration=my_batch_processor)
To get frequencies out of the system, run
query = '[pos="A"] "go" [pos="N"]'
results = korppi.statistics(query, corpora, "pos")
This will group the results by pos
query = '[pos="A"] "go" [pos="N"]'
results = korppi.trend_diagram(query, corpora, "pos")
Log-likelihood takes two different queries and two different corpora in addition to a groupby variable
query1 = '[pos="A"] "go" [pos="N"]'
query2 = '[pos="A"] "dago" [pos="N"]'
results = korppi.log_likelihood(self, query1, query2, corpora1, corpora2, "pos")
This will group the results by pos
Word picture doesn't take a query but a word as its input
results = korppi.word_picture("go", corpora)
Word picture hits takes head, corpora, relation and sources (see word_picture results for more information)
results = korppi.word_picture_hits('villiinty\xe4..vb.1', corpora, "ADV", ["S24:16174044", "S24:2343136"])
All methods described above accept a dictionary of additional parameters additional_parameters to see a list of parameters you can use the list_additional_parameters method and pass the desired method name
print korppi.list_additional_parameters("log_likelihood")
>>> [u'[?] max = The maximum number of results', u'[?] incremental = true: Return results incrementally (as soon as the results for each corpus are ready) for a search from multiple corpora']
(C) 2017 Mika Hämäläinen