# Get sentences of a corpus/subcorpus available (only) through the Korp interface.

Korp is a concordance search tool available for The Swedish Language Bank (https://spraakbanken.gu.se/korp/) and the Language Bank of Finland (https://korp.csc.fi/) collections. It provides its users with an interface to search for keywords in text corpora and to generate concordances for them. There are actually a lot more things you can do with it. You can check them out here: https://www.kielipankki.fi/support/korp/#Search_result_views.

However useful Korp is, sometimes you would want to explore a little more than just words and their contexts. For example, you want to trace if the average length of sentences that were written by an author changes through the years. For this, it would be nice to be able to obtain those different kinds of sentences from the corpus. Well, this is how you do it. 

I would divide the task into three steps:
* Download the author subcorpora as multiple CSV files from the Korp interface.
* Concatenate the CSV files, get sentences and write them into a .txt file.
* Check that you've got everything right (or rightish). 

## Step 1: Download the subcorpora as multiple CSV files from the Korp interface.
* 1. Go to Korp and select the corpus you need. For example, New year's speeches given by the presidents of Finland (https://metashare.csc.fi/repository/browse/new-years-speeches-of-the-presidents-of-the-republic-of-finland/6d69dc8e089a11e28bed005056be118e41f01f1cea2f47c3b6bd34181fb18aa7/). Let's select only Tarja Halonen's speeches.
<img src= "screens/screen1.png">
* 2. Press on **Extended** search, select KWIC hits per page, and the statistics you want to base the corpora division on. For example, **1000** KWIC hits per page (this would result in the smallest number of CSV parts), and **date**. Hit **Search**.
<img src= "screens/screen2.png">
* 3. Go to **Statistics** tab and press on the year you are interested in (2001, for example). The interface will open an additional KWIC tab.
<img src= "screens/screen3.png">
<img src= "screens/screen4.png">
* 4. Scroll till the bottom of the new KWIC tab. You'll see a **Download KWIC** button. Select **Sentence per row, match and contexts separated** and **CSV** as your data and file formats next to the button. Press the **Download KWIC** button to start the download.
<img src= "screens/screen5.png">
* 5. Go page by page and download the rest of the files.
<img src= "screens/screen6.png">

NOTE1: These steps can be automated with Selenium web driver, for example.
NOTE2: Sometimes in Korp you can get as much as paragraphs of text. For this, after step 3 press **Show context** next to the pagination bar in the new KWIC tab.
<img src= "screens/screen7.png">

## Step 2: Concatenate the files, get sentences and write them into a .txt file.
* 1 Concatenate CSV files into one dataframe.
* 2 Select only rows with empty left context (start of the sentence). 
* 3 Create a column that contains full sentences.
* 4 Convert this column into a list of strings.
* 5 Write this list into a .txt file.

NOTE: in the case of presidential speeches, you have a "sentence_id" feature that you can use to extract sentences. Not all corpora in Korp have this, so I am describing a more general case.

In [4]:
import pandas as pd
import glob
import io

### 2.1

In [15]:
# getting the parts of the corpus
paths = sorted(glob.glob('2001/*')) # i've collected the parts of 2001 speech into a folder '2001'
# reading every csv file into a dataframe
dfs = [pd.read_csv(paths[i]) for  i in range(len(paths))]
# concatenating dataframes
df = pd.concat(dfs)

# quick look at the dataframe
df.head(3)

Unnamed: 0,hit number,corpus,left context,match,right context,left context lemmas,match lemmas,right context lemmas,text_title,text_distributor,...,paragraph_span,sentence_id,sentence_url,URN,metadata link,licence,date,total hits,Korp URL,params
0,0,KOTUS_NS_PRESIDENTTI_HALONEN,,Vuosituhannen,vaihtuminen lisäsi keskustelua elämämme arvois...,,vuosituhat,vaihtuminen lisätä keskustelu elämä arvo .,Tasavallan presidentin uudenvuodenpuhe 1.1.2001,Kotimaisten kielten tutkimuskeskus / Research ...,...,MENNYT,s0,http://kaino.kotus.fi/korpus/teko/teksti/presi...,urn:nbn:fi:lb-20151001,http://urn.fi/urn:nbn:fi:lb-20140730150,EUPL v1.1 (CLARIN PUB),2020-09-18 17:21:59,1243,https://korp.csc.fi/#?stats_reduce=text_date&c...,corpus=KOTUS_NS_PRESIDENTTI_HALONEN; cqp=[]; d...
1,1,KOTUS_NS_PRESIDENTTI_HALONEN,Vuosituhannen,vaihtuminen,lisäsi keskustelua elämämme arvoista .,vuosituhat,vaihtuminen,lisätä keskustelu elämä arvo .,Tasavallan presidentin uudenvuodenpuhe 1.1.2001,Kotimaisten kielten tutkimuskeskus / Research ...,...,MENNYT,s0,http://kaino.kotus.fi/korpus/teko/teksti/presi...,urn:nbn:fi:lb-20151001,http://urn.fi/urn:nbn:fi:lb-20140730150,EUPL v1.1 (CLARIN PUB),2020-09-18 17:21:59,1243,https://korp.csc.fi/#?stats_reduce=text_date&c...,corpus=KOTUS_NS_PRESIDENTTI_HALONEN; cqp=[]; d...
2,2,KOTUS_NS_PRESIDENTTI_HALONEN,Vuosituhannen vaihtuminen,lisäsi,keskustelua elämämme arvoista .,vuosituhat vaihtuminen,lisätä,keskustelu elämä arvo .,Tasavallan presidentin uudenvuodenpuhe 1.1.2001,Kotimaisten kielten tutkimuskeskus / Research ...,...,MENNYT,s0,http://kaino.kotus.fi/korpus/teko/teksti/presi...,urn:nbn:fi:lb-20151001,http://urn.fi/urn:nbn:fi:lb-20140730150,EUPL v1.1 (CLARIN PUB),2020-09-18 17:21:59,1243,https://korp.csc.fi/#?stats_reduce=text_date&c...,corpus=KOTUS_NS_PRESIDENTTI_HALONEN; cqp=[]; d...


### 2.2

In [17]:
# selecting only text columns
sentences = df[['left context','match','right context']]
# selecting only the rows where 'match' starts the sentence (no left context)
sentences = sentences[sentences['left context'].isnull()]
# ignoring empty matches (if there are, korp has such mistakes in some corpora)
sentences = sentences[sentences['match'].notnull()]
# converting remaining empty cells into empty strings 
# (for empty right contexts, if a sentence is one word)
sentences = sentences.fillna('')

### 2.3

In [28]:
# adding a column with full sentences 
# by concatenating matches and their right contexts
sentences['full sentence'] = sentences['match'] + ' ' +  sentences['right context'] + '\n'

### 2.4 

In [26]:
# converting the column with sentences into a list
corpus = sentences['full sentence'].tolist()

# quick look at the last sentence
print(corpus[-1])

Tehkäämme vuodesta 2001 aidosti yhteisen vastuun vuosi .



### 2.5

In [27]:
f = io.open('2001.txt', 'w', encoding='UTF-8')
f.writelines(corpus)
f.close()

## Step 3: Check that you've got everything right.
This step helps to make sure there were no really dumb mistakes. We'll compare the number of tokens in sentences to the number of 'match' tokens in the CSV files.

* 1 Split the sentences from step 2 by whitespaces and count the tokens.
* 2 Compare this count to the number of match tokens in the concatenated dataframe.
    * Get tokens from 'match' column.
    * Count the number of whitespaces these tokens contain. (One match token can contain whitespaces, so we can't just compare the number of rows in the dataframe to the number of tokens from a previous step (3.1))
    * Subtract the whitespace count from the token count in the sentences. Compare this number to the number of match tokens.
* 3 Manually check the number of tokens to the number given in Korp to be completely sure that you've got everything.

### 3.1

In [38]:
# count whitespace separated tokens
n = 0
for sent in corpus:
    n += len(sent.split())
print(n)

1243


### 3.2

In [44]:
# get match tokens and count how many whitespaces they contain    
match_tokens = df['match'].tolist()
k = 0
for word in match_tokens:
    if len(word.split()) > 1:
        k+= len(word.split()) - 1
        
# compare token counts
print(n-k == len(match_tokens))
print(len(match_tokens))

True
1243


### 3.3
1243 is exactly the same as the number of tokens in Korp:
<img src= "screens/screen8.png">

## FINAL NOTES

Now you can do the same thing with speeches from other years. Not all corpora in Korp are as nice as this one, so you'll probably need to make some modifications to the code.

You can find the functions based on steps taken in this notebook in write_korp_corpus.py script