<a href="https://colab.research.google.com/github/mavela/Linguistics-with-conllu-data/blob/master/langnet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using dependency parses for analyzing language

Focus here on ready-made Python scripts
* (Although some of the first commands are in [Bash](https://en.wikipedia.org/wiki/Bash_(Unix_shell)))
* All the scripts are downloadable at https://github.com/mavela/Linguistics-with-conllu-data.git (disclaimer for code beauty!)
* You can also use tagged data with corpus tools, such as Antconc
** See https://www.youtube.com/watch?v=gkKna-ka9zw


The examples follow the distinction of two research designs in corpus linguistics ([see Biber & Jones 2009](https://jan.ucc.nau.edu/biber/Biber/Biber_offprint.pdf)) 
*   Type A focuses on individual forms (words, lemmas, constructions)
*   Type B focuses on entire texts



## Preparations

Let's download the data from Github!
* cd takes us to the correct directory
* ! ls lists the files in that directory





In [1]:
! git clone https://github.com/mavela/Linguistics-with-conllu-data.git
% cd Linguistics-with-conllu-data
!ls

Cloning into 'Linguistics-with-conllu-data'...
remote: Enumerating objects: 34, done.[K
remote: Counting objects: 100% (34/34), done.[K
remote: Compressing objects: 100% (26/26), done.[K
remote: Total 34 (delta 15), reused 27 (delta 8), pack-reused 0[K
Unpacking objects: 100% (34/34), done.
/content/Linguistics-with-conllu-data
analyze.py	  narrative_ext.conllu		  pb_smallpart.conllu.gz
how-to.conllu.gz  pb_even_smaller_part.conllu.gz


### Checking the basics
! zcat prints the file the entire file, head -20 cuts after 20 first lines

In [2]:
! zcat pb_smallpart.conllu.gz | head -20 # How the data look like

# <doc id="7-510353" length="0-1k" crawl_date="2015-06-05" url="http://yle.fi/uutiset/lahden_paikallisliikenteen_uudistus_edennyt_ilman_suuria_ongelmia/7338226?origin=rss" langdiff="0.11">
# delex_lm_mean_perplexity: 210.51
# lex_lm_mean_perplexity: 32064.49
# predicted register: narrative
# <p heading="0">
# paragraph_delex_lm_mean_perplexity: 368.98
# paragraph_lexical_lm_mean_perplexity: 50604.32
# text = Suurilta ja toistuvilta myöhästymisiltä tai muilta kommelluksilta on vältytty.
# </p>
1	Suurilta	suuri	ADJ	_	Case=Abl|Degree=Pos|Number=Plur	4	amod	_	_
2	ja	ja	CCONJ	_	_	3	cc	_	_
3	toistuvilta	toistuva	ADJ	_	Case=Abl|Degree=Pos|Number=Plur	1	conj	_	_
4	myöhästymisiltä	myöhästyminen	NOUN	_	Case=Nom|Derivation=Minen|Number=Plur	9	obl	_	_
5	tai	tai	CCONJ	_	_	7	cc	_	_
6	muilta	muu	PRON	_	Case=Abl|Number=Sing|PronType=Dem	7	det	_	_
7	kommelluksilta	kommellus	NOUN	_	Case=Abl|Number=Plur	4	conj	_	_
8	on	olla	AUX	_	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	9	aux:pass	

## Word counts
Note that we have to skip empty lines and metadata


In [3]:
from analyze import count_words, most_frequent, extract_register, print_text

print("Total word count of the conllu file is", count_words("pb_smallpart.conllu.gz"), "tokens")

Total word count of the conllu file is 6004519 tokens


## Lemmatization

One of the most frequent uses of the parser is simply lemmatization.

print_text prints one text per line the text featurized as we indicate.

We can specify the kind of feature by referring to the column name in the Conll format.

The columns are: 
* ID, FORM, LEMMA, UPOS, XPOS, FEAT, HEAD, DEPREL, DEPS, MISC 

The number indicates how many texts we want.

In [4]:
print(print_text("pb_smallpart.conllu.gz", "FORM", 2))
print()
print(print_text("pb_smallpart.conllu.gz", "LEMMA", 2))

Suurilta ja toistuvilta myöhästymisiltä tai muilta kommelluksilta on vältytty . – Tietooni ei ole tullut liikennöitsijöiltä sellaisia linjoja , joissa toistuvia myöhästelyjä olisi . Kuljettajatkin ovat olleet tyytyväisiä uusin reitteihin , vaikka varmasti paljon uutta on ollut omaksuttavana . Myöhästelyt ovat Jorasmaan mukaan johtuneet lähes poikkeuksetta lipunmyyntijärjestelmän ongelmista .

Translate sunnuntai 2. joulukuuta 2012 Poronkäristys Aivan loistava ruoka näin tuulen tuivertaessa lunta ikkunoihin . On hanget korkeat nietokset ( ainakin toivottavasti muutaman päivän ) ja pihalta tullessa tuhti ruoka lämmittää sopivasti . Poro on lihana miedonmakuinen ja vähärasvainen joten se on hyvää vaihtelua naudan- ja possunlihalle . Saimme 10kg poroa tuoreena suoraan lapista setäni kautta ja tänä talvena pääsemme kokeilemaan tästä pohjoisen herkusta moninaisia ruokia . Paistia on jo tullut testattua ja käristystäkin kerran . Teimme ensimmäisen kerran käristyksen " ainoan ja oikean " ohjee

## Type A perspective: Analyzing / searching for individual words / lemmas

How many times a particular lemma appears in the corpus?

What are its most frequent dependency relations? (Or other tags)

The Conllu tagsets (columns) are defined as: ID,FORM,LEMMA,UPOS,XPOS,FEAT,HEAD,DEPREL,DEPS,MISC

In [5]:
from analyze import count_specific_lemma, count_word_context

#NOTE THESE ARE LEMMAS WE ARE SEARCHING FOR
print(count_specific_lemma("koira", "pb_smallpart.conllu.gz", "FORM"))
print(count_specific_lemma("koira", "pb_smallpart.conllu.gz", "DEPREL"))

Total counts for the lemma koira: 563
The most frequent  FORM:
koira 149
koiran 68
koiraa 49
koiria 48
koirat 47
Koira 31
koirien 29
Koirat 16
Koiran 13
koirilla 13
koiransa 11
koiralle 9
koirille 9
koirasta 7
koirista 7

Total counts for the lemma koira: 563
The most frequent  DEPREL:
nsubj 94
obj 83
obl 68
conj 59
nmod 54
root 53
nmod:poss 52
nsubj:cop 40
compound:nn 17
appos 14
advcl 8
nmod:gobj 5
ccomp 4
nmod:gsubj 3
flat:name 3



## Surrounding words or other context

What kinds of words is the target word collocating with?
(Note that this is just a frequency list)

In [6]:
print(count_word_context("koira", "LEMMA", "pb_smallpart.conllu.gz"))
print()
print(count_word_context("koira", "UPOS", "pb_smallpart.conllu.gz"))

koira 105
olla 46
ja 43
se 18
hän 16
joka 12
kun 8
ei 6
tulla 5
voida 5
mutta 5
että 5
tehdä 5
tämä 5
minä 4
jos 4
tai 4
lapsi 4
oma 4
tuo 4


NOUN 337
VERB 125
PROPN 70
PRON 67
ADV 66
ADJ 64
CCONJ 56
AUX 51
SCONJ 23
NUM 11
ADP 10
SYM 6
INTJ 2



## Type B perspective: Text level
### Most frequent tokens + lemmas in a text


In [7]:
print("Most frequent tokens")
print(most_frequent("pb_smallpart.conllu.gz", "FORM", 10)) # the number defines how many we want to see

print("Most frequent lemmas")
print(most_frequent("pb_smallpart.conllu.gz", "LEMMA", 10)) # the number defines how many we want to see

Most frequent tokens
ja 187895
on 98379
oli 49492
että 37319
ei 29839
joka 24799
hän 23282
vuonna 20998
hänen 19782
se 18595

Most frequent lemmas
olla 224316
ja 190309
hän 80894
se 75515
joka 58328
vuosi 47916
ei 47496
että 37784
tämä 35210
kun 21057



## Distribution of POS tags or dependency relations
The Conllu tagsets (columns) are defined as:
ID,FORM,LEMMA,UPOS,XPOS,FEAT,HEAD,DEPREL,DEPS,MISC


In [8]:
print("Most frequent lemmas, or whatever tagset (column) is specified")
print(most_frequent("pb_smallpart.conllu.gz", "LEMMA", 10))

print("Then the most frequent part-of-speech tags")
print(most_frequent("pb_smallpart.conllu.gz", "UPOS", 10))

Most frequent lemmas, or whatever tagset (column) is specified
olla 224316
ja 190309
hän 80894
se 75515
joka 58328
vuosi 47916
ei 47496
että 37784
tämä 35210
kun 21057

Then the most frequent part-of-speech tags
NOUN 1557042
PROPN 808304
VERB 654108
ADJ 381399
ADV 381011
PRON 362355
AUX 287354
CCONJ 246675
NUM 195055
SCONJ 105037



## Focusing text-level analysis to particular tags

In [9]:
print("Most frequent lemmas under a specific tagset(column).")
print("For instance, the most frequent lemmas that receive the ADJ tag in the UPOS column.")
print()
print(most_frequent("pb_smallpart.conllu.gz", "UPOS", 10, "ADJ"))

print("Or the most frequent lemmas that receive nsubj that in the DEPREL column")
print()
print(most_frequent("pb_smallpart.conllu.gz", "DEPREL", 10, "nsubj"))

Most frequent lemmas under a specific tagset(column).
For instance, the most frequent lemmas that receive the ADJ tag in the UPOS column.

ensimmäinen 11915
suuri 9877
hyvä 8696
uusi 8267
usea 6615
oma 6224
toinen 6027
pieni 4724
koko 3994
eri 3896

Or the most frequent lemmas that receive nsubj that in the DEPREL column

hän 35406
joka 21604
se 14226
tämä 2813
minä 2734
mikä 2110
ihminen 1571
joku 1183
hallitus 1109
osa 1083



## Focus and / or compare specific registers

Registers used:
* how-to/instructions
* informational description
* informational persuasion general
* interactive discussion
* machine translated
* narrative
* opinion

The Github repo includes two register-specific datasets: narrative and how-to/instructions. You can also extract other registers.


In [10]:
from analyze import extract_register

extract_register("opinion", "pb_smallpart.conllu.gz")

! ls # see it's there!

Wrote opinion texts to a file!
analyze.py	      opinion_ext.conllu	      __pycache__
how-to.conllu.gz      pb_even_smaller_part.conllu.gz
narrative_ext.conllu  pb_smallpart.conllu.gz
