<a href="https://colab.research.google.com/github/mavela/Linguistics-with-conllu-data/blob/master/Predicted_keywords_with_an_SVM_and_english_web_registers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text classification on web registers + predicted keywords
## Steps

1. Get data from Github
2. Have a look at conllu data
3. Decide what two registers you want to compare - the classifier setting is binary with 2 classes
4. Featurize 
5. Divide to train and test
6. Run the SVM
7. Evaluate + optimize
8. Analyze the keywords

## Things to analyze

1. Do the results vary between different registers?
2. Do the results vary between feature sets?
3. Do the keywords or keyfeatures make sense?
4. What do the keywords or features tell you about the text classes?


### 1. Data from Github

This is the same what we had for the keywords

In [1]:
! git clone https://github.com/mavela/Linguistics-with-conllu-data.git
% cd Linguistics-with-conllu-data/
! ls

Cloning into 'Linguistics-with-conllu-data'...
remote: Enumerating objects: 76, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 76 (delta 6), reused 13 (delta 4), pack-reused 61[K
Unpacking objects: 100% (76/76), done.
Checking out files: 100% (17/17), done.
/content/Linguistics-with-conllu-data
analyze.py
common.py
data
keyness-conllu.py
keyness-filt.py
keyness.py
Notebook_Poitiers.ipynb
Predicted_keywords_with_an_SVM_and_english_web_registers.ipynb
__pycache__
svm_explain.py
svm.py
text_dispersion_filt.py
text_dispersion.py


### 2. Have a look at the data

This is also the same - just a standard procedure to make sure all is fine!

In [2]:
! head -20 data/sr.conllu

## register SR
# ../CORE_raw/sr
# ../CORE_raw/sr/1+NA+SR+NA-NA-NA-NA+SR-SR-SR-SR+NNNY+1758585.txt
# <1758585>
# <http://www.infonews.co.nz/news.cfm?id=89071>
# <Rater 1: NA_SR * QU * N * ID: A38IZBOGFD2S9Y>
# <Rater 2: NA_SR * QU * N * ID: A3UOK7WMVFVHHE>
# <Rater 3: NA_SR *  * N * ID: A33SMNMTMIOJ6T>
# <Rater 4: NA_SR *  * Y * ID: A193YDFH1CD4JX>
# newdoc
# newpar
# sent_id = 1
# text = The kiwi crews just failed to qualify for a debut A Final.
1	The	the	DET	DT	Definite=Def|PronType=Art	3	det	_	SpacesBefore=\n
2	kiwi	kiwi	PROPN	NNP	Number=Sing	3	compound	_	_
3	crews	crew	NOUN	NNS	Number=Plur	5	nsubj	_	_
4	just	just	ADV	RB	_	5	advmod	_	_
5	failed	fail	VERB	VBD	Mood=Ind|Tense=Past|VerbForm=Fin	0	root	_	_
6	to	to	PART	TO	_	7	mark	_	_
7	qualify	qualify	VERB	VB	VerbForm=Inf	5	xcomp	_	_


### 3. Featurize

Decide which you registers you want to compare. Then featurize them.

This script turns the file to the column feature presented by the argument. The script outputs a file `[register].feats`

Run this script for two register classes to get your data!


In [29]:
from analyze import save_text_format

save_text_format("data/df.conllu", "LEMMA")

O fi data/df.conllu
FILE data/df.conllu
ORIG DATA data/df.conllu


Have a look at the file content to be sure it's what you meant! You'll see that the script also adds a register label to each line.

In [21]:
! head -20 data/sr.feats # again, you can change the filename to match your files

sr	the kiwi crew just fail to qualify for a debut a final . the man 's eight be eliminate from the a final after finish fifth in the repechage . credit : Rowing New Zealand New Zealand 's new lightweight four fail to make the a final in Belgrade last night after a brave effort on its debut - lead until the final sprint where it be demote to third and consign to the b final . they will be join in the b final by the heavyweight four , who also narrowly fail to make it through . the eight be eliminate after finish fifth in the repechage . need a first or second place to qualify for the main a final , Curtis Rapley , James Lassche , Graham Oberlin Brown and Duncan grant take the lead early and be still ahead at 500 metre - ensure they would be in the battle as the race develop in a class that boast a huge number of competitive , close boat . they hold onto the lead through halfway by half a second from China and less than a second from France . and they be still ahead by just tenth of a se

### 4. Split to train and test

Let's put the first 400 lines to the train set and the last 200 lines to the test set

Again, this should be done for both of your register files

In [38]:
! cat data/df.feats | head -400 > data/df-train.feats # again, change the filenames here to match your registers
! cat data/df.feats | tail -199 > data/df-test.feats

Let's yet combine the register-specific train and test sets to two files and shuffle

In [39]:
! cat data/sr-train.feats data/df-train.feats | shuf > data/sr-df-train.feats
! cat data/sr-test.feats data/df-test.feats | shuf > data/sr-df-test.feats

### 5. Then finally the classification!

### Questions: 

How well does the classifier perform? If you train several models with different registers, do the results differ? If so, what could this reflect?



In [40]:
! python3 svm.py data/sr-df-train.feats data/sr-df-test.feats

              precision    recall  f1-score   support

          df       0.96      0.90      0.93       198
          sr       0.91      0.96      0.93       199

    accuracy                           0.93       397
   macro avg       0.93      0.93      0.93       397
weighted avg       0.93      0.93      0.93       397



### 6. Extracting predicted keywords

This is otherwise the same script

### Questions:

Do the keywords make sense?

Do they differ from the calculated keywords estimated with the standard method or text dispersion?

In [41]:
! python3 svm_explain.py data/sr-df-train.feats data/sr-df-test.feats

              precision    recall  f1-score   support

          df       0.96      0.90      0.93       198
          sr       0.91      0.96      0.93       199

    accuracy                           0.93       397
   macro avg       0.93      0.93      0.93       397
weighted avg       0.93      0.93      0.93       397

Positive features for the first class
you
i
re
people
or
government
use
forum
if
not
would
do
allah
dog
post
no
my
someone
as
site
be
etc
die
visit

Positive features for the second class
week
champion
say
asp
goal
performance
year
nfl
match
ranger
at
three
two
in
coach
sport
arsenal
against
race
club
last
football
league
season
game
win
team
player
play
the
he


### 7. Understanding the keywords

Again, to understand what the keywords do in the texts you can print example sentences with them


In [42]:
from analyze import read_conllu
ID,FORM,LEMMA,UPOS,XPOS,FEAT,HEAD,DEPREL,DEPS,MISC=range(10)

count = 0
for comm, sent in read_conllu(open("data/sr.conllu", "r")): # here you can change the data file
    if "nsubj" in [token[DEPREL].lower() for token in sent]: #here you can put any word form instead of "you"
        count += 1
        if count > 5: # here we specify how many sentences we want to see
          break
        else:
          print(" ".join(token[FORM] for token in sent)) # now this prints the sentence FORMs and DEPRELs, but these can be changed as well
          print(" ".join(token[DEPREL] for token in sent))
          print()

The kiwi crews just failed to qualify for a debut A Final .
det compound nsubj advmod root mark xcomp case det obl det nmod:npmod punct

CREDIT : Rowing New Zealand New Zealand 's new lightweight four failed to make the A final in Belgrade last night after a brave effort on its debut - leading until the final sprint where it was demoted to third and consigned to the B Final .
root punct compound compound compound compound nmod:poss case amod amod nsubj appos mark xcomp det det obj case obl amod obl:tmod case det amod obl case nmod:poss nmod punct advcl case det amod obl advmod nsubj:pass aux:pass acl:relcl case obl cc conj case det compound obl punct

Finals by the heavyweight four , who also narrowly failed to make it through .
root case det nmod nummod punct nsubj advmod advmod acl:relcl mark xcomp obj advmod punct

Needing a first or second place to qualify for the main A Final , Curtis Rapley , James Lassche , Graham Oberlin Brown and Duncan Grant took the lead early and were still