<a href="https://colab.research.google.com/github/mavela/Linguistics-with-conllu-data/blob/master/Predicted_keywords_with_an_SVM_and_english_web_registers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text classification on web registers + predicted keywords
## Steps

1. Get data from Github
2. Have a look at conllu data
3. Decide what two registers you want to compare - the classifier setting is binary with 2 classes
4. Featurize 
5. Divide to train and test
6. Run the SVM + Evaluate
7. Extract the keywords
8. Analyze the keywords

## Things to analyze

1. Do the results vary between different registers?
2. Do the results vary between feature sets?
3. Do the keywords or keyfeatures make sense?
4. What do the keywords or features tell you about the text classes?


### 1. Data from Github

This is the same what we had for the keywords

In [None]:
! git clone https://github.com/mavela/Linguistics-with-conllu-data.git
% cd Linguistics-with-conllu-data/

! echo "The folder includes these files"
! ls

### 2. Have a look at the data

This is also the same - just a standard procedure to make sure all is fine!

In [None]:
! head -20 data/sr.conllu

### 3. Choose your registers

Decide which you registers you want to compare. Then featurize them.

### 4. Featurize

This script turns the file to the column feature presented by the argument. The script outputs a file `[register].feats`

Run this script for two register classes to get your data!


In [3]:
from analyze import save_text_format

save_text_format("data/df.conllu", "LEMMA")

Have a look at the file content to be sure it's what you meant! You'll see that the script also adds a register label to each line.

In [None]:
! head -15 data/df.feats # again, you can change the filename to match your files

### 5. Split to train and test

Let's put the first 400 lines to the train set and the last 200 lines to the test set

**Again, this should be done for both of your register files**

In [6]:
! cat data/df.feats | head -400 > data/df-train.feats # again, change the filenames here to match your registers
! cat data/df.feats | tail -199 > data/df-test.feats

Let's yet combine the register-specific train and test sets to two files and shuffle

In [8]:
! cat data/df-train.feats data/df-train.feats | shuf > data/df-df-train.feats
! cat data/df-test.feats data/df-test.feats | shuf > data/df-df-test.feats

### 6. Then finally the classification and the evaluation!

### Questions: 

How well does the classifier perform? If you train several models with different registers, do the results differ? If so, what could this reflect?



In [None]:
! python3 svm.py data/sr-df-train.feats data/sr-df-test.feats

### 7. Extracting predicted keywords

This is otherwise the same script

### Questions:

Do the keywords make sense?

Do they differ from the calculated keywords estimated with the standard method or text dispersion?

In [None]:
! python3 svm_explain.py data/sr-df-train.feats data/sr-df-test.feats | head -30

### 8. Understanding the keywords

Again, to understand what the keywords do in the texts you can print example sentences with them


In [None]:
from analyze import read_conllu
ID,FORM,LEMMA,UPOS,XPOS,FEAT,HEAD,DEPREL,DEPS,MISC=range(10)

count = 0
for comm, sent in read_conllu(open("data/sr.conllu", "r")): # here you can change the data file
    if "nsubj" in [token[DEPREL].lower() for token in sent]: #here you can put any word form instead of "you"
        count += 1
        if count > 5: # here we specify how many sentences we want to see
          break
        else:
          print(" ".join(token[FORM] for token in sent)) # now this prints the sentence FORMs and DEPRELs, but these can be changed as well
          print(" ".join(token[DEPREL] for token in sent))
          print()