# BERT FEATURES
## embedding based
* ✓ get a sentence embedding (easy with any model)
* ✓ get a single word embedding (how to ensure for uncased model that the right word index is given? use cased model for now)
* ✓ get a MWE embedding (same as for single words that are tokenized into several subwords)

* get a top masked single word embedding
* get a masked MWE embedding (one mask for now)
* get a sentence embedding with masked target word
* get a distance from masked embedding to a target embedding
* ✓ get a score for a masked target (average for parts of a tokenized target)

## tokenization based
* ✓ get a length of a tokenized target (/2 for mwe)
* ✓ get a length of tokenized target in characters ((len-1)/2 for mwe)
* ✓ get a sentence length in tokens
* ✓ get an average sentence length (average number of tokens per word)

# OTHER FEATURES
* ✓ number WordNet senses for a word
* ✓ log frequency in general english
* frequency in a specific corpus vs general (rank comparison)
* ✓ number of morphs
* ✓ frequency of morphs


TO DO:
✓ deal with test 38LRF35D5LWPYKNDAPAKMD6HD1M3UI (lowercase the target)

# Table of contents
* [Data loading](#data_load) 
* [BERT loading](#bert_load)
    * [BERT cheat sheet](#bert_cheat)
* [1: get token indices](#1)
* [2: get non-mask features](#2)
    * [3: save non-mask features](#3)
* [4: get MASK-related features](#4)
* [5: get WordNet features](#5)
* [6: get word frequency features](#6)
* [7: get Morfessor features](#7)
* [8: save non-embedding features](#8)

In [1]:
import transformers
from transformers import BertTokenizer, BertModel, AutoModelWithLMHead
import pandas as pd
import numpy as np
import torch

  return torch._C._cuda_getDeviceCount() > 0


# Load data <a class="anchor" id="data_load"></a>

In [141]:
def data_loader(file_name):
    df = pd.read_csv(file_name, '\t', quoting=3, na_filter=False)
    if 'subcorpus' in df.columns:
        df = df.rename(columns={'subcorpus':'corpus'})
    return df

In [152]:
single_train = data_loader('data/train/lcp_single_train.tsv')
multi_train = data_loader('data/train/lcp_multi_train.tsv')

single_trial = data_loader('data/trial/lcp_single_trial.tsv')
multi_trial = data_loader('data/trial/lcp_multi_trial.tsv')

single_test = data_loader('data/test/lcp_single_test_labels.tsv')
multi_test = data_loader('data/test/lcp_multi_test_labels.tsv')

# Load BERT models  <a class="anchor" id="bert_load"></a>

In [4]:
tokenizer_propercased = BertTokenizer.from_pretrained('bert-base-cased')
model_propercased = BertModel.from_pretrained("bert-base-cased")
config = model_propercased.config

## BERT cheat sheet <a class="anchor" id="bert_cheat"></a>

In [5]:
sentence = 'This is A sentence!'
sentence_token_ids = tokenizer_propercased('sea.')['input_ids']
sentence_token_ids_tensor = tokenizer_propercased('sea.', padding=True, truncation=True, return_tensors="pt" )['input_ids']
sentence_tokens = [tokenizer_propercased.ids_to_tokens[id] for id in sentence_token_ids]
print(tokenizer_propercased.mask_token)
print(sentence_tokens)

[MASK]
['[CLS]', 'sea', '.', '[SEP]']


# 1 get the indices of the subwords for a target word/word pair <a class="anchor" id="1"></a>
1. tokenize a sentence + tokenize a target
2. check if target is one subword long
* **case 1**: word is tokenized as a whole
    3. check that a word is in sentence tokens and get its index (indices for multiple occurances) in a sentence
    4. otherwise get a sentence token that contains the target as its part
* **case 2**: word is tokenized into subwords
    3. get indices of subwords 

In [6]:
def get_index(tokenized_sentence, tokenized_target):
    """Returns positions of target tokens in a tokenized sentence
    
    Parameters
    ----------
    tokenized_sentence : iterable object
        an iterable object with sentence tokens including special tokens
    tokenized_target : iterable object
        an iterable object with target tokens
    
    Returns
    -------
    target_inds : indices of target parts in a tokenized sentence
    """
    
    target_inds=[]
    parts_len = len(tokenized_target)
    
    # if target is tokenized into several subwords
    if parts_len > 1:
        sentence_len = len(tokenized_sentence)
        for i in range(sentence_len-parts_len+1):
            if tokenized_sentence[i:i+parts_len] == tokenized_target:
                target_inds += [j for j in range(i,i+parts_len)]
    
    # if target is left as a whole
    if parts_len == 1:
        target = tokenized_target[0]
        if target in tokenized_sentence:
            # get indices of all target occurances
            target_inds += [i for i,val in enumerate(tokenized_sentence) if val==target]
        else:
            # get indices of all tokens that contain a target
            target_inds += [i for i,val in enumerate(tokenized_sentence) if target in val]
    
    return target_inds            

In [7]:
print(get_index([1,2,3,4,1],[1]))
print(get_index([1,2,3,4,1],[1,2]))
print(get_index([1,2,3,4,1,2],[1,2]))
print(get_index([1,3,2,4,1,2],[1,2]))
print(get_index(['ab','c','d','ac'],['a']))
print(get_index(['ab','c','d','a'],['a']))

[0, 4]
[0, 1]
[0, 1, 4, 5]
[4, 5]
[0, 3]
[3]


# 2 getting features (not mask-related) <a class="anchor" id="2"></a>

1. ✓ tokenize a sentence and a target word
2. ✓ get a sentence len in tokens (num of tokens - 2 special tokens)
3. ✓ get an number of tokens per word in a sentence (num of tokens - 2 special tokens/(len(sentence.split())) 
4. ✓ get a number of tokens in a target (/2 for mwe)
5. ✓ get a list of target indices
6. ✓ put a sentence into a model
7. ✓ get a sentence embedding
8. ✓ get an average embedding for target parts

**TO DO**:
1. transform to batch-applicable
2. 8 is not dealing with multiple occurances ...

In [45]:
def get_features(sentence, target_word, model, tokenizer, print_out=False):
    """ 
    
    Parameters
    ----------
    sentence : str   
    target_word : str
    model :
    tokenizer : 
        
    Returns
    -------
    features : dict
        sentence_embedding :
        target_embedding : 
        sentence_len : 
        tokens_per_word :
            an average token per word len in a sentence
        target_len : 
        target_len_chars :
    """
    # 1.1 tokenize a sentence
    sentence_len_words = len(sentence.split())
    sentence_token_ids = tokenizer(sentence)['input_ids']
    sentence_tokens = tokenizer.convert_ids_to_tokens(sentence_token_ids)
    
    # 1.2 tokenize a target word
    target_tokens = tokenizer.tokenize(target_word)
    
    # 2. get a sentence len in tokens
    sentence_len = len(sentence_tokens) - 2
    # 3. get an average word len in tokens
    tokens_per_word = (len(sentence_tokens)-2)/sentence_len_words
    # 4. get target len in tokens and characters
    if " " in target_word:
        target_len = len(target_tokens)/2
        target_len_chars = (len(target_word)-1)/2
    else:
        target_len = len(target_tokens)
        target_len_chars = len(target_word)
    
    # 5. get target parts indices
    target_parts_ids = get_index(sentence_tokens, target_tokens)
    
    if print_out:
        print(target_tokens)
        print(sentence_tokens)
        print(target_parts_ids)    
    
    # 6. put a sentence into a model
    sentence_ids_tensor = tokenizer(sentence, padding=True, truncation=True, return_tensors="pt" )['input_ids']
    model_output = model(sentence_ids_tensor)[0] # batch x num indices x 768
    
    # 7. get a sentence embedding
    sentence_embedding = model_output[0][0]
    
    # 8. get a target embedding
    if len(target_tokens)==1:
        # take only the first occasion
        target_embedding = model_output[0][target_parts_ids[0]]
    else:
        #target_embedding = model_output[0][target_parts_ids[0]:target_parts_ids[-1]].mean(dim=0)
        if len(target_parts_ids)==len(target_tokens):
            target_embedding = model_output[0][target_parts_ids[0]:target_parts_ids[-1]].mean(dim=0)
        else:
            print("MULTIPLES!!!", target_tokens)
            # take only the first occasion
            target_embedding = model_output[0][target_parts_ids[0]:target_parts_ids[len(target_tokens)]].mean(dim=0)
            # take mean of everything
            #target_embedding = model_output[0][target_parts_ids].mean(dim=0)
            
    features = {}
    features['sentence_embedding'] = sentence_embedding
    features['target_embedding'] = target_embedding
    features['sentence_len'] = sentence_len
    features['tokens_per_word'] = tokens_per_word
    features['target_len'] = target_len
    features['target_len_chars'] = target_len_chars
    return features

In [41]:
sentence = 'this is a seahorse. hello seahorse'
target_word = 'seahorse'

features = get_features(sentence, target_word, model_propercased, tokenizer_propercased, True)
print('--------')
print(features['sentence_len'])
print(features['tokens_per_word'])
print(features['target_len'])
print(features['target_len_chars'])

['sea', '##horse']
['[CLS]', 'this', 'is', 'a', 'sea', '##horse', '.', 'hello', 'sea', '##horse', '[SEP]']
[4, 5, 8, 9]
MULTIPLES!!! ['sea', '##horse']
--------
9
1.5
2
8


# 3 Save features <a class="anchor" id="3"></a>

To not go over loading-predicting process again and again, it is easier to extract the features once.
**TO DO**:
1. transform to batch-applicable

In [42]:
def write_features(df, model, tokenizer):
    """ Adds extracted features into a dataframe
    
    Parameters
    ----------
    df
    model
    tokenizer
    """
    
    sentences=[]
    targets=[]
    sentence_lens=[]
    tokens_per_words=[]
    target_lens=[]
    target_lens_chars=[]
    
    for i, row in df.iterrows():
        
        if row.id =='38LRF35D5LWPYKNDAPAKMD6HD1M3UI':
            features= get_features(row.sentence, row.token.lower(), model, tokenizer)
        else:
            features= get_features(row.sentence, row.token, model, tokenizer)
        
        sentences.append(features['sentence_embedding'].detach().numpy())
        targets.append(features['target_embedding'].detach().numpy())
        sentence_lens.append(features['sentence_len'])
        target_lens.append(features['target_len'])
        tokens_per_words.append(features['tokens_per_word'])
        target_lens_chars.append(features['target_len_chars'])
        
    df['sentence_embedding'] = sentences
    df['target_embedding'] = targets
    df['sentence_len'] = sentence_lens
    df['tokens_per_word'] = tokens_per_words
    df['target_len'] = target_lens
    df['target_len_chars'] = target_lens_chars
    
    return df

### saving numpy arrays for embeddings

In [43]:
single_train = write_features(single_train, model_propercased, tokenizer_propercased)
single_trial = write_features(single_trial, model_propercased, tokenizer_propercased)
single_test = write_features(single_test, model_propercased, tokenizer_propercased)

In [157]:
multi_train = write_features(multi_train, model_propercased, tokenizer_propercased)
multi_trial = write_features(multi_trial, model_propercased, tokenizer_propercased)
multi_test = write_features(multi_test, model_propercased, tokenizer_propercased)

In [46]:
def embeddings_to_csv(file_name, df_embedding):
    
    numpy_embed = np.array(df_embedding.tolist())
    np.savetxt(file_name, numpy_embed, delimiter=",")

In [47]:
embeddings_to_csv("features/single_target_embedding_train.csv", single_train.target_embedding)
embeddings_to_csv("features/single_target_embedding_trial.csv", single_trial.target_embedding)
embeddings_to_csv("features/single_target_embedding_test.csv", single_test.target_embedding)

embeddings_to_csv("features/single_sentence_embedding_train.csv", single_train.sentence_embedding)
embeddings_to_csv("features/single_sentence_embedding_trial.csv", single_trial.sentence_embedding)
embeddings_to_csv("features/single_sentence_embedding_test.csv", single_test.sentence_embedding)

In [48]:
embeddings_to_csv("features/mwe_target_embedding_train.csv", multi_train.target_embedding)
embeddings_to_csv("features/mwe_target_embedding_trial.csv", multi_trial.target_embedding)
embeddings_to_csv("features/mwe_target_embedding_test.csv", multi_test.target_embedding)

embeddings_to_csv("features/mwe_sentence_embedding_train.csv", multi_train.sentence_embedding)
embeddings_to_csv("features/mwe_sentence_embedding_trial.csv", multi_trial.sentence_embedding)
embeddings_to_csv("features/mwe_sentence_embedding_test.csv", multi_test.sentence_embedding)

# 4 getting MASK-related features <a class="anchor" id="4"></a>

Question:
* how to deal with thing in need of multiple masks? (only first occurance for now)

In [53]:
model_masked_lm = AutoModelWithLMHead.from_pretrained("bert-base-cased")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## For single targets
1. ✓ substitute target occurances for a mask in a sentence
2. ✓ get sentence tokens ids
3. ✓ get mask id (first_occurance if multiple)
4. ✓ get logits for a masked token
5. ✓ get max prob for masked token prediction
6. ✓ get predicted token
7. ✓ see if token was predicted right (impossible for subword tokenized singles)


what if keeping later occurance for multiple unmasked?

In [73]:
def get_mask_features_singles(sentence, token, model_lm, tokenizer):
    """
    Parameters
    ----------
    sentence
    token
    model_lm
    tokenizer
    
    Returns
    -------
    features: dict
        predicted_prob : probability of a word that was precicted by a model
        predicted_word : the word that was predicted by a model
        target_prob : probability of a target word to be predicted (average for parts)
    """
    features = {}
    
    # substitute target occurances for a mask
    sentence = sentence.replace(token, tokenizer.mask_token)
    # sentence = sentence.replace(token, tokenizer.mask_token, 1)
    
    # get sentence tokens ids
    sentence_tensor = tokenizer.encode(sentence, return_tensors="pt")
    
    # get mask index (first_occurance)
    mask_token_index = torch.where(sentence_tensor == tokenizer.mask_token_id)[1][0].reshape(1)

    # prediction for everything
    logits = model_lm(sentence_tensor).logits
    
    # prediction for a masked token
    mask_token_logits = logits[0, mask_token_index, :]

    # probs of masked tokens
    probs = torch.nn.functional.log_softmax(mask_token_logits, dim=1)
    #probs = torch.nn.functional.softmax(mask_token_logits, dim=1)
    predicted_prob = probs.max().item()
    
    # prediction as a word
    top_1_tokens = torch.topk(mask_token_logits, 1, dim=1).indices
    predicted_word = tokenizer.convert_ids_to_tokens(top_1_tokens)[0]
    
    # check if correct
    tokenized_target = tokenizer.tokenize(token)
    target_ids = tokenizer.convert_tokens_to_ids(tokenized_target)
    target_logit = []
    target_prob = []
    for i in target_ids:
        target_prob.append(probs[0][i].item())
    target_prob = sum(target_prob)/len(target_prob)
    
    features['predicted_prob'] = predicted_prob
    features['predicted_word'] = predicted_word
    features['target_prob'] = target_prob
    
    return features

In [74]:
sentence = 'this is a sea. hello sea'
target_word = 'sea'

features = get_mask_features_singles(sentence, target_word, model_masked_lm, tokenizer_propercased)
print('--------')
print(features['predicted_prob'])
print(features['predicted_word'])
print(features['target_prob'])

--------
-3.44461989402771
mistake
-9.556841850280762


In [95]:
def write_mask_features(df, model_lm, tokenizer):
    
    best_probs=[]
    predicted_words=[]
    target_probs=[]
    
    for i, row in df.iterrows():
        if row.id =='38LRF35D5LWPYKNDAPAKMD6HD1M3UI':
            features= get_mask_features_singles(row.sentence, row.token.lower(), model_lm, tokenizer)
        else:
            features= get_mask_features_singles(row.sentence, row.token, model_lm, tokenizer)
        best_probs.append(features['predicted_prob'])
        predicted_words.append(features['predicted_word'])
        target_probs.append(features['target_prob'])
        
    df['predicted_probs'] = best_probs
    df['predicted_words'] = predicted_words
    df['target_probs'] = target_probs
    
    return df

In [96]:
single_train = write_mask_features(single_train, model_masked_lm, tokenizer_propercased)
single_trial = write_mask_features(single_trial, model_masked_lm, tokenizer_propercased)
single_test = write_mask_features(single_test, model_masked_lm, tokenizer_propercased)

## for mwe targets
1. ✓ substitute targets for two masks in a sentence
2. ✓ get sentence tokens ids
3. ✓ get mask ids
4. ✓ get logits for masked tokens
5. ✓ get max probs for each of masked token predictions
6. ✓ get an average of max probs for masked token predictions
7. ✓ get probs for each of masked target parts
8. ✓ get average of probs for each of masked target parts
9. ✓ get predicted tokens
10.   see if any token was correctly predcited

In [97]:
def get_mask_features_mwes(sentence, token_pair, model_lm, tokenizer):
    """
    

    best_probs
    predicted_words
    target_prob
    """
    
    features = {}
    
    # substitute a target pair with two masks
    sentence = sentence.replace(token_pair, tokenizer.mask_token*2)
    
    # get sentence tokens ids
    sentence_tensor = tokenizer.encode(sentence, return_tensors="pt")
    
    # get mask indices
    mask_token_index = torch.where(sentence_tensor == tokenizer.mask_token_id)[1]

    # prediction for everything
    logits = model_lm(sentence_tensor).logits
    
    # prediction for masked tokens
    mask_token_logits = logits[0, mask_token_index, :]

    # probs of masked tokens
    probs = torch.nn.functional.log_softmax(mask_token_logits, dim=1) # num_of_masks x vocab
    best_probs = probs.max(dim=1)

    # prediction as a word
    top_1_tokens = torch.topk(mask_token_logits, 1, dim=1).indices
    predicted_words = tokenizer.convert_ids_to_tokens(top_1_tokens)
    
    # check if correct
    # get tokens for each of the words
    tokenized_targets = [tokenizer.tokenize(token) for token in token_pair.split(' ')]
        
    pair_probs = [] 
    for i, part in enumerate(tokenized_targets):
        part_probs = []
        # ids the ith word of the pair
        target_part_ids = tokenizer.convert_tokens_to_ids(part)
        for j in target_part_ids: 
            part_probs.append(probs[i][j].item())
        pair_probs.append(sum(part_probs)/len(part_probs))
    
    features['best_probs'] = best_probs.values.detach().tolist()
    features['predicted_words'] = predicted_words
    features['pair_probs'] = pair_probs
    return features

In [98]:
sentence = 'this is a sea horse. hello sea horse'
target_word = 'sea horse'

features = get_mask_features_mwes(sentence, target_word, model_masked_lm, tokenizer_propercased)
print('--------')
print(features['best_probs'])
print(features['predicted_words'])
print(features['pair_probs'])

--------
[-3.3664755821228027, -3.8095571994781494, -1.9187053442001343, -0.3489280641078949]
['new', 'number', '!', '.']
[-9.650083541870117, -8.456759452819824]


In [142]:
def write_mask_features_mwe(df, model_lm, tokenizer):
    
    best_probs1=[]
    best_probs2=[]
    predicted_words = []
    target_probs1=[]
    target_probs2=[]
    best_probs_average=[]
    target_probs_average=[]
    
    for i, row in df.iterrows():
        features = get_mask_features_mwes(row.sentence, row.token, model_lm, tokenizer)
        best_probs1.append(features['best_probs'][0])
        best_probs2.append(features['best_probs'][1])
        predicted_words.append(" ".join(features['predicted_words']))
        target_probs1.append(features['pair_probs'][0])
        target_probs2.append(features['pair_probs'][1])
        best_probs_average.append(np.mean(features['best_probs']))
        target_probs_average.append(np.mean(features['pair_probs']))
        
    df['best_probs1'] = best_probs1
    df['best_probs2'] = best_probs2
    df['predicted_probs'] = best_probs_average
    df['predicted_words'] = predicted_words
    df['target_probs1'] = target_probs1
    df['target_probs2'] = target_probs2
    df['target_probs'] = target_probs_average
    
    return df

In [153]:
mwe_train = write_mask_features_mwe(multi_train, model_masked_lm, tokenizer_propercased)
mwe_trial = write_mask_features_mwe(multi_trial, model_masked_lm, tokenizer_propercased)
mwe_test = write_mask_features_mwe(multi_test, model_masked_lm, tokenizer_propercased)

# 5 WordNet features <a class="anchor" id="5"></a>
1. get an average number of synsets that a word is present in

In [102]:
import nltk
from nltk.corpus import wordnet

In [104]:
def wordnet_features(df):
    senses = []
    for i, row in df.iterrows():
        if ' ' in row.token:
            s = 0
            # look if a pair is present
            s += len(wordnet.synsets(row.token.lower().replace(' ','_')))
            # look for each word separately
            s += sum([len(wordnet.synsets(w)) for w in row.token.lower().split(' ')])
            senses.append(s)
        else:
            syns = wordnet.synsets(row.token.lower())
            senses.append(len(syns))
    df['senses'] = senses
    return df

In [154]:
single_train = wordnet_features(single_train)
single_trial = wordnet_features(single_trial)
single_test = wordnet_features(single_test)

multi_train = wordnet_features(mwe_train)
multi_trial = wordnet_features(mwe_trial)
multi_test = wordnet_features(mwe_test)

# 6 Word requency features <a class="anchor" id="6"></a>

In [109]:
import wordfreq

In [110]:
def freq_features(df):
    freqs = []
    for i, row in df.iterrows():
        if ' ' not in row.token:
            freqs.append(wordfreq.zipf_frequency(row.token,'en', wordlist='best', minimum=0.0))
        else:
            tokens = row.token.split()
            av_freq = sum([wordfreq.zipf_frequency(token,'en', wordlist='best', minimum=0.0) for token in tokens])/len(tokens)
            freqs.append(av_freq)
    df['freqs'] = freqs
    return df

In [155]:
single_train = freq_features(single_train)
single_trial = freq_features(single_trial)
single_test = freq_features(single_test)

multi_train = freq_features(multi_train)
multi_trial = freq_features(multi_trial)
multi_test = freq_features(multi_test)

# 7 Morfessor features <a class="anchor" id="7"></a>
1. get frequencies of subwords
2. segment target sentence and target
3. get average freq
4. get average len

In [120]:
import morfessor
import numpy as np

In [121]:
io = morfessor.MorfessorIO()
segmentation_model = io.read_binary_model_file('data/english_model')

I0212 15:49:13.236800 47459466215040 io.py:186] Loading model from 'data/english_model'...
I0212 15:49:13.963269 47459466215040 io.py:188] Done.


In [122]:
constructions = segmentation_model.get_constructions()
probs_dict = {}
for c in constructions:
    probs_dict[c[0]] = np.log(c[1]/segmentation_model.tokens)

In [136]:
def get_morfessor_features(df, seg_model, prob_dict):
    """
    
    Parameters
    ----------
    
    Returns
    -------
    
    """
    
    len_target_subs = []
    prob_subs = []
    for i,row in df.iterrows():
            
        if " " not in row.token:
            if row.id =='38LRF35D5LWPYKNDAPAKMD6HD1M3UI':
                segments = seg_model.viterbi_segment(row.token.lower())[0]
            else:
                segments = seg_model.viterbi_segment(row.token)[0]
            len_target_subs.append(len(segments))
            probs = sum([prob_dict[sub] for sub in segments])/len(segments)
            prob_subs.append(probs)
        else:
            tokens = row.token.split(" ")
            segments = []
            for token in tokens:
                segments+=seg_model.viterbi_segment(token)[0]
            len_target_subs.append(len(segments))
            probs = sum([prob_dict[sub] for sub in segments])/len(segments)
            prob_subs.append(probs)
    
    df['morfessor_len'] = len_target_subs
    df['morfessor_freqs'] = prob_subs
    return df

In [156]:
single_train = get_morfessor_features(single_train, segmentation_model, probs_dict)
single_trial = get_morfessor_features(single_trial, segmentation_model, probs_dict)
single_test = get_morfessor_features(single_test, segmentation_model, probs_dict)

multi_train = get_morfessor_features(multi_train, segmentation_model, probs_dict)
multi_trial = get_morfessor_features(multi_trial, segmentation_model, probs_dict)
multi_test = get_morfessor_features(multi_test, segmentation_model, probs_dict)

# 8 saving non-embedding features <a class="anchor" id="8"></a>

In [161]:
columns = ['id', 'corpus', 'sentence', 'token', 'complexity', 'sentence_len', 'tokens_per_word', 'target_len',
       'target_len_chars', 'best_probs1', 'best_probs2', 'predicted_words',
       'target_probs1', 'target_probs2', 'predicted_probs',
       'target_probs', 'senses', 'freqs', 'morfessor_len', 'morfessor_freqs']

multi_train[columns].to_csv('features/multi_train.csv')
multi_trial[columns].to_csv('features/multi_trial.csv')
multi_test[columns].to_csv('features/multi_test.csv')

In [None]:
columns = ['id', 'corpus', 'sentence', 'token', 'complexity', 'sentence_len', 'tokens_per_word', 'target_len',
       'target_len_chars', 'predicted_probs', 'predicted_words',
       'target_probs', 'senses', 'freqs','morfessor_len', 'morfessor_freqs']

single_train[columns].to_csv('features/single_train.csv')
single_trial[columns].to_csv('features/single_trial.csv')
single_test[columns].to_csv('features/single_test.csv')