# Attempt at using spacy's NER training to do NER
## This did not work well in the end. I labeled ~300 rows, and the recall on the labeled rows was not acceptable (labels with <5 examples were not identified well)
### Things I improved / learned, despite the failure:
* Found a better way to ensure dask uses all cpus: the `client()` method creates a local cluster that more consistently uses cores, and scales linearly with cores compared to the other methods (see `sports_sentiment.py`)
* Learned how to train spacy's NER; learned that you really need a good number of examples for every known entity for it to get decent recall
* improved sentence extraction by removing trailing punctuation

In [50]:
import re
import ast
import pandas as pd
import sys, os
sys.path.append(os.path.join(os.path.curdir, '..', 'sentiment_sports'))
import sports_sentiment as ss

Load a list of all skins

In [51]:
skin_df = pd.read_csv('labeled_skin_covariates.tsv', sep='\t')
skin_set = set(skin_df['skin_name'].str.lower().tolist()) | set(skin_df['skin_name'].tolist())
# remove . from acronym skins
skin_set = skin_set | set(skin.replace('.', '') for skin in skin_set )

## Create training dataset for Spacy
Load most recent data

In [52]:
july_df = (pd.read_csv('d:/data/fortnite/201907-fortnitebr-comments_submissions.tsv', sep='\t')
              .dropna(subset=['text']))
print(july_df.shape)
july_df.sample(2)

(312235, 9)


Unnamed: 0,text,timestamp,user,flair,score,id,link_id,parent_id,source
124159,How does someone this young have reddit???,1562549000.0,ElevatedRaptor6,,2.0,et881zg,t3_caemjg,t3_caemjg,comment
113952,did they even have permission to do this?,1562466000.0,Stealth_Reflex,:stealthreflex: Stealth Reflex,-9.0,et57av7,t3_ca18hf,t3_ca18hf,comment
165746,Damn. Why it be like dis man,1562882000.0,R34CTz,:oblivion: Oblivion,2.0,etjv6pd,t3_cbwgsi,t1_etjv02r,comment


One of the few things that this work helped me do was improve that sentence chunking using dask

In [54]:
%%time
july_sentences_df = ss.chunk_comments_sentences(july_df)

Chunking into sentences
Reshaping and fixing whitespace / punctuation
Chunked into 552840 sentences
Wall time: 1min 7s


Find sentences that contain an exact token match that we care about (this will fail on multi-token names)

In [8]:
def token_search(row, token_set):
    ''' find sentences that contain a token; this will fail for multi-token entities
    '''
    set_row = set(row.split(' '))
    return [token for token in token_set if token in set_row]

In [9]:
%%time
july_sentences_df['contained_skin'] = july_sentences_df['sentences'].apply(lambda row: token_search(row, skin_set))

Wall time: 1min 23s


In [19]:
skin_sentences_df = july_sentences_df[july_sentences_df['contained_skin'].str.len() >0][['sentences', 'contained_skin']]
skin_sentences_df['skin_str'] = skin_sentences_df['contained_skin'].str.join(' ')

In [None]:
train_data = [
    ("Who is Chaka Khan?", [(7, 17, "PERSON")]),
    ("I like London and Berlin.", [(7, 13, "LOC"), (18, 24, "LOC")]),
]

In [22]:
unique_sentences = skin_sentences_df.drop_duplicates(subset =['skin_str'])
unique_sentences[['sentences', 'skin_str']].to_csv('d:/data/fortnite/skin_ner_examples.txt', index=None, header=False)

In [36]:
def create_annotation(row):
    ''' create spacy annotation for a row
    '''
    annotations = []
    for skin in row['contained_skin']:
        
        span = list(re.search(skin, row['sentences']).span())
        span += [skin]
        annotations.append( tuple(span))
    return annotations

In [39]:
unique_sentences['annotation'] = unique_sentences.apply(create_annotation, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [42]:
unique_sentences.to_csv('d:/data/fortnite/ner_training_examples.tsv', sep='\t',
                       index=False)

In [None]:
'Can we have same styles for dark vanguard and dark voyager like mission specialist and moonwalker'

In [124]:
def slow_skin_annotation(sentence, skin_set):
    
    annotations = []
    for skin in skin_set:
        search_result = re.search(skin, sentence)
        if search_result:
            span = list(search_result.span())
            span += [skin]
            annotations.append( tuple(span))
    return annotations

In [132]:
sample = "Today, Recon Scout came back to the shop and it has not been in shoo since March 17 2018 (395 days)"
print(slow_skin_annotation(sample, skin_set))

[(7, 18, 'Recon Scout'), (13, 18, 'Scout')]


In [58]:
'Lynx' in skin_set

False

## Test of NER training using spacy
### This did not yield good results; seems like I basically need to label each skin 5x to get it recognized 
Have labeled 200 rows for fine-tuning

In [133]:
labeled_skins_df = pd.read_csv('ner_training_examples.tsv', sep='\t', nrows=300)
labeled_skins_df['annotation'] = labeled_skins_df['annotation'].fillna('[]')
labeled_skins_df['annotation'] = labeled_skins_df['annotation'].apply(ast.literal_eval)

In [134]:
def rename_skin(annotation_list):
    return [list(annotation[:2]) + ['SKIN'] for annotation in annotation_list]
labeled_skins_df['annotation'] =labeled_skins_df['annotation'].apply(rename_skin)
spacy_ner_train_data = list(labeled_skins_df[['sentences', 'annotation']].values)
spacy_ner_train_data = [ (sentence, {'entities':entities}) for sentence, entities in spacy_ner_train_data]

In [135]:
import random
import spacy
from spacy.util import minibatch, compounding
nlp = spacy.load("en_core_web_sm")
ner = nlp.get_pipe('ner')
ner.add_label('SKIN')

In [136]:
optimizer = nlp.resume_training()
move_names = list(ner.move_names)

In [138]:
n_iter = 15

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
with nlp.disable_pipes(*other_pipes):  # only train NER
    sizes = compounding(1.0, 4.0, 1.001)
    # batch up the examples using spaCy's minibatch
    for itn in range(n_iter):
        random.shuffle(spacy_ner_train_data)
        batches = minibatch(spacy_ner_train_data, size=sizes)
        losses = {}
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
        print("Losses", losses)

Losses {'ner': 3890.19620668759}
Losses {'ner': 3536.268500545567}
Losses {'ner': 3366.1052768380628}
Losses {'ner': 3259.820517916717}
Losses {'ner': 3196.235129074468}
Losses {'ner': 3151.2381334613165}
Losses {'ner': 2995.463249539025}
Losses {'ner': 2987.788801067276}
Losses {'ner': 2953.784096196294}
Losses {'ner': 2928.1729186736047}
Losses {'ner': 3045.6879339404404}
Losses {'ner': 2955.5015856027603}
Losses {'ner': 2850.174688756466}
Losses {'ner': 2791.492754817009}
Losses {'ner': 2950.26401770761}


In [182]:
test_sentence = july_sentences_df.sample(1).iloc[0]['sentences']
test_sentence = spacy_ner_train_data[random.randint(0,199)][0]
print(test_sentence)
print(nlp(test_sentence).ents)

ok I was smart and added a flare
()
