# Data Mining Week 8 with Professor Sloan

## Maggie Boles

### From Blackboard: Using the Kaggle NER corpus (ner_database.csv), which you can also find in our GitHub, create a NER tagger using Scikit-learn, which implies creating the NER model.
    I highly encourage you to look at the Author's Notebook for Chapter 8. In the text, this all starts on p. 545 and note the Author's GitHub is a little different than what's in the text. Note that building this model is going to take some time so plan accordingly. For example, the fit() alone was 3 minutes (not too bad, but it could take much longer on your machine).
    There's also a package installed by the author in his Notebook (sklearn-crfsuite). He installs it in-line in the Notebook, which may not work with Visual Studio Code. But you can just install it at a terminal.
    Run the following sentence through your tagger: “Fourteen days ago, Emperor Palpatine left San Diego, CA for Tatooine to follow Luke Skywalker.” Report on the tags applied to the sentence.
    Run the same sentence through spaCy’s NER engine.
    Compare and contrast the results – you can do this in your Jupyter Notebook or as a comment in your .py file.

In [8]:
#PATCH FOR BOTTLENECK ISSUE I WAS HAVING
import os
os.environ["PANDAS_NO_BOTTLENECK"] = "1"
import warnings
warnings.filterwarnings("ignore", message=".*bottleneck.*")

# (Resource use: Sarkar, D. (2019). Text analytics with python: A practitioner’s Guide to Natural Language Processing. Apress. (Chapter 8))
# 2. IMPORTS
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import sklearn_crfsuite
from sklearn_crfsuite import CRF
import spacy

#LOAD DATA
df = pd.read_csv("ner_dataset.csv", encoding="latin1")
df = df.fillna(method="ffill")
print(df.head())

#GROUP INTO SENTENCES
sentences = (
    df.groupby("Sentence #")
      .apply(lambda s: [(w, t) for w, t in zip(s["Word"], s["Tag"])])
      .tolist()
)
print(f"{len(sentences)} sentences loaded")

#FEATURE FUNCTION 
def word2features(sent, i):
    word = str(sent[i][0])
    features = {
        "bias": 1.0,
        "word.lower()": word.lower(),
        "word[-3:]": word[-3:],
        "word[-2:]": word[-2:],
        "word.isupper()": word.isupper(),
        "word.istitle()": word.istitle(),
        "word.isdigit()": word.isdigit(),
    }
    if i > 0:
        prev = str(sent[i-1][0])
        features.update({
            "-1:word.lower()": prev.lower(),
            "-1:word.istitle()": prev.istitle(),
        })
    else:
        features["BOS"] = True
    if i < len(sent)-1:
        nxt = str(sent[i+1][0])
        features.update({
            "+1:word.lower()": nxt.lower(),
            "+1:word.istitle()": nxt.istitle(),
        })
    else:
        features["EOS"] = True
    return features

def sent2features(s): return [word2features(s, i) for i in range(len(s))]
def sent2labels(s):   return [label for _, label in s]

X = [sent2features(s) for s in sentences]
y = [sent2labels(s)   for s in sentences]

#TRAIN / TEST SPLIT
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

#TRAIN CRF
crf = CRF(algorithm='lbfgs', c1=0.1, c2=0.1,
          max_iterations=50, all_possible_transitions=False)
print("Training CRF…")
crf.fit(X_train, y_train)
print("Done!")

#TEST SENTENCE – CUSTOM TAGGER
test_sent = "Fourteen days ago , Emperor Palpatine left San Diego , CA for Tatooine to follow Luke Skywalker .".split()
dummy = [(w, "O") for w in test_sent]          # dummy tags
feat  = sent2features(dummy)
pred  = crf.predict_single(feat)

print("\n=== CUSTOM CRF TAGGER ===")
for w, t in zip(test_sent, pred):
    print(f"{w:15} {t}")


#spaCy NER
nlp = spacy.load("en_core_web_sm")
doc = nlp(" ".join(test_sent))
print("\n=== spaCy NER ===")
for ent in doc.ents:
    print(f"{ent.text:20} {ent.label_}")


  df = df.fillna(method="ffill")          # VERY IMPORTANT


    Sentence #           Word  POS Tag
0  Sentence: 1      Thousands  NNS   O
1  Sentence: 1             of   IN   O
2  Sentence: 1  demonstrators  NNS   O
3  Sentence: 1           have  VBP   O
4  Sentence: 1        marched  VBN   O


  .apply(lambda s: [(w, t) for w, t in zip(s["Word"], s["Tag"])])


47959 sentences loaded
Training CRF…
Done!

=== CUSTOM CRF TAGGER ===
Fourteen        O
days            O
ago             O
,               O
Emperor         B-per
Palpatine       I-per
left            O
San             B-geo
Diego           I-geo
,               O
CA              B-org
for             I-org
Tatooine        I-org
to              O
follow          O
Luke            B-org
Skywalker       I-org
.               O

=== spaCy NER ===
Fourteen days ago    DATE
Palpatine            PERSON
San Diego            GPE
CA for Tatooine      WORK_OF_ART
Luke Skywalker       PERSON


##### From my output of this model when I just objectively look at the custom tagger and the NER I see that the custom tagger categorized each word in the sentence, and it is nice to see how the CRF categorizes each word and it did chunk Emporer Palpatine, San Diego, & Luke Skywalker giving us the Inside and beginning tagging we see, albeit Luke is a person not an organization (but it did group the two together), the CA for Tatooine categorizing as an Organization could be an error in the training for the CRF, the spaCY NER categorized this one as a Geopolitical entity which is pretty funny. I think this again could be from the structure and the grouping that is occuring, it makes me think of something like: "Californinians for Tatooine Politics" and I think that is a little silly and fun. For the spaCY NER we get a different output where it correctly grouped a date (fourteen days ago), it missed grouping Emporer with Palpatine, but still correctly identified Palpatine as a person. San Diego is a GPE (also fails to group with CA), CA for Tatooine as a WOA which does correctly label this (though still not accurate) by identifying that Tattoine is part of classifying titles within text, and Luke Skywalker as a Person. 

##### I think overall we can do some comparing and contrasting. With the CRF we do have to train the model, and then we can put our sentence(s) through it with that though it did take a little bit of time (not too bad for my rig though as I recently upgraded it over the summer because I was running a 2070 with an i5 processor, so my computer didn't struggle too bad with this part), the tag scheme is the BIO that we previously had covered, we used tokenization to split on spaces, we have entity grouping, and we have a lot more control over features and training data so we could potentially continue to fine tune this model if we wanted to. 

##### For the spaCY NER this is not pre-trained, has a flat tag scheme, the tokenization handles punctuation and contractions, entity grouping is apparent, there may be some tuning we could do, but I'm not 100% and from some of the things I read it can be fine-tuned but it can be complex. 

##### The one thing I think that is important to note agian is that both of these methods missclassify CA for Tatooine and the CRF classified Luke as an ORG. While neither was a perfect model, they seemed to both struggle on one part of this at the same time, it could be strategically confusing to essentially flag this error on purpose, or the real and fake locations cause this problem. Or specifically with the training model, this could possibly be a time where ensemble training could come into play and help our model tuning and increasing accuracy. 