# Who Are Our Rebels

In this notebook I'm going to use some simple NLP to try to explore who were our favorite rebels. In the process I hope to demonstrate some of the data-wrangling challenges that go along with NLP.

I have previously used the Canvas API to download the submissions for this assignment and used [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to extract the submission text. This was then saved as a JSON file, which is where this notebook begins.

In [None]:
import json
from collections import Counter
import matplotlib.pyplot as plt
import markdown
from IPython.display import HTML
import pandas as pd
import seaborn as sns

### 

In [None]:
with open("rebel_text.json", "r") as f:
    rebel_text = json.load(f)

### Data Format

The data are loaded as a list of strings, each string being the text submitted by a student.

### We are going to use the very popular [Spacy](https://spacy.io/) NLP.

We will use Spacy's _named entity recognition_ functionality to identify proper names.

In [None]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

#### Entity Recognition

Spacy will parse the sentences and then try to recognize different entitites that are named in the text, such as people or organizations or diseases. Let's see how it works.

In [None]:
for txt in rebel_text:
    doc = nlp(txt)
    displacy.render(doc, style="ent")
    print('-'*72)

### Spacy seems to do OK
#### But there are some consistent failures

For example

- Nicolaus Copernicus identified as an organization (`ORG`)
- Aristotle is identified as a product (`PRODUCT`, what product?)


### Filtering Entities

The function `get_top_rebel` is what I use to try to identify the person each student is identifying as their rebel. Here are the assumptions I made.

- A person might have been labeled as an organization, a person or a product.
    - You can change the list of acceptable labels to see if you can get improved performance.
- Because so many reference, Freeman Dyson (because of the assignment) I decided to filter him out of the responses.
    - Sorry if he was your rebel
- I assume all shorter names all refer to a longer name for which it is a substring
    - For example, I assume all `"Godfrey"` references are referring to `"Godfrey Hounsfield"` the lon
- I count which name is identifed the most often in a submission
    - In case of a tie, I take the longer string as being the name.

In [None]:
def get_top_rebel(txt, labels=None):
    if not labels:
        labels = ['ORG', 'PERSON', 'PRODUCT']
    doc = nlp(txt)
    rtxts = [ent.string.strip() for ent in doc.ents if ent.label_ in labels and ent.string != 'Freeman' and ent.string != 'Dyson']
    rtxts.sort(key=lambda f: len(f), reverse=True)
    for i in range(len(rtxts)-1):
        for j in range(i+1, len(rtxts)):
            rr1 = rtxts[j]
            rr0 = rtxts[i]
            if rr1 in rr0:
                rtxts[j] = rr0
    c = Counter(rtxts)
    top_count =  c.most_common(1)[0][1]
    cc = [k for k,v in c.items() if v == top_count]
    cc.sort(key=lambda f: len(f), reverse=True)
    return cc[0]

### Find the rebel for each submission

In [None]:
rebels = []

for txt in rebel_text:
    top_rebel = get_top_rebel(txt)
    rebels.append((top_rebel, txt))


### Write the submissions and matching rebels out as a Markdown file

In [None]:
txt = """"""
for r in rebels:
    txt = txt + markdown.markdown("""-------\n## Text\n %s\n\n### Identified Rebel: %s\n"""%(r[1], r[0]))
with open("results.md", "w") as f:
    f.write(txt)

In [None]:
HTML(txt)

### Count the identified Entities

In [None]:
counted=Counter([r[0] for r in rebels])

In [None]:
counted.most_common()

### Use [Seaborn](https://seaborn.pydata.org) to visualize our counts

In [None]:
f, ax = plt.subplots(figsize=(14, 8))

data = pd.DataFrame(counted.items()).rename(columns={0:"Rebel", 1:"Count"}).sort_values("Rebel", ascending=True)
sns.set_color_codes("pastel")

sns.barplot(x="Count", y="Rebel", data=data, color="g")
#plt.yticks(rotation=45)
f.savefig("identified_rebels.png", dpi=300, facecolor='w', bbox_inches="tight")

## Discussion

I took a fairly simplistic approach to identifying the named rebels. The technique was not robust to several textual features, such as typos and misspellings possessive form. Because I was counting mentions of names, if someone used a lot of pronouns to refer to the rebel I might not have identified them properly. Identify the answer you submitted. Did I correctly find your rebel? If not, can you think of things in your writing that could be edited to make the identification task easier?