# Asking Questions In "Alice's Adventures In Wonderland"

## Concerning Nouns And Named Entities

### Or "Falling Down The Rabbit Hole and Getting (A Bit) Confused"

By Jan Eberhardt, 2/2/22

In [None]:
# Getting started
import spacy

nlp = spacy.load('en_core_web_sm')

Using SpaCy by Explosion (Ines Montani et al.):
    
https://github.com/explosion/spaCy

The source data of "Alice's Adventures In Wonderland" is from https://gist.github.com/phillipj/4944029, because of a comment that said:

drjoms commented on 23 Feb 2020

thanks.
was studying regular expression. decided to use the book as sample material.
Gutenberg version seemed to add some characters at end of line, which rendered me in state of frustration.
your version restored my sanity. thanks!
https://gist.github.com/phillipj/4944029?permalink_comment_id=3186160

In [None]:
#basically going through the introduction notebook for text manipulation
with open('txt/alice_in_wonderland.txt', 'r', encoding='UTF-8') as f:
    text = f.read()
    print(text)
    print(repr(text))

In [None]:
processed_text = text.replace('\n',' ')
print(repr(processed_text))

In [None]:
processed_text = '    ' + processed_text
processed_text = processed_text.strip()
print(repr(processed_text))
text = processed_text

In [None]:
import re

pattern = r"[?]"

matches = re.findall(pattern, text)
print (matches)
print(len(matches))

In [None]:
#How to get to the sentences in the text that contain a "?"? Trying different regular expressions:
re.findall(r"([^.]*?[^.]*\?)", text)

In [None]:
re.findall(r"([^.!?]*?[^.?]*\?)",text)

In [None]:
#even trying matche by spacy, but an error occured
from spacy.matcher import Matcher
matcher = Matcher(vocab=nlp.vocab)
matcher

In [None]:
questionmark = [{'PunctType': 'Peri'}]
pattern=[questionmark]
matcher.add("question", [questionmark])

In [None]:
from PIL import Image
import requests
from io import BytesIO

url = "https://24.media.tumblr.com/tumblr_m74l2lZ6Y41rb924bo1_r5_500.gif"
response = requests.get(url)
img = Image.open(BytesIO(response.content))
img.show()

In [None]:
with open('txt/questions.txt', 'r', encoding='UTF-8') as q:
    text = q.read()

In [None]:
doc = nlp(text)
doc

In [None]:
# Loop over items in the Doc object, using the variable 'token' to refer to items in the list
for token in doc:
    
    # Print the token and the POS tags
    print(token, token.pos_, token.tag_)

In [None]:
# Loop over items in the Doc object
# When the tag of the item is a noun, a noun in plural form or a proper noun, write it into a new file
# Attention: this file already exists. if you write it again, it adds more nouns to the file
for token in doc:
    if token.tag_ == 'NN' or token.tag_ == 'NNS' or token.tag_ == 'NNP':
        with open("txt/nouns.txt", "a") as myfile:
            myfile.write(token.text + "\n") 
            print(token.text)

For the visualization of the nouns, I use the Word Cloud repository by Andreas Mueller. He even has a mask looking like Alice and the White Rabbit:

https://github.com/amueller/word_cloud

In [None]:
from os import path
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import os

from wordcloud import WordCloud

# get data directory (using getcwd() is needed to support running example in generated IPython notebook)
d = path.dirname(__file__) if "__file__" in locals() else os.getcwd()

# Read the whole text.
wctext = open(path.join(d, 'txt/nouns.txt')).read()

# read the mask image
# taken from
# http://www.stencilry.org/stencils/movies/alice%20in%20wonderland/255fk.jpg
alice_mask = np.array(Image.open(path.join(d, "img/alice_mask.png")))

wc = WordCloud(background_color="white", max_words=2000, mask=alice_mask, contour_width=3, contour_color='steelblue')

# generate word cloud
wc.generate(wctext)

# store to file
wc.to_file(path.join(d, "img/nounswc.png"))

# show
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
Image.open("img/nounswc.png")

But we also want to know, which named entities are in the questions:

In [None]:
# Loop over the named entities in the Doc object 
for ent in doc.ents:

    # Print the named entity and its label
    print(ent.text, ent.label_)

In [None]:
# import visualization tool from spacy, render the entities
from spacy import displacy
displacy.render(doc, style='ent')

In [None]:
# create file to be able to visualize the entities with word cloud
# Attention: this file already exists. if you write it again, it adds more entities to the file
for ent in doc.ents:
    with open("txt/ents.txt", "a") as myfile:
        myfile.write(ent.text + "\n")    
        print(ent.text)

In [None]:
d = path.dirname(__file__) if "__file__" in locals() else os.getcwd()

wctext = open(path.join(d, 'txt/ents.txt')).read()

alice_mask = np.array(Image.open(path.join(d, "img/alice_mask.png")))

wc = WordCloud(background_color="white", max_words=2000, mask=alice_mask, contour_width=3, contour_color='red')

wc.generate(wctext)

wc.to_file(path.join(d, "img/entswc.png"))

plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
Image.open("img/entswc.png")

Let's see, if a different tool for NLP recognizes the named entities better:

Booknlp by David Bammen: https://github.com/booknlp/booknlp

In [None]:
Image.open("img/booknlp.png")

In [None]:
from booknlp.booknlp import BookNLP

model_params={
		"pipeline":"entity,quote,supersense,event,coref", 
		"model":"big"
	}
	
booknlp=BookNLP("en", model_params)

# Input file to process
input_file="txt/questions.txt"

# Output directory to store resulting files in
output_directory="txt/booknlp/"

# File within this directory will be named ${book_id}.entities, ${book_id}.tokens, etc.
book_id="alice"

booknlp.process(input_file, output_directory, book_id)

In [None]:
#open file where I extracted the nominals and the propers
with open('txt/entities_booknlp.txt', 'r', encoding='UTF-8') as ebnlp:
    ent_booklnp = ebnlp.read()
    print(ent_booklnp)

In [None]:
d = path.dirname(__file__) if "__file__" in locals() else os.getcwd()

#open file where I extracted the entities from entities_booknlp.txt
wctext = open(path.join(d, 'txt/booknlp.txt')).read()

alice_mask = np.array(Image.open(path.join(d, "img/alice_mask.png")))

wc = WordCloud(background_color="white", max_words=2000, mask=alice_mask, contour_width=3, contour_color='green')

wc.generate(wctext)

wc.to_file(path.join(d, "img/booknlpwc.png"))

plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
Image.open("img/booknlpwc.png")

## For The Future

- Optimize: Working with text to get the right pattern to extract just the questions - also what are questions on a professional linguistic level?, French - use different language model?, mistakes in detecting nouns (e.g. "queer") and named entities (e.g. "Duchess"= work of art) - larger language model or training with other/larger corpus?, use Prodigy?


- How many questions are there in each chapter?


- Finding out who is asking who about what -  visualisation with graph modeling tool neo4j?