# Exploring Text with SpaCy

## The spaCy pipeline

"When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing [pipeline](https://spacy.io/usage/processing-pipelines)." 

From https://course.spacy.io/en/chapter3

<img src="https://spacy.io/images/pipeline.svg">

In [36]:
import sys
import spacy
nlp = spacy.load("en_core_web_md") #_sm does not have word vectors
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


## Adding sentiment via TextBlob

In [37]:
from spacytextblob.spacytextblob import SpacyTextBlob
nlp.add_pipe('spacytextblob')
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'spacytextblob']


## Exploring document structure

In [38]:
# Break the corpus into:
# Title, Chapters, Sections, Paragraphs, Thoughts, Sentences
#  Title: Knowing Gaia
#  C1:Introduction 1 section, 6 paragraphs
#  C2:Taking a leap 8 sections, 7 paragraphs
#  C3-C10: each with 10 sections  thought-embedding 
#  C11: Contemplation: 2 sections

In [71]:
# Open the text file and read its content into a variable
with open('../assets/clean.txt', 'r') as file:
    text = file.read()

doc=nlp(text)

In [40]:
# Before extra bash cleaning:
#print([tok.text for tok in doc[21:30] 
# if not tok.is_stop and tok.text!="\n\n"])

# After extra bash cleaning:
print([tok.text for tok in doc[21:30]])

['traditional', 'non', '-', 'duality', 'speakers', ',', 'I', 'was', 'inspired']


In [41]:
sents=[ sent for sent in doc.sents]
type(sents[0])

spacy.tokens.span.Span

In [42]:
nlp.vocab["\n"].is_stop = True
words=[token.text for token in doc
       if not token.is_stop and not token.is_punct ]

In [43]:
from collections import Counter
word_freq=Counter(words)

In [44]:
common_words = word_freq.most_common(30)

In [45]:
print(common_words)

[('Gaia', 190), ('want', 99), ('know', 78), ('people', 75), ('self', 71), ('like', 68), ('system', 61), ('need', 58), ('think', 53), ('experience', 52), ('feel', 48), ('life', 46), ('feeling', 46), ('power', 44), ('body', 41), ('dream', 40), ('attention', 39), ('person', 38), ('mind', 35), ('systems', 34), ('living', 33), ('right', 33), ('way', 30), ('ideology', 30), ('emotions', 29), ('work', 29), ('time', 28), ('thinking', 27), ('away', 26), ('touch', 26)]


In [46]:
for token in doc[1:10]:
    print(token, token.tag_, token.pos_, spacy.explain(token.tag_))

Gaia NNP PROPN noun, proper singular

 _SP SPACE whitespace
C1 NNP PROPN noun, proper singular
: : PUNCT punctuation mark, colon or ellipsis
Introduction NN NOUN noun, singular or mass

 _SP SPACE whitespace
After IN ADP conjunction, subordinating or preposition
many JJ ADJ adjective (English), other noun-modifier (Chinese)
years NNS NOUN noun, plural


In [47]:
nouns = []
adjectives = []
for token in [t for t in doc[2:4000] if t.text != "-"]:
    if token.pos_ == 'NOUN':
        nouns.append(token)
    if token.pos_ == 'ADJ':
        adjectives.append(token)

print(f'Total nouns: {len(nouns)} \nFirst 5 nouns: {nouns[:5]}')
print(f'Total adjectives: {len(adjectives)}')
print(f'First 5 adjectives: {adjectives[:5]}')

Total nouns: 660 
First 5 nouns: [Introduction, years, study, practice, speakers]
Total adjectives: 282
First 5 adjectives: [many, Buddhist, traditional, non, duality]


In [48]:
from spacy import displacy
# Creates large image of relations
#displacy.render(ss[3:4], style='dep', jupyter=True)

### A SpaCy Document is a list of tokens
SpaCy **Doc** data structure is a sequence of Token objects allowing access to **sentences** 
and named entities.

In [49]:
# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

# Iterate over first 2000 tokens
for token in doc[1:2000]:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == "VERB":
            print("Found proper noun before a verb:", token.text)

center=4
width=4
print(f'{len(doc[1].vector)}')
print(f'Simularity between {doc[center].text} and \
        {doc[center].similarity(doc[center-width:center+width])}')

Found proper noun before a verb: NASA
Found proper noun before a verb: jellyfish
Found proper noun before a verb: LeGuin
Found proper noun before a verb: Contemplation
Found proper noun before a verb: Gaia
Found proper noun before a verb: Contemplation
300
Simularity between : and         0.615739643573761


In [50]:
# Each vocabulary word has a 300 dim vector which is a word2vec vector
# Dot product is not normalized but seems to work

In [51]:
import numpy
numpy.dot(doc.vocab['earth'].vector,doc.vocab['gaia'].vector)

126.3143

In [52]:
from itertools import product
from itertools import zip_longest
c=["C1", "C2", "C3", "C4", "C5", "C6", "C7", "C8", "C9", "C10", "C11"]
starts=[ t.i for chap in c for t in doc if t.text==chap]
ci=list(zip_longest(starts,starts[1:]))
chaps=[ doc[s[0]:s[1]] for s in ci] # list of spans
print(ci)

[(3, 720), (720, 1912), (1912, 3603), (3603, 4985), (4985, 7204), (7204, 9617), (9617, 12594), (12594, 16739), (16739, 20888), (20888, 24152), (24152, None)]


In [53]:
pts=["P1", "P2", "P3", "P4", "P5", "P6", "P7", "P8", "P9", "P10", "P11"]
chap=5
prompt=3
pis=[ t.i for pstr in pts for t in chaps[chap] if t.text==pstr]
pi=list(zip_longest(pis,pis[1:])) # iterator can only be used once
ptexts=[ chaps[chap][s[0]:s[1]] for s in pi]
chap6prompts=[ doc[s[0]:s[1]] for s in pi]
sents=list(doc[pis[prompt-1]:pis[prompt]].sents)

In [54]:
with open("./spacy.html", 'w', encoding='utf-8') as f:
    f.write("<!DOCMENT HTML>")
    f.write("<html><head><style>")
    f.write("html{ font-family: monospace; color:#888}")    
    f.write("</style>")
    for c in chaps:
        f.write(f'<h1>{c[1:3].text}</h1>')
        f.write(c.text)
        
    f.write("</html>")

## Adding spans to Doc

In [55]:
from spacy import displacy
from spacy.tokens import Span

sents = list(chaps[0].sents)
type(doc)
type(sents)
options = {"colors": {"PHRASE1": "green",
                      "PHRASE2": "red",
                      "THOUGHT": "orange"}}

# type(chap[0]) = spacy.tokens.span.Span
# type(chap[0][1:60]) = spacy.tokens.span.Span

smalldoc=chaps[0][0:64].as_doc()
span1 = Span(smalldoc, start=2, end=36, label="THOUGHT")
span2 = Span(smalldoc, start=4, end=23, label="PHRASE1")  # Update indices as appropriate
span3 = Span(smalldoc, start=24, end=36, label="PHRASE2")  # Update indices as appropriate

# Assign these new Span objects to a span group in smalldoc
smalldoc.spans["sc"] = [span1, span2, span3]

displacy.render( smalldoc, style='span',options=options)

## Create the document structure
We have 11 Chapters

In [118]:
def parse_document(text):
    # Initialize the chapters list with a placeholder for the Table of Contents
    chapters = [{"title": "Front matter", "lede": [], "sections": []}]
    prompts = []
    chapter = None
    chapter_number = 0

    # Define prompt names for mapping
    prompt_names = {
        "P1": "independently",
        "P2": "integrated",
        "P3": "dream",
        "P4": "body",
        "P5": "honesty",
        "P6": "power",
        "P7": "touched",
        "P8": "ideology",
        "P9": "presence",
        "P10": "self"
    }

    # Split the text into lines
    lines = text.split('\n')

    for line in lines:
        line = line.strip()
        if not line:
            continue

        # Start a new chapter when we find a chapter line
        if line.startswith('C') and line[1:].split(':', 1)[0].isdigit():
            # Append the current chapter before starting a new one
            if chapter is not None:
                chapters.append(chapter)
            chapter_number = int(line[1:].split(':', 1)[0])
            chapter_title = line.split(':', 1)[1].strip()
            chapter = {
                "title": chapter_title,
                "lede": [],
                "sections": []
            }
            # Add chapter titles to the Table of Contents
            chapters[0]["sections"].append({"name": chapter_title, "paragraphs": []})
            continue  # Continue to the next iteration after starting a new chapter

        # Handle prompts for chapters 3-10
        if chapter_number >= 3 and chapter_number <= 10:
            if line.startswith('P') and ':' in line:
                prompt_key = line.split(':')[0].strip()
                prompt_text = line.split(':', 1)[1].strip()
                # Check if the prompt is already added to avoid duplicates
                if not any(p['name'] == prompt_names[prompt_key] for p in prompts):
                    prompts.append({
                        "name": prompt_names[prompt_key],
                        "text": prompt_text
                    })
                # Add the section with the prompt name if not already present
                if not any(section["name"] == prompt_names[prompt_key] for section in chapter["sections"]):
                    chapter["sections"].append({"name": prompt_names[prompt_key], "paragraphs": []})
                continue  # Continue to the next iteration after adding a prompt section

            # If the line doesn't start with 'P', it's considered a paragraph for the current section
            if chapter["sections"]:
                chapter["sections"][-1].setdefault("paragraphs", []).append(line)

        # Handle lede for chapters 1, 2, and 11
        elif chapter_number in [1, 2, 11]:
            chapter["lede"].append(line)

    # Append the last chapter if it exists
    if chapter is not None:
        chapters.append(chapter)

    return {"chapters": chapters, "prompts": prompts}


In [119]:
p=parse_document(text)

In [120]:
import json
# Convert the dictionary to a JSON-formatted string
json_string = json.dumps(p)
json_data = json.loads(json_string)

# Now write the JSON data to a file
with open('spacy.json', 'w', encoding='utf-8') as f:
    json.dump(json_data, f, ensure_ascii=False, indent=4)

In [3]:
%reload_ext jupyter_ai

In [4]:
%%ai chat --format code
find the number of bits of information in a string

Cannot determine model provider from model ID `chat`.

To see a list of models you can use, run `%ai list`

If you were trying to run a command, run `%ai help` to see a list of commands.

In [5]:
%ai list

| Provider | Environment variable | Set? | Models |
|----------|----------------------|------|--------|
| `ai21` | `AI21_API_KEY` | <abbr title="You have not set this environment variable, so you cannot use this provider's models.">❌</abbr> | <ul><li>`ai21:j1-large`</li><li>`ai21:j1-grande`</li><li>`ai21:j1-jumbo`</li><li>`ai21:j1-grande-instruct`</li><li>`ai21:j2-large`</li><li>`ai21:j2-grande`</li><li>`ai21:j2-jumbo`</li><li>`ai21:j2-grande-instruct`</li><li>`ai21:j2-jumbo-instruct`</li></ul> |
| `gpt4all` | Not applicable. | <abbr title="Not applicable">N/A</abbr> | <ul><li>`gpt4all:ggml-gpt4all-j-v1.2-jazzy`</li><li>`gpt4all:ggml-gpt4all-j-v1.3-groovy`</li><li>`gpt4all:ggml-gpt4all-l13b-snoozy`</li><li>`gpt4all:mistral-7b-openorca.Q4_0`</li><li>`gpt4all:mistral-7b-instruct-v0.1.Q4_0`</li><li>`gpt4all:gpt4all-falcon-q4_0`</li><li>`gpt4all:wizardlm-13b-v1.2.Q4_0`</li><li>`gpt4all:nous-hermes-llama2-13b.Q4_0`</li><li>`gpt4all:gpt4all-13b-snoozy-q4_0`</li><li>`gpt4all:mpt-7b-chat-merges-q4_0`</li><li>`gpt4all:orca-mini-3b-gguf2-q4_0`</li><li>`gpt4all:starcoder-q4_0`</li><li>`gpt4all:rift-coder-v0-7b-q4_0`</li><li>`gpt4all:em_german_mistral_v01.Q4_0`</li></ul> |
| `huggingface_hub` | `HUGGINGFACEHUB_API_TOKEN` | <abbr title="You have not set this environment variable, so you cannot use this provider's models.">❌</abbr> | See [https://huggingface.co/models](https://huggingface.co/models) for a list of models. Pass a model's repository ID as the model ID; for example, `huggingface_hub:ExampleOwner/example-model`. |
| `nvidia-chat` | `NVIDIA_API_KEY` | <abbr title="You have not set this environment variable, so you cannot use this provider's models.">❌</abbr> | <ul><li>`nvidia-chat:playground_llama2_70b`</li><li>`nvidia-chat:playground_nemotron_steerlm_8b`</li><li>`nvidia-chat:playground_mistral_7b`</li><li>`nvidia-chat:playground_nv_llama2_rlhf_70b`</li><li>`nvidia-chat:playground_llama2_13b`</li><li>`nvidia-chat:playground_steerlm_llama_70b`</li><li>`nvidia-chat:playground_llama2_code_13b`</li><li>`nvidia-chat:playground_yi_34b`</li><li>`nvidia-chat:playground_mixtral_8x7b`</li><li>`nvidia-chat:playground_neva_22b`</li><li>`nvidia-chat:playground_llama2_code_34b`</li></ul> |
| `ollama` | Not applicable. | <abbr title="Not applicable">N/A</abbr> | See [https://www.ollama.com/library](https://www.ollama.com/library) for a list of models. Pass a model's name; for example, `deepseek-coder-v2`. |
| `qianfan` | `QIANFAN_AK`, `QIANFAN_SK` | <abbr title="You have not set all of these environment variables, so you cannot use this provider's models.">❌</abbr> | <ul><li>`qianfan:ERNIE-Bot`</li><li>`qianfan:ERNIE-Bot-4`</li></ul> |
| `togetherai` | `TOGETHER_API_KEY` | <abbr title="You have not set this environment variable, so you cannot use this provider's models.">❌</abbr> | <ul><li>`togetherai:Austism/chronos-hermes-13b`</li><li>`togetherai:DiscoResearch/DiscoLM-mixtral-8x7b-v2`</li><li>`togetherai:EleutherAI/llemma_7b`</li><li>`togetherai:Gryphe/MythoMax-L2-13b`</li><li>`togetherai:Meta-Llama/Llama-Guard-7b`</li><li>`togetherai:Nexusflow/NexusRaven-V2-13B`</li><li>`togetherai:NousResearch/Nous-Capybara-7B-V1p9`</li><li>`togetherai:NousResearch/Nous-Hermes-2-Yi-34B`</li><li>`togetherai:NousResearch/Nous-Hermes-Llama2-13b`</li><li>`togetherai:NousResearch/Nous-Hermes-Llama2-70b`</li></ul> |

Aliases and custom commands:

| Name | Target |
|------|--------|
| `gpt2` | `huggingface_hub:gpt2` |
| `gpt3` | `openai:davinci-002` |
| `chatgpt` | `openai-chat:gpt-3.5-turbo` |
| `gpt4` | `openai-chat:gpt-4` |
| `ernie-bot` | `qianfan:ERNIE-Bot` |
| `ernie-bot-4` | `qianfan:ERNIE-Bot-4` |
| `titan` | `bedrock:amazon.titan-tg1-large` |
