# Act 2 Python NLP

## Libraries & Modelle importieren



In [None]:
# Installation falls Google Colab das Model nicht bereits installiert hat
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
# import the libraries
import pandas as pd
import spacy # nlp library
from spacy import displacy # visualization functionality

# load the machine learning model, md = medium, sm = small, lg = large
nlp = spacy.load("en_core_web_md")


## Tokenization

Tokenization is the task of splitting a text into meaningful segments, called tokens.

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
# a for loop in python
for token in doc:
  print(token)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


## Part of Speech Tagging

- `.lemma` base or dictionary form of the token (e.g. "run" for "running")
- `.pos` high level category such as noun, verb, adjective etc.
- `.tag` the fine grained part of speech information.
- `.dep` indicates the token’s relationship with its parent word in the syntactic parse tree.
- `.shape` The shape of the token, which abstracts its capitalization and punctuation patterns. E.g. "Xxxxx" (for "Apple"), "xxx" (for "run").
- `.is_alpha` A boolean indicating whether the token consists only of alphabetic characters. E.g. True (for "running"), False (for "123" or ".")
- `.is_stop` A boolean indicating whether the token is a stop word. Stop words are common words like “the”, “is”, “and” that are often ignored in NLP tasks. E.g. True (for "is"), False (for "running").

See the full documentation [here](https://spacy.io/usage/linguistic-features#pos-tagging).

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

Apple Apple PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


In [None]:
# To understand the spacy codes
spacy.explain("VBN")

'verb, past participle'

Spacy comes with some visualization options to better understand the text structure. All options can be seen in the documentation [here](https://spacy.io/usage/visualizers/).

In [None]:
# "dep": Displays the dependency parse tree, showing relationships between tokens, such as subject, object, modifiers, etc.
displacy.render(doc, style="dep", jupyter=True)

## Named Entity Recognition (NER)

Full documentation can be seen [here](https://spacy.io/usage/linguistic-features#named-entities).

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


In [None]:
displacy.render(doc, style="ent", jupyter=True)

## Sentence Segmentation

Full documentation can be seen [here](https://spacy.io/usage/linguistic-features#sbd).

In [None]:
doc = nlp("This is a sentence. This is another sentence.")
assert doc.has_annotation("SENT_START") # is a safety check to ensure that the Doc object (doc) has sentence boundary annotations (SENT_START) before attempting to iterate over the sentences in doc.sents.
for sent in doc.sents:
    print(sent.text)

This is a sentence.
This is another sentence.


## Similarity

Full documentation can be found [here](https://spacy.io/usage/linguistic-features#vectors-similarity).

If you are interested in this aspect also check out `sense2vec` [here](https://github.com/explosion/sense2vec).

In [None]:
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))

# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

I like salty fries and hamburgers. <-> Fast food tastes very good. 0.691649353055761
salty fries <-> hamburgers 0.6938489675521851


## Analyzing the Prologue of Lord of the Rings

In [None]:
from collections import Counter

In [None]:
# mount google drive
from google.colab import drive
drive.mount('/content/gdrive')

file_name = "lord-of-rings-prologue.txt"
file_path = "/content/gdrive/MyDrive/Colab Notebooks/Act_2_Python_NLP/data/"
open_this = file_path + file_name

# Open and read the text file
with open(open_this, 'r') as file:
    text = file.read()

len(text)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


42152

In [None]:
# Process the text with spaCy
doc = nlp(text)

### Word Frequency

In [None]:
# Create an empty list to store filtered tokens
filtered_tokens = []

# Loop through each token in the processed text
for token in doc:
    # Check if the token is a word (not punctuation or numbers) and is not a stop word
    if token.is_alpha and not token.is_stop:
        # Add the lowercase lemma (base form) of the token to the list
        filtered_tokens.append(token.lemma_.lower())

In [None]:
# Calculate word frequencies
word_freq = Counter(filtered_tokens)

# Get the top 20 most common words
most_common = word_freq.most_common(20)
most_common

[('shire', 49),
 ('hobbits', 42),
 ('bilbo', 38),
 ('long', 33),
 ('time', 28),
 ('hobbit', 25),
 ('book', 23),
 ('great', 23),
 ('day', 22),
 ('old', 22),
 ('find', 20),
 ('come', 19),
 ('ring', 19),
 ('history', 18),
 ('king', 18),
 ('year', 17),
 ('family', 16),
 ('large', 15),
 ('live', 15),
 ('gollum', 15)]

### Named Entity Recognition

- `CARDINAL`: Numerals that do not fall under another type.
- `DATE`: Absolute or relative dates or periods.
- `EVENT`: Named hurricanes, battles, wars, sports events, etc.
- `FAC`: Buildings, airports, highways, bridges, etc.
- `GPE`: Countries, cities, states.
- `LANGUAGE`: Any named language.
- `LAW`: Named documents made into laws.
- `LOC`: Non-GPE locations, mountain ranges, bodies of water.
- `MONEY`: Monetary values, including unit.
- `NORP`: Nationalities or religious or political groups.
- `ORDINAL`: "First", "second", etc.
- `ORG`: Companies, agencies, institutions, etc.
- `PERCENT`: Percentage, including "%".
- `PERSON`: People, including fictional.
- `PRODUCT`: Objects, vehicles, foods, etc. (not services).
- `QUANTITY`: Measurements, as of weight or distance.
- `TIME`: Times smaller than a day.
- `WORK_OF_ART`: Titles of books, songs, etc.

In [None]:
displacy.render(doc, style="ent", jupyter=True)

In [None]:
# Create an empty list to store entities
entities = []

# Loop through each entity in the document
for ent in doc.ents:
    # Save the entity text and its label as a tuple
    entities.append((ent.text, ent.label_))

# Now `entities` contains a list of tuples like ("Frodo", "PERSON")
entities

[('Westmarch', 'ORG'),
 ('The Hobbit', 'WORK_OF_ART'),
 ('the Red Book', 'EVENT'),
 ('Bilbo', 'ORG'),
 ('first', 'ORDINAL'),
 ('East', 'LOC'),
 ('first', 'ORDINAL'),
 ('today', 'DATE'),
 ('ancient days', 'DATE'),
 ('the Big Folk’', 'WORK_OF_ART'),
 ('first', 'ORDINAL'),
 ('Dwarves', 'NORP'),
 ('between two and four feet', 'CARDINAL'),
 ('2', 'CARDINAL'),
 ('three feet', 'QUANTITY'),
 ('ancient days', 'DATE'),
 ('the Red Book', 'ORG'),
 ('Bullroarer', 'PERSON'),
 ('Isumbras', 'ORG'),
 ('Third', 'ORDINAL'),
 ('four \nfoot', 'QUANTITY'),
 ('five', 'CARDINAL'),
 ('two', 'CARDINAL'),
 ('the days', 'DATE'),
 ('six', 'CARDINAL'),
 ('Dwarves', 'NORP'),
 ('the Elder Days', 'DATE'),
 ('Middle-earth', 'LOC'),
 ('many long years', 'DATE'),
 ('the days', 'DATE'),
 ('Bilbo', 'ORG'),
 ('Frodo', 'ORG'),
 ('Those days', 'DATE'),
 ('the Third Age of Middle-earth', 'EVENT'),
 ('Sea', 'LOC'),
 ('Bilbo', 'ORG'),
 ('3', 'CARDINAL'),
 ('Dwarves', 'LOC'),
 ('Wandering Days', 'EVENT'),
 ('Anduin', 'PERSON'),
 

In [None]:
# Create an empty list to store entity types (labels)
entity_labels = []

# Loop through each entity in the document
for ent in doc.ents:
    # Add the label of the entity (e.g., "PERSON", "LOC") to the list
    entity_labels.append(ent.label_)

# Now `entity_labels` contains a list of labels like ["PERSON", "LOC", "PERSON"]

In [None]:
label_counts = Counter(entity_labels)

# Top entity types and their counts
labels, counts = zip(*label_counts.most_common())
label_counts

Counter({'ORG': 109,
         'WORK_OF_ART': 7,
         'EVENT': 17,
         'ORDINAL': 18,
         'LOC': 34,
         'DATE': 45,
         'NORP': 12,
         'CARDINAL': 44,
         'QUANTITY': 3,
         'PERSON': 87,
         'FAC': 12,
         'GPE': 32,
         'PRODUCT': 8,
         'TIME': 2,
         'LAW': 2})

## Sentiment Analysis and Tone Mapping

For sentiment analysis we use another library TextBlob. Full documentation can be found [here](https://textblob.readthedocs.io/en/dev/). Direct link to the Sentiment Classifier can be found [here](https://textblob.readthedocs.io/en/dev/advanced_usage.html#sentiment-analyzers).

- Polarity lies between [-1,1], -1 defines a negative sentiment and 1 defines a positive sentiment. Negation words reverse the polarity.
- Subjectivity lies between [0,1]. Subjectivity quantifies the amount of personal opinion and factual information contained in the text. The higher subjectivity means that the text contains personal opinion rather than factual information.

In [None]:
from textblob import TextBlob

In [None]:
# Example sentences
sentence1 = "Bilbo Baggins was loved by all the hobbits for his generosity and kindness."
sentence2 = "The dark shadows of Mordor filled Frodo with a sense of dread and hopelessness."

# Analyze the first sentence
blob1 = TextBlob(sentence1)
polarity1 = blob1.sentiment.polarity
subjectivity1 = blob1.sentiment.subjectivity

print(f"Sentence: {sentence1}")
print(f"Sentiment Polarity: {polarity1:.2f} ({'Positive' if polarity1 > 0 else 'Negative' if polarity1 < 0 else 'Neutral'})")
print(f"Subjectivity: {subjectivity1:.2f} ({'Subjective' if subjectivity1 > 0.5 else 'Objective'})\n")

# Analyze the second sentence
blob2 = TextBlob(sentence2)
polarity2 = blob2.sentiment.polarity
subjectivity2 = blob2.sentiment.subjectivity

print(f"Sentence: {sentence2}")
print(f"Sentiment Polarity: {polarity2:.2f} ({'Positive' if polarity2 > 0 else 'Negative' if polarity2 < 0 else 'Neutral'})")
print(f"Subjectivity: {subjectivity2:.2f} ({'Subjective' if subjectivity2 > 0.5 else 'Objective'})")

Sentence: Bilbo Baggins was loved by all the hobbits for his generosity and kindness.
Sentiment Polarity: 0.70 (Positive)
Subjectivity: 0.80 (Subjective)

Sentence: The dark shadows of Mordor filled Frodo with a sense of dread and hopelessness.
Sentiment Polarity: 0.12 (Positive)
Subjectivity: 0.65 (Subjective)
