# Act 2 Python NLP

## Libraries & Modelle importieren



In [None]:
# Installation falls Google Colab das Model nicht bereits installiert hat
!python -m spacy download en_core_web_md

In [None]:
# import the libraries
import pandas as pd
import spacy # nlp library
from spacy import displacy # visualization functionality

# load the machine learning model, md = medium, sm = small, lg = large
nlp = spacy.load("en_core_web_md")


## Tokenization

Tokenization is the task of splitting a text into meaningful segments, called tokens.

## Part of Speech Tagging

- `.lemma` base or dictionary form of the token (e.g. "run" for "running")
- `.pos` high level category such as noun, verb, adjective etc.
- `.tag` the fine grained part of speech information.
- `.dep` indicates the token’s relationship with its parent word in the syntactic parse tree.
- `.shape` The shape of the token, which abstracts its capitalization and punctuation patterns. E.g. "Xxxxx" (for "Apple"), "xxx" (for "run").
- `.is_alpha` A boolean indicating whether the token consists only of alphabetic characters. E.g. True (for "running"), False (for "123" or ".")
- `.is_stop` A boolean indicating whether the token is a stop word. Stop words are common words like “the”, “is”, “and” that are often ignored in NLP tasks. E.g. True (for "is"), False (for "running").

See the full documentation [here](https://spacy.io/usage/linguistic-features#pos-tagging).

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

Spacy comes with some visualization options to better understand the text structure. All options can be seen in the documentation [here](https://spacy.io/usage/visualizers/).

## Named Entity Recognition (NER)

Full documentation can be seen [here](https://spacy.io/usage/linguistic-features#named-entities).

## Sentence Segmentation

Full documentation can be seen [here](https://spacy.io/usage/linguistic-features#sbd).

In [None]:
doc = nlp("This is a sentence. This is another sentence.")

## Similarity

Full documentation can be found [here](https://spacy.io/usage/linguistic-features#vectors-similarity).

If you are interested in this aspect also check out `sense2vec` [here](https://github.com/explosion/sense2vec).

In [None]:
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

## Analyzing the Prologue of Lord of the Rings

In [None]:
from collections import Counter

In [None]:
# mount google drive
from google.colab import drive
drive.mount('/content/gdrive')

file_name = "lord-of-rings-prologue.txt"
file_path = "/content/gdrive/MyDrive/Colab Notebooks/Act_2_Python_NLP/data/"
open_this = file_path + file_name

# Open and read the text file
with open(open_this, 'r') as file:
    text = file.read()

len(text)

In [None]:
# Process the text with spaCy

### Word Frequency

In [None]:
# Create an empty list to store filtered tokens

# Loop through each token in the processed text
# Check if the token is a word (not punctuation or numbers) and is not a stop word
# Add the lowercase lemma (base form) of the token to the list


In [None]:
# Calculate word frequencies

# Get the top 20 most common words

### Named Entity Recognition

- `CARDINAL`: Numerals that do not fall under another type.
- `DATE`: Absolute or relative dates or periods.
- `EVENT`: Named hurricanes, battles, wars, sports events, etc.
- `FAC`: Buildings, airports, highways, bridges, etc.
- `GPE`: Countries, cities, states.
- `LANGUAGE`: Any named language.
- `LAW`: Named documents made into laws.
- `LOC`: Non-GPE locations, mountain ranges, bodies of water.
- `MONEY`: Monetary values, including unit.
- `NORP`: Nationalities or religious or political groups.
- `ORDINAL`: "First", "second", etc.
- `ORG`: Companies, agencies, institutions, etc.
- `PERCENT`: Percentage, including "%".
- `PERSON`: People, including fictional.
- `PRODUCT`: Objects, vehicles, foods, etc. (not services).
- `QUANTITY`: Measurements, as of weight or distance.
- `TIME`: Times smaller than a day.
- `WORK_OF_ART`: Titles of books, songs, etc.

In [None]:
# displacy the whole prologue

In [None]:
# Create an empty list to store entities


# Loop through each entity in the document
# Save the entity text and its label as a tuple

# Now `entities` contains a list of tuples like ("Frodo", "PERSON")

## Sentiment Analysis and Tone Mapping

For sentiment analysis we use another library TextBlob. Full documentation can be found [here](https://textblob.readthedocs.io/en/dev/). Direct link to the Sentiment Classifier can be found [here](https://textblob.readthedocs.io/en/dev/advanced_usage.html#sentiment-analyzers).

- Polarity lies between [-1,1], -1 defines a negative sentiment and 1 defines a positive sentiment. Negation words reverse the polarity.
- Subjectivity lies between [0,1]. Subjectivity quantifies the amount of personal opinion and factual information contained in the text. The higher subjectivity means that the text contains personal opinion rather than factual information.

In [None]:
from textblob import TextBlob

In [None]:
# Example sentences
sentence1 = "Bilbo Baggins was loved by all the hobbits for his generosity and kindness."
sentence2 = "The dark shadows of Mordor filled Frodo with a sense of dread and hopelessness."

# Analyze the first sentence


# Analyze the second sentence


## Further Project Ideas

#### Word Frequency

-	Objective: Analyze word frequency and visualize the most common words.
-	Steps:
	1.	Use doc and count token frequencies (excluding stop words and punctuation).
	2.	Lemmatize tokens to group similar forms (e.g., “run” and “running”).
	3.	Visualize the results using a bar chart or word cloud.


#### Named Entity Recognition (NER) Analysis

- Objective: Identify and visualize named entities in the text (e.g., PERSON, LOCATION, ORG).
- Steps:
	1.	Use doc.ents to extract named entities.
	2.	Group entities by type and calculate frequencies.
	3.	Create a map or chart to show relationships between characters (e.g., Bilbo, Frodo) or places (e.g., Shire, Mordor).

#### Character Interaction Network

- Objective: Create a network graph of character interactions.
- Steps:
	1.	Use NER to extract character names (PERSON entities).
	2.	Identify sentences where multiple characters are mentioned.
	3.	Build a graph using nodes (characters) and edges (interactions)

#### Sentiment Analysis and Tone Mapping

- Objective: Identify the tone or sentiment of sections in the prologue.
- Steps:
	1.	Use a sentiment analysis library (e.g., textblob) alongside spaCy for preprocessing.
	2.	Segment the text into paragraphs or scenes.
	3.	Assign a sentiment score to each section and visualize the tone.


