<a href="https://colab.research.google.com/github/junting-huang/data_storytelling/blob/main/case_1_narrative.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# case_1. narrative


The technique of understanding and extracting valuable information from human language is called Natural Language Processing (NLP). This is one of the most important topics in machine learning and is used in various aspects of our life. Python language helps us with NLP with its plethora of libraries. 

One such library is TextBlob. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more. More information can be found: https://textblob.readthedocs.io/en/dev/.

## 1.1 installation

The process is quick and simple:

In [None]:
! pip install textblob

In [None]:
! python -m textblob.download_corpora

## 1.2 basic usage

TextBlob aims to provide access to common text-processing operations through a familiar interface. You can treat TextBlob objects as if they were Python strings that learned how to do Natural Language Processing.

First, the import.

In [None]:
from textblob import TextBlob

Let’s create our first TextBlob. 

In [None]:
blob = TextBlob("When I wrote the following pages, or rather the bulk of them, I lived alone, in the woods, a mile from any neighbor, in a house which I had built myself, on the shore of Walden Pond, in Concord, Massachusetts, and earned my living by the labor of my hands only. ")

### Part-of-speech Tagging

Part-of-speech tagging, better known as POS tagging, is a natural language processing task that involves assigning a grammatical category (such as noun, verb, adjective, etc.) to each word in a text. This process is essential for several reasons:

* Syntax Analysis: Part-of-speech tagging helps in understanding the syntactic structure of a sentence. Identifying the part of speech of each word allows for the analysis of how words relate to each other in a grammatical sense.

* Semantic Analysis: It aids in understanding the meaning of words in context. Different parts of speech convey different semantic roles, and knowing the part of speech can provide insights into the intended meaning of a word.

* Information Retrieval: Part-of-speech tagging is crucial in information retrieval systems. It enables more accurate and relevant searches by considering the grammatical roles of words in queries and documents.


Part-of-speech tags can be accessed through the tags property.

In [None]:
print(blob.tags)

### Noun Phrase Extraction

Noun phrase extraction is a natural language processing (NLP) task that involves identifying and extracting noun phrases from a given text. A noun phrase is a group of words that function as a unit and includes a noun (the head) along with its modifiers. The process of noun phrase extraction serves several important purposes:

* Semantic Analysis: Noun phrases often represent meaningful units of information in a sentence. Extracting them helps in understanding the key entities and concepts discussed in the text, contributing to semantic analysis.

* Information Retrieval: Noun phrases play a crucial role in information retrieval systems. By extracting relevant noun phrases from documents, search engines can improve the accuracy of search results and help users find information more effectively.

* Named Entity Recognition (NER): Noun phrase extraction is closely related to named entity recognition. Many named entities, such as people, organizations, and locations, are often part of noun phrases. Extracting noun phrases can be a preliminary step in identifying and categorizing named entities.


Similarly, noun phrases are accessed through the noun_phrases property.

In [None]:
print(blob.noun_phrases)

### Tokenization

Tokenization is the process of separating each word in the sentence into a list so that it can be easily interpreted and manipulated later. It takes one line of code to implement tokenization using TextBlob library.

You can break TextBlobs into words or sentences. Sentence objects have the same properties and methods as TextBlobs. For more advanced tokenization, see the Advanced Usage guide: https://textblob.readthedocs.io/en/dev/advanced_usage.html#advanced.

In [None]:
blob.words

In [None]:
blob.sentences

### Words Inflection and Lemmatization

Each word in TextBlob.words or Sentence.words is a Word object (a subclass of unicode) with useful methods, e.g. for word inflection.

In [None]:
print(blob.words[5])

In [None]:
print(blob.words[5].singularize())

In [None]:
print(blob.words[19])

In [None]:
print(blob.words[19].pluralize())

Lemmatization refers to reducing the word to its root form, as found in a dictionary. Lemmatization considers the context and converts the word to its meaningful base form. It is responsible for grouping different inflected forms of words into the root form, having the same meaning. For instance, stemming the word ‘Caring‘ would return ‘Car‘ whereas lemmatizing the word ‘Caring‘ would return ‘Care‘.

To perform lemmatization via TextBlob, you have to use the Word object from the textblob library, pass it the word that you want to lemmatize, and then call the lemmatize method.

In [None]:
print(blob.words[2])

In [None]:
print(blob.words[2].lemmatize('v'))

In [None]:
print(blob.words[4])

In [None]:
print(blob.words[4].stem())

A complete tutorial from TextBlob official website can be found here: https://textblob.readthedocs.io/en/dev/quickstart.html#quickstart. Here is another great article about TextBlob for your reference: https://www.scaler.com/topics/nlp/nlp-textblob/.

## 1.3 labeling agent/action

During the course, you will have the opportunity to explore several other Python packages designed for natural language processing. Some popular libraries include:

* spaCy: This is a highly popular NLP library in Python, known for its efficiency and ease of use. It can be used to parse sentences and identify various grammatical components, including the subject (agent) of the sentence.

* NLTK (Natural Language Toolkit): This is another widely-used library for NLP in Python. It provides tools for sentence parsing and can be used to identify the subject of a sentence, although it might require more manual effort compared to spaCy.

In this section, we are going to demonstrate how to label agent/action in a given sentence using spaCy package. 

In [None]:
import spacy

# Load the English NLP model
nlp = spacy.load("en_core_web_sm")

# Example sentence
sentence = "John eats an apple."

# Process the sentence with spaCy
doc = nlp(sentence)

# Find the subject (agent) of the sentence
agent = None
for token in doc:
    if token.dep_ == "nsubj" and token.head.dep_ == "ROOT":
        agent = token.text

# Print the labeled agent
print("Labeled Agent:", agent)


In [None]:
# Find the root verb (action) of the sentence
action = None
for token in doc:
    if token.dep_ == "ROOT":
        action = token.text

# Print the labeled action
print("Labeled Action:", action)

## 1.4 labeling poetic style

Labeling the poetic style of a text is a more subjective and complex task compared to tasks like part-of-speech tagging. Poetic style encompasses various elements such as metaphors, imagery, rhyme, rhythm, and more. Here's an example using spaCy to identify metaphors in a text:

In [None]:
# Example poetic text
poetic_text = "The stars danced in the night sky, painting a canvas of dreams."

# Process the text with spaCy
doc = nlp(poetic_text)

# Identify metaphors (simplified example)
metaphors = []
for token in doc:
    if token.dep_ == "prep" and token.head.pos_ == "NOUN":
        metaphors.append((token.head.text, token.text))

# Print identified metaphors
print("Identified Metaphors:", metaphors)


In this example, it looks for prepositions (prep) followed by nouns, assuming that metaphors often involve comparing one thing to another. This is a simplistic approach, and identifying poetic style comprehensively would likely involve more advanced techniques and possibly machine learning models trained on poetic corpora.