# Text Processing with Part-of-Speech (POS) Tagging

## What is Part-of-Speech (POS)?

Part-of-Speech (POS) tagging is the process of identifying and labeling the grammatical category of each word in a text. It assigns tags to words based on their role and function in a sentence.

### Common POS Tags Include:

- **Noun (N)**: Person, place, thing, or idea
  - Examples: "cat", "house", "happiness"
- **Verb (V)**: Action or state of being
  - Examples: "run", "is", "think"
- **Adjective (ADJ)**: Describes or modifies nouns
  - Examples: "beautiful", "tall", "red"
- **Adverb (ADV)**: Modifies verbs, adjectives, or other adverbs
  - Examples: "quickly", "very", "often"
- **Pronoun (PRON)**: Replaces nouns
  - Examples: "he", "she", "it", "they"
- **Preposition (PREP)**: Shows relationships between words
  - Examples: "in", "on", "under", "with"
- **Conjunction (CONJ)**: Connects words or phrases
  - Examples: "and", "but", "or"
- **Determiner (DET)**: Specifies nouns
  - Examples: "the", "a", "this", "some"

### Why is POS Tagging Important?

POS tagging is crucial for:
- **Grammar checking**: Identifying grammatical errors
- **Information extraction**: Finding specific types of words (e.g., all nouns)
- **Text analysis**: Understanding sentence structure and meaning
- **Machine translation**: Proper translation requires understanding word roles
- **Named entity recognition**: Identifying proper nouns vs. common nouns
- **Text summarization**: Focusing on important word types

In [None]:
import spacy
import pandas as pd


In [None]:
nlp = spacy.cli.download("en_core_web_sm")

In [None]:
original_text = (
    "Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, "
    "and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. "
    "It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; "
    "whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, "
    "and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, "
    "that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people’s hats off—then, "
    "I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. "
    "With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. "
    "If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me."
)

In [None]:
# make it small cases
original_text = original_text.lower()
original_text


In [None]:
# remove punctuation using regex
import re
original_text = re.sub(r"[^\w\s]", "", original_text)
original_text

In [None]:
# create a spacy doc
nlp = spacy.load("en_core_web_sm")
doc = nlp(original_text)

In [None]:
# put them in a pandas dataframe
df = pd.DataFrame(
    [(token.text, token.pos_, token.dep_) for token in doc],
    columns=["Token", "POS", "Dependency"],
    # Token is the word itself,
    # POS is the part of speech, 
    # Dependency is the syntactic dependency relation; 
    #   which is the relationship between the token and its parent in the parse tree
)
df.head(10)  # Display the first 10 rows of the DataFrame

In [None]:
# group by Token
tokens_df = df.groupby(["Token", "POS"]) \
                .size() \
                .reset_index(name="Count") \
                .sort_values(by="Count", ascending=False)
tokens_df.head(10)  # Display the first 10 rows of the grouped DataFrame


In [None]:
# just group them by POS and count
pos_df_pos_counts = df.groupby("POS") \
                .size() \
                .reset_index(name="Count") \
                .sort_values(by="Count", ascending=False)
pos_df_pos_counts.head(10)  # Display the first 10 rows of the POS grouped Data

In [None]:
# show the nouns and count them
nouns = tokens_df[tokens_df.POS == "NOUN"][:10]
nouns 