In [6]:
# Parts of speech tagging

"""
Big picture: how spaCy works
Load a language model → this gives you an NLP pipeline (an ordered list of steps).

Pass text into the pipeline: doc = nlp(text).

spaCy returns a Doc object that contains rich annotations:

- Tokens (Token objects) with attributes like text, lemma_, pos_, tag_, dep_, etc.

- Spans (contiguous slices of the Doc), including sentences and named entities.

- You read results (tokens, sentences, entities, noun phrases), or apply rules/patterns, or add custom components.

Think of nlp as a function that turns raw text → structured data.


The pipeline components (common ones)
- Tokenizer: splits text into Tokens (handles punctuation, contractions, etc.).

- Tagger: part‑of‑speech tags (e.g., NOUN, VERB).

- Lemmatizer: base form (run → run, running → run).

- Dependency Parser: grammatical relations (subject/object, etc.).

- NER (Entity Recognizer): finds entities (PERSON, ORG, GPE, DATE…).

- AttributeRuler / Morphologizer: enrich token features.

You can inspect your pipeline:

nlp.pipe_names
# e.g., ['tok2vec', 'tagger', 'morphologizer', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

You can disable parts you don’t need to speed things up:
with nlp.select_pipes(disable=["ner", "parser"]):
    doc = nlp("Text here")  # faster, only tagging/lemmatization


Core objects: Doc, Token, Span

doc = nlp("Apple is opening an office in Lagos next year.")

# Tokens
for t in doc:
    print(t.text, t.lemma_, t.pos_, t.dep_)
# Apple Apple PROPN nsubj
# is be AUX ROOT
# opening open VERB xcomp
# ...

# Sentences
[sent.text for sent in doc.sents]

# Named entities
[(ent.text, ent.label_) for ent in doc.ents]
# [('Apple','ORG'), ('Lagos','GPE'), ('next year','DATE')]

# Noun chunks (quick NP extractor)
[np.text for np in doc.noun_chunks]
# ['Apple', 'an office', 'Lagos', 'next year']
Note on attributes:

pos_ (coarse POS) and tag_ (fine‑grained POS).

Trailing underscore (_) gives you the string form. Without _ you get an integer ID.



1) What is a “pipeline” in spaCy?
In plain English:

A pipeline is like an assembly line for text.

You give spaCy raw text.

The text passes through a sequence of processing steps, each step adding some analysis.

The final product is a Doc object that has tokens, parts of speech, lemmas, entities, etc.

Think: text → step 1 → step 2 → … → annotated text

2) Typical steps in en_core_web_sm
When you load this model:

python
Copy
Edit
import spacy
nlp = spacy.load("en_core_web_sm")
The pipeline components usually are:

Tokenizer – Splits raw text into tokens (words, punctuation).

"Apple is looking..." → ["Apple", "is", "looking", ...]

Tagger – Assigns part-of-speech tags to tokens.

"Apple" → PROPN (proper noun), "is" → AUX (auxiliary verb)

Lemmatizer / AttributeRuler – Finds the base form of words.

"looking" → "look", "cars" → "car"

Dependency Parser – Figures out grammar structure.

Who is the subject? What’s the object?

NER (Named Entity Recognizer) – Detects real-world entities.

"Apple" → ORG, "U.K." → GPE, "next year" → DATE

Visual “assembly line”
yaml
Copy
Edit
Raw Text
   |
   v
Tokenizer → Tokens
   |
   v
Tagger → POS tags
   |
   v
Lemmatizer → Base forms
   |
   v
Parser → Grammar (subject/object)
   |
   v
NER → Named entities
   |
   v
Doc object (richly annotated text)
Each step adds extra info without losing the original text.

3) Why is it called a pipeline?
Because output of one step feeds the next, just like a pipe in plumbing or an assembly line in a factory.

Tokenizer creates tokens → tagger needs tokens to label them.

Tagger & parser help the NER understand context for entities.

End result: your text is enriched with layers of analysis.



"""
import spacy
import pandas as pd

nlp = spacy.load("en_core_web_sm")

emma_ja = "emma woodhouse handsome clever and rich with a comfortable home and happy disposition seemed to unite some of the best blessings of existence and had lived nearly twentyone years in the world with very little to distress or vex her she was the youngest of the two daughters of a most affectionate indulgent father and had in consequence of her sisters marriage been mistress of his house from a very early period her mother had died too long ago for her to have more than an indistinct remembrance of her caresses and her place had been supplied by an excellent woman as governess who had fallen little short of a mother in affection sixteen years had miss taylor been in mr woodhouses family less as a governess than a friend very fond of both daughters but particularly of emma between them it was more the intimacy of sisters even before miss taylor had ceased to hold the nominal office of governess the mildness of her temper had hardly allowed her to impose any restraint and the shadow of authority being now long passed away they had been living together as friend and friend very mutually attached and emma doing just what she liked highly esteeming miss taylors judgment but directed chiefly by her own"
print(emma_ja)

emma woodhouse handsome clever and rich with a comfortable home and happy disposition seemed to unite some of the best blessings of existence and had lived nearly twentyone years in the world with very little to distress or vex her she was the youngest of the two daughters of a most affectionate indulgent father and had in consequence of her sisters marriage been mistress of his house from a very early period her mother had died too long ago for her to have more than an indistinct remembrance of her caresses and her place had been supplied by an excellent woman as governess who had fallen little short of a mother in affection sixteen years had miss taylor been in mr woodhouses family less as a governess than a friend very fond of both daughters but particularly of emma between them it was more the intimacy of sisters even before miss taylor had ceased to hold the nominal office of governess the mildness of her temper had hardly allowed her to impose any restraint and the shadow of auth

In [13]:
doc = nlp(emma_ja)
pos_df = pd.DataFrame(columns=["token", "pos_tag"])

for token in doc:
    pos_df = pd.concat(
        [pos_df, pd.DataFrame.from_records([{"token": token.text, "pos_tag": token.pos_}])],
        ignore_index=True
    )

pos_df.head(15)


Unnamed: 0,token,pos_tag
0,emma,PROPN
1,woodhouse,PROPN
2,handsome,ADJ
3,clever,ADJ
4,and,CCONJ
5,rich,ADJ
6,with,ADP
7,a,DET
8,comfortable,ADJ
9,home,NOUN


In [12]:

pos_df_counts = pos_df.groupby(["token", "pos_tag"]).size().reset_index(name="counts").sort_values(by="counts", ascending=False)
pos_df_counts.head(10)

Unnamed: 0,token,pos_tag,counts
88,of,ADP,14
49,had,AUX,9
54,her,PRON,9
111,the,DET,8
6,and,CCONJ,8
0,a,DET,6
114,to,PART,5
61,in,ADP,4
13,been,AUX,4
120,very,ADV,4


In [14]:
pos_df_poscounts = pos_df_counts.groupby(["pos_tag"])["token"].count().sort_values(ascending=False)
pos_df_poscounts.head(10)

pos_tag
NOUN     35
VERB     19
ADJ      18
ADV      18
PRON      9
ADP       8
PROPN     6
DET       5
AUX       4
CCONJ     3
Name: token, dtype: int64

In [15]:
nouns = pos_df_counts[pos_df_counts["pos_tag"] == "NOUN"][:10]
nouns.head(10)

Unnamed: 0,token,pos_tag,counts
48,governess,NOUN,3
46,friend,NOUN,3
130,years,NOUN,2
35,emma,NOUN,2
28,daughters,NOUN,2
103,sisters,NOUN,2
82,mother,NOUN,2
89,office,NOUN,1
78,mistress,NOUN,1
75,mildness,NOUN,1
