# NLP Pipelines
In this lab: spaCy   
Notable other mentions: **NLTK** (next week), **CoreNLP** (a.k.a. Stanford NLP)   
Hungarian specific: **magyarlanc**, other resources: https://github.com/oroszgy/awesome-hungarian-nlp

# Installation
pip install spacy

# Download
English pipeline models:   
python -m spacy download en_core_web_sm   
Details: https://spacy.io/usage/models

In [40]:
!nvidia-smi

Sat Sep 13 11:44:45 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:01:00.0  On |                  N/A |
|  0%   39C    P8             30W /  420W |     503MiB /  24576MiB |     32%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [41]:
#Update Spacy for 3.0 functionality (new features), install GPU support for better performance
!pip install -U pip setuptools wheel
!pip install --upgrade 'spacy[cuda120]'

[0m

In [42]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m58.6 MB/s[0m  [33m0:00:00[0meta [36m0:00:01[0m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [43]:
import spacy
import pandas as pd

# spaCy pipeline schematic

<img src="https://spacy.io/images/pipeline-design.svg">

**Tokenizer**: "segment an input character sequence into
small meaningful units" (Includes sentence splitter)

**Tagger**: "determining Part-of-speech (POS) tags"

**Parser**: "parsing syntactical dependencies";   
word D depends on word H if:
- D modifies the meaning of H
- D can be omitted from the sentence keeping H (but H cannot be omitted while keeping D)

**Ner**: "the task of finding
expressions in the input text that are naming entities and
tagging them with the corresponding entity type"

**Others**: ?

# Pipeline management

In [44]:
# Load pipeline

# *EXTRA CODE*: require_gpu() is called to force spacy to use GPU
#spacy.require_gpu()
nlp = spacy.load("en_core_web_sm")

In [45]:
# Check pipeline elements
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [46]:
# Control loaded elements
nlp = spacy.load("en_core_web_sm", disable=["parser"], exclude=["ner"])
print(nlp.pipe_names)

nlp.enable_pipe("parser")
print(nlp.pipe_names)

['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer']
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer']


In [47]:
# Let's load the full pipeline
# tok2vec - required transformation for neural network based components
# attribute_ruler - handles exception rules, enhances tagger outputs
nlp = spacy.load("en_core_web_sm")
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


# Inference

In [48]:
text = "Elisabeth grabbed 2 apples and took a bite of each."

doc = nlp(text)
print(doc)

Elisabeth grabbed 2 apples and took a bite of each.


## Document attributes

In [49]:
#tokenized
for token in doc:
  print(token)

Elisabeth
grabbed
2
apples
and
took
a
bite
of
each
.


In [50]:
#Token parameters
#Tag ending in _ is a string type reprezentation of a category, without _ usually returns an int
data = \
[( token,         #token string
  token.is_alpha, #contains letters only
  token.is_punct, #punctuation
  token.is_stop,  #stopwords
  token.norm_,    #normailzed form
  token.pos_,     #POS tag
  spacy.explain(token.pos_),
  token.tag_,     #Fine POS tag with morph info
  spacy.explain(token.tag_),
  token.dep_,     #Dependency type
  spacy.explain(token.dep_),
  token.ent_type_, #NER entity
  spacy.explain(token.ent_type_),
  token.lemma_,   #Lemma of the word
  token.morph,    #morphological information
  )for token in doc]



In [51]:
df = pd.DataFrame(data, columns=["word","isalpha","ispunct","isstop","norm","pos","pos_expl","fine_pos","fine_pos_expl","dep","dep_expl","entity","entity_expl","lemma","morph"])
df

Unnamed: 0,word,isalpha,ispunct,isstop,norm,pos,pos_expl,fine_pos,fine_pos_expl,dep,dep_expl,entity,entity_expl,lemma,morph
0,Elisabeth,True,False,False,elisabeth,PROPN,proper noun,NNP,"noun, proper singular",nsubj,nominal subject,PERSON,"People, including fictional",Elisabeth,(Number=Sing)
1,grabbed,True,False,False,grabbed,VERB,verb,VBD,"verb, past tense",ROOT,root,,,grab,"(Tense=Past, VerbForm=Fin)"
2,2,False,False,False,2,NUM,numeral,CD,cardinal number,nummod,numeric modifier,CARDINAL,Numerals that do not fall under another type,2,(NumType=Card)
3,apples,True,False,False,apples,NOUN,noun,NNS,"noun, plural",dobj,direct object,,,apple,(Number=Plur)
4,and,True,False,True,and,CCONJ,coordinating conjunction,CC,"conjunction, coordinating",cc,coordinating conjunction,,,and,(ConjType=Cmp)
5,took,True,False,False,took,VERB,verb,VBD,"verb, past tense",conj,conjunct,,,take,"(Tense=Past, VerbForm=Fin)"
6,a,True,False,True,a,DET,determiner,DT,determiner,det,determiner,,,a,"(Definite=Ind, PronType=Art)"
7,bite,True,False,False,bite,NOUN,noun,NN,"noun, singular or mass",dobj,direct object,,,bite,(Number=Sing)
8,of,True,False,True,of,ADP,adposition,IN,"conjunction, subordinating or preposition",prep,prepositional modifier,,,of,()
9,each,True,False,True,each,PRON,pronoun,DT,determiner,pobj,object of preposition,,,each,()


## Plot DEP graph

In [52]:
from spacy import displacy

displacy.render(doc, style="dep", jupyter=True, options={"distance":120, "bg":"transparent", "color":"white"})

In [53]:
for token in doc:
  print(token.text.ljust(10), "rightmost ancestor chain:", list(token.ancestors))

Elisabeth  rightmost ancestor chain: [grabbed]
grabbed    rightmost ancestor chain: []
2          rightmost ancestor chain: [apples, grabbed]
apples     rightmost ancestor chain: [grabbed]
and        rightmost ancestor chain: [grabbed]
took       rightmost ancestor chain: [grabbed]
a          rightmost ancestor chain: [bite, took, grabbed]
bite       rightmost ancestor chain: [took, grabbed]
of         rightmost ancestor chain: [bite, took, grabbed]
each       rightmost ancestor chain: [of, bite, took, grabbed]
.          rightmost ancestor chain: [grabbed]


# Plot entities

In [54]:
displacy.render(doc, style="ent", jupyter=True, options={"distance":120, "bg":"transparent", "color":"white"})

# Multiple sentences

In [55]:
text = "Bob went to the cinema. He watched Titanic with his friends."

doc = nlp(text)
for sent in doc.sents:
  displacy.render(sent, style="dep", jupyter=True, options={"distance":120, "bg":"transparent", "color":"white"})

# Batched pipeline

Working with statistical models, processing text in batches is more efficient.

In [56]:
!pip install gdown

import gdown

gdown.download("https://drive.google.com/uc?id=11ggHFzTXvvRETep7hwYlwyByFEHOLpa6", "source.txt", quiet=False)


[0m

Downloading...
From: https://drive.google.com/uc?id=11ggHFzTXvvRETep7hwYlwyByFEHOLpa6
To: /work/elte-nlp-course/practice_examples/source.txt
100%|██████████| 54.8k/54.8k [00:00<00:00, 1.63MB/s]


'source.txt'

In [57]:
with open("./source.txt","r", encoding="utf-8") as f:
  lines = f.readlines()

In [58]:
print(lines[:2])

['In the original BBS article, Searle identified and discussed several responses to the argument that he had come across in giving the argument in talks at various places. As a result, these early responses have received the most attention in subsequent discussion. What Searle 1980 calls “perhaps the most common reply” is the Systems Reply.\n', 'The Systems Reply (which Searle says was originally associated with Yale, the home of Schank’s AI work) concedes that the man in the room does not understand Chinese. But, the reply continues, the man is but a part, a central processing unit (CPU), in a larger system. The larger system includes the huge database, the memory (scratchpads) containing intermediate states, and the instructions – the complete system that is required for answering the Chinese questions. So the Sytems Reply is that while the man running the program does not understand Chinese, the system as a whole does.\n']


In [59]:
%%time
#Simple solution

docs = [nlp(line) for line in lines]

CPU times: user 1.16 s, sys: 12.6 ms, total: 1.17 s
Wall time: 1.17 s


In [60]:
%%time
#Batched solution

docs = list(nlp.pipe(lines))

CPU times: user 713 ms, sys: 11.6 ms, total: 724 ms
Wall time: 724 ms


In [61]:
%%time
#Batched solution with multiple processes

docs =  list(nlp.pipe(lines, n_process=4))

CPU times: user 26.9 ms, sys: 53.4 ms, total: 80.2 ms
Wall time: 878 ms


# Custom pipeline component

Let's process questions only

In [62]:
from spacy.language import Language

questions = []
@Language.component("question_collector")
def my_component(doc):
  for sent in doc.sents:
    if sent[-1].text == "?" and sent[-1].is_sent_end:
      questions.append(sent)
  return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("question_collector", name="print_question", before="lemmatizer")
print(nlp.pipe_names)
docs = list(nlp.pipe(lines))
for question in questions:
  print(question)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'print_question', 'lemmatizer', 'ner']
In his 1989 paper, Harnad writes “Searle formulates the problem as follows: Is the mind a computer program?
Stevan Harnad also finds important our sensory and motor capabilities: “Who is to say that the Turing Test, whether conducted in Chinese or in any other language, could be successfully passed without operations that draw on our sensory, motor, and other higher cognitive capacities as well?
Ex hypothesi the rest of the world will not notice the difference; will Otto?
If so, when?
What physical properties of the brain are important?
In criticism of Searle’s response to the Brain Simulator Reply, Kurzweil says: “So if we scale up Searle’s Chinese Room to be the rather massive ‘room’ it needs to be, who’s to say that the entire system of a hundred trillion people simulating a Chinese Brain that knows Chinese isn’t conscious?
Related to the preceding is The Other Minds Reply: “How do you know tha

# Collect all Person type entities and the number of mentions

In [63]:
personalMentions={}

@Language.component("person_counter")
def my_component(doc):
  for ent in doc.ents:
    if ent.label_=="PERSON":
      if ent.text in personalMentions.keys():
        personalMentions[ent.text]+=1
      else:
        personalMentions[ent.text]=1
  return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("person_counter", name="person_counter", after="ner")
print(nlp.pipe_names)
docs = list(nlp.pipe(lines))
print(personalMentions)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'person_counter']
{'Ned Block': 2, 'Jack Copeland': 2, 'Daniel Dennett': 4, 'Douglas Hofstadter': 1, 'Jerry Fodor': 3, 'John Haugeland': 2, 'Ray Kurzweil': 1, 'Georges Rey': 3, 'Rey': 5, 'Kurzweil': 1, 'Margaret Boden': 4, 'Boden': 1, 'Clark': 7, 'Otto': 3, 'Shaffer': 2, 'Stevan Harnad': 5, 'Harnad': 5, 'Block': 1, 'Virtual Mind': 8, 'Tim Maudlin': 3, 'Maudlin': 5, 'Minsky': 1, 'Perlis': 2, 'Chalmers': 1, 'Richard Hanley': 1, 'Patrick Hayes': 1, 'Don Perlis': 1, 'Schank': 1, 'Turing Test': 2, 'Mao': 1, 'Mind': 1, 'Roger Penrose': 1, 'Kurt Gödel’s': 1, 'Penrose': 1, 'Gödel': 1, 'Christian Kaernbach': 1, 'Robot': 2, 'Tim Crane': 2, 'Hans Moravec': 2, 'Vat': 1, 'E.g Carter': 1, 'Hilary Putnam': 1, 'David Lewis': 1, 'Helen Keller': 1, 'Jerry Fodor’s': 1, 'Roger Schank': 1, 'Paul': 2, 'Patricia Churchland': 2, 'Suppose Otto': 1, 'Otto’s': 4, 'John Searle': 1, 'Systems Reply': 1, 'Rod Serling’s': 1, 'Steven Pinker': 4, '

In [64]:
#Merging last names with full names

keys = list(personalMentions.keys())
splitkeys = {splitkey:key for key in keys for splitkey in key.split(" ") if key.find(" ")>=0}
print(splitkeys)

for key in keys:
  if key.find(" ")<0:
    if key in splitkeys.keys():
      print(key)
      personalMentions[splitkeys[key]] += personalMentions[key]
      personalMentions.pop(key)

{'Ned': 'Ned Block', 'Block': 'Ned Block', 'Jack': 'Jack Copeland', 'Copeland': 'Jack Copeland', 'Daniel': 'Daniel Tammet', 'Dennett': 'contra Dennett', 'Douglas': 'Douglas Hofstadter', 'Hofstadter': 'Douglas Hofstadter', 'Jerry': 'Jerry Fodor’s', 'Fodor': 'Jerry Fodor', 'John': 'John Searle', 'Haugeland': 'John Haugeland', 'Ray': 'Similarly Ray Kurzweil', 'Kurzweil': 'Similarly Ray Kurzweil', 'Georges': 'Georges Rey', 'Rey': 'Georges Rey', 'Margaret': 'Margaret Boden', 'Boden': 'Margaret Boden', 'Stevan': 'Stevan Harnad', 'Harnad': 'Stevan Harnad', 'Virtual': 'Virtual Mind', 'Mind': 'Transcendent Mind', 'Tim': 'Tim Crane', 'Maudlin': 'Tim Maudlin', 'Richard': 'Richard Hanley', 'Hanley': 'Richard Hanley', 'Patrick': 'Patrick Hayes', 'Hayes': 'Patrick Hayes', 'Don': 'Don Perlis', 'Perlis': 'Don Perlis', 'Turing': 'Turing Test', 'Test': 'Turing Test', 'Roger': 'Roger Schank', 'Penrose': 'Roger Penrose', 'Kurt': 'Kurt Gödel’s', 'Gödel’s': 'Kurt Gödel’s', 'Christian': 'Christian Kaernbach'

In [65]:
print(personalMentions)

{'Ned Block': 3, 'Jack Copeland': 2, 'Daniel Dennett': 4, 'Douglas Hofstadter': 1, 'Jerry Fodor': 3, 'John Haugeland': 2, 'Ray Kurzweil': 1, 'Georges Rey': 8, 'Margaret Boden': 5, 'Shaffer': 2, 'Stevan Harnad': 10, 'Virtual Mind': 8, 'Tim Maudlin': 8, 'Minsky': 1, 'Richard Hanley': 1, 'Patrick Hayes': 1, 'Don Perlis': 3, 'Turing Test': 2, 'Mao': 1, 'Roger Penrose': 2, 'Kurt Gödel’s': 1, 'Gödel': 1, 'Christian Kaernbach': 1, 'Robot': 2, 'Tim Crane': 2, 'Hans Moravec': 6, 'Vat': 1, 'E.g Carter': 1, 'Hilary Putnam': 1, 'David Lewis': 1, 'Helen Keller': 1, 'Jerry Fodor’s': 1, 'Roger Schank': 2, 'Patricia Churchland': 2, 'Suppose Otto': 4, 'Otto’s': 4, 'John Searle': 1, 'Systems Reply': 1, 'Rod Serling’s': 1, 'Steven Pinker': 4, 'Andy Clark': 8, 'William Lycan': 1, 'Similarly Ray Kurzweil': 2, 'Terry Horgan': 1, 'Wittgenstein': 1, 'Daniel Tammet': 1, 'Intentionality': 1, 'Robotics': 1, 'Transcendent Mind': 2, 'Wakefield': 1, 'Aliens': 1, 'SAM': 1, 'Simon': 3, 'Eisenstadt': 2, 'Rudolf Carnap

# Using Pandas with spaCy pipelines

"normalize" text

In [None]:
gdown.download("https://drive.google.com/file/d/1Oti4rfwPJcd4S79GOvzZPP9OCO7shnAd/view?usp=sharing", "winemag-data-130k-v2.json")

df = pd.read_json("winemag-data-130k-v2.json")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
nlp = spacy.load("en_core_web_sm", disable=["ner","parser"])

Using 20k only to keep computation times low

In [None]:
df = df[:20000]

- Computation as a single block so spaCy can batch efficiently
- Filtering as an apply function

In [None]:
%%time
df["doc"] = list(nlp.pipe(df["description"], batch_size=4096))

In [None]:
def prepare_text(doc):
  filtered_list = []
  for token in doc:
    if token.is_alpha and not token.is_stop and not token.is_punct:
      filtered_list.append(token.lemma_.lower())

  return " ".join(filtered_list)

In [None]:
df["normalized_text"] = df["doc"].apply(prepare_text)

In [None]:
df["price"] = df["price"].fillna(df["price"].mean())

In [None]:
with pd.option_context('display.max_colwidth', None):
  display(df[["description","normalized_text","price"]])

# Exploring some basic data connections

How does the average length of the words (lemmas) used to describe the product affect the price?

In [None]:
df["avg_lemma_len"] = df["normalized_text"].apply(lambda x: len(x)/len(x.split(" ")))

In [None]:
import matplotlib.pyplot as plt

plt.scatter(df["avg_lemma_len"],df["price"])
plt.xlabel("Avg word len")
plt.ylabel("Price");

In [None]:
import numpy as np

stepnum = 15
steps = np.linspace(5,10,stepnum)

bin_center = np.zeros(stepnum-1)
bin_avg_price = np.zeros(stepnum-1)

for i in range(stepnum-1):
  bin_center[i] = (steps[i]+steps[i+1])/2.0
  bin_avg_price[i] = df["price"][(df["avg_lemma_len"]>=steps[i]) & (df["avg_lemma_len"]<steps[i+1])].median()

In [None]:
plt.plot(bin_center, bin_avg_price)
plt.xlabel("Average description word length")
plt.ylabel("Average price")
plt.grid()

Could the extreme values cause this change? What measure to use then?


(median)