# NLP
Find your favorite news source and grab the article text. 

1. Show the most common words in the article.
2. Show the most common words under a part of speech. (i.e. NOUN: {'Bob':12, 'Alice':4,})
3. Find a subject/object relationship through the dependency parser in any sentence.
4. Show the most common Entities and their types. 
5. Find Entites and their dependency (hint: entity.root.head)
6. Find the most similar words in the article

Note: Yes, the notebook from the video is not provided, I leave it to you to make your own :) it's your final assignment for the semester. Enjoy!

In [2]:
# Text taken from Nature article: https://www.nature.com/articles/d41586-024-01103-7
# doi: https://doi.org/10.1038/d41586-024-01103-7
# Title: Peter Higgs obituary: physicist who predicted boson that explains why particles have mass
# Author: Christine Sutton
# Date: 12 April 2024

text = """During a few weeks in the summer of 1964, Peter Higgs, a theoretical physicist at the University of Edinburgh, UK, wrote two short papers outlining his ideas for a mechanism that could give mass to fundamental particles, the building blocks of the Universe. His aim was to rescue a theory that was mathematically appealing but ultimately unrealistic because the particles it described had no mass. The second paper drew attention to a measurable consequence of his proposal — it predicted the existence of a new massive particle. Nearly half a century later, the discovery of the predicted particle brought Higgs, who has died aged 94, a share of the 2013 Nobel Prize in Physics.
The mechanism Higgs described became a key component of the standard model of particle physics during the 1970s, but the associated particle remained stubbornly elusive. Then, in 2012, two giant experiments run by more than 6,000 physicists at CERN, Europe’s high-energy physics laboratory near Geneva, Switzerland, discovered something with the appropriate properties. By then, the particle had achieved fame as the Higgs boson, although the self-effacing Higgs would usually refer to it as ‘the scalar boson’ in reference to its key characteristic of having no intrinsic spin.
Higgs was born in Newcastle upon Tyne in 1929, but his father’s work as a sound engineer for the BBC took the family to Bristol, where Higgs attended Cotham Grammar School. There, he spotted several mentions on the honours boards of a previous pupil, Paul Dirac, who had earned a share of the 1933 Nobel Prize in Physics for his work in quantum mechanics. Inspired, Higgs took up physics at King’s College London, where he obtained a PhD in 1954.
As a hitch-hiking student, Higgs had discovered a liking for Edinburgh, so in 1960 he was happy to be appointed as a lecturer there. He picked up on a long-standing interest in symmetry in subatomic particle physics, inspired in particular by the work of the future Nobel prizewinner Yoichiro Nambu, a Japanese American physicist then at the University of Chicago in Illinois. In physics, symmetry is linked to the conservation of quantities such as energy, momentum and electric charge. Working on a theory that had an underlying symmetry but in which particles had no mass, Nambu was attempting to generate mass through a mechanism known as spontaneous symmetry breaking. However, such symmetry breaking would also produce massless particles with zero spin, for which there was no evidence.
This seemed a dead end, but in 1964, Higgs realized that it was possible to get round the difficulty using gauge theory, which has the kind of symmetry found, for example, in the established theory of electromagnetism. Higgs showed that the massless particles associated with spontaneous symmetry breaking become ‘absorbed’ into massive particles. He published two short papers on this theme1,2, the second of which explicitly predicted a massive spin-zero particle. It fell to other physicists to realize that this mechanism for spontaneous symmetry breaking was key to a mathematically coherent gauge theory of particle physics that unites the electromagnetic interactions between particles with the weak interactions involved in certain forms of radioactivity. This Nobel-prizewinning ‘electro-weak theory’ became one of the twin pillars of the standard model of particle physics, and was well established by experiments by the 1990s.
The field had seemed something of a scientific backwater in the 1960s, but Higgs was not working alone. Two other papers were published in 1964 on the mechanism3,4, one appearing just before his own. But only Higgs drew attention to the associated massive spin‑zero particle, which in the 1970s began to be called the Higgs boson. The catchy name stuck.
Around this time, after his marriage broke down, Higgs found he was losing his way in theoretical particle physics. In addition to teaching, he became involved in the union side of university life and wrote few physics papers. Nevertheless, as appreciation of his work on spontaneous symmetry breaking grew, he was increasingly asked to give talks, a popular title being ‘My life as a boson’. He retired in 1996.
The award of the Nobel prize to Higgs and Belgian physicist François Englert — the surviving author of the paper published just before Higgs’s in 1964 — came as little surprise to anyone, including Higgs, because it had been mooted since the 1980s. However, interest in the Higgs boson had increased over the years, not only among particle physicists but also among the general public and the media. It reached fever pitch after the construction of CERN’s Large Hadron Collider, which was billed by many as the machine that would discover the last missing piece of the standard model. This intense interest brought fame at a level no one could have imagined in the 1960s, and the quiet physicist became a media star, before retiring for a second time when he reached 85.
A keen music lover, Higgs had little interest in the trappings of modern technology — he famously had no television and did not use the Internet. Yet he was far from being remote from the world. He had a strong social conscience and was a member of the Campaign for Nuclear Disarmament and Greenpeace at various times. Humble in many respects, with an infectious sense of humour, he was proud of the work that he had always known was important and which ultimately brought him fame."""

Download the model from SpaCy

In [1]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
     ---------------------------------------- 0.0/587.7 MB ? eta -:--:--
     ---------------------------------------- 0.0/587.7 MB ? eta -:--:--
     -------------------------------------- 0.0/587.7 MB 262.6 kB/s eta 0:37:19
     -------------------------------------- 0.1/587.7 MB 409.6 kB/s eta 0:23:55
     ---------------------------------------- 0.4/587.7 MB 1.9 MB/s eta 0:05:16
     ---------------------------------------- 1.5/587.7 MB 6.7 MB/s eta 0:01:28
     --------------------------------------- 3.4/587.7 MB 12.7 MB/s eta 0:00:46
     --------------------------------------- 5.8/587.7 MB 18.6 MB/s eta 0:00:32
      -------------------------------------- 7.9/587.7 MB 21.1 MB/s eta 0:00:28
      ------------------------------------- 10.2/587.7 MB 25.2 MB/s eta 0:00:23
      -------------------------------

Set up a text processor and load the provided news article.

In [3]:
import spacy

In [4]:
nlp = spacy.load('en_core_web_lg')
doc = nlp(text)
doc

During a few weeks in the summer of 1964, Peter Higgs, a theoretical physicist at the University of Edinburgh, UK, wrote two short papers outlining his ideas for a mechanism that could give mass to fundamental particles, the building blocks of the Universe. His aim was to rescue a theory that was mathematically appealing but ultimately unrealistic because the particles it described had no mass. The second paper drew attention to a measurable consequence of his proposal — it predicted the existence of a new massive particle. Nearly half a century later, the discovery of the predicted particle brought Higgs, who has died aged 94, a share of the 2013 Nobel Prize in Physics.
The mechanism Higgs described became a key component of the standard model of particle physics during the 1970s, but the associated particle remained stubbornly elusive. Then, in 2012, two giant experiments run by more than 6,000 physicists at CERN, Europe’s high-energy physics laboratory near Geneva, Switzerland, disc

#### Most common words

In [5]:

# Create an array of tokens from the text
words = [token.text               # Get the text of the token, load into an array
         for token in doc         # Iterate over the tokens 
         if not token.is_punct]   # Remove stop words and punctuation
print(words)


['During', 'a', 'few', 'weeks', 'in', 'the', 'summer', 'of', '1964', 'Peter', 'Higgs', 'a', 'theoretical', 'physicist', 'at', 'the', 'University', 'of', 'Edinburgh', 'UK', 'wrote', 'two', 'short', 'papers', 'outlining', 'his', 'ideas', 'for', 'a', 'mechanism', 'that', 'could', 'give', 'mass', 'to', 'fundamental', 'particles', 'the', 'building', 'blocks', 'of', 'the', 'Universe', 'His', 'aim', 'was', 'to', 'rescue', 'a', 'theory', 'that', 'was', 'mathematically', 'appealing', 'but', 'ultimately', 'unrealistic', 'because', 'the', 'particles', 'it', 'described', 'had', 'no', 'mass', 'The', 'second', 'paper', 'drew', 'attention', 'to', 'a', 'measurable', 'consequence', 'of', 'his', 'proposal', 'it', 'predicted', 'the', 'existence', 'of', 'a', 'new', 'massive', 'particle', 'Nearly', 'half', 'a', 'century', 'later', 'the', 'discovery', 'of', 'the', 'predicted', 'particle', 'brought', 'Higgs', 'who', 'has', 'died', 'aged', '94', 'a', 'share', 'of', 'the', '2013', 'Nobel', 'Prize', 'in', 'Phys

In [6]:
# Find the most common words
from collections import Counter

# Create a Counter object
word_freq = Counter(words)
# Get the most common words
most_common_words = word_freq.most_common()
print(most_common_words)

[('the', 60), ('of', 34), ('a', 31), ('in', 29), ('Higgs', 20), ('to', 19), ('was', 17), ('had', 13), ('particle', 12), ('he', 11), ('as', 10), ('and', 10), ('for', 9), ('that', 9), ('physics', 9), ('symmetry', 9), ('his', 8), ('\n', 8), ('particles', 7), ('but', 7), ('which', 7), ('at', 6), ('theory', 6), ('no', 6), ('on', 6), ('it', 5), ('The', 5), ('Nobel', 5), ('by', 5), ('’s', 5), ('with', 5), ('boson', 5), ('work', 5), ('breaking', 5), ('1964', 4), ('physicist', 4), ('papers', 4), ('mechanism', 4), ('mass', 4), ('massive', 4), ('became', 4), ('He', 4), ('interest', 4), ('spontaneous', 4), ('two', 3), ('second', 3), ('predicted', 3), ('brought', 3), ('key', 3), ('standard', 3), ('model', 3), ('associated', 3), ('physicists', 3), ('fame', 3), ('would', 3), ('spin', 3), ('This', 3), ('published', 3), ('this', 3), ('one', 3), ('not', 3), ('before', 3), ('few', 2), ('theoretical', 2), ('University', 2), ('Edinburgh', 2), ('wrote', 2), ('short', 2), ('could', 2), ('give', 2), ('mathema

#### Most common words under a part of speech

In [7]:
# Create a Counter object
from collections import defaultdict

In [8]:
count_pos = defaultdict(Counter)
for token in doc:
    count_pos[token.pos_][token.text] += 1

for pos in count_pos:
    print(pos)
    #for word, count in count_pos[pos].most_common():
    #    print(f"{word}: {count}")
    print(count_pos[pos].most_common())
    print("")


ADP
[('of', 34), ('in', 29), ('to', 12), ('for', 9), ('as', 8), ('at', 6), ('on', 6), ('by', 5), ('with', 5), ('up', 2), ('In', 2), ('before', 2), ('among', 2), ('from', 2), ('During', 1), ('during', 1), ('than', 1), ('near', 1), ('By', 1), ('As', 1), ('through', 1), ('round', 1), ('into', 1), ('between', 1), ('Around', 1), ('down', 1), ('over', 1), ('after', 1)]

DET
[('the', 60), ('a', 31), ('no', 6), ('The', 5), ('this', 3), ('an', 2), ('This', 2), ('half', 1), ('A', 1)]

ADJ
[('massive', 4), ('spontaneous', 4), ('key', 3), ('standard', 3), ('few', 2), ('theoretical', 2), ('short', 2), ('second', 2), ('associated', 2), ('such', 2), ('massless', 2), ('other', 2), ('weak', 2), ('little', 2), ('many', 2), ('fundamental', 1), ('appealing', 1), ('unrealistic', 1), ('measurable', 1), ('new', 1), ('aged', 1), ('elusive', 1), ('giant', 1), ('more', 1), ('high', 1), ('appropriate', 1), ('scalar', 1), ('intrinsic', 1), ('sound', 1), ('several', 1), ('previous', 1), ('quantum', 1), ('happy', 1

#### Find a subject/object relationship through the dependency parser in any sentence.

In [9]:
def pr_tree(word, level):
    if word.is_punct:
        return
    
    for child in word.lefts:
        pr_tree(child, level+1)
    if word.dep_ == 'nsubj' or word.dep_ == 'dobj':
        print('\t'* level + word.text + ' - ' + word.dep_)
    #print('\t'* level + word.text + ' - ' + word.dep_)
    for child in word.rights:
        pr_tree(child, level+1)

In [10]:
for sentence in doc.sents:
    pr_tree(sentence.root, 0)
    print('-------------------------------------------')

	Higgs - nsubj
	papers - dobj
			ideas - dobj
						that - nsubj
						mass - dobj
-------------------------------------------
	aim - nsubj
		theory - dobj
				that - nsubj
			particles - nsubj
					it - nsubj
			mass - dobj
-------------------------------------------
		paper - nsubj
		attention - dobj
	it - nsubj
	existence - dobj
-------------------------------------------
	discovery - nsubj
	Higgs - dobj
			who - nsubj
				94 - dobj
-------------------------------------------
	Higgs - nsubj
		particle - nsubj
-------------------------------------------
	experiments - nsubj
	something - dobj
-------------------------------------------
	particle - nsubj
	fame - dobj
		Higgs - nsubj
								spin - dobj
-------------------------------------------
		work - nsubj
		family - dobj
					Higgs - nsubj
					School - dobj
-------------------------------------------
	he - nsubj
	mentions - dobj
						who - nsubj
						share - dobj
-------------------------------------------
	Higgs - nsubj
	physi

#### Show the most common Entities and their types. 

In [11]:

count_ent = defaultdict(Counter)
for entity in doc.ents:
    count_ent[entity.label_][entity.text] += 1

for ent in count_ent:
    print(ent)
    #for word, count in count_pos[pos].most_common():
    #    print(f"{word}: {count}")
    print(count_ent[ent].most_common())
    print("")


DATE
[('1964', 3), ('the 1970s', 2), ('the 1960s', 2), ('a few weeks', 1), ('the summer of 1964', 1), ('Nearly half a century later', 1), ('aged 94', 1), ('2013', 1), ('2012', 1), ('1929', 1), ('1933', 1), ('1954', 1), ('1960', 1), ('the 1990s', 1), ('1996', 1), ('the 1980s', 1), ('the years', 1)]

PERSON
[('Higgs', 18), ('Peter Higgs', 1), ('Paul Dirac', 1), ('Yoichiro Nambu', 1), ('François Englert', 1)]

ORG
[('CERN', 2), ('the University of Edinburgh', 1), ('Universe', 1), ('BBC', 1), ('Bristol', 1), ('Cotham Grammar School', 1), ('King’s College London', 1), ('the University of Chicago', 1), ('the Campaign for Nuclear Disarmament and Greenpeace', 1)]

GPE
[('UK', 1), ('Geneva', 1), ('Switzerland', 1), ('Newcastle', 1), ('Tyne', 1), ('Edinburgh', 1), ('Illinois', 1), ('Nambu', 1)]

CARDINAL
[('two', 3), ('zero', 2), ('more than 6,000', 1), ('Two', 1), ('one', 1), ('spin‑zero', 1), ('85', 1)]

ORDINAL
[('second', 3)]

WORK_OF_ART
[('Nobel Prize in Physics', 2), ('Nobel', 2), ('PhD',

#### Find Entites and their dependency (hint: entity.root.head)

In [12]:
# entity.root.head did not appear anywhere in spaCy documentation!

# Print the entities and their dependencies
for entity in doc.ents:
    print(entity.text, entity.label_, entity.root.dep_)


a few weeks DATE pobj
the summer of 1964 DATE pobj
Peter Higgs PERSON nsubj
the University of Edinburgh ORG pobj
UK GPE appos
two CARDINAL nummod
Universe ORG pobj
second ORDINAL amod
Nearly half a century later DATE advmod
Higgs PERSON dobj
aged 94 DATE advcl
2013 DATE nummod
Nobel Prize in Physics WORK_OF_ART pobj
Higgs PERSON nsubj
the 1970s DATE pobj
2012 DATE pobj
two CARDINAL nummod
more than 6,000 CARDINAL nummod
CERN ORG pobj
Europe LOC poss
Geneva GPE pobj
Switzerland GPE appos
Higgs PERSON compound
Higgs PERSON nsubj
Higgs PERSON nsubjpass
Newcastle GPE pobj
Tyne GPE pobj
1929 DATE pobj
BBC ORG pobj
Bristol ORG pobj
Higgs PERSON nsubj
Cotham Grammar School ORG dobj
Paul Dirac PERSON appos
1933 DATE nummod
Nobel Prize in Physics WORK_OF_ART pobj
Higgs PERSON nsubj
King’s College London ORG pobj
PhD WORK_OF_ART dobj
1954 DATE pobj
Higgs PERSON nsubj
Edinburgh GPE pobj
1960 DATE pobj
Nobel WORK_OF_ART compound
Yoichiro Nambu PERSON appos
Japanese American NORP amod
the Universit

#### Find the most similar words in the article

In [13]:
# Get the word vectors for each token in the text
word_vectors = [token.vector for token in doc]

# Calculate the similarity between each pair of words
similarities = []
# Loop through all of the words in the text
for i in range(len(word_vectors)):
    # Loop through all of the words in the text that come after word i
    for j in range(i+1, len(word_vectors)):
        # Calculate the similarity between word vectors i and j
        similarity = word_vectors[i].dot(word_vectors[j])

        # Ignore if the words themselves are the same
        if doc[i].lemma_ != doc[j].lemma_:
            # Append the word pair and their similarity score
            similarities.append((doc[i].text, doc[j].text, similarity))

# Sort the similarities in descending order
similarities.sort(key=lambda x: x[2], reverse=True)


In [14]:

# Print the top most similar word pairs
for word1, word2, similarity in similarities[:20]:
    print(f"{word1} - {word2}: {similarity}")


By - In: 10500.205078125
By - In: 10500.205078125
As - In: 6958.2001953125
As - In: 6958.2001953125
He - It: 6924.5810546875
He - It: 6924.5810546875
He - It: 6924.5810546875
He - It: 6924.5810546875
It - He: 6924.5810546875
It - He: 6924.5810546875
He - It: 6924.5810546875
It - He: 6924.5810546875
His - My: 6719.4404296875
his - he: 6638.3701171875
his - he: 6638.3701171875
his - he: 6638.3701171875
his - he: 6638.3701171875
his - he: 6638.3701171875
his - he: 6638.3701171875
his - he: 6638.3701171875
