# NLP
Find your favorite news source and grab the article text. 

1. Show the most common words in the article.
2. Show the most common words under a part of speech. (i.e. NOUN: {'Bob':12, 'Alice':4,})
3. Find a subject/object relationship through the dependency parser in any sentence.
4. Show the most common Entities and their types. 
5. Find Entites and their dependency (hint: entity.root.head)
6. Find the most similar words in the article

Note: Yes, the notebook from the video is not provided, I leave it to you to make your own :) it's your final assignment for the semester. Enjoy!

In [1]:
# Saving article text
article = """American superstar Katie Ledecky equalled the record for the most gold medals by a female Olympian as she won the 800m freestyle title at the Paris Games. Ledecky clocked eight minutes 11.04 seconds to become the only woman - and only swimmer other than the great Michael Phelps - with four Olympic golds in the same event. It was Ledecky's ninth Olympic gold, moving her level with former Soviet gymnast Larisa Latynina, and taking her overall tally to 14 medals. Phelps has the most medals of any Olympian with 28, including 23 golds. "The four-times record is the one that means the most to me," Ledecky, 27, said afterwards. "3 August is the day I won in 2012, and I didn't want 3 August to be a day I didn't like moving forwards. "I put a lot of pressure on myself, so I'm happy I got the job done." Earlier on Saturday, Summer McIntosh's astonishing debut Games continued, with the Canadian 17-year-old securing her third gold with victory in the women's 200m individual medley. But Great Britain's 4x100m medley relay defence ended in disappointment, with the quartet finishing seventh. Ledecky's dominance over distance continues. Ledecky has won four medals in Paris alone - two golds, a silver and a bronze. She became the United States' most decorated female Olympian with silver in the women's 4x200m relay on Thursday. Such is her dominance in the 800m freestyle that she has lost just once over the distance in 13 years - and that was to rising star McIntosh at a regional meet earlier in 2024. McIntosh opted not to swim the 800m in Paris, meaning Ledecky's biggest rival was old foe Ariarne Titmus. Australia's Titmus beat Ledecky to 400m freestyle gold earlier in the week but she could not stay with the American in the closing stages of her favourite distance. The two shared a warm moment at the end of the race, with Ledecky raising both their arms in the air before Titmus applauded her opponent as she left the arena. "We have just seen a little bit of history there," Steve Parry, Olympic bronze medallist for Britain in 2004, said on BBC 5 Live. "Ledecky is the absolute queen of the pool. To be able to see someone dominate a distance event for 13 years is absolutely brilliant." Titmus took silver in 8:12.29, with Ledecky's American team-mate Paige Madden (8:13.00) completing the podium."""

In [2]:
# Printing article
print(article)

American superstar Katie Ledecky equalled the record for the most gold medals by a female Olympian as she won the 800m freestyle title at the Paris Games. Ledecky clocked eight minutes 11.04 seconds to become the only woman - and only swimmer other than the great Michael Phelps - with four Olympic golds in the same event. It was Ledecky's ninth Olympic gold, moving her level with former Soviet gymnast Larisa Latynina, and taking her overall tally to 14 medals. Phelps has the most medals of any Olympian with 28, including 23 golds. "The four-times record is the one that means the most to me," Ledecky, 27, said afterwards. "3 August is the day I won in 2012, and I didn't want 3 August to be a day I didn't like moving forwards. "I put a lot of pressure on myself, so I'm happy I got the job done." Earlier on Saturday, Summer McIntosh's astonishing debut Games continued, with the Canadian 17-year-old securing her third gold with victory in the women's 200m individual medley. But Great Brita

In [3]:
# Installing packages
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_md


# Importing libraries
import spacy
from collections import Counter

# Load the English model
nlp = spacy.load("en_core_web_sm")


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     --------------------------------------- 12.8/12.8 MB 40.9 MB/s eta 0:00:00
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
     --------------------------------------- 42.8/42.8 MB 32.7 MB/s eta 0:00:00
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [6]:
# Processing the text
doc = nlp(article)

# Load the medium English model for word vectors
nlp = spacy.load("en_core_web_md")

# Verify model name and word vectors
print(nlp.meta['name'])
print("Vectors loaded:", nlp.vocab.vectors_length > 0)


# Most common words
words = [token.text for token in doc if token.is_alpha]
word_freq = Counter(words)
most_common_words = word_freq.most_common(10)
print("Most common words:", most_common_words)

# Most common nouns
nouns = [token.text for token in doc if token.pos_ == "NOUN"]
noun_freq = Counter(nouns)
most_common_nouns = noun_freq.most_common(10)
print("Most common nouns:", most_common_nouns)

# Subject/Object relationships
print("\nSubject/Object relationships:")
for token in doc:
    if token.dep_ in ("nsubj", "dobj"):
        print(f"Token: {token.text}, Dep: {token.dep_}, Head: {token.head.text}")

# Most common entities and their types
entities = [(ent.text, ent.label_) for ent in doc.ents]
entity_freq = Counter(entities)
most_common_entities = entity_freq.most_common(10)
print("\nMost common entities:", most_common_entities)

# Entities and their dependency
print("\nEntities and their dependency:")
for ent in doc.ents:
    print(f"Entity: {ent.text}, Root Head: {ent.root.head.text}")

core_web_md
Vectors loaded: True
Most common words: [('the', 30), ('in', 15), ('Ledecky', 11), ('with', 10), ('a', 9), ('to', 8), ('m', 7), ('her', 6), ('of', 6), ('I', 6)]
Most common nouns: [('gold', 4), ('medals', 4), ('distance', 4), ('freestyle', 3), ('golds', 3), ('silver', 3), ('m', 3), ('record', 2), ('event', 2), ('day', 2)]

Subject/Object relationships:
Token: Ledecky, Dep: nsubj, Head: equalled
Token: record, Dep: dobj, Head: equalled
Token: she, Dep: nsubj, Head: won
Token: title, Dep: dobj, Head: won
Token: Ledecky, Dep: nsubj, Head: clocked
Token: It, Dep: nsubj, Head: was
Token: level, Dep: dobj, Head: moving
Token: her, Dep: dobj, Head: taking
Token: tally, Dep: dobj, Head: taking
Token: Phelps, Dep: nsubj, Head: has
Token: medals, Dep: dobj, Head: has
Token: record, Dep: nsubj, Head: is
Token: that, Dep: nsubj, Head: means
Token: most, Dep: dobj, Head: means
Token: Ledecky, Dep: nsubj, Head: said
Token: August, Dep: nsubj, Head: is
Token: I, Dep: nsubj, Head: won
Toke

In [17]:
!python -m spacy download en_core_web_lg


Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
     -------------------------------------- 587.7/587.7 MB 2.6 MB/s eta 0:00:00
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [20]:
# Loading larger model
nlp = spacy.load('en_core_web_lg')

doc = nlp(article)

In [21]:
# Extracting unique words
unique_words = list(set([token.text for token in doc if token.is_alpha]))

# Creating function to find most similar words
def get_most_similar(word, topn=5):
    # Create a single token Doc object for the input word
    token_word = nlp(word)[0]
    if not token_word.has_vector:
        return []  # Skip words without vectors
    
    # Calculate similarity for each token in the document
    similarities = [(token.text, token.similarity(token_word)) 
                    for token in doc if token.has_vector]
    similarities = sorted(similarities, key=lambda item: -item[1])
    return similarities[:topn]

# Findnng similar words for each unique word
similar_words = {}
for word in unique_words:
    try:
        similar_words[word] = get_most_similar(word)
    except KeyError:
        # Handle case where word may not be in the vocabulary
        similar_words[word] = []

# Printing the most similar words
print("\nMost similar words:")
for word, similarities in similar_words.items():
    print(f"{word}: {similarities}")


Most similar words:
Parry: [('Parry', 1.0), ('McIntosh', 0.6216792464256287), ('McIntosh', 0.6216792464256287), ('McIntosh', 0.6216792464256287), ('Madden', 0.49364355206489563)]
arena: [('arena', 1.0), ('event', 0.4486501216888428), ('event', 0.4486501216888428), ('team', 0.36738309264183044), ('pool', 0.3580174446105957)]
mate: [('mate', 1.0), ('someone', 0.3582513630390167), ('opponent', 0.33517709374427795), ('woman', 0.31376540660858154), ('swimmer', 0.29381030797958374)]
seen: [('seen', 1.0), ('done', 0.5543443560600281), ('see', 0.5273446440696716), ('equalled', 0.4409928321838379), ('shared', 0.43655744194984436)]
of: [('of', 1.0), ('of', 1.0), ('of', 1.0), ('of', 1.0), ('of', 1.0)]
absolute: [('absolute', 1.0), ('absolutely', 0.7233128547668457), ('astonishing', 0.5538218021392822), ('means', 0.5067089796066284), ('that', 0.5039765238761902)]
superstar: [('superstar', 1.0), ('star', 0.6966253519058228), ('brilliant', 0.41771119832992554), ('rival', 0.3953743278980255), ('team