# Natural Language Processing

---

## *Section 1: N-Grams*

**Concept:** N-grams are contiguous sequences of "n" items (typically
words) from a given sample of text. Commonly used n-grams are unigrams
(n=1), bigrams (n=2), and trigrams (n=3). N-grams are foundational in NLP, especially in
language modeling, text generation, and context-based analysis.

**How it works:**
To extract n-grams from a text, a sliding window of size `n` moves across the tokenized text. Each time the window moves, it captures a new set of `n` adjacent tokens. Frequency distributions of these n-grams can then be computed to identify common patterns or sequences.

**Demonstration:**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset='all')
texts = newsgroups.data

vectorizer = CountVectorizer(ngram_range=(2, 2), max_features=20)
X = vectorizer.fit_transform(texts[:1000])
bigrams = vectorizer.get_feature_names_out()

plt.barh(bigrams, X.toarray().sum(axis=0))
plt.xlabel("Frequency")
plt.title("Top Bigrams")
plt.tight_layout()
plt.show()

**Exercise 1:** The previous visualization has limited uses because of the inclusion of stop words. Remove the stop words and visualize the top 20 trigrams from the first 1000 documents.

In [None]:
# your code here
raise NotImplementedError

**Exercise 2:** Compare the frequency distribution of bigrams between the newsgroup categories `sci.space` and `rec.sport.hockey`. (check out the lda demo code for how categories are defined)

In [None]:
# your code here
raise NotImplementedError
plt.figure(figsize=(10, 5))
for i, cat in enumerate(cats):
    mask = labels == i
    cat_counts = X[mask].sum(axis=0).A1
    plt.bar([x + i * 0.3 for x in range(20)], cat_counts, width=0.3, label=cat)

plt.xticks(range(20), vectorizer.get_feature_names_out(), rotation=90)
plt.legend()
plt.title("Top Bigrams by Category")
plt.tight_layout()
plt.show()

---

## *Section 2: Named Entity Recognition (NER)*

**Concept:** NER locates and classifies entities in text into predefined categories like persons, organizations, locations, dates, etc. Entities provide structure and semantic understanding to raw text, enabling more meaningful search, summarization, and recommendation systems. NER uses pre-trained models that rely on statistical or neural network-based approaches to detect entities. Tokens are passed through a model which assigns them a label such as `PERSON`, `ORG`, or `DATE`. These models often use contextual embeddings to make predictions. The demo here uses statistical models, but the interpretation remains the same.

**Demonstration:**

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(texts[0])
spacy.displacy.render(doc, style="ent", jupyter=True)

**Exercise 1:** Extract all unique organizations from the first 100
documents.

In [None]:
orgs = set()
# your code here
raise NotImplementedError
print(orgs)

**Exercise 2:** Count and visualize the frequency of each named entity
label in the first 200 documents.

In [None]:
# your code here
raise NotImplementedError

---

## *Section 3: Dependency Parsing*

**Concept:** Dependency parsing uncovers grammatical structure and relationships between words, identifying subjects, objects, and modifiers. Parsing is key to  understanding sentence-level meaning and grammar for complex NLP applications.

**How it works:**
Dependency parsers analyze the grammatical structure of a sentence by identifying the "head" word of each word in the sentence and labeling the relationships (dependencies) between them. This generates a tree structure representing grammatical roles such as `nsubj` (nominal subject) or `dobj` (direct object).

**Applications:**

-   Grammar checking tools
-   Semantic search systems
-   Intelligent chatbots

**Demonstration:**



In [None]:
sentence = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(sentence)
spacy.displacy.render(doc, style="dep", jupyter=True)

**Exercise 1:** Count how many times a noun serves as a subject (nsubj) in
the first 200 documents.

In [None]:
nsubj_count = 0
# your code here
raise NotImplementedError
print(nsubj_count)

**Exercise 2:** Find and visualize the top 10 most frequent dependency labels in the dataset. (and review what these parts of speech mean)

In [None]:
# your code here
raise NotImplementedError

---

## *Section 4: Sentiment Analysis*

**Concept:** Sentiment analysis assigns emotional tone to text, commonly classifying as positive, negative, or neutral. Understanding sentiment helps organizations evaluate user opinions, reviews, or public reactions.

**How it works:**
Sentiment analysis often uses a combination of rule-based systems and machine learning models. Rule-based systems like VADER assign polarity scores to words and aggregate them, considering intensifiers and negations. Machine learning approaches use annotated corpora to learn patterns associated with sentiment.

**Applications:**

-   Product review classification
-   Social media monitoring
-   Political campaign analysis

**Demonstration:**



In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
nltk.download("vader_lexicon")

sia = SentimentIntensityAnalyzer()
score = sia.polarity_scores(texts[0])
print(score)

**Exercise 1:** Count how many of the first 100 texts are mostly positive.

In [None]:
# your code here
raise NotImplementedError
print(positive_count)

**Exercise 2:** Plot the sentiment distribution (positive, negative,
neutral) for the first 500 texts.

In [None]:
# your code here
raise NotImplementedError

---

## *Section 5: Topic Modeling with LDA*

**Concept:** LDA is an unsupervised algorithm that finds abstract topics
in a corpus based on word distribution.

**How it works:**
Latent Dirichlet Allocation assumes each document is a mixture of topics, and each topic is a mixture of words. The algorithm infers topic distributions by iteratively updating topic assignments for each word using probabilistic inference. It uses Dirichlet priors to ensure sparsity in topic and word distributions.

**Applications:**

-   News topic categorization
-   Academic literature review
-   Legal discovery

**Demonstration:** See lda notebook on canvas.

**Exercise 1:** Print the top 10 words for each of 7 topics from the `rec.autos` category.



In [None]:
# your code here
raise NotImplementedError

**Exercise 2:** Visualize your LDA model using pyLDAvis. (check out the lda demo code for how categories are defined)

In [None]:
# your code here
raise NotImplementedError