# POS tagging

In this notebook, we explore what can be done with the POS tagging of the social media posts of the State Election Officials.

Words that can be captured:

- ADJ: Adjective (e.g., "happy", "large")
- ADP: Adposition (preposition or postposition, e.g., "in", "on")
- ADV: Adverb (e.g., "quickly", "very")
- AUX: Auxiliary verb (e.g., "is", "will", "can")
- CONJ: Coordinating conjunction (e.g., "and", "or")
- CCONJ: Coordinating conjunction (same as CONJ)
- DET: Determiner (e.g., "the", "a", "this")
- INTJ: Interjection (e.g., "oh", "wow")
- NOUN: Noun (e.g., "cat", "book")
- NUM: Numeral (e.g., "one", "2023")
- PART: Particle (e.g., "not", "to" in infinitives)
- PRON: Pronoun (e.g., "he", "she", "it")
- PROPN: Proper noun (e.g., "John", "London")
- PUNCT: Punctuation (e.g., ".", ",", "!")
- SCONJ: Subordinating conjunction (e.g., "if", "because")
- SYM: Symbol (e.g., "$", "%")
- VERB: Verb (e.g., "run", "eat")
- X: Other (e.g., foreign words, abbreviations)

### Fine-Grained POS Tags

These tags provide more detailed information about the word's role in the sentence. They are specific to English and are based on the Penn Treebank tag set. Here are some common examples:

- **Adjectives (ADJ):**
  - JJ: Adjective (e.g., "happy")
  - JJR: Comparative adjective (e.g., "happier")
  - JJS: Superlative adjective (e.g., "happiest")

- **Nouns (NOUN):**
  - NN: Singular noun (e.g., "cat")
  - NNS: Plural noun (e.g., "cats")
  - NNP: Singular proper noun (e.g., "John")
  - NNPS: Plural proper noun (e.g., "Vikings")

- **Verbs (VERB):**
  - VB: Base form (e.g., "run")
  - VBD: Past tense (e.g., "ran")
  - VBG: Gerund/present participle (e.g., "running")
  - VBN: Past participle (e.g., "eaten")
  - VBP: Present tense, not 3rd person singular (e.g., "run")
  - VBZ: Present tense, 3rd person singular (e.g., "runs")

- **Adverbs (ADV):**
  - RB: Adverb (e.g., "quickly")
  - RBR: Comparative adverb (e.g., "faster")
  - RBS: Superlative adverb (e.g., "fastest")

- **Pronouns (PRON):**
  - PRP: Personal pronoun (e.g., "I", "you")
  - PRP$: Possessive pronoun (e.g., "my", "your")

- **Determiners (DET):**
  - DT: Determiner (e.g., "the", "a")
  - WDT: Wh-determiner (e.g., "which", "what")

- **Particles (PART):**
  - RP: Particle (e.g., "up" in "give up")
  - TO: Infinitive marker (e.g., "to" in "to run")

- **Conjunctions (CONJ/CCONJ/SCONJ):**
  - CC: Coordinating conjunction (e.g., "and", "or")
  - IN: Preposition/subordinating conjunction (e.g., "in", "because")

- **Punctuation (PUNCT):**
  - .: Period
  - ,: Comma
  - -: Hyphen
  - ': Apostrophe

- **Other (X):**
  - FW: Foreign word (e.g., "bonjour")
  - LS: List item marker (e.g., "1.", "a.")
  - UH: Interjection (e.g., "uh", "oh")

## load data

In [14]:
import pandas as pd
import numpy as np
import spacy

df = pd.read_csv("../data/clean/pos_tagging_data.csv")
df.head()

Unnamed: 0,PostId,Combined_text
0,74414837,We hope everyone has a safe and Happy Hallowee...
1,74420801,Oconee County has the best Elections staff and...
2,74420802,ðŸ‡ºðŸ‡¸Keep on voting Young CountyðŸ‡ºðŸ‡¸ Letâ€™s try and ...
3,74420805,"Early Voting turnout for Monday, October 31, 2..."
4,74411274,Happy Halloween from the Clerk-Recorders Office!


## spacy:

In [9]:
import spacy

# Load the SpaCy model
nlp = spacy.load("en_core_web_lg")

# Sample text
text = "SpaCy is a powerful library for natural language processing."

# Process the text
doc = nlp(text)

# Iterate over the tokens and print their POS tags
for token in doc:
    print(f"Token: {token.text}, POS: {token.pos_}, Tag: {token.tag_}")

Token: SpaCy, POS: PROPN, Tag: NNP
Token: is, POS: AUX, Tag: VBZ
Token: a, POS: DET, Tag: DT
Token: powerful, POS: ADJ, Tag: JJ
Token: library, POS: NOUN, Tag: NN
Token: for, POS: ADP, Tag: IN
Token: natural, POS: ADJ, Tag: JJ
Token: language, POS: NOUN, Tag: NN
Token: processing, POS: NOUN, Tag: NN
Token: ., POS: PUNCT, Tag: .


## Counts of word types

In [11]:
import spacy
from collections import defaultdict

# Load the SpaCy model
nlp = spacy.load("en_core_web_lg")

# Sample documents
documents = [
    "The ambitious man and the emotional woman worked together.",
    "The confident leader and the supportive assistant presented the plan.",
    "The aggressive boy and the polite girl were in the same class."
]

# Initialize a dictionary to store word counts by POS
pos_counts = defaultdict(lambda: defaultdict(int))

# Process each document
for doc_text in documents:
    doc = nlp(doc_text)
    for token in doc:
        pos_counts[token.pos_][token.text.lower()] += 1

# Print the results
for pos, words in pos_counts.items():
    print(f"POS: {pos}")
    for word, count in words.items():
        print(f"  {word}: {count}")

POS: DET
  the: 8
POS: ADJ
  ambitious: 1
  emotional: 1
  confident: 1
  supportive: 1
  aggressive: 1
  polite: 1
  same: 1
POS: NOUN
  man: 1
  woman: 1
  leader: 1
  assistant: 1
  plan: 1
  boy: 1
  girl: 1
  class: 1
POS: CCONJ
  and: 3
POS: VERB
  worked: 1
  presented: 1
POS: ADV
  together: 1
POS: PUNCT
  .: 3
POS: AUX
  were: 1
POS: ADP
  in: 1


## dependency parsing or collocation analysis

In [12]:
# Extract adjective-noun pairs
adj_noun_pairs = []

for doc_text in documents:
    doc = nlp(doc_text)
    for token in doc:
        if token.pos_ == "ADJ":
            for child in token.children:
                if child.pos_ == "NOUN":
                    adj_noun_pairs.append((token.text.lower(), child.text.lower()))

# Print the results
print("Adjective-Noun Pairs:")
for adj, noun in adj_noun_pairs:
    print(f"  {adj} {noun}")

Adjective-Noun Pairs:


## Sentiment on pairs

In [13]:
from textblob import TextBlob

# Analyze sentiment of adjectives
for adj, noun in adj_noun_pairs:
    blob = TextBlob(f"{adj} {noun}")
    sentiment = blob.sentiment.polarity  # Range: -1 (negative) to 1 (positive)
    print(f"Pair: {adj} {noun}, Sentiment: {sentiment}")

## Broad Topic Modeling

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Prepare the documents
texts = [doc.text for doc in nlp.pipe(documents)]

# Create a document-term matrix
vectorizer = CountVectorizer(stop_words="english")
dtm = vectorizer.fit_transform(texts)

# Perform topic modeling
lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(dtm)

# Print the top words for each topic
for idx, topic in enumerate(lda.components_):
    print(f"Topic {idx}:")
    print([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-5:]])