# Natural Language Processing (NLP) with Dimensions Grant Data

This notebook introduces Natural Language Processing (NLP) concepts using **Dimensions-style grant abstracts** as real-world examples.

1. **Fundamentals**
   - Tokenization
   - n-grams
   - Bag-of-Words
   - Markov text generation

2. **Classification (Naive Bayes)**
   - Classify grants as AI vs non-AI using text

3. **Word Representations**
   - One-hot vectors
   - Word embeddings (Word2Vec, Skip-Gram)

4. **Syntax & Semantics**
   - Context-Free Grammar parsing with NLTK
   - Syntactic ambiguity & semantics

5. **Neural NLP**
   - RNN text classifier
   - Attention concepts
   - Transformer sentence embeddings

6. **Applications in Grants Data**
   - Summarization
   - Extraction
   - Language ID
   - Named Entity Recognition (NER)

# Imports & Data Load

In [None]:
import pandas as pd
import numpy as np

# NLP tools
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.util import ngrams
from nltk import CFG, ChartParser

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Word embeddings
from gensim.models import Word2Vec

# Deep NLP
import tensorflow as tf
from tensorflow.keras import layers, models, preprocessing

# Ensure NLTK components
nltk.download('punkt', quiet=True)

# --- Load Dimensions grant data (must include "abstract" and "is_ai_ml") ---
# grants = pd.read_csv("grants.csv")

print("Columns:", grants.columns.tolist())

## 1. Tokenization & N-grams

Tokenization splits text into words and sentences.  
n-grams capture short sequences (e.g., bigrams, trigrams) useful for prediction and topic exploration.

In [None]:
sample_text = grants["abstract"].dropna().iloc[0]
print("Sample grant abstract:\n", sample_text)

# Word tokenization
tokens = word_tokenize(sample_text.lower())
print("\nTokens:", tokens[:20])

# Bigrams
bigrams = list(ngrams(tokens, 2))
print("\nExample bigrams:", bigrams[:10])

## 2. Bag-of-Words Representation

Bag-of-Words ignores order and simply counts word occurrences.  
Useful for:
- Topic features
- Text classification
- Similarity queries

In [None]:
vectorizer = CountVectorizer(max_features=2000, stop_words="english")
X_bow = vectorizer.fit_transform(grants["abstract"].fillna(""))

print("Shape (documents × vocab):", X_bow.shape)

## 3. Naive Bayes Text Classifier (AI vs Non-AI)

Naive Bayes assumes word independence (with smoothing).  
We’ll classify grants into two groups based purely on their abstracts.

In [None]:
y = grants["is_ai_ml"].astype(int).fillna(0).values

X_train, X_test, y_train, y_test = train_test_split(
    X_bow, y, test_size=0.2, random_state=42, stratify=y
)

nb = MultinomialNB()
nb.fit(X_train, y_train)
pred = nb.predict(X_test)

print(classification_report(y_test, pred))

## 4. Markov Model (n-gram text generator)

A Markov chain predicts the next word based on previous words.  
We build a simple *bigram* model from Dimensions abstracts to generate synthetic grant descriptions.

In [None]:
from collections import defaultdict
import random

# Build transitions: P(next | current)
transitions = defaultdict(list)

for abstract in grants["abstract"].dropna().head(500):
    words = ["<s>"] + word_tokenize(abstract.lower()) + ["</s>"]
    for w1, w2 in ngrams(words, 2):
        transitions[w1].append(w2)

def generate_text(seed="<s>", length=25):
    word = seed
    result = []
    for _ in range(length):
        next_words = transitions.get(word, ["</s>"])
        word = random.choice(next_words)
        if word == "</s>":
            break
        result.append(word)
    return " ".join(result)

print(generate_text())

## 5. Word Representations

### One-hot vectors
- Simple but high-dimensional
- No notion of similarity

### Embeddings (Word2Vec)
- Dense vectors capturing semantic similarity
- Use Skip-Gram to learn meaning from context
- Example: king – man + woman ≈ queen

In [None]:
abstract_tokens = [word_tokenize(a.lower()) for a in grants["abstract"].dropna().head(5000)]
w2v = Word2Vec(abstract_tokens, vector_size=50, window=5, min_count=3, workers=4)

print("Most similar to 'data':", w2v.wv.most_similar("data")[:10])

## 6. Syntax & Semantics with a CFG

Syntax = structure  
Semantics = meaning

We demonstrate a **Context-Free Grammar (CFG)** and parse a simple sentence using NLTK’s ChartParser.

In [None]:
grammar = CFG.fromstring("""
S -> NP VP
NP -> Det N
VP -> V NP
Det -> 'the'
N -> 'researcher' | 'dataset'
V -> 'analyzed'
""")

parser = ChartParser(grammar)
sentence = "the researcher analyzed the dataset".split()

for tree in parser.parse(sentence):
    tree.pretty_print()

## 7. RNN (LSTM) Text Classifier

We classify AI vs non-AI grants using only the **abstract** text.

Key idea:
- RNNs maintain hidden state across tokens  
- Useful for sequences and long-range dependencies  

In [None]:
# Prepare text…
texts = grants["abstract"].fillna("").tolist()
labels = grants["is_ai_ml"].astype(int).values

tokenizer = preprocessing.text.Tokenizer(num_words=5000)
tokenizer.fit_on_texts(texts)

X_seq = tokenizer.texts_to_sequences(texts)
X_pad = preprocessing.sequence.pad_sequences(X_seq, maxlen=200)

X_train, X_test, y_train, y_test = train_test_split(
    X_pad, labels, test_size=0.2, random_state=42, stratify=labels
)

model_rnn = models.Sequential([
    layers.Embedding(input_dim=5000, output_dim=64, input_length=200),
    layers.LSTM(64),
    layers.Dense(32, activation="relu"),
    layers.Dense(1, activation="sigmoid")
])

model_rnn.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model_rnn.fit(X_train, y_train, epochs=3, batch_size=32, validation_split=0.1, verbose=1)

loss_rnn, acc_rnn = model_rnn.evaluate(X_test, y_test, verbose=0)
print("LSTM classification accuracy:", acc_rnn)

## 8. Transformers & Attention

Transformers use **self-attention** to process all words in parallel and capture global context.

Here we:
- Use a pretrained transformer (via TF Hub)
- Generate sentence embeddings for grant abstracts

In [None]:
import tensorflow_hub as hub

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

sentences = grants["abstract"].dropna().head(5).tolist()
embeddings = embed(sentences)

print("Embedding shape:", embeddings.shape)
print("Similarity matrix:\n", np.inner(embeddings, embeddings))

## 9. Named Entity Recognition (NER)

We extract scientific entities (institutions, diseases, compounds, locations) using spaCy.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

text = grants["abstract"].dropna().iloc[0]
doc = nlp(text)

[(ent.text, ent.label_) for ent in doc.ents][:20]

# Final Summary

This notebook covered:

### Core NLP Tasks
- Tokenization, n-grams, Bag-of-Words
- Text classification via Naive Bayes
- Markov text generation

### Word Representation
- One-hot encoding, embeddings, Word2Vec Skip-Gram

### Syntax & Semantics
- Context-Free Grammar parsing with NLTK

### Deep NLP
- LSTM text classifier
- Transformer embeddings & attention

### Applications to Dimensions Data
- Grant classification
- Entity extraction
- Similarity analysis
- Text generation