# IMDB reviews as n-gram graphs

This example shows how to use n-gram graphs with the imdb review dataset.

## Setup

In [None]:
!pip3 install datasets scikit-learn nltk
!pip install -q dgl -f https://data.dgl.ai/wheels/cu118/repo.html
!pip install -q dglgo -f https://data.dgl.ai/wheels-test/repo.html

## Load Dataset

Download the imdb review dataset from huggingface:

In [1]:
from datasets import load_dataset

dataset = load_dataset("imdb")
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

## Preprocessing

Before we can start, we have to preprocess the text. We remove punctuation, stop words, and break sentences into words. We use the nltk package to accomplish this:

In [2]:
from nltk import TweetTokenizer
import nltk

nltk.download("stopwords")
nltk.download("punkt")

from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))
tokenizer_words = TweetTokenizer()


def preprocess_text(sample):
    text = sample["text"]
    text = text.lower()
    text = tokenizer_words.tokenize(text)
    text = [w for w in text if not w in stop_words and w.isalpha()]
    return {"text": text}


dataset["train"] = dataset["train"].map(preprocess_text)
dataset["test"] = dataset["test"].map(preprocess_text)

[nltk_data] Downloading package stopwords to nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
dataset["train"]["text"][0]

['rented',
 'video',
 'store',
 'controversy',
 'surrounded',
 'first',
 'released',
 'also',
 'heard',
 'first',
 'seized',
 'u',
 'customs',
 'ever',
 'tried',
 'enter',
 'country',
 'therefore',
 'fan',
 'films',
 'considered',
 'controversial',
 'really',
 'see',
 'br',
 'br',
 'plot',
 'centered',
 'around',
 'young',
 'swedish',
 'drama',
 'student',
 'named',
 'lena',
 'wants',
 'learn',
 'everything',
 'life',
 'particular',
 'wants',
 'focus',
 'attentions',
 'making',
 'sort',
 'documentary',
 'average',
 'swede',
 'thought',
 'certain',
 'political',
 'issues',
 'vietnam',
 'war',
 'race',
 'issues',
 'united',
 'states',
 'asking',
 'politicians',
 'ordinary',
 'denizens',
 'stockholm',
 'opinions',
 'politics',
 'sex',
 'drama',
 'teacher',
 'classmates',
 'married',
 'men',
 'br',
 'br',
 'kills',
 'years',
 'ago',
 'considered',
 'pornographic',
 'really',
 'sex',
 'nudity',
 'scenes',
 'far',
 'even',
 'shot',
 'like',
 'cheaply',
 'made',
 'porno',
 'countrymen',
 'min

## Build N-Gram Graphs

Now we can build the n-gram graphs for these texts:

In [4]:
import os
from ngram_graph.graph import text_to_graph


def get_edges(samples):
    graphs = [text_to_graph(t, n=4) for t in samples["text"]]
    graphs = [g.as_dgl_graph() for g in graphs]
    return {
        "edges": [g[0].edges() for g in graphs],
        "n_grams": [g[1] for g in graphs],
    }


dataset["train"] = dataset["train"].map(get_edges, batched=True, batch_size=100, num_proc=os.cpu_count())
dataset["test"] = dataset["test"].map(get_edges, batched=True, batch_size=100, num_proc=os.cpu_count())

In [5]:
dataset["train"][0]["n_grams"]

['rented video store controversy',
 'video store controversy surrounded',
 'store controversy surrounded first',
 'controversy surrounded first released',
 'surrounded first released also',
 'first released also heard',
 'released also heard first',
 'also heard first seized',
 'heard first seized u',
 'first seized u customs',
 'seized u customs ever',
 'u customs ever tried',
 'customs ever tried enter',
 'ever tried enter country',
 'tried enter country therefore',
 'enter country therefore fan',
 'country therefore fan films',
 'therefore fan films considered',
 'fan films considered controversial',
 'films considered controversial really',
 'considered controversial really see',
 'controversial really see br',
 'really see br br',
 'see br br plot',
 'br br plot centered',
 'br plot centered around',
 'plot centered around young',
 'centered around young swedish',
 'around young swedish drama',
 'young swedish drama student',
 'swedish drama student named',
 'drama student named l

There can be empty graphs, because there are texts that do not contain more than N words after the pre-processing is done. So filter all empty graphs:

In [6]:
dataset["train"] = dataset["train"].filter(lambda sample: len(sample["edges"][0]) != 0, num_proc=os.cpu_count())
dataset["test"] = dataset["test"].filter(lambda sample: len(sample["edges"][0]) != 0, num_proc=os.cpu_count())

After we created the n-gram graphs, we have to vectorize the n-grams. For this purpose we use the CountVectorizer:

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer=lambda x: x, max_df=0.98, min_df=0.01)
vectorizer.fit(dataset["train"]["text"])
vectorizer.fit(dataset["test"]["text"])
len(vectorizer.vocabulary_)

1587

In [8]:
train_ids = [vectorizer.transform(g) for g in dataset["train"]["n_grams"]]
test_ids = [vectorizer.transform(g) for g in dataset["test"]["n_grams"]]

In [9]:
import torch
import dgl


def get_dgl_graphs(samples, ids):
    graphs = [dgl.graph(tuple(s)) for s in samples["edges"]]
    for g, i in zip(graphs, ids):
        g.ndata["id"] = torch.tensor(i.todense(), dtype=torch.float32)
    return graphs


train_graphs = get_dgl_graphs(dataset["train"], train_ids)
test_graphs = get_dgl_graphs(dataset["test"], test_ids)

As a final step, we will save the dgl graphs for future use:

In [10]:
dgl.save_graphs("train.bin", train_graphs)
dgl.save_graphs("test.bin", test_graphs)