# Text Representation and Pre-processing Practice

This notebook walks through encoding text data into machine-friendly representations
using a small sample from the provided tweets dataset.
We explore naive one-hot encodings, demonstrate how simple pre-processing helps reduce sparsity,
and then build co-occurrence matrices, term-frequency representations, TF–IDF, and compute distance metrics.

## Loading a sample of tweets

We start by loading the `tweets.json` file and inspecting a small sample.
The data is stored as a JSON object with a single key `tweets` mapping to a list of tweet records.
For demonstration purposes, we'll use the first two tweets from the dataset.

In [1]:

import json
from pathlib import Path

# Path to the tweets file
file_path = 'tweets.json'

# Load the data
with open(file_path, 'r', encoding='utf-8') as f:
    data = json.load(f)

# Extract the list of tweets
tweets = data['tweets']

# Take a small sample (two tweets)
sample_texts = []
for tweet in tweets:
    text = tweet.get('full_text') or tweet.get('text') or ''
    if text:
        sample_texts.append(text)
    if len(sample_texts) >= 2:
        break

print(f"Loaded {len(tweets)} tweets from the file.")
print('Sample tweet 1:', sample_texts[0])
print('Sample tweet 2:', sample_texts[1])


Loaded 500 tweets from the file.
Sample tweet 1: RT @mike_pence: Huge crowd gathered tonight at SNHU Arena in Manchester, NH for @realDonaldTrump! https://t.co/SvnB8xWHKm
Sample tweet 2: Springsteen said Hillary was born to run? She can't even walk. @realDonaldTrump @TomiLahren @WeNeedTrump


## Naïve one-hot encoding (no pre-processing)

To illustrate the limitations of one-hot encodings, we'll first build a one-hot representation without any pre-processing. This means we simply split each tweet on whitespace without removing mentions, URLs, punctuation or changing case.

This approach often yields a very large vocabulary containing many one-off tokens (e.g., user handles, URL fragments), which leads to extremely sparse vectors and little overlap between documents.

In [2]:

# Naive tokenization by splitting on whitespace (no cleaning)
naive_tokens = [text.split() for text in sample_texts]
print("Naive tokens:")
for i, toks in enumerate(naive_tokens, 1):
    print(f"Tweet {i} tokens: {toks}")

# Build vocabulary from naive tokens
vocab_naive = sorted(set().union(*naive_tokens))
word2idx_naive = {w: i for i, w in enumerate(vocab_naive)}

# Create one-hot presence vectors
vectors_naive = []
for toks in naive_tokens:
    s = set(toks)
    vec = [1 if w in s else 0 for w in vocab_naive]
    vectors_naive.append(vec)

print('Vocabulary size (naive):', len(vocab_naive))
print('Vocabulary:', vocab_naive)

# Display one-hot vectors and overlap
for i, vec in enumerate(vectors_naive, 1):
    print(f"One-hot vector for tweet {i}: {vec}")

# Compute overlap (dot product) between the two vectors
if len(vectors_naive) >= 2:
    from numpy import dot
    overlap_naive = dot(vectors_naive[0], vectors_naive[1])
    print('Overlap (dot product) without pre-processing:', overlap_naive)


Naive tokens:
Tweet 1 tokens: ['RT', '@mike_pence:', 'Huge', 'crowd', 'gathered', 'tonight', 'at', 'SNHU', 'Arena', 'in', 'Manchester,', 'NH', 'for', '@realDonaldTrump!', 'https://t.co/SvnB8xWHKm']
Tweet 2 tokens: ['Springsteen', 'said', 'Hillary', 'was', 'born', 'to', 'run?', 'She', "can't", 'even', 'walk.', '@realDonaldTrump', '@TomiLahren', '@WeNeedTrump']
Vocabulary size (naive): 29
Vocabulary: ['@TomiLahren', '@WeNeedTrump', '@mike_pence:', '@realDonaldTrump', '@realDonaldTrump!', 'Arena', 'Hillary', 'Huge', 'Manchester,', 'NH', 'RT', 'SNHU', 'She', 'Springsteen', 'at', 'born', "can't", 'crowd', 'even', 'for', 'gathered', 'https://t.co/SvnB8xWHKm', 'in', 'run?', 'said', 'to', 'tonight', 'walk.', 'was']
One-hot vector for tweet 1: [0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0]
One-hot vector for tweet 2: [1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1]
Overlap (dot product) without pre-processing: 0


### Discussion

Without any pre-processing, the vocabulary contains many tokens that are unique to one tweet (e.g., mentions, URL fragments, mixed case words). As a result, the one-hot vectors share very few common dimensions and the overlap (dot product) is tiny. This highlights the limitations of naive one-hot encoding. Next, we clean and normalize the text to create more meaningful representations.

## Pre-processing: Tokenization and normalization

A simple preprocessing pipeline helps reduce sparsity by lowercasing, removing punctuation, and splitting tweets into consistent tokens. This step is crucial for aggregating similar words and improving overlap.

In [3]:

import re

# A simple tokenizer: lowercase and remove non-alphanumeric characters

def tokenize(text):
    cleaned = re.sub(r"[^a-zA-Z0-9]+", " ", text.lower())
    return cleaned.split()

# Apply tokenizer to each sample tweet
tokenized = [tokenize(t) for t in sample_texts]
print("Tokens after preprocessing:")
for i, toks in enumerate(tokenized, 1):
    print(f"Tweet {i} tokens: {toks}")

# Build vocabulary from the cleaned tokens
vocab = sorted(set().union(*tokenized))
word2idx = {w: i for i, w in enumerate(vocab)}

# Create one-hot presence vectors
vectors = []
for toks in tokenized:
    s = set(toks)
    vec = [1 if w in s else 0 for w in vocab]
    vectors.append(vec)

print('Vocabulary size (cleaned):', len(vocab))
print('Vocabulary:', vocab)

# Display one-hot vectors and overlap
for i, vec in enumerate(vectors, 1):
    print(f"One-hot vector for tweet {i}: {vec}")

# Compute overlap (dot product) between the two vectors
if len(vectors) >= 2:
    from numpy import dot
    overlap = dot(vectors[0], vectors[1])
    print('Overlap (dot product) after preprocessing:', overlap)


Tokens after preprocessing:
Tweet 1 tokens: ['rt', 'mike', 'pence', 'huge', 'crowd', 'gathered', 'tonight', 'at', 'snhu', 'arena', 'in', 'manchester', 'nh', 'for', 'realdonaldtrump', 'https', 't', 'co', 'svnb8xwhkm']
Tweet 2 tokens: ['springsteen', 'said', 'hillary', 'was', 'born', 'to', 'run', 'she', 'can', 't', 'even', 'walk', 'realdonaldtrump', 'tomilahren', 'weneedtrump']
Vocabulary size (cleaned): 32
Vocabulary: ['arena', 'at', 'born', 'can', 'co', 'crowd', 'even', 'for', 'gathered', 'hillary', 'https', 'huge', 'in', 'manchester', 'mike', 'nh', 'pence', 'realdonaldtrump', 'rt', 'run', 'said', 'she', 'snhu', 'springsteen', 'svnb8xwhkm', 't', 'to', 'tomilahren', 'tonight', 'walk', 'was', 'weneedtrump']
One-hot vector for tweet 1: [1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0]
One-hot vector for tweet 2: [0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1]
Overlap (dot product) after preproc

### Discussion

After normalization, many of the one-off tokens (handles, URL fragments) disappear or are converted into consistent lowercased terms. This increases the vocabulary overlap between tweets, though it may still be small with a very small sample. The next sections build richer representations that capture context and frequency.

## Word × word co-occurrence matrix

Co-occurrence counts capture how often words appear near each other within a sliding window. We define a window size \(k\) and slide across each tweet, counting word pairs.

In [4]:

import numpy as np

# Window size (feel free to experiment)
k = 2

# Initialize co-occurrence matrix
cooc = np.zeros((len(vocab), len(vocab)), dtype=int)

# Populate co-occurrence counts
for tokens in tokenized:
    for i, w in enumerate(tokens):
        if w not in word2idx:
            continue
        wi = word2idx[w]
        # look at k neighbors on each side
        window = tokens[max(0, i - k): i] + tokens[i + 1: i + k + 1]
        for u in window:
            if u in word2idx and u != w:
                ui = word2idx[u]
                cooc[wi, ui] += 1

print("Co-occurrence matrix (rows and columns in vocab order):")
print(cooc)


Co-occurrence matrix (rows and columns in vocab order):
[[0 1 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


## Word × document (term-frequency) matrix

A term-frequency (TF) matrix records how many times each word appears in each document. This is the foundation for bag-of-words and TF–IDF representations.

In [5]:

from collections import Counter

# Number of documents
num_docs = len(tokenized)

# Term-frequency matrix: rows = words, columns = tweets
import numpy as np
tf = np.zeros((len(vocab), num_docs), dtype=int)

for j, tokens_list in enumerate(tokenized):
    counts = Counter(tokens_list)
    for i, w in enumerate(vocab):
        tf[i, j] = counts[w]

print('Term-frequency matrix:')
print(tf)


Term-frequency matrix:
[[1 0]
 [1 0]
 [0 1]
 [0 1]
 [1 0]
 [1 0]
 [0 1]
 [1 0]
 [1 0]
 [0 1]
 [1 0]
 [1 0]
 [1 0]
 [1 0]
 [1 0]
 [1 0]
 [1 0]
 [1 1]
 [1 0]
 [0 1]
 [0 1]
 [0 1]
 [1 0]
 [0 1]
 [1 0]
 [1 1]
 [0 1]
 [0 1]
 [1 0]
 [0 1]
 [0 1]
 [0 1]]


### Deriving co-occurrence from the TF matrix

Multiplying the TF matrix by its transpose (\(TF 	imes TF^	op\)) gives a simple co-occurrence count where counts are aggregated at the document level. (This ignores sliding windows and instead counts words co-occurring within the same document.)

In [6]:

# Compute a document-level co-occurrence estimate
cooc_from_tf = tf @ tf.T
print('Co-occurrence derived from TF (TF @ TF.T):')
print(cooc_from_tf)


Co-occurrence derived from TF (TF @ TF.T):
[[1 1 0 ... 0 0 0]
 [1 1 0 ... 0 0 0]
 [0 0 1 ... 1 1 1]
 ...
 [0 0 1 ... 1 1 1]
 [0 0 1 ... 1 1 1]
 [0 0 1 ... 1 1 1]]


## TF–IDF weighting

Term frequency–inverse document frequency (TF–IDF) downweights very common words and upweights terms that are rare across documents. It is widely used in information retrieval and text mining.

In [7]:

import numpy as np

# Compute document frequency (df): number of documents where each word appears
df = (tf > 0).sum(axis=1)

# Number of documents
N = num_docs

# Inverse document frequency (smoothed)
idf = np.log((N + 1) / (df + 1)) + 1

# Compute TF–IDF matrix
# tf is integer counts; broadcasting idf across columns
tfidf = tf * idf[:, None]

print('IDF vector:')
print(np.round(idf, 3))
print('TF–IDF matrix:')
print(np.round(tfidf, 3))


IDF vector:
[1.405 1.405 1.405 1.405 1.405 1.405 1.405 1.405 1.405 1.405 1.405 1.405
 1.405 1.405 1.405 1.405 1.405 1.    1.405 1.405 1.405 1.405 1.405 1.405
 1.405 1.    1.405 1.405 1.405 1.405 1.405 1.405]
TF–IDF matrix:
[[1.405 0.   ]
 [1.405 0.   ]
 [0.    1.405]
 [0.    1.405]
 [1.405 0.   ]
 [1.405 0.   ]
 [0.    1.405]
 [1.405 0.   ]
 [1.405 0.   ]
 [0.    1.405]
 [1.405 0.   ]
 [1.405 0.   ]
 [1.405 0.   ]
 [1.405 0.   ]
 [1.405 0.   ]
 [1.405 0.   ]
 [1.405 0.   ]
 [1.    1.   ]
 [1.405 0.   ]
 [0.    1.405]
 [0.    1.405]
 [0.    1.405]
 [1.405 0.   ]
 [0.    1.405]
 [1.405 0.   ]
 [1.    1.   ]
 [0.    1.405]
 [0.    1.405]
 [1.405 0.   ]
 [0.    1.405]
 [0.    1.405]
 [0.    1.405]]


## Conclusion

In this notebook we explored how to represent text data for analysis. We started with naive one-hot encodings and observed how their sparsity and lack of preprocessing lead to very little overlap between documents. After tokenization and normalization, we built vocabularies, co-occurrence matrices, bag-of-words. These techniques lay the groundwork for more advanced models such as word embeddings and topic models.