In [None]:
## Import useful packages
import pandas as pd # dataframes
import numpy as np
import nltk
import matplotlib.pyplot as plt
import statistics

##** Download Data Stored on Google Drive**

6,417 posts with many different features (for this exercise, we will only need the 'Text' feature of the dataframe)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [16]:
#!pip3 install datasets # uncomment if the code isn't working!
from datasets import load_dataset
data = load_dataset("csv", data_files="/content/drive/MyDrive/DATA/ALL_DATA_clean.csv")

Generating train split: 0 examples [00:00, ? examples/s]

In [17]:
data

DatasetDict({
    train: Dataset({
        features: ['level_0', 'Thread_ID', 'Text', 'User', 'Date', 'Comments', 'UpVotes', 'Flair', 'URL', 'Top', 'Hot', 'Rising', 'New', 'Controversial', 'Topic', 'Sentiment'],
        num_rows: 6417
    })
})

In [None]:
sequence = data['train']['text'][50]
sequence

'how do you guys stop the chase or the “not enough” feeling with adderall irs? some days i don’t need any, and others i feel like i constantly need another to keep momentum. i struggled with stims in the past and took a five year break from all of it. this year i was put back on adderall because my adhd was getting out of hand. the past few months have been great but lately i feel like i’m chasing the dragon. my diet has been poor lately as is my hydration so it may play a part. how do you guys center yourselves and get the most out of your meds?  thanks!'

In BERT **uncased**, the text has been lowercased before WordPiece tokenization step while in BERT **cased**, the text is same as the input text (no changes). Named Entity Recognition and Part-of-Speech tagging are two applications where case information is important and hence, BERT cased is better.

In [None]:
from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer.from_pretrained("bert-base-cased")
tokenized_sequence = tokenizer.encode(sequence)
#print(tokenized_sequence)

AttributeError: type object 'BertWordPieceTokenizer' has no attribute 'from_pretrained'

## Investigate number of counts in each post

I was curious to see how much posts vary in word length and what was the distribution of these word lengths amoung posts...this was nessisary because some posts may need to be filtered either because they contain too little (1 word or less) or too much (over 10000 words) information. The min, max, and average number of words along with a distribution plot are given below.

In [None]:
word_counts = []
for item in texts['text']:
  word_counts.append(len(item.split()))
plt.hist(word_counts, bins=50, range = [0,1000])

In [None]:
print("Max: " + str(max(word_counts)) + " Min: " + str(min(word_counts)) + " Average: " + str(statistics.mean(word_counts)))

Max: 5797 Min: 1 Average: 147.55165965404396


In [None]:
indices_above1000  = [index for (index, item) in enumerate(word_counts) if item > 1000]
indices_above500  = [index for (index, item) in enumerate(word_counts) if item > 500]
len(indices_above1000) # 31 posts with word counts above 1000
#indices_above10000
len(indices_above500) # 205 above 500 words

205

Based on this investigation, there is an average word length of around ~200 I would say with many upperbound outliers (plot boxplot to prove). For example, there are 31 posts with word counts above 1000 and 205 posts with word counts above 500. The maximum number of words in a given post was 5797 while the minimum was 1 (this might be something I should filter for)...

## Count Number of Unique Users and their Posting Activity

Based on the code , there are 4,535 unique users although 230 posts have no affiliated username (NaN). 3,726 of these users post exactly once while 652 posted 2-3 times, 141 posted 4-10 times, 14 posted 10-20 times, and 2 users posted over 20 times (with the max being 37 posts). The dictionary structure is shown below where keys are the number of posts made by a single user and values are the frequency...

In [None]:
#df['User'].value_counts()
#len(df[df['User'].isna()==True]) # get number of posts with no user (NaN)
import collections
count_users = list(df['User'].value_counts())
frequency = collections.Counter(count_users)
print(dict(frequency))

{37: 1, 21: 1, 19: 1, 17: 3, 16: 2, 15: 1, 14: 1, 12: 2, 11: 4, 10: 2, 9: 9, 8: 6, 7: 18, 6: 20, 5: 26, 4: 60, 3: 135, 2: 517, 1: 3726}


## Tokenize each Post

Loop through every post in the set and call tokenize_item function which splits words and then make a list out of all split tokens. Note punctuation is kept and no additional filtering is done...

In [None]:
def make_tokens(data):
  for item in data:
    yield(word_tokenize(item)) # tokenize a given post
tokens_list = list(make_tokens(texts['text']))

## Find the total number of unique tokens

In [None]:
unique_tokens = dict()
for tokens in tokens_list:
  for token in tokens:
    if token not in unique_tokens.keys():
      unique_tokens[token] = 1
    else:
      unique_tokens[token] = unique_tokens[token]+1

#len(unique_tokens)
sorted_unique_tokens = {k: v for k, v in sorted(unique_tokens.items(), key=lambda item: item[1], reverse=True)}
sorted_unique_tokens

Based on the code above, there are 22,686 unique tokens in the corpus of reddit posts I will be working with.

# Get the number of all unique tokens that also happen to be considered words

In [None]:
from nltk.corpus import words
#nltk.download('words')
word_tokens = dict()
for token in unique_tokens.keys():
  if token in words.words():
    word_tokens[token] = unique_tokens[token]

In [None]:
#word_tokens
#sum(word_tokens.values()) # get the total number of words found
len(word_tokens) # get the number of unique terms

9402

# Get number of tokens that ARE NOT considered words

**WARNING: DO NOT RUN** the chunk of code below because it will take FOREVER to excute!!!*italicized text*

In [None]:
not_word_tokens = dict()
for token in unique_tokens.keys():
  if token not in words.words():
    not_word_tokens[token] = unique_tokens[token]

# AN ALTERNATIVE WAY TO PROCESS TEXT
(see code chunk below)

In [None]:
from gensim.utils import simple_preprocess
def tokenizer(data):
    for item in data:
      yield(simple_preprocess(item, deacc=True)) # lowercases, tokenizes, & de-accents (optional) so that output is tokens (ie unicode strings not processed further)
tokens_list = list(tokenizer(texts['text'])) # convert tokens to list form
# The code above seems to remove the word "i" but all other parts of the post remain in tact (just split into words)...
posts_lst = texts['text'].tolist()
len(posts_lst)

## Tokenize by Sentence

In [None]:
from nltk.tokenize import sent_tokenize
sentence_tokens = []
for i in range(0,len(texts['text'])):
  tokens = sent_tokenize(texts['text'][i]) # tokenize a given post
  sentence_tokens.append(tokens)

In [None]:
total_sent = 0 # total number of sentences found in corpus
num_sent_lst = []
for post in sentence_tokens:
  num_sent = len(post) # get number of sentences in post
  num_sent_lst.append(num_sent)
  total_sent = total_sent + num_sent # add number of sentences to running total
print("Total Number of Sentences in Corpus: " + str(total_sent)) # print results

Total Number of Sentences in Corpus: 54586


Based on the work done above, there are around 54,586 total sentences in the post corpus with the average number of sentences per post being around 8-9 sentences.

## EXTRA PRACTICE (Word Piece Tokenization is in the next section)

Using the link (https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt) provided in the assignment instructions for support!

In [None]:
# import needed packages
from transformers import AutoTokenizer
# use "bert-base-case" pretrained weights to formulate tokenizer method used...
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
from collections import defaultdict
word_freqs = defaultdict(int) # initialize empty dict of ints

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Gives an example of the output for the first post...

In [None]:
# tuples give the offset of that word in a sentence based on the words before and how long that word is
ex1 = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(texts['text'][0])
n_w = [word for word, offset in ex1]

Now do this for every post in the text and count the frequency of all unique words (similar to what I did before) so that the output is a dictionary of unique words and their total frequency/count in the corpus...

In [None]:
for post in texts['text']:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(post)
    new_words = [word for word, offset in words_with_offsets] # get the words from the post
    for word in new_words:
        word_freqs[word] += 1

# len(word_freqs)

Uses all words and frequency to create a complete alphabet of characters, numbers, and emojiis...

In [None]:
alphabet = []
for word in word_freqs.keys():
    if word[0] not in alphabet:
        alphabet.append(word[0])
    for letter in word[1:]:
        if f"##{letter}" not in alphabet:
            alphabet.append(f"##{letter}")

alphabet.sort()
# alphabet
# includes all words and emojiis

Add vocabulary below to alphabet since this shows up in the preprocessing...

In [None]:
vocab = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] + alphabet.copy()

Loop through each unique word in the frequency dictionary created before. Then loop through every character in the given word and create a list of its chars. The first char will be as is and all the following chars will be prefixed by ##...

In [None]:
splits = {
    word: [c if i == 0 else f"##{c}" for i, c in enumerate(word)]
    for word in word_freqs.keys()
}
# splits

write a function that computes the score of each pair (use  at each step of the training)...

In [None]:
def compute_pair_scores(splits):
    letter_freqs = defaultdict(int)
    pair_freqs = defaultdict(int)
    for word, freq in word_freqs.items(): # loop through all unique words and their frequency
        split = splits[word] # get the list of chars for that given word (eg [w, ##o, ##r, ##d])
        if len(split) == 1: # if there is only on char, eg [a] or [i]...
            letter_freqs[split[0]] += freq # count the frequency of that single letter word
            continue
        for i in range(len(split) - 1): # loop through all elements in char list except for the last one
            pair = (split[i], split[i + 1]) # get the letter and the letter after in the list
            letter_freqs[split[i]] += freq # count the frequency of each letter
            pair_freqs[pair] += freq # add the tuple of co occuring words and count their frequency
        letter_freqs[split[-1]] += freq # add the last letters frequency since the loop didn't go through it

    scores = { # calculate scores based on a combination of the frequency of that pair balanced by the frequency of the induvidual chars within that pair
        pair: freq / (letter_freqs[pair[0]] * letter_freqs[pair[1]])
        for pair, freq in pair_freqs.items()
    }
    return scores

In [None]:
# look at a part of this dictionary after the inital splits...
pair_scores = compute_pair_scores(splits)
for i, key in enumerate(pair_scores.keys()):
    print(f"{key}: {pair_scores[key]}")
    if i >= 5:
        break

('h', '##a'): 2.381161293631052e-06
('##a', '##v'): 1.562522606040896e-06
('##v', '##e'): 1.7951079864703374e-06
('o', '##n'): 1.2207334722530364e-06
('##n', '##e'): 1.5056709788849352e-07
('p', '##i'): 3.2271473483763843e-07


find the pair with the best score...

In [None]:
best_pair = ""
max_score = None
for pair, score in pair_scores.items():
    if max_score is None or max_score < score:
        best_pair = pair
        max_score = score

print(best_pair, max_score)

('##😅', '##😩') 1.0


In [None]:
vocab.append("😅😩") # add most common pair merged to dictionary

In [None]:
len(vocab)

241

apply that merge in our splits dictionary through another function...

In [None]:
def merge_pair(a, b, splits):
    for word in word_freqs:
        split = splits[word]
        if len(split) == 1:
            continue
        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i + 1] == b:
                merge = a + b[2:] if b.startswith("##") else a + b
                split = split[:i] + [merge] + split[i + 2 :]
            else:
                i += 1
        splits[word] = split
    return splits

now look at the result of the first merge...

In [None]:
splits = merge_pair("a", "##b", splits)
splits["about"]

['ab', '##o', '##u', '##t']

Now we have everything we need to loop until we have learned all the merges we want - aim for a vocab size of 70...then look at the generated vocabulary

In [None]:
vocab_size = 2000
while len(vocab) < vocab_size: # len(vocab) is 241
    scores = compute_pair_scores(splits)
    best_pair, max_score = "", None
    for pair, score in scores.items():
        if max_score is None or max_score < score:
            best_pair = pair
            max_score = score
    splits = merge_pair(*best_pair, splits)
    new_token = (
        best_pair[0] + best_pair[1][2:]
        if best_pair[1].startswith("##")
        else best_pair[0] + best_pair[1]
    )
    vocab.append(new_token)

In [None]:
print(vocab)

['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]', '!', '"', '#', '##0', '##1', '##2', '##3', '##4', '##5', '##6', '##7', '##8', '##9', '##a', '##b', '##c', '##d', '##e', '##f', '##g', '##h', '##i', '##j', '##k', '##l', '##m', '##n', '##o', '##p', '##q', '##r', '##s', '##t', '##u', '##v', '##w', '##x', '##y', '##z', '##¢', '##®', '##°', '##×', '##à', '##é', '##ê', '##ï', '##ʻ', '##\u200d', '##€', '##℃', '##℉', '##♀', '##♂', '##♥', '##✌', '##❤', '##️', '##\ufeff', '##🇺', '##🏻', '##🏼', '##🏽', '##🏾', '##👍', '##💕', '##💙', '##💛', '##💫', '##😃', '##😅', '##😆', '##😊', '##😌', '##😑', '##😖', '##😘', '##😜', '##😤', '##😩', '##😭', '##😳', '##😷', '##🙏', '##🤞', '##🤢', '##🤬', '##🤯', '##🤷', '##🥺', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~'

To tokenize a new text, we pre-tokenize it, split it, then apply the tokenization algorithm on each word. That is, we look for the biggest subword starting at the beginning of the first word and split it, then we repeat the process on the second part, and so on for the rest of that word and the following words in the text:

## Word Piece Tokenization

First, we need to save the dataframe with JUST the post text since that's all we will be needing for preprocessing. Below I save the dataframe to a csv in the same folder as the original data...

In [None]:
texts.to_csv("/content/drive/MyDrive/DATA/adderall_texts.csv") # save texts table only to file on drive for quick access

Next, I import the datasets library used in hugging face Transformers library since this is what I will be formatting the adderall post data into...

In [None]:
#!pip3 install datasets # uncomment if the code isn't working!
from datasets import load_dataset

Next, I can read back in the csv file but this time it will be read direcly into the hugging face dataset object. Below, we can see the format of this object which is a DatasetDict containing a training key with features listed along with the total number of posts (ie rows). I could have possibly added a split to make both train and test data but don't think this is nessisary for this step at least...

In [None]:
raw_dataset = load_dataset("csv", data_files="/content/drive/MyDrive/DATA/adderall_texts.csv")

In [None]:
raw_dataset # print to see whats inside object
# raw_dataset['train']
# raw_dataset['train']['text'][0] # gives the first post text item in the data!

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'text'],
        num_rows: 6417
    })
})

Now that I have the dataset object all ready, I can create a Python generator which avoids loading anything into memory until necessary...to do this I defined a generator inside a function and then, within that, a for loop using the yield statement which allows for more complex logic than you can do with a simple list comprehension in a function. This function doesn’t fetch any elements of the dataset but just creates an object you can use later (texts will only be loaded when you need them) and only 1,000 texts at a time will be loaded (prevents memory from being exhausted on a huge dataset). It was defined in a function so that it can be reused more than once if needed!

In [None]:
def get_training_corpus():
    dataset = raw_dataset["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples["text"]
training_corpus = get_training_corpus()

Next, I can instantiate a tokenizer class from a pre trained model vocabulary, in this case the BERT model which is case sensitive but also uses word piece tokenization which is why I selected it...

In [None]:
from transformers import AutoTokenizer
old_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# another way is shown below:
# from transformers import BertTokenizer, BertModel
# tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
# just trying out this tokenizer as is on the first data post...
old_tokens=old_tokenizer.tokenize(raw_dataset['train']['text'][0])
len(old_tokens)

105

Now that I have the right tokenizer, I can fine tune it on my specific training corpus of posts which I created through a generator in the previous function. Note that vocab_size (the second arg = 52000) is the size of the vocabulary you want for your tokenizer...not really sure what to set for this so I just used the same number as in the example on hugging face...

In [None]:
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

Here I defined a function to apply the new tokenizer as defined above to the loaded dataset. I was forced to truncate posts since many were much longer than the limit specified by the BERT model. However, I wasn't sure if padding was nessisary to apply here as well (since many posts are also shorter than the max length) so I may need to add this later...

In [None]:
def tokenization(example):
  return tokenizer(example["text"], truncation=True) # most sentences are too long so we must truncate!

In [None]:
dataset = raw_dataset.map(tokenization, batched=True)

Map:   0%|          | 0/6417 [00:00<?, ? examples/s]

Below we can see that the output added a new field (input_ids) which is the tokenized version of each posts converted into integer ids that the model can read. In this case, I believe that each unique id corresponds to one token (so that the same id correspond to the same token)...This means that, to get the number of unique word piece tokens, I needed to loop through all the ids and find the total number of unique ids, the length of which is printed below and ended up being 17,635...

In [None]:
dataset

In [None]:
uniq_tokens = set()
for i in range(0, len(dataset['train'])):
  uniq_tokens.update(set(dataset['train']['input_ids'][i])) # get all unique ids which should be the same as the number of unique wordpiece tokens

In [None]:
len(uniq_tokens)

17635