# 1. Exploratory Data Analysis





## 1.1 Let's load our data

I have loaded from Google sheets below as the raw CSVs needed a little extra cleaning, so I did this in Sheets vs loading the CSVs and then cleaning in Pandas (just because it was quicker)! :) 

However, please feel free to experiment loading your data in a different way! :) 

In [1]:
!pip install transformers torch pandas gspread gspread-dataframe urlextract emoji langdetect



In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [3]:
import pandas as pd
import numpy as np
import threading, queue
import urlextract
import nltk
import time
import os
import re

from langdetect import detect

In [4]:
official_forum_names = {
    'cartalk'      : 'Car Talk Community',
    'amazon'       : 'Amazon Seller Forums',
    'tnation'      : 'Forums - T Nation',
    'episode'      : 'Episode Forums',
    'github'       : 'GitHub Support Community',
    'l1t'          : 'Level1Techs Forums',
    'smartthings'  : 'SmartThings Community',
    'gearbox'      : 'The Official Gearbox Software Forums',
    'elasticstack' : 'Discuss the Elastic Stack',
    'revolut'      : 'Revolut Community'
}

ROOT = '/content/gdrive/My Drive/'
INP_FOLDER = 'data'
url_extractor = urlextract.URLExtract()

In [5]:
def get_files(file_name_pattern=r'(?s).*', folder=INP_FOLDER):
  """
  Returns a list of all files that match `file_name_pattern` in `folder`.
  """
  files = []
  for file_name in os.listdir(os.path.join(ROOT, folder)):
    if re.search(file_name_pattern, file_name):
      files.append(os.path.join(folder, file_name))
  return files

def load_posts(forum_names=official_forum_names, folder=INP_FOLDER):
  """
  Returns a dictionary where the keys are those specified in 
  `forum_names` and the values are dataframes that contain the associated
  post data for each forum in `forum_names`.
  """
  posts = dict()
  files = [get_files(f'{n}_posts.csv')[0] for n in forum_names]
  for forum, fname in zip(forum_names, files):
    posts[forum] = pd.read_csv(os.path.join(ROOT, fname), index_col=0).fillna('')
  return posts

In [6]:
df_all = pd.concat(load_posts().values())

In [7]:
df_all.shape

(692390, 15)

In [8]:
df_all[['text', 'forum']].sample(5)

Unnamed: 0,text,forum
118496,"Yes I have, it’s @epi.verve (like here)",Episode Forums
73351,"Hey everyone, I’m 40 years old 6’1” 210lbs and...",Forums - T Nation
107957,"yea, they are playing on normal level 40 to sh...",The Official Gearbox Software Forums
84991,Have several laser sploders in different eleme...,The Official Gearbox Software Forums
108692,,The Official Gearbox Software Forums


## 1.2 Data Cleansing and Prep

Now we've loaded the data, we must remove noise from the dataset. Please explore some techniques in which we could clean the data in order for us to see how well pre-trained BERT works on our dataset. Luckily, due to the way BERT tokenises the data, we don't need to the same extent of data preprocessing as required of previous NLP models. However we still need to - 

1. Filter nulls
2. Filter for duplicates
3. [Optional] Remove post_text which does not have vocab in pre-trained BERT. Later, we will leave this in for finetuning.
  * Hyperlinks 
  * Foreign languages - there are multilingual BERT models
  * Any more you can think of?
4. Encode the labels - map categorical labels to numerical values

* See [here](https://drive.google.com/open?id=1PotbhjemiMobHu0Loy-mDHIumdJh-LxC) for Pandas cleaning tutorials
* See [here](https://www.analyticsvidhya.com/blog/2020/04/beginners-guide-exploratory-data-analysis-text-data/#3) for beginner EDA tutorial for NLP


### 1.2.1 Filter nulls


In [9]:
# There shouldn't be any nulls bc they were filtered when loading
df_all.isnull().sum()

post_id              0
username             0
created_at           0
cooked               0
post_num             0
updated_at           0
reply_count          0
reply_to_post_num    0
reads                0
topic_id             0
user_id              0
topic_slug           0
type                 0
forum                0
text                 0
dtype: int64

### 1.2.2 Filter for duplicates


In [10]:
df_all = df_all.drop_duplicates()

### 1.2.4 Encode the labels

We then need to encode the labels.

In [11]:
df_all['forum'] = df_all['forum'].astype('category')
df_all['forum_name_encoded'] = df_all['forum'].cat.codes.astype('int32')
df_all.sample(5)

Unnamed: 0,post_id,username,created_at,cooked,post_num,updated_at,reply_count,reply_to_post_num,reads,topic_id,user_id,topic_slug,type,forum,text,forum_name_encoded
102496,7378890,LynnAnn,2020-07-28T12:29:48.525Z,<p>Hey! Thanks for this thread. Here are my st...,15,2020-07-28T12:29:48.525Z,0,,18,421580,194415,story-recommendation-drop-yours,latest,Episode Forums,Hey! Thanks for this thread. Here are my stori...,3
31418,13636,darxtrix,2018-11-28T15:19:23.020Z,"<p>Hi,</p>\n<p>I am not able to find the query...",1,2020-05-23T05:33:32.359Z,0,,1,13633,20603,creating-a-github-release-using-github-api-v4,latest,GitHub Support Community,"Hi,\nI am not able to find the query/mutations...",5
39712,4412052,tj_s_emporium,2018-08-03T23:48:48.342Z,"<aside class=""quote no-group"" data-post=""4"" da...",14,2018-08-03T23:51:31.530Z,0,4.0,271,390324,70251,separate-husband-and-wife-seller-accounts-for-...,top,Amazon Seller Forums,\n\n\n goodseller_2483:\n\njust change the SS ...,0
98138,5946656,nat_zero_six,2020-06-25T23:19:34.410Z,<p>“we are back” - melee amara enthusiast.</p>,4,2020-06-25T23:27:56.598Z,0,,148,4539628,1501535,how-phase-2-action-skill-melee-damage-scaling-...,latest,The Official Gearbox Software Forums,“we are back” - melee amara enthusiast.,9
76354,6602822,GO-Logan,2020-05-03T23:17:58.614Z,<p>I can’t help you with that so I would try a...,8,2020-05-03T23:17:58.614Z,1,6.0,39,373543,186258,bathroom-mirror-reflection-scene-help,top,Episode Forums,I can’t help you with that so I would try and ...,3


### 1.2.3 [Optional] Filter noise

Remove post_text which does not have vocab in pre-trained BERT. Later, we will leave this in for finetuning.

* Emojis
* Hyperlinks
* Foreign languages
* Any more you can think of?

**How do we filter for these anomalies?** 

Perhaps we try to split strings by space and remove that match markdown hyperlink syntax `![]()`?

We can always see how BERT performs with dirty data and then perform further pre-processing as we move forward such as expanding contractions etc.




In [12]:
def replace_urls(text, repl=''):
  urls = list(set(url_extractor.find_urls(text)))
  urls.sort(key=lambda url: len(url), reverse=True)
  for url in urls:
    text = text.replace(url, repl)
  return text

def is_english(text):
  return len(text) >= 3 and detect(text) == 'en'

def remove_noise_serial(df, verbose=False):
  copy = df.copy()
  if verbose: print("Removing hyperlinks...", end='')
  copy['text'] = copy['text'].apply(replace_urls)
  if verbose: print("Done!")
  if verbose: print("Removing non-english posts...", end='')
  copy = copy[copy['text'].str.replace(r'[^A-Za-z]+', ' ').apply(is_english)]
  if verbose: print("Done!")
  if verbose: print("Removing emojis and symbols...", end='')
  copy['text'] = copy['text'].str.encode('ascii', 'ignore')\
                             .str.decode('ascii')\
                             .str.replace(r'[^A-Za-z0-9]+', ' ')\
                             .str.split()\
                             .str.join(" ")
  if verbose: print("Done!")
  return copy

def report_progress(q, event, time_between_reports=5):
  qsize_init = q.qsize()
  last_report_time = time.time()
  while not q.empty() and event.is_set():
    time_elapsed = time.time() - last_report_time
    if time_elapsed >= time_between_reports:
      portion_finished = 1 - q.qsize() / qsize_init
      print("\tProgress: {:.1%}".format(portion_finished))
      last_report_time = time.time()
    time.sleep(time_between_reports)

def task(q, lock, event, file_path):
  while q.qsize() > 0 and event.is_set():
    out = remove_noise_serial(q.get())
    lock.acquire()
    out.to_csv(file_path, mode='a', header=False, index=False)
    lock.release()
    q.task_done()

def remove_noise_parallel(df, num_threads, chunk_size, file_path, verbose=False, report_freq=5):
  run_event = threading.Event()
  run_event.set()
  lock = threading.Lock()
  work_q = queue.Queue()
  threads = []

  if verbose: print('Placing tasks in work queue...', end='')
  for d in np.array_split(df, chunk_size):
    work_q.put(d)
  if verbose: print('\tdone!')

  if verbose: print('Starting threads...', end='')
  for i in range(num_threads):
    threads.append(threading.Thread(target=task, name=f'thread {i}', args=(work_q, lock, run_event, file_path, )))
    threads[i].start()
  if verbose: print('\t\tdone!')

  if verbose: print("Processing data:")
  try:
    if verbose: 
      report_progress(work_q, run_event, report_freq)
    work_q.join()
  except (KeyboardInterrupt, SystemExit):
    run_event.clear()
    for t in threads:
      t.join()
  if verbose: print("\nExiting!")
  
  if any([t.is_alive() for t in threads]):
    print("WARNING: some threads may still be active!")

In [13]:
# # Set up csv file (this will delete any existing csv file with the same name)
file_name = "bert_data.csv"
file_path = os.path.join(ROOT, 'Colab Notebooks', 'BERT', file_name)
# if os.path.exists(file_path):
#   os.remove(file_path)
# pd.DataFrame(columns=df_all.columns).to_csv(file_path, index=False)

In [14]:
# # No need to run this again, the file has been saved
# cleaned_df = remove_noise_parallel(df_all, 1500, df_all.shape[0], file_path, verbose=True, report_freq=30)

In [15]:
cleaned_df = pd.read_csv(file_path)

In [16]:
cleaned_df['forum'].value_counts()

The Official Gearbox Software Forums    110797
Episode Forums                          105872
Amazon Seller Forums                    102445
Forums - T Nation                        91721
Car Talk Community                       76723
Discuss the Elastic Stack                51539
GitHub Support Community                 31494
SmartThings Community                    29821
Revolut Community                        29297
Level1Techs Forums                       28475
Name: forum, dtype: int64

In [17]:
samples_per_forum = 20000
forums = []
for f in official_forum_names.values():
  forums.append(cleaned_df[cleaned_df['forum'] == f].sample(samples_per_forum, random_state=42))
sampled_data = pd.concat(forums).sample(frac=1, random_state=42)

In [18]:
sampled_data.shape

(200000, 16)

# 2. Forum Classifier with BERT

There are two steps to creating a text classifier - 

1. Train NLP model to transform sentences into meaningful sentence embeddings  
2. Train a classifier to make predictions

There are many models we could use to transform our post text into meaningful sentence embeddings. However, for our project we have chosen the BERT model due to high performance on unseen data as it's been trained on large corpuses of common texts and it's ability to handle dirty data due to its tokenisation architecture. More specifically, we will be using the [DistilBERT model](https://huggingface.co/transformers/model_doc/distilbert.html) from the HuggingFace transformer library.

* DistilBERT processes the sentence and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.
* We’ll be using [BertForSequenceClassification](https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#bertforsequenceclassification). This is the normal BERT model with an added single linear layer on top for classification that we will use as a sentence classifier. As we feed input data, the entire pre-trained BERT model and the additional untrained classification layer is trained on our specific task.

The data we pass between the two models is a vector of size 768 (The [CLS] vector). We can think of this of vector as an embedding for the sentence that we can use for classification. 


## 2.1 Sentence Embeddings with BERT
There are many flavours of BERT out there, all trained for a variety of different use cases. However, the model we are using in our project is [DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html) from the HuggingFace transformers library as it's simple to use.

Please follow the following steps to generate a dataframe of sentence embeddings with their corresponding forum labels.

1. Tokenise the sentences
2. Pad & truncate all sentences to a single constant length for batch processing
3. Explicitly differentiate real tokens from padding tokens with an “attention mask” 
4. Pass tokenised sentences through BERT to generate sentence embedding features


* See this [visual starter notebook](https://colab.research.google.com/drive/1elYlJ_JKupvMJwLtuwizoYIBmwwjgaur?usp=sharing) for further understanding of how to use the BERT model.

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tutorial-sentence-embedding.png" />


Let’s extract the sentences and labels of our training set as numpy ndarrays.

In [19]:
# Get the lists of sentences and their labels to np array.
sentences = sampled_data.text.values
labels = sampled_data.forum_name_encoded.values

Let’s apply the tokenizer to one sentence just to see the output.

---



When we actually convert all of our sentences, we’ll use the `tokenizer.encode` function to handle both steps, rather than calling `tokenize` and `convert_tokens_to_ids` separately.

In [20]:
from transformers import DistilBertTokenizer

# Load pretrained DistilBERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Sample df
sampled_data[['text', 'forum']].sample(5)

Unnamed: 0,text,forum
411279,did you modified your grub config Use video ef...,Level1Techs Forums
110008,Prior to Amazon instituting the Automated retu...,Amazon Seller Forums
620677,how frequently your snapshot is scheduled to r...,Discuss the Elastic Stack
546798,The closest thing to a tank in this game is FL...,The Official Gearbox Software Forums
166797,Ha I am in Colorado For some reason none of th...,Amazon Seller Forums


In [21]:
# Print the original sentence.
print(' Original: ', sentences[1])

# Print the sentence split into tokens.
print('Tokenized: ', tokenizer.tokenize(sentences[1]))

# Print the sentence mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[1])))

 Original:  Insta nevaepi Name Nevada Body Neutral 04 Brow Arched Natural black Hair Two Braids Green Eyes Deepest downturn Aqua Blue Face Diamond Nose Defined Natural Lips Full Round Pouty Deep Re Matte Outfit Put her in something badass like a leather jacket Role Can she be a friend of the MC if not anything youd like her to be
Tokenized:  ['ins', '##ta', 'ne', '##va', '##ep', '##i', 'name', 'nevada', 'body', 'neutral', '04', 'brow', 'arched', 'natural', 'black', 'hair', 'two', 'braid', '##s', 'green', 'eyes', 'deepest', 'down', '##turn', 'aqua', 'blue', 'face', 'diamond', 'nose', 'defined', 'natural', 'lips', 'full', 'round', 'po', '##ut', '##y', 'deep', 're', 'matt', '##e', 'outfit', 'put', 'her', 'in', 'something', 'bad', '##ass', 'like', 'a', 'leather', 'jacket', 'role', 'can', 'she', 'be', 'a', 'friend', 'of', 'the', 'mc', 'if', 'not', 'anything', 'you', '##d', 'like', 'her', 'to', 'be']
Token IDs:  [16021, 2696, 11265, 3567, 13699, 2072, 2171, 7756, 2303, 8699, 5840, 8306, 9194

The below cell will perform one tokenisation pass of the dataset in order to measure the maximum sentence length.

In [22]:
max_len = 0
# For first 10 sentences - 
for s in sentences[:10]:
    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    input_ids = tokenizer.encode(s, truncation=True, add_special_tokens=True)
    # Update the maximum sentence length.
    max_len = max(max_len, len(input_ids))

print('Max sentence length: ', max_len)

Max sentence length:  77




For BERT, all sentences must be padded or truncated to a single, fixed length. The maximum sentence length is 512 tokens so you may have to have to split the post_text. The maximum length does impact training and evaluation speed, however. For example, with a Tesla K80 (Colab GPU):

`MAX_LEN = 128 --> Training epochs take ~5:28 each`

`MAX_LEN = 64 --> Training epochs take ~2:57 each`

Try and encode the dataset with the DistilBertTokenizer below.

See [docs here](https://huggingface.co/transformers/main_classes/tokenizer.html?highlight=encode_plus#transformers.PreTrainedTokenizer.encode) for `tokenizer.encode` . 


In [23]:
tokenizer.encode?

In [24]:
# Let's try it on a small subset of our data
tokenized = sampled_data['text'].iloc[:1000].apply(lambda x: tokenizer.encode(x, truncation=True, add_special_tokens=True))
tokenized.head()

431570    [101, 2092, 3849, 5875, 2012, 2560, 2453, 3046...
372215    [101, 16021, 2696, 11265, 3567, 13699, 2072, 2...
568337    [101, 6854, 2009, 2145, 3849, 2000, 2022, 2004...
373236    [101, 1045, 2031, 2589, 2070, 2077, 1045, 2064...
77862     [101, 2004, 5498, 2638, 9148, 16150, 5358, 204...
Name: text, dtype: object

### 2.1.2 Padding the Sentences and Attention Mask 

- For BERT, all sentences must be padded or truncated to a single, fixed length.
The maximum sentence length is 512 tokens.
- Padding is done with a special [PAD] token, which is at index 0 in the BERT vocabulary. The below illustration demonstrates padding out to a “MAX_LEN” of 8 tokens.




<img src="http://www.mccormickml.com/assets/BERT/padding_and_mask.png" width="500"/>

The “Attention Mask” is simply an array of 1s and 0s indicating which tokens are padding and which aren’t (seems kind of redundant, doesn’t it?!). This mask tells the “Self-Attention” mechanism in BERT not to incorporate these PAD tokens into its interpretation of the sentence.


Therefore the encoding task requires the following - 

1. Split the sentence into tokens.
2. Add the special [CLS] and [SEP] tokens.
3. Pad or truncate all sentences to the same length.
4. Create the attention masks which explicitly differentiate real tokens from [PAD] tokens.
5. Map the tokens to their IDs.


You should try implement the padding and attention masks yourself with matrix multiplication via numpy. It is trained on lower-cased English text. Hence we set the flag **do_lower_case** to true in BertTokenizer.


Otherwise, the first four features are in `tokenizer.encode`, but you can also try use the `tokenizer.encode_plus` to get the fifth item (attention masks). Documentation is [here](https://huggingface.co/transformers/main_classes/tokenizer.html?highlight=encode_plus#transformers.PreTrainedTokenizer.encode_plus).


In [25]:
import torch

# Tokenize all of the sentences and map the tokens to thier word IDs.
token_ids = []
attention_masks = []
T = 128

for s in sentences:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start and append the `[SEP]` token to the end.
    #   (3) Pad or truncate the sentence to `max_length`
    #   (4) Create attention masks for [PAD] tokens
    #   (5) Map tokens to their IDs.

    # You can encode_plus as function
    encoded_dict = tokenizer.encode_plus(
                        s,                              # Sentence to encode.
                        add_special_tokens = True,      # Add '[CLS]' and '[SEP]'
                        truncation=True,                # Pad & truncate all sentences.
                        max_length = T,                 # Pad & truncate all sentences.
                        pad_to_max_length = True,       # Pad & truncate all sentences.
                        return_attention_mask = True,   # Construct attn masks.
                        return_tensors = 'pt',          # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.
    token_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
token_ids = torch.cat(token_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)

# Print sentence 0, now as a list of IDs.
print('Original: ', sentences[0])
print('Token IDs:', token_ids[0])

Original:  Well seems interesting at least might try it out later Though Im not sure which Mesa Im currently at might even be 19 3
Token IDs: tensor([  101,  2092,  3849,  5875,  2012,  2560,  2453,  3046,  2009,  2041,
         2101,  2295, 10047,  2025,  2469,  2029, 15797, 10047,  2747,  2012,
         2453,  2130,  2022,  2539,  1017,   102,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,


## 3.1 Train the Classification Model

### 3.1.1 DistilBert For Sequence Classification


For this task, we first want to modify the pre-trained BERT model to give outputs for classification, and then we want to continue training the model on our dataset until that the entire model, end-to-end, is well-suited for our task.

Thankfully, the HuggingFace Pytorch implementation includes a set of interfaces designed for a variety of NLP tasks. Though these interfaces are all built on top of a trained DistilBERT model, each has different top layers and output types designed to accomodate their specific NLP task.


We’ll be using [DistilBertForSequenceClassification](https://huggingface.co/transformers/v2.2.0/model_doc/distilbert.html#distilbertforsequenceclassification). This is the normal DistilBERT model with an added single linear layer on top for classification that we will use as a sentence classifier. As we feed input data, the entire pre-trained BERT model and the additional untrained classification layer is trained on our specific task.

We then pass the sentence embeddings and features through to the linear regression model to evaluate forum predictions.

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-train-test-split-sentence-embedding.png" />

<!-- 
1. Append the classification  layer to the BERT model


<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-train-test-split-sentence-embedding.png" />

### [Optional] Grid Search for Parameters
We can dive into Logistic regression directly with the Scikit Learn default parameters, but sometimes it's worth searching for the best value of the C parameter, which determines regularisation strength. -->

In [26]:
from torch.utils.data import TensorDataset, random_split

# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(token_ids, attention_masks, labels)

# Create a 90-10 train-validation split.

# Calculate the number of samples to include in each set.
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print(f'{train_size:>5,} training samples')
print(f'{val_size:>5,} validation samples')

180,000 training samples
20,000 validation samples


In [27]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

# The DataLoader needs to know our batch size for training, so we specify it 
# here. For fine-tuning BERT on a specific task, the authors recommend a batch 
# size of 16 or 32.
batch_size = 32

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order. 
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

In [28]:
sampled_data['forum'].unique()

array(['Level1Techs Forums', 'Episode Forums',
       'The Official Gearbox Software Forums', 'Amazon Seller Forums',
       'SmartThings Community', 'Discuss the Elastic Stack',
       'Car Talk Community', 'Revolut Community',
       'GitHub Support Community', 'Forums - T Nation'], dtype=object)

In [29]:
from transformers import DistilBertForSequenceClassification, AdamW, BertConfig

# Load DistilBertForSequenceClassification, the pretrained BERT model with a single 
# linear classification layer on top. 
model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = len(sampled_data['forum'].unique()), # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

# Tell pytorch to run this model on the GPU. Make sure you enable the runtime clicking [Runtime]->[Change Runtime Type]->[Hardware Accelerator]->GPU->[Save]
model.cuda()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [30]:
# Get all of the model's parameters as a list of tuples.
params = list(model.named_parameters())

print('The BERT model has {:} different named parameters.\n'.format(len(params)))

print('==== Embedding Layer ====\n')

for p in params[0:5]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== First Transformer ====\n')

for p in params[5:21]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== Output Layer ====\n')

for p in params[-4:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

The BERT model has 104 different named parameters.

==== Embedding Layer ====

distilbert.embeddings.word_embeddings.weight            (30522, 768)
distilbert.embeddings.position_embeddings.weight          (512, 768)
distilbert.embeddings.LayerNorm.weight                        (768,)
distilbert.embeddings.LayerNorm.bias                          (768,)
distilbert.transformer.layer.0.attention.q_lin.weight     (768, 768)

==== First Transformer ====

distilbert.transformer.layer.0.attention.q_lin.bias           (768,)
distilbert.transformer.layer.0.attention.k_lin.weight     (768, 768)
distilbert.transformer.layer.0.attention.k_lin.bias           (768,)
distilbert.transformer.layer.0.attention.v_lin.weight     (768, 768)
distilbert.transformer.layer.0.attention.v_lin.bias           (768,)
distilbert.transformer.layer.0.attention.out_lin.weight   (768, 768)
distilbert.transformer.layer.0.attention.out_lin.bias         (768,)
distilbert.transformer.layer.0.sa_layer_norm.weight           (

### 3.1.2 Optimizer & Learning Rate Scheduler

Now that we have our model loaded we need to grab the training hyperparameters from within the stored model.

For the purposes of fine-tuning, the authors recommend choosing from the following values (from Appendix A.3 of the BERT paper):

* Batch size: 16, 32
* Learning rate (Adam): 5e-5, 3e-5, 2e-5
* Number of epochs: 2, 3, 4

We chose:

* Batch size: 32 (set when creating our DataLoaders)
* Learning rate: 2e-5
* Epochs: 4 (we’ll see that this is probably too many…)


The epsilon parameter eps = 1e-8 is “a very small number to prevent any division by zero in the implementation” (from [here](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)).

You can find the creation of the AdamW optimizer in run_glue.py [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L109).

In [31]:
# Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
# I believe the 'W' stands for 'Weight Decay fix"
optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                )


In [32]:
from transformers import get_linear_schedule_with_warmup

# Number of training epochs. The BERT authors recommend between 2 and 4. 
# We chose to run for 4, but we'll see later that this may be over-fitting the
# training data.
epochs = 4

# Total number of training steps is [number of batches] x [number of epochs]. 
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

Below is our training loop. There’s a lot going on, but fundamentally for each pass in our loop we have a trianing phase and a validation phase.

**Training:**

* Unpack our data inputs and labels
* Load data onto the GPU for acceleration
* Clear out the gradients calculated in the previous pass.
* In pytorch the gradients accumulate by default (useful for things like RNNs) unless you explicitly clear them out.
* Forward pass (feed input data through the network)
* Backward pass (backpropagation)
* Tell the network to update parameters with optimizer.step()
* Track variables for monitoring progress

**Evalution:**
* Unpack our data inputs and labels
* Load data onto the GPU for acceleration
* Forward pass (feed input data through the network)
* Compute loss on our validation data and track variables for monitoring progress

Pytorch hides all of the detailed calculations from us, but we’ve commented the code to point out which of the above steps are happening on each line.

Define a helper function for calculating accuracy.

In [33]:
import numpy as np

# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [34]:
import time
import datetime

# Helper function for formatting elapsed times as hh:mm:ss
def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

In order for torch to use the GPU, we need to identify and specify the GPU as the device. Later, in our training loop, we will load data onto the device. 

In [35]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla K80


In [36]:
import random

# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128


# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# Store the average loss after each epoch so we can plot them.
loss_values = []

# For each epoch...
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_loss = 0

    # Put the model into training mode. Don't be mislead--the call to 
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):

        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)
            
            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the 
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].type(torch.LongTensor).to(device)
        # print(b_input_ids.shape)
       
        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because 
        # accumulating the gradients is "convenient while training RNNs". 
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()        

        # Perform a forward pass (evaluate the model on this training batch).
        # This will return the loss (rather than the model output) because we
        # have provided the `labels`.
        # The documentation for this `model` function is here: 
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        outputs = model(b_input_ids,
                    attention_mask=b_input_mask, 
                    labels=b_labels)
        
        # The call to `model` always returns a tuple, so we need to pull the 
        # loss value out of the tuple.
        loss = outputs[0]

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over the training data.
    avg_train_loss = total_loss / len(train_dataloader)            
    
    # Store the loss value for plotting the learning curve.
    loss_values.append(avg_train_loss)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(format_time(time.time() - t0)))
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables 
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        
        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch
        
        # Telling the model not to compute or store gradients, saving memory and
        # speeding up validation
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # This will return the logits rather than the loss because we have
            # not provided labels.
            # token_type_ids is the same as the "segment ids", which 
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            # The documentation for this `model` function is here: 
            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
            outputs = model(b_input_ids, 
                            attention_mask=b_input_mask)
        
        # Get the "logits" output by the model. The "logits" are the output
        # values prior to applying an activation function like the softmax.
        logits = outputs[0]

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        
        # Calculate the accuracy for this batch of test sentences.
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        
        # Accumulate the total accuracy.
        eval_accuracy += tmp_eval_accuracy

        # Track the number of batches
        nb_eval_steps += 1

    # Report the final accuracy for this validation run.
    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
    print("  Validation took: {:}".format(format_time(time.time() - t0)))

print("")
print("Training complete!")


Training...
  Batch    40  of  5,625.    Elapsed: 0:00:24.
  Batch    80  of  5,625.    Elapsed: 0:00:47.
  Batch   120  of  5,625.    Elapsed: 0:01:11.
  Batch   160  of  5,625.    Elapsed: 0:01:36.
  Batch   200  of  5,625.    Elapsed: 0:02:00.
  Batch   240  of  5,625.    Elapsed: 0:02:24.
  Batch   280  of  5,625.    Elapsed: 0:02:48.
  Batch   320  of  5,625.    Elapsed: 0:03:12.
  Batch   360  of  5,625.    Elapsed: 0:03:36.
  Batch   400  of  5,625.    Elapsed: 0:04:00.
  Batch   440  of  5,625.    Elapsed: 0:04:24.
  Batch   480  of  5,625.    Elapsed: 0:04:49.
  Batch   520  of  5,625.    Elapsed: 0:05:13.
  Batch   560  of  5,625.    Elapsed: 0:05:37.
  Batch   600  of  5,625.    Elapsed: 0:06:01.
  Batch   640  of  5,625.    Elapsed: 0:06:25.
  Batch   680  of  5,625.    Elapsed: 0:06:49.
  Batch   720  of  5,625.    Elapsed: 0:07:13.
  Batch   760  of  5,625.    Elapsed: 0:07:37.
  Batch   800  of  5,625.    Elapsed: 0:08:01.
  Batch   840  of  5,625.    Elapsed: 0:08:25.


Try using Trainer class

In [37]:
import dataclasses
import logging
import os
import sys
from dataclasses import dataclass, field
from typing import Dict, Optional

import numpy as np

from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer, EvalPrediction, GlueDataset
from transformers import GlueDataTrainingArguments as DataTrainingArguments
from transformers import (
    HfArgumentParser,
    Trainer,
    TrainingArguments,
    glue_compute_metrics,
    glue_output_modes,
    glue_tasks_num_labels,
    set_seed,
)

logging.basicConfig(level=logging.INFO)

In [38]:
@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """

    model_name_or_path: str = field(
        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
    )
    config_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    cache_dir: Optional[str] = field(
        default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
    )

In [39]:
model_args = ModelArguments(
    model_name_or_path="distilbert-base-cased",
)
data_args = DataTrainingArguments(task_name="mnli", data_dir="./glue_data/MNLI")
training_args = TrainingArguments(
    output_dir="./models/model_name",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    per_gpu_train_batch_size=32,
    per_gpu_eval_batch_size=128,
    num_train_epochs=1,
    logging_steps=500,
    logging_first_step=True,
    save_steps=1000,
    evaluate_during_training=True,
)

In [40]:
def compute_metrics(p: EvalPrediction) -> Dict:
    preds = np.argmax(p.predictions, axis=1)
    return glue_compute_metrics(data_args.task_name, preds, p.label_ids)

In [41]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    # eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

INFO:transformers.training_args:PyTorch: setting up devices
INFO:transformers.trainer:You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.
