# Project 2 Report - Dean Knudson

This semester I continued my work on a sentiment analysis engine. The semester began with the revelation that a new library called transformers, by the HuggingFace, had recently been released. This library made fine-tuning some of the SOTA (state of the art) langauage models incredibly simple. With this, I decided to finetune a variety of BERT architures using the custom dataset I created last semester. At the same time another team was working on revamping the existing media analytics website and a sentiment analysis component was requested from me. This progressed into the construction and deployment of a python machine learning for text classification package. I also briefly experimented with the dataset I used in my [special topic](https://github.com/knuddj1/Sentiment-Analysis/blob/master/recap%20stuff/semester_recap.ipynb), to create a 5 label sentiment analysis model, comparing its results with my custom dataset.

---

## [Transformers](https://github.com/huggingface/transformers)

The beginning of this semester yielded a new tool in the NLP community.The work I did [last semester](https://github.com/knuddj1/Project-1-Sentiment-Analysis/blob/master/report.ipynb) was made fairly irrelevant by transformers. Created by the HuggingFace team, transformers took all the difficult implementation details out of using the recent, powerful,transformer based, language models and provided a fairly compact python interface for interacting with them. Originally transformers only supported using the popular machine learning framework [PyTorch](https://github.com/huggingface/transformers) as a backend, and was named pytorch_transformers. It now supports tensorflow as well, however I began using the library when only pytorch was available, so all my work is using pytorch. 

## Familiarizing myself with Transformers and PyTorch

Trasnformers is very simple to use compared to using other existing repos or reimplementing these models from scratch. However, its still not the most straight forward library to use and required quite a few weeks for me to really figure it out. This was partly due to the fact I had little exposure to PyTorch, as I had predominantly been using Keras for my machine learning work.

I started off working through PyTorch's [tutorials](pytorch.org/tutorials/), and completed two, that seemed to be most relevant for me.

1. ***Writing Custom Datasets, Dataloaders and Transforms*** went in depth on how to load and preprocess/augment data from a non trivial dataset in pytorch fashion. This introduced me to pytorch dataloaders and custom datasets. A pytorch dataset is an object that contains a list of data and implements the `__get__` dunder method, which is used  to extract a single sample given an index. A dataloader is basically a generator function that uses multithreading to extract batchs of samples from a pytorch dataset in parrallel to main thread of execution.

2. ***Sequence-to-Sequence Modeling with NN.Transformer and TorchText***, a tutorial on how to train a sequence-to-sequence model that uses the nn.Transformer module. The models featured on HuggingFace's Transformers are made up of layers of these Transformer blocks. I thought that this tutorial would be useful before tackling the Trasformers library as it goes over the entire pipeline of a training a transformer model. Most of this however turned out to be unnecessary, because of Transformer's level of abstraction, and acted more as lesson.

Next I moved on to learning to use the Transformers library. I started off just wanting to get one of the available models instantiated and then running training/testing/inference/saving for sequence classification. 

Below is an example of how this initial test would have looked:

1. **Declare the required imports**

        import torch
        from transformers import BertTokenizer, BertForSequenceClassification
        
   - torch is an alias for PyTorch. It is required to use the transformers package.


2. **Define model type and load its components**

        model_type = 'bert-base-uncased'
        tokenizer = BertTokenizer(model_type)
        model = BertForSequenceClassification.from_pretrained(model_type)
        
   - model_type defines which version of the model is to be used.
   - BertTokenizer is a tool that preprocesses a piece of text into numerical tokens for the model.
   - BertForSequenceClassification is a pretrained BERT transformer model with an untrained
     classification head connected to the final hidden layers output.


3. **Transform raw text into an input tensor for the model**

        input_text = "Hello, my dog is cute"
        encoded_text = tokenizer.encode(input_text)
        input_tensor = torch.tensor(encoded_text).unsequeeze(0) # Add batch dimension
        input_label = torch.tensor([1]).unsqueeze(0)  # Add batch dimension
    
    - input_text is a raw string a of text.
    - encoded_text is the output of the tokenizers encode function. It is a list of integers in
      the range 0-VOCABULARY_SIZE with a length of the same number of words in the input_text.
    - input_tensor is the encoded_text list converted to a PyTorch tensor.
    - input_label is the same as the input_tensor but it is the what will be used as the ground 
      truth to compare to the models prediction.
    
    
4. **Perform a forward pass of the model**

        outputs = model(input_ids, labels=labels)
        loss, logits = outputs[:2]
        
    - The model returns a tuple containing the loss (how close its prediction was to the target label)
      and logits (a probability distribution of size n, where n is the number of class labels).
    
    
5. **Update weights**
        
        loss.backward()
    
    - Performs back propagation using gradient decent and updates the models weights accordingly.



**FULL EXAMPLE**

    import torch
    from transformers import BertTokenizer, BertForSequenceClassification
        
    model_type = 'bert-base-uncased'
    tokenizer = BertTokenizer(model_type)
    model = BertForSequenceClassification.from_pretrained(model_type)
        
    input_text = "Hello, my dog is cute"
    encoded_text = tokenizer.encode(input_text)
    input_tensor = torch.tensor(encoded_text).unsequeeze(0) # Add batch dimension
    input_label = torch.tensor([1]).unsqueeze(0)  # Add batch dimension
    
    outputs = model(input_ids, labels=labels)
    loss, logits = outputs[:2]
    loss.backward()