# Analysis & Results


The following notebook is a summary of the Results after working with different Baselines and Models. 

Models used: 

#### Baselines 

1. **Majority Class with Feature Extraction**
1. **Logistic Regression Classifier with Feature Extraction**

#### Models 

1. **Ensemble Model for German using `de_core_news_lg` package with spaCy** - without cleaning the Data 
2. **Ensemble Model for German using `de_core_news_lg` package with spaCy** - with cleaning the Data by removing Punctuation, stop words and lemmatizing the tokens
3. **BERT Transformers model for German fine-tune on `bert-base-german-cased` with our data**

## Baselines definition

It's well known that before starting to use / train any models, one should create baselines that will serve as comparisons with the models we would like to test. 

I have chosen to work as follows: 

- After pre-processing the data, to be able to work with the text data, I have worked with transformers model `bert-base-german-cased` to build a FeatureExtractor as I wanted to take advantage of the German training data used for this model. 


______________

#### Side note on Feature Extractor 

As we want to vectorize the data to be able to use it in our models, I have frozen the weights of the model and get the hidden_states from the output of the CLS token, so that those could be used as Features for the baselines

## Model selection 

Initially, I wanted to try different models and selected two different ones, described next: 

### 1. TextCategorizer Ensemble model from spaCy. 

Quote: 
> Stacked ensemble of a linear bag-of-words model and a neural network model.

**Notes:** 
- Usually ensemble models perform well for classification, in this particular case it seems the updated model on the training data achieved good results that will be discussed later on
- One of the caveats of this model is that bag-of-words models usually doesn't account for context and surroudings around the text. 
- spaCy it's specialized to build multi-lingual pipelines, this particular one has multiple sources for training such as TIGER corpus, Tiger2Dep, explosion fastText vectors. 

Reference: [spaCy de_core_news_lg](https://spacy.io/models/de#de_core_news_lg)

### 2. Transformer models. 

Based on previous experience, working with text data it's challenging and luckily more recently we have seen an increase usage of Transformer Models, which internally uses Attention mechanisms for looking around surrounding text and not only the token that is being processed. 

**Notes:**

- Model chosen was based on the fact that: 
    - It was trained on German corpus 
    - Training corpus was based on : `German Wikipedia, OpenLegalData, News Articles` which seems a big bigger and more diverse than the spaCy model.
    
Reference: [Deepset - German BERT article](https://www.deepset.ai/german-bert)

-----------------

## Baseline Results

The following metrics were chosen for evaluation: 

- Accuracy, as it's one of the metrics usually utilized for evaluating Classification models. 
- For the case of the spaCy models, the metric is set to macro-f1 by default. 

To visualize the results, a *Confusion Matrix* was utilized. 

### Baseline 1. Majority Class Results

As it's expected the results from this Classifier will be always set to predict the majority class no matter what the real class is. 

**Accuracy on Test Set**: 42.1 % 

**Analysis**: 

- This seems to be fair given we are always predicting the majority of the class, recalling from the class distribution and the proportion of the class was 42%

<img src="./assets/distrib_per_class.png" alt="Dataset Distribution per Class" width=350 height=350/>


### Baseline 2. Logistic Regression Classifier Results

This model was trained for 3000 iterations, and it seems to have achieved really good results on the Test set. 

**Accuracy on Test Set**: 94 %

**Analysis**: 

Even though Logistic Regression is a simple classifier, it's still powerful and in the case of the dataset it performed really well. 

We can observe in the confusion matrix below, how the accuracy for the model was above 90% for all the classes as:

- 91% Correct predicted labels for the `soft` class
- 93% Correct predicted labels for the `tech` class
- 97% Correct predicted labels for the `none` class 

<img src="./assets/log_reg_matrix.png" alt="Confusion Matrix - Logistic Regression" width=400 height=400/>

**Notes**: 
- It would be interesting to perform error analysis for the following classes where the wrong class was predicted: 
    - Predicted: Soft, True: Tech, with a 5% error. 
    - Predicted: Tech, True: Soft, with a 7% error. 

-------------

## Models Results

### 1. Ensemble Model from spaCy -  No cleaning

This was a really interesting experiment as I am not 100% familiar with the library but without doubt powerful and with lots of capabilities for building NLP pipelines for downstream tasks such as: PoS, NER , Lemmatizers among others.

**Macro-F1 on Test Set**:  93.65 %

<img src="./assets/c_matrix_raw.png" alt="Confusion Matrix - spacy raw data" width=400 height=400/>

**Analysis**: 

It seems the model performed really well in terms of predicting the `tech` class. Based on the requirements for this application, one could select this model if for example the customer is looking to obtain candidates, job reqs focused in Tecnology. 


### 2. Ensemble Model from spaCy - Cleaning was performed 

This experiment differs from the first model as the Punctuation, StopWords were removed and posteriorly the text was lemmatized. 

**Macro-F1 on Test Set**: 89.29 %

Let's observed the Confusion Matrix below. 

<img src="./assets/c_matrix_clean.png" alt="Confusion Matrix - Cleaned data" width=400 height=400/>

**Analysis**:

Hypothesis was based on that cleaning the data by removing punctuation and stopwords and further addition of lemmatizer might help, but It seems it didn't. I would be curious to do another round with a native speaker. Perhaps context was lost after removing certain stopwords and the model couldn't generalize well.

### 3. German BERT Transformer Model

After fine-tuning in the data provided for training and validation sets, the best model seems to be this one, with an Accuracy of around 95% 

**Accuracy on Test Set**: 95% 


**Analysis**: 

Checking once more the confusion matrix for the BERT model, we can observe how the predictions were better. One of the reasons this model might be outperforming the others apart from the fact that is using Transformers, is that the metrics were calculated taking in account the class imbalance from the `none` class. 


<img src="./assets/c_matrix_bert.png" alt="Confusion Matrix - BERT" width=400 height=400/>


# Lessons learnt and further improvements

This challenge was very rewarding and fun!!

### Lessons learnt

- Language Barrier, though I don't know German but just very basics I was able to pull through with the classifier successfully.
- Cleaning data took more time than expected. As I didn't noticed I was deleting the duplicated incorrectly I had to re-run the experiments twice. 
- Library learning curve, spaCy is a powerful and well-built library, I wish I have more time to tweak parameters or train a Categorizer from Scratch. 
- Domain knowledge, I think I might have perform a better task if I would have asked if the `none` class was relevant or not for the classification task. 
---------------

### Further Improvements to the Model and the Work in General

- Perform Error Analysis to check what is happening around the missclassifications of the tech label being confused with the soft labels.
- Perform downsampling to the `none` class if it's truly relevant 
- With the hypothesis that `none` class is not important, I think a model could be build for classifying the text in skills (soft or tech) 
- I have the intuition of if given time and annotated data, training a NER on the data could have helped by identifying certain "tools", "skills", "software" and the classifier could have been improved by this.
- I think the Transformer models performed really well given it didn't have a lot of data to train with, but have its downsides such as Size and Carbon footprint. It would be interesting to build an hypothesis around distillation models which are light in weight and training time in terms of resources like cost and memory (GPU). 
