<a href="https://colab.research.google.com/github/jpsiegel/Projects/blob/master/caseStudy_Jan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

### Background

This case study is losely based on an actual market research problem. The client tested a (relatively) large number of advertising messages over the years. These tests involved respondents evaluating the credibility and appeal of those messages. These evaluations are aggregated to a score which is then binned using deciles. Top decile claims are likely to be used in actual advertising campaigns. 

We try to predict the claim performance (as represented by the decile) from the text by applying a popular pre-trained model.

### Your task

Fill in the blanks to make a basic analysis work. Add to it as much as you like. During the subsequent interview, you can explain your solution and approach.

### Data

There are three column in the data, `Message_Text` (The advertising message), `score` (The survey-based score) and `label` (Decile).

Our client wants to predict `label` from `Message_Text`. There are 10 classes in total.

The data structure is based on real data, however, for confidentiality reasons it is not our actual client data.

### Model

The code below apply *distilbert model* to do the classification. Please fill the blanks.

We will use pretraiend model from [huggingface](https://huggingface.co/) library. Hugginface is an open source AI library where published cutting-edge advanced AI models. You can find [courses](https://huggingface.co/course/chapter1/1) online.
 
This case study is [text classification](https://huggingface.co/tasks/text-classification) task.
If you are not familiar with [Bert](https://en.wikipedia.org/wiki/BERT_(language_model)), please check this [paper](https://arxiv.org/abs/1810.04805). Please also check [attention machenism](https://arxiv.org/abs/1706.03762) and transformer. 

###Setup

In [5]:
!pip install transformers --quiet 
!pip install datasets --quiet
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil --quiet
!pip install psutil --quiet
!pip install humanize --quiet

import psutil
import humanize
import os
import GPUtil as GPU

In [None]:
# check GPU unit

GPUs = GPU.getGPUs()
gpu = GPUs[0]  # Only one GPU on Colab and not guaranteed

def printm():
  """Prints available ram and graphic memory"""
  process = psutil.Process(os.getpid())
  print("RAM Free: " + humanize.naturalsize(psutil.virtual_memory().available), " | Used: " + humanize.naturalsize(process.memory_info().rss))
  print("VRAM Free: {0:.0f}MB | Used: {1:.0f}MB | Using {2:3.0f}% Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil * 100, gpu.memoryTotal))

printm()

In [6]:
# mount google drive for workspace

from google.colab import drive

def mount_gdrive():
  """Sets up google drive directory access"""
  path = "/content/drive"
  drive.mount(path, force_remount=True)
  
my_dir = "/content/drive/MyDrive/CaseStudySkim/"
mount_gdrive()
print(os.listdir(my_dir))

Mounted at /content/drive
['caseStudy_Jan.ipynb', 'sampled_data_NLP.xltx']


In [7]:
import pandas as pd
import torch
from datasets import Dataset
from transformers import (AutoTokenizer, AutoModelForSequenceClassification, 
                          TrainingArguments, Trainer)

###Data Exploration

In [8]:
data = pd.read_excel(my_dir + "sampled_data_NLP.xltx")
print("Shape: ", data.shape, "\n")
data.head()

Shape:  (7211, 3) 



Unnamed: 0,Message_Text,label,score
0,Impeccable stain removal,class7,0.634615
1,With Odour Resistance Formula - Still fresh at...,class9,0.875
2,Original recipe,class8,0.788462
3,Natural hair gene awakening for 1000 new hair ...,class4,0.383333
4,Plastic-free packaging,class5,0.483871


In [9]:
symbol_example1 = 1254
symbol_example2 = 1649
dutch_example = 2000
spanish_example = 1104
print("Unwanted characters: \n\n", data.iloc[symbol_example1], "\n")
print(data.iloc[symbol_example2], "\n")
print("Non-english languages: \n\n", data.iloc[dutch_example], "\n")
print(data.iloc[spanish_example])

Unwanted characters: 

 Message_Text    Yumos: 2 fragrances in 1¬†fabcon - every time ...
label                                                      class2
score                                                    0.105263
Name: 1254, dtype: object 

Message_Text    "It‚Äôs better than any other Deodorants that ...
label                                                     class10
score                                                    0.928571
Name: 1649, dtype: object 

Non-english languages: 

 Message_Text    Vrij van conserveermiddelen
label                                class5
score                              0.433333
Name: 2000, dtype: object 

Message_Text    Textura ideal con muchos tomates
label                                     class9
score                                       0.85
Name: 1104, dtype: object


### Preprocessing

In [31]:
!pip install googletrans==4.0.0rc1 --quiet
!pip install chardet
!pip install langid

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langid
  Downloading langid-1.1.6.tar.gz (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 14.2 MB/s 
Building wheels for collected packages: langid
  Building wheel for langid (setup.py) ... [?25l[?25hdone
  Created wheel for langid: filename=langid-1.1.6-py3-none-any.whl size=1941188 sha256=25f2c46fd83158fdb32c3afd1f44d54e18a3a693a7e13dacc7ca0298448d4d07
  Stored in directory: /root/.cache/pip/wheels/2b/bb/7f/11e4db39477278161e882eadc46fb558949a28b13470fc74b8
Successfully built langid
Installing collected packages: langid
Successfully installed langid-1.1.6


In [40]:
# Please preprocess label column into correct format to make it run model smoothly
# your code here....

data["Message_Text"] = data["Message_Text"].apply(lambda x: ''.join(["" if ord(i) < 32 or ord(i) > 126 else i for i in x]))

#print(data.iloc[symbol_example1], "\n")
#print(data.iloc[symbol_example2])

from googletrans import Translator
translator = Translator()
def detect_language(string, translator=translator):
  try:
    ret = translator.detect(string).lang == "en"
  except AttributeError:
    print(string)
    ret = False
  return ret

import langid
def detect_lang(string):
  return langid.classify(string)[0] == "en"

#df = df.drop(df[df.score < 50].index)
#df[df['column name'].map(len) < 2]#

#data_noneng = data[data["Message_Text"].map(detect_lang) == False]
#for i in range(773):
#  print(data_noneng.iloc[i]["Message_Text"])

#import langid
langid.classify("More space, better taste")
 

# DEJAR AMBOS DATASETS, UNO CON TODO LOS LANGS, OTRO CON LA REDUCCION PIOLA HECHA CON LANGID

('it', -6.865324020385742)

In [None]:
# convert pandas into dataset and get train and test dataset
ds = (Dataset.from_pandas(data).train_test_split(train_size=0.8, test_size=0.2))

In [None]:
# peek at one example
ds["train"][0]

{'Message_Text': 'Now uses the power of nature for longer lasting fragrance',
 'label': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
 'score': 0.35294117647058826,
 'idx': 5807}

## Tokenizer

We are using huggingface [AutoClass](https://huggingface.co/docs/transformers/model_doc/auto). For the [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer), please check here.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
# please write the correct parameters here....
)

In [None]:
def tokenize_and_encode(examples):
    return tokenizer(examples["Message_Text"], truncation=True)

In [None]:
cols = ds["train"].column_names
cols.remove("label")
ds_enc = ds.map(tokenize_and_encode, batched=True, remove_columns=cols)
ds_enc

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 7025
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 1757
    })
})

In [None]:
# cast label IDs to floats
ds_enc.set_format("torch")
ds_enc = (ds_enc
          .map(lambda x : {"float_label": x["label"].to(torch.float)}, remove_columns=["label"])
          .rename_column("float_label", "label"))

  0%|          | 0/7025 [00:00<?, ?ex/s]

  0%|          | 0/1757 [00:00<?, ?ex/s]

## Model

In [None]:
model = (AutoModelForSequenceClassification
         .from_pretrained(
            # please add your code here....
          ).to('cuda')
        )

In [None]:
# let's peek the data
ds_enc["train"][0]

{'input_ids': tensor([  101,  2085,  3594,  1996,  2373,  1997,  3267,  2005,  2936,  9879,
         24980,   102]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 'label': tensor([0., 0., 0., 0., 1., 0., 0., 0., 0., 0.])}

In [None]:
args = TrainingArguments(
  # please put your code here....
)

trainer = Trainer(model=model, 
                  args=args, 
                  train_dataset=ds_enc["train"], 
                  eval_dataset=ds_enc["test"], 
                  tokenizer=tokenizer)

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

# Improvement?

Congrats! You finish your model training.

But this is a very basic model, there are still lots of improvements could be done. 

For example, improving model accuracy by tuning hyperparameter, changing different models, logging model to diagnose models. 
You could also try explinable AI to interpret why model gives this prediction. 

Client is also interested in directly predicting socres, you could also try that.

Have fun!