# TweetNLP Introduction
This colab notebook brings a short introduction of [`tweetnlp`](https://github.com/cardiffnlp/tweetnlp), a python library of NLP models for tweets. In this tutorial, we explain following applications on tweets:
- [Text Classification](#scrollTo=KAZYjeskBqL4): Sentiment/Hate/Irony/Emoji/Emotion, etc
- [NER](#scrollTo=WeREiLEjBlrj): Named Entity Recognition (NER)
- [Question Answering](#scrollTo=reZDePaBmYhA&line=4&uniqifier=1): Answer prediction given a question with a context (SQuAD style)
- [Question Answer Generation](#scrollTo=uqd7sBHhnwym&line=6&uniqifier=1): Question and answer pairs generation on a context
- [Language Modeling](#scrollTo=COOoZHVAFCIG): Masked token prediction
- [Fine-tuning](#scrollTo=2plrPTqk7OHp): Model fine-tuning.


## Installation
TweetNLP is available on pip or can be installed from source.


In [None]:
# Fix Colab Error
!pip install --upgrade google-cloud-storage

In [1]:
# via pip
!pip install tweetnlp

Collecting tweetnlp
  Downloading tweetnlp-0.4.4.tar.gz (54 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/54.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m51.2/54.6 kB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.6/54.6 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ray[tune] (from tweetnlp)
  Downloading ray-2.7.1-cp310-cp310-manylinux2014_x86_64.whl (62.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.4/62.4 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
Collecting urlextract (from tweetnlp)
  Downloading urlextract-1.8.0-py3-none-any.whl (21 kB)
Collecting transformers<=4.21.2 (from tweetnlp)
  Downloading transformers-4.21.2-py3-none-any.whl (4.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.7/4.7 MB

In [None]:
# # via source
# !git clone https://github.com/cardiffnlp/tweetnlp
# %cd tweetnlp
# !pip install . -U

In [None]:
! pip list | grep tweetnlp

tweetnlp                      0.4.0


All you need is to import `tweetnlp` !

In [2]:
import tweetnlp

## Tweet Classification
The classification module consists of six different tasks (Topic Classification, Sentiment Analysis, Irony Detection, Hate Speech Detection, Offensive Language Detection, Emoji Prediction, and Emotion Analysis).
In each example, the model is instantiated by `tweetnlp.load("task-name")`, and run the prediction by passing a text or a list of texts as argument to the corresponding function.

### Topic Classification
The aim of this task is, given a tweet to assign topics related to its content. The task is formed as a supervised multi-label classification problem where each tweet is assigned one or more topics from a total of 19 available topics. The topics were carefully curated based on Twitter trends with the aim to be broad and general and consist of classes such as: arts and culture, music, or sports. Our internally-annotated dataset contains over 10K manually-labeled tweets (check the paper [here](https://arxiv.org/abs/2209.09824), or the [huggingface dataset page](https://huggingface.co/datasets/cardiffnlp/tweet_topic_single)).

***Multi-label Model***

In [3]:
model = tweetnlp.load_model('topic_classification')  # Or `model = tweetnlp.TopicClassification()`
model.topic("vieja traidora vos larreta morales  y todos los que pretendieron destruir a cambiemos !! jamás vas a ganar.")  # Or `model.predict`

Downloading config.json:   0%|          | 0.00/1.96k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/354 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/780k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/476M [00:00<?, ?B/s]

{'label': ['diaries_&_daily_life']}

In [4]:
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model.topic("tenes el apoyo incondicional que tiene patricia !! ahora es!!", return_probability=True)

{'label': ['diaries_&_daily_life'],
 'probability': {'arts_&_culture': 0.07730702310800552,
  'business_&_entrepreneurs': 0.009323335252702236,
  'celebrity_&_pop_culture': 0.1671179085969925,
  'diaries_&_daily_life': 0.547214925289154,
  'family': 0.016154373064637184,
  'fashion_&_style': 0.006967581808567047,
  'film_tv_&_video': 0.2328609973192215,
  'fitness_&_health': 0.004611230921000242,
  'food_&_dining': 0.00356752285733819,
  'gaming': 0.00927015021443367,
  'learning_&_educational': 0.004868419840931892,
  'music': 0.010620350949466228,
  'news_&_social_concern': 0.041006311774253845,
  'other_hobbies': 0.17132407426834106,
  'relationships': 0.07458601891994476,
  'science_&_technology': 0.005275660194456577,
  'sports': 0.010755330324172974,
  'travel_&_adventure': 0.01077447459101677,
  'youth_&_student_life': 0.0034137829206883907}}

***Singlelabel Model***

In [5]:
model = tweetnlp.load_model('topic_classification', multi_label=False)  # Or `model = tweetnlp.TopicClassification(multi_label=False)`
model.topic("tenes el apoyo incondicional que tiene patricia !! ahora es!!")

Downloading config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/407 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/780k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/2.01M [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/476M [00:00<?, ?B/s]

{'label': 'daily_life'}

In [None]:
# NOTE: the probability of the sinlge-label model the softmax over the label.
model.topic("Jacob Collier is a Grammy-awarded English artist from London.", return_probability=True)

{'label': 'pop_culture',
 'probability': {'arts_&_culture': 9.20625461731106e-05,
  'business_&_entrepreneurs': 6.916998972883448e-05,
  'pop_culture': 0.9995898604393005,
  'daily_life': 0.00011083026038249955,
  'sports_&_gaming': 8.668467489769682e-05,
  'science_&_technology': 5.152115045348182e-05}}

***Dataset***

In [None]:
dataset_multi_label, label2id_multi_label = tweetnlp.load_dataset('topic_classification')
dataset_single_label, label2id_single_label = tweetnlp.load_dataset('topic_classification', multi_label=False)

In [None]:
dataset_multi_label

DatasetDict({
    test_2020: Dataset({
        features: ['text', 'date', 'label', 'label_name', 'id'],
        num_rows: 376
    })
    test_2021: Dataset({
        features: ['text', 'date', 'label', 'label_name', 'id'],
        num_rows: 1693
    })
    train_2020: Dataset({
        features: ['text', 'date', 'label', 'label_name', 'id'],
        num_rows: 2858
    })
    train_2021: Dataset({
        features: ['text', 'date', 'label', 'label_name', 'id'],
        num_rows: 1516
    })
    train_all: Dataset({
        features: ['text', 'date', 'label', 'label_name', 'id'],
        num_rows: 4374
    })
    validation_2020: Dataset({
        features: ['text', 'date', 'label', 'label_name', 'id'],
        num_rows: 352
    })
    validation_2021: Dataset({
        features: ['text', 'date', 'label', 'label_name', 'id'],
        num_rows: 189
    })
    train_random: Dataset({
        features: ['text', 'date', 'label', 'label_name', 'id'],
        num_rows: 2830
    })
    validati

In [None]:
label2id_multi_label

{'arts_&_culture': 0,
 'business_&_entrepreneurs': 1,
 'pop_culture': 2,
 'daily_life': 3,
 'sports_&_gaming': 4,
 'science_&_technology': 5}

In [None]:
dataset_single_label

DatasetDict({
    test_2020: Dataset({
        features: ['text', 'date', 'label', 'label_name', 'id'],
        num_rows: 376
    })
    test_2021: Dataset({
        features: ['text', 'date', 'label', 'label_name', 'id'],
        num_rows: 1693
    })
    train_2020: Dataset({
        features: ['text', 'date', 'label', 'label_name', 'id'],
        num_rows: 2858
    })
    train_2021: Dataset({
        features: ['text', 'date', 'label', 'label_name', 'id'],
        num_rows: 1516
    })
    train_all: Dataset({
        features: ['text', 'date', 'label', 'label_name', 'id'],
        num_rows: 4374
    })
    validation_2020: Dataset({
        features: ['text', 'date', 'label', 'label_name', 'id'],
        num_rows: 352
    })
    validation_2021: Dataset({
        features: ['text', 'date', 'label', 'label_name', 'id'],
        num_rows: 189
    })
    train_random: Dataset({
        features: ['text', 'date', 'label', 'label_name', 'id'],
        num_rows: 2830
    })
    validati

In [None]:
label2id_single_label

{'arts_&_culture': 0,
 'business_&_entrepreneurs': 1,
 'pop_culture': 2,
 'daily_life': 3,
 'sports_&_gaming': 4,
 'science_&_technology': 5}

### Sentiment Analysis
The sentiment analysis task integrated in TweetNLP is a simplified version where the goal is to predict the sentiment of a tweet with one of the three following labels: positive, neutral or negative. The base dataset for English is the unified TweetEval version of the Semeval-2017 dataset from the task on Sentiment Analysis in Twitter (check the paper [here](https://arxiv.org/pdf/2010.12421.pdf)).

***English Model***

In [None]:
model = tweetnlp.load_model('sentiment')  # Or `model = tweetnlp.Sentiment()`
model.sentiment("Yes, including Medicare and social security saving👍")  # Or `model.predict`

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


{'label': 'positive'}

In [None]:
model.sentiment("Yes, including Medicare and social security saving👍", return_probability=True)

{'label': 'positive',
 'probability': {'negative': 0.004584962967783213,
  'neutral': 0.19360849261283875,
  'positive': 0.8018065094947815}}

***Multilingual Model***

In [6]:
model = tweetnlp.load_model('sentiment', multilingual=True)  # Or `model = tweetnlp.Sentiment(multilingual=True)`
model.sentiment("tenes el apoyo incondicional que tiene patricia !! ahora es!!")

Downloading config.json:   0%|          | 0.00/841 [00:00<?, ?B/s]

Downloading sentencepiece.bpe.model:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

{'label': 'positive'}

In [7]:
model.sentiment(["tenes el apoyo incondicional que tiene patricia !! ahora es!!", "vieja traidora vos larreta morales  y todos los que pretendieron destruir a cambiemos !! jamás vas a ganar."], return_probability=True)

[{'label': 'positive',
  'probability': {'negative': 0.034607741981744766,
   'neutral': 0.14964747428894043,
   'positive': 0.8157448172569275}},
 {'label': 'negative',
  'probability': {'negative': 0.932296872138977,
   'neutral': 0.05282315984368324,
   'positive': 0.01487990003079176}}]

***Dataset***

In [None]:
dataset, label2id = tweetnlp.load_dataset('sentiment')

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 45615
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 12284
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

In [None]:
label2id

{'negative': 0, 'neutral': 1, 'positive': 2}

In [None]:
for l in ['arabic', 'english', 'french', 'german', 'hindi', 'italian', 'portuguese', 'spanish']:
    dataset_multilingual, label2id_multilingual = tweetnlp.load_dataset('sentiment', multilingual=True, task_language=l)
    print(dataset_multilingual)
    print(label2id_multilingual)
    print()



  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1839
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 324
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 870
    })
})
{'negative': 0, 'neutral': 1, 'positive': 2}





  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1839
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 324
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 870
    })
})
{'negative': 0, 'neutral': 1, 'positive': 2}





  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1839
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 324
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 870
    })
})
{'negative': 0, 'neutral': 1, 'positive': 2}





  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1839
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 324
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 870
    })
})
{'negative': 0, 'neutral': 1, 'positive': 2}





  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1839
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 324
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 870
    })
})
{'negative': 0, 'neutral': 1, 'positive': 2}





  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1839
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 324
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 870
    })
})
{'negative': 0, 'neutral': 1, 'positive': 2}





  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1839
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 324
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 870
    })
})
{'negative': 0, 'neutral': 1, 'positive': 2}





  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1839
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 324
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 870
    })
})
{'negative': 0, 'neutral': 1, 'positive': 2}



### Irony Detection
This is a binary classification task where given a tweet, the goal is to detect whether it is ironic or not. It is based on the Irony Detection dataset from the SemEval 2018 task (check the paper [here](https://arxiv.org/pdf/2010.12421.pdf)).

***Model***

In [None]:
model = tweetnlp.load_model('irony')  # Or `model = tweetnlp.Irony()`
model.irony('If you wanna look like a badass, have drama on social media')  # Or `model.predict`

{'label': 'irony'}

In [None]:
model.irony('If you wanna look like a badass, have drama on social media', return_probability=True)

{'label': 'irony',
 'probability': {'non_irony': 0.08390878140926361,
  'irony': 0.9160912036895752}}

***Dataset***

In [None]:
dataset, label2id = tweetnlp.load_dataset('irony')

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 2862
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 784
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 955
    })
})

In [None]:
label2id

{'non_irony': 0, 'irony': 1}

### Hate Speech Detection
The hate speech dataset consists of detecting whether a tweet is hateful towards women or immigrants. It is based on the Detection of Hate Speech task at SemEval 2019 (check the paper [here](https://arxiv.org/pdf/2010.12421.pdf)).

***Model***

In [None]:
model = tweetnlp.load_model('hate')  # Or `model = tweetnlp.Hate()`
model.hate('Whoever just unfollowed me you a bitch')  # Or `model.predict`

{'label': 'non-hate'}

In [None]:
model.hate('Whoever just unfollowed me you a bitch', return_probability=True)

{'label': 'non-hate',
 'probability': {'non-hate': 0.726382851600647, 'hate': 0.27361705899238586}}

***Dataset***

In [None]:
dataset, label2id = tweetnlp.load_dataset('hate')

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 9000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2970
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1000
    })
})

In [None]:
label2id

{'non-hate': 0, 'hate': 1}

### Offensive Language Identification
This task consists in identifying whether some form of offensive language is present in a tweet. For our benchmark we rely on the SemEval2019 OffensEval dataset (check the paper [here](https://arxiv.org/pdf/2010.12421.pdf)).

***Model***

In [None]:
model = tweetnlp.load_model('offensive')  # Or `model = tweetnlp.Offensive()`
model.offensive("All two of them taste like ass.")  # Or `model.predict`

{'label': 'offensive'}

In [None]:
model.offensive("All two of them taste like ass.", return_probability=True)

{'label': 'offensive',
 'probability': {'non-offensive': 0.16420336067676544,
  'offensive': 0.8357967138290405}}

***Dataset***

In [None]:
dataset, label2id = tweetnlp.load_dataset('offensive')

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 11916
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 860
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1324
    })
})

In [None]:
label2id

{'non-offensive': 0, 'offensive': 1}

### Emoji Prediction
The goal of emoji prediction is to predict the final emoji on a given tweet. The dataset used to fine-tune our models is the TweetEval adaptation from the SemEval 2018 task on Emoji Prediction (check the paper [here](https://arxiv.org/pdf/2010.12421.pdf)), including 20 emoji as labels (❤, 😍, 😂, 💕, 🔥, 😊, 😎, ✨, 💙, 😘, 📷, 🇺🇸, ☀, 💜, 😉, 💯, 😁, 🎄, 📸, 😜).

***Model***

In [8]:
model = tweetnlp.load_model('emoji')  # Or `model = tweetnlp.Emoji()`
model.emoji(["tenes el apoyo incondicional que tiene patricia !! ahora es!!", "vieja traidora vos larreta morales  y todos los que pretendieron destruir a cambiemos !! jamás vas a ganar."])  # Or `model.predict`

Downloading config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/476M [00:00<?, ?B/s]

[{'label': '😍'}, {'label': '😂'}]

In [None]:
model.emoji('Beautiful sunset last night from the pontoon @TupperLakeNY', return_probability=True)

{'label': '📷',
 'probability': {'❤': 0.13197320699691772,
  '😍': 0.11246418952941895,
  '😂': 0.008415068499743938,
  '💕': 0.04842923954129219,
  '🔥': 0.014528144150972366,
  '😊': 0.15096743404865265,
  '😎': 0.08625395596027374,
  '✨': 0.01616634987294674,
  '💙': 0.07396604120731354,
  '😘': 0.03033280000090599,
  '📷': 0.1652531772851944,
  '🇺🇸': 0.020336609333753586,
  '☀': 0.007999817840754986,
  '💜': 0.01611141860485077,
  '😉': 0.012984543107450008,
  '💯': 0.012557175941765308,
  '😁': 0.0313868410885334,
  '🎄': 0.006829544901847839,
  '📸': 0.04188750311732292,
  '😜': 0.01115693524479866}}

***Dataset***

In [None]:
dataset, label2id = tweetnlp.load_dataset('emoji')

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 45000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})

In [None]:
label2id

{'❤': 0,
 '😍': 1,
 '😂': 2,
 '💕': 3,
 '🔥': 4,
 '😊': 5,
 '😎': 6,
 '✨': 7,
 '💙': 8,
 '😘': 9,
 '📷': 10,
 '🇺🇸': 11,
 '☀': 12,
 '💜': 13,
 '😉': 14,
 '💯': 15,
 '😁': 16,
 '🎄': 17,
 '📸': 18,
 '😜': 19}

### Emotion Recognition
Given a tweet, this task consists of associating it with its most appropriate emotion. As a reference dataset we use the SemEval 2018 task on Affect in Tweets, simplified to only four emotions used in TweetEval: anger, joy, sadness and optimism (check the paper [here](https://arxiv.org/pdf/2010.12421.pdf)).

***Model***

In [12]:
model = tweetnlp.load_model('emotion')  # Or `model = tweetnlp.Emotion()`
model.emotion(["tenes el apoyo incondicional que tiene patricia !! ahora es!!", "vieja traidora vos larreta morales  y todos los que pretendieron destruir a cambiemos !! jamás vas a ganar."])  # Or `model.predict`

[{'label': 'joy'}, {'label': 'joy'}]

In [10]:
model.emotion(["tenes el apoyo incondicional que tiene patricia !! ahora es!!", "vieja traidora vos larreta morales  y todos los que pretendieron destruir a cambiemos !! jamás vas a ganar."], return_probability=True)

[{'label': 'joy',
  'probability': {'anger': 0.009525344707071781,
   'anticipation': 0.167303204536438,
   'disgust': 0.018283387646079063,
   'fear': 0.002851309720426798,
   'joy': 0.6338755488395691,
   'love': 0.010651671327650547,
   'optimism': 0.06741335988044739,
   'pessimism': 0.009707988239824772,
   'sadness': 0.049333520233631134,
   'surprise': 0.020732387900352478,
   'trust': 0.010322364047169685}},
 {'label': 'joy',
  'probability': {'anger': 0.0026499649975448847,
   'anticipation': 0.14817200601100922,
   'disgust': 0.006477881222963333,
   'fear': 0.002306950744241476,
   'joy': 0.7185746431350708,
   'love': 0.009970979765057564,
   'optimism': 0.05579708144068718,
   'pessimism': 0.0054869865998625755,
   'sadness': 0.024923812597990036,
   'surprise': 0.016973311081528664,
   'trust': 0.008666383102536201}}]

***Dataset***

In [None]:
dataset, label2id = tweetnlp.load_dataset('emotion')

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 3257
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1421
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 374
    })
})

In [None]:
label2id

{'anger': 0, 'joy': 1, 'optimism': 2, 'sadness': 3}

## Named Entity Recognition
This module consists of a named-entity recognition (NER) model specifically trained for tweets. The model is instantiated by `tweetnlp.load("ner")`, and runs the prediction by giving a text or a list of texts as argument to the `ner` function (check the paper [here](https://arxiv.org/abs/2210.03797), or the [huggingface dataset page](https://huggingface.co/datasets/tner/tweetner7)).

***Model***

In [13]:
model = tweetnlp.load_model('ner')  # Or `model = tweetnlp.NER()`
model.ner(["tenes el apoyo incondicional que tiene patricia !! ahora es!!", "vieja traidora vos larreta morales  y todos los que pretendieron destruir a cambiemos !! jamás vas a ganar."])  # Or `model.predict`

Downloading config.json:   0%|          | 0.00/13.0k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/328 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/780k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

[[{'type': 'person', 'entity': ' patricia'}], []]

In [None]:
# Note: the probability for the predicted entity is the mean of the probabilities over the sub-tokens representing the entity.
model.ner('Jacob Collier is a Grammy-awarded English artist from London.', return_probability=True)  # Or `model.predict`

[{'type': 'person',
  'entity': 'Jacob Collier',
  'probability': 0.9905317823092142},
 {'type': 'event', 'entity': ' Grammy', 'probability': 0.1916438639163971},
 {'type': 'location', 'entity': ' London', 'probability': 0.9606999158859253}]

***Dataset***

In [None]:
dataset, label2id = tweetnlp.load_dataset('ner')

In [None]:
dataset

DatasetDict({
    test_2020: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 576
    })
    test_2021: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 2807
    })
    validation_2020: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 576
    })
    validation_2021: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 310
    })
    train_2020: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 4616
    })
    train_2021: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 2495
    })
    train_all: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 7111
    })
    validation_random: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 576
    })
    train_random: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 4616
    })
  

In [None]:
label2id

{'B-corporation': 0,
 'B-creative_work': 1,
 'B-event': 2,
 'B-group': 3,
 'B-location': 4,
 'B-person': 5,
 'B-product': 6,
 'I-corporation': 7,
 'I-creative_work': 8,
 'I-event': 9,
 'I-group': 10,
 'I-location': 11,
 'I-person': 12,
 'I-product': 13,
 'O': 14}

## Question Answering
This module consists of a question answering model specifically trained for tweets.
The model is instantiated by `tweetnlp.load("question_answering")`,
and runs the prediction by giving a question or a list of questions along with a context or a list of contexts
as argument to the `question_answering` function (check the paper [here](https://arxiv.org/abs/2210.03992), or the [huggingface dataset page](https://huggingface.co/datasets/lmqg/qg_tweetqa)).

***Model***

In [14]:
model = tweetnlp.load_model('question_answering')  # Or `model = tweetnlp.QuestionAnswering()`
model.question_answering(
  question='¿a quién estamos apoyando?',
  context="tenes mi apoyo incondicional que tiene patricia !! ahora es!!"
)  # Or `model.predict`

Downloading config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

RuntimeError: ignored

***Dataset***

In [None]:
dataset = tweetnlp.load_dataset('question_answering')

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['answer', 'paragraph_question', 'question', 'paragraph'],
        num_rows: 9489
    })
    validation: Dataset({
        features: ['answer', 'paragraph_question', 'question', 'paragraph'],
        num_rows: 1086
    })
    test: Dataset({
        features: ['answer', 'paragraph_question', 'question', 'paragraph'],
        num_rows: 1203
    })
})

## Question Answer Generation
This module consists of a question & answer pair generation specifically trained for tweets.
The model is instantiated by `tweetnlp.load("question_answer_generation")`,
and runs the prediction by giving a context or a list of contexts
as argument to the `question_answer_generation` function (check the paper [here](https://arxiv.org/abs/2210.03992), or the [huggingface dataset page](https://huggingface.co/datasets/lmqg/qag_tweetqa)).


***Model***

In [None]:
model = tweetnlp.load_model('question_answer_generation')  # Or `model = tweetnlp.QuestionAnswerGeneration()`
model.question_answer_generation(
  text="'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`

[{'question': 'who created the post?', 'answer': 'ben'},
 {'question': 'what did ben do in 1994?', 'answer': 'he retired as editor'}]

***Dataset***

In [None]:
dataset = tweetnlp.load_dataset("question_answer_generation")

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['answers', 'questions', 'paragraph', 'paragraph_id', 'questions_answers'],
        num_rows: 4536
    })
    validation: Dataset({
        features: ['answers', 'questions', 'paragraph', 'paragraph_id', 'questions_answers'],
        num_rows: 583
    })
    test: Dataset({
        features: ['answers', 'questions', 'paragraph', 'paragraph_id', 'questions_answers'],
        num_rows: 583
    })
})

## Language Modeling
The masked language model predicts the masked token in the given sentence. This is instantiated by `tweetnlp.load('language_model')`, and runs the prediction by giving a text or a list of texts as argument to the `mask_prediction` function. Please make sure that each text has a `<mask>` token, since that is eventually the following by the objective of the model to predict.

In [None]:
model = tweetnlp.load_model('language_model')  # Or `model = tweetnlp.LanguageModel()`
model.mask_prediction("So glad I'm <mask> vaccinated.")  # Or `model.predict`

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/499M [00:00<?, ?B/s]

{'best_tokens': ['fully',
  'getting',
  'not',
  'still',
  'already',
  'all',
  'being',
  'completely',
  'now',
  'finally'],
 'best_scores': [1.0366170595521584e-11,
  1.6352504420003022e-11,
  1.9819770824547334e-11,
  4.520027963028639e-10,
  1.0578210094536189e-05,
  0.0002495304506737739,
  2.3952467017807066e-05,
  1.8536340576247312e-05,
  2.8879132514703088e-05,
  5.781193976872601e-06],
 'best_sentences': ["So glad I'm fully vaccinated.",
  "So glad I'm getting vaccinated.",
  "So glad I'm not vaccinated.",
  "So glad I'm still vaccinated.",
  "So glad I'm already vaccinated.",
  "So glad I'm all vaccinated.",
  "So glad I'm being vaccinated.",
  "So glad I'm completely vaccinated.",
  "So glad I'm now vaccinated.",
  "So glad I'm finally vaccinated."]}

## Tweet Embedding
The tweet embedding model produces a fixed length embedding for a tweet. The embedding represents the semantics by meaning of the tweet, and this can be used for semantic search of tweets by using the similarity between the embeddings. Model is instantiated by `tweet_nlp.load('sentence_embedding')`, and run the prediction by passing a text or a list of texts as argument to the `embedding` function.

In [15]:
model = tweetnlp.load_model('sentence_embedding')

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/715 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]



In [18]:
# Get sentence embedding
tweet = "Bullrich y Macri dan su apoyo a Milei . la izquierda como esos metiches que opinan de matrimonios ajenos"
vectors = model.embedding(tweet)
vectors.shape


(768,)

In [16]:
# Get sentence embedding (multiple inputs)
tweet_corpus = [
    "bullrich no tiene nada que perder y el otro si , ademas hasta convocó a la izquierda , la rta es facil",
    "grande bullrich! a sepultar la miseria!  el 18 milei presidente y vicky vicepresidente!    .co/2acpajsieg",
    "no era q ibas s terminar con la casta política y que cambiemos era juntos por el cargo? que bullrich ponía bombas en jardines mamita",
    "las palabras de bullrich me hicieron llorar . fue lo que siempre soñé , la unión de milei y bullrich .",
    "bullrich hasta acá llegaste , deja de faltarle el respeto a los viejos meados de la      .co/9yztzf1npk",
    "patricia bullrich habló y expresó su apoyo a javier milei .     .co/hciytw89mi",
    "vieja traidora vos larreta morales  y todos los que pretendieron destruir a cambiemos !! jamás vas a tenes el apoyo incondicional que tiene patricia !! ahora es!! y será el  p r o !!  con los verdaderos  patricia bullrich  mauricio macri basta de resentidos !!",
    "patricia bullrich la traidora traicionó a la izquierda para irse con los radicales y traicionó a los radicales para irse ahora con milei",
    "aguante patricia bullrich loko viva la libertad carajo",
    "bullrich le pego terrible paliza a larreta , cambiemos se purifico , no hay division . solo sacaron a los traidores .",
    "son medios parecidos . miley es de derecha y bullrich fue de izquierda y ahora se la da de derecha . pasa que en ambos partidos hay mucha gente de varios partidos mesclados . en cambiemos hay radicales y en el de miley hay ex menemistas .",
    "la ucr , larreta , carrio hdrp   aguante la libertad carajoooo   .co/sus8hlq9f8",
    "en lla hay rupturas por la negociacion de milei con macri , bullrich y la izquierda . nadie conforme con los resultados del domingo .",
    "una por una el libertario está derribando todas las ideas y afirmaciones que dio durante su campaña , en un acto de profunda desesperación por intentar ganar algunos votos . primero la izquierda , luego bullrich , también el papa . . . ahora fue el turno de ch . . .  .co/ynlbxh6ajy",
    "era obvio que bullrich iba a estar con milei . la derecha no tiene problema en dejar diferencias de lado y unirse . algo que la izquierda nunca hace . esperemos que los votantes del pro tengan su propio criterio ."
    ]
vectors = model.embedding(tweet_corpus, batch_size=3)
vectors.shape

(16, 768)

In [19]:
# Similarity search
sims = []
for n, i in enumerate(tweet_corpus):
  _sim = model.similarity(tweet, i)
  sims.append([n, _sim])
print(f'anchor tweet: {tweet}\n')
for m, (n, s) in enumerate(sorted(sims, key=lambda x: x[1], reverse=True)):
  print(f' - top {m}: {tweet_corpus[n]}\n - similaty: {s}\n')

anchor tweet: bullrich y macri dan su apoyo a milei . la izquierda como esos metiches que opinan de matrimonios ajenos

 - top 0: en lla hay rupturas por la negociacion de milei con macri , bullrich y la izquierda . nadie conforme con los resultados del domingo .
 - similaty: 0.8197374388010154

 - top 1: era obvio que bullrich iba a estar con milei . la derecha no tiene problema en dejar diferencias de lado y unirse . algo que la izquierda nunca hace . esperemos que los votantes del pro tengan su propio criterio .
 - similaty: 0.814577690317582

 - top 2: las palabras de bullrich me hicieron llorar . fue lo que siempre soñé , la unión de milei y bullrich .
 - similaty: 0.7836462367948186

 - top 3: patricia bullrich habló y expresó su apoyo a javier milei .     .co/hciytw89mi
 - similaty: 0.7547579920421408

 - top 4: grande bullrich! a sepultar la miseria!  el 18 milei presidente y vicky vicepresidente!    .co/2acpajsieg
 - similaty: 0.7467215812733348

 - top 5: bullrich hasta acá l

## Use Custom Model
To use an other model from local/huggingface modelhub, one can simply provide the model path/alias to the `load` function.

`tweetnlp.load('task', model='model-path/alias')`

Or any classification model can be used without specifying the task.


In [20]:
# other task eg) NER
model = tweetnlp.load_model('ner', model_name='tner/twitter-roberta-base-2019-90m-tweetner7-continuous')
model.ner("Bullrich y Macri dan su apoyo a Milei . la izquierda como esos metiches que opinan de matrimonios ajenos")

Downloading config.json:   0%|          | 0.00/12.9k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/780k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/2.01M [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/473M [00:00<?, ?B/s]

[{'type': 'person', 'entity': 'Bullrich'},
 {'type': 'person', 'entity': ' Macri'},
 {'type': 'person', 'entity': ' Milei'}]

In [None]:
# other task

# model = tweetnlp.load_model('text-classification', model_name="02shanky/finetuned-twitter-xlm-roberta-base-emotion")
# All tasks: text-classification is not valid, choose from dict_keys(['sentiment', 'offensive', 'irony', 'hate', 'emotion', 'emoji', 'stance_abortion', 'stance_atheism', 'stance_climate', 'stance_feminist', 'stance_hillary', 'topic_classification', 'ner', 'language_model', 'sentence_embedding', 'question_answering', 'question_answer_generation']b

model = tweetnlp.load_model('topic_classification', model_name="02shanky/finetuned-twitter-xlm-roberta-base-emotion")
model.topic("Bullrich y Macri dan su apoyo a Milei . la izquierda como esos metiches que opinan de matrimonios ajenos", return_probability=True)

In [None]:
{
    'sentiment': "",
    'offensive': "",
    'irony': "",
    'hate': "",
    'emotion': "Modelo de 4 emociones: ",
    'emoji': "Predice con un emoji la emoción de un tweet, etc",
    'stance_abortion': "postura respecto al aborto",
    'stance_atheism': "postura respecto al ateísmo",
    'stance_climate': "postura respecto al cambio climático",
    'stance_feminist': "postura respecto al feminismo",
    'topic_classification': "",
    'ner': "",
    'language_model': "",
    'sentence_embedding': "",
    'question_answering': "",
    'question_answer_generation': ""
    }
 

## Fine-tuning Language Model with TweetNLP
TweetNLP provides an easy interface to fine-tune language models on the datasets supported by HuggingFace for model hosting/fine-tuning with [RAY TUNE](https://docs.ray.io/en/latest/tune/index.html) for parameter search.
- Supported Tasks: `sentiment`, `offensive`, `irony`, `hate`, `emotion`, `topic_classification`



In [None]:
import logging
import tweetnlp
from pprint import pprint

logging.basicConfig(format='%(asctime)s %(levelname)-8s %(message)s', level=logging.INFO, datefmt='%Y-%m-%d %H:%M:%S')

# an examples for model prediction
sample = [
    "How many more days until opening day? 😩"
    "All two of them taste like ass.",
    "If you wanna look like a badass, have drama on social media",
    "Whoever just unfollowed me you a bitch",
    "I love swimming for the same reason I love meditating...the feeling of weightlessness.",
    "Beautiful sunset last night from the pontoon @ Tupper Lake, New York",
    'Jacob Collier is a Grammy-awarded English artist from London.'
]

# set language model and task
language_model = 'cardiffnlp/twitter-roberta-base-2021-124m'
task = "irony"

# load dataset
dataset, label_to_id = tweetnlp.load_dataset(task)

# load trainer
trainer_class = tweetnlp.load_trainer(task)

# define trainer
trainer = trainer_class(
    language_model=language_model,
    dataset=dataset,
    label_to_id=label_to_id,
    max_length=128,
    split_train='train',
    split_test='test',
    output_dir=f'model_ckpt/test'
)

In [None]:
# train
trainer.train(down_sample_size_train=1000, ray_result_dir="ray_results/test")

# save model checkpoint
trainer.save_model()

In [None]:
# model evaluation
metrics = trainer.evaluate()
pprint(metrics)

In [None]:
# sample prediction
output = trainer.predict(sample)
pprint(f"Sample Prediction: {language_model} ({task})")
for s, p in zip(sample, output):
    pprint(s)
    pprint(p)