# Natural Language Processing (NLP) in Python

## 1. Libraries

- The [Hugging Face Transformers library](https://huggingface.co/docs/transformers/v4.17.0/en/index) stands at the forefront of modern Natural Language Processing (NLP), offering a vast collection of pre-trained models based on transformer architectures like BERT, RoBERTa, and T5. These models have been trained on extensive amounts of text data and can be fine-tuned for various NLP tasks, such as sentiment analysis, named entity recognition, or language translation.

In [1]:
import pandas as pd # for data processing
from transformers import AutoTokenizer, TFAutoModelForCausalLM, TFAutoModelForSequenceClassification # to load pretrained models
import tensorflow as tf
import tensorflow.keras as keras
print(f"Tensorflow {tf.__version__}")


Tensorflow 2.19.0


## 2. Importing Data

In [3]:
# Importing a csv files
harring_data = pd.read_csv("./export/harring_data.csv")
harring_data.head()

Unnamed: 0,sentences
0,"Keith Allen Haring (May 4, 1958 – February 16,..."
1,"His animated imagery has ""become a widely reco..."
2,Much of his work includes sexual allusions tha...
3,"In addition to solo gallery exhibitions, he pa..."
4,The Whitney Museum held a retrospective of his...


## 3. Applying Natural Language Processing

### 3.1 Sentiment Analysis

In [4]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"

model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [5]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, framework="tf")

Device set to use 0


Prepare the sentences from the DataFrame for classification

In [10]:
# get the sentences
sentences = harring_data["sentences"]

# make a list out of it: 
sentences = sentences.astype(str)

# make a list out of it
sentences = sentences.tolist()

# choose one & print it
one_sentence = sentences[12]
print(one_sentence)

In 2019, he was one of the inaugural 50 American "pioneers, trailblazers, and heroes" inducted on the National LGBTQ Wall of Honor within the Stonewall National Monument in New York City's Stonewall Inn


In [11]:
# classify the sentence
classifier(one_sentence)

[{'label': '5 stars', 'score': 0.5006921887397766}]

In [13]:
# classify another one
second_sentence = sentences[10]
print(second_sentence)
classifier(second_sentence)

Haring died of AIDS-related complications on February 16, 1990


[{'label': '1 star', 'score': 0.44514399766921997}]

### 3.2 Named Entity Recognition

In [15]:
# define the pipeline and model
ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", grouped_entities=True)


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


In [16]:
# Run NER
entities = ner(one_sentence)

# Print results
for entity in entities:
    print(entity)

{'entity_group': 'MISC', 'score': np.float32(0.99913186), 'word': 'American', 'start': 40, 'end': 48}
{'entity_group': 'MISC', 'score': np.float32(0.97617745), 'word': 'National LGBTQ Wall of Honor', 'start': 102, 'end': 130}
{'entity_group': 'LOC', 'score': np.float32(0.96628475), 'word': 'Stonewall National Monument', 'start': 142, 'end': 169}
{'entity_group': 'LOC', 'score': np.float32(0.9990921), 'word': 'New York City', 'start': 173, 'end': 186}
{'entity_group': 'LOC', 'score': np.float32(0.9532637), 'word': 'Stonewall Inn', 'start': 189, 'end': 202}
