# Text and Natural Language Processing (NLP) in Python

## 1. Libraries

- Pandas is a powerful data manipulation library, facilitates loading, exploring, cleaning, and transforming text datasets stored in formats like CSV or 
Excel.
- Regular expressions (re) are invaluable for text preprocessing tasks, such as removing special characters, punctuation, and digits from raw text data.
- The [Hugging Face Transformers library](https://huggingface.co/docs/transformers/v4.17.0/en/index) stands at the forefront of modern Natural Language Processing (NLP), offering a vast collection of pre-trained models based on transformer architectures like BERT, RoBERTa, and T5. These models have been trained on extensive amounts of text data and can be fine-tuned for various NLP tasks, such as sentiment analysis, named entity recognition, or language translation.

In [1]:
import pandas as pd # for data processing
import re
from transformers import AutoTokenizer, TFAutoModelForCausalLM, TFAutoModelForSequenceClassification # to load pretrained models
import tensorflow as tf
import tensorflow.keras as keras
print(f"Tensorflow {tf.__version__}")


Tensorflow 2.19.0


## 2. Importing Data

### 2.1 CSV Files

In [4]:
# Importing a csv files
anime_data = pd.read_csv("./data/AnimeQuotes.csv")
anime_data.head()

Unnamed: 0,Quote,Character,Anime
0,"People’s lives don’t end when they die, it end...",Itachi Uchiha,Naruto
1,"If you don’t take risks, you can’t create a fu...",Monkey D Luffy,One Piece
2,"If you don’t like your destiny, don’t accept it.",Naruto Uzumaki,Naruto
3,"When you give up, that’s when the game ends.",Mitsuyoshi Anzai,Slam Dunk
4,All we can do is live until the day we die. Co...,Deneil Young,Uchuu Kyoudai or Space Brothers


### 2.2 TXT Files

In [5]:
# Load a txt file
with open("./data/KeithHarring.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print(raw_text)

Keith Allen Haring (May 4, 1958 – February 16, 1990) was an American artist whose pop art emerged from the New York City graffiti subculture of the 1980s.[1] His animated imagery has "become a widely recognized visual language".[2] Much of his work includes sexual allusions that turned into social activism by using the images to advocate for safe sex and AIDS awareness.[3] In addition to solo gallery exhibitions, he participated in renowned national and international group shows such as documenta in Kassel, the Whitney Biennial in New York, the São Paulo Biennial, and the Venice Biennale. The Whitney Museum held a retrospective of his art in 1997.

Haring's popularity grew from his spontaneous drawings in New York City subways—chalk outlines of figures, dogs, and other stylized images on blank black advertising spaces.[4] After gaining public recognition, he created colorful larger scale murals, many commissioned.[4] He produced more than 50 public artworks between 1982 and 1989, many 

If we have a "raw" text file, we need to transform it into a list of sentences that we can use for NLP. 

In [6]:
split_text = raw_text.split('.')
split_text

['Keith Allen Haring (May 4, 1958 – February 16, 1990) was an American artist whose pop art emerged from the New York City graffiti subculture of the 1980s',
 '[1] His animated imagery has "become a widely recognized visual language"',
 '[2] Much of his work includes sexual allusions that turned into social activism by using the images to advocate for safe sex and AIDS awareness',
 '[3] In addition to solo gallery exhibitions, he participated in renowned national and international group shows such as documenta in Kassel, the Whitney Biennial in New York, the São Paulo Biennial, and the Venice Biennale',
 ' The Whitney Museum held a retrospective of his art in 1997',
 "\n\nHaring's popularity grew from his spontaneous drawings in New York City subways—chalk outlines of figures, dogs, and other stylized images on blank black advertising spaces",
 '[4] After gaining public recognition, he created colorful larger scale murals, many commissioned',
 '[4] He produced more than 50 public artwo

Create a data frame from it for further processing

In [7]:
harring_df = pd.DataFrame({'sentences': split_text})
harring_df.head()

Unnamed: 0,sentences
0,"Keith Allen Haring (May 4, 1958 – February 16,..."
1,"[1] His animated imagery has ""become a widely ..."
2,[2] Much of his work includes sexual allusions...
3,"[3] In addition to solo gallery exhibitions, h..."
4,The Whitney Museum held a retrospective of hi...


## 3. Cleaning Text Data

In [9]:
def clean_text(text):
    # remove specific characters like '\n' (linebreaks)
    text = re.sub(r'\n', ' ', text)

    # define a pattern (e.g. numbers inside square brackets)
    pattern = r'\[(\d+)\]'
    text = re.sub(pattern, '', text)

    # Remove leading/trailing spaces
    text = text.strip()
    return text         # send cleaned text back

# get the column from the data frame
sentences = harring_df['sentences']

# apply the cleaning function
harring_data = sentences.apply(clean_text)
harring_data

0      Keith Allen Haring (May 4, 1958 – February 16,...
1      His animated imagery has "become a widely reco...
2      Much of his work includes sexual allusions tha...
3      In addition to solo gallery exhibitions, he pa...
4      The Whitney Museum held a retrospective of his...
                             ...                        
583                           Taking it off the pedestal
584            I’m giving it back to the people, I guess
585                                                 ”[20
586                                           Drenger, p
587                                                  53]
Name: sentences, Length: 588, dtype: object

## 4. Applying Natural Language Processing

### 4.1 Sentiment Analysis

In [2]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"

model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [3]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, framework="tf")

Device set to use 0


In [17]:
sentence1 = harring_data.iloc[12]
print(sentence1)

sentence2 = harring_data.iloc[10]
print(sentence2)

classifier(sentence1)
classifier(sentence2)
#print(type(sentence1))
#classifier()

In 2019, he was one of the inaugural 50 American "pioneers, trailblazers, and heroes" inducted on the National LGBTQ Wall of Honor within the Stonewall National Monument in New York City's Stonewall Inn
Haring died of AIDS-related complications on February 16, 1990


[{'label': '1 star', 'score': 0.44514399766921997}]

### 4.2 Named Entity Recognition

In [19]:
ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", grouped_entities=True)

# Example text
text = "Hugging Face Inc. is a company based in New York. Its founders include Clément Delangue."

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


In [20]:
# Run NER
entities = ner(sentence1)

# Print results
for entity in entities:
    print(entity)

{'entity_group': 'MISC', 'score': np.float32(0.99913186), 'word': 'American', 'start': 40, 'end': 48}
{'entity_group': 'MISC', 'score': np.float32(0.97617745), 'word': 'National LGBTQ Wall of Honor', 'start': 102, 'end': 130}
{'entity_group': 'LOC', 'score': np.float32(0.96628475), 'word': 'Stonewall National Monument', 'start': 142, 'end': 169}
{'entity_group': 'LOC', 'score': np.float32(0.9990921), 'word': 'New York City', 'start': 173, 'end': 186}
{'entity_group': 'LOC', 'score': np.float32(0.9532637), 'word': 'Stonewall Inn', 'start': 189, 'end': 202}
