<strong>
    <font color="#0E1117">
        Author: lprtk
    </font>
</strong>

<br/>
<br/>


<Center>
    <h1 style="font-family: Arial">
        <font color="#0E1117">
            NLP: exploring HuggingFace Transformers
        </font>
    </h1>
</Center>

-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Introduction & context
        </font>
    </h2>
</div>

<p style="text-align: justify">
    This notebook focuses on exploring and testing how we can use a simple pre-trained Transformers language model for different Natural Language Processing (NLP) tasks with Python. The objective is to extract information and value from large volumes of textual data using NLP.
</p>

-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Librairies import
        </font>
    </h2>
</div>

In [1]:
import pandas as pd
from transformers import pipeline

-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Sentiment analysis & scoring
        </font>
    </h2>
</div>

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            1) Data
        </font>
    </h3>
</div>

In [2]:
text = [
    "I love flying Air New Zealand because they have the best food!",
    "That orange Fiat Multipla is the ugliest car I’ve ever seen.",
    "I love this phone but wouldn’t recommend it to my friends.",
    "I have just broken down with my motorcycle, I AM MISTAKEN...",
    "Data Scientist must to have several skills: statistics, IT and business expertise.",
    "I'm happy because I'll be able to sunbathe under the coconut trees 😂."
]

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            2) Model
        </font>
    </h3>
</div>

In [3]:
sentiment_analyzer = pipeline(task="text-classification", model=None)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [4]:
df_result = pd.DataFrame(sentiment_analyzer(text))

In [5]:
df_result

Unnamed: 0,label,score
0,POSITIVE,0.999876
1,NEGATIVE,0.663035
2,POSITIVE,0.973586
3,NEGATIVE,0.999557
4,POSITIVE,0.868442
5,POSITIVE,0.999863


-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Named entity recognition
        </font>
    </h2>
</div>

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            1) Data
        </font>
    </h3>
</div>

In [6]:
text = """
Microsoft has announced the launch of a student program to build ai skills. The Redmond 
giant wants to expand its reach and plans to build a strong developer ecosystem between 
India and the US. The company will provide tools and services such as Microsoft Cognitive 
Services, Bot Services and Azure Machine Learning. Manish Prakash, general manager, Microsoft 
India, said, "ai being the defining technology of our time is transforming lives and industry 
and the jobs of tomorrow will require different skills.
"""

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            2) Model
        </font>
    </h3>
</div>

In [7]:
ner_analyzer = pipeline(task="ner", aggregation_strategy="simple", model=None)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)
Some layers from the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing TFBertForTokenClassification: ['dropout_147']
- This IS expected if you are initializing TFBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForTokenClassification were initialized from the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english.
If your task is similar to the task the

In [8]:
df_result = pd.DataFrame(ner_analyzer(text))

In [9]:
df_result

Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.99961,Microsoft,1,10
1,ORG,0.706997,Redmond,81,88
2,LOC,0.999741,India,179,184
3,LOC,0.999685,US,193,195
4,ORG,0.993,Microsoft Cognitive Services,249,278
5,ORG,0.987963,Bot Services,280,292
6,ORG,0.991273,Azure Machine Learning,297,319
7,PER,0.999419,Manish Prakash,321,335
8,ORG,0.978427,Microsoft India,354,370


-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Question answering
        </font>
    </h2>
</div>

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            1) Data
        </font>
    </h3>
</div>

In [10]:
text = """
The phone is in good condition and well protected in the packaging. I also bought 
a shell for my phone hoping it would fit, but it did not. It is a shell for the old 
model which explains the low price. Thanks to Amazon for shipping this product to 
France with Prime. I hope they will add more products under Prime to ship from the 
US to France. The prices and collection of items are still not good on the France site, 
and are nowhere near the same level as the US site. Thanks.
"""

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            2) Model
        </font>
    </h3>
</div>

In [11]:
question_answerer = pipeline(task="question-answering", model=None)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)
Some layers from the model checkpoint at distilbert-base-cased-distilled-squad were not used when initializing TFDistilBertForQuestionAnswering: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-cased-distilled-squad and are newly initialized: ['dropout_113']
You should probably TRAIN

In [12]:
question = "Is the customer satisfied?"
df_result = pd.DataFrame([question_answerer(question=question, context=text)])

In [13]:
df_result

Unnamed: 0,score,start,end,answer
0,0.063303,130,140,it did not


In [14]:
question = "Why the customer is not satisfied?"
df_result = pd.DataFrame([question_answerer(question=question, context=text)])

In [15]:
df_result

Unnamed: 0,score,start,end,answer
0,0.036813,142,174,It is a shell for the old \nmodel


In [16]:
question = "How to improve customer satisfaction?"
df_result = pd.DataFrame([question_answerer(question=question, context=text)])

In [17]:
df_result

Unnamed: 0,score,start,end,answer
0,0.00833,278,348,they will add more products under Prime to shi...


-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#0E1117">
            Text summarization
        </font>
    </h2>
</div>

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            1) Data
        </font>
    </h3>
</div>

In [18]:
text = """
For years, Facebook gave some of the world's largest technology companies more intrusive 
access to users' personal data than it has disclosed, effectively exempting those business 
partners from its usual privacy rules, according to internal records and interviews. The 
special arrangements are detailed in hundreds of pages of Facebook documents obtained by 
The New York Times. The records, generated in 2017 by the company's internal system for 
tracking partnerships, provide the most complete picture yet of the social network's 
data-sharing practices. They also underscore how personal data has become the most prized 
commodity of the digital age, traded on a vast scale by some of the most powerful companies 
in Silicon Valley and beyond. The exchange was intended to benefit everyone. Pushing for 
explosive growth, Facebook got more users, lifting its advertising revenue. Partner companies 
acquired features to make their products more attractive. Facebook users connected with friends 
across different devices and websites. But Facebook also assumed extraordinary power over the 
personal information of its 2 billion users - control it has wielded with little transparency 
or outside oversight.Facebook allowed Microsoft's Bing search engine to see the names of virtually 
all Facebook user's friends without consent, the records show, and gave Netflix and Spotify the 
ability to read Facebook users' private messages.
"""

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#0E1117">
            2) Model
        </font>
    </h3>
</div>

In [19]:
text_summarizer = pipeline(task="summarization", model=None)

No model was supplied, defaulted to t5-small (https://huggingface.co/t5-small)
All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [20]:
result = text_summarizer(text, max_length=50, clean_up_tokenization_spaces=True)

In [21]:
print(result[0]["summary_text"])

personal data has become the most prized commodity of the digital age, traded on a vast scale by some of the most powerful companies in Silicon Valley and beyond. the exchange was intended to benefit everyone, pushing for explosive growth.
