# Sentiment Analysis with Hugging Face and DuckDB
In this notebook, we're going to learn how to do sentiment analysis with Hugging Face and DuckDB.

## Download the LLM
We're going to write some code to manually download the model.

In [3]:
import os
from huggingface_hub import hf_hub_download

In [5]:
HUGGING_FACE_API_KEY = os.environ.get("HUGGING_FACE_API_KEY")

In [1]:
model_id = "cardiffnlp/twitter-roberta-base-sentiment"
filenames = [
        "pytorch_model.bin", "config.json", "merges.txt",
        "special_tokens_map.json", "vocab.json", "tf_model.h5", "flax_model.msgpack"
]

In [6]:
for filename in filenames:
        downloaded_model_path = hf_hub_download(
                    repo_id=model_id,
                    filename=filename,
                    token=HUGGING_FACE_API_KEY
        )
        print(downloaded_model_path)

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

/Users/markhneedham/.cache/huggingface/hub/models--cardiffnlp--twitter-roberta-base-sentiment/snapshots/daefdd1f6ae931839bce4d0f3db0a1a4265cd50f/pytorch_model.bin
/Users/markhneedham/.cache/huggingface/hub/models--cardiffnlp--twitter-roberta-base-sentiment/snapshots/daefdd1f6ae931839bce4d0f3db0a1a4265cd50f/config.json


Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

/Users/markhneedham/.cache/huggingface/hub/models--cardiffnlp--twitter-roberta-base-sentiment/snapshots/daefdd1f6ae931839bce4d0f3db0a1a4265cd50f/merges.txt


Downloading (…)cial_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

/Users/markhneedham/.cache/huggingface/hub/models--cardiffnlp--twitter-roberta-base-sentiment/snapshots/daefdd1f6ae931839bce4d0f3db0a1a4265cd50f/special_tokens_map.json


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

/Users/markhneedham/.cache/huggingface/hub/models--cardiffnlp--twitter-roberta-base-sentiment/snapshots/daefdd1f6ae931839bce4d0f3db0a1a4265cd50f/vocab.json


Downloading tf_model.h5:   0%|          | 0.00/501M [00:00<?, ?B/s]

/Users/markhneedham/.cache/huggingface/hub/models--cardiffnlp--twitter-roberta-base-sentiment/snapshots/daefdd1f6ae931839bce4d0f3db0a1a4265cd50f/tf_model.h5


Downloading flax_model.msgpack:   0%|          | 0.00/499M [00:00<?, ?B/s]

/Users/markhneedham/.cache/huggingface/hub/models--cardiffnlp--twitter-roberta-base-sentiment/snapshots/daefdd1f6ae931839bce4d0f3db0a1a4265cd50f/flax_model.msgpack


## Run the LLM
Now let's try running the model. But before we do that, let's disable the Wi-Fi.

In [16]:
import urllib, csv
labels=[]
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/sentiment/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = {f"LABEL_{index}" : row[1] for index, row in enumerate(csvreader) if len(row) > 1}
labels

{'LABEL_0': 'negative', 'LABEL_1': 'neutral', 'LABEL_2': 'positive'}

In [23]:
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis", model=model_id)

In [41]:
data = ["Possibly the worst book I've ever read.It's a huge collection of biases for all the possible countries and cultures. The whole book is structured with examples like: if you are working with Chinese people, you should take this approach, instead if your team is composed by German people you should do this etc....",
 'A book full of oversimplifications, generalisations and self-contradiction. Plus many of the examples felt simply made up. Although it had one or two good ideas thrown in there, I am honestly not sure if this book can hardly help anyone.',
 'I had it on my recommendations list for a long time, but my impression was always like: "damn, I don\'t need a book on cultural differences; I\'ve worked in many international enterprises, I have been trained, I have practical experience - it would be just a waste of time". In the end, it wasn\'t (a waste of time).',
 'Candidate for the best book I have read in 2016 unless another one can beat it. The author made is fun to read with great examples that I could easily relate to.',
 'A practical and comprehensive guide to how different cultures should be approached regarding business relations, but it can also be used outside of that.',
 'The book was OK. It offers a good overview of differences between cultures. Sometimes we may assume that 2 cultures are similar, but in the end there is a possibility of conflict, because they have different "mentality" on a certain point (trust or time perception, for instance). But Erin often limits herself to personal stories and doesn\'t cite almost any researcher or study.']
results = sentiment_pipeline(data)

In [40]:
for value, sentiment in zip(data, results):
    print(value)
    print(labels[sentiment['label']], sentiment['score'])
    print("")

Possibly the worst book I've ever read.It's a huge collection of biases for all the possible countries and cultures. The whole book is structured with examples like: if you are working with Chinese people, you should take this approach, instead if your team is composed by German people you should do this etc....
negative 0.9155668020248413

A book full of oversimplifications, generalisations and self-contradiction. Plus many of the examples felt simply made up. Although it had one or two good ideas thrown in there, I am honestly not sure if this book can hardly help anyone.
negative 0.7787094116210938

I had it on my recommendations list for a long time, but my impression was always like: "damn, I don't need a book on cultural differences; I've worked in many international enterprises, I have been trained, I have practical experience - it would be just a waste of time". In the end, it wasn't (a waste of time).
neutral 0.4299629330635071

Candidate for the best book I have read in 2016 

In [33]:
sentiment = sentiment_pipeline("It was epic")[0]
sentiment["label"] = labels[sentiment['label']]
sentiment

{'label': 'positive', 'score': 0.8966352343559265}

In [39]:
sentiment_pipeline = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")

Downloading (…)lve/main/config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [38]:
sentiment = sentiment_pipeline("It was epic")[0]
sentiment

{'label': 'POS', 'score': 0.9846311211585999}