### Compute Sentiment for MovieLens Tags

The Hugging Face Model can be found [here](https://huggingface.co/j-hartmann/emotion-english-distilroberta-base) that was used to compute emotions in the ensuing notebook cells for the tag strings found in the MovieLens dataset.

In [4]:
import time

import typing as T
from functools import wraps
from pathlib import Path

from transformers import pipeline
import pandas as pd

# Link to model: https://huggingface.co/j-hartmann/emotion-english-distilroberta-base

classifier = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base", return_all_scores=True)

In [2]:
classifier("some text")

[[{'label': 'anger', 'score': 0.029231058433651924},
  {'label': 'disgust', 'score': 0.22560150921344757},
  {'label': 'fear', 'score': 0.00604574428871274},
  {'label': 'joy', 'score': 0.00475974241271615},
  {'label': 'neutral', 'score': 0.628390371799469},
  {'label': 'sadness', 'score': 0.059987958520650864},
  {'label': 'surprise', 'score': 0.04598362371325493}]]

#### Load Data

One row was dropped from the dataset that contained an empty string to reduce the likelihood of running into edge case issues while computing different statistics on the dataset.

In [33]:
DATA_PATH = Path().cwd().parent / "data"
df = pd.read_csv(DATA_PATH / "ml-latest" / "tags.csv")
print(f"Length of df: {len(df)}")
df.dropna(inplace=True)
tags = df["tag"].unique()

print(f"Number of unique tags: {len(tags)}")

Length of df: 1108997
Number of tags: 74714


In [34]:
def timeit(func):
    @wraps(func)
    def timeit_wrapper(*args, **kwargs):
        start_time = time.perf_counter()
        result = func(*args, **kwargs)
        end_time = time.perf_counter()
        total_time = end_time - start_time
        print(f"Function {func.__name__} Took {total_time:.4f} seconds.")

        return result

    return timeit_wrapper

@timeit
def get_scores(classifier: T.Any, tags: T.List[str]) -> T.Tuple[T.Union[str, float]]:
    """
    Compute a classification score for each tag.
    """
    print(f"Classifying {len(tags)} tags.")
    predictions = []
    for i in range(0, len(tags), 1000):
        predictions.append()
    predictions = classifier(tags)
    output = []

@timeit
def compute_score(classifier: T.Any, tags: T.List[str]) -> T.Tuple[T.Union[str, float]]:
    """
    Compute a classification score for each tag in the provided tag list.
    """
    print(f"Classifying {len(tags)} tags.")
    predictions = classifier(tags)
    output = []
    for tag, prediction in zip(tags, predictions):
        max_pred = max(prediction, key=lambda x:x["score"])
        output.append((tag, max_pred["label"], max_pred["score"]))

    return output


tag_tuples = compute_score(classifier, tags.tolist())

Classifying 74714 tags.
Function compute_score Took 3958.4121 seconds.


In [35]:
df_out = pd.DataFrame(tag_tuples, columns=["tag", "emotion", "score"])
df_out.head(5)

Unnamed: 0,tag,emotion,score
0,epic,surprise,0.406139
1,Medieval,neutral,0.795388
2,sci-fi,neutral,0.441692
3,space action,neutral,0.595582
4,imdb top 250,neutral,0.629384


In [36]:
df_out.to_csv(DATA_PATH / "ml-latest" / "computed-tags.csv")