<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# CAPSTONE : Ubisoft's Skull & Bones

# Part 3 â€“ Text Summarization: Hugging Face

As part of our experimentation, we will attempt to perform text summarization on our reviews to give a more concise description of each review, so that it will be clearer for our multi-label classification model.

We will be performing text summarization using the Large Language Models in the Hugging Face library as part of our unsupervised modelling process.

In [1]:
# !pip install bert-extractive-summarizer
# !pip install summarizer
# !pip install transformers
# !pip install tensorflow
# !pip install sacremoses==0.0.53
# !pip install torch torchvision torchaudio
# !pip install sentencepiece
# !pip install --upgrade jupyter ipywidgets
# !jupyter nbextension enable --py widgetsnbextension

In [19]:
# Importing all libraries used: 

import pandas as pd
from transformers import BartForConditionalGeneration, BartTokenizer

from transformers import pipeline

## Importing Data

### Assassin's Creed

In [11]:
ass_creed = pd.read_csv('../data/output/ass_creed_no_punc.csv')

In [12]:
ass_creed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21812 entries, 0 to 21811
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   num_games_owned              21812 non-null  int64  
 1   num_reviews                  21812 non-null  int64  
 2   playtime_forever             21812 non-null  int64  
 3   playtime_last_two_weeks      21812 non-null  int64  
 4   playtime_at_review           21812 non-null  int64  
 5   last_played                  21812 non-null  object 
 6   review                       21729 non-null  object 
 7   timestamp_created            21812 non-null  object 
 8   recommended                  21812 non-null  bool   
 9   votes_up                     21812 non-null  int64  
 10  votes_funny                  21812 non-null  int64  
 11  weighted_vote_score          21812 non-null  float64
 12  comment_count                21812 non-null  int64  
 13  steam_purchase  

In [13]:
ass_creed.head()

Unnamed: 0,num_games_owned,num_reviews,playtime_forever,playtime_last_two_weeks,playtime_at_review,last_played,review,timestamp_created,recommended,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access
0,267,21,2856,0,2856,2018-02-18 19:42:24,the only game where i avoid fast travel,2021-01-27 20:27:22,True,3748,185,0.986815,11,True,False,False
1,0,52,1838,0,1838,2020-07-01 11:12:26,this is the best assassins creed game and prob...,2020-09-19 13:12:59,True,2558,37,0.979318,11,True,False,False
2,0,2,117391,77,90196,2023-08-26 03:33:05,best game ever iv 'e been playing from day one...,2020-05-23 19:09:36,True,2190,227,0.979193,0,True,False,False
3,77,4,4719,0,4036,2021-04-03 17:48:13,shanties before panties,2021-03-10 19:36:02,True,2174,897,0.974363,11,True,False,False
4,0,10,1592,0,1177,2022-02-11 01:37:29,best part of the game are the sea shanties low...,2020-06-27 00:22:26,True,1041,240,0.958593,0,True,False,False


We will be trying out a few different Text Summarization models with a few reviews, and see whether the model does a good job at properly summarizing the reviews.

## Text Summarization Models

### Hugging Face: Bart Large CNN

First, attempting to Hugging Face's [bart large cnn model](https://huggingface.co/facebook/bart-large-cnn), a model mainly trained on CNN Daily Mail articles.

In [14]:
model_name = "facebook/bart-large-cnn"
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)

def generate_summary(text):
    input_ids = tokenizer.encode(text, return_tensors="pt", max_length=100, truncation=True)
    summary_ids = model.generate(input_ids, max_length=30, min_length = 10, num_beams=4, length_penalty=2.0, early_stopping=True)
    summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary_text

We will try this on individual reviews first to see the performance.

In [15]:
# Generating summary of the third review
generate_summary(ass_creed['review'][2])

"best game ever iv 'e been playing from day one (love it and i'm 75 years young arrrrr matey)"

In [16]:
# Comparing with the original review
ass_creed['review'][2]

"best game ever iv 'e been playing from day one (love it and i 'm 75 years young arrrrr matey"

We can see that the model essentially just extracted the whole original review. Let's try this same model on a longer review.

In [17]:
# Generating summary of the nineteenth review
generate_summary(ass_creed['review'][18])

'Gorgeous graphics detailed immersive environments very strong voice -acting tons to explore lots of environmental diversity does extremely well to encourage a sense of'

In [18]:
# Comparing with the original review
ass_creed['review'][18]

"[h1]richly populated extremely immersive // recommended for ac fans pirate fans alike [/h1] significantly less boring hand -holding in the main campaign than the previous entry of the series reliably solid parkour mechanics there were a few bugs glitches but were very minor gorgeous graphics detailed immersive environments very strong voice -acting tons to explore lots of environmental diversity does extremely well to encourage a sense of adventure ship sailing upgrade mechanics add a lot of depth to the game exploration doesn 't simply consist of empty fetch -quests lookout points there is a lot of treasure to be found myriad activities to be done naval combat seems to have found a comfortable middle -ground between simplicity immersive completxity it all feels very sensible to pick up without losing the warmth of immersion (and it 's very fun hunting crafting mechanics actually felt useful worthwhile (as opposed to ac3 this entry probably boasts the most interesting likeable protago

For the longer review, it managed to pick out a sentence from the first half of the review, however, the sentence does not summarise the review well.

### Hugging Face: T5 Small Model

Trying out a different model from the Hugging Face library: T5 Small

In [40]:
# Instantiating our model
summarizer = pipeline(
    task="summarization",
    model="t5-small",
    min_length=5,
    max_length=50,
    truncation=True)

In [41]:
# Performing text summarization on the same examples as above to test it out
summarizer(ass_creed['review'][2])

Your max_length is set to 50, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


[{'summary_text': "best game ever iv 'e been playing from day one (love it and i 'm 75 years young arrrr matey . i'm a 75-year-old arr"}]

In [42]:
# Comparing with the original
ass_creed['review'][2]

"best game ever iv 'e been playing from day one (love it and i 'm 75 years young arrrrr matey"

We can see that similar to the previous Bart Large CNN model, the T5 Small Model also just took the entire review, and even repeated a line from it. Let's see how well it does for the longer review.

In [43]:
summarizer(ass_creed['review'][18])

[{'summary_text': "combat was about as solid as ever didn 't do much to innovate in comparison to other entries of the series . the main plot was not horrible it felt considerably weaker than other entries in the series so many tailing /e"}]

In [44]:
ass_creed['review'][18]

"[h1]richly populated extremely immersive // recommended for ac fans pirate fans alike [/h1] significantly less boring hand -holding in the main campaign than the previous entry of the series reliably solid parkour mechanics there were a few bugs glitches but were very minor gorgeous graphics detailed immersive environments very strong voice -acting tons to explore lots of environmental diversity does extremely well to encourage a sense of adventure ship sailing upgrade mechanics add a lot of depth to the game exploration doesn 't simply consist of empty fetch -quests lookout points there is a lot of treasure to be found myriad activities to be done naval combat seems to have found a comfortable middle -ground between simplicity immersive completxity it all feels very sensible to pick up without losing the warmth of immersion (and it 's very fun hunting crafting mechanics actually felt useful worthwhile (as opposed to ac3 this entry probably boasts the most interesting likeable protago

Comparing the results to the original text, similar to the Bart Large CNN model, the T5 Small model just extracts a small portion of text from the middle of the review. However, this again does not successfully summarize our text very well.

### Hugging Face: Pegasus Xsum

Our models have not been doing very well at summarising the review text on just two of our test reviews. Let's test out one more model before moving on to try a different type of model. We will be trying out the [Pegasus Xsum Model](https://huggingface.co/google/pegasus-xsum) which was trained on public Common Crawl web scrape, and HugeNews data.

In [51]:
summypeg = pipeline("summarization", model= "google/pegasus-xsum")

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [52]:
# Using our model on the test reviews
summypeg(ass_creed['review'][2], max_length=20, min_length=5, do_sample=False)

[{'summary_text': 'Happy birthday to one of my all-time favourite footballers, Paul Gascoigne.'}]

In [53]:
# Comparing it to the original
ass_creed['review'][2]

"best game ever iv 'e been playing from day one (love it and i 'm 75 years young arrrrr matey"

In [54]:
# Longer test review
summypeg(ass_creed['review'][18], max_length=30, min_length=10, do_sample=False)

[{'summary_text': 'Pirates of the Caribbean: Big Day Out is the latest entry in one of the most popular video game series in the world.'}]

In [55]:
# Original review
ass_creed['review'][18]

"[h1]richly populated extremely immersive // recommended for ac fans pirate fans alike [/h1] significantly less boring hand -holding in the main campaign than the previous entry of the series reliably solid parkour mechanics there were a few bugs glitches but were very minor gorgeous graphics detailed immersive environments very strong voice -acting tons to explore lots of environmental diversity does extremely well to encourage a sense of adventure ship sailing upgrade mechanics add a lot of depth to the game exploration doesn 't simply consist of empty fetch -quests lookout points there is a lot of treasure to be found myriad activities to be done naval combat seems to have found a comfortable middle -ground between simplicity immersive completxity it all feels very sensible to pick up without losing the warmth of immersion (and it 's very fun hunting crafting mechanics actually felt useful worthwhile (as opposed to ac3 this entry probably boasts the most interesting likeable protago

We can see that the text summarizations were completely unrelated to what was mentioned in the reviews, even bringing in a seprate game into its summarization.

As these models have not even passed the initial two review sample stage, let's move on to a different unsupervised model. The next notebook will be focusing on using Open AI's GPT3.5 Turbo to perform text summarization on our data.