<a href="https://www.kaggle.com/thomaslazarus/steam-reviews-summarizing-reviews?scriptVersionId=84272379" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Summarizing Steam Reviews

## Pulling in the Data

In [1]:
import pandas as pd

raw_data = pd.read_csv('/kaggle/input/steam-reviews/dataset.csv')

Since I'm just going to be summarizing the review text the extra columns are unnecessary.

In [2]:
raw_data.drop(columns=['review_votes', 'review_score', 'app_id'], inplace=True, errors='ignore')

In [3]:
raw_data.head(5)

Unnamed: 0,app_name,review_text
0,Counter-Strike,Ruined my life.
1,Counter-Strike,This will be more of a ''my experience with th...
2,Counter-Strike,This game saved my virginity.
3,Counter-Strike,• Do you like original games? • Do you like ga...
4,Counter-Strike,"Easy to learn, hard to master."


In [4]:
raw_data.dropna(axis='index', inplace=True)

### Cleaning the Data

Some of the inputs had lots of messy text so I'm going to try and clean out the gross text.

In [5]:
import re
import numpy as np
def clean_text(text):
    return re.sub('&.*;', '', re.sub('/.*', '', text))

raw_data['review_text'] = raw_data['review_text'].apply(clean_text)
raw_data['review_text'].replace('', np.nan, inplace=True)
raw_data.dropna(subset=['review_text'], inplace=True)
raw_data = raw_data.loc[raw_data['review_text'].str.len() >= 500]

In [6]:
reorder_raw_data = raw_data.review_text.str.len().sort_values(ascending=False).index
raw_data = raw_data.reindex(reorder_raw_data)
raw_data.drop_duplicates(inplace=True)

In [7]:
from datasets import Dataset
data_df = Dataset.from_pandas(raw_data[1975:2000])

In [8]:
import gc
del raw_data
gc.collect()

51

## Model Time

In [9]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

checkpoint = "sshleifer/distilbart-cnn-12-6"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint).to('cuda')

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

In [10]:
batch = tokenizer(data_df['review_text'], truncation=True, padding=True, return_tensors="pt").to('cuda')
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)

In [11]:
df = pd.DataFrame(data={'review_text': data_df['review_text'], 'summarized_text': tgt_text})

In [12]:
df.head(5)

Unnamed: 0,review_text,summarized_text
0,Necrovision: Lost Company isn't a well-crafted...,The voice acting is terrible but fits the gam...
1,Spec Ops: The Line is an excellent example of ...,Spec Ops: The Line is a trite piece of underc...
2,"To start off this review, I'll say that I Kick...","While it looks and sounds great, The Dwarves ..."
3,Dragonfall Director's Cut is an interesting ga...,Dragonfall Director's Cut is based in Shadowr...
4,“I don't think any word can explain a rat's li...,"""Bad Rats"" is more than a great game; it is a..."


In [13]:
import random
random_num = random.randint(0, 24)

In [14]:
df['review_text'][random_num]

'A very brief summary       So first of all ..    LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMONS LEMONS LEMONS  LEMO

In [15]:
df['summarized_text'][random_num]

" A very brief summary of a very brief brief summary. A look at some of the history of the world's greatest moments. A few words:   LEMONS LEMON Lemons LemONS L.E. L.M. L.E. E. MARTIN LESSONS LESSES:  LESSESS: LESSERMENT LESSERS:  I'm all over the world, and I'm sure it's all over, but I can't wait to see what's going on from here."