# Summarizing Steam Reviews

## Pulling in the Data

In [1]:
import pandas as pd

raw_data = pd.read_csv('/kaggle/input/steam-reviews/dataset.csv')

Since I'm just going to be summarizing the review text the extra columns are unnecessary.

In [2]:
raw_data.drop(columns=['review_votes', 'review_score', 'app_id'], inplace=True, errors='ignore')

In [3]:
raw_data.head(5)

Unnamed: 0,app_name,review_text
0,Counter-Strike,Ruined my life.
1,Counter-Strike,This will be more of a ''my experience with th...
2,Counter-Strike,This game saved my virginity.
3,Counter-Strike,• Do you like original games? • Do you like ga...
4,Counter-Strike,"Easy to learn, hard to master."


In [4]:
raw_data.dropna(axis='index', inplace=True)

### Cleaning the Data

Some of the inputs had lots of messy text so I'm going to try and clean out the gross text.

In [5]:
import re
import numpy as np
def clean_text(text):
    return re.sub('&.*;', '', re.sub('/.*', '', text))

raw_data['review_text'] = raw_data['review_text'].apply(clean_text)
raw_data['review_text'].replace('', np.nan, inplace=True)
raw_data.dropna(subset=['review_text'], inplace=True)
raw_data = raw_data.loc[raw_data['review_text'].str.len() >= 50]

In [6]:
reorder_raw_data = raw_data.review_text.str.len().sort_values(ascending=True).index
raw_data = raw_data.reindex(reorder_raw_data)
raw_data.drop_duplicates(inplace=True)

In [7]:
from datasets import Dataset
# Filter only the longest 1,000 reviews to summarize
data_df = Dataset.from_pandas(raw_data[1000:1050])

In [8]:
import gc
del raw_data
gc.collect()

51

## Model Time

In [9]:
from transformers import T5ForConditionalGeneration, T5TokenizerFast

checkpoint = "t5-small"
model = T5ForConditionalGeneration.from_pretrained(checkpoint).to('cuda')
tokenizer = T5TokenizerFast.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

In [10]:
batch = tokenizer(data_df['review_text'], truncation=True, padding=True, return_tensors="pt").to('cuda')

In [11]:
translated = model.generate(**batch)

In [12]:
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
# data_df['summarized'] = tgt_text

In [13]:
tgt_text

['ok its hard as so you no its a good game 4',
 '         ',
 '!',
 'Haben Sie 200-400 Stunden zu spare? Kaufen Sie dieses Spiel.',
 'Das ist das schlimmste, was sich mit der Menschheit ereignet hat.',
 'are looking for DOOM but with zombies?',
 'Wahrscheinlich das best game in the entire NFS franchise.',
 'Best Game ever defeated it in 12 hours in pasifest in 12 hours in pasifest',
 'Dieses Spiel zeigt, wie gut Nvdia gegenüber AMD ist.',
 'Dark Souls II ist noch schwerer als Dark Souls und Dark Souls II kombiniert.',
 'game and has a great storyline.',
 '- a game.',
 'game, just needs a new game plus.',
 'Brilliant story, excellent gameplay, great puzzles.',
 'Portals.',
 'Play as the most  off guy not on this planet',
 '',
 'Eines der besten Spiele, die Spaß und Spaß ist.',
 'Epic game, sowohl für Singleplayer als auch für Multiplayer.',
 'hnliches Gefühl wie Rainbow Six, es ist sehr einfach zu m',
 'Werden Sie fertig fertig fertig fertig fertig fertig fertig fertig',
 '!',
 'i say. 