## Create a NLP Pipeline to 'Clean' Reviews Data
- Load Input File and Read Reviews
- Tokenize
- Remove Stopwords
- Perform Stemming
- Write cleaned data to output file

In [1]:
sample_text = """
    First things first, Edison Chen did a fantastic, believable job as a Cambodian hit-man, 
    born and bred in the dumps and a gladiatorial ring, where he honed his craft of savage battery in order to survive, 
    living on the mantra of kill or be killed. In a role that had little dialogue, or at least a few lines in Cambodian/Thai,
    his performance is compelling, probably what should have been in the Jet Li vehicle Danny the Dog, 
    where a man is bred for the sole purpose of fighting, and on someone else's leash.<br /><br />Like Danny the Dog, 
    the much talked about bare knuckle fight sequences are not choreographed stylistically, but rather designed as normal, 
    brutal fisticuffs, where everything goes. This probably brought a sense of realism and grit when you see the characters 
    slug it out at each other's throats, in defending their own lives while taking it away from others. It's a grim, gritty 
    and dark movie both literally and figuratively, and this sets it apart from the usual run off the mill cop thriller 
    production.<br /><br />Edison plays a hired gun from Cambodia, who becomes a fugitive in Hong Kong, on the run from the 
    cops as his pickup had gone awry. Leading the chase is the team led by Cheung Siu-Fai, who has to contend with maverick 
    member Inspector Ti (Sam Lee), who's inclusion and acceptance in the team had to do with the sins of his father. 
    So begins a cat and mouse game in the dark shades and shadows of the seedier looking side of Hong Kong.<br /><br />
    The story itself works on multiple levels, especially in the character studies of the hit-man, and the cop. 
    On opposite sides of the law, we see within each character not the black and white, but the shades of grey. 
    With the hit-man, we see his caring side when he got hooked up and developed feelings of love for a girl (Pei Pei), 
    bringing about a sense of maturity, tenderness, and revealing a heart of gold. The cop, with questionable tactics and 
    attitudes, makes you wonder how one would buckle when willing to do anything it takes to get the job done. 
    There are many interesting moments of moral questioning, on how anti-hero, despicable strategies are adopted. 
    You'll ask, what makes a man, and what makes a beast, and if we have the tendency to switch sides depending on 
    circumstances - do we have that dark inner streak in all of us, transforming from man to dog, and dog to man? 
    Dog Bite Dog grips you from the start and never lets go until the end, though there are points mid way through 
    that seemed to drag, especially on its tender moments, and it suffered too from not knowing when to end. 
    If I should pick a favourite scene, then it must be the one in the market food centre - extremely well controlled and 
    delivered, a suspenseful edge of your seat moment. Listen out for the musical score too, and you're not dreaming if you 
    hear growls of dogs.<br /><br />Highly recommended, especially if you think that you've seen about almost everything from 
    the cop thriller genre."""

#### NLTK

In [2]:
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import sys

In [3]:
# Initialize Objects
tokenizer = RegexpTokenizer(r'\w+')
en_stopwords = set(stopwords.words('english'))
ps = PorterStemmer()

In [4]:
def getStemmedReview(review):
    
    review = review.lower()
    review = review.replace('<br /><br />',' ')
    
    # Tokenize
    tokens = tokenizer.tokenize(review)
    new_tokens = [token for token in tokens if token not in en_stopwords]
    stemmed_tokens = [ps.stem(tokens) for tokens in new_tokens]
    
    cleaned_review = ' '.join(stemmed_tokens)
    
    return cleaned_review
    

In [5]:
getStemmedReview(sample_text)

'first thing first edison chen fantast believ job cambodian hit man born bred dump gladiatori ring hone craft savag batteri order surviv live mantra kill kill role littl dialogu least line cambodian thai perform compel probabl jet li vehicl danni dog man bred sole purpos fight someon els leash like danni dog much talk bare knuckl fight sequenc choreograph stylist rather design normal brutal fisticuff everyth goe probabl brought sens realism grit see charact slug throat defend live take away other grim gritti dark movi liter figur set apart usual run mill cop thriller product edison play hire gun cambodia becom fugit hong kong run cop pickup gone awri lead chase team led cheung siu fai contend maverick member inspector ti sam lee inclus accept team sin father begin cat mous game dark shade shadow seedier look side hong kong stori work multipl level especi charact studi hit man cop opposit side law see within charact black white shade grey hit man see care side got hook develop feel love