### A General Flow of Dataset Creation
1. load the CSV into a pandas dataframe
2. clean the text for tokenization
3. get `max_length` from getting length of tokenized reviews
4. write out `reviews_as_table` as a `.txt` file
   - join all 50000 entries into one big string (corpus from which we will form our vocabulary from)
   - separate reviews into a list of lists of containing 50000 lists where each list contains the group of tokens for a review (feature vector representation)
### Nice to haves for Thursday Probs
- have the `max_length`
- a vector representation for each and every review (this predicates on us having the corpus vocab)

In [1]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.util import bigrams
nltk.download('punkt')
nltk.download('wordnet')
import re

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\cacac\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\cacac\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Load into environment as Pandas DF

In [2]:
reviews_as_table = pd.read_csv('IMDB Dataset.csv')
reviews_as_table

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


# This cell was to test if I knew how to use Pandas

In [3]:
reviews_as_table["cleaned"] = reviews_as_table["review"].str.lower()
reviews_as_table.head()

Unnamed: 0,review,sentiment,cleaned
0,One of the other reviewers has mentioned that ...,positive,one of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...,positive,a wonderful little production. <br /><br />the...
2,I thought this was a wonderful way to spend ti...,positive,i thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...,negative,basically there's a family where a little boy ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"petter mattei's ""love in the time of money"" is..."


# Stolen pre-processing steps from module 1 examples & lab

In [4]:
reviews_as_table["cleaned"] = reviews_as_table["cleaned"].apply(lambda x: re.sub(r'[^\x00-\x7f]', r'', x))
reviews_as_table["cleaned"] = reviews_as_table["cleaned"].apply(lambda x: x.replace(".<br /><br />",".<br /><br /><s>"))
reviews_as_table["cleaned"] = reviews_as_table["cleaned"].apply(lambda x: x.replace(". <br /><br />",". <br /><br /><s>"))
reviews_as_table["cleaned"] = reviews_as_table["cleaned"].apply(lambda x: "<s> " + x)

reviews_as_table

Unnamed: 0,review,sentiment,cleaned
0,One of the other reviewers has mentioned that ...,positive,<s> one of the other reviewers has mentioned t...
1,A wonderful little production. <br /><br />The...,positive,<s> a wonderful little production. <br /><br /...
2,I thought this was a wonderful way to spend ti...,positive,<s> i thought this was a wonderful way to spen...
3,Basically there's a family where a little boy ...,negative,<s> basically there's a family where a little ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"<s> petter mattei's ""love in the time of money..."
...,...,...,...
49995,I thought this movie did a down right good job...,positive,<s> i thought this movie did a down right good...
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,"<s> bad plot, bad dialogue, bad acting, idioti..."
49997,I am a Catholic taught in parochial elementary...,negative,<s> i am a catholic taught in parochial elemen...
49998,I'm going to have to disagree with the previou...,negative,<s> i'm going to have to disagree with the pre...


# Tokenization

In [5]:
def custom_tokenize(text):
    text = text.replace("<s>", " SPECIAL_START_TOKEN ").replace("</s>", " SPECIAL_END_TOKEN ")
    tokens = word_tokenize(text)
    tokens = [token.replace("SPECIAL_START_TOKEN", "<s>").replace("SPECIAL_END_TOKEN", "</s>") for token in tokens]
    return tokens

reviews_as_table["cleaned_tokenized"] = reviews_as_table["cleaned"].apply(lambda x: custom_tokenize(str(x)))
reviews_as_table

Unnamed: 0,review,sentiment,cleaned,cleaned_tokenized
0,One of the other reviewers has mentioned that ...,positive,<s> one of the other reviewers has mentioned t...,"[<s>, one, of, the, other, reviewers, has, men..."
1,A wonderful little production. <br /><br />The...,positive,<s> a wonderful little production. <br /><br /...,"[<s>, a, wonderful, little, production, ., <, ..."
2,I thought this was a wonderful way to spend ti...,positive,<s> i thought this was a wonderful way to spen...,"[<s>, i, thought, this, was, a, wonderful, way..."
3,Basically there's a family where a little boy ...,negative,<s> basically there's a family where a little ...,"[<s>, basically, there, 's, a, family, where, ..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"<s> petter mattei's ""love in the time of money...","[<s>, petter, mattei, 's, ``, love, in, the, t..."
...,...,...,...,...
49995,I thought this movie did a down right good job...,positive,<s> i thought this movie did a down right good...,"[<s>, i, thought, this, movie, did, a, down, r..."
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,"<s> bad plot, bad dialogue, bad acting, idioti...","[<s>, bad, plot, ,, bad, dialogue, ,, bad, act..."
49997,I am a Catholic taught in parochial elementary...,negative,<s> i am a catholic taught in parochial elemen...,"[<s>, i, am, a, catholic, taught, in, parochia..."
49998,I'm going to have to disagree with the previou...,negative,<s> i'm going to have to disagree with the pre...,"[<s>, i, 'm, going, to, have, to, disagree, wi..."


# Counting

In [6]:
reviews_as_table["cleaned_tokenized_counts"] = reviews_as_table["cleaned_tokenized"].apply(lambda x: len(x))
reviews_as_table

Unnamed: 0,review,sentiment,cleaned,cleaned_tokenized,cleaned_tokenized_counts
0,One of the other reviewers has mentioned that ...,positive,<s> one of the other reviewers has mentioned t...,"[<s>, one, of, the, other, reviewers, has, men...",384
1,A wonderful little production. <br /><br />The...,positive,<s> a wonderful little production. <br /><br /...,"[<s>, a, wonderful, little, production, ., <, ...",205
2,I thought this was a wonderful way to spend ti...,positive,<s> i thought this was a wonderful way to spen...,"[<s>, i, thought, this, was, a, wonderful, way...",208
3,Basically there's a family where a little boy ...,negative,<s> basically there's a family where a little ...,"[<s>, basically, there, 's, a, family, where, ...",179
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"<s> petter mattei's ""love in the time of money...","[<s>, petter, mattei, 's, ``, love, in, the, t...",288
...,...,...,...,...,...
49995,I thought this movie did a down right good job...,positive,<s> i thought this movie did a down right good...,"[<s>, i, thought, this, movie, did, a, down, r...",243
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,"<s> bad plot, bad dialogue, bad acting, idioti...","[<s>, bad, plot, ,, bad, dialogue, ,, bad, act...",141
49997,I am a Catholic taught in parochial elementary...,negative,<s> i am a catholic taught in parochial elemen...,"[<s>, i, am, a, catholic, taught, in, parochia...",274
49998,I'm going to have to disagree with the previou...,negative,<s> i'm going to have to disagree with the pre...,"[<s>, i, 'm, going, to, have, to, disagree, wi...",241


### obtaining `max_length`

In [7]:
reviews_as_table["cleaned_tokenized_counts"].max()

2944

### longest entries

In [9]:
longest_to_shortest = reviews_as_table.sort_values(by='cleaned_tokenized_counts', ascending=False)
longest_to_shortest.head(10)

Unnamed: 0,review,sentiment,cleaned,cleaned_tokenized,cleaned_tokenized_counts
40521,There's a sign on The Lost Highway that says:<...,positive,<s> there's a sign on the lost highway that sa...,"[<s>, there, 's, a, sign, on, the, lost, highw...",2944
31481,Match 1: Tag Team Table Match Bubba Ray and Sp...,positive,<s> match 1: tag team table match bubba ray an...,"[<s>, match, 1, :, tag, team, table, match, bu...",2819
31240,"(Some spoilers included:)<br /><br />Although,...",positive,<s> (some spoilers included:)<br /><br />altho...,"[<s>, (, some, spoilers, included, :, ), <, br...",2670
31436,"Back in the mid/late 80s, an OAV anime by titl...",positive,"<s> back in the mid/late 80s, an oav anime by ...","[<s>, back, in, the, mid/late, 80s, ,, an, oav...",2575
5708,**Attention Spoilers**<br /><br />First of all...,positive,<s> **attention spoilers**<br /><br />first of...,"[<s>, *, *, attention, spoilers, *, *, <, br, ...",2321
3654,*!!- SPOILERS - !!*<br /><br />Before I begin ...,positive,<s> *!!- spoilers - !!*<br /><br />before i be...,"[<s>, *, !, !, -, spoilers, -, !, !, *, <, br,...",2134
42946,By now you've probably heard a bit about the n...,positive,<s> by now you've probably heard a bit about t...,"[<s>, by, now, you, 've, probably, heard, a, b...",2046
12647,Titanic directed by James Cameron presents a f...,positive,<s> titanic directed by james cameron presents...,"[<s>, titanic, directed, by, james, cameron, p...",2028
3024,If anyone ever assembles a compendium on moder...,positive,<s> if anyone ever assembles a compendium on m...,"[<s>, if, anyone, ever, assembles, a, compendi...",2020
43821,Some have praised _Atlantis:_The_Lost_Empire_ ...,negative,<s> some have praised _atlantis:_the_lost_empi...,"[<s>, some, have, praised, _atlantis, :, _the_...",1947


### shortest entries

In [10]:
longest_to_shortest.tail(10)

Unnamed: 0,review,sentiment,cleaned,cleaned_tokenized,cleaned_tokenized_counts
48448,Adrian Pasdar is excellent is this film. He ma...,positive,<s> adrian pasdar is excellent is this film. h...,"[<s>, adrian, pasdar, is, excellent, is, this,...",15
31761,Ming The Merciless does a little Bardwork and ...,negative,<s> ming the merciless does a little bardwork ...,"[<s>, ming, the, merciless, does, a, little, b...",14
31072,"What a script, what a story, what a mess!",negative,"<s> what a script, what a story, what a mess!","[<s>, what, a, script, ,, what, a, story, ,, w...",13
13109,"More suspenseful, more subtle, much, much more...",negative,"<s> more suspenseful, more subtle, much, much ...","[<s>, more, suspenseful, ,, more, subtle, ,, m...",13
11926,I wouldn't rent this one even on dollar rental...,negative,<s> i wouldn't rent this one even on dollar re...,"[<s>, i, would, n't, rent, this, one, even, on...",13
19874,This movie is terrible but it has some good ef...,negative,<s> this movie is terrible but it has some goo...,"[<s>, this, movie, is, terrible, but, it, has,...",12
18400,Brilliant and moving performances by Tom Court...,positive,<s> brilliant and moving performances by tom c...,"[<s>, brilliant, and, moving, performances, by...",12
40817,I hope this group of film-makers never re-unites.,negative,<s> i hope this group of film-makers never re-...,"[<s>, i, hope, this, group, of, film-makers, n...",10
28920,Primary plot!Primary direction!Poor interpreta...,negative,<s> primary plot!primary direction!poor interp...,"[<s>, primary, plot, !, primary, direction, !,...",10
27521,"Read the book, forget the movie!",negative,"<s> read the book, forget the movie!","[<s>, read, the, book, ,, forget, the, movie, !]",9
