# Contents

1. Corpus
2. Load Corpus using Pandas
3. Access data using Pandas dataframe
4. Text Pre-processing
5. Intro to SpaCy and language processing tasks

# Corpus

A collection of text. Specifically refers to the data to be analyzed.

In [2]:
!ls data/

labeledTrainData.tsv  testData.tsv  unlabeledTrainData.tsv


In [3]:
!wc -l data/labeledTrainData.tsv
!wc -l data/testData.tsv
!wc -l data/unlabeledTrainData.tsv

25001 data/labeledTrainData.tsv
25001 data/testData.tsv
50001 data/unlabeledTrainData.tsv


In [4]:
!head -n 2 data/labeledTrainData.tsv

id	sentiment	review
"5814_8"	1	"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feat

In [5]:
!head -n 2 data/testData.tsv

id	review
"12311_10"	"Naturally in a film who's main themes are of mortality, nostalgia, and loss of innocence it is perhaps not surprising that it is rated more highly by older viewers than younger ones. However there is a craftsmanship and completeness to the film which anyone can enjoy. The pace is steady and constant, the characters full and engaging, the relationships and interactions natural showing that you do not need floods of tears to show emotion, screams to show fear, shouting to show dispute or violence to show anger. Naturally Joyce's short story lends the film a ready made structure as perfect as a polished diamond, but the small changes Huston makes such as the inclusion of the poem fit in neatly. It is truly a masterpiece of tact, subtlety and overwhelming beauty."


In [6]:
!head -n 2 data/unlabeledTrainData.tsv

id	review
"9999_0"	"Watching Time Chasers, it obvious that it was made by a bunch of friends. Maybe they were sitting around one day in film school and said, \"Hey, let's pool our money together and make a really bad movie!\" Or something like that. What ever they said, they still ended up making a really bad movie--dull story, bad script, lame acting, poor cinematography, bottom of the barrel stock music, etc. All corners were cut, except the one that would have prevented this film's release. Life's like that."


# What is labeled data/Unlabeled data/test data?

Labels - In ML/NLP terminology, The independent variable which needs to be predicted.<br>
Labeld data/train data - Data for which the final outcome labels are defined.<br>
unlabeled data - Data for which the outcome to be predicted is not available.<br>
test data - Data for which the hypothesis can be tested.

# Let's load the data into memory using pandas library.<br> 
Pandas is a pyhton package useful in doing data analysis on structured or labeled data

In [7]:
import pandas as pd

In [8]:
data_dir = 'data/'

In [55]:
df_train = pd.read_csv(data_dir + 'labeledTrainData.tsv', sep="\t")

In [10]:
df_train.shape

(25000, 3)

In [11]:
df_train.columns

Index(['id', 'sentiment', 'review'], dtype='object')

# How to access data using data frames

Selection with [], .loc and .iloc

In [56]:
# Directly access using column name. Gives Pandas series (list like)

reviews=df_train['review']
(type(reviews), reviews.shape)

(pandas.core.series.Series, (25000,))

In [57]:
#.loc can select subsets of rows and/or columns. Only selects data by the LABEL of the rows and columns.
#df_train.loc[row,column]
reviews = df_train.loc[:,'review']
(type(reviews), reviews.shape)

(pandas.core.series.Series, (25000,))

In [58]:
reviews = df_train.loc[:,['review', 'id']]
print((type(reviews), reviews.shape))

reviews = df_train.loc[0:5,['review', 'id']]
print((type(reviews), reviews.shape))

(<class 'pandas.core.frame.DataFrame'>, (25000, 2))
(<class 'pandas.core.frame.DataFrame'>, (6, 2))


In [59]:
reviews

Unnamed: 0,review,id
0,With all this stuff going down at the moment w...,5814_8
1,"\The Classic War of the Worlds\"" by Timothy Hi...",2381_9
2,The film starts with a manager (Nicholas Bell)...,7759_3
3,It must be assumed that those who praised this...,3630_4
4,Superbly trashy and wondrously unpretentious 8...,9495_8
5,I dont know why people think this is such a ba...,8196_8


In [17]:
# .iloc is very similar to .loc but only uses integer locations to make its selections.
r = df_train.iloc[0:3]
print(r.shape)

r = df_train.iloc[0:3, 1]
print(r.shape)

(3, 3)
(3,)


In [18]:
df_train['review'][0]

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

Lets see how pandas dataframe can be used to perform some data analysis

In [60]:
#how many positive & negative reviews are present
df_train['sentiment'].value_counts()

1    12500
0    12500
Name: sentiment, dtype: int64

In [20]:
#seggrate the reviews based on Sentiment
pos_reviews=df_train['review'][df_train['sentiment']==1]
print(pos_reviews.shape)

(12500,)


In [21]:
pos_reviews[1]

'\\The Classic War of the Worlds\\" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells\' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different things in a movie. Those who envision themselves as amateur \\"critics\\" look only to criticize everything they can. Others rate a movie on more important bases,like being entertained, which is why most people never agree with the \\"critics\\". We enjoyed the effort Mr. Hines put into being faithful to H.G. Wells\' classic novel, and we found it to be very entertaining. This made it easy to overlook what the \\"critics\\" perceive to be its shortcomings."'

# Text Preprocessing

Pre-processing is the task of preparing the text before it is given to any process to perform analysis/calculations. Some of the standard text preprocessing steps (depending upon the task at hand) are

1. Converting all letters to lower or upper case (text normalization)
2. Removing special characters like punctuations, accent marks etc.
3. Removing white spaces and stop words
4. Removing garbal characters like tags etc.
5. Removing numbers (Maybe?)

In [23]:
# Why do case conversion? Treat same words having different cases to be same

lc=pos_reviews[1].lower()
print(lc)

\the classic war of the worlds\" by timothy hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate h. g. wells' classic book. mr. hines succeeds in doing so. i, and those who watched his film with me, appreciated the fact that it was not the standard, predictable hollywood fare that comes out every year, e.g. the spielberg version with tom cruise that had only the slightest resemblance to the book. obviously, everyone looks for different things in a movie. those who envision themselves as amateur \"critics\" look only to criticize everything they can. others rate a movie on more important bases,like being entertained, which is why most people never agree with the \"critics\". we enjoyed the effort mr. hines put into being faithful to h.g. wells' classic novel, and we found it to be very entertaining. this made it easy to overlook what the \"critics\" perceive to be its shortcomings."


In [24]:
#Removing Special Characters using Regular Expressions

import re

lc=re.sub('\"','', lc)
print(lc)

\the classic war of the worlds\ by timothy hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate h. g. wells' classic book. mr. hines succeeds in doing so. i, and those who watched his film with me, appreciated the fact that it was not the standard, predictable hollywood fare that comes out every year, e.g. the spielberg version with tom cruise that had only the slightest resemblance to the book. obviously, everyone looks for different things in a movie. those who envision themselves as amateur \critics\ look only to criticize everything they can. others rate a movie on more important bases,like being entertained, which is why most people never agree with the \critics\. we enjoyed the effort mr. hines put into being faithful to h.g. wells' classic novel, and we found it to be very entertaining. this made it easy to overlook what the \critics\ perceive to be its shortcomings.


In [28]:
lc=re.sub('\\\\','', lc)
print(lc)

the classic war of the worlds by timothy hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate h. g. wells' classic book. mr. hines succeeds in doing so. i, and those who watched his film with me, appreciated the fact that it was not the standard, predictable hollywood fare that comes out every year, e.g. the spielberg version with tom cruise that had only the slightest resemblance to the book. obviously, everyone looks for different things in a movie. those who envision themselves as amateur critics look only to criticize everything they can. others rate a movie on more important bases,like being entertained, which is why most people never agree with the critics. we enjoyed the effort mr. hines put into being faithful to h.g. wells' classic novel, and we found it to be very entertaining. this made it easy to overlook what the critics perceive to be its shortcomings.


In [29]:
#HTML tags

pos_reviews[0]

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

# Removing HTML Markup from text

BeautifulSoup Package can be used for text cleaning

In [30]:
from bs4 import BeautifulSoup

In [31]:
BeautifulSoup(pos_reviews[0], "lxml").get_text()

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 mi

# Combine all preprocessing steps

In [61]:
## Add all preprocessing steps here
def preproc(line):
    text = BeautifulSoup(line, "lxml").get_text()
    return re.sub("[\"\\\\]","", text)

In [62]:
sanitizied_rev=df_train['review'].apply(preproc)

In [47]:
df_train['review']=sanitizied_rev

# Intoduction to SpaCy

1. Fast & reliable language parsing
2. Written in Cython
3. Sentence Segmentation
4. Tokenization
5. Part of Speech Tagging
6. Named Entity Recognition
7. Dependency Parsing

# Install Spacy

pip install -U spacy

DOWNLOAD MODELS

After installation, download model of the language you want to process text of, with SpaCy

python -m spacy download en

In [None]:
#! pip install -U spacy
#! python -m spacy download en

In [48]:
#import spacy

import spacy
nlp = spacy.load('en')

In [63]:

sanitizied_rev[1]

"The Classic War of the Worlds by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different things in a movie. Those who envision themselves as amateur critics look only to criticize everything they can. Others rate a movie on more important bases,like being entertained, which is why most people never agree with the critics. We enjoyed the effort Mr. Hines put into being faithful to H.G. Wells' classic novel, and we found it to be very entertaining. This made it easy to overlook what the critics perceive to be its shortcomings."

# Sentence Segmentation

Breaking piece of text/document in to logical sentences.

In [71]:
doc=nlp(sanitizied_rev[1])

In [72]:
for sent in doc.sents:
    print(sent.text)

The Classic War of the Worlds by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book.
Mr. Hines succeeds in doing so.
I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book.
Obviously, everyone looks for different things in a movie.
Those who envision themselves as amateur critics look only to criticize everything they can.
Others rate a movie on more important bases,like being entertained, which is why most people never agree with the critics.
We enjoyed the effort Mr. Hines put into being faithful to H.G. Wells' classic novel, and we found it to be very entertaining.
This made it easy to overlook what the critics perceive to be its shortcomings.


# Tokenization

Also known as chunking. It is the process of breaking text into chunks of words also called tokens.

In [74]:
sentList=[]
for sent in doc.sents:
    wordList=[]
    for word in sent:
        wordList.append(word.text)
    sentList.append(wordList)

In [75]:
for wList in sentList:
    print(wList)

['The', 'Classic', 'War', 'of', 'the', 'Worlds', 'by', 'Timothy', 'Hines', 'is', 'a', 'very', 'entertaining', 'film', 'that', 'obviously', 'goes', 'to', 'great', 'effort', 'and', 'lengths', 'to', 'faithfully', 'recreate', 'H.', 'G.', 'Wells', "'", 'classic', 'book', '.']
['Mr.', 'Hines', 'succeeds', 'in', 'doing', 'so', '.']
['I', ',', 'and', 'those', 'who', 'watched', 'his', 'film', 'with', 'me', ',', 'appreciated', 'the', 'fact', 'that', 'it', 'was', 'not', 'the', 'standard', ',', 'predictable', 'Hollywood', 'fare', 'that', 'comes', 'out', 'every', 'year', ',', 'e.g.', 'the', 'Spielberg', 'version', 'with', 'Tom', 'Cruise', 'that', 'had', 'only', 'the', 'slightest', 'resemblance', 'to', 'the', 'book', '.']
['Obviously', ',', 'everyone', 'looks', 'for', 'different', 'things', 'in', 'a', 'movie', '.']
['Those', 'who', 'envision', 'themselves', 'as', 'amateur', 'critics', 'look', 'only', 'to', 'criticize', 'everything', 'they', 'can', '.']
['Others', 'rate', 'a', 'movie', 'on', 'more', 

# POS Tagging

 The process of assigning Part of speech Tag to each word in a text. example noun, verb, adjective etc.<br>
 Can be used for varity of tasks such as relation extraction, topic detection etc.

In [78]:
firstSent=next(doc.sents)
for token in firstSent:
    print(token.text, token.pos_, token.tag_)

The DET DT
Classic PROPN NNP
War PROPN NNP
of ADP IN
the DET DT
Worlds PROPN NNPS
by ADP IN
Timothy PROPN NNP
Hines PROPN NNP
is VERB VBZ
a DET DT
very ADV RB
entertaining ADJ JJ
film NOUN NN
that ADJ WDT
obviously ADV RB
goes VERB VBZ
to ADP IN
great ADJ JJ
effort NOUN NN
and CCONJ CC
lengths NOUN NNS
to PART TO
faithfully ADV RB
recreate VERB VB
H. PROPN NNP
G. PROPN NNP
Wells PROPN NNP
' PART POS
classic ADJ JJ
book NOUN NN
. PUNCT .


# Named Entitiy Recognition

Dual task of identifying key terms or entities present in the text and the label associated with these entities.

In [81]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

The Classic War 0 15 EVENT
Worlds 23 29 PERSON
Timothy Hines 33 46 PERSON
H. G. Wells 146 157 PERSON
Hines 177 182 PERSON
Hollywood 311 320 GPE
every year 341 351 DATE
Spielberg 362 371 PERSON
Tom Cruise 385 395 PERSON
Hines 750 755 PERSON
H.G. Wells' 783 794 PERSON
