# Data Programming in Python | BAIS:6040
# Text Processing with NLTK and TextBlob

Instructor: Jeff Hendricks 

Topics to be covered:
- Text processing with NLTK and text mining with TextBlob (+ exercises)
- Adding new columns (+ exercises)
- Popular keywords
- Searching for records containing a certain string

References: 
- NLTK official website (http://www.nltk.org/)
- Pandas official website (http://pandas.pydata.org/) 
- Python Data Science Handbook by Jake VanderPlas (http://shop.oreilly.com/product/0636920034919.do)
- Python for Data Analysis by Wes McKinney (https://www.oreilly.com/library/view/python-for-data/9781491957653/)
- TextBlob official website (https://textblob.readthedocs.io/en/dev/)

## Prerequisites

In [1]:
# ! pip install --user nltk 
# ! pip install textblob==0.17.1
# ! python -m nltk.downloader stopwords
# ! python -m textblob.download_corpora

## Importing modules

In [2]:
from collections import Counter              # word counting
import nltk                                  # text processing
import pandas as pd                          # handling Pandas dataframes
from textblob import TextBlob                # sentiment analysis and language detection and translation 

pd.set_option('display.max_colwidth', 200)   # set the maximum column width to 200

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('brown')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/jerjacob/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jerjacob/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/jerjacob/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package brown to /home/jerjacob/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jerjacob/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Text processing with NLTK

In [3]:
text = "I am pleased to announce the launch of http://greatoutdoors.com . This new platform will allow all who love hiking to find enjoyable trails. This has been a priority of mine and I’m pleased to share that it is up and running! #KeepAmericaHiking"
text

'I am pleased to announce the launch of http://greatoutdoors.com . This new platform will allow all who love hiking to find enjoyable trails. This has been a priority of mine and I’m pleased to share that it is up and running! #KeepAmericaHiking'

In [4]:
sents = nltk.sent_tokenize(text)       # Tokenize text into sentences.

In [5]:
sents

['I am pleased to announce the launch of http://greatoutdoors.com .',
 'This new platform will allow all who love hiking to find enjoyable trails.',
 'This has been a priority of mine and I’m pleased to share that it is up and running!',
 '#KeepAmericaHiking']

In [6]:
tokens = nltk.word_tokenize(text)        # Tokenize text into words.

In [7]:
tokens

['I',
 'am',
 'pleased',
 'to',
 'announce',
 'the',
 'launch',
 'of',
 'http',
 ':',
 '//greatoutdoors.com',
 '.',
 'This',
 'new',
 'platform',
 'will',
 'allow',
 'all',
 'who',
 'love',
 'hiking',
 'to',
 'find',
 'enjoyable',
 'trails',
 '.',
 'This',
 'has',
 'been',
 'a',
 'priority',
 'of',
 'mine',
 'and',
 'I',
 '’',
 'm',
 'pleased',
 'to',
 'share',
 'that',
 'it',
 'is',
 'up',
 'and',
 'running',
 '!',
 '#',
 'KeepAmericaHiking']

In [8]:
tagged_tokens = nltk.pos_tag(tokens)          # Tag each token with their part-of-speech tag.

Note that the argument of <b>pos_tag</b> function is a list of tokens, not raw text. 

In [9]:
tagged_tokens

[('I', 'PRP'),
 ('am', 'VBP'),
 ('pleased', 'JJ'),
 ('to', 'TO'),
 ('announce', 'VB'),
 ('the', 'DT'),
 ('launch', 'NN'),
 ('of', 'IN'),
 ('http', 'NN'),
 (':', ':'),
 ('//greatoutdoors.com', 'NN'),
 ('.', '.'),
 ('This', 'DT'),
 ('new', 'JJ'),
 ('platform', 'NN'),
 ('will', 'MD'),
 ('allow', 'VB'),
 ('all', 'DT'),
 ('who', 'WP'),
 ('love', 'VBP'),
 ('hiking', 'VBG'),
 ('to', 'TO'),
 ('find', 'VB'),
 ('enjoyable', 'JJ'),
 ('trails', 'NNS'),
 ('.', '.'),
 ('This', 'DT'),
 ('has', 'VBZ'),
 ('been', 'VBN'),
 ('a', 'DT'),
 ('priority', 'NN'),
 ('of', 'IN'),
 ('mine', 'NN'),
 ('and', 'CC'),
 ('I', 'PRP'),
 ('’', 'VBP'),
 ('m', 'NNS'),
 ('pleased', 'VBD'),
 ('to', 'TO'),
 ('share', 'NN'),
 ('that', 'IN'),
 ('it', 'PRP'),
 ('is', 'VBZ'),
 ('up', 'JJ'),
 ('and', 'CC'),
 ('running', 'VBG'),
 ('!', '.'),
 ('#', '#'),
 ('KeepAmericaHiking', 'VBG')]

Penn Part of Speech Tags: https://cs.nyu.edu/grishman/jet/guide/PennPOS.html

In [10]:
stemmer = nltk.stem.SnowballStemmer("english")

In [11]:
stemmer.stem("car")

'car'

In [12]:
stemmer.stem("cars")

'car'

In [13]:
stemmer.stem("say")

'say'

In [14]:
stemmer.stem("saying")

'say'

In [15]:
stemmer.stem("said")

'said'

In [16]:
lemmatizer = nltk.stem.WordNetLemmatizer()

In [17]:
lemmatizer.lemmatize("cars", pos="n")

'car'

In [18]:
lemmatizer.lemmatize("saying", pos="v")

'say'

In [19]:
lemmatizer.lemmatize("said", pos="v")

'say'

In [20]:
lemmatizer.lemmatize("am", pos="v")

'be'

In [21]:
lemmatizer.lemmatize("are", pos="v")

'be'

In [22]:
lemmatizer.lemmatize("is", pos="v")

'be'

## Text mining with TextBlob

In [23]:
text = "It's just awesome!"

In [24]:
tb = TextBlob(text)

In [25]:
tb.sentiment                # polarity between -1 (negative) and 1 (positive) 
                            # subjectivity between 0 (objective) and 1 (subjective)

Sentiment(polarity=1.0, subjectivity=1.0)

In [26]:
tb.sentiment.polarity

1.0

In [27]:
tb.sentiment.subjectivity

1.0

In [28]:
text = "It's Friday."

In [29]:
tb = TextBlob(text)
tb.sentiment

Sentiment(polarity=0.0, subjectivity=0.0)

In [30]:
text = "Machine Learning is a branch of Artificial Intelligence in computer science."

In [31]:
tb = TextBlob(text)

In [32]:
tb.noun_phrases

WordList(['machine learning', 'artificial intelligence', 'computer science'])

In [33]:
tb.detect_language()

'en'

In [34]:
text_es = tb.translate(to="es")
text_es

TextBlob("El aprendizaje automático es una rama de la inteligencia artificial en informática.")

Language translation and detection of TextBlob is powered by the Google Translate API.

In [35]:
text_de = tb.translate(to="de")
text_de

TextBlob("Machine Learning ist ein Teilgebiet der Künstlichen Intelligenz in der Informatik.")

In [36]:
tb = TextBlob(text_es.string)
tb.detect_language()

'es'

In [37]:
tb = TextBlob(text_de.string)
tb.detect_language()

'de'

## Exercises for text processing and mining (9 questions)

In [38]:
text = "When our Country had no debt and built everything from Highways to the Military with CASH, we had a big system of Tariffs. Now we allow other countries to steal our wealth, treasure, and jobs - But no more! The USA is doing great, with unlimited upside into the future!"
text

'When our Country had no debt and built everything from Highways to the Military with CASH, we had a big system of Tariffs. Now we allow other countries to steal our wealth, treasure, and jobs - But no more! The USA is doing great, with unlimited upside into the future!'

1\. Using NLTK, get a list <i>sents</i> of sentences in <i>text</i>.

In [39]:
# Your answer here


2\. Using NLTK, get a list <i>tokens</i> of words in <i>text</i>.

In [40]:
# Your answer here


3\. Using NLTK, get a list <i>tagged_tokens</i> of tagged tokens in <i>text</i>.

In [41]:
# Your answer here


4\. Get a new list of the nouns in <i>tagged_tokens</i>, which are tagged with 'NN'.

In [42]:
# Your answer here


5\. Get a new list of the verbs in <i>tagged_tokens</i>, which are tagged with 'VB', 'VBD', 'VBG', 'VBN', 'VBP', or 'VBZ'.

In [43]:
# Your answer here


6\. Using TextBlob, get a new list of the polarities of the sentences in <i>text</i>.

In [44]:
# Your answer here


7\. Using TextBlob, get a new list of the subjectivities of the sentences in <i>text</i>.

In [45]:
# Your answer here


8\. Using TextBlob, get a new list of the Chinese sentences translated from the sentences in <i>text</i>.

In [46]:
# Your answer here


9\. Using TextBlob, get a new list of the detected languages of the sentences in <i>text</i>.

In [47]:
# Your answer here


## Loading a CSV file into a Pandas dataframe

In [48]:
df = pd.read_csv("../../Data/imdb_data.csv", sep=",", header=0)
df.columns=['text']

In [49]:
df.head()

Unnamed: 0,text
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me..."
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p..."
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i..."
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what mone..."


## Deriving a new column from an existing column

The raw dataset itself does not always come with all the information you may need. In many cases, you will have to derive new columns from a set of existing columns. 

### Adding a new column for text length (text ➔ text_length) 

In [50]:
df.text[:5]

0    One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me...
1    A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p...
2    I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i...
3    Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...
4    Petter Mattei's "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be

In [51]:
df["text_length"] = df.text.apply(lambda x: len(x))

pandas.Series.apply: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html

When using the <b>apply</b> method, pay attention to the existing column to be used (<i>text</i>) and the new column to be added (<i>text_length</i>).

In [52]:
df.columns

Index(['text', 'text_length'], dtype='object')

In [53]:
df[["text", "text_length"]].head()

Unnamed: 0,text,text_length
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me...",1761
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p...",998
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i...",926
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...,748
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what mone...",1317


### Adding a new column for tokens (text ➔ tokens)

In [54]:
df["tokens"] = df.text.apply(lambda x: nltk.word_tokenize(x))

In [55]:
df.columns

Index(['text', 'text_length', 'tokens'], dtype='object')

In [56]:
df[["text", "tokens"]].head()

Unnamed: 0,text,tokens
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me...","[One, of, the, other, reviewers, has, mentioned, that, after, watching, just, 1, Oz, episode, you, 'll, be, hooked, ., They, are, right, ,, as, this, is, exactly, what, happened, with, me., <, br,..."
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p...","[A, wonderful, little, production, ., <, br, /, >, <, br, /, >, The, filming, technique, is, very, unassuming-, very, old-time-BBC, fashion, and, gives, a, comforting, ,, and, sometimes, discomfor..."
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i...","[I, thought, this, was, a, wonderful, way, to, spend, time, on, a, too, hot, summer, weekend, ,, sitting, in, the, air, conditioned, theater, and, watching, a, light-hearted, comedy, ., The, plot,..."
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...,"[Basically, there, 's, a, family, where, a, little, boy, (, Jake, ), thinks, there, 's, a, zombie, in, his, closet, &, his, parents, are, fighting, all, the, time., <, br, /, >, <, br, /, >, This,..."
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what mone...","[Petter, Mattei, 's, ``, Love, in, the, Time, of, Money, '', is, a, visually, stunning, film, to, watch, ., Mr., Mattei, offers, us, a, vivid, portrait, about, human, relations, ., This, is, a, mo..."


### Adding a new column for tagged tokens (tokens ➔ tagged_tokens)

In [57]:
df["tagged_tokens"] = df.tokens.apply(lambda x: nltk.pos_tag(x))

In [58]:
df.columns

Index(['text', 'text_length', 'tokens', 'tagged_tokens'], dtype='object')

In [59]:
df[["text", "tokens", "tagged_tokens"]].head()

Unnamed: 0,text,tokens,tagged_tokens
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me...","[One, of, the, other, reviewers, has, mentioned, that, after, watching, just, 1, Oz, episode, you, 'll, be, hooked, ., They, are, right, ,, as, this, is, exactly, what, happened, with, me., <, br,...","[(One, CD), (of, IN), (the, DT), (other, JJ), (reviewers, NNS), (has, VBZ), (mentioned, VBN), (that, IN), (after, IN), (watching, VBG), (just, RB), (1, CD), (Oz, NNP), (episode, NN), (you, PRP), (..."
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p...","[A, wonderful, little, production, ., <, br, /, >, <, br, /, >, The, filming, technique, is, very, unassuming-, very, old-time-BBC, fashion, and, gives, a, comforting, ,, and, sometimes, discomfor...","[(A, DT), (wonderful, JJ), (little, JJ), (production, NN), (., .), (<, JJ), (br, NN), (/, NNP), (>, NNP), (<, NNP), (br, NN), (/, NNP), (>, VBZ), (The, DT), (filming, NN), (technique, NN), (is, VB..."
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i...","[I, thought, this, was, a, wonderful, way, to, spend, time, on, a, too, hot, summer, weekend, ,, sitting, in, the, air, conditioned, theater, and, watching, a, light-hearted, comedy, ., The, plot,...","[(I, PRP), (thought, VBD), (this, DT), (was, VBD), (a, DT), (wonderful, JJ), (way, NN), (to, TO), (spend, VB), (time, NN), (on, IN), (a, DT), (too, RB), (hot, JJ), (summer, NN), (weekend, NN), (,,..."
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...,"[Basically, there, 's, a, family, where, a, little, boy, (, Jake, ), thinks, there, 's, a, zombie, in, his, closet, &, his, parents, are, fighting, all, the, time., <, br, /, >, <, br, /, >, This,...","[(Basically, NNP), (there, EX), ('s, VBZ), (a, DT), (family, NN), (where, WRB), (a, DT), (little, JJ), (boy, NN), ((, (), (Jake, NNP), (), )), (thinks, VBZ), (there, EX), ('s, VBZ), (a, DT), (zomb..."
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what mone...","[Petter, Mattei, 's, ``, Love, in, the, Time, of, Money, '', is, a, visually, stunning, film, to, watch, ., Mr., Mattei, offers, us, a, vivid, portrait, about, human, relations, ., This, is, a, mo...","[(Petter, NNP), (Mattei, NNP), ('s, POS), (``, ``), (Love, NNP), (in, IN), (the, DT), (Time, NNP), (of, IN), (Money, NNP), ('', ''), (is, VBZ), (a, DT), (visually, RB), (stunning, JJ), (film, NN),..."


### Adding new columns for sentiment (text ➔ polarity, subjectivity)

- Polarity between -1 (negative) and 1 (positive)
- Subjectivity between 0 (objective) and 1 (subjective)

In [60]:
df["polarity"] = df.text.apply(lambda x: TextBlob(x).sentiment.polarity)
df["subjectivity"] = df.text.apply(lambda x: TextBlob(x).sentiment.subjectivity)

In [61]:
df.columns

Index(['text', 'text_length', 'tokens', 'tagged_tokens', 'polarity',
       'subjectivity'],
      dtype='object')

In [62]:
df[["text", "polarity", "subjectivity"]].head()

Unnamed: 0,text,polarity,subjectivity
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me...",0.023433,0.490369
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p...",0.109722,0.559343
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i...",0.354008,0.65873
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...,-0.057813,0.454167
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what mone...",0.217952,0.452916


In [63]:
df[(df.polarity > 0.4) | (df.polarity < -0.4)][["text", "polarity"]].head()

Unnamed: 0,text,polarity
9,"If you like original gut wrenching laughter you will like this movie. If you are young or old then you will love this movie, hell even my mom liked it.<br /><br />Great Camp!!!",0.415
38,"Okay, so this series kind of takes the route of 'here we go again!' Week in, week out David Morse's character helps out his ride who is in a bit of a pickle - but what's wrong with that!? David Mo...",0.416142
84,"We brought this film as a joke for a friend, and could of been our worst joke to play. The film is barely watchable, and the acting is dire. The worst child actor ever used and Hasslehoff giving a...",-0.56875
160,"MYSTERY MEN has got to be THE stupidest film I've ever seen, but what a film! I thought it was fabulous, excellent and impressive. It was funny, well-done and nice to see ridiculous Super Heroes f...",0.495833
162,I can't believe it that was the worst movie i have ever seen in my life. i laughed a couple of times. ( probably because of how stupid it was ) If someone paid me to see that movie again i wouldn'...,-0.542857


## Exercises for adding new columns (2 questions)

Let's continue to use the dataframe <i>df</i> from the movie review dataset.

In [64]:
df.head()

Unnamed: 0,text,text_length,tokens,tagged_tokens,polarity,subjectivity
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me...",1761,"[One, of, the, other, reviewers, has, mentioned, that, after, watching, just, 1, Oz, episode, you, 'll, be, hooked, ., They, are, right, ,, as, this, is, exactly, what, happened, with, me., <, br,...","[(One, CD), (of, IN), (the, DT), (other, JJ), (reviewers, NNS), (has, VBZ), (mentioned, VBN), (that, IN), (after, IN), (watching, VBG), (just, RB), (1, CD), (Oz, NNP), (episode, NN), (you, PRP), (...",0.023433,0.490369
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p...",998,"[A, wonderful, little, production, ., <, br, /, >, <, br, /, >, The, filming, technique, is, very, unassuming-, very, old-time-BBC, fashion, and, gives, a, comforting, ,, and, sometimes, discomfor...","[(A, DT), (wonderful, JJ), (little, JJ), (production, NN), (., .), (<, JJ), (br, NN), (/, NNP), (>, NNP), (<, NNP), (br, NN), (/, NNP), (>, VBZ), (The, DT), (filming, NN), (technique, NN), (is, VB...",0.109722,0.559343
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i...",926,"[I, thought, this, was, a, wonderful, way, to, spend, time, on, a, too, hot, summer, weekend, ,, sitting, in, the, air, conditioned, theater, and, watching, a, light-hearted, comedy, ., The, plot,...","[(I, PRP), (thought, VBD), (this, DT), (was, VBD), (a, DT), (wonderful, JJ), (way, NN), (to, TO), (spend, VB), (time, NN), (on, IN), (a, DT), (too, RB), (hot, JJ), (summer, NN), (weekend, NN), (,,...",0.354008,0.65873
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...,748,"[Basically, there, 's, a, family, where, a, little, boy, (, Jake, ), thinks, there, 's, a, zombie, in, his, closet, &, his, parents, are, fighting, all, the, time., <, br, /, >, <, br, /, >, This,...","[(Basically, NNP), (there, EX), ('s, VBZ), (a, DT), (family, NN), (where, WRB), (a, DT), (little, JJ), (boy, NN), ((, (), (Jake, NNP), (), )), (thinks, VBZ), (there, EX), ('s, VBZ), (a, DT), (zomb...",-0.057813,0.454167
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what mone...",1317,"[Petter, Mattei, 's, ``, Love, in, the, Time, of, Money, '', is, a, visually, stunning, film, to, watch, ., Mr., Mattei, offers, us, a, vivid, portrait, about, human, relations, ., This, is, a, mo...","[(Petter, NNP), (Mattei, NNP), ('s, POS), (``, ``), (Love, NNP), (in, IN), (the, DT), (Time, NNP), (of, IN), (Money, NNP), ('', ''), (is, VBZ), (a, DT), (visually, RB), (stunning, JJ), (film, NN),...",0.217952,0.452916


1\. Add a new column named <i>text_short</i> that contains only the first ten characters of the column <i>text</i>.

In [65]:
# Your answer here


2\. Add a new column <i>lang</i> that contains the language of the column <i>text</i>.

In [66]:
# Your answer here


## Popular keywords

In [67]:
df.tagged_tokens[:5]

0    [(One, CD), (of, IN), (the, DT), (other, JJ), (reviewers, NNS), (has, VBZ), (mentioned, VBN), (that, IN), (after, IN), (watching, VBG), (just, RB), (1, CD), (Oz, NNP), (episode, NN), (you, PRP), (...
1    [(A, DT), (wonderful, JJ), (little, JJ), (production, NN), (., .), (<, JJ), (br, NN), (/, NNP), (>, NNP), (<, NNP), (br, NN), (/, NNP), (>, VBZ), (The, DT), (filming, NN), (technique, NN), (is, VB...
2    [(I, PRP), (thought, VBD), (this, DT), (was, VBD), (a, DT), (wonderful, JJ), (way, NN), (to, TO), (spend, VB), (time, NN), (on, IN), (a, DT), (too, RB), (hot, JJ), (summer, NN), (weekend, NN), (,,...
3    [(Basically, NNP), (there, EX), ('s, VBZ), (a, DT), (family, NN), (where, WRB), (a, DT), (little, JJ), (boy, NN), ((, (), (Jake, NNP), (), )), (thinks, VBZ), (there, EX), ('s, VBZ), (a, DT), (zomb...
4    [(Petter, NNP), (Mattei, NNP), ('s, POS), (``, ``), (Love, NNP), (in, IN), (the, DT), (Time, NNP), (of, IN), (Money, NNP), ('', ''), (is, VBZ), (a, DT), (visually, RB), (stunn

In [68]:
###################################################################################
# The 'counter' object will have all the word count information. 
###################################################################################

counter = Counter()

for pos in df.tagged_tokens:
    
    ###################################################################
    # Use a set to remove duplicate words in a review.
    # This enables you to count every token in a review as just 1 
    # even though they appear multiple times in the same review. 
    ###################################################################
    word_set = set()
    
    for t in pos:
        word = t[0].lower()
        word_set.add(word)
            
    counter.update(word_set)

collections.Counter: https://docs.python.org/3/library/collections.html#collections.Counter

In [69]:
counter.most_common(30)           # Show the top-n most popular words in 'counter'.

[('the', 994),
 ('.', 990),
 ('a', 967),
 ('and', 966),
 (',', 959),
 ('of', 956),
 ('to', 938),
 ('this', 912),
 ('is', 901),
 ('in', 882),
 ('it', 877),
 ('i', 811),
 ('that', 794),
 ('for', 716),
 ('but', 709),
 ('with', 706),
 ("'s", 702),
 ('was', 650),
 ('on', 638),
 ('as', 625),
 ('movie', 614),
 ('not', 601),
 ("n't", 597),
 ('/', 588),
 ('>', 587),
 ('<', 587),
 ('br', 587),
 ('you', 580),
 ('have', 565),
 ('be', 551)]

collections.Counter.most_common: https://docs.python.org/3/library/collections.html#collections.Counter.most_common

In [70]:
global_stopwords = nltk.corpus.stopwords.words("english") # stopwords based on the English stopwords provided by NLTK
global_stopwords[:30]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself']

Stopwords actually have no meaning in terms of keyword analysis.

In [71]:
counter = Counter()

for pos in df.tagged_tokens:
    word_set = set()
    
    for t in pos:
        word = t[0].lower()
        
        ###################################
        # Check if the word is a stopword.
        ###################################
        if word in global_stopwords:
            continue
        else:
            word_set.add(word)
            
    counter.update(word_set)
    
counter.most_common(30)

[('.', 990),
 (',', 959),
 ("'s", 702),
 ('movie', 614),
 ("n't", 597),
 ('/', 588),
 ('>', 587),
 ('<', 587),
 ('br', 587),
 ('film', 544),
 ('one', 539),
 (')', 531),
 ('(', 528),
 ('like', 458),
 ('``', 451),
 ("''", 450),
 ('!', 374),
 ('would', 350),
 ('even', 348),
 ('good', 343),
 ('see', 339),
 ('?', 315),
 ('time', 310),
 ('really', 309),
 ('get', 308),
 ('story', 302),
 ('much', 285),
 ('well', 282),
 ('made', 256),
 ('could', 255)]

In [72]:
local_stopwords = [":", ";", ",", ".", "...", "!", "-", "#", "(", ")", "@", "&", "%", "'", "’", "“", "”", 
                   "amp", "https", "rt", "t…"]

counter = Counter()
for pos in df.tagged_tokens:
    word_set = set()
    
    for t in pos:
        word = t[0].lower()
        
        ##########################################################
        # Check if the word is either a global or a local stopword.
        ##########################################################
        if word in (global_stopwords + local_stopwords):
            continue
        else:
            word_set.add(word)
            
    counter.update(word_set)
    
counter.most_common(30)

[("'s", 702),
 ('movie', 614),
 ("n't", 597),
 ('/', 588),
 ('>', 587),
 ('<', 587),
 ('br', 587),
 ('film', 544),
 ('one', 539),
 ('like', 458),
 ('``', 451),
 ("''", 450),
 ('would', 350),
 ('even', 348),
 ('good', 343),
 ('see', 339),
 ('?', 315),
 ('time', 310),
 ('really', 309),
 ('get', 308),
 ('story', 302),
 ('much', 285),
 ('well', 282),
 ('made', 256),
 ('could', 255),
 ('way', 254),
 ('also', 252),
 ('first', 251),
 ('great', 246),
 ('people', 241)]

### Popular adjectives 

In [73]:
counter = Counter()
for pos in df.tagged_tokens:
    word_set = set()
    
    for t in pos:
        word = t[0].lower()
        tag = t[1]
        
        if word in (global_stopwords + local_stopwords):
            continue
            
        ##########################################################
        # Check if the tag matches.
        ##########################################################
        if tag.startswith("JJ"):
            word_set.add(word)
            
    counter.update(word_set)
    
counter.most_common(30)

[('good', 338),
 ('<', 241),
 ('great', 237),
 ('bad', 231),
 ('many', 210),
 ('first', 195),
 ('much', 192),
 ('little', 181),
 ('best', 166),
 ('br', 145),
 ('real', 138),
 ('old', 120),
 ('least', 118),
 ('better', 118),
 ('funny', 111),
 ('new', 102),
 ('young', 101),
 ('whole', 98),
 ('big', 92),
 ('worst', 88),
 ('last', 87),
 ('interesting', 80),
 ('original', 80),
 ('main', 77),
 ('hard', 77),
 ('sure', 76),
 ('different', 74),
 ('true', 74),
 ('excellent', 72),
 ('special', 71)]

### Popular verbs

In [74]:
counter = Counter()
for pos in df.tagged_tokens:
    word_set = set()
    
    for t in pos:
        word = t[0].lower()
        tag = t[1]
        
        if word in (global_stopwords + local_stopwords):
            continue
            
        ##########################################################
        # Check the tag
        ##########################################################
        if tag.startswith("VB"):
            word_set.add(word)
            
    counter.update(word_set)
    
counter.most_common(30)

[("'s", 464),
 ('see', 335),
 ('get', 298),
 ('>', 292),
 ('made', 255),
 ('make', 229),
 ('think', 221),
 ('watch', 206),
 ('seen', 201),
 ('know', 194),
 ('say', 161),
 ('go', 161),
 ('watching', 156),
 ("'m", 150),
 ("'ve", 143),
 ('find', 142),
 ('makes', 141),
 ('want', 130),
 ('going', 126),
 ('<', 125),
 ('gets', 120),
 ('got', 117),
 ('take', 115),
 ("'re", 109),
 ('done', 103),
 ('give', 102),
 ('seems', 101),
 ('saw', 99),
 ('thought', 96),
 ('come', 96)]

## Searching for records containing a certain string

In [75]:
target_string = "movie"

In [76]:
mask = df.text.str.contains(target_string, case=False, regex=False)
mask

0      False
1      False
2      False
3       True
4       True
       ...  
995     True
996     True
997     True
998     True
999    False
Name: text, Length: 1000, dtype: bool

pandas.Series.str.contains: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

In [77]:
df[mask][["text"]].head()

Unnamed: 0,text
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what mone..."
5,"Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it's not preachy or boring. It just never gets old, despite my having seen it some 15 o..."
6,I sure would like to see a resurrection of a up dated Seahunt series with the tech they have today it would bring back the kid excitement in me.I grew up on black and white TV and Seahunt with Gun...
9,"If you like original gut wrenching laughter you will like this movie. If you are young or old then you will love this movie, hell even my mom liked it.<br /><br />Great Camp!!!"


In [78]:
df[~mask][["text"]].head()

Unnamed: 0,text
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me..."
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p..."
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i..."
7,"This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years were brilliant, but things dropped off after that. By 1990, the show was not really funny ..."
8,Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I've seen 950+ films and this is truly one of the worst of them - it's awful i...
