## Applied Text and Natural Laguage Analytics, Fall 2020

#### Assignment 3

Submitted by:

Harsh Dhanuka, hd2457

In [1]:
import re
import nltk
from nltk import word_tokenize, sent_tokenize, ngrams, pos_tag, RegexpParser
from collections import Counter
import urllib.request as url
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from bs4 import BeautifulSoup
from bs4.element import Comment

# 1. Use urllib or requests package to read this CNBC article through its URL link -
https://www.cnbc.com/2019/01/17/netflix-price-hike-helps-disney-upcoming-streaming-service-analyst.html 

In [2]:
link = "https://www.cnbc.com/2019/01/17/netflix-price-hike-helps-disney-upcoming-streaming-service-analyst.html"
html = url.urlopen(link)
raw = html.read()
#print(raw)

# 2. Use BeautifulSoup or another HTML parsing package to extract text from the article.

I've observed a few problems using this library: 
- For one, it picked up unwanted text, such as JavaScript source. 
- Also, it did not interpret HTML entities. For example, I would expect &#39; in HTML source to be converted to an apostrophe in text, just as if I'd pasted the browser content into notepad.

I will make some changes to the code, to handle the above mentioned challenges.

I will 2 approaches, and verify my results with both, and will pick the one which is ideal.

### Approach 1
This approach is not ideal. I will comment it out. 

In [3]:
#soup_file = BeautifulSoup(raw, features = "html.parser")

# Kill all the java script and other style elements from raw:
#for script in soup_file(["script", "style"]):
#    script.extract()    # rip it out

# Get text content and assign to another variable
#text = soup_file.get_text()

# Break the text into different lines, to remove all leading and trailing white space on each line
#lines = (line.strip() for line in text.splitlines())

# Break multi-headlines into a line each
#chunks = (phrase.strip() for line in lines for phrase in line.split("  "))

# Drop all blank lines
#text = '\n'.join(chunk for chunk in chunks if chunk)

# Print the forst 2000 text characters
#print(text[:1000])

### Approach 2

I will use the output from this approach for my analysis.

#### Note: 
I will let the 'head' and 'title' of the article remain in the data. I consider them as important elements in the NLP process, as headers hold a very significant value.

In [4]:
def tag_visible(element):
    if element.parent.name in ['style', 'script', 
                               #'head', 'title', 
                               'meta', '[document]'
                              ]:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

text = text_from_html(raw)
print(text[:1000])

Netflix price hike helps Disney upcoming streaming service: Analyst Skip Navigation SIGN IN Pro Watchlist Make It Select USA INTL Markets Pre-Markets U.S. Markets Currencies Cryptocurrency Futures & Commodities Bonds Funds & ETFs Watchlist Business Economy Finance Health & Science Media Real Estate Energy Transportation Industrials Retail Wealth Life Small Business Investing Invest In You Personal Finance Financial Advisors Trading Nation Options Action ETF Street Buffett Archive Earnings Trader Talk Tech Cybersecurity Enterprise Internet Media Mobile Social Media Venture Capital Tech Guide Politics White House Policy Defense Congress 2020 Elections CNBC TV Live TV Live Audio Latest Video Top Video CEO Interviews Business Day Shows The News with Shepard Smith Entertainment Shows CNBC World Digital Originals Full Episodes Menu SEARCH QUOTES Markets Pre-Markets U.S. Markets Currencies Cryptocurrency Futures & Commodities Bonds Funds & ETFs Watchlist Business Economy Finance Health & Scie

# 3. Use re (regular expression) package to:

## a. Find all matches of $ amounts in the article

In [5]:
dollar_amounts = re.findall('\$(.+?) ',text)

print("There are", len(dollar_amounts), "dollar amount values in the article. They are as follows:")
dollar_amounts

There are 2 dollar amount values in the article. They are as follows:


['325.', '351']

## b. Substitute all numbers with # character and print the output

In [6]:
text = re.sub('[0-9]', '#', text)

In [7]:
print(text[:1500])

Netflix price hike helps Disney upcoming streaming service: Analyst Skip Navigation SIGN IN Pro Watchlist Make It Select USA INTL Markets Pre-Markets U.S. Markets Currencies Cryptocurrency Futures & Commodities Bonds Funds & ETFs Watchlist Business Economy Finance Health & Science Media Real Estate Energy Transportation Industrials Retail Wealth Life Small Business Investing Invest In You Personal Finance Financial Advisors Trading Nation Options Action ETF Street Buffett Archive Earnings Trader Talk Tech Cybersecurity Enterprise Internet Media Mobile Social Media Venture Capital Tech Guide Politics White House Policy Defense Congress #### Elections CNBC TV Live TV Live Audio Latest Video Top Video CEO Interviews Business Day Shows The News with Shepard Smith Entertainment Shows CNBC World Digital Originals Full Episodes Menu SEARCH QUOTES Markets Pre-Markets U.S. Markets Currencies Cryptocurrency Futures & Commodities Bonds Funds & ETFs Watchlist Business Economy Finance Health & Scie

## c. Count (using regular expressions) ”Netflix” and “Disney” mentions

In [8]:
count_text = re.split(r'\W', text)

In [9]:
print(" ")
print("Count of the word 'Netflix' is as follows:")
print(" ")
print(count_text.count('Netflix'))

print(" ")
print("Count of the word 'Disney' is as follows:")
print(" ")
print(count_text.count('Disney'))

print(" ")
print("Count of the word 'Netflix' and Disney' together is as follows:")
print(" ")
print(count_text.count('Disney') + count_text.count('Netflix'))

 
Count of the word 'Netflix' is as follows:
 
14
 
Count of the word 'Disney' is as follows:
 
8
 
Count of the word 'Netflix' and Disney' together is as follows:
 
22


# 4. Use NTLK and/or Spacy tokenization features to:

## a. Tokenize sentences and words

### Sentences

In [10]:
# tokenize sentences
sentences = sent_tokenize(text)

for i in range(0,3):
    print()
    print(sentences[i])


Netflix price hike helps Disney upcoming streaming service: Analyst Skip Navigation SIGN IN Pro Watchlist Make It Select USA INTL Markets Pre-Markets U.S. Markets Currencies Cryptocurrency Futures & Commodities Bonds Funds & ETFs Watchlist Business Economy Finance Health & Science Media Real Estate Energy Transportation Industrials Retail Wealth Life Small Business Investing Invest In You Personal Finance Financial Advisors Trading Nation Options Action ETF Street Buffett Archive Earnings Trader Talk Tech Cybersecurity Enterprise Internet Media Mobile Social Media Venture Capital Tech Guide Politics White House Policy Defense Congress #### Elections CNBC TV Live TV Live Audio Latest Video Top Video CEO Interviews Business Day Shows The News with Shepard Smith Entertainment Shows CNBC World Digital Originals Full Episodes Menu SEARCH QUOTES Markets Pre-Markets U.S. Markets Currencies Cryptocurrency Futures & Commodities Bonds Funds & ETFs Watchlist Business Economy Finance Health & Sci

### Words

In [11]:
tokens = word_tokenize(text)

for i in range(0,10):
    print(tokens[i])

Netflix
price
hike
helps
Disney
upcoming
streaming
service
:
Analyst


## b. Remove all English stop words

In [12]:
print(stopwords.words('english')[:50])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be']


In [13]:
stop_words = set(stopwords.words('english')) 

filtered_sentence = [word for word in tokens if not word in stop_words] 

print(filtered_sentence[:50])

['Netflix', 'price', 'hike', 'helps', 'Disney', 'upcoming', 'streaming', 'service', ':', 'Analyst', 'Skip', 'Navigation', 'SIGN', 'IN', 'Pro', 'Watchlist', 'Make', 'It', 'Select', 'USA', 'INTL', 'Markets', 'Pre-Markets', 'U.S.', 'Markets', 'Currencies', 'Cryptocurrency', 'Futures', '&', 'Commodities', 'Bonds', 'Funds', '&', 'ETFs', 'Watchlist', 'Business', 'Economy', 'Finance', 'Health', '&', 'Science', 'Media', 'Real', 'Estate', 'Energy', 'Transportation', 'Industrials', 'Retail', 'Wealth', 'Life']


## c. List and count n-grams for any given input n

In [14]:
def n_grams(n):
    grams = Counter(ngrams(filtered_sentence, n))
    items = grams.items()
    first_ten = list(items)[:10]
    return ("Count: ",len(grams)), ("--------------------"), ("The first 10 values/pairs are as follows:  "), first_ten

# first_ten is  defined to display only the first 10 items, not display long  output

In [15]:
# Verify

n_grams(1)

(('Count: ', 434),
 '--------------------',
 'The first 10 values/pairs are as follows:  ',
 [(('Netflix',), 14),
  (('price',), 7),
  (('hike',), 3),
  (('helps',), 2),
  (('Disney',), 7),
  (('upcoming',), 2),
  (('streaming',), 6),
  (('service',), 3),
  ((':',), 8),
  (('Analyst',), 1)])

In [16]:
# Verify

n_grams(2)

(('Count: ', 684),
 '--------------------',
 'The first 10 values/pairs are as follows:  ',
 [(('Netflix', 'price'), 3),
  (('price', 'hike'), 3),
  (('hike', 'helps'), 1),
  (('helps', 'Disney'), 2),
  (('Disney', 'upcoming'), 1),
  (('upcoming', 'streaming'), 2),
  (('streaming', 'service'), 2),
  (('service', ':'), 1),
  ((':', 'Analyst'), 1),
  (('Analyst', 'Skip'), 1)])

In [17]:
# Verify

n_grams(3)

(('Count: ', 758),
 '--------------------',
 'The first 10 values/pairs are as follows:  ',
 [(('Netflix', 'price', 'hike'), 1),
  (('price', 'hike', 'helps'), 1),
  (('hike', 'helps', 'Disney'), 1),
  (('helps', 'Disney', 'upcoming'), 1),
  (('Disney', 'upcoming', 'streaming'), 1),
  (('upcoming', 'streaming', 'service'), 2),
  (('streaming', 'service', ':'), 1),
  (('service', ':', 'Analyst'), 1),
  ((':', 'Analyst', 'Skip'), 1),
  (('Analyst', 'Skip', 'Navigation'), 1)])

## d. Lemmatize and deduplicate unigrams into a vocabulary of terms.

In [18]:
lemmatizer = WordNetLemmatizer()

In [19]:
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_sentence]
lemmatized_words[:10]

['Netflix',
 'price',
 'hike',
 'help',
 'Disney',
 'upcoming',
 'streaming',
 'service',
 ':',
 'Analyst']

In [20]:
# 'lemmatized_words' is a list of unigrams of lemmatized words

## e. Print bigrams and trigrams in the first 5 sentences

####  Initial cleaning, extract first 5 sentences

In [21]:
# I will first break the article to extract only the first 5 sentences, using the tokenized sentences (as above)

five_sentences = sentences[:5]
len(five_sentences)

5

In [22]:
# Make it a continuous string

five_sentences = (' '.join(word for word in five_sentences))
five_sentences[:1000]

'Netflix price hike helps Disney upcoming streaming service: Analyst Skip Navigation SIGN IN Pro Watchlist Make It Select USA INTL Markets Pre-Markets U.S. Markets Currencies Cryptocurrency Futures & Commodities Bonds Funds & ETFs Watchlist Business Economy Finance Health & Science Media Real Estate Energy Transportation Industrials Retail Wealth Life Small Business Investing Invest In You Personal Finance Financial Advisors Trading Nation Options Action ETF Street Buffett Archive Earnings Trader Talk Tech Cybersecurity Enterprise Internet Media Mobile Social Media Venture Capital Tech Guide Politics White House Policy Defense Congress #### Elections CNBC TV Live TV Live Audio Latest Video Top Video CEO Interviews Business Day Shows The News with Shepard Smith Entertainment Shows CNBC World Digital Originals Full Episodes Menu SEARCH QUOTES Markets Pre-Markets U.S. Markets Currencies Cryptocurrency Futures & Commodities Bonds Funds & ETFs Watchlist Business Economy Finance Health & Sci

In [23]:
# Convert to word tokens and Remove stopwords

tokens_sent = word_tokenize(five_sentences)

filtered_five_sent = [word for word in tokens_sent if not word in stop_words]    

###  Bi-grams

In [24]:
bigrams = Counter(ngrams(filtered_five_sent, 2))
print("Count: ",len(bigrams))
list(bigrams.items())[:10]

Count:  256


[(('Netflix', 'price'), 2),
 (('price', 'hike'), 3),
 (('hike', 'helps'), 1),
 (('helps', 'Disney'), 2),
 (('Disney', 'upcoming'), 1),
 (('upcoming', 'streaming'), 2),
 (('streaming', 'service'), 2),
 (('service', ':'), 1),
 ((':', 'Analyst'), 1),
 (('Analyst', 'Skip'), 1)]

### Trigrams

In [25]:
trigrams = Counter(ngrams(filtered_five_sent, 3))
print("Count: ",len(trigrams))
list(trigrams.items())[:10]

Count:  276


[(('Netflix', 'price', 'hike'), 1),
 (('price', 'hike', 'helps'), 1),
 (('hike', 'helps', 'Disney'), 1),
 (('helps', 'Disney', 'upcoming'), 1),
 (('Disney', 'upcoming', 'streaming'), 1),
 (('upcoming', 'streaming', 'service'), 2),
 (('streaming', 'service', ':'), 1),
 (('service', ':', 'Analyst'), 1),
 ((':', 'Analyst', 'Skip'), 1),
 (('Analyst', 'Skip', 'Navigation'), 1)]

## f. Print POS tags in the first 5 sentences

In [26]:
sentence_pos = pos_tag(tokens_sent)
sentence_pos[:10]

[('Netflix', 'NNP'),
 ('price', 'NN'),
 ('hike', 'NN'),
 ('helps', 'VBZ'),
 ('Disney', 'NNP'),
 ('upcoming', 'VBG'),
 ('streaming', 'VBG'),
 ('service', 'NN'),
 (':', ':'),
 ('Analyst', 'NNP')]

In [27]:
grammar = "NP: {<DT>?<JJ>*<NNP>}"
cp = RegexpParser(grammar)
cp.parse(sentence_pos)[:10]

[Tree('NP', [('Netflix', 'NNP')]),
 ('price', 'NN'),
 ('hike', 'NN'),
 ('helps', 'VBZ'),
 Tree('NP', [('Disney', 'NNP')]),
 ('upcoming', 'VBG'),
 ('streaming', 'VBG'),
 ('service', 'NN'),
 (':', ':'),
 Tree('NP', [('Analyst', 'NNP')])]