<a href="https://colab.research.google.com/github/mortezaaghajanzadeh/Machine-learning-in-Finance/blob/main/Lecture%206/ScrapingNLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Example 1: Submitting GET Requests**

This module imports urlopen from `urllib`. It then sends a get request to http://www.math.unm.edu/writingHTML/tut/index.html and prints
the HTML returned.

In [None]:
from urllib.request import urlopen

# Define URL as string.
url = "http://www.math.unm.edu/writingHTML/tut/index.html"

# Send get request.
html = urlopen(url)

# Print HTML as string.
print(html.read())

b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">\n<html>\n<head>\n<title>Writing HTML</title>\n<META name="description" content="More than just an HTML reference, this is a structured approach for learning how to create web pages, designed by specialists in learning at the Maricopa Center for Learning & Instruction.">\n<META  name="keywords" content="HTML, tutorial, learn, make, create, design, web page, home page, education, maricopa, mcli, writing, form, tables, frames, javascript">\n</head>\n<body bgcolor="#FFFFFF">\n<H5><I>Writing HTML</I> |\nAbout | \n<A HREF="faq.html">FAQ</A> |\n<A HREF="http://www.mcli.dist.maricopa.edu/cgi-bin/alumni.pl">Alumni</A> |\n<A HREF="http://www.mcli.dist.maricopa.edu/cgi-bin/kudos_tut.pl">Kudos</A> |\n<a href="ref.html">References</a> | \n<A HREF="tags/index.html">Tags</A> |\n<A HREF="lessons.html">Lessons</A> | \n</H5>\n\n<img src="pictures/tut.gif" alt="..." width="397" height="114"> <br>\n<i>/ May 1999 / version 4.0.1 / \n<A HREF="version.html"

# **Example 2: Fixing Broken HTML**

This module imports `BeautifulSoup` from the `bs4` module. It then defines a string variable that contains broken HTML code. The string is converted into a BeautifulSoup object and then prettified using a parser.

In [None]:
from bs4 import BeautifulSoup

# Define broken html as string.
broken_html = "<ul class=country><li>Area<li>Area<li>Population</ul>"

# Convert string to BeautifulSoup object.
soup = BeautifulSoup(broken_html)

# Correct formatting errors and then print.
fixed_html = soup.prettify()
print(fixed_html)

<html>
 <body>
  <ul class="country">
   <li>
    Area
   </li>
   <li>
    Area
   </li>
   <li>
    Population
   </li>
  </ul>
 </body>
</html>


# **Example 3: Parsing HTML**

This module imports urlopen from `urllib` and
BeautifulSoup from `bs4`. It sends a get request to
https://en.wikipedia.org/wiki/Richard_Thaler, extracts the HTML from the code returned, and then converts it to a `BeautifulSoup` object.

In [None]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

# Define URL as string.
url = "https://en.wikipedia.org/wiki/Richard_Thaler"

# Send GET request and convert result to BeautifulSoup object.
html = urlopen(url)
soup = BeautifulSoup(html.read())

# Print h1 header tag.
print(soup.h1)

# Print text attribute of h1 header.
print(soup.h1.text)

<h1 class="firstHeading mw-first-heading" id="firstHeading"><span class="mw-page-title-main">Richard Thaler</span></h1>
Richard Thaler


# **Example 4: Handling Exceptions**

This module imports urlopen from urllib and
BeautifulSoup from bs4. It sends a GET request to
http://google.com/404 and receives a 404 error. It then pauses for 60 seconds and attempts the same GET request. If that fails again, it sends a GET request to a different url.

In [None]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.request import HTTPError
import time

# Define URLs as strings.
url_0 = "http://google.com/404"
url_1 = "http://google.com"

# Use try-except block to handle exceptions.
try:
	html = urlopen(url_0)
	soup = BeautifulSoup(html.read())
except HTTPError as e:
	print(e)
	if str(e).find('404')!=-1:
		time.sleep(5)
		try:
			print(e)
			html = urlopen(url_0)
			soup = BeautifulSoup(html.read())
		except HTTPError as e:
			html = urlopen(url_1)
			soup = BeautifulSoup(html.read())
			print('Page found.')

HTTP Error 404: Not Found
HTTP Error 404: Not Found
Page found.


# **Example 5: Parsing HTML**

This module imports urlopen from `urllib` and
BeautifulSoup from `bs4`. It then sends a GET request to
https://online.auktionsverket.com/ and converts the resulting HTML into a `BeautifulSoup` object. The rest of the file demonstrates basic navigational options for a `BeautifulSoup` parse.

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

# Load and parse HTML from auction site.
url = "https://auctionet.com"
html = urlopen(url)
soup = BeautifulSoup(html.read())

In [None]:
# Access first title, paragraph, and link tag.
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.p)
print(soup.a)

In [None]:
# Find and print all links.
soup.find_all('a')
links = soup.find_all('a')
for link in links[:5]:
	print(link)

# **Example 6: Counting Words**

This example imports the Natural Language Toolkit (`nltk`) and the Reuters Corpus from `nltk.corpus`. It then loads an article about gold, performs both sentence and word tokenization on the text, and then constructs a frequency distribution from the words.

#### **Install and download modules.**

In [None]:
# Install and import nltk (v3.5)
!pip install nltk==3.5
import nltk

# Download popular submodules and corpuses.
nltk.download('popular')
nltk.download('reuters')

Collecting nltk==3.5
  Downloading nltk-3.5.zip (1.4 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.4 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.4 MB[0m [31m5.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.4/1.4 MB[0m [31m22.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Created wheel for nltk: filename=nltk-3.5-py3-none-any.whl size=1434679 sha256=8dde4cf3712bc016babec3d35c86f95d5528c3e13e4792b62a508fdcf5934752
  Stored in directory: /root/.cache/pip/wheels/35/ab/82/f9667f6f884d272670a15382599a9c753a1dfdc83f7412e37d
Successfully built nltk
Installi

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

True

#### **Print article categories and topics.**

In [None]:
print(nltk.corpus.reuters.categories())

In [None]:
print(nltk.corpus.reuters.fileids(['gold']))

#### **Load and explore example article about gold.**

In [None]:
# Load article.
gold = nltk.corpus.reuters.raw(fileids='training/9799')

# Print article length in characters.
print(len(gold))

# Print first 50 characters of article.
print(gold[:50])

#### **Tokenize data and explore tokens.**

In [None]:
# Tokenize sentences and print example sentence.
goldSentences = nltk.sent_tokenize(gold)
print(goldSentences[4])

In [None]:
# Tokenize words.
goldWords = nltk.word_tokenize(gold)

# Print number of words and words 150-159.
print(len(goldWords))
print(goldWords[150:160])

#### **Count number of instances of "rise" and "fall" in text.**

In [None]:
# Convert to lower case.
goldWords = [w.lower() for w in goldWords]

# Compute frequency distribution of words.
fdist = nltk.FreqDist(goldWords)

# Print frequencies for words in list.
print('Rise: ' + str(fdist['rise']))
print('Fall: ' + str(fdist['fall']))

# **Example 7: Analyzing Central Bank Communication**

This example first imports urlopen from `urllib`, `BeautifulSoup` from `bs4`, and `nltk`. It uses urlopen and BeautifulSoup to download Janet Yellen's September 26, 2017 speech and parse the HTML content. It then extracts the text from all paragraph projects, cleans them, and computes the frequency with which the word inflation appears in the speech.

#### **Import modules.**

In [None]:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords

#### **Download central bank speech and explore text.**

In [None]:
# Define URL.
url = "https://www.federalreserve.gov/newsevents/speech/yellen20170926a.htm"

# Define user-agent string.
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/90.0.4430.85"}

# Define contents of GET request.
req = Request(url, headers=headers)

# Send get request.
html = urlopen(req)

# Parse HTML.
soup = BeautifulSoup(html.read())

In [None]:
soup

<!DOCTYPE html>
<html class="no-js" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0 maximum-scale=1.6, user-scalable=1" name="viewport"/>
<meta content=" MathJax.Hub.Config({extensions: [&quot;tex2jax.js&quot;],jax: [&quot;input/TeX&quot;, &quot;output/HTML-CSS&quot;],tex2jax: {inlineMath: [ ['$$','$$'], ['$$$','$$$'] ],displayMath: [ ['$$$$" name="description"/>
<meta content="Board of Governors of the Federal Reserve System" property="og:site_name"/>
<meta content="article" property="og:type"/>
<meta content="Speech by Chair Yellen on inflation, uncertainty, and monetary policy" property="og:title"/>
<meta content="https://www.federalreserve.gov/images/social-media/social-default-image-opengraph.jpg" property="og:image"/>
<meta content="Board of Governors of the Federal Reserve System" property="og:image:alt"/>
<meta content=" MathJax.Hub.Config({extensions: [&quot

In [None]:
# Extract text from all paragraph objects.
paragraphs = soup.find_all('p')
paragraphs = [p.text for p in paragraphs]
print(len(paragraphs))

162


In [None]:
# Join the paragraphs into a speech and identify references section.
speech = ' '.join(paragraphs)
print(speech.split('References')[1][:50])


Aaronson, Daniel, Luojia Hu, Arian Seifoddini, a


In [None]:
# Remove references section.
speech = speech.split('References')[0]

#### **Compute relative frequency of term inflation in text.**

In [None]:
# Tokenize the speech into words.
wordTokens = nltk.word_tokenize(speech)

# Convert characters to lower case.
wordTokens = [w.lower() for w in wordTokens]

# Compute the frequency distribution of word use.
fdist = nltk.FreqDist(wordTokens)

# Count the number of uses of inflation and of all words.
speechLength = len(wordTokens)
inflationCount = fdist['inflation']

# Compute and print inflation intensity.
print(100.0*inflationCount/speechLength)

2.0097136158097473


In [None]:
# Remove stop words.
stops = stopwords.words("english")
wordTokens = [word for word in wordTokens if word not in stops]
fdist = nltk.FreqDist(wordTokens)

# Print modified inflation intensity.
print(100.0*fdist['inflation']/len(wordTokens))

3.0698388334612434


# **Example 8: Text Classification**

This example imports classes from `nltk`, `re`,
`numpy`, and `sklearn`. It then pulls Reuters articles about corn and wheat. The articles are separated into train and tests sets and the text is cleaned. We then 1) compute a Tfidf vectorization of the text corpus; and 2) use the word vectors to estimate a naive bayes and logistic model. We show that the logistic model performs better both in and out of sample.

#### **Import modules**.

In [None]:
import nltk
from nltk.corpus import reuters, stopwords
import re
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix

#### **Create train and test sets.**

In [None]:
# Load reuters categories.
print(reuters.categories())

# Load silver and gold categories.
corn = reuters.fileids(['corn'])
wheat = reuters.fileids(['wheat'])

# Drop common ids.
common = set(corn).intersection(wheat)
corn = [id for id in corn if id not in common]
wheat = [id for id in wheat if id not in common]

# Separate test and train files.
train_corn_ids = [train for train in corn if train.find('train')>-1]
test_corn_ids = [test for test in corn if test.find('test')>-1]

train_wheat_ids = [train for train in wheat if train.find('train')>-1]
test_wheat_ids = [test for test in wheat if test.find('test')>-1]

['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']


In [None]:
# Define empty lists for targets.
train_corn_target = []
test_corn_target = []
train_wheat_target = []
test_wheat_target = []

# Define empty lists for features.
train_corn = []
test_corn = []
train_wheat = []
test_wheat = []

# Define function to load train data.
def load_train_data():
	train = []
	train_target = []
	for id in train_corn_ids:
		train_corn_target.append(0)
		train_corn.append(reuters.raw(id))
	for id in train_wheat_ids:
		train_wheat_target.append(1)
		train_wheat.append(reuters.raw(id))
	train = train_corn + train_wheat
	train_target = train_corn_target + train_wheat_target
	return train, train_target

# Define function to load test data.
def load_test_data():
	for id in test_corn_ids:
		test_corn_target.append(0)
		test_corn.append(reuters.raw(id))
	for id in test_wheat_ids:
		test_wheat_target.append(1)
		test_wheat.append(reuters.raw(id))
	test = test_corn + test_wheat
	test_target = test_corn_target + test_wheat_target
	return test, test_target

# Load train and test data.
train, train_target = load_train_data()
test, test_target = load_test_data()

#### **Clean text.**

In [None]:
# Remove special characters and stopwords.
def preprocess_text(text):
  try:
    text = re.sub('[^A-Za-z]+', ' ', text)
    wordTokens = nltk.word_tokenize(text)
    wordTokens = [token.lower() for token in wordTokens if len(token)>1]
    stops = set(stopwords.words("english"))
    wordTokens = [token for token in wordTokens if token not in stops]
    cleanedText = ' '.join(wordTokens)
  except:
    cleanedText = ''
  return cleanedText

# Pre-process data.
train = [preprocess_text(doc) for doc in train]
test = [preprocess_text(doc) for doc in test]

# Drop unusable documents.
train = [doc for doc in train if len(doc)>0]
test = [doc for doc in test if len(doc)>0]

#### **Construct feature matrix.**

In [None]:
# Extract features. Use train and test.
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(np.hstack([train,test])).toarray()
train_counts = counts[:len(train),:]
test_counts = counts[len(train):,:]

#### **Train Naive Bayes classifier.**

In [None]:
# Train Naive Bayes classifier.
nbm = GaussianNB()
nbm.fit(train_counts, train_target)

# Predict train and test set.
train_pred = nbm.predict(train_counts)
test_pred = nbm.predict(test_counts)

In [None]:
# Compute train data confusion matrix.
confusion_matrix(train_target, train_pred)

array([[122,   0],
       [  3, 150]])

In [None]:
# Compute test data confusion matrix.
confusion_matrix(test_target, test_pred)

array([[22, 12],
       [ 6, 43]])

#### **Train deep learning classifier.**

In [None]:
from keras.layers import Dense, Dropout
from keras.models import Sequential
from keras.utils import to_categorical

In [None]:
# Convert training target to categorical variable.
train_target = to_categorical(train_target)

# Define dense neural network.
model = Sequential()
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.30))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.30))
model.add(Dense(2, activation='softmax'))

# Compile network.
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['acc'])

# Train model.
history = model.fit(train_counts, train_target,
 				epochs=20, batch_size=32, shuffle=True,
 				validation_split=0.20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
# Predict test weights.
test_pred = model.predict(test_counts)

# Compute confusion matrix.
confusion_matrix(test_target, np.argmax(test_pred, axis=1))



array([[29,  5],
       [ 6, 43]])

In [None]:
# Summarize model architecture.
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 32)                151712    
                                                                 
 dropout (Dropout)           (None, 32)                0         
                                                                 
 dense_1 (Dense)             (None, 16)                528       
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense_2 (Dense)             (None, 2)                 34        
                                                                 
Total params: 152274 (594.82 KB)
Trainable params: 152274 (594.82 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


### **Example 9: Sentiment Analysis and Topic Modeling**

This example scrapes and parses two ECB speeches using `urllib` and `BeautifulSoup`. It then uses `pysentiment2` to measure the sentiment of the paragraphs in the documents and `sklearn` to uncover latent topics using the `LatentDirichletAllocation` model.

#### **Install and import modules.**

In [None]:
!pip install pysentiment2

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import numpy as np
import pysentiment2 as ps
from sklearn.feature_extraction.text import CountVectorizer

#### **Scrape and clean data.**

In [None]:
# Scrape ECB speeches.
sp210429 = 'https://www.ecb.europa.eu/press/key/date/2021/html/ecb.sp210429~3f8606edca.en.html'
sp210429 = BeautifulSoup(urlopen(sp210429).read())

# Extract paragraphs.
paragraphs = sp210429 = [p.text for p in sp210429.find_all('p')]
print(paragraphs[10])

#### **Compute sentiment.**

In [None]:
# Instantiate tokenizer.
lm = ps.LM()

# Tokenize speeches.
tokens = [lm.tokenize(p) for p in paragraphs]

# Compute sentiment.
sentiment = [lm.get_score(p)['Polarity'] for p in tokens]

# Print paragraph and sentiment score.
print(paragraphs[10])
print(sentiment[10])

#### **Identify latent topics.**

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
# Vectorize tokens into count matrix.
vectorizer = CountVectorizer()
tfreq = vectorizer.fit_transform([' '.join(token) for token in tokens])
feature_names = vectorizer.get_feature_names_out()

# Instantiate LDA model.
lda = LatentDirichletAllocation(n_components=5)

# Fit LDA model and transform data.
props = lda.fit_transform(tfreq)

# Recover word distribution for each topic.
wordDist = lda.components_

# Collect five highest probability terms for each topic.
topics = []

for i in range(5):
	topics.append([feature_names[name] for name in
	wordDist[i].argsort()[-3:][::-1]])

# Print topics.
print(topics)