<a href="https://colab.research.google.com/github/merriekay/S23-CS167-Notes/blob/main/Day26_Webscraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS167: Day 26
## Web Scraping

#### CS167: Machine Learning, Spring 2023

Tuesday, May 2nd, 2023

📆 [Course Schedule](https://docs.google.com/spreadsheets/d/e/2PACX-1vSvFV5Mz0_YZE1d5r3gQ8IMktE4cBAsJIlP30cl2GhEpSO0J-YWV62QokSDz-OcOCsEmxMuKpY0kVlR/pubhtml?gid=0&single=true) | 🙋[PollEverywhere](https://pollev.com/meredithmoore011) | 📜 [Syllabus](https://analytics.drake.edu/~moore/cs167_s23_syllabus.html) | 📬 [CodePost Login](https://codepost.io/login)

# Admin Stuff

You should be working on:
- Project #2, due Friday, May 5th, by 11:59 pm
- Quiz #2, due Tuesday May 9th, by 11:59pm

## Load your data:

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# ✨ New Material

# Web Scraping

Often times in ML, the dataset you need just doesn't exist. 

One potential way to create the dataset that you need is to use __web-scraping__: 
- The process of using scripts to extract information from existing websites. 
- Python has a great web-scraping package, `beautifulsoup`
- Another helpful package is `selenium`

## Let's Solve a Problem 🐉📚🎮

I'm not sure if you know this yet, but Professor Moore is a huge nerd. 

I love reading fantasy books, playing video games, and training machine learning models. 

The natural progression of these interests is to train a machine learning model on names from fantasy books to be able to generate my own super nerdy gamertag. 

This problem can be broken down into 2 main steps:
1. Collect a dataset of names from fantasy books.
2. Train a RNN on the dataset so that it will generate new names.

# Step #1: Webscraping a Dataset

## Brando Sando Style:

Let's scrape the names from Brandon Sanderson Characters found on [this website](https://coppermind.net/wiki/Category:Characters). 

Let's start by importing Beautiful soup and providing the URL pointing to the webpage. We use the `requests` library to get the text from the html. 

In [None]:
from bs4 import BeautifulSoup
import requests
import csv

source = requests.get('https://coppermind.net/wiki/Category:Characters').text

soup = BeautifulSoup(source, 'html.parser')

csv_file = open('fantasy_characters.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['name', 'author','genre', 'series', 'description']) 

print(soup.prettify())

Since the names are stored in `li`'s, we will use the `soup.find_all` function and provide the argument `li`. This will find all of the list items on the page. This is a good first step, but you'll notice, it's not quite what we want--it includes the list items before the actual charatcter names. 

Also, notice the line `li.text`. We are going through and getting each list item, and then only looking at the text (the `.text` line indicates this).

In [5]:
count = 0
for li in soup.find_all('li'):
  count +=1
  if count < 10: #print out the first 10
    print(li.text)

Alcatraz
Elantris & The Emperor's Soul
Legion
Mistborn series
The Reckoners
Rithmatist
Shadows for Silence in the Forests of Hell
Sixth of the Dusk
The Stormlight Archive


So now that we can get the characters's names, we want to give it the correct starting point. 

Here we'll use the `soup.find` function, and after using the 'Inspect' function on the browser, we know we're looking for the `div` element with `id='mw-pages'`. 

If we save this as `section`, we can then use this section of the website, combined with the code from above to collect a list of all of the names of Brandon Sanderson characters. 

In [6]:
section = soup.find('div', id='mw-pages')

count = 0

for li in section.find_all('li'):
  count = count + 1
  name = li.text
  
  #print the first few lines just to make sure it looks good
  if count < 10: 
    print(count, name)

  #add some other information to the csv file.
  author = 'Brandon Sanderson'
  genre = 'fantasy'
  series = ''
  description =''
  csv_writer.writerow([name, author, genre, series, description])
  

1 Aarik
2 Aaron
3 Abaray
4 Abiajan
5 Abigail Casey
6 Abraham Desjardins
7 Abrem
8 Abrial
9 Abrobadar


## Now let's add some Wheel of Time 🛞: 

New URL: https://en.wikipedia.org/wiki/List_of_Wheel_of_Time_characters 

You'll notice that the way this website is set up is a bit different from the Brando Sando website. We'll have to adjust our web-scarping code accordingly.

What should we put on line 4, `class_ = ___________` ?


In [None]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_Wheel_of_Time_characters').text

soup = BeautifulSoup(source, 'html.parser') 
section = soup.find('div', class_= __________ ) #your code here

count = 0
for li in section.find_all('li'):

  #This one isn't quite as simple becuase we have a character name and description.
  count = count + 1
  list_item = li.text

  #not all of the entries are names, but all the names follow the syntax <name: description>  
  if(len(list_item.split(':'))>1): 
    name= list_item.split(':')[0]  #The first element is the name so we use [0]
    description = list_item.split(':')[1] #the description is after the :, so we use [1]


    # SPOILER ALERT: DONT PRINT DESCRIPTION OUT IF YOU DON'T WANT TO READ THE CHARACTER DESCRIPTIONS.
    if count < 10:
      print(count, name)

    #then we add some other information and write a row to the csv file
    series = 'Wheel of Time'
    author = 'Robert Jordan'
    genre= 'fantasy'
    csv_writer.writerow([name, author, genre, series, description])

## How about some Kingkiller Chronicles Characters
https://kingkiller.fandom.com/wiki/Category:Characters


In [None]:
source = requests.get('https://kingkiller.fandom.com/wiki/Category:Characters').text


# Now it's your turn... try to scrape the character names from the Kingkiller Chronicles 




## Lord of the Rings, Anyone?
Yesss, my preciousssss: https://lotr.fandom.com/wiki/Category:The_Lord_of_the_Rings_Characters

In [None]:
source = requests.get('https://en.wikipedia.org/wiki/Category:The_Lord_of_the_Rings_characters').text

# And try to do Lord of the Rings also




# Annnnddd... let's give that bad boy a download

In [None]:
csv_file.close()
from google.colab import files
files.download('fantasy_characters.csv')

# Part 2: Fantasy Character Name Generator

This code should look very familiar... it's the code from Thursday, but just with a new dataset.

[Here's a link to the dataset](https://drive.google.com/file/d/1xcqFHcQ5EU4NjNP0lX6yCR3658_zuMXw/view?usp=sharing ) if you just want to download it:

In [1]:
#imports and things
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
    !pip install -q -U tensorflow-addons
    !pip install -q -U transformers
    IS_COLAB = True
except Exception:
    IS_COLAB = False

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

if not tf.config.list_physical_devices('GPU'):
    print("No GPU was detected. LSTMs and CNNs can be very slow without a GPU.")
    if IS_COLAB:
        print("Go to Runtime > Change runtime and select a GPU hardware accelerator.")

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)
tf.random.set_seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m591.0/591.0 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m35.5 MB/s[0m eta [36m0:00:00[0m
[?25h

# Import the Data
Import the data (it's a text file so it's a little bit different than a csv). 
- Then we get rid of the newlines and replace it with a space.
- we then get the vocabulary, and use a tokenizer to convert the text to sequences. 

In [2]:
from google.colab import drive
import pandas as pd
from tensorflow import keras
drive.mount('/content/drive')
#names = pd.read_csv('/content/drive/MyDrive/sanderson_names.txt',  header = None)
#names.head()

with open('/content/drive/MyDrive/CS167/datasets/sanderson_names.txt') as f:
    names = f.read()

Mounted at /content/drive


In [3]:
names = names.replace('\n'," ")
names[:100]

'Aarik Aaron Abaray Abiajan Abigail Reed Abraham Desjardins Abrial Abrobadar Abronai Abry Absence Aci'

In [4]:
# The vocabulary of our character-level language model looks like this:
"".join(sorted(set(names.lower())))

" '()-./:abcdefghijklmnopqrstuvwxyz©±√"

In [5]:
# Use Tokenizer to tokenize the Names
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(names)

In [6]:
# Embed the name 'Shallan' as tokens:
tokenizer.texts_to_sequences(["Shallan"])

[[8, 11, 2, 7, 7, 2, 5]]

In [7]:
# Revert the sequence of tokens back to the word:
tokenizer.sequences_to_texts([[8, 11, 2, 7, 7, 2, 5]])

['s h a l l a n']

In [8]:
max_id = len(tokenizer.word_index) # number of distinct characters
dataset_size = tokenizer.document_count # total number of characters

[encoded] = np.array(tokenizer.texts_to_sequences([names])) - 1
train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

n_steps = 100
window_length = n_steps + 1 # target = input shifted 1 character ahead
dataset = dataset.repeat().window(window_length, shift=1, drop_remainder=True)

dataset = dataset.flat_map(lambda window: window.batch(window_length))

np.random.seed(42)
tf.random.set_seed(42)

batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))

dataset = dataset.prefetch(1)


for X_batch, Y_batch in dataset.take(1):
    print(X_batch.shape, Y_batch.shape)

(32, 100, 37) (32, 100)


# Build our model
Now we build our model. 

In [9]:
def create_model():
  model = keras.models.Sequential([
      keras.layers.GRU(64, return_sequences=True, input_shape=[None, max_id],
                      dropout=0.2),
      keras.layers.GRU(64, return_sequences=True,
                      dropout=0.2),
      keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                      activation="softmax"))
  ])
  return model


model = create_model()
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 gru (GRU)                   (None, None, 64)          19776     
                                                                 
 gru_1 (GRU)                 (None, None, 64)          24960     
                                                                 
 time_distributed (TimeDistr  (None, None, 37)         2405      
 ibuted)                                                         
                                                                 
Total params: 47,141
Trainable params: 47,141
Non-trainable params: 0
_________________________________________________________________
None


In [10]:
#save our models during training
checkpoint_path = "training_1/fantasy_name_gen.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=1)

# Now let's train our model. Notice the callbacks=[cp_callback], 
#this will save checkpoints so we can load our model later.
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
history = model.fit(dataset, steps_per_epoch=train_size // batch_size,
                    epochs=10,callbacks=[cp_callback])

Epoch 1/10
Epoch 1: saving model to training_1/fantasy_name_gen.ckpt
Epoch 2/10
Epoch 2: saving model to training_1/fantasy_name_gen.ckpt
Epoch 3/10
Epoch 3: saving model to training_1/fantasy_name_gen.ckpt
Epoch 4/10
Epoch 4: saving model to training_1/fantasy_name_gen.ckpt
Epoch 5/10
Epoch 5: saving model to training_1/fantasy_name_gen.ckpt
Epoch 6/10
Epoch 6: saving model to training_1/fantasy_name_gen.ckpt
Epoch 7/10
Epoch 7: saving model to training_1/fantasy_name_gen.ckpt
Epoch 8/10
Epoch 8: saving model to training_1/fantasy_name_gen.ckpt
Epoch 9/10
Epoch 9: saving model to training_1/fantasy_name_gen.ckpt
Epoch 10/10
Epoch 10: saving model to training_1/fantasy_name_gen.ckpt


# Let's try it out!

In [11]:
def preprocess(texts):
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1
    return tf.one_hot(X, max_id)

# Let's pass in 'Meredi' and see what it predicts the next letter should be according to Sanderson:
X_new = preprocess(["Meredi"])

#this line takes a look at the softmax output and returns the max
Y_pred = np.argmax(model(X_new), axis=-1)
tokenizer.sequences_to_texts(Y_pred + 1)[0][-1] # 1st sentence, last char

'n'

In [12]:
def next_char(text, temperature=1):
    X_new = preprocess([text])
    y_proba = model(X_new)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

In [13]:
tf.random.set_seed(42)

next_char("Meredi", temperature=1)

's'

In [14]:
def complete_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

In [15]:
tf.random.set_seed(42)

print(complete_text("t", temperature=0.3))

t joshin joshin jasten jaston jaston jastan jastan 


In [16]:
print(complete_text("t", temperature=1))

tizbbr thmalanathop rine thoret hersram trimet shov


In [17]:
print(complete_text("t", temperature=2))


tpari uvu dws.'a 'vurk jiadywxpn qemlliw tmvidannau


In [18]:
import random
new_name = complete_text('mer', 15, temperature=0.75)
new_name.split(" ")[0].title()

'Merlist'

# Let's try loading our model

In [19]:
!pip install pyyaml h5py

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [20]:
os.listdir(checkpoint_dir)

['fantasy_name_gen.ckpt.index',
 'checkpoint',
 'fantasy_name_gen.ckpt.data-00000-of-00001']

In [21]:
latest = tf.train.latest_checkpoint(checkpoint_dir)
latest

'training_1/fantasy_name_gen.ckpt'

In [22]:
# disable warnings becuase we live dangerously:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

# Create a new model instance
model = create_model()

# Load the previously saved weights
model.load_weights(latest)

print(complete_text("t", temperature=0.25))

t molin morash mores moresh mord moreth mord morden
