This notebook will go through the process of examining a forum webpage for the data we want to extract, and the HTML tags and properties that we can use to identify that data.

For the main program, we intend to use Scrapy to fetch the web pages and extract the data, for simplicity during exploration we only use Scrapy to extract the data and use requests to fetch the website we are interested in.

It might be useful to illustrate what is happening if you open the below "thread_url" in a separate tab.

In [32]:
# import requests to fetch the website (there are many ways of doing this)
import requests
# import scrapy to develop the Selectors that extract the data we want, to prove they work
import scrapy
from pprint import pprint
# we import scrapy.selector as Selector and remove_tags for use later
from scrapy.selector import Selector
from scrapy.utils.markup import remove_tags
import datetime


ModuleNotFoundError: No module named 'date'

In [13]:
# This is the page from the test forum that I will use to explore the data. 
# The selectors should be tried against multiple web pages to catch inconsistencies, especially is the site had
# complex mechanics like quotes, code blocks, and other forms of media. 
thread_url = "https://learningautomaton.ca/wp-content/uploads/2019/02/FruitsAndVeggiesForum/Knock%20Knock...%20-%20Fruits%20and%20Veggies.html"
# use requests to fetch the page as a requests object
request_file = requests.get(thread_url)

# extract the raw html (the "text" property) from the requests object
thread_html_text = request_file.text
# note that the two steps could be combined for brevity 

The tool that scrapy uses to extract a particular part of an html response is Selectors, which are used to select a certin part of the html file.
Scrapy Selectors are built on the [lxml](http://lxml.de/) XML and HTML parsing library, and is fairly fast.

We also make use of xpath strings. Xpath strings are one method of navigating through an html text to a particular
tag (the other being [CSS expressions](https://www.w3.org/TR/selectors/)). 

Xpath has a number of options, however the key characters we will be using are:

`/tag_name` : from the tag we are at, search for a tag that is an immidiate child that is called `tag_name`
`//tag_name` : from the tag we are at, search for all contained tags that are called `tag_name`
`@property_name` : extract the value of the tag property called `property_name` from the tag we are currently at
`/tag_name[@property_name='property_value']` : go to a child tag that has a property called `property_name` and a value of `property_value`

These key characters can be string togeather to make an xpath string that will bring us to where we want to be in the html text. For example, the xpath string

`//body/div[@class='title']/@text`

is saying: 
* from the beginning of the html text  for every tag called body' (`//body`), 
* find an immidiate child of 'body' called 'div' (`/div`) 
* that has a property called 'class' with a value of 'title' (`[@class='title']`)
* and navigate from the found 'div' tags to the value of the property 'text' (`/@text`)

When we pass the Selector a text object (in this case the stored html text) and than pass the selector an xpath string, the selector will navigate to the part of the html text as per the instructions in the xpath string.

To actually get the value from the text that we navigated to, we need to call `extract()` on the returned value.

For an indepth look into XPath and Selectors, the Scrapy [Selectors](https://doc.scrapy.org/en/latest/topics/selectors.html) page is a great resource.

In [None]:
# Lets take a look at the entire HTML text by running this cell
# (If the output takes up the entire page, right click this cell and "enable scrolling for outputs, if you wish")

print(thread_html_text)

#You can right click on this cell and select "clear outputs" (for the cell) or "clear all outputs" (for the entire notebook). 
# It does not run or un-run any code, just removes the currently displayed cell outputs.

In [14]:
# Lets play around with Selectors and XPath to get our feet wet. 

print(Selector(text = thread_html_text).xpath("//a[@class='forumtitle']").extract())
# Try running the above line of code in the following variations:
# Selector(text = thread_html_text) will print the Selector object
# Selector(text = thread_html_text).xpath("//head") will print the SelectorList object (a subclass of list that implements the Selector interface) returned by the .xpath(query) method
# Selector(text = thread_html_text).xpath("//head").extract() will, for each object in the SelectorList, return the selected part of the HTML text as a list of text

[]


Lets further refine our XPath query to extract the title of the thread. 

If you search through the text of the above cell output, you will see that the thread title (which we confirmed by looking at the [thread](https://learningautomaton.ca/wp-content/uploads/2019/02/FruitsAndVeggiesForum/Bears.%20Beets.%20-%20Fruits%20and%20Veggies.html) in our web browser) is "Bears. Beets.". 

This text appears in a few places. We will extract it from 
`<h2 class="topic-title"><a href="./viewtopic.php?f=3&amp;t=7">Bears. Beets.</a></h2>`

To extract this text, we can use the following XPath:

In [53]:
Selector(text = thread_html_text).xpath("//div[@class='postbody']//p[@class='author']/text()[3]")
# Adding /text() at the end will remove anything that is HTML, and not text content. Try removing /text() and see the output.
# Selector(text = thread_html_text).xpath("//head/title").extract()

[<Selector xpath="//div[@class='postbody']//p[@class='author']/text()[3]" data='Wed Feb 06, 2019 12:40 am\n\t\t\t'>,
 <Selector xpath="//div[@class='postbody']//p[@class='author']/text()[3]" data='Wed Feb 06, 2019 12:40 am\n\t\t\t'>,
 <Selector xpath="//div[@class='postbody']//p[@class='author']/text()[3]" data='Wed Feb 06, 2019 12:42 am\n\t\t\t'>,
 <Selector xpath="//div[@class='postbody']//p[@class='author']/text()[3]" data='Wed Feb 06, 2019 12:42 am\n\t\t\t'>,
 <Selector xpath="//div[@class='postbody']//p[@class='author']/text()[3]" data='Wed Feb 06, 2019 12:43 am\n\t\t\t'>,
 <Selector xpath="//div[@class='postbody']//p[@class='author']/text()[3]" data='Wed Feb 06, 2019 12:43 am\n\t\t\t'>,
 <Selector xpath="//div[@class='postbody']//p[@class='author']/text()[3]" data='Wed Feb 06, 2019 12:43 am\n\t\t\t'>,
 <Selector xpath="//div[@class='postbody']//p[@class='author']/text()[3]" data='Wed Feb 06, 2019 12:44 am\n\t\t\t'>,
 <Selector xpath="//div[@class='postbody']//p[@class='author']/t

Alright, progress!

You should have gotten `['Bears. Beets.']` as the output of the above cell. Note that it appears in brackets as `.extract()` returns a list of outputs, in this case, 1 of 1 pieces of text that matched our selector.

Let's make a thread object, and a function to pull what data we can about the thread:

In [None]:
# we create a thread object that we will use to store the information on the entire thread, to include the posts it contains.
# we are avoiding any complex or custom data types so that the object is easily serializable later on.
class Thread:
    def __init__(self):
        self.title = None
        self.posts = []
        self.url = None
        self.op_account_name = None

def extract_thread_data(html_text):
    new_thread = Thread()
    new_thread.title = Selector(text = html_text).xpath("//h2[@class='topic-title']/a/text()").extract()[0]
    new_thread.url = thread_url
    return new_thread

print(extract_thread_data(thread_html_text).__dict__)

This forum code does not keep information on the thread starter (aka original poster, or "op") so we will need to get that information from the posts in the thread.

Let's try and find the parts of the HTML file that we can use to extract the post.

Looking through the HTML file we notice a tag, `postbody`, let's see what we get by extracting that:

In [None]:
Selector(text = thread_html_text).xpath("//div[@class='postbody']")

Ok, that returned four items, and we see there are four posts. Let's extract one for a closer look. 

In [None]:
#We use print() here to make it more readable
print(Selector(text = thread_html_text).xpath("//div[@class='postbody']").extract()[0])

We quickly see that the post title, poster, and content are here. Let's try to extract them.

In [None]:
#the [0] at the end is to select the bit from the first post, as the Selector returns a list of all matching tags, i.e. the bit for all posts in this thread.
post_title = Selector(text = thread_html_text).xpath("//div[@class='postbody']//h3/a//text()").extract()[2] 
post_user = Selector(text = thread_html_text).xpath("//div[@class='postbody']//span[@class='username']//text()").extract()[2] 
post_content = Selector(text = thread_html_text).xpath("//div[@class='postbody']//div[@class='content']").extract()[2] 
#Note: Because of how the website uses <br> tags, using /text() wont work with the post content. We will strip out the HTML tags later. 
print("title:",post_title)
print("user:",post_user)
print("content:")
print(post_content)

Ok, we're in business. Now we make a Post class, get the data for all the posts, and store them in the thread.

Let's also use Scrapy's `remove_tags` function that we imported earlier to clean up the post content.

In [None]:
class Post:
    def __init__(self):
        self.user = None
        self.title = None
        self.content = None
    
def extract_post_data(html_text):
    post_titles = Selector(text = html_text).xpath("//div[@class='postbody']//h3/a/text()").extract()
    post_users = Selector(text = html_text).xpath("//div[@class='postbody']//span[@class='username']/text()").extract()
    post_contents = Selector(text = html_text).xpath("//div[@class='postbody']//div[@class='content']").extract()
    posts = []
    for title, author, content in zip(post_titles, post_users, post_contents):
        new_post = Post()
        new_post.user = author
        new_post.title = title
        new_post.content = ''.join(remove_tags(content)) #We use join here as content is a list, which we want to flatten into a string
        posts.append(new_post)
    return posts


for _post in extract_post_data(thread_html_text):
    print(_post.__dict__)

Ok, time to wrap everything together in one function.

In [None]:
def scrape_thread(html_text):
    thread = extract_thread_data(html_text)
    thread.posts = extract_post_data(html_text)
    thread.op_account_name = thread.posts[0].user
    return thread

#Yes I know I should be modifying the thread __str__ and __repr__, streach goals.
def print_thread(thread):
    print("Thread title:", thread.title)
    print("Thread url:", thread.url)
    print("Thread op_account_name:", thread.op_account_name)
    print("Posts:")
    for post in thread.posts:
        print("\nPost title:", post.title)
        print("Post user:", post.user)
        print("Post content:")
        print("''\n",post.content,"\n''")
        
thread = scrape_thread(thread_html_text)
print_thread(thread) 

Success! Checking our results against the [forum page](https://learningautomaton.ca/wp-content/uploads/2019/02/FruitsAndVeggiesForum/Bears.%20Beets.%20-%20Fruits%20and%20Veggies.html) shows that we accurately  extracted all of the data. 

Let's try it against a different [forum page](https://learningautomaton.ca/wp-content/uploads/2019/02/FruitsAndVeggiesForum/I'm%20Going%20To%20The%20Country,%20I'm%20Gonna%20Eat%20Me%20A%20Lot%20of%20Peaches%20-%20Fruits%20and%20Veggies.html) to ensure our Selectors work across different pages.

In [None]:
thread_url2 = "https://learningautomaton.ca/wp-content/uploads/2019/02/FruitsAndVeggiesForum/I'm%20Going%20To%20The%20Country,%20I'm%20Gonna%20Eat%20Me%20A%20Lot%20of%20Peaches%20-%20Fruits%20and%20Veggies.html"
# use requests to fetch the page as a requests object
request_file2 = requests.get(thread_url2)

# extract the raw html (the "text" property) from the requests object2
thread_html_text2 = request_file2.text
thread2 = scrape_thread(thread_html_text2)
print_thread(thread2) 

In summary, we have seen:
* The basics of XPath and Selectors
* How to explore an HTML source file to find the data you want to extract
* An example of pulling everything together to extract all of the data we need from a thread

The final source code can be found on GitHub here #TODO

If you have any comments or recommendations, please let me know by leaving a comment at https://learningautomaton.ca/forum-scrape-project---html-exploration.

<pre><code>if can_learn: 
    learn()</code></pre>