This notebook will go through the process of examining a forum webpage for the data we want to extract, and the HTML tags and properties that we can use to identify that data.

For the main program we intend to use scrapy to fetch the webpages and extract the data, for simplicity during exploration we only use scrapy to extract the data, and use requests to fetch the website we are interested in.

In [1]:
# import requests to fetch the website (there are many ways of doing this)
import requests
# import scrapy to develop the Selectors that extract the data we want, to prove they work
import scrapy
# TODO Remove htmltotext?
from html2text import html2text

# we import scrapy.selector as Selector for ease of use later
from scrapy.selector import Selector
#from scrapy.http import HtmlResponse

ModuleNotFoundError: No module named 'scrapy'

In [60]:
# This is the page from the test forum that I will use to explore the data. 
# The selectors should be tried against multiple web pages to catch inconsistancies, especially is the site had
# complex mechanics like quotes, code blocks, and other forms of media. 
thread_url = "http://fruitsandveggies.forumotion.com/t4-bears-beets"

# use requests to fetch the page as a requests object
request_file = requests.get(thread_url)

# extract the raw html (the "text" property) from the requests object
thread_html_text = request_file.text

# note that the two steps could be combined for brevity 

In [61]:
# we create a thread object that we will use to store the information on the enture thread, to include the posts it contains.
# we are avoiding any complex or custom data types, so that the object is easilly serializable later on.
class thread:
    def __init__(self):
        self.title = None
        self.time_created = None
        self.posts = []
        self.url = None
        self.op_account_name = None
        self.op_account_link = None


The tool that scrapy uses to extract a particular part of an html response is Selectors. 
We also make use of xpath strings. Xpath strings are one method of navigating through an html text to a particular
tag. 

Xpath has a number of options, however the key characters we will be using are:

`/tag_name` : from the tag we are at, search for a tag that is an immidiate child that is called `tag_name`
`//tag_name` : from the tag we are at, search for all contained tags that are called `tag_name`
`@property_name` : extract the value of the tag property called `property_name` from the tag we are currently at
`/tag_name[@property_name='property_value']` : go to a child tag that has a property called `property_name` and a value of `property_value`

These key characters can be string togeather to make an xpath string that will bring us to where we want to be in the html text. For example, the xpath string

`//body/div[@class='title']/@text`

is saying: 
* from the beginning of the html text  for every tag called body' (`//body`), 
* find an immidiate child of 'body' called 'div' (`/div`) 
* that has a property called 'class' with a value of 'title' (`[@class='title']`)
* and navigate from the found 'div' tags to the value of the property 'text' (`/@text`)

When we pass the Selector a text object (in this case the stored html text) and than pass the selector an xpath string, the selector will navigate to the part of the html text as per the instructions in the xpath string.

To actually get the value from the text that we navigated to, we need to call `extract()` on the returned value.

#insert examples


In [None]:

new_thread = thread()


new_thread.title = Selector(text = thread_html_text).xpath("//meta[@name='title']/@content").extract()[0]

In [62]:
class post:
    def __init__(self):
        self.author_account_name = None
        self.title = None
        self.content = None
post_authors = Selector(text = thread_html_text).xpath("//body//dl/dt//div[@class='postprofile-name']/text()").extract()
post_title = Selector(text=thread_html_text).xpath("//h2[@class='topic-title']//a/text()").extract()
post_contents = Selector(text=thread_html_text).xpath("//div[@class='postbody']/div[@class='content']/div/text()").extract()
posts = []
for author, title, content in zip(post_authors, post_title, post_contents):
    new_post = post()
    new_post.author_account_name = author
    new_post.title = title
    new_post.content = content
    posts.append(new_post)
for _post in posts:
    print(_post.__dict__)

{'author_account_name': 'DwightLight', 'title': 'Bears. Beets.', 'content': 'Battlestar Galactica.'}
{'author_account_name': 'Schrute Farms Official', 'title': 'NO ONE READ THE ABOVE, THAT IS NOT ME', 'content': 'Identity theft is not a joke, Jim! Millions of families suffer every year!'}
{'author_account_name': 'DwightLight', 'title': 'Michael!', 'content': '*storms off*'}
{'author_account_name': 'Schrute Farms Official', 'title': "Oh that's funny", 'content': 'MICHAEL'}
