### Web Scraping Example

This notebook provides another example of web scraping, using my own [blog](https://karenmazidi.blogspot.com/), which admittedly I tend to neglect. 

The blog uses Blogger's template, which controls how the blogs are stored on the page. 

First, requests is used to access the page.

In [1]:
import requests

URL = 'https://karenmazidi.blogspot.com/'

page = requests.get(URL)

Then BeautifulSoup is used to create a soup object from the web page. 

In [2]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

When we extract the text, it is quite messy.

In [3]:
text = soup.get_text()
text[:100]

'\n\n\n\nNever Stop Learning Computer Science\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to main content\n\n\n\n\n\n\n\n\n\n\n'

Extracting paragraphs as shown below doesn't get the blog posts, just the blurb in the heading. The first code block shows that selecting 'p' also gets the html. The second block below shows how to just get the text.

In [4]:
for p in soup.select('p'):
    print(p)

<p>
Learning about all things computer science, especially machine learning, natural language processing and computer architecture. 
</p>


In [5]:
# get just the text

for p in soup.select('p'):
    print(p.get_text())


Learning about all things computer science, especially machine learning, natural language processing and computer architecture. 



### Extract text from div

We will have to examine the page source to figure out where the text is:

* right-click in the page, and View Source 
* if in Chrome, you can right-click and choose Inspect

While *View Source* opens a separate tab, the *Inspect* option opens a side-by-side view. I tend to use the View Source option.

Taking a look at the page soure shows that the text of the blog posts is in a div container with class 'snippet-item r-snippetized'. The next code block shows how to extract all of those. 

In [6]:
results = soup.findAll('div', {'snippet-item r-snippetized'})

Now that we have those, we can get the text from each post:

In [7]:
for post in results:
    print(post.get_text())


Google is watching you, but you knew that already. Several years ago I saw a pop-up like this:  It scared the heck out of me so I just exited out of Chrome. It popped up again today and I hit 'I want to play' and this popped up:   Wow. I have heard that this is how Google recruits developers. I don't know if that's true but I exited out of this. Maybe next time . . .


I was watching an old Twilight Zone last night, filmed in 1962. The one where an elderly couple comes into a gleaming tech company to purchase new, young artificial bodies for themselves. Spoiler alert: the new technology has unforeseen negative consequences. (Isn't that the theme of every sci-fi work?)  While they are touring displays of young, healthy bodies they may inhabit, for a price, a leggy secretary comes in and tells the executive that he has a call on the video phone. Oh how modern. A video phone. Artificial biology. And yet, a woman still has to be the secretary?  Why was it easier for the writers/producers 

As you can see, I need to blog more often.