# Web Scraping with Python Workshop


We will use the python package, Beautiful Soup, to webscrape headlines from the NY Times. By scraping the headlines, we wil examine how to search for meta data hidden within HTML tags and how HTML tags can be removed with data scraping.

Following the exercise of web scraping on a static webapage, we will crawl a similar webpage and use the crawler to "click" on links embedded within the webpage.

We will then store the data in a Pandas dataframe and show how to transfer this information to aa csv.

Feel free to send questions to CDSS_executives@columbia.edu

Import the Beautiful Soup package as well as urllib, a package that is used to process url's. If you don't have these packages, please install `python3` and `bs4` either with your favorite package manager or by means of `conda`. You will only be able to use the `urllib.request` functionality as shown in a Jupyter notebook running python3.

In [None]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

Find the search URL that you would like to use.

In [None]:
search_url = 'http://nytimes.com'

Form the Beautiful Soup query around the url. For more information, visit the Beautiful Soup documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/. For trying out different url's, start by modifying the "search_url" variable from above.

In [None]:
soup = BeautifulSoup(urlopen(search_url).read(), 'html.parser')

Beautiful Soup produces essentially copy of the html file from the web application. So let's print it to see what we're working with!

In [None]:
print(soup)

It doesn't look too nice! We need to find a way to parse through it. Let's start by looking at the HTML identifiers near the information that we want -- the headlines. Let's command+F our printed soup output for the context of one of the headlines that we want.

```
<h2 class="story-heading"><a href="https://www.nytimes.com/2017/02/01/arts/beyonce-pregnant-twins.html">Beyoncé Announces She Is Pregnant With Twins</a></h2>
<p class="byline">By JOE COSCARELLI <time class="timestamp" data-eastern-timestamp="2:49 PM" data-utc-timestamp="1485978561" datetime="2017-02-01">2:49 PM ET</time></p>
<p class="summary">
        The pop star shared an Instagram post in which she said her family with the rapper Jay Z “will be growing by two.”    </p>
```

This headline is denoted by the tag "< h2 >" snd the class "story-heading". With further examination with more command+F searches through the soup output, we confirm that "story-heading" is used to denote the headline for all of the stories on the homepage. 

In [None]:
 soup2 = soup.findAll('h2', {'class':'story-heading'})
 print(soup2)

Now we just need to scrape away the html tags to get the text that we want. Since not every line in soup2 has text at all, we first need to check for empty lines before getting the text element.

In [None]:
lines = []
for line in soup2:
    if line:
        lines.append(line.text)
        print(line.text)

In [None]:
print(lines)


How many headlines did we scrape? Does this number seem reasonable? Let's do some visual inspection of the data that we scraped.

In [None]:
print(len(lines))

Now that we've confirmed that our data looks like headlines, let's strip away all the \n, trailing or leading whitespace and other characters that we don't want. We will do this using the string.strip() method.

In [None]:
headlines = []
for line in lines:
    line = line.strip()
    headlines.append(line)
print(headlines)

Hmmmm...what seems to be going on here? The newline characters are still there! This is because they're in the middle of the strings and we only stripped whitespace characters from the ends of the line. Since it seems like certain headlines are copied before and after a series of \n's, lets string.split() the lines, remove the trailing and leading whitespaces from the first element of the split string array, and use this processed string as our headline.

In [None]:
for line_index in range(0, len(headlines)):
    first_elem = headlines[line_index].split('\n')[0]
    first_elem = first_elem.strip()
  #  print(first_elem)
    headlines[line_index] = first_elem
print(headlines)
   # print(headlines[line_index].split('\n')[0])

Now let's store the data in a useful way. How about a dataframe? 

In [None]:
import pandas as pd
df = pd.DataFrame(headlines)
print(df)

That was simple, right? We can easily write this data to a CSV from pandas also.

In [None]:
df.to_csv('headlines.csv')
#make sure that it's there...
#you can always use "!" as an escape character to use terminal commands in ipython notebooks
!ls
!head headlines.csv

Now let's crawl (but not really crawling)! Let's try to store some text from Reuters articles. This has debatable legality so do this at your own risk!

In [None]:
reuters_url = 'http://www.reuters.com/'
crawl_soup = BeautifulSoup(urlopen(reuters_url).read(), 'html.parser')
print(crawl_soup)

Similarly to the NYT, Reuters stories are mostly identified using the "story-title" class, but we want the url's this time! The url is stored next to the story-title. The href= URL will be stored as a property once we find the story-title classes.

In [None]:
crawl_soup1 = crawl_soup.findAll(True, {'class':'story-title'})
#print(crawl_soup1)

#now get the href, note that it always starts with <a href="....
#remember that lines can be null

r_headlines = []
#hacky way when soup tags don't work
#use escape character "\" to split
for line in crawl_soup1:
    if len(line)>1:
       for i in str(line).split("\""):
            if 'article' in i:
                #print(i)
                r_headlines.append(i)
                
print(r_headlines)