In [1]:
import requests
import csv
import pandas as pd
import re
from bs4 import BeautifulSoup
from time import sleep

### Read CSV File

We first need to read the .csv file that is in the current directly. The `keep_default_na=False` argument will prevent Pandas from putting NaN data in empty cells.

Once you've loaded the .csv data in an variable, you can run `.head()` method to peek at the first five rows of the data.

In [18]:
# Read CSV file
mycsv = pd.read_csv("news-links-with-id.csv", keep_default_na=False)
mycsv.head()

Unnamed: 0,id,newslink
0,6290375,http://archive.redding.com/news/ex-chp-officer...
1,7187899,http://archive.signalscv.com/archives/150962/
2,6646879,http://archive.vcstar.com/news/guilty-officer-...
3,6993951,http://archive.vcstar.com/news/local/sheriffs-...
4,1335727,http://archive.vcstar.com/news/sheriffs-deputy...


### Create a new column

To create a column with Pandas, simply assign a value to the property of your dataframe.

In [19]:
mycsv["news_headlines"] = ""
mycsv.head()

Unnamed: 0,id,newslink,news_headlines
0,6290375,http://archive.redding.com/news/ex-chp-officer...,
1,7187899,http://archive.signalscv.com/archives/150962/,
2,6646879,http://archive.vcstar.com/news/guilty-officer-...,
3,6993951,http://archive.vcstar.com/news/local/sheriffs-...,
4,1335727,http://archive.vcstar.com/news/sheriffs-deputy...,


### Set headers

When scraping websites, you should spoof your header and referer. This prevents the website from thinkinking you're a spammer. Of course, this isn't a foolproof measure, and most websites will block you if you access them too many times in a row (which is a good thing). This header spoof only really helps from very paranoid websites that block all requests from unknown referers. You can change this to a URL you control. Note that the recipient website you're scraping can see the referer in their server logs.

In [34]:
headers = {
'referer': 'https://example.com',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}

### Loop through headlines to fetch HTML

This will go through each headline using a for loop. Study the code block first, and then come back here. We'll describe each setp of the loop.

```
for index, row in mycsv.iterrows():
```

`index` is a counter. It starts from 0, and increments each time the loop runs. We will use the index variable later to insert the retrieved headline back into the table at the appropriate row. Just remember, the index refers to which row we're on.

`row` will be a reference to the data in each row in the spreadsheet. So, row['column_name'] could retrive a particular cell. Every time the loop runs, the row variable will refer to data in the next row.

`mycsv.iterrows()` refers to all of the rows in the table, and it's just a way for us to cycle through all of them.

```
url = row['newslink'].strip()
```

Once in the loop, we will extract data from the `newlink` column. We will extract the specific cell with a URL in it. The `strip()` removes leading/trailing whitespace.

```
try: 
    webpage = requests.get(url, headers = headers, timeout = 4)
    webpage.encoding = 'utf-8'
```

The try command sets up a mechanism to catch errors, while allowing the loop to continue (so it doesn't break every time a website is unresponsive). The `except` creates an exception without breaking the loop. So as it's running, we'll see messages occasionally from websites that failed to give us a headline.

We'll use the requests library to fetch the webpage, supplying the headers variable, and specifying how many seconds to wait before giving up and moving on to the next website.

```
soup = BeautifulSoup(webpage.text, 'html.parser')
```

We'll use BeautifulSoup to parse the result from fetching the webpage. BeautifulSoup can parse several different types of files. We're using the html parser to interpret html.

```
title = soup.find('meta', {'property': 'og:title'})
```

This finds the `<meta content="some headline" property="og:title">` tag in the webpage, which most news sites have as part of their SEO and to supply social networks with information about their articles.

```
if title is not None:
    mycsv.iat[index, mycsv.columns.get_loc("news_headlines")] = title['content']
sleep(0.1)
```

The if statement makes sure we were able to find a title. The `.iat` is a Pandas command that will allow us to insert a cell at a specific location, getting the location of the correct column, and assigning it the title we found.

The sleep command slows things down, so we don't make too many requests at once.

In [None]:
# Test things out before running the loop. Un-comment the code below to run for just one headline.
# url = mycsv['newslink'][0]
# webpage = requests.get(url, headers=headers,timeout=4)
# webpage.encoding = 'utf-8'
# soup = BeautifulSoup(webpage.text, 'html.parser')
# title = soup.find('meta', {'property': 'og:title'})
# title['content']


for index, row in mycsv.iterrows():
    url = row['newslink'].strip()
    try:
        webpage = requests.get(url, headers=headers,timeout=4)
        webpage.encoding = 'utf-8'
        soup = BeautifulSoup(webpage.text, 'html.parser')
        title = soup.find('meta', {'property': 'og:title'})
        if title is not None:
            mycsv.iat[index, mycsv.columns.get_loc("news_headlines")] = title['content']
        sleep(0.1)
    except requests.exceptions.Timeout:
        print("timeout occured")
mycsv

## Save CSV file of the output

The following command will save a .csv file of our mycsv data, with the additional column

In [10]:
mycsv.to_csv("data-with-headlines.csv", encoding="utf-8")