## Exercise: read and write .csv data

Wherever you have the current file where you work on this problem, create ONE LEVEL UP a directory called "data" and put there the datafile [`amazon_stock_data.csv`](https://raw.githubusercontent.com/paljenczy/progtools2-2016-winter/master/data/amazon_stock_data.csv). This data contains data on Amazon's stock price on a daily frequency.

1. Read the data using the `csv` module's `DictReader` function and build a list of the daily price data. That is, this will be a list of dictionaries where each dictionary contains data from a day.

2. Go through this list and for each dictionary, create a new key-value pair: `"Avg"` will be the key and the value will be the average of the `Open` and `Close` prices.
    * define an empty list `list_new_dicts` in which you'll collect the new dictionaries
    * write a for loop that steps thourgh the dictionaries
        * from each dictionary, get the value corresponding to the keys `"Open"` and `"Close"`. These are strings, turn them into `float`s using the `float` function
        * take the average of these two float numbers and assign it to a variable `avg_price`. Its type is `float`.
        * add the new key-value pair `"Avg"` as key and the calculated average as value. CAUTION! Turn fist the float value into string using `"{%.2f}".format(avg_price)`

3. Take the new list and write it using `csv.DictWriter`. You have to supply an argument `fieldnames` that specifies the fields that are written (just copy and paste, and inspect how you can use it).

```
with open("../data/amazon_stock_data.csv", "w") as f:
      writer = csv.DictWriter(f, fieldnames=["Date", "Open", "High", "Low", "Close", "Volume", "Adj Close","Avg"],
                              delimiter=",")
      # write a header
      writer.writeheader()
      # loop through the new list of dictionaries
      for d in list_new_dicts:
            writer.writerow(d)
      
      
```

### Solution

In [64]:
# 1.
import csv

list_dicts = []
with open("../data/amazon_stock_data.csv", "r") as f:
    reader = csv.DictReader(f, delimiter=",")
    for d in reader:
        list_dicts.append(d)

In [65]:
# 2.
list_new_dicts = []
for d in list_dicts:
    avg_price = (float(d["Open"]) + float(d["Close"]))/2
    d["Avg"] = "{:.2f}".format(avg_price)
    list_new_dicts.append(d)

In [66]:
# 3.
with open("../data/amazon_stock_data_new.csv", "w") as f:
      writer = csv.DictWriter(f, fieldnames=["Date", "Open", "High", "Low", "Close", "Volume", "Adj Close","Avg"],
                              delimiter=",")
      # write a header
      writer.writeheader()
      # loop through the new list of dictionaries
      for d in list_new_dicts:
            writer.writerow(d)

## Exercise: Mining a news article with BS and regex

Take the news article http://www.sacbee.com/site-services/databases/article57154063.html. 

1. Extract the title of the article.
2. Extract the author information to a string variable (BY PHILLIP REESE - PREESE@SACBEE.COM)
3. Write a function that extracts an email address from a text
  * create a regular expression that parses the email address and returns that
  * search if the regex matches the input string
  * if yes, return the email address, if not, return `None`
    * you can test if a match was found using
      ```
      result = pattern.search(text)
      if bool(result):
        ...
      ```
4. Extract the date and time information appearing above the title (JANUARY 28, 2016 4:26 PM)
5. Write a function that extracts time information from a string
  * it should match "hour:minute AM/PM"-like strings
    * hint: `[0-5]` matches all digits between 0 and 5
    * 2:33 PM, 22:40 AM should be matched
    * times that does not make sense should not be matched (like 45:33 PM or 2:67 AM)
    * return the time if found and `None` otherwise
  * use this function to extract the time from the string extracted in the previous task
6. The text contains __two__ links to other news articles. Extract them using a combination of regex with Beautiful Soup search.
  * the links should appear in the body of the text, not in other linked content (thus, first you should extract a BS object that is the body of the text
  * the links are contained in `<a>` tags' `href` attributes
  * the links should match a regex that begins with `http` and end in `html`

### Solution

In [54]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

# set the default xml reader:

def BS(html):
    """
    Overcome annoying warning message: set the html parser
    """
    return BeautifulSoup(html, "lxml")

In [55]:
url = "http://www.sacbee.com/site-services/databases/article57154063.html"
soup = BS(urlopen(url))

In [56]:
# 1.
# this is a Beautiful Soup object
title_obj = soup.find("h1", {"class": "title"})

# get the text and strip newline characters
print(title_obj.get_text().replace("\n", ""))

Top political parties continue to slip in Sacramento region


In [57]:
# 2.
author_text = soup.find("div", {"class": "byline element-spacing-small"}).p.get_text()
print(author_text)

By Phillip Reese - preese@sacbee.com


In [58]:
# 3.

import re

def email_extractor(text):
    email_pattern = re.compile("\w+@\w+\.\w+")
    result = email_pattern.search(text)
    if bool(result):
        email_address = result.group()
        return email_address
    else:
        return None

email_address = email_extractor(author_text)
print(email_address)

preese@sacbee.com


In [59]:
# 4.
time_and_date = soup.find("p", {"class": "published-date"}).get_text()
print(time_and_date)


January 28, 2016 4:26 PM



In [60]:
# 5.

def extract_time(text):
    time_pattern = re.compile("[1-2]?[0-9]:[0-5][0-9] (A|P)M")
    match = time_pattern.search(text)
    if bool(match):
        return match.group()
    else:
        return None
    
print(extract_time(time_and_date))

4:26 PM


In [61]:
# 6.
story = soup.find("div", {"id": "story-body-items"})
links_in_story = story.findAll("a", {"href": re.compile("^http.*html$")})
print(links_in_story)

# extract the links
list_links_in_text = [x["href"] for x in links_in_story]
print(list_links_in_text)

[<a href="http://www.aipca.org/platform.html">platform planks</a>, <a href="http://www.sacbee.com/site-services/databases/article4457401.html">he political makeup of every Sacramento neighborhood</a>]
['http://www.aipca.org/platform.html', 'http://www.sacbee.com/site-services/databases/article4457401.html']
