# DSCI 511: Data acquisition and pre-processing<br>Chapter 4: Pre-processing considerations: foresight for downstream needs
## Exercises
Note: numberings refer to the main notes.

#### 4.1.1.1 Exercise: CSV to JSON conversion
Read the `cities.csv` file and look at its contents. It should have a header (the first line of the file) that tells you which fields contain what data. Next, take the data for  only the cities which have their population listed and store this in JSON format.

#### Discussion: Object structure to deduplicate metadata
While we didn't filter rows by those with population listed (as requested), we did set a default structure to our dictionary by `state` and then `city` so as to avoid from having to store the state infomation with each 'row'. Can you modify this to filter the rows by non-empty population?

In [3]:
import csv, json
from pprint import pprint

reader = csv.reader(open("./data/cities.csv", "r")) 
cities_lists = list(reader)

states = {}
for i, row in enumerate(cities_lists): 
    if i:
        state = row[1]
        city = row[0]
        states.setdefault(state,{})
#         states[state].setdefault(city,[])
        states[state][city] = row[-3:]
    else:
        header = row

json.dump(state, open("data/states.json", "w"))

#### 4.1.2.1 Exercise: JSON to CSV conversion
Load the data in the `american-movies.json` file. We only want the movies that were made from 1990 to 1999 (it was a truly glorious decade for American cinema). Your task is to take the title and year of making for these movies and put these in a tab-separated values file.

#### Discussion: Selecting specific columns for a list of lists
We decided to keep the `'title'` and `'year'` fields of the data, but the `'title'` field is free text, with records that will likely contain commas! So to save in a tabular format, e.g., for Excell, we can still easily write out to file using a list of lists and Python's basic file i/o as long as we join by a safe delimiter, like tab (`'\t'`). Here's the file output using multiple `.join()`s in compound comprehension (for fun):

In [24]:
import json
movies = json.load(open("data/american-movies.json", "r"))

movies_ordered = [['year', 'title']]
movies_ordered += [[str(movie['year']), movie['title']] for movie in movies]

with open("data/american-movies-year-title.tsv", "w") as f:
    f.write("\n".join(["\t".join(row) for row in movies_ordered]))

#### Discussion: Loop approach
For those less comfortable or just trying to get the hang of comprehensions, here's the same process using a loops approach:

In [25]:
import json
movies = json.load(open("data/american-movies.json", "r"))

movies_ordered = [['year', 'title']]

for row in [[str(movie['year']), movie['title']] for movie in movies]:
    movies_ordered.append(row)

with open("data/american-movies-year-title.tsv", "w") as f:
    for movie in movies_ordered:
        f.write("\t".join(movie) + "\n")

#### 4.1.2.4 Exercise: Making JSON file reading scalable
Create a specialized JSON serialization of the data in `'nobel-laureates.json'`. Specifically, create a file called `'data/nobel-laureates-lines.json'` that has each lauriate's record serialized seprately as a json object, with newlines `'\n'` in between, as delimiters. As a follow up, combine the line-by-line file reading syntax introduced in Section 1.4.1.5 in conjunction with the `json.dumps()` string serialization function in Section 1.4.2.2 to _read only the first ten lines_. As you read these lines, load each from json and print the laureate's list of prizes.

#### Discussion: JSON objects on each line
Here, we're just making sure that each line of the file is interpretable as a JSON object. For practice, here's a line-by-line reader to interact with the scalably-stored data.

In [26]:
import json
nobel_laureates = json.load(open("data/nobel-laureates.json", "r"))
with open('data/nobel-laureates-lines.json', 'w') as f:
    for laureate in nobel_laureates["laureates"]:
        f.write(json.dumps(laureate)+"\n")

In [27]:
with open('data/nobel-laureates-lines.json', 'r') as f:
    for line in f:
        laureate = json.loads(line)
        print(laureate)
        break

{'id': '1', 'firstname': 'Wilhelm Conrad', 'surname': 'Röntgen', 'born': '1845-03-27', 'died': '1923-02-10', 'bornCountry': 'Prussia (now Germany)', 'bornCountryCode': 'DE', 'bornCity': 'Lennep (now Remscheid)', 'diedCountry': 'Germany', 'diedCountryCode': 'DE', 'diedCity': 'Munich', 'gender': 'male', 'prizes': [{'year': '1901', 'category': 'physics', 'share': '1', 'motivation': '"in recognition of the extraordinary services he has rendered by the discovery of the remarkable rays subsequently named after him"', 'affiliations': [{'name': 'Munich University', 'city': 'Munich', 'country': 'Germany'}]}]}


#### 4.4.1.3 Exercise: Regex phone numbers
Read the file `phone-numbers.txt`. It contains a phone number in each line. \[Hint: use something like `lines = open("file.txt", "r").readlines()`\] Store only the phone numbers with the area code "215" in a list and print it out. Use regex-based pattern matching, not any other methods which occur to you.

#### Discussion: Character classes
Moving from the syntax `[0-9]`, the pre-defined character class `\d` really makes the regular expression much more succinct. We want to make sure we put the `215` into our pattern explicitly, since it's the Philly area code!

In [39]:
import re
numbers = open("./data/phone-numbers.txt", "r").readlines()

phila_nums = []
for num in numbers:
    if re.search("215-\d{3}-\d{4}", num):
        phila_nums.append(num.strip())

pprint(phila_nums)

['215-673-7554',
 '215-672-4085',
 '215-371-5261',
 '215-758-4303',
 '215-173-2648',
 '215-290-3681',
 '215-424-6180',
 '215-762-7704',
 '215-709-5404',
 '215-517-7535',
 '215-377-5293',
 '215-384-6874',
 '215-356-1368',
 '215-841-7294',
 '215-992-5760',
 '215-471-4965',
 '215-384-6622',
 '215-848-9952',
 '215-577-3006',
 '215-236-7893',
 '215-625-7823',
 '215-144-3179',
 '215-266-1567',
 '215-887-1117',
 '215-595-4262',
 '215-850-8796',
 '215-750-1293',
 '215-676-8811',
 '215-217-8676',
 '215-572-9395',
 '215-724-6998',
 '215-141-8609',
 '215-275-2164',
 '215-740-6238',
 '215-340-1427',
 '215-911-2531',
 '215-315-1104',
 '215-233-6324',
 '215-800-7926',
 '215-989-2630',
 '215-990-7215',
 '215-576-2113',
 '215-870-3616',
 '215-997-9490',
 '215-546-9201',
 '215-998-3660',
 '215-819-8806',
 '215-422-2358',
 '215-908-2121',
 '215-534-1397']


#### 4.4.1.8 Exercise: Names of the gods
In the cell below is some text. It's an extract from [A Clash of Kings](https://www.goodreads.com/book/show/10572.A_Clash_of_Kings), specifically, about a character's prayer to some fictional gods. Use regex to extract the names of these gods. Your output should be a list that looks something like `["the Father", "the Mother", "the Warrior"]`.

#### Discussion: Shaping a capitalized word
While case insensitivity on `the` didn't turn out any extra matches in this example, we did _have_ to utilize the capitalization present on the names to be able to separate them from other words that follow `the` determiner.

In [15]:
import re
text = 'Lost and weary, Catelyn Stark gave herself over to her gods. She knelt before the Smith, who fixed things that were broken, and asked that he give her sweet Bran his protection. She went to the Maid and beseeched her to lend her courage to Arya and Sansa, to guard them in their innocence. To the Father, she prayed for justice, the strength to seek it and the wisdom to know it, and she asked the Warrior to keep Robb strong and shield him in his battles. Lastly she turned to the Crone, whose statues often showed her with a lamp in one hand. "Guide me, wise lady," she prayed. "Show me the path I must walk, and do not let me stumble in the dark places that lie ahead."'

gods = re.findall("[tT]he [A-Z][a-z]+", text)
gods

['the Smith', 'the Maid', 'the Father', 'the Warrior', 'the Crone']

#### 4.4.4.2 Exercise: Calculate youre exact age
Calculate your own age using datetime parsing! Can you come up with a datetime format for your birthday that `dateutil.parser` doesn't recognize or recognizes incorrectly? If so, use the `datetime` module to specify the format exactly. [Hint. Review these docs: 
- https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime
- https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
]

#### Discussion: Using the `dateparser` shortcut
Utilizing `dateparser` to create our datetime object can really be quite helpful when it works, but for this calculation we'd have to be careful when determining the ages of particularly old people. It seems the 2-digit year reference to `1985` works (can you figure out how `dateparser` determines century?) and as a result determining age becomes as simple as taking a difference with `-` (minus), and dividing the resulting `.days` attribute by 365.25 (accounting for leap years).

In [22]:
from datetime import datetime as dt
import dateutil.parser as dateparser

current_time = dt.now()

birthday = dateparser.parse("8/14/85")
print(birthday)

print((current_time - birthday).days/365.25)

1985-08-14 00:00:00
33.210130047912386
