# DSCI 511: Data acquisition and pre-processing<br>Chapter 4: Pre-processing considerations: foresight for downstream needs
## Exercises
Note: numberings refer to the main notes.

#### 4.1.1.1 Exercise: CSV to JSON conversion
Read the `cities.csv` file and look at its contents. It should have a header (the first line of the file) that tells you which fields contain what data. Next, take the data for  only the cities which have their population listed and store this in JSON format.

#### Discussion: Object structure to deduplicate metadata
While we didn't filter rows by those with population listed (as requested), we did set a default structure to our dictionary by `state` and then `city` so as to avoid from having to store the state infomation with each 'row'. Can you modify this to filter the rows by non-empty population?

#### 4.1.2.1 Exercise: JSON to CSV conversion
Load the data in the `american-movies.json` file. We only want the movies that were made from 1990 to 1999 (it was a truly glorious decade for American cinema). Your task is to take the title and year of making for these movies and put these in a tab-separated values file.

#### Discussion: Selecting specific columns for a list of lists
We decided to keep the `'title'` and `'year'` fields of the data, but the `'title'` field is free text, with records that will likely contain commas! So to save in a tabular format, e.g., for Excell, we can still easily write out to file using a list of lists and Python's basic file i/o as long as we join by a safe delimiter, like tab (`'\t'`). Here's the file output using multiple `.join()`s in compound comprehension (for fun):

#### Discussion: Loop approach
For those less comfortable or just trying to get the hang of comprehensions, here's the same process using a loops approach:

#### 4.1.2.4 Exercise: Making JSON file reading scalable
Create a specialized JSON serialization of the data in `'nobel-laureates.json'`. Specifically, create a file called `'data/nobel-laureates-lines.json'` that has each lauriate's record serialized seprately as a json object, with newlines `'\n'` in between, as delimiters. As a follow up, combine the line-by-line file reading syntax introduced in Section 1.4.1.5 in conjunction with the `json.dumps()` string serialization function in Section 1.4.2.2 to _read only the first ten lines_. As you read these lines, load each from json and print the laureate's list of prizes.

#### Discussion: JSON objects on each line
Here, we're just making sure that each line of the file is interpretable as a JSON object. For practice, here's a line-by-line reader to interact with the scalably-stored data.

#### 4.4.1.3 Exercise: Regex phone numbers
Read the file `phone-numbers.txt`. It contains a phone number in each line. \[Hint: use something like `lines = open("file.txt", "r").readlines()`\] Store only the phone numbers with the area code "215" in a list and print it out. Use regex-based pattern matching, not any other methods which occur to you.

#### Discussion: Character classes
Moving from the syntax `[0-9]`, the pre-defined character class `\d` really makes the regular expression much more succinct. We want to make sure we put the `215` into our pattern explicitly, since it's the Philly area code!

#### 4.4.1.8 Exercise: Names of the gods
In the cell below is some text. It's an extract from [A Clash of Kings](https://www.goodreads.com/book/show/10572.A_Clash_of_Kings), specifically, about a character's prayer to some fictional gods. Use regex to extract the names of these gods. Your output should be a list that looks something like `["the Father", "the Mother", "the Warrior"]`.

#### Discussion: Shaping a capitalized word
While case insensitivity on `the` didn't turn out any extra matches in this example, we did _have_ to utilize the capitalization present on the names to be able to separate them from other words that follow `the` determiner.

#### 4.4.4.2 Exercise: Calculate youre exact age
Calculate your own age using datetime parsing! Can you come up with a datetime format for your birthday that `dateutil.parser` doesn't recognize or recognizes incorrectly? If so, use the `datetime` module to specify the format exactly. [Hint. Review these docs: 
- https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime
- https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
]

#### Discussion: Using the `dateparser` shortcut
Utilizing `dateparser` to create our datetime object can really be quite helpful when it works, but for this calculation we'd have to be careful when determining the ages of particularly old people. It seems the 2-digit year reference to `1985` works (can you figure out how `dateparser` determines century?) and as a result determining age becomes as simple as taking a difference with `-` (minus), and dividing the resulting `.days` attribute by 365.25 (accounting for leap years).