# Homework 04

In this assignment, you will practice wrangling data through different formats, with missing data, and working with text.

You may or may not use `for` loops depending on the questions. **Please see the question instructions carefully to determine if `for` loops can be used.** Generally, you should never be iterating over the rows of a DataFrame using `for` loops. If you do, you will lose points.

## Part 1: Text wrangling and regular expressions
In this part, we will work with the citation file exported from the [Nature Review Article](https://www.nature.com/articles/s41586-020-2649-2) *Array Programming with NumPy*. Below we read the file into the Python variable `cite` and print the result for you to preview.

It is fine to use `for` loops in this part since it involves parsing code before putting them into a `DataFrame`.

In [1]:
# Run but do not modify this code
with open("numpy_nature.txt") as f:
    cite = f.read()

print(cite)

TY  - JOUR
AU  - Harris, Charles R.
AU  - Millman, K. Jarrod
AU  - van der Walt, Stéfan J.
AU  - Gommers, Ralf
AU  - Virtanen, Pauli
AU  - Cournapeau, David
AU  - Wieser, Eric
AU  - Taylor, Julian
AU  - Berg, Sebastian
AU  - Smith, Nathaniel J.
AU  - Kern, Robert
AU  - Picus, Matti
AU  - Hoyer, Stephan
AU  - van Kerkwijk, Marten H.
AU  - Brett, Matthew
AU  - Haldane, Allan
AU  - del Río, Jaime Fernández
AU  - Wiebe, Mark
AU  - Peterson, Pearu
AU  - Gérard-Marchant, Pierre
AU  - Sheppard, Kevin
AU  - Reddy, Tyler
AU  - Weckesser, Warren
AU  - Abbasi, Hameer
AU  - Gohlke, Christoph
AU  - Oliphant, Travis E.
PY  - 2020
DA  - 2020/09/01
TI  - Array programming with NumPy
JO  - Nature
SP  - 357
EP  - 362
VL  - 585
IS  - 7825
AB  - Array programming provides a powerful, compact and expressive syntax for accessing, manipulating and operating on data in vectors, matrices and higher-dimensional arrays. NumPy is the primary array programming library for the Python language. It has an essential r

### Question 1 (4 points)
There are several authors, each recorded on a separate line beginning with `AU`. In the variable `q1` put a Python list of all of the author names formatted as in the file but without the extra characters and whitespace (i.e., without the `AU  - ` or the newline `\n` characters). Your list should be of the form `['Harris, Charles R.', 'Millman, K. Jarrod', ..., 'Oliphant, Travis E.']`. When you are finished, print the resulting list. 

In [None]:
# Put your code to answer the question here, and store the answer as the following variable
# Please keep author names in the original order

q1 = ... # List of Authors


print(q1)

### Question 2 (4 points)
Create a Pandas DataFrame that contains three columns: one for first names, one for middle names, and one for last names for all of the authors. Use the column names of the example table below. Keep the same order as the original text file and use the default primary index (the row labels) of 0, 1, 2, etc. as shown below. You are welcome to use the results of the prior question to asnwer this problem.

|      | first      | middle     | last         |
| ---- | ---------- | ---------- | ------------ |
| 0	   | Charles    | R.         | Harris       |
| 1	   | K.	        | Jarrod     | Millman      |
| 2	   | Stéfan     | J.         | van der Walt |
| 3    | Ralf       |            | Gommers      |
| 4	   | Pauli      |            | Virtanen     |   

Note that some authors do not have any middle names, in which case you can leave the middle name column blank.

In [None]:
# Run but do not modify this code
import pandas as pd

In [None]:
# Put your code to answer the question here, and store the answer as the following variable
# Please keep author names in the original order

q2 = ... # Pandas DataFrame with three columns

...

q2 # This last line will display the table to check for correctness

### Question 3 (12 points)
Below we extract the abstract from the citation and store it in a string variable `abstract`. Write regular expressions to answer the following questions about the abstract.

1. In `q3_1` put the starting index of everywhere `NumPy` appears in the abstract (i.e., the index of the `N` wherever `NumPy` occurs in the `abstract` string). This should be case sensitive.
2. In `q3_2` put all of the capitalized words in `abstract`, including words with extra capitalized letters like `NumPy` and `NumPy-like`.
3. In `q3_3` put all of the words that immediately follow `NumPy`, but do not include the word `NumPy` itself. For the one occurrence it is hyphenated `NumPy-like`, use `-like`.

In [None]:
# Run but do not modify this code
import re
abstract_query = re.compile(r"AB  - (.+)")
abstract = re.search(abstract_query, cite).group(1)
print(abstract)

In [None]:
# Put your code to answer the question here, and store the answer as the following variable
# Please allow duplication in your list

q3_1 = ... # List of index
q3_2 = ... # List of all the capitalized words
q3_3 = ... # List of words that immediately follow Numpy

...

# Run but do not modify this code
print('q3_1:', q3_1)
print('q3_2:', q3_2)
print('q3_3:', q3_3)

## Part 2: Cleaning up more system logs CSV

In this part, we work with a piece of messy tabular data in the form of a poorly formatted csv file containing data about programs running on computer systems. It contains all of the data about system time and memory from Worked Example 4, but also includes new information about user ids and machine ids, and some data are missing in every column. (The user ids are made up and do not correspond to any real individuals).

### Question 4 (12 points, 2 manual points)

Below, we import the dataset using the Pandas `read_csv` function that creates a dataframe. Run the code; it will preview the first five rows.

**Do not** use `for` loops, unless it is guaranteed to iterate less than 50 times. The manual points are partially for checking for this. You are also likely to not pass some tests because iterating over the entire `DataFrame` sometimes causes issues with formatting.

In [None]:
# Run but do not modify this code
import pandas as pd
import numpy as np
sys_df = pd.read_csv("more_monitor.csv")
sys_df.head()

There are several formatting issues with the default import. Address the following.

1. The `System User ID` and `System Machine ID` contain String data with the redundant information `User ID: ` and `Machine ID: ` in every row that has data. Remove these prefixes so that the columns only contain the user ids and machine ids themselves. Leave the `?` cells as is. For example, the first row should just have `yw22` in the `System User ID` column and `Carrot` in the `System Machine ID` column, while the second row should keep `?` for both columns.

2. The first three rows for `System Time second`, `System Memory GB` and `System Memory MB` contain numerical data but are currently formatted as strings with redundant prefix information repeating the column label and missing data represented as the string `?` instead of the Numpy `NaN` sentinel value. Fix this so that each value in the first three columns is either a single numerical value or `NaN` (note, you should use the actual `np.NaN` sentinal value, not just the String with the characters `N`, `a`, and `N`). For example, when you are done, the first three columns of the first row should all have `NaN` values, the second row should be `40`, `3`, and `382`, and so on. Note that the rows at index `400` and on have System Time recorded in minutes instead of seconds, be sure to convert these to seconds by mulitplying by 60.

3. Currently the System Memory is split accross two columns, one for the GB and one for the MB. For example, the total memory of the first program is 3 GB and 414 MB. Instead, represent the full system memory in the `System Memory GB` column, and get rid of the `System Memory MB` column. To do so, you need to convert the values in the MB column to GB (1 MB is 0.001 GB) and add that to the GB column, then use the [`drop` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html). Missing values should remain missing after this transformation.

When you are finished, `sys_df` should have the above issues corrected. Run both of the cells with `sys_df.head()` and `sys_df.tail()` to show the first and last few rows of your dataframe. Create a variable `q4` and have it point to this DataFrame.

In [None]:
# Put your code to answer the question here, and store sys_df in q4 for full credit


...

q4 = sys_df

# Run but do not modify this code
print(q4.head())

# Run but do not modify this code
print(q4.tail())

### Question 5 (12 points)
The `sys_df` dataframe from question 4 should now be a little easier to read and use. Answer the following questions about `sys_df`.

1. How many rows are missing data (have a `?`) in the `System Machine ID` column? Put your answer in `q5_1`.
2. What is the average value of `System Memory GB` among the rows that are missing data (have a `?`) in the `System User ID` column?  Put your answer in `q5_2`.
3. How many rows are missing data in both the `System Time second` and `System Memory GB` columns? Put your answer in `q5_3`.

In [None]:
# Put your code to answer the question here, and store answers as the following variables

q5_1 = ... # number of rows
q5_2 = ... # average System Memory GB
q5_3 = ... # number of missing rows

...

# Run but do not modify this code
print('q5_1:', q5_1)
print('q5_2:', q5_2)
print('q5_3:', q5_3)

## Part 3: Wrangling FDA JSON Dataset 
In this part we work with a messy JSON dataset containing information about several drugs labels.

### Question 6 (12 points)
Below we import the `FDADrugLabel.json` file into the `labels` variable. This is the same dataset as the worked example. The resulting Python object is somewhat messy; we encourage you to explore the data before answering the questions.

You may use `for` loops for this question.

In [None]:
# Run but do not modify this code
import json
with open("FDADrugLabel.json", encoding="utf-8") as f:
    labels = json.load(f)

In [None]:
# Use this cell to explore (and you are welcome to add more cells as needed)

Answer the following questions.

1. In `q6_1` put the average number of key/value (or name/value) pairs for the drugs.

2. In `q6_2` put the list of all of the `manufacturer_names` without any other information. `manufacturer_names` are not a top level key/name, you will need to search for where they are located and how to extract them.

3. In `q6_3` put how many drugs contain the string `child` anywhere in their `warnings` (including as part of larger strings like `children`). Note: `warnings` is a top level key/name.

In [None]:
# Put your code to answer the question here, and store answers as the following variables

q6_1 = ... # number of rows
q6_2 = ... # average System Memory GB
q6_3 = ... # number of missing rows

...

# Run but do not modify this code
print('q6_1:', q6_1)
print('q6_2:', q6_2)
print('q6_3:', q6_3)

## Submitting

You should make sure any code that you write to answer the questions is included in this notebook. We recommend you go to the Kernel option and choose \"Restart & Run All.\" Double check that your entire notebook runs correctly and generates the expected output. Finally, make sure to save your work (timestamp at the top tells you the last checkpoint and whether there are unsaved changes). When you finish, submit your assignment at [Gradescope](http://gradescope.com/).