# Extracting non-tabular data

### Ingesting JSON data with pandas
When developing a data pipeline, you may have to work with non-tabular data and data sources, such as APIs or JSON files. In this exercise, we'll practice extracting data from a JSON file using pandas.

pandas has been imported as pd, and the JSON file you'll ingest is stored at the path "testing_scores.json".

In [32]:
import pandas as pd


def extract(file_path):
  # Read the JSON file into a DataFrame
  return pd.read_json(file_path, orient="records")

# Call the extract function with the appropriate path, assign to raw_testing_scores
raw_testing_scores = extract("../data/testing_scores.json")

# Output the head of the DataFrame
print(raw_testing_scores.head())


       id        street_address       city  math_score  reading_score  \
0  02M260  425 West 33rd Street  Manhattan         NaN            NaN   
1  06M211    650 Academy Street  Manhattan         NaN            NaN   
2  01M539   111 Columbia Street  Manhattan       657.0          601.0   
3  02M294      350 Grand Street  Manhattan       395.0          411.0   
4  02M308      350 Grand Street  Manhattan       418.0          428.0   

   writing_score  
0            NaN  
1            NaN  
2          601.0  
3          387.0  
4          415.0  


### Reading JSON data into memory
When data is stored in JSON format, it's not always easy to load into a DataFrame. This is the case for the "nested_testing_scores.json" file. Here, the data will have to be manually manipulated before it can be stored in a DataFrame.

To help get you started, pandas has been loaded into the workspace as pd.

In [33]:
def extract(file_path):
  	# Read the JSON file into a DataFrame, orient by index
	return pd.read_json(file_path, orient="index")

# Call the extract function, pass in the desired file_path
raw_testing_scores = extract("../data/nested_school_scores.json")
print(raw_testing_scores.head())


               street_address       city  \
01M539    111 Columbia Street  Manhattan   
02M294       220 Henry Street  Manhattan   
03M485  755 East 100th Street  Manhattan   
04Q301   18-01 Cornaga Avenue     Queens   

                                               scores  
01M539  {'math': 657, 'reading': 601, 'writing': 601}  
02M294  {'math': 683, 'reading': 592, 'writing': 588}  
03M485  {'math': 720, 'reading': 635, 'writing': 642}  
04Q301  {'math': 598, 'reading': 567, 'writing': 573}  


In [34]:
# Import the json library
import json

def extract(file_path):
    with open(file_path, "r") as json_file:
        # Load the data from the JSON file
        raw_data = json.load(json_file)
    return raw_data

raw_testing_scores = extract("../data/nested_school_scores.json")

# Print the raw_testing_scores
print(raw_testing_scores)

{'01M539': {'street_address': '111 Columbia Street', 'city': 'Manhattan', 'scores': {'math': 657, 'reading': 601, 'writing': 601}}, '02M294': {'street_address': '220 Henry Street', 'city': 'Manhattan', 'scores': {'math': 683, 'reading': 592, 'writing': 588}}, '03M485': {'street_address': '755 East 100th Street', 'city': 'Manhattan', 'scores': {'math': 720, 'reading': 635, 'writing': 642}}, '04Q301': {'street_address': '18-01 Cornaga Avenue', 'city': 'Queens', 'scores': {'math': 598, 'reading': 567, 'writing': 573}}}


### Iterating over dictionaries
Once JSON data is loaded into a dictionary, you can leverage Python's built-in tools to iterate over its keys and values.

The "nested_school_scores.json" file has been read into a dictionary stored in the raw_testing_scores variable, which takes the following form:

{
    "01M539": {
        "street_address": "111 Columbia Street",
        "city": "Manhattan",
        "scores": {
              "math": 657,
              "reading": 601,
              "writing": 601
        }
  }, ...
}

In [35]:
raw_testing_scores_keys = []

# Iterate through the keys of the raw_testing_scores dictionary
for school_id in raw_testing_scores.keys():
  	# Append each key to the raw_testing_scores_keys list
	raw_testing_scores_keys.append(school_id)

print(raw_testing_scores_keys[0:3])


['01M539', '02M294', '03M485']


In [36]:
raw_testing_scores_values = []

# Iterate through the values of the raw_testing_scores dictionary
for school_info in raw_testing_scores.values():
	raw_testing_scores_values.append(school_info)

print(raw_testing_scores_values[0:3])


[{'street_address': '111 Columbia Street', 'city': 'Manhattan', 'scores': {'math': 657, 'reading': 601, 'writing': 601}}, {'street_address': '220 Henry Street', 'city': 'Manhattan', 'scores': {'math': 683, 'reading': 592, 'writing': 588}}, {'street_address': '755 East 100th Street', 'city': 'Manhattan', 'scores': {'math': 720, 'reading': 635, 'writing': 642}}]


In [37]:
raw_testing_scores_keys = []
raw_testing_scores_values = []

# Iterate through the values of the raw_testing_scores dictionary
for school_id, school_info in raw_testing_scores.items():
	raw_testing_scores_keys.append(school_id)
	raw_testing_scores_values.append(school_info)

print(raw_testing_scores_keys[0:3])
print(raw_testing_scores_values[0:3])


['01M539', '02M294', '03M485']
[{'street_address': '111 Columbia Street', 'city': 'Manhattan', 'scores': {'math': 657, 'reading': 601, 'writing': 601}}, {'street_address': '220 Henry Street', 'city': 'Manhattan', 'scores': {'math': 683, 'reading': 592, 'writing': 588}}, {'street_address': '755 East 100th Street', 'city': 'Manhattan', 'scores': {'math': 720, 'reading': 635, 'writing': 642}}]


### Parsing data from dictionaries
When JSON data is loaded into memory, the resulting dictionary can be complicated. Key-value pairs may contain another dictionary, such are called nested dictionaries. These nested dictionaries are frequently encountered when dealing with APIs or other JSON data. In this exercise, you will practice extracting data from nested dictionaries and handling missing values.

The dictionary below is stored in the school variable. Good luck!

{
    "street_address": "111 Columbia Street",
    "city": "Manhattan",
    "scores": {
        "math": 657,
        "reading": 601
    }
}

In [38]:
import pandas as pd


def extract(file_path):
  # Read the JSON file into a DataFrame
  return pd.read_json(file_path, orient="records")

# Call the extract function with the appropriate path, assign to school
school = extract("../data/school.json")

# Output the head of the DataFrame
print(school.head())

          street_address       city                         scores
0    111 Columbia Street  Manhattan  {'math': 657, 'reading': 601}
1       220 Henry Street  Manhattan  {'math': 683, 'reading': 592}
2  755 East 100th Street  Manhattan  {'math': 720, 'reading': 635}
3   18-01 Cornaga Avenue     Queens  {'math': 598, 'reading': 567}


In [39]:
# Parse the street_address from the dictionary
street_address = school.get("street_address")

# Parse the scores dictionary
scores = school.get("scores", {})

# Try to parse the math, reading and writing values from scores
math_score = scores.get("math", None)
reading_score = scores.get("reading", None)
writing_score = scores.get("writing", None)


print(f"Street Address: {street_address}")
print(f"Math: {math_score}, Reading: {reading_score}, Writing: {writing_score}")


Street Address: 0      111 Columbia Street
1         220 Henry Street
2    755 East 100th Street
3     18-01 Cornaga Avenue
Name: street_address, dtype: object
Math: None, Reading: None, Writing: None


### Transforming JSON data
Chances are, when reading data from JSON format into a dictionary, you'll probably have to apply some level of manual transformation to the data before it can be stored in a DataFrame. This is common when working with nested dictionaries, which you'll have the opportunity to explore in this exercise.

The "nested_school_scores.json" file has been read into a dictionary available in the raw_testing_scores variable, which takes the following form:

{
    "01M539": {
        "street_address": "111 Columbia Street",
        "city": "Manhattan",
        "scores": {
              "math": 657,
              "reading": 601,
              "writing": 601
        }
  }, ...
}

In [40]:
normalized_testing_scores = []

# Loop through each of the dictionary key-value pairs
for school_id, school_info in raw_testing_scores.items():
	normalized_testing_scores.append([
    	school_id,
    	school_info.get("street_address"),  # Pull the "street_address"
    	school_info.get("city"),
    	school_info.get("scores").get("math", 0),
    	school_info.get("scores").get("reading", 0),
    	school_info.get("scores").get("writing", 0),
    ])

print(normalized_testing_scores)


[['01M539', '111 Columbia Street', 'Manhattan', 657, 601, 601], ['02M294', '220 Henry Street', 'Manhattan', 683, 592, 588], ['03M485', '755 East 100th Street', 'Manhattan', 720, 635, 642], ['04Q301', '18-01 Cornaga Avenue', 'Queens', 598, 567, 573]]


### Transforming and cleaning DataFrames
Once data has been curated into a cleaned Python data structure, such as a list of lists, it's easy to convert this into a pandas DataFrame. You'll practice doing just this with the data that was curated in the last exercise.

Per usual, pandas has been imported as pd, and the normalized_testing_scores variable stores the list of each schools testing data, as shown below.

[
    ['01M539', '111 Columbia Street', 'Manhattan', 657.0, 601.0, 601.0],
    ...
]

In [41]:
# Create a DataFrame from the normalized_testing_scores list
normalized_data = pd.DataFrame(normalized_testing_scores)

# Set the column names
normalized_data.columns =  ["school_id", "street_address", "city", "avg_score_math", "avg_score_reading", "avg_score_writing"]

normalized_data = normalized_data.set_index("school_id")
print(normalized_data.head())


                  street_address       city  avg_score_math  \
school_id                                                     
01M539       111 Columbia Street  Manhattan             657   
02M294          220 Henry Street  Manhattan             683   
03M485     755 East 100th Street  Manhattan             720   
04Q301      18-01 Cornaga Avenue     Queens             598   

           avg_score_reading  avg_score_writing  
school_id                                        
01M539                   601                601  
02M294                   592                588  
03M485                   635                642  
04Q301                   567                573  


### Filling missing values with pandas
When building data pipelines, it's inevitable that you'll stumble upon missing data. In some cases, you may want to remove these records from the dataset. But in others, you'll need to impute values for the missing information. In this exercise, you'll practice using pandas to impute missing test scores.

Data from the file "testing_scores.json" has been read into a DataFrame, and is stored in the variable raw_testing_scores. In addition to this, pandas has been loaded as pd.

In [43]:
# Print the head of the `raw_testing_scores` DataFrame
print(raw_testing_scores.head())

{'01M539': {'street_address': '111 Columbia Street', 'city': 'Manhattan', 'scores': {'math': 657, 'reading': 601, 'writing': 601}}, '02M294': {'street_address': '220 Henry Street', 'city': 'Manhattan', 'scores': {'math': 683, 'reading': 592, 'writing': 588}}, '03M485': {'street_address': '755 East 100th Street', 'city': 'Manhattan', 'scores': {'math': 720, 'reading': 635, 'writing': 642}}, '04Q301': {'street_address': '18-01 Cornaga Avenue', 'city': 'Queens', 'scores': {'math': 598, 'reading': 567, 'writing': 573}}}


In [None]:
# Fill NaN values with the average from that column
raw_testing_scores["math_score"] = raw_testing_scores["math_score"].fillna(raw_testing_scores["math_score"].mean())

# Print the head of the raw_testing_scores DataFrame
print(raw_testing_scores.head())


In [None]:
def transform(raw_data):
	raw_data.fillna(
    	value={
			# Fill NaN values with column mean
			"math_score": raw_data["math_score"].mean(),
			"reading_score": raw_data["reading_score"].mean(),
			"writing_score": raw_data["writing_score"].mean(),
		}, inplace=True
	)
	return raw_data

clean_testing_scores = transform(raw_testing_scores)

# Print the head of the clean_testing_scores DataFrame
print(clean_testing_scores.head())

### Grouping data with pandas
The output of a data pipeline is typically a "modeled" dataset. This dataset provides data consumers easy access to information, without having to perform much manipulation. Grouping data with pandas helps to build modeled datasets,

pandas has been imported as pd, and the raw_testing_scores DataFrame contains data in the following form:

              street_address       city  math_score  reading_score  writing_score
01M539   111 Columbia Street  Manhattan       657.0          601.0          601.0
02M294      350 Grand Street  Manhattan       395.0          411.0          387.0
02M308      350 Grand Street  Manhattan       418.0          428.0          415.0

In [None]:
def find_street_name(street_address):
    # Extract street name by splitting and taking relevant parts
    # This assumes format like "111 Columbia Street"
    parts = street_address.split()
    # Return everything except the first part (number)
    return " ".join(parts[1:])

def transform(raw_data):
    # Use the apply function to extract the street_name from the street_address
    raw_data["street_name"] = raw_data["street_address"].apply(
        # Pass the correct function to the apply method
        find_street_name
    )
    return raw_data

# Transform the raw_testing_scores DataFrame
cleaned_testing_scores = transform(raw_testing_scores)

# Print the head of the cleaned_testing_scores DataFrame
print(cleaned_testing_scores.head())