In [1]:
import gzip
import json

import pandas as pd

## Prepare Data 

The first thing we need to do is access the file that contains the data we need. We've done this using multiple strategies before, but this time around, we're going to use the command line.

 Open a terminal window and navigate to the directory where the data for this project is located.

What's the Linux command line?
Navigate a file system using the Linux command line.
As we've seen in our other projects, datasets can be large or small, messy or clean, and complex or easy to understand. Regardless of how the data looks, though, it needs to be saved in a file somewhere, and when that file gets too big, we need to compress it. Compressed files are easier to store because they take up less space. If you've ever come across a ZIP file, you've worked with compressed data.

The file we're using for this project is compressed, so we'll need to use a file utility called gzip to open it up.

In [None]:
%%bash
cd data
gzip -dfk poland-bankruptcy-data-2009.json.gz

## Explore

Now that we've decompressed the data, let's take a look and see what's there.

In the terminal window, examine the first 10 lines of poland-bankruptcy-data-2009.json.

Load the data into a DataFrame.

Read a JSON file into a DataFrame using pandas.

In [None]:
df = pd.read_json("data/poland-bankruptcy-data-2009.json")
df.head()


Hmmm. It looks like something went wrong, and we're going to have to fix it. Luckily for us, there's an error message to help us figure out what's happening here:

ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.



*** Using a context manager, open the file poland-bankruptcy-data-2009.json and load it as a dictionary with the variable name poland_data.

What's a context manager?
Open a file in Python.
Load a JSON file into a dictionary using Python.

In [None]:
# Open file and load JSON
with open("data/poland-bankruptcy-data-2009.json","r") as read_file:
    poland_data = json.load(read_file)

print(type(poland_data))

In [None]:
# Print `poland_data` keys
poland_data.keys()

In [None]:
# Continue Exploring `poland_data`
#poland_data["metadata"]
#poland_data["schema"].keys()
#type(poland_data["data"])
poland_data["data"][0]

In [None]:
# Calculate number of companies
len(poland_data["data"])

In [None]:
# Calculate number of features
len(poland_data["data"][0])

In [None]:
# Iterate through companies
for item in poland_data["data"]:
    if len(item) != 66:
        print("ALERT!!")
print("All Data Has 66 Columns/Features")    

In [None]:
# Open compressed file and load contents
with gzip.open("data/poland-bankruptcy-data-2009.json.gz","r") as read_file:
    poland_data_gz = json.load(read_file)

print(type(poland_data_gz))

In [None]:
# Explore `poland_data_gz`
print(poland_data_gz.keys())
print(len(poland_data_gz["data"]))
print(len(poland_data_gz["data"][0]))

In [None]:
df = pd.DataFrame().from_dict(poland_data_gz["data"]).set_index("company_id")
print(df.shape)
df.head()

## Import

Now that we have everything set up the way we need it to be, let's combine all these steps into a single function that will decompress the file, load it into a DataFrame, and return it to us as something we can use.

In [None]:
def wrangle(filename):
    # Open compressed file, load into dict
    with gzip.open(filename, "r") as f:
        data = json.load(f)
    
    # Turn dict into Dataframe
    df = pd.DataFrame().from_dict(data["data"]).set_index("company_id")
    return df

In [None]:
df = wrangle("data/poland-bankruptcy-data-2009.json.gz")
print(df.shape)
df.head()