# Importing Data

### Optional: JSON Files

Another common format are JSON files -- these are non-tabular data files that are popular in many applications, particularly web applications and APIs. 

JSON files enjoy high popularity as they are very flexible to adapt to changing storage needs, are compatible with just about any system (as they are encoded in plain text), and are friendly to read for both humans and machines. You will frequently encounter them when speaking to an API or otherwise obtaining data from the web.

Here is an example JSON file containing information about an airplane:

```json
{
    "planeId": "1xc2345g",
    "manufacturerDetails": {
        "manufacturer": "Airbus",
        "model": "A330",
        "year": 1999
    },
    "airlineDetails": {
        "currentAirline": "Lufthansa",
        "previousAirlines": {
            "1st": "Emirates"
        },
        "lastPurchased": 2013
    },
    "numberOfFlights": 4654
}
```

<font class="question">
    <strong>Question</strong>:<br><em>Does this JSON data structure remind you of a Python data structure?</em>
</font>

The JSON file bears a striking reseblance to the Python `dict` structure due to the key-value pairings.

#### Importing JSON Files

JSON Files can be imported using the the `json` library paired with the `with` statement and the `open()` function:

In [None]:
import json
with open('data/json_example.json', 'r') as f:
    imported_json = json.load(f)

And indeed, when importing a JSON file this way, it will be loaded into a Python dictionary.  
We can then verify that the `imported_json` variable is a `dict`:

In [None]:
type(imported_json)

This is what the data looks like, once loaded into Python:

In [None]:
imported_json

It is possible to convert this data to a Pandas DataFrame. However since JSON often includes nested records, you likely will first have to do some extra work to massage the data into a suitable tabular form.

Let's import another JSON file that already is in a more flat format:

In [None]:
import json
with open('data/airlines.json', 'r') as f:
    flat_json = json.load(f)
flat_json

In [None]:
# This file can be directly passed to pd.DataFrame() and pandas will know how to read it
import pandas as pd
df_json = pd.DataFrame(flat_json)
df_json.head()

## Optional: Importing Other Files

Over time you are likely to also encounter data in other formats

Here are some of the relatively common ones:
- Connecting to **relational databases** (like Postgres or Oracle)
    - tabular data, similar to csv
- Connecting to NoSQL or **document databases** (like MongoDB)
    - nested data, similar to JSON
- Obtaining data directly from an **external API**
    - often JSON
- Other data formats: xml, xlsx, avro, parquet, delta

The good news is that Pandas comes with native support for many of these. And Python, including it's vast ecosystem of packages, generally offers first-class support to interface with any common data type or database system. 

If ever in doubt the answer is usually just a google search away.

That said, once we have access to the data, our **goal** should always be to **first** try and **bring our data into tidy, tabular form** to facilitate further processing, analysis and the building of models.

For the remainder of this course we will continue to work with tabular data. Most of the data you will encounter at BSH will already be in this form as well.

# Questions

Are there any questions up to this point?

<img src="images/any_questions.png" style="width: 1000px;"/>

----

### Optional: General Framework

A general way to conceptualize data import into and use within Python:

1. Data sits in on the computer/server - this is frequently called "disk"
2. Python code can be used to copy a data file from disk to the Python session's memory
3. Python data then sits within Python's memory ready to be used by other Python code

Here is a visualization of this process:


<center>
<img src="images/import-framework.png" alt="import-framework.png" width="1000" height="1000">
</center>

---

### Optional: Pickle Files

So far, we've seen that DataFrames can be represented as tabular data files and dicts can be represented as JSON files, but what about other, more complex data?

Python's native data files are known as **Pickle** files:

* All Pickle files have the `.pickle` extension

* Pickle files are great for saving native Python data that can't easily be represented by other file types
  * Pre-processed data
  * Models
  * Any other Python object...

#### Importing Pickle Files

Pickle files can be imported using the `pickle` library paired with the `with` statement and the `open()` function:

In [None]:
import pickle
with open('data/pickle_example.pickle', 'rb') as f:
    imported_pickle = pickle.load(f)

We can view this file and see it's the same data as the JSON:

In [None]:
imported_pickle

And that it was loaded directly as a `dict`:

In [None]:
type(imported_pickle)