# Introduction

So far, you've worked with many types of data, including numeric types (integers, floating point values), strings, and the [DATETIME](https://www.kaggle.com/dansbecker/order-by) type.  In this tutorial, you'll learn how to query **nested and repeated data**.  These are the most complex data types that you can find in BigQuery datasets! 

# Nested data

Consider a hypothetical dataset containing information about pets and their toys.  We could organize this information in two different tables (a `pets` table and a `toys` table).  The `toys` table could contain a "Pet_ID" column that could be used to match each toy to the pet that owns it.

Another option in BigQuery is to organize all of the information in a single table, similar to the `pets_and_toys` table below.  

![nested data](https://storage.googleapis.com/kaggle-media/learn/images/wxuogYA.png)

In this case, all of the information from the `toys` table is collapsed into a single column (the "Toy" column in the `pets_and_toys` table).  We refer to the "Toy" column in the `pets_and_toys` table as a **nested** column, and say that the "Name" and "Type" fields are nested inside of it.  

Nested columns have type **STRUCT** (or type **RECORD**).  This is reflected in the table schema below.
> Recall that we refer to the structure of a table as its **schema**.  If you need to review how to interpret table schema, feel free to check out [this lesson](https://www.kaggle.com/dansbecker/getting-started-with-sql-and-bigquery) from the Intro to SQL micro-course.

![nested data](https://storage.googleapis.com/kaggle-media/learn/images/epXFXdb.png)

To query a column with nested data, we need to identify each field in the context of the column that contains it: 
- `Toy.Name` refers to the "Name" field in the "Toy" column, and
- `Toy.Type` refers to the "Type" field in the "Toy" column.  

![nested data](https://storage.googleapis.com/kaggle-media/learn/images/eE2Gt62.png)

Otherwise, our usual rules remain the same - we need not change anything else about our queries.

# Repeated data 

Now consider the (more realistic!) case where each pet can have multiple toys.  In this case, to collapse this information into a single table, we need to leverage a different datatype.

![repeated data](https://storage.googleapis.com/kaggle-media/learn/images/S93FJTE.png)

We say that the "Toys" column contains **repeated data**, because it permits more than one value for each row.  This is reflected in the table schema below, where the mode of the "Toys" column appears as **'REPEATED'**.

![repeated data](https://storage.googleapis.com/kaggle-media/learn/images/KlrjpDM.png)

Each entry in a repeated field is an **ARRAY**, or an ordered list of (zero or more) values with the same datatype.  For instance, the entry in the "Toys" column for Moon the Dog is **[Frisbee, Bone, Rope]**, which is an ARRAY with three values.

When querying repeated data, we need to put the name of the column containing the repeated data inside an **UNNEST()** function.  

![repeated data](https://storage.googleapis.com/kaggle-media/learn/images/p3fXPxY.png)

This essentially flattens the repeated data (which is then appended to the right side of the table) so that we have one element on each row.  For an illustration of this, check out the image below.

![repeated data](https://storage.googleapis.com/kaggle-media/learn/images/8j4XK8f.png)

# Nested and repeated data

Now, what if pets can have multiple toys, _and_ we'd like to keep track of both the name and type of each toy?  In this case, we can make the "Toys" column both **nested** and **repeated**.

![repeated data](https://storage.googleapis.com/kaggle-media/learn/images/psKtza2.png)

In the `more_pets_and_toys` table above, "Name" and "Type" are both fields contained within the "Toys" STRUCT, and each entry in both "Toys.Name" and "Toys.Type" is an ARRAY.

![repeated data](https://storage.googleapis.com/kaggle-media/learn/images/fO5OymI.png)

Let's look at a sample query.

![repeated data](https://storage.googleapis.com/kaggle-media/learn/images/DiMCZaO.png)

Since the "Toys" column is repeated, we flatten it with the **UNNEST()** function.  And, since we give the flattened column an alias of `t`, we can refer to the "Name" and "Type" fields in the "Toys" column as `t.Name` and `t.Type`, respectively.  

To reinforce what you've learned, we'll apply these ideas to a real dataset in the section below.

# Example

We'll work with the [Google Analytics Sample](https://www.kaggle.com/bigquery/google-analytics-sample) dataset.  It contains information tracking the behavior of visitors to the Google Merchandise store, an e-commerce website that sells Google branded items.

We begin by printing the first few rows of the `ga_sessions_20170801` table.  (_We have hidden the corresponding code.  To take a peek, click on the "Code" button below._)  This table tracks visits to the website on August 1, 2017.  

In [None]:
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "google_analytics_sample" dataset
dataset_ref = client.dataset("google_analytics_sample", project="bigquery-public-data")

# Construct a reference to the "ga_sessions_20170801" table
table_ref = dataset_ref.table("ga_sessions_20170801")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()

For a description of each field, refer to this [data dictionary](https://support.google.com/analytics/answer/3437719?hl=en).

The table has many nested fields, which you can verify by looking at either the [data dictionary](https://support.google.com/analytics/answer/3437719?hl=en) (_hint: search for appearances of 'RECORD' on the page_) or the table preview above.

In our first query against this table, we'll work with the "totals" and "device" columns. 

In [None]:
print("SCHEMA field for the 'totals' column:\n")
print(table.schema[5])

print("\nSCHEMA field for the 'device' column:\n")
print(table.schema[7])

We refer to the "browser" field (which is nested in the "device" column) and the "transactions" field (which is nested inside the "totals" column) as `device.browser` and `totals.transactions` in the query below:

In [None]:
# Query to count the number of transactions per browser
query = """
        SELECT device.browser AS device_browser,
            SUM(totals.transactions) as total_transactions
        FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
        GROUP BY device_browser
        ORDER BY total_transactions DESC
        """

# Run the query, and return a pandas DataFrame
result = client.query(query).result().to_dataframe()
result.head()

By storing the information in the "device" and "totals" columns as STRUCTs (as opposed to separate tables), we avoid expensive JOINs.  This increases performance and keeps us from having to worry about JOIN keys (and which tables have the exact data we need).

Now we'll work with the "hits" column as an example of data that is both nested and repeated. Since:
- "hits" is a STRUCT (contains nested data) and is repeated,
- "hitNumber", "page", and "type" are all nested inside the "hits" column, and
- "pagePath" is nested inside the "page" field,

we can query these fields with the following syntax:

In [None]:
# Query to determine most popular landing point on the website
query = """
        SELECT hits.page.pagePath as path,
            COUNT(hits.page.pagePath) as counts
        FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`, 
            UNNEST(hits) as hits
        WHERE hits.type="PAGE" and hits.hitNumber=1
        GROUP BY path
        ORDER BY counts DESC
        """

# Run the query, and return a pandas DataFrame
result = client.query(query).result().to_dataframe()
result.head()

In this case, most users land on the website through the `"/home"` page.

# Your turn 

Use what you've learned to **[query complex datatypes](https://www.kaggle.com/kernels/fork/5045823)** in a real-world dataset.

---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/advanced-sql/discussion) to chat with other learners.*