**This notebook is an exercise in the [SQL](https://www.kaggle.com/learn/intro-to-sql) course.  You can reference the tutorial at [this link](https://www.kaggle.com/dansbecker/getting-started-with-sql-and-bigquery).**

---


# Introduction

The first test of your new data exploration skills uses data describing crime in the city of Chicago.

Before you get started, run the following cell. It sets up the automated feedback system to review your answers.

In [1]:
# Set up feedack system
from learntools.core import binder
binder.bind(globals())
from learntools.sql.ex1 import *
print("Setup Complete")

Using Kaggle's public dataset BigQuery integration.
Setup Complete


Use the next code cell to fetch the dataset.

The first step in the workflow is to create a Client object. As you'll soon see, this Client object will play a central role in retrieving information from BigQuery datasets.

In BigQuery, each dataset is contained in a corresponding project. In this case, our chicago-crime dataset is contained in the bigquery-public-data project. To access the dataset,

- We begin by constructing a reference to the dataset with the dataset() method.
- Next, we use the get_dataset() method, along with the reference we just constructed, to fetch the dataset.

In [2]:
from google.cloud import bigquery
import pandas as pd

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "chicago_crime" dataset
dataset_ref = client.dataset("chicago_crime", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

Using Kaggle's public dataset BigQuery integration.


# Exercises

### 1) Count tables in the dataset

How many tables are in the Chicago Crime dataset?

Every dataset is just a collection of tables. You can think of a dataset as a spreadsheet file containing multiple tables, all composed of rows and columns.

We use the list_tables() method to list the tables in the dataset.

In [3]:
# List all the tables in the "hacker_news" dataset
tables = list(client.list_tables(dataset))

# Print names of all tables in the dataset (there are four!)
for table in tables:  
    print(table.table_id)
    


crime


In [4]:
num_tables = len(tables)  # Store the answer as num_tables and then run this cell

# Check your answer
q_1.check()
print(num_tables)

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

1


For a hint or the solution, uncomment the appropriate line below.

In [5]:
# q_1.hint()
# q_1.solution()

### 2) Explore the table schema

How many columns in the `crime` table have `TIMESTAMP` data?

Similar to how we fetched a dataset, we can fetch a table. In the code cell below, we fetch the crime table in the chicago-crime dataset.

In [6]:
# Write the code to figure out the answer
# Construct a reference to the "crime" table
table_ref = dataset_ref.table("crime")

# API request - fetch the table
table_crime = client.get_table(table_ref)

Each SchemaField tells us about a specific column (which we also refer to as a field). In order, the information is:

- The name of the column
- The field type (or datatype) in the column
- The mode of the column ('NULLABLE' means that a column allows 
- NULL values, and is the default)
- A description of the data in that column

In [7]:
# Print information on all the columns in the "full" table in the "hacker_news" dataset
table_crime.schema

[SchemaField('unique_key', 'INTEGER', 'REQUIRED', None, (), None),
 SchemaField('case_number', 'STRING', 'NULLABLE', None, (), None),
 SchemaField('date', 'TIMESTAMP', 'NULLABLE', None, (), None),
 SchemaField('block', 'STRING', 'NULLABLE', None, (), None),
 SchemaField('iucr', 'STRING', 'NULLABLE', None, (), None),
 SchemaField('primary_type', 'STRING', 'NULLABLE', None, (), None),
 SchemaField('description', 'STRING', 'NULLABLE', None, (), None),
 SchemaField('location_description', 'STRING', 'NULLABLE', None, (), None),
 SchemaField('arrest', 'BOOLEAN', 'NULLABLE', None, (), None),
 SchemaField('domestic', 'BOOLEAN', 'NULLABLE', None, (), None),
 SchemaField('beat', 'INTEGER', 'NULLABLE', None, (), None),
 SchemaField('district', 'INTEGER', 'NULLABLE', None, (), None),
 SchemaField('ward', 'INTEGER', 'NULLABLE', None, (), None),
 SchemaField('community_area', 'INTEGER', 'NULLABLE', None, (), None),
 SchemaField('fbi_code', 'STRING', 'NULLABLE', None, (), None),
 SchemaField('x_coord

We can use the **list_rows()** method to check just the first five lines of of the full table to make sure this is right. (Sometimes databases have outdated descriptions, so it's good to check.) 

This returns a BigQuery RowIterator object that can quickly be converted to a pandas DataFrame with the **to_dataframe()** method.

In [8]:
for i in table_crime.schema:
    print(i)

SchemaField('unique_key', 'INTEGER', 'REQUIRED', None, (), None)
SchemaField('case_number', 'STRING', 'NULLABLE', None, (), None)
SchemaField('date', 'TIMESTAMP', 'NULLABLE', None, (), None)
SchemaField('block', 'STRING', 'NULLABLE', None, (), None)
SchemaField('iucr', 'STRING', 'NULLABLE', None, (), None)
SchemaField('primary_type', 'STRING', 'NULLABLE', None, (), None)
SchemaField('description', 'STRING', 'NULLABLE', None, (), None)
SchemaField('location_description', 'STRING', 'NULLABLE', None, (), None)
SchemaField('arrest', 'BOOLEAN', 'NULLABLE', None, (), None)
SchemaField('domestic', 'BOOLEAN', 'NULLABLE', None, (), None)
SchemaField('beat', 'INTEGER', 'NULLABLE', None, (), None)
SchemaField('district', 'INTEGER', 'NULLABLE', None, (), None)
SchemaField('ward', 'INTEGER', 'NULLABLE', None, (), None)
SchemaField('community_area', 'INTEGER', 'NULLABLE', None, (), None)
SchemaField('fbi_code', 'STRING', 'NULLABLE', None, (), None)
SchemaField('x_coordinate', 'FLOAT', 'NULLABLE', No

In [9]:
# Preview the first five lines of the "full" table
client.list_rows(table_crime, max_results=3).to_dataframe()

  


Unnamed: 0,unique_key,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,...,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude,location
0,20216,HT651443,2011-12-29 11:20:00+00:00,007XX E 105TH ST,110,HOMICIDE,FIRST DEGREE MURDER,APARTMENT,False,False,...,9,50,01A,1183056.0,1835492.0,2011,2022-09-18 04:45:51+00:00,41.703787,-87.605294,"(41.703787141, -87.605294344)"
1,24716,JC412482,2019-08-29 06:40:00+00:00,105XX S WABASH AVE,110,HOMICIDE,FIRST DEGREE MURDER,STREET,False,False,...,9,49,01A,1178440.0,1835083.0,2019,2022-09-18 04:45:51+00:00,41.702771,-87.622209,"(41.702770631, -87.622209499)"
2,26396,JE271557,2021-10-17 12:35:00+00:00,105XX S PERRY AVE,110,HOMICIDE,FIRST DEGREE MURDER,HOUSE,True,True,...,34,49,01A,1177468.0,1835004.0,2021,2022-08-31 04:51:30+00:00,41.702576,-87.625771,"(41.702575814, -87.625771078)"


The **list_rows()** method will also let us look at just the information in a specific column. If we want to see the first five entries in the by column, for example, we can do that!

In [10]:
# Preview the first five entries in the "date" column of the "crime" table
client.list_rows(table_crime, selected_fields=table_crime.schema[2:3], max_results=5).to_dataframe()

  


Unnamed: 0,date
0,2011-12-29 11:20:00+00:00
1,2019-08-29 06:40:00+00:00
2,2021-10-17 12:35:00+00:00
3,2018-07-21 11:31:00+00:00
4,2018-06-19 02:20:00+00:00


In [11]:
list_rows(table_crime, selected_fields=table_crime.schema[2:3], max_results=5).to_dataframe()

NameError: name 'list_rows' is not defined

In [12]:
num_timestamp_fields = 2 # Put your answer here
# columm 'date' and 'updated_on'

# Check your answer
q_2.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

For a hint or the solution, uncomment the appropriate line below.

In [13]:
# q_2.hint()
# q_2.solution()

### 3) Create a crime map

If you wanted to create a map with a dot at the location of each crime, what are the names of the two fields you likely need to pull out of the `crime` table to plot the crimes on a map?

`'latitude'` and `'longitude'` would be better and more standard than `'x_coordinate'` and `'y_coordinate'`, but this might work.


In [14]:
fields_for_plotting = ['latitude', 'longitude'] # Put your answers here
# Check your answer
q_3.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

For a hint or the solution, uncomment the appropriate line below.

In [15]:
# q_3.hint()
# q_3.solution()

Thinking about the question above, there are a few columns that appear to have geographic data. Look at a few values (with the `list_rows()` command) to see if you can determine their relationship.  Two columns will still be hard to interpret. But it should be obvious how the `location` column relates to `latitude` and `longitude`.

In [16]:
client.list_rows(table_crime, max_results=5).to_dataframe()

  """Entry point for launching an IPython kernel.


Unnamed: 0,unique_key,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,...,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude,location
0,20216,HT651443,2011-12-29 11:20:00+00:00,007XX E 105TH ST,110,HOMICIDE,FIRST DEGREE MURDER,APARTMENT,False,False,...,9,50,01A,1183056.0,1835492.0,2011,2022-09-18 04:45:51+00:00,41.703787,-87.605294,"(41.703787141, -87.605294344)"
1,24716,JC412482,2019-08-29 06:40:00+00:00,105XX S WABASH AVE,110,HOMICIDE,FIRST DEGREE MURDER,STREET,False,False,...,9,49,01A,1178440.0,1835083.0,2019,2022-09-18 04:45:51+00:00,41.702771,-87.622209,"(41.702770631, -87.622209499)"
2,26396,JE271557,2021-10-17 12:35:00+00:00,105XX S PERRY AVE,110,HOMICIDE,FIRST DEGREE MURDER,HOUSE,True,True,...,34,49,01A,1177468.0,1835004.0,2021,2022-08-31 04:51:30+00:00,41.702576,-87.625771,"(41.702575814, -87.625771078)"
3,24071,JB359746,2018-07-21 11:31:00+00:00,002XX W 106TH ST,110,HOMICIDE,FIRST DEGREE MURDER,STREET,False,False,...,34,49,01A,1176557.0,1834650.0,2018,2022-09-18 04:45:51+00:00,41.701625,-87.629118,"(41.701624878, -87.629117507)"
4,24014,JB313093,2018-06-19 02:20:00+00:00,000XX W 103RD ST,110,HOMICIDE,FIRST DEGREE MURDER,HOTEL,False,False,...,34,49,01A,1177576.0,1836668.0,2018,2022-09-18 04:45:51+00:00,41.70714,-87.625326,"(41.707139627, -87.625325518)"


# Keep going

You've looked at the schema, but you haven't yet done anything exciting with the data itself. Things get more interesting when you get to the data, so keep going to **[write your first SQL query](https://www.kaggle.com/dansbecker/select-from-where).**

---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/intro-to-sql/discussion) to chat with other learners.*