In [None]:
from google.cloud import bigquery # for BiqQuery usage

  from pkg_resources import get_distribution


In [5]:
# create client object
client = bigquery.Client('intsql-2025') # create client object; used for retrieving information from BigQuery Datasets



In [6]:
# Construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")

# API req, fetch the dataset
dataset = client.get_dataset(dataset_ref)

Every dataset is just a collection of tables. We can think of a dataset as a spreadsheet file containing multiple tables, all composed of rows and columns.

In [8]:
# list all tables in the "hacker news" dataset
tables = list(client.list_tables(dataset))

# print all of tables names in the dataset
for table in tables:
    print(table.table_id)

full




Similar to how we fetched a dataset, we can fetch a table. In the code cell below, we fetch the full table in the hacker_news dataset.

In [9]:
# construct a reference to the "full" table
table_ref = dataset_ref.table("full")

# API req, fetch the table
table = client.get_table(table_ref)

The structure of a table is called its schema. I need to understand a table's schema to effectively pull out the data I want.

In this example, I'll investigate the full table that we fetched above.

In [10]:
# Print information on all the columns in the "full" table in the "hacker_news" dataset
table.schema

[SchemaField('title', 'STRING', 'NULLABLE', 'Story title', (), None),
 SchemaField('url', 'STRING', 'NULLABLE', 'Story url', (), None),
 SchemaField('text', 'STRING', 'NULLABLE', 'Story or comment text', (), None),
 SchemaField('dead', 'BOOLEAN', 'NULLABLE', 'Is dead?', (), None),
 SchemaField('by', 'STRING', 'NULLABLE', "The username of the item's author.", (), None),
 SchemaField('score', 'INTEGER', 'NULLABLE', 'Story score', (), None),
 SchemaField('time', 'INTEGER', 'NULLABLE', 'Unix time', (), None),
 SchemaField('timestamp', 'TIMESTAMP', 'NULLABLE', 'Timestamp for the unix time', (), None),
 SchemaField('type', 'STRING', 'NULLABLE', 'type of details (comment comment_ranking poll story job pollopt)', (), None),
 SchemaField('id', 'INTEGER', 'NULLABLE', "The item's unique id.", (), None),
 SchemaField('parent', 'INTEGER', 'NULLABLE', 'Parent comment ID', (), None),
 SchemaField('descendants', 'INTEGER', 'NULLABLE', 'Number of story or poll descendants', (), None),
 SchemaField('ran

In [16]:
# print more readable format
for field in table.schema:
    print(f"{field.name:12} | {field.field_type:10} | {field.description}")

title        | STRING     | Story title
url          | STRING     | Story url
text         | STRING     | Story or comment text
dead         | BOOLEAN    | Is dead?
by           | STRING     | The username of the item's author.
score        | INTEGER    | Story score
time         | INTEGER    | Unix time
timestamp    | TIMESTAMP  | Timestamp for the unix time
type         | STRING     | type of details (comment comment_ranking poll story job pollopt)
id           | INTEGER    | The item's unique id.
parent       | INTEGER    | Parent comment ID
descendants  | INTEGER    | Number of story or poll descendants
ranking      | INTEGER    | Comment ranking
deleted      | BOOLEAN    | Is deleted?


In [18]:
# Extract columns' name only
column_names = [field.name for field in table.schema]
print("Columns:", column_names)

Columns: ['title', 'url', 'text', 'dead', 'by', 'score', 'time', 'timestamp', 'type', 'id', 'parent', 'descendants', 'ranking', 'deleted']


We can use the `list_rows()` method to check just the first five lines of of the full table to make sure this is right. (Sometimes databases have outdated descriptions, so it's good to check.) This returns a BigQuery `RowIterator` object that can quickly be converted to a pandas DataFrame with the `to_dataframe()` method.

In [19]:
# preview the first five lines of the full table
client.list_rows(table, max_results=5).to_dataframe()

  client.list_rows(table, max_results=5).to_dataframe()


Unnamed: 0,title,url,text,dead,by,score,time,timestamp,type,id,parent,descendants,ranking,deleted
0,,,,,,,1437681605,2015-07-23 20:00:05+00:00,story,9938127,,,,
1,,,,,,,1437682438,2015-07-23 20:13:58+00:00,story,9938200,,,,
2,,,,,,,1437683854,2015-07-23 20:37:34+00:00,story,9938343,,,,
3,,,,,,,1437684093,2015-07-23 20:41:33+00:00,story,9938369,,,,
4,,,,,,,1437684657,2015-07-23 20:50:57+00:00,story,9938432,,,,


The `list_rows()` method will also let us look at just the information in a specific column. If we want to see the first five entries in the by column, for example, we can do that!

In [20]:
client.list_rows(table, selected_fields=table.schema[:1], max_results=5).to_dataframe()

  client.list_rows(table, selected_fields=table.schema[:1], max_results=5).to_dataframe()


Unnamed: 0,title
0,
1,
2,
3,
4,
