## 0. Authentication and project ID

> Indented block



In [1]:
from google.colab import auth
auth.authenticate_user()
print("Authenticated!")
# project id
pid = "springer-nature-analytics"

ModuleNotFoundError: No module named 'google'

## 1. Running BQ queries in Colab

### Option 1: display result directly in Colab

In [0]:
%%bigquery --project $pid 

SELECT COUNT(doi) FROM `springer-nature-analytics.DS_dimensions.publications_full_refresh` LIMIT 1000

### Option 2: save results in a Pandas data frame

In [0]:
%%bigquery --project $pid df

SELECT doi, year, T.title AS art_title, j.title FROM `springer-nature-analytics.DS_dimensions.publications_full_refresh` AS T, 
UNNEST(journal) AS j 
WHERE j.title LIKE "Scientific Reports"
 LIMIT 1000

#### Exercises

Use Colab to write and execute one or more queries that show (in the Colab interface):

1. How many articles were published in Scientific Reports in 2018.
1. How many articles were published in PLoS ONE in 2018.

Make sure you save the Colab notebook and add comments in the text cells.

## 2. Exploratory data analysis

### Descriptive statistics

Basic descriptive statistics using SQL:


In [0]:
%%bigquery --project $pid 

SELECT AVG(n_doi) FROM (
SELECT year, COUNT(DISTINCT(doi)) AS n_doi FROM `springer-nature-analytics.DS_dimensions.publications_full_refresh`, unnest(journal) as j
where j.title LIKE "Scientific Reports"
GROUP BY year
)

In [0]:
%%bigquery --project $pid df

SELECT STDDEV(n_doi) FROM (
SELECT year, COUNT(DISTINCT(doi)) AS n_doi FROM `springer-nature-analytics.DS_dimensions.publications_full_refresh`, unnest(journal) as j
where j.title LIKE "Scientific Reports"
GROUP BY year
)

A better way: using the Pandas library for handling data frames in Python.

Pandas examples:

https://www.tutorialspoint.com/python_pandas/python_pandas_descriptive_statistics.htm

In [0]:
import pandas as pd

Example summary functions using a toy data frame example:

In [0]:
dummy_df = pd.DataFrame({"Fruit": ["Apple", "Banana", "Pear", "Orange"],
                         "Weight": [185, 183, 159, 310],  # in grams
                         "Diameter": [8.3, 6.6, 5, 10],  # in cm
                         "Fruitbearing": [(6 * 12), 9, (3 * 12) , (15 * 12)]})  # average months before bearing fruit

Basic data inspection, to look at the start (head) and end (tail) of the dataframe:

In [0]:
print(dummy_df.head())  # print the first 6 rows
print("---------------------------------------------------")
print(dummy_df.tail())  # print the last 6 rows
print("---------------------------------------------------")
print("N rows:", len(dummy_df))  # print the number of rows
print("N columns:", len(dummy_df.columns))  # print the number of columns

It's sometimes a good idea to set up a simple test to check that the actual length of the data frame corresponds to what we expect. In this case it's trivial, since we created the dataframe by hand. But when doing multiple joins and/or transformations, it can be good way to set up these tests so that the execution of the notebook stops if something has gone wrong:

In [0]:
assert len(dummy_df) == 4

We can quickly get lots of summary statistics with the Pandas `describe` method:

In [0]:
# Get basic summary statistics for the data frame:

dummy_df.describe()

In [0]:
# mean for a single column:

dummy_df["Weight"].mean()

In [0]:
# Correlation (Pearson) between two columns:

dummy_df[["Weight", "Diameter"]].corr()

### Plotting

The recommended Python library for quick data exploration in Colab is *plotly express*:

https://plot.ly/python/plotly-express/


In [0]:
import plotly.express as px

Scatter plot:

In [0]:
px.scatter(dummy_df, x="Weight", y="Diameter", trendline="ols").show()

Barplot:

In [0]:
# WRONG: px.bar(dummy_df, x="Fruit", y="Average weight (g)").show()
px.bar(dummy_df, x="Fruit", y="Weight (g)").show()

Multi-variable scatter for multiple correlations:

In [0]:
px.scatter_matrix(dummy_df, dimensions=["Weight", "Diameter", "Fruitbearing"]).show()

#### Exercises

1. Use the query below to obtain a subset of SciReps data.
1. Load the data into a Pandas data frame in Colab.
1. Use Pandas and Plotly express to explore the data.
1. Compare summary statistics from Pandas and from BQ SQL - are they the same?

In [0]:
%%bigquery --project $pid

SELECT doi, f.first_level.name AS for_1, IFNULL(ARRAY_LENGTH(references), 0) AS n_refs, 
IFNULL(ARRAY_LENGTH(author_affiliations), 0) AS n_authors,
IFNULL(ARRAY_LENGTH(concepts), 0) AS n_kw FROM `springer-nature-analytics.DS_dimensions.publications_full_refresh`, UNNEST(journal) AS j,
UNNEST(`for`) AS f
WHERE j.title LIKE "Scientific Reports" AND year = 2018
LIMIT 1000