# Big Data for Engineers FS2021 – Exercises Week 1

## Exercise 1: Query operations in SQL

1. Label each of the following SQL statements with its query type.
  ```
  A) SELECT * FROM Posts WHERE Id = 123
  
  B) SELECT Id, ParentId FROM Posts WHERE ParentId IS NOT NULL
  
  C) SELECT u.Id, DisplayName
    FROM Users AS u
    JOIN Posts AS p ON u.id = p.OwnerUserId
    GROUP BY u.Id, DisplayName
  ```

2. What makes SQL a declarative language and what advantages does that have?

3. What aspects of functional languages are present in SQL, and what advantages does that have?

## Exercise 2: Explore the dataset

Here we will recall basic concepts from relational databases and try to illustrate them by example. First, some introductory questions:

1. What is a relational model? 
2. In what logical shape is the data stored? 
3. What is a primary key and what is his purpose?
4. What does 'first normal form' refer to? 

Now let us illustrate with few examples. For this we need to connect to the database we used in the first exercise. We repeat here the steps. We first set the credentials to connect.

In [None]:
server='ethbigdata2020.postgres.database.azure.com'
user='student@ethbigdata2020'
password='BigData2020'
database='poker.stackexchange.com'
connection_string=f'postgresql://{user}:{password}@{server}:5432/{database}?sslmode=require'

Then we run a first query against our server (following [this tutorial](https://docs.microsoft.com/en-us/azure/postgresql/quickstart-create-server-database-portal) from the Azure website). This should print the version information of the SQL server.

In [None]:
import sqlalchemy

engine = sqlalchemy.create_engine(connection_string)
print(engine.execute('SELECT version()').fetchall())

We can now load (or reload, if already loaded) the extension and establish a connection to our database from above. Run the following cell and make sure the output says `Connected: <connection_string>`.

In [None]:
%load_ext sql
%sql $connection_string

Now we can use the ```%sql``` and ```%%sql``` magic words to run SQL directly. ```%%sql``` makes a cell a SQL cell. A SQL cell can run an arbitrary number of SQL statements and displays the result of the last one of them.

Let's see the version number again:

In [None]:
%%sql 
SELECT version();

Now let's run an SQL query:

In [None]:
%%sql 
SELECT Id, DisplayName FROM Users
LIMIT 10

### List of Tables

Now that you have established connection to the database, let us try to understand the it a bit better. Run the following queries which shows the content of a system table with the names of the tables.

In [None]:
%sql SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE = 'BASE TABLE'

In [None]:
%%sql
SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE = 'BASE TABLE' AND TABLE_CATALOG='poker.stackexchange.com';

In [None]:
TABLE_CATALOG='$database';

### List of attributes/columns

The following shows information about the attributes of the tables.

In [None]:
%sql SELECT TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, DATA_TYPE \
     FROM INFORMATION_SCHEMA.COLUMNS \
     WHERE TABLE_CATALOG='poker.stackexchange.com' AND TABLE_SCHEMA <> 'sys'\
     ORDER BY TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, ORDINAL_POSITION;

For each table you can extract the primary key by running: 

In [None]:
%sql SELECT COLUMN_NAME \
     FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE \
     WHERE OBJECTPROPERTY(OBJECT_ID(CONSTRAINT_SCHEMA || '.' || QUOTENAME(CONSTRAINT_NAME)), 'IsPrimaryKey') = 1 \
        AND TABLE_NAME = 'Badges' AND TABLE_SCHEMA = 'dbo';

In [None]:
# This code block replaces the above one, which is specific to MS SQL and we're using PostgreSQL
#   - get attribute name and type for the primary key of the `Badges` table
#   - taken from https://wiki.postgresql.org/wiki/Retrieve_primary_key_columns
%%sql 
SELECT a.attname, format_type(a.atttypid, a.atttypmod) AS data_type
FROM   pg_index i
JOIN   pg_attribute a ON a.attrelid = i.indrelid
                     AND a.attnum = ANY(i.indkey)
WHERE  i.indrelid = 'Badges'::regclass
AND    i.indisprimary;

From the above returned results answer the following questions:
5. Which objects are modelled in the dataset and how do they relate (semantically) to each other?
6. Which are the primary keys for each table?

### Where we got the data from (if interested)

* [Info about the StackOverflow dataset](http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede)
* [Web interface to query it](https://data.stackexchange.com/poker/query/new)
* [Link to the dataset](https://archive.org/download/stackexchange/) (you actually don't need this for these exercises)

## Exercise 3: Distribution of post scores

In this exercise, we want to find out how the scores of posts are distributed.

To start, write a query that selects the top 10 best-scored posts.

**Note**: `LIMIT <number>` is PostgreSQL specific syntax. Other systems have different syntaxes to achieve the same thing, like for example ```SELECT TOP <number>``` in MS SQL.

We now know what the best posts look like. What about "more normal" posts? Write a query that counts (using the COUNT operation) the number of posts for each score.

Did you use renaming in the query?  If not try to rename the returned results from the count operation.

Your query for the above exercise may give a very large result that is difficult to interpret. Let us write a query that rounds the scores of the posts to the nearest multiple of a constant that we define and counts the number of posts for each rounded score.  

In [None]:
%%sql
SELECT RoundedScore, Count(*) AS Count
FROM (
        SELECT ROUND((score+2.5)/5, 0) * 5 AS RoundedScore FROM Posts
    ) AS Rounded
GROUP BY RoundedScore
ORDER BY RoundedScore DESC;

Can you name the operation of calling a query from inside a query? What are the semantics of the GROUP By and ORDER By operations?

Using the right constant for the rounding, you can already get a better grasp of the distribution of scores. Here, we round each score to smallest integer multiple of 5 that is still strictly larger (this is not the greatest way of rounding, but it will do for the purpose of this exercise).

We will not execute the same query but from within a Python script. This allows us to send the SQL query results to Matplotlib and plot them. 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# Store the result of the query in a Python object (add your query here!)
result = %sql SELECT RoundedScore, Count(*) AS Count \
     FROM ( \
             SELECT ROUND((score+2.5)/5, 0) * 5 AS RoundedScore FROM Posts \
        ) AS Rounded \
     GROUP BY RoundedScore \
     ORDER BY RoundedScore DESC;

# Convert the result to a Pandas data frame
df = result.DataFrame()

# Extract x and y values for a plot
x = df['RoundedScore'].tolist()
y = df['Count'].tolist()

# Print them just for debugging
print(x)
print(y)

# Plot the distribution of scores
fig, ax = plt.subplots()
ax.bar(range(len(df.index)), y, tick_label=[int(i) for i in x], align='center')
ax.set_xlabel('Score')
ax.set_ylabel('Number of Posts')

In [None]:
# This code block replaces the above one
#   - changes only in lines `x = df[...` and `y = df[...` where the dataframe keys are in lowercase
%matplotlib inline
import matplotlib.pyplot as plt

# Store the result of the query in a Python object (add your query here!)
result = %sql SELECT RoundedScore, Count(*) AS Count \
     FROM ( \
             SELECT ROUND((score+2.5)/5, 0) * 5 AS RoundedScore FROM Posts \
        ) AS Rounded \
     GROUP BY RoundedScore \
     ORDER BY RoundedScore DESC;

# Convert the result to a Pandas data frame
df = result.DataFrame()

# Extract x and y values for a plot
x = df['roundedscore'].tolist()
y = df['count'].tolist()

# Print them just for debugging
print(x)
print(y)

# Plot the distribution of scores
fig, ax = plt.subplots()
ax.bar(range(len(df.index)), y, tick_label=[int(i) for i in x], align='center')
ax.set_xlabel('Score')
ax.set_ylabel('Number of Posts')

## Exercise 4: Impact of Score Count on Scores

We now want to find out whether the number of posts of the owner of a post has an influence of the score of the post.
To that goal, write queries that answer the following questions:

1. What are the 10 users with the highest number of posts?
2. What is the average number of posts per user?
3. Which are the users with a number of posts higher than 10?
4. How many such users exist?

## Exercise 5: Included in the graded quiz

This is the exercise included in the [Week 1: SQL brush-up](https://moodle-app2.let.ethz.ch/mod/quiz/view.php?id=565933) quiz

On `poker.stackexchange.com`, find the **frequent** tag such that the posts it appears have the lowest average score. A tag is considered frequent if it appears in at least 50 posts.

**Example:** Say you have three frequent tags `<a>`, `<b>`, and `<c>`, such that they appear in the set of posts `A`, `B`, and `C`, respectively. Note that `A`, `B`, and `C` may not necessarily be disjoint since a tag may appear in multiple posts. You have to look at the average scores of `A`, `B` and `C`, see which set has the lowest, and report the tag `<a>`, `<b>`, or `<c>` that belongs to such set. For instance, if the average scores are `AVG(A) = 13`, `AVG(B) = 15`, `AVG(C) = 12`, then tag `<c>` is the tag we're looking for, since `C` has the lowest average score. You should report it without the `<`, `>` characters, i.e. just `c`.

**Note:** You can safely assume the result is unique.

