# Stack Overflow Analysis using SQL
In this notebook I will work with publicly available data about the technical question and answer site [Stack Overflow](https://stackoverflow.com/). I demonstrate the use of SQL queries to extract data to answer the following questions.

  **1) What hour of the day are most questions and answers posted?**
  
  **2) How many daily questions were posted, on average, in 2021?**
  
  **3) In January 2021, did more users post questions or answers?**
  
  **4) Who are potential experts on the topic of BigQuery?**
  
This notebook has four sections, with each section devoted to answering one of the four questions above. 

I decided to create this notebook after completing the Kaggle course [Intro to SQL](https://www.kaggle.com/learn/intro-to-sql). I will emulate the basic syntax demonstrated throughout that course to access data using BigQuery from the [Google Cloud](https://cloud.google.com/python/docs/reference) library. The queries implemented below demonstrate my understanding of basic SQL, including the commands SELECT, AS, FROM, WHERE, GROUP BY, ORDER BY, and JOIN, as well as aggregate functions.

If you are viewing this notebook on GitHub, the output results from each cell are not shown. To see the full notebook, head over to Kaggle [here](https://www.kaggle.com/jasonphiltron/stack-overflow-analysis-using-sql).

We begin by 

In [1]:
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "stackoverflow" dataset
dataset_ref = client.dataset("stackoverflow", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

We also import pandas, matplotlib, and seaborn.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 1) What hour of the day are most questions and answers posted?

To answer this question, we will need to access information from both the `posts_questions` and `posts_answers` tables. For both of these tables in the Stack Overflow dataset we will count the number of posts grouped by the hour of the day. We will limit the query to posts created in 2021.

We begin by viewing the first few lines of each of the two tables to ensure that we understand the data format and column names.

In [3]:
# Construct a reference to the "posts_questions" table
questions_table_ref = dataset_ref.table("posts_questions")

# API request - fetch the table
questions_table = client.get_table(questions_table_ref)

# Preview the first five lines of the "posts_questions" table
client.list_rows(questions_table, max_results=3).to_dataframe()

In [4]:
# Construct a reference to the "posts_answers" table
answers_table_ref = dataset_ref.table("posts_answers")

# API request - fetch the table
answers_table = client.get_table(answers_table_ref)

# Preview the first five lines of the "posts_answers" table
client.list_rows(answers_table, max_results=3).to_dataframe()

In this example we will extract the data from the two tables separately, join the data using pandas, and then plot the data. (Use of a JOIN statement within the SQL string will be demonstrated later in the notebook.)

The column `creation_date` appears in both tables. We will count up the posts and group them by `hour_of_day`. We will also limit the query so that it only returns entries for posts created in 2021.

In [5]:
hour_answers_query = """
                SELECT EXTRACT(HOUR FROM creation_date) as hour_of_day,
                    COUNT(1) as num_answers
                FROM `bigquery-public-data.stackoverflow.posts_answers`
                WHERE creation_date >= '2021-01-01'
                    AND creation_date < '2022-01-01'
                GROUP BY hour_of_day
                ORDER BY hour_of_day
                """

# Set up the query, with a limit on the total memory accessed
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
hour_answers_query_job = client.query(hour_answers_query, job_config=safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
hour_answers_results = hour_answers_query_job.to_dataframe() # Your code goes here

# Preview results
print(hour_answers_results.head(3))

In [6]:
hour_questions_query = """
                SELECT EXTRACT(HOUR FROM creation_date) as hour_of_day,
                    COUNT(1) as num_questions
                FROM `bigquery-public-data.stackoverflow.posts_questions`
                WHERE creation_date >= '2021-01-01'
                    AND creation_date < '2022-01-01'
                GROUP BY hour_of_day
                ORDER BY hour_of_day
                """

# Set up the query, with a limit on the total memory accessed
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
hour_questions_query_job = client.query(hour_questions_query, job_config=safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
hour_questions_results = hour_questions_query_job.to_dataframe() # Your code goes here

# Preview results
print(hour_questions_results.head(3))

We now join the two dataframes on the `hour_of_day` column. Then we will plot the results.

In [7]:
hour_results = hour_questions_results.join(hour_answers_results['num_answers'], on=['hour_of_day'])
hour_results.head()

In [8]:
# Plot results
sns.set_style('whitegrid')
sns.lineplot(data=hour_results.set_index('hour_of_day'))


The time of day that mosts posts are created is in the afternoon, around 2 and 3 PM. This trend holds for both questions and answers, suggesting that the site as a whole gets more traffic during these hours, and that there aren't different times of day that people like to ask questions as opposed to answering them.

Also, there are very few posts created between 10 PM and 6 AM. This suggests that the users of the Stack Overflow site are largely diurnal.

## 2) How many daily questions were posted, on average, in 2021?

To answer this question, we will count the number of questions per day and then average that result for the 2021 year. Also, to demonstrate the use of additional aggregate functions beyond COUNT, we calculate the minimum, maximum, and standard deviation of the number of daily questions posted throughout 2021. In this example we also use a common table expression (CTE) to make the query easier to read and interpret.

In [59]:
daily_question_query = """
                WITH daily_posts AS
                (
                SELECT COUNT(id) AS num_posts,
                    EXTRACT(DAY from creation_date) AS day
                FROM `bigquery-public-data.stackoverflow.posts_questions`
                WHERE EXTRACT(YEAR from creation_date) = 2021
                GROUP BY day
                )
                SELECT AVG(num_posts) AS avg_num_daily_posts,
                    MIN(num_posts) AS min_num_daily_posts,
                    MAX(num_posts) AS max_num_daily_posts,
                    STDDEV(num_posts) AS stddev_num_daily_posts
                FROM daily_posts
                """

# Set up the query, with a limit on the total memory accessed
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
daily_question_query_job = client.query(daily_question_query, job_config=safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
daily_question_results = daily_question_query_job.to_dataframe() # Your code goes here

# View results
print(daily_question_results.head())

The average number of daily posts in 2021 is about 58,000. The minimum, maximum, and standard deviation are 27,793, 64,258, and 6,193 posts, respectively.

## 3) In January 2021, did more users post questions or answers?

To answer this question, we want to determine which users have posted a question and/or an answer in January 2021. A secondary question we want to answer is whether there more users that post questions or answers.

We will access information from both the `posts_questions` and `posts_answers` tables. We will group by `user_id` defined as the `owner_user_id` column, and count the `number_of_answers` or `number_of_questions` posted by each user. We will do this in two queries, one per table, and join the results using Python. 

In [17]:
question_numbers_query = """
                SELECT owner_user_id AS user_id,
                    COUNT(1) AS number_of_questions
                FROM `bigquery-public-data.stackoverflow.posts_questions`
                WHERE creation_date >= '2021-01-01'
                    AND creation_date < '2021-02-01'
                GROUP BY owner_user_id
                ORDER BY number_of_questions DESC
                """

# Set up the query, with a limit on the total memory accessed
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
question_numbers_query_job = client.query(question_numbers_query, job_config=safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
question_numbers_results = question_numbers_query_job.to_dataframe() # Your code goes here

# Preview results
print(question_numbers_results.head())

We see that someone with `user_id` NaN posted the most questions! Clearly this person does not exist, but the NaN entry is an artifact of the way that the data was collected. Perhaps these questions were posted by visitors to the site that did not have a username or were not logged in. In any case, we will ignore this eccentricity in our analysis here.

In [18]:
answer_numbers_query = """
                SELECT owner_user_id AS user_id,
                    COUNT(1) AS number_of_answers
                FROM `bigquery-public-data.stackoverflow.posts_answers`
                WHERE creation_date >= '2021-01-01'
                    AND creation_date < '2021-02-01'
                GROUP BY owner_user_id
                ORDER BY number_of_answers DESC
                """

# Set up the query, with a limit on the total memory accessed
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
answer_numbers_query_job = client.query(answer_numbers_query, job_config=safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
answer_numbers_results = answer_numbers_query_job.to_dataframe() # Your code goes here

# Preview results
print(answer_numbers_results.head())

There were far more posts by the user that answered the most questions (1020) than by the user who asked the most questions (46). 

In [19]:
q_and_a_numbers = question_numbers_results.join(answer_numbers_results['number_of_answers'], on='user_id', how='outer')
users_q_only = sum(q_and_a_numbers['number_of_answers'].isna())
users_a_only = sum(q_and_a_numbers['number_of_questions'].isna())
users_q_and_a = sum(q_and_a_numbers['number_of_questions'].notna() & q_and_a_numbers['number_of_answers'].notna())
print("Number of users that only asked questions: {}.".format(users_q_only))
print("Number of users that only answered questions: {}.".format(users_a_only))
print("Number of users that both asked and answered questions: {}.".format(users_q_and_a))
print("Total number of unique users in this dataframe: {}".format(len(q_and_a_numbers)))

In January 2021, more users asked questions (about 117,000) than answered questions (about 70,000). A small proportion of users both asked and answered questions. (I am surprised that there aren't more users that are both asking and answering questions!) Here is one additional question: For users that both asked and answered questions, which did they do more of?

In [20]:
users_q_and_a_row_TF = q_and_a_numbers['number_of_questions'].notna() & q_and_a_numbers['number_of_answers'].notna()
q_and_a_numbers[users_q_and_a_row_TF].drop('user_id', axis=1).mean()

On average, these users posted more than twice as many answers as questions. 

## 4) Who are potential experts on the topic of BigQuery?

To answer this question, we want a list of users that have answered many questions. We will make an assumption that if a user answers many questions related to a certain topic, then they are an expert on that topic. This question mirrors the Joining Data exercise in the previously mentioned Kaggle course Intro to SQL. However, we will add an additional filter to the query. We will only count answer posts that are of at least a minimum length. We will do this to make sure that very short posts, which presumably may not answer the question or add value, are not counted in the total number of answers given by a user. Finally, we will limit our search to answers posted in 2021.

We will access information from both the `posts_questions` and `posts_answers` tables. We will group by `user_id` defined as the `owner_user_id` column from the `posts_answers` table, and count the `number_of_answers` given by each user. We will join the `posts_questions` table on the `id` column to the `parent_id` column from `posts_answers`. Additionally, we will only select questions with `tags` containing "bigquery" and having a minimum post length of 200 characters.

In [16]:
bigquery_experts_query = """
                SELECT ans.owner_user_id AS user_id,
                    COUNT(1) AS number_of_answers
                FROM `bigquery-public-data.stackoverflow.posts_answers` AS ans
                INNER JOIN `bigquery-public-data.stackoverflow.posts_questions` AS que
                    ON ans.parent_id = que.id
                WHERE que.tags LIKE '%bigquery%'
                    AND CHAR_LENGTH(ans.body) >= 200
                    AND ans.creation_date >= '2021-01-01'
                    AND ans.creation_date < '2022-02-01'
                GROUP BY user_id
                ORDER BY number_of_answers DESC
                """

# Set up the query, with a limit on the total memory accessed
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=3*10**10)
bigquery_experts_query_job = client.query(bigquery_experts_query, job_config=safe_config) # Your code goes here

# API request - run the query, and return a pandas DataFrame
bigquery_experts_results = bigquery_experts_query_job.to_dataframe() # Your code goes here

# Preview results
print(bigquery_experts_results.head())

Users with id's 5221944, 1144035, and 13473525 have the highest number of posts that fit the criteria we set. Perhaps we should award them each a badge for **2021 Stack Overflow Expert on BigQuery**!

**This is the end of the notebook. Thanks for reading!**