# Bioinformatics Questions Asked On Stackoverflow

<img src="https://stackoverflow.design/assets/img/logos/so/logo-stackoverflow.png" alt="Stackoverflow-logo" style="width: 300px;"/>

If you're a coder, you should be familiar with **Stackoverflow**. Almost any problems you may encounter in your code have already been asked and answered on this website. In this project, I will try to gather information about the *most recent* questions of specific tags on this page and summarize them into a table. <br> 

For each question, the details include: <br>
- What is the question headline?
- Who asked and answered the question, and what is his/her reputation (*if available*)?
- When was the question asked and answered (*if available*)?
- How popular is that question? (views, number of answers, number of votes for the question and top answer)
- What tags are associated with it?

## Outline

1. Extract raw information from [Stackoverflow question pages](https://stackoverflow.com/questions) .
 * Inspecting the structure of the webpage and analyze its URL's components.
 * Define functions: Retrieve all data shown on a single page of the website.
 * Checkpoint: Scrape the data about the 50 most recent bioinformatics questions on <b>Stackoverflow</b>.
2. Draw out the fields of interest from the raw data.
 * Determine the patterns associated with each of the 14 fields of interest: <i>asker's name, asker's reputation, question, tags, time asked, # views, # answers, # votes for the question, has accepted answer, top answerer's name, top answerer's reputation, # votes for the top answer, time of the top answer, link to the question </i> (example below).
 * Define functions: fetch relevant information from the chunk of raw data and correct their formats, if necessary.
 * Checkpoint: parsing through the 50 most recent questions (scraped in 1st step) and pull the 14 fields out. 
3. Automate the process for multiple pages and organize the results into a table
 * Define functions: create a loop to extract and filter data from multiple pages of the website, then save all results in a csv files.
 * Checkpoint: create a csv file containing 100 recent bioinformatics questions on <b>Stackoverflow</b>.
4. Conclusion
 * Apply my functions for a different topic.
 * Summarize what I have accomplished in this project.
 * Ideas for future work.
5. References
 
Here is a glimpse of what our final output would look like.

<img src="https://i.ibb.co/Qkx2NWJ/Output-example.png" alt="Output-example">

## Page structure

I will be scraping from [Stackoverflow](https://stackoverflow.com/questions). There is an overwhelming number of questions on the website, and each of them is associated with the tags (such as <code>python</code>, <code>machine-learning</code>, etc) that will narrow down the scope of the questions. Not any strings can be a tag, and all possible tags can be found on the [website](https://stackoverflow.com/tags). <br>
For this project, I'm specifically interested in bioinformatics questions (the tag is <code>bioinformatics</code>), so I'm using it as the example.

### URL structure

After inspecting the website, I identified 2 URL patterns relevant to my objective: <br>
- <code>ht&#8203;tps://stackoverflow.com/questions/tagged/**[tag-name]**?tab=newest&page=**[page-number]**&pagesize=50</code>, this URL will show 50 questions with the tag **[tag-name]**, on the page numbered **[page-number]** of all available pages.
- <code>ht&#8203;tps://stackoverflow.com/questions/<strong>[question-info]</strong></code>, this URL show the particular question (and answers, if exist) of interest with the relative reference **[question-info]** inside the website.

### Functions to scrape the website

The functions in this section aim to "copy" all the HTML content from the website into a variable (in particular, a BeautifulSoup document) for later scraping. <br>
These functions, along with those in other sections, will all be integrated for the final purpose.

In [1]:
# Import the library
import requests               # To extract the HTML document from the website
from bs4 import BeautifulSoup # To parse through the HTML document
import math                   # To round up a division, which is later needed to limit the number of pages that can be scraped
import csv                    # To convert result into a csv file
import pandas as pd           # To read the csv file

In [2]:
# Construct the URL
def construct_url(topic = None, k = 1, href = None):
    base_url = 'https://stackoverflow.com' 
    if href == None: # First pattern, to find all questions of a tag
        url = base_url + '/questions/tagged/' + topic + '?tab=newest&page=' + str(k) + '&pagesize=50'
    elif topic == None: # Second pattern, to check a particular question
        url = base_url + href
    return url

In [3]:
def get_page(url):
    # Get the HTML page content using requests
    content = requests.get(url)
    
    # Ensure that the reponse is valid
    if content.status_code != 200:
        print('Status code:', content.status_code)
        raise Exception('The tag may not exist or you have been requesting too frequently')
    
    # Construct a beautiful soup document
    page = BeautifulSoup(content.text, 'html.parser')
    
    return page

### Checkpoint 1

To confirm our functions work:

In [4]:
# Get the BeautifulSoup document for the tag "bioinformatics"
url = construct_url(topic = "bioinformatics")
page = get_page(url)

In [5]:
# Inspecting 1000 first character of the HTML page
print(page.prettify()[:1000])

<!DOCTYPE html>
<html class="html__responsive" lang="en">
 <head>
  <title>
   Newest 'bioinformatics' Questions - Stack Overflow
  </title>
  <link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196" rel="shortcut icon"/>
  <link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" rel="apple-touch-icon"/>
  <link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" rel="image_src"/>
  <link href="/opensearch.xml" rel="search" title="Stack Overflow" type="application/opensearchdescription+xml"/>
  <meta content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0" name="viewport"/>
  <meta content="website" property="og:type">
   <meta content="https://stackoverflow.com/questions/tagged/bioinformatics" property="og:url"/>
   <meta content="Stack Overflow" property="og:site_name"/>
   <meta content="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-to

## Details From Each Questions 

As stated in the beginning, I want to extract the questions, their indices, askers and answerers. In order to do that, I inspected the fields of interest and identified their HTML tags and attributes.

### HTML Structure

On the page listing all questions, the information about each question is stored within a **\<div>** tag, with the attribute <code>class='s-post-summary'</code>. They can visualized as <u>blocks</u> separating the questions on the web browser interface. Furthermore, they contain 2 child **\<div>** tag *content* and *stats*, with <code>class='s-post-summary--content'</code> and <code>class='s-post-summary--stats'</code>, respectively: <br>
- *content* tags contain information about the question title, question tags, information of the asker (name and reputation), and when the question was asked. The information are displayed on the <u>right</u> side of the question block.
- *stat* tags contain some indices about the question, including the number of votes for the questions, number of answers, number of views, and whether the asker has accepted any answer. The information are displayed on the <u>left</u> side of the question block.

The image below is an example of a question block. <br>

<img src="https://i.ibb.co/mSxzjfs/block-example.png" alt="Block Example">

The *content* is on the right, containing the following information: <br>
- Question title: Calculate ratio of values within one column
- Tags: r, calculated-columns
- Asker: Jorge Cornick, reputation 1
- Time the question was asked: 2 hours ago

The *stat* is on the left, showing: <br>
- 0 votes for the questions, 2 answers and 23 views.
- The asker has not accepted any answers (otherwise, the back ground color of <code>2 answers</code> would have been green with a stick next to it).

To extract the information, I used the following parameters:

For the *content* tag:

|Detail | Tag | Attribute | Other information |
|-------|-----|-----------|------------------|
|Title  | a   |           |The first 'a' tag   |
|Tags   | ul, li|         | |
|Asker name|a |class='s-user-card--info'|Some do not have |
|Asker reputation|a |class='s-user-card--rep'|Some do not have |
|Time asked|time, span| |The first 'span' tag only. Some do not have|

For the *stat* tag:

|Detail | Tag | Attribute | Other information |
|-------|-----|-----------|------------------|
|Votes, Answers, Views|span|class='s-post-summary--stats-item-number'|Get 3 numbers in the same order |
|Has accepted answer| |class = 'has-accepted-answer'|A boolean value to check if such class exist|

Below is an illustration of how I chose the tags and attributes of each section based on my observation of the HTML link.

<img src="https://i.ibb.co/PW94W2V/HTML-tags.png" alt="HTML-tags">

Because the question has some answers, I am curious about the top answer (with the most votes, or the most recent if the votes are tied). To scrape that, I enter the [URL of the question](https://stackoverflow.com/questions/74838141/calculate-ratio-of-values-within-one-column), which entail the question itself and all the answers. <br>
The page is also designed into blocks (**\<div>** tag, <code>class='post-layout'</code>), and the top answer is always the second block. Then, I can get the information about the answer using the following parameter:

|Detail | Tag | Attribute | Other information |
|-------|-----|-----------|------------------|
|Answerer name|div, span |class='user-details', itemprop='author' |The attributes are applied for \<div> tag only. Some do not have |
|Asker reputation|span |class='reputation-score'|Some do not have |
|Time answered|span|class='user-action-time'|\<span> is applied after filtered by class|
|Votes| |class='js-vote-count'||


### Functions to Extract Relevant Information from the HTML Document

The functions in this section aim to extract the information from a single page and put it in a list of dictionaries. <br>
<b><i>Note </i></b>that these functions use specific tags and attributes on the page as of Dec 20th, 2022, so they may not always work in the future if Stackoverflow changes its design. <br>
These functions, along with those in other sections, will all be integrated for the final purpose.

In [6]:
def scrape_questions_per_page(page):
    # Get list of tags
    content_tags = page.find_all('div', {'class': "s-post-summary--content"})
    stat_tags = page.find_all('div', {'class': "s-post-summary--stats"})
    
    # Lengths of both should be equal
    if len(content_tags) != len(stat_tags):
        print(len(content_tags), len(stat_tags)) 
        raise Exception('The number of tags are unequal')
    else:
        recent_questions = []
        for i in range(len(content_tags)):
            content_tag = content_tags[i]
            stat_tag = stat_tags[i]
            # print(i) # Uncomment this for debug
            recent_questions.append(parse_info(content_tag, stat_tag))
    
    return recent_questions

In [7]:
def parse_info(content_tag, stat_tag):
    # Extract question summaries
    [question, user_name, user_reputation, other_tags, time_asked] = get_question_info(content_tag)
    href = content_tag.find('a')['href']
    question_url = construct_url(href=href)
    
    # Extract question statistics
    [no_votes, no_answers, views, accepted] = get_question_stat(stat_tag)

    # Extract top answer's info
    if no_answers > 0:
        [answer_name, answer_repu, answer_vote, answer_time] = get_answer_info(href)
    else:
        [answer_name, answer_repu, answer_vote, answer_time] = [None]*4
    return {
        'asker_name': user_name,
        'asker_reputation': user_reputation,
        'question': question,
        'tags': other_tags,
        'time_asked': time_asked,
        'views': views,
        'no_answers': no_answers,
        'no_votes_question': no_votes,
        'has_accepted_answer': accepted,
        'answerer_name': answer_name,
        'answerer_reputation': answer_repu,
        'no_votes_answer': answer_vote,
        'time_answered': answer_time,
        'link': question_url
    }

In [8]:
def get_question_info(content_tag):
    # Question
    question = content_tag.find('a').text
    # Asker
    user = content_tag.find(class_ = "s-user-card--info")
    if user.find('a') != None:
        user_name = user.find('a').text
        if user.find(class_="s-user-card--rep") != None:
            user_reputation = user.find(class_="s-user-card--rep").text
            user_reputation = int(''.join(filter(str.isdigit, user_reputation)))
        else:
            user_reputation = None
    else:
        [user_name, user_reputation] = [None]*2
    # Other tags
    other_tags = [tag.text for tag in content_tag.find('ul').find_all('li')]
    other_tags = str(other_tags).replace(",", "/") # Convert list to string, replace , to / to write into csv later
    # Time asked
    if content_tag.find('time').find('span') != None:
        time_asked = content_tag.find('time').find('span')['title']
    else:
        time_asked = None
    return [question, user_name, user_reputation, other_tags, time_asked]

In [9]:
def get_question_stat(stat_tag):
    # Votes, Answers, Views
    [no_votes, no_answers, views] = [str_to_int(stat.text) for stat in stat_tag.find_all('span', class_ = "s-post-summary--stats-item-number")]
    
    # Has accepted answers
    accepted = stat_tag.find(class_ = "has-accepted-answer") != None
    
    return [no_votes, no_answers, views, accepted]

# Large numbers have 'k' to substitute for x1000, so this function convert them to a numeric format.
def str_to_int(str):
    return int(float(str[:-1]) * 1000) if str[-1] == 'k' else int(str)

In [10]:
def get_answer_info(href):
    # Scrape the question page
    sub_url = construct_url(href=href)
    sub_page = get_page(sub_url)
    
    # Check only the top answer
    top_answer = sub_page.find_all('div', class_ = "post-layout", limit = 2)[1]
    # How many votes
    answer_vote = top_answer.find(class_ = "js-vote-count")['data-value']
    # When he/she answered
    answer_time = top_answer.find(class_ = "user-action-time").find('span')['title']
    # Who is he/she  
    if top_answer.find('div', {'class': "user-details", 'itemprop':"author"}) != None:
        answer_name = top_answer.find('div', {'class': "user-details", 'itemprop':"author"}).find('span').text# Name
        if top_answer.find('span', class_ = "reputation-score") != None:
            answer_repu = top_answer.find('span', class_ = "reputation-score").text # Reputation
            answer_repu = str_to_int(answer_repu.replace(",","")) 
        else:
            answer_repu = None
    else:
        [answer_name, answer_repu] = [None]*2
    
    return [answer_name, answer_repu, answer_vote, answer_time]

### Checkpoint 2

In [11]:
# Extracting information from the first questions page with [bioinformatics] tag
recent_questions = scrape_questions_per_page(page)

In [12]:
len(recent_questions) # Should have 50 questions

50

In [13]:
# View the 2 most recent ones
recent_questions[:2]

[{'asker_name': 'LaLuna Kon',
  'asker_reputation': 23,
  'question': 'Mean for multiple BigWig files',
  'tags': "['python'/ 'r'/ 'bioinformatics']",
  'time_asked': '2022-12-20 14:35:24Z',
  'views': 17,
  'no_answers': 0,
  'no_votes_question': 0,
  'has_accepted_answer': False,
  'answerer_name': None,
  'answerer_reputation': None,
  'no_votes_answer': None,
  'time_answered': None,
  'link': 'https://stackoverflow.com/questions/74864686/mean-for-multiple-bigwig-files'},
 {'asker_name': 'ricehound',
  'asker_reputation': 3,
  'question': 'How to open and close conda environment while running python script? [duplicate]',
  'tags': "['python'/ 'shell'/ 'terminal'/ 'conda'/ 'bioinformatics']",
  'time_asked': '2022-12-20 13:50:34Z',
  'views': 24,
  'no_answers': 1,
  'no_votes_question': 0,
  'has_accepted_answer': False,
  'answerer_name': 'August Axelsson',
  'answerer_reputation': 1,
  'no_votes_answer': '-1',
  'time_answered': '2022-12-20 14:57:33Z',
  'link': 'https://stackove

## Multiple pages

So far, I have successfully extracted the information from the website <code>*ht&#8203;tps://stackoverflow.com/questions/tagged/**[tag-name]**?tab=newest&page=**[page-number]**&pagesize=50*</code>, with **[tag-name]** = 'bioinformatics' and **[page-number]** = 1. <br>
While I can customize the **[tag-name]**, I have not altered the pages to check yet. Since the URL structure is intuitive, I can perform a loop to extract multiple pages. However, since the act of scraping each page involves scraping up to 50 sub-pages, it is **not possible** to scrape more than <u>3 pages</u> at once. <br>
In addition, the scraped information is currently stored as a list of dictionaries, which is not easy to read. Thus, I will use another function to save it as a tabular document (csv file).

### Functions

The functions in this section aim to integrate extracted information from multiple pages and write the final result into a csv file.
These functions, along with those in other sections, will all be integrated for the final purpose.

In [14]:
def scrape_recent_questions(topic, n=1, skip=0): 
    # n = number of pages to scrape, skip = number of pages to skip
    # By default, scrape only the first page (50 questions)
    
    # Estimate the limit
    url = construct_url(topic = topic, k = 1)
    page = get_page(url)
    # Get the number of questions
    no_questions = int(''.join(filter(str.isdigit, page.find('div', class_="fs-body3").text))) # Keep only the digit from text
    no_pages = math.ceil(no_questions/50)
    
    if skip >= no_pages:
        raise Exception('Skip too many pages. There are ' + str(no_pages) + ' pages available')
    elif skip + n > no_pages:
        n = no_pages - skip
        print("Not enough questions. Scraping " + str(n) + " pages")
    
    # Scraping
    questions = []
    for page_no in range(skip, skip+n):
        # Get the website content (bs4 document)
        url = construct_url(topic = topic, k = page_no+1)
        page = get_page(url)
        # Scrape that website
        questions += scrape_questions_per_page(page)
    return questions

In [15]:
def write_csv(items, path): # Reference: https://stackoverflow.com/questions/3086973/how-do-i-convert-this-list-of-dictionaries-to-a-csv-file
    keys = items[0].keys()

    with open(path, 'w', newline='') as output_file:
        dict_writer = csv.DictWriter(output_file, keys)
        dict_writer.writeheader()
        dict_writer.writerows(items)

### Checkpoint 3 - Final

In [16]:
# Scraping questions from the same topic, but on page 2 and 3
almost_recent_question = scrape_recent_questions('bioinformatics', 2, 1)

In [17]:
len(almost_recent_question) #Should be 100

100

In [18]:
# View the 2 top answers
almost_recent_question[:2]

[{'asker_name': 'Janie Olver',
  'asker_reputation': 1,
  'question': 'Problems downloading edgeR - R version 4.0.2',
  'tags': "['r'/ 'bioinformatics'/ 'rna-seq']",
  'time_asked': '2022-11-29 15:21:57Z',
  'views': 21,
  'no_answers': 0,
  'no_votes_question': 0,
  'has_accepted_answer': False,
  'answerer_name': None,
  'answerer_reputation': None,
  'no_votes_answer': None,
  'time_answered': None,
  'link': 'https://stackoverflow.com/questions/74616335/problems-downloading-edger-r-version-4-0-2'},
 {'asker_name': 'Nickmofoe',
  'asker_reputation': 57,
  'question': 'How can I print out multiple similar patterns I have matched in perl?',
  'tags': "['regex'/ 'perl'/ 'bioinformatics'/ 'protein-database']",
  'time_asked': '2022-11-29 12:44:46Z',
  'views': 61,
  'no_answers': 1,
  'no_votes_question': 1,
  'has_accepted_answer': True,
  'answerer_name': 'pmqs',
  'answerer_reputation': 2269,
  'no_votes_answer': '2',
  'time_answered': '2022-11-29 16:04:34Z',
  'link': 'https://stac

In [19]:
# Write the result to a csv file
write_csv(almost_recent_question, "bioinformatic_almost_recent_questions.csv")

In [20]:
# Glimpse at our table
pd.read_csv("bioinformatic_almost_recent_questions.csv") # Should seem legit!!!

Unnamed: 0,asker_name,asker_reputation,question,tags,time_asked,views,no_answers,no_votes_question,has_accepted_answer,answerer_name,answerer_reputation,no_votes_answer,time_answered,link
0,Janie Olver,1,Problems downloading edgeR - R version 4.0.2,['r'/ 'bioinformatics'/ 'rna-seq'],2022-11-29 15:21:57Z,21,0,0,False,,,,,https://stackoverflow.com/questions/74616335/p...
1,Nickmofoe,57,How can I print out multiple similar patterns ...,['regex'/ 'perl'/ 'bioinformatics'/ 'protein-d...,2022-11-29 12:44:46Z,61,1,1,True,pmqs,2269.0,2.0,2022-11-29 16:04:34Z,https://stackoverflow.com/questions/74614308/h...
2,batman,196,How do I make gsea plot with group names label...,['r'/ 'ggplot2'/ 'plot'/ 'bioinformatics'],2022-11-29 07:29:23Z,30,1,0,False,marco,41.0,0.0,2022-12-08 09:13:05Z,https://stackoverflow.com/questions/74610471/h...
3,plnnvkv,541,Remove a substring from lines starting with a ...,['bash'/ 'bioinformatics'/ 'fasta'],2022-11-28 15:06:06Z,41,4,1,True,William Pursell,198000.0,1.0,2022-11-28 15:21:02Z,https://stackoverflow.com/questions/74602571/r...
4,Nickmofoe,57,How can I find a protein sequence from a FASTA...,['regex'/ 'perl'/ 'bioinformatics'/ 'protein-d...,2022-11-28 11:33:07Z,52,1,1,True,pmqs,5003.0,3.0,2022-11-28 22:55:50Z,https://stackoverflow.com/questions/74599963/h...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Raksha,35,Import SNP data into DendroPy for popgen analyses,['python'/ 'bioinformatics'/ 'dendropy'],2022-10-03 21:38:28Z,48,0,0,False,,,,,https://stackoverflow.com/questions/73941081/i...
96,Quinn,27,Unable to access jar file (installed in a cond...,['java'/ 'bash'/ 'jar'/ 'bioinformatics'/ 'pic...,2022-10-03 15:45:25Z,118,1,-1,True,Will Holtz,196.0,1.0,2022-10-04 04:53:18Z,https://stackoverflow.com/questions/73937802/u...
97,Programming Noob,1035,Converting a data frame using a formula,['r'/ 'dataframe'/ 'bioinformatics'],2022-10-03 13:54:37Z,61,2,3,False,Allan Cameron,122000.0,2.0,2022-10-03 14:05:40Z,https://stackoverflow.com/questions/73936504/c...
98,Shelby Labuschagne,63,Why does my Manhattan plot look like this?,['python'/ 'bioinformatics'/ 'genetics'/ 'gwas...,2022-10-02 14:27:40Z,93,0,0,False,,,,,https://stackoverflow.com/questions/73926611/w...


## Conclusion

### Applying

Everything seems to work. Let's apply to another tag, such as *'mysql'*. Note that we may run into code 406 because we've been requesting too many times. If that happens, wait for about 15 minutes and try again.

In [23]:
# Getting the 50 most recent questions with the tag 'mysql'
mysql_recent_questions = scrape_recent_questions('mysql', 2, 0)
# Write result to a csv file
write_csv(mysql_recent_questions, "mysql_recent_questions.csv")

In [24]:
# Check results
pd.read_csv("mysql_recent_questions.csv")

Unnamed: 0,asker_name,asker_reputation,question,tags,time_asked,views,no_answers,no_votes_question,has_accepted_answer,answerer_name,answerer_reputation,no_votes_answer,time_answered,link
0,Pat Lefebvre,1,MariaDB SQL query help needed,['mysql'/ 'sql'/ 'join'/ 'mariadb'/ 'left-join'],2022-12-20 20:36:58Z,6,0,0,False,,,,,https://stackoverflow.com/questions/74868630/m...
1,momo668,138,SQL on splitting table by column and rejoining...,['mysql'/ 'sql'],2022-12-20 20:28:24Z,12,0,-2,False,,,,,https://stackoverflow.com/questions/74868574/s...
2,user20826386,1,How to select records with distinct combinatio...,['mysql'],2022-12-20 20:12:31Z,12,1,-2,False,Stu,25300.0,0.0,2022-12-20 20:23:43Z,https://stackoverflow.com/questions/74868441/h...
3,osi129,1,I am having problems in installing Mysql in ub...,['mysql'],2022-12-20 20:05:59Z,12,0,-4,False,,,,,https://stackoverflow.com/questions/74868382/i...
4,Andrew Ya,1,MySQL error 'Prepared statement needs to be re...,['mysql'],2022-12-20 19:39:06Z,11,0,-1,False,,,,,https://stackoverflow.com/questions/74868115/m...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Dit,1,Mysql Keeps deleting the written ID but the ID...,['php'/ 'html'/ 'mysql'/ 'mysqli'],2022-12-19 23:23:11Z,35,1,-2,False,Barmar,157000.0,1.0,2022-12-20 05:21:52Z,https://stackoverflow.com/questions/74857027/m...
96,darkstar,621,Would indexing a text field allow for better p...,['mysql'/ 'sql'/ 'indexing'],2022-12-19 23:09:16Z,32,3,1,False,Chris Maurer,1644.0,2.0,2022-12-19 23:41:09Z,https://stackoverflow.com/questions/74856945/w...
97,B.12,9,VARCHAR TO DATETIME,['mysql'],2022-12-19 22:45:01Z,33,1,-2,False,Barmar,709000.0,-1.0,2022-12-19 23:44:23Z,https://stackoverflow.com/questions/74856785/v...
98,Developer Jano,81,Best way to update 10k rows based on ID,['mysql'],2022-12-19 22:36:23Z,36,0,1,False,,,,,https://stackoverflow.com/questions/74856736/b...


### Summary

- I <u>wrote Python functions</u> using <i>requests</i> and <i>BeautifulSoup</i> libraries to retrieve details about the most recent topic-specific questions asked on <b>Stackoverflow</b>.
- For each question entry, I retrieve additional information if not visible on the summary page by <u>scraping its subpage</u> (in particular, the fields associated with the answer if there is one).
- The functions were applied to scrape 100 most recent questions with the "mysql" tag and store the <u>result in a tabular format</u> (100 rows x 14 columns), showing the names, reputations of the askers and answerers, time asked and answered, associated tags, and the popularities of those questions and answers.

### Ideas for Future Work

Improvement for the functions:
- Due to the act of scraping subpages, I am limitted to scraping 2-3 pages at a time. I can optimize it by creating 2 scraping options: <i>general</i> showing only fields which are visible on the summary page, and <i>detail</i> adding information about the answers.
- Fetch additional detailed information from each question, such as whether the questions/answers have pictures, code sections, or the number of characters.
- Scrape the top results for the searched 'key word' instead of 'tag' by changing the URL structures.

Follow up on the scraped data:
- Analyze the time between when the question was asked and when it was answered.
- Identify the trending topics.
- Observe the relationship between the length of the question titles and how often they are viewed/answered.

## Reference

- https://stackoverflow.com/questions
- https://stackoverflow.com/questions/3086973/how-do-i-convert-this-list-of-dictionaries-to-a-csv-file
- https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/web-scraping-and-rest-apis