# Stack Overflow Scraping


In [None]:
pip install requests beautifulsoup4



In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import re
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Web Scraping Function Documentation

## Overview
This function is designed to scrape posts from a specific section of the Stack Overflow website, particularly the Python-tagged questions. It uses Python with libraries such as `requests` and `BeautifulSoup` to extract information from web pages. The function allows for customizable scraping based on different filters, a set number of pages, and a minimum view count threshold.

## Function Signature
```python
def scrape(base_url, filters, total_pages=3, min_view_count=100):
```

## Parameters
- `base_url` (str): The base URL of the section to scrape. For example, 'https://stackoverflow.com/questions/tagged/python'.
- `filters` (list): A list of filters to apply. Each filter represents a different sorting method, such as 'RecentActivity', 'MostVotes', or 'MostFrequent'.
- `total_pages` (int): The total number of pages to scrape across all filters. Default is 3.
- `min_view_count` (int): The minimum number of views a post must have to be included. Default is 100.

## Functionality
1. **Header Configuration**: Sets a user-agent for HTTP requests to mimic a web browser.
2. **Filter Distribution Logic**: Calculates how many pages to scrape per filter based on the total number of pages and the number of filters.
3. **Scraping Loop**:
   - Iterates over each filter and its assigned number of pages.
   - Constructs the URL for each page and sends a GET request.
   - Uses a delay to avoid rate limiting or server overload.
   - Parses the HTML response using BeautifulSoup.
4. **Post Data Extraction**:
   - Extracts title, upvotes, answer count, view count, and the post's content.
   - Converts view counts from strings to integers, handling 'k' (thousand) and 'm' (million) suffixes.
   - Filters out posts with views less than the `min_view_count`.
   - Fetches the full content of each post.
   - Removes code blocks and calculates the total length of removed code.
   - Appends the full content (excluding code blocks) to the post information.
5. **Error Handling**:
   - Handles HTTP status codes, particularly rate limiting (status code 429).
   - Skips over posts with missing elements or if the page fails to load.

## Return Value
- Returns a list of dictionaries. Each dictionary contains the following keys: 'link', 'upvotes', 'answers', 'views', 'content', and 'code_length'.

## Usage Example
```python
base_url = 'https://stackoverflow.com/questions/tagged/python'
filters = ['RecentActivity', 'MostVotes', 'MostFrequent']
posts_data = scrape(base_url, filters, total_pages=5, min_view_count=100)
print(len(posts_data), "posts collected.")
```

## Notes
- The function includes print statements for debugging and progress tracking.
- Due to its web scraping nature, the function is dependent on the structure of the Stack Overflow website. Changes in the website's layout or class names may require updates to the selectors used in the function.
- The function includes sleep calls to respect the server's load and to comply with web scraping best practices.

The `scrape` function is a Python-based web scraper specifically designed to extract information from posts on Stack Overflow. It focuses on collecting data from posts with non-zero upvotes and retrieves the post link, upvotes count, and textual content.

#### Parameters

- `base_url` (str): The base URL of the Stack Overflow tag page to scrape.
- `filter` (str): The tab filter to apply to the page (e.g., 'active', 'newest', 'frequent', etc.).
- `start_page` (int): The starting page number for pagination.
- `end_page` (int): The ending page number for pagination.

#### How It Works

1. **Setup and Initialization**:
    - Initializes the `User-Agent` header to mimic a browser request.
    - Prepares an empty list, `all_posts_info`, to store the scraped data.

2. **Page-wise Iteration**:
    - Iterates through each page number from `start_page` to `end_page`.
    - Constructs the URL for each page using the `base_url`, `filter`, and current page number.

3. **Page Request and Validation**:
    - Performs an HTTP GET request to the constructed URL.
    - Checks if the response status code is 200 (OK). If not, it logs an error message and proceeds to the next page.

4. **HTML Parsing**:
    - Parses the page content using BeautifulSoup.
    - Selects all elements with the class `question-summary` as post summaries.

5. **Post Summaries Validation**:
    - Checks if there are any post summaries on the page. If none are found, logs an error and continues to the next page.

6. **Individual Post Processing**:
    - Iterates over each post summary.
    - Extracts the title element (link to the post) and the upvote element.
    - Skips any posts with 0 upvotes.
    - Constructs the full URL of the post.

7. **Fetching Individual Post Content**:
    - Makes an HTTP GET request to each post's URL.
    - If the request fails, logs an error and continues with the next post.
    - Parses the post page and extracts the post content.
    - If no content is found, logs an error and continues.

8. **Data Aggregation**:
    - Appends a dictionary with the post's link, upvote count, and content to `all_posts_info`.

9. **Progress Logging**:
    - After processing each page, logs the number of posts collected so far.

#### Output

- Returns `all_posts_info`: A list of dictionaries, where each dictionary contains the following keys:
  - `link`: The URL of the post.
  - `upvotes`: The number of upvotes the post has received.
  - `content`: The textual content of the post.

#### Final Output

The script prints the total number of posts collected with non-zero upvotes and content, whilst returning a list of dictionaries, where dictionaries represent a row.


---
FOR MS JANE TO MODIFY

In [None]:
############################################
# To TEST the program, please SET THE NUMBER_OF_PAGES TO 3.
# We need 100 pages to scrape around 5000 rows.
NUMBER_OF_PAGES = 2
############################################

---

In [None]:
import requests
from bs4 import BeautifulSoup
import time

def convert_view_count(view_count_str):
    if 'k' in view_count_str:
        return int(float(view_count_str.replace('k', '')) * 1000)
    elif 'm' in view_count_str:
        return int(float(view_count_str.replace('m', '')) * 1000000)
    else:
        return int(view_count_str)

def scrape(base_url, filters, total_pages=3, min_view_count=100):
    headers = {'User-Agent': 'Mozilla/5.0'}
    all_posts_info = []

    num_filters = len(filters)
    pages_scraped = 0
    start_time = time.time()

    print(f"Number of filters: {num_filters}")
    print(f"Total pages requested: {total_pages}")

    # Page distribution logic
    if total_pages < num_filters:
        pages_to_scrape = [1 if i < total_pages else 0 for i in range(num_filters)]
    else:
        pages_per_filter = total_pages // num_filters
        extra_pages = total_pages % num_filters
        pages_to_scrape = [pages_per_filter + (1 if i < extra_pages else 0) for i in range(num_filters)]

    print(f"Pages to scrape per filter: {pages_to_scrape}")

    # Scraping loop
    for filter, pages in zip(filters, pages_to_scrape):
        print(f"Scraping filter: {filter}, Pages to scrape: {pages}")
        for page in range(1, pages + 1):
            url = f"{base_url}?sort={filter}&page={page}&pagesize=50"
            print(f"Scraping URL: {url}")
            response = requests.get(url, headers=headers)
            # SLEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEP
            # time.sleep()

            if response.status_code == 200:
                soup = BeautifulSoup(response.text, 'html.parser')
                posts = soup.select('.s-post-summary')

                for post in posts:
                    title_element = post.select_one('.s-post-summary--content-title a')
                    upvote_element = post.select_one('.s-post-summary--stats-item-number')

                    # Extract answer count
                    answer_count_element = post.select_one('div:nth-child(2) > span.s-post-summary--stats-item-number')
                    answers = int(answer_count_element.text.strip())

                    # Extract view count
                    view_count_element = post.select_one('div:nth-child(3) > span.s-post-summary--stats-item-number')
                    views = convert_view_count(view_count_element.text.strip())

                    if views < min_view_count:
                        print(f"Skipping post, view count less than {min_view_count}: {views}")
                        continue
                    title = title_element.text.strip()
                    upvotes = int(upvote_element.text.strip())

                    link = 'https://stackoverflow.com' + title_element['href']

                    # SLEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEP
                    # time.sleep(1)
                    post_response = requests.get(link, headers=headers)
                    if post_response.status_code == 429:
                        print(f"Rate limit hit, saving collected data and exiting...")
                        return all_posts_info

                    post_soup = BeautifulSoup(post_response.text, 'html.parser')

                     # Select the question container
                    question_container = post_soup.select_one('.question.js-question')

                    if question_container:
                        post_notice = question_container.select_one('.s-notice.s-notice__info.post-notice.js-post-notice.mb16')

                        # Remove the post notice if it exists
                        if post_notice:
                          # print(f"Skipping post notice, but keeping post content.")
                          post_notice.decompose()

                        # Check the length of code blocks
                        code_blocks = question_container.select('.js-post-body code')
                        # Calculate the total length of code
                        total_code_length = sum(len(code_block.get_text(strip=True)) for code_block in code_blocks)

                        # Now remove the code blocks to clean up the content
                        for code_block in code_blocks:
                            code_block.decompose()  # Remove the whole 'pre' element

                        post_content_element = question_container.select_one('.js-post-body')

                        if not post_content_element:
                            print(f"No content element found for post: {link}. Check the selector.")
                            continue

                        post_content = post_content_element.get_text(strip=True)
                        full_content = title + "\n" + post_content  # Append title to content

                        print(f"Page {page}: {upvotes} upvotes, {answers} answers, {views} views, {total_code_length} code length,  {len(full_content)} content length.")

                        # Add the post info including code length
                        all_posts_info.append({
                            'link': link,
                            'upvotes': int(upvotes),
                            'answers': answers,
                            'views': views,
                            'content': full_content,
                            'code_length': total_code_length
                        })

                pages_scraped += 1
                elapsed_time = time.time() - start_time
                avg_time_per_page = elapsed_time / pages_scraped
                pages_left = total_pages - pages_scraped
                estimated_time_left = avg_time_per_page * pages_left

                hours, remainder = divmod(estimated_time_left, 3600)
                minutes, seconds = divmod(remainder, 60)

                print(f"\nPage {page} scraped. {pages_left} pages more to go. Estimated time to completion: {int(hours)}h {int(minutes)}m {int(seconds)}s.\n")

            elif response.status_code == 429:
                print(f"Rate limit hit, saving collected data and exiting...")
                return all_posts_info
            else:
                print(f"Error fetching page {page}, Status Code: {response.status_code}")
                return all_posts_info

    return all_posts_info

base_url = 'https://stackoverflow.com/questions/tagged/python'
filters = ['RecentActivity', 'MostVotes', 'MostFrequent']
posts_data = scrape(base_url, filters, total_pages=NUMBER_OF_PAGES, min_view_count=0)
print(len(posts_data), "posts collected.")

Number of filters: 3
Total pages requested: 2
Pages to scrape per filter: [1, 1, 0]
Scraping filter: RecentActivity, Pages to scrape: 1
Scraping URL: https://stackoverflow.com/questions/tagged/python?sort=RecentActivity&page=1&pagesize=50
Page 1: 0 upvotes, 0 answers, 8 views, 0 code length,  709 content length.
Page 1: 0 upvotes, 2 answers, 29 views, 677 code length,  328 content length.
Page 1: 0 upvotes, 1 answers, 41 views, 2002 code length,  1223 content length.
Page 1: 2 upvotes, 1 answers, 4000 views, 270 code length,  1137 content length.
Page 1: 0 upvotes, 0 answers, 9 views, 296 code length,  374 content length.
Page 1: 0 upvotes, 1 answers, 21 views, 1222 code length,  216 content length.
Page 1: 1 upvotes, 2 answers, 23 views, 451 code length,  452 content length.
Page 1: 1 upvotes, 2 answers, 866 views, 765 code length,  361 content length.
Page 1: 0 upvotes, 0 answers, 3 views, 3075 code length,  281 content length.
Page 1: 0 upvotes, 1 answers, 402 views, 0 code length, 

In [None]:
df = pd.DataFrame(posts_data)
df.head()

Unnamed: 0,link,upvotes,answers,views,content,code_length
0,https://stackoverflow.com/questions/77918138/h...,0,0,8,how to fix make validate identifier in git?\nT...,0
1,https://stackoverflow.com/questions/77917814/c...,0,2,29,Character outside of capture group not substit...,677
2,https://stackoverflow.com/questions/77913227/r...,0,1,41,Remove the boxes but keep the labels for a YOL...,2002
3,https://stackoverflow.com/questions/4897692/fa...,2,1,4000,Fast Bluetooth Name Lookup\nI'm experiencing p...,270
4,https://stackoverflow.com/questions/77918221/w...,0,0,9,"Why wont pydnatic print the id as ""_id""?\nHi e...",296


## Saving the function to csv

In [None]:
# Export the DataFrame to a CSV file
# file_path = "/content/drive/MyDrive/Y2S2/TSAP/Assignment/"
# df.to_csv(file_path + 'stack_overflow.csv', index=False)

# print("Data exported to 'stack_overflow.csv'")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Data exported to 'stack_overflow.csv'
