# Scraping the eCommerce Sellers Forum

link: https://forum.flowster.app/

When we open the above [link](https://forum.flowster.app) one of the things we see is the list of categories called Category, the number of topics per category and Latest posts.
![First page](assets/1.png)
What we are interested to get are:
- The category urls for all the available categories
- The topic urls for each category
- The title, tags and comments or main post and replies
- Etc (feel free to extract what you thing would be interesting to investigate for you)

I will be using Firefox webdriver, selenium and beautiful soup to extract the data needed from this forum.

## Get each catgory url

First access the inspector and try to find out the HTML tag or identifier of these urls.
As you can see when I click using the mouse inspector on the element containing Catgory, Topics and Latest I find them wrapped up in a div which class is `categories-and-latest ember-view` under which there is another div with the class name `column categories`.
![Front page](assets/2.png)


To get the category url you can click on the category section itself:
![More on first page](assets/3.png)
You will find different HTML tags, what we are interested in is the `<a>`(hyperlink) tag that contains the `href` attribute that has the url of that category we just selected. We will then write the code to access that tag and get the value of its `href` attribute. 
![More on first page 2](assets/4.png)

In [2]:
# Define the webdriver and provide the forum link
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options

opts = Options()
opts.set_headless()
assert opts.headless # Operating in headless mode
browser = Firefox(options=opts)
browser.get('https://forum.flowster.app')

  opts.set_headless()


In [3]:
# Navigate to the a tag to get the content of its href
categ_links = browser.find_elements_by_css_selector('.category > h3 > a')

In [4]:
categ_links

[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="139dbe67-ce87-dd4f-9da2-7d0c99a317e6", element="228206b4-f575-2b4f-a674-ce0dfdd4a4ff")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="139dbe67-ce87-dd4f-9da2-7d0c99a317e6", element="e1d23a6f-84cf-684a-899a-84e72904e774")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="139dbe67-ce87-dd4f-9da2-7d0c99a317e6", element="20f00f40-305f-794c-bf44-fdc5e9c94e00")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="139dbe67-ce87-dd4f-9da2-7d0c99a317e6", element="feba649f-c15c-6048-9f55-e6f6e6058f8b")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="139dbe67-ce87-dd4f-9da2-7d0c99a317e6", element="683e202c-3335-c64e-8a7e-4975c369f7d3")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="139dbe67-ce87-dd4f-9da2-7d0c99a317e6", element="2353ed4b-8879-2045-9642-14b0b15c2fcd")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement 

In [7]:
# Print the url of the first category : Store & Website Management 
categ_links[0].get_attribute('href')

'https://forum.flowster.app/c/store-website-management/40'

## Get the category topics urls

- Access the category by using its url.
- Find the tag holding each topic url.
- Extract the url and store it.

In [10]:
# Access the category by using its url
import requests
import time

CATEG_URL = categ_links[0].get_attribute('href')
browser.get(CATEG_URL)
browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
time.sleep(3)

**Find the tag holding each topic url**
![Topics](assets/5.png)

In [11]:
# Find the tag holding each topic url
from bs4 import BeautifulSoup

soup = BeautifulSoup(browser.page_source, 'html.parser') # Get the current HTML page content

In [14]:
# Navigate directly to the tag holding the url
storeWebM_topics_links = soup.find_all('a', class_='title raw-link raw-topic-link') 

In [15]:
storeWebM_topics_links

[<a class="title raw-link raw-topic-link" data-topic-id="58" href="/t/about-the-store-website-management-category/58">About the Store &amp; Website Management category</a>,
 <a class="title raw-link raw-topic-link" data-topic-id="1545" href="/t/securing-long-term-partnerships/1545">Securing long term partnerships</a>,
 <a class="title raw-link raw-topic-link" data-topic-id="1470" href="/t/amazon-free-products/1470">Amazon Free Products</a>]

In [17]:
len(storeWebM_topics_links) # we have 3 topics available in the Store & Website Management category

3

In [21]:
# Extract the 1st topic url
storeWebM_topics_links[0]['href']

'/t/about-the-store-website-management-category/58'

In [25]:
# Add the base url to form a full accessible url
storeWebM_topic_urls = []
BASE_URL = 'https://forum.flowster.app'
for topic_link in storeWebM_topics_links:
    storeWebM_topic_urls.append(BASE_URL + topic_link['href'])

In [26]:
storeWebM_topic_urls

['https://forum.flowster.app/t/about-the-store-website-management-category/58',
 'https://forum.flowster.app/t/securing-long-term-partnerships/1545',
 'https://forum.flowster.app/t/amazon-free-products/1470']

## Get the topic post, replies and other info

One thing to note is that there is a difference between the class of the div holding the main post and the one holding the replies.
- **Main post**  

![Main post](assets/6.png)

- **Replies**  

![Replies](assets/7.png)

That is why we will be iterating over the two classes to get the main post and the replies.

In [27]:
# Access the 2nd topic url

SECOND_TOPIC_URL = storeWebM_topic_urls[1]
browser.get(SECOND_TOPIC_URL)
browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
time.sleep(3)

In [28]:
# Get the HTML content of the current page
secondTopic_soup = BeautifulSoup(browser.page_source, 'html.parser')

**Get all the posts HTML**

![Posts](assets/8.png)

In [29]:
# Get all the posts HTML
postStream = secondTopic_soup.find('div', class_='post-stream')

In [30]:
# Select the HTML for both the main and replies posts
postsDivs = postStream.find_all('div', {'class': ['topic-post clearfix topic-owner regular', 'topic-post clearfix regular']})

In [31]:
# Extract the main post text and the replies posts text
comments = []
for i in range(len(postsDivs)):
    comment = postsDivs[i].find('div', class_='cooked').text
    comments.append(comment)

leading_comment = comments[0]
if len(comments) == 1:
    other_comments = []
else:
    other_comments = comments[1:]

In [32]:
leading_comment # main post

'Hello, I just closed a deal to manage a quite large brand account on Amazon, my main concern now is that this partnership may only last 1-2 years and once the account has grown to higher levels they would just manage it by themselves. What methods, tips and tricks you can suggest to protect this deal and make sure they will stick with my company managing their account for long term? Essentially how can I make them dependent on me so it will be very hard for them to let go our partnership?'

In [33]:
other_comments # replies posts

['Brand management is going to have partner turnover.  There will always be high expectations and, even if you raise sales 20%, they’ll want a 30% increase the next quarter.  I think one good tip is to set reasonable expectations.  Let them know all the activities you will undertake and don’t over-promise a sales increase.']

**Extract title, tags, number of views, likes and replies**

In [34]:
created = postStream.find('li', class_="created-at")
created_at = created.find('span', class_='relative-date')['title']
created_at

'Dec 5, 2020 5:49 pm'

In [35]:
last_reply = postStream.find('li', class_="last-reply")
last_reply = last_reply.find('span', class_='relative-date')['title']
last_reply

'Dec 7, 2020 3:02 pm'

In [36]:
replies = postStream.find('li', class_="replies")
nbr_replies = replies.find('span', class_='number').text
nbr_replies

'1'

In [38]:
views = postStream.find('li', class_="secondary views")
nbr_views = views.find('span', class_='number').text
nbr_views

'30'

# Scraper Class

In [None]:
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup
import requests
import time
from datetime import datetime
import os
import pandas as pd
import json

class WebScraper:
    browser = None # Selenium webriver object
    topic_dict = {} # Dictionary of all topics and their attributes
    topic_df = pd.DataFrame(columns=[
        'Topic Title',
        'Category',
        'Tags',
        'Leading Post',
        'Post Replies',
        'Created_at',
        'Likes',
        'Views',
        'Replies',
    ])
    
    def __init__(self, webdriverPath):
        # Set up the webdriver
        opts = Options()
        opts.set_headless()
        assert opts.headless # Operating in headless mode
        self.browser = Firefox(options=opts, executable_path=webdriverPath)
        
    def get_topic_title_details(self, topic_soup):
        """
        Get topic title, category and tags
        """
        topic_title = topic_soup.find('a', class_='fancy-title').text.strip()

        title_wraper = topic_soup.find('div', class_='title-wrapper')

        topic_tags = title_wraper.find_all('span', class_='category-name')
        topic_tags = [tag.text for tag in topic_tags]
        
        try: 
            topic_category = topic_tags[0]
        
            if len(topic_tags) == 1:
                topic_tags = []
            else:
                topic_tags = topic_tags[1:]
        except:
            topic_category = ''
            topic_tags = ''
            
        return topic_title, topic_category, topic_tags
        
    def get_topic_comments(self, topic_soup):
        """
        Get topic leading post and its replies.
        """
        postStream = topic_soup.find('div', class_='post-stream')
        postsDivs = postStream.find_all('div', {'class': ['topic-post clearfix topic-owner regular', 'topic-post clearfix regular']})

        comments = []
        for i in range(len(postsDivs)):
            comment = postsDivs[i].find('div', class_='cooked').text
            #postsDivs[i].find('div', class_='cooked').text.replace('\n', ' ')
            comments.append(comment)
        try:
            leading_comment = comments[0]
            if len(comments) == 1:
                other_comments = []
            else:
                other_comments = comments[1:]
        except:
            leading_comment, other_comments = [], []

        return leading_comment, other_comments
    
    def get_topic_created_at(self, topic_soup):
        """
        Get the topic creation date
        """
        created = topic_soup.find('li', class_="created-at")
        
        if created is None:
            created_at = str(0)
        else:
            created_at = created.find('span', class_='relative-date')['title']
    
        return created_at

    def get_topic_replies_nbr(self, topic_soup):
        """
        Get the topic's nbr of replies
        """    
        replies = topic_soup.find('li', class_="replies")
        
        if replies == None:
            nbr_replies = str(0)
        else:
            nbr_replies = replies.find('span', class_='number').text
        
        return nbr_replies

    def get_topic_views_nbr(self, topic_soup):
        """
        Get the topic's nbr of views
        """ 
        views = topic_soup.find('li', class_="secondary views")
        
        if views is None:
            nbr_views = str(0)
        else:
            nbr_views = views.find('span', class_='number').text
        
        return nbr_views

    def get_topic_likes_nbr(self, topic_soup):
        """
        Get the topic's nbr of likes
        """ 
        likes = topic_soup.find('li', class_="secondary likes")
        
        if likes is None:
            nbr_likes = str(0)
        else:
            nbr_likes = likes.find('span', class_='number').text
        
        return nbr_likes
    
    def runApp(self, BASE_URL, SITE_NAME):
        """
        Run the scraping process
        """
        # Open Firefox web client using Selenium and retrieve page source
        self.browser.get(BASE_URL)
        
        # Get all the categories link 
        categ_links = self.browser.find_elements_by_css_selector('.category > h3 > a')
        categ_urls = []
        for link in categ_links:
            categ_urls.append(link.get_attribute('href'))
        
        # Go over each category url
        for categ_url in categ_urls:
            # Access category webpage
            self.browser.get(categ_url)
            
            # Load the entire webage by scrolling to the bottom
            lastHeight = self.browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
            
            while (True):
                # Scroll to bottom of page
                self.browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

                # Wait for new page segment to load
                time.sleep(0.5)

                # Calculate new scroll height and compare with last scroll height
                newHeight = self.browser.execute_script("return document.body.scrollHeight")
                if newHeight == lastHeight:
                    break
                    
                lastHeight = newHeight
            
            # Generate category soup
            categoryHTML = self.browser.page_source
            categ_topic_soup = BeautifulSoup(categoryHTML, 'html.parser')
    
            categ_topic_links = categ_topic_soup.find_all('a', class_='title raw-link raw-topic-link')
    
            # Get all the topic urls inside the current category
            categ_topic_urls = []
            for topic_link in categ_topic_links:
                categ_topic_urls.append(BASE_URL + topic_link['href'])
            
            # Loop through all the topics in the current category
            for categ_topic_url in categ_topic_urls:
                # Get current topic_soup
                self.browser.get(categ_topic_url)
                topicHTML = self.browser.page_source
                topic_soup = BeautifulSoup(topicHTML, 'html.parser')
                
                # Scrape all topic attributes of interest
                topic_title, topic_category, topic_tags = self.get_topic_title_details(topic_soup)
                leading_comment, other_comments = self.get_topic_comments(topic_soup)
                created_at = self.get_topic_created_at(topic_soup)
                nbr_replies = self.get_topic_replies_nbr(topic_soup)
                nbr_views = self.get_topic_views_nbr(topic_soup)
                nbr_likes = self.get_topic_likes_nbr(topic_soup)
                
                # Attribute dictionary for each topic in a category
                attribute_dict = {
                            'Topic Title': topic_title,
                            'Category': topic_category,
                            'Tags': topic_tags,
                            'Leading Post': leading_comment,
                            'Post Replies': other_comments,
                            'Created_at': created_at,
                            'Likes': nbr_likes,
                            'Views': nbr_views,
                            'Replies': nbr_replies}
                
                self.topic_dict[topic_title] = attribute_dict
                self.topic_df = self.topic_df.append(attribute_dict, ignore_index=True)
                
                # TEST
                print('Title :', topic_title)
                print('Category :', topic_category)
                print('URL :', categ_topic_url)
                
        # Get unique timestamp of the webscraping
        timeStamp = datetime.now().strftime('%Y%m%d%H%M%S')
        
        # Save data in JSON and CSV files and store in the save folder as this program
        jsonFilename = SITE_NAME + '_SCRAPED_DATA_' + timeStamp + '.json'
        csvFilename = SITE_NAME + '_SCRAPED_DATA_' + timeStamp + '.csv'
        
        jsonFileFullPath = os.path.join(os.path.dirname(os.path.realpath(__file__)), jsonFilename)
        csvFileFullPath = os.path.join(os.path.dirname(os.path.realpath(__file__)), csvFilename)
        
        # Save scraped data  into json file
        with open(jsonFileFullPath, 'w') as f:
            json.dump(self.topic_dict, f)
        
        # Save dataframe into csv file
        self.topic_df.to_csv(csvFileFullPath)
        
if __name__=='__main__':
    # Local path to webdriver
    webdriverPath = r'/usr/local/bin/geckodriver'
    
    # Forum to scrape URL    
    BASE_URL = 'https://forum.flowster.app'
    
    # Name of the forum
    SITE_NAME = 'FLOWSTER'
        
    # WebScraping object
    webScraper = WebScraper(webdriverPath)
    
    # Run the webscraper and save scraped data
    webScraper.runApp(BASE_URL, SITE_NAME)

# Explore the scraped data

In [273]:
import pandas as pd

data = pd.read_csv('FLOWSTER_SCRAPED_DATA20201212224954.csv') 

In [274]:
data.head()

Unnamed: 0.1,Unnamed: 0,Topic Title,Category,Tags,Leading Post,Post Replies,Created_at,Likes,Views,Replies
0,0,About the Store & Website Management category,Store & Website Management,[],Have questions about Store & Website Managemen...,[],0,0,0,0
1,1,Securing long term partnerships,Store & Website Management,[],"Hello, I just closed a deal to manage a quite ...",['Brand management is going to have partner tu...,"Dec 5, 2020 5:49 pm",0,19,1
2,2,Amazon Free Products,Store & Website Management,[],"Hello,\nI need some buyers for my products in ...",[],0,0,0,0
3,3,About the Product Sourcing Category,Product Sourcing,[],Have questions about sourcing products? This i...,[],0,0,0,0
4,4,Virtual Assistant sending Emails Hubspot,Product Sourcing,[],"Hello Bright Ideas Tribe,\nI would like to giv...",['\n\n\nknowledge.hubspot.com 1\n\n\n\nCreate ...,"Nov 18, 2020 9:12 pm",0,81,5


In [275]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278 entries, 0 to 277
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    278 non-null    int64 
 1   Topic Title   278 non-null    object
 2   Category      278 non-null    object
 3   Tags          278 non-null    object
 4   Leading Post  277 non-null    object
 5   Post Replies  278 non-null    object
 6   Created_at    278 non-null    object
 7   Likes         278 non-null    int64 
 8   Views         278 non-null    int64 
 9   Replies       278 non-null    int64 
dtypes: int64(4), object(6)
memory usage: 21.8+ KB


In [277]:
data['Leading Post'][1]

'Hello, I just closed a deal to manage a quite large brand account on Amazon, my main concern now is that this partnership may only last 1-2 years and once the account has grown to higher levels they would just manage it by themselves. What methods, tips and tricks you can suggest to protect this deal and make sure they will stick with my company managing their account for long term? Essentially how can I make them dependent on me so it will be very hard for them to let go our partnership?'

In [280]:
all_categ = data['Category'].unique() 

In [282]:
all_categ

array(['Store & Website Management', 'Product Sourcing', 'Management',
       'Amazon Specific', 'Fulfillment', 'Flowster-specific',
       'eCommerce Marketplaces', 'Traffic Sources', 'Software & Tools',
       'Financial Management', 'Human Resources', 'Misc Topics'],
      dtype=object)

In [281]:
len(all_categ)

12

# Resources for EDA (Exploratory Data Analysis)

Check out these two resources and the ones in module for further info and methods:
- https://www.analyticsvidhya.com/blog/2020/04/beginners-guide-exploratory-data-analysis-text-data/
- `Processing Textual Data.ipynb`