## Data extraction of Product Hunt data using Beautiful Soup

In [187]:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
import pandas as pd

My workflow on this web scraping process consists of 3 steps
1. Creating the whole scraping process for a unique URL (a single day of the year)
2. Extrapolating to the next days by setting a base URL and iterating over the day and month
3. Merging the results into 1 dataframe

STEP 1: Scraping process for unique URL

In [232]:
# Making a request to the website
url = 'https://www.producthunt.com/time-travel/2022/1/1'
data = requests.get(url)
# Parsing the HTML of the page
soup = BeautifulSoup(data.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Posts for January 1, 2022 | Product Hunt | Product Hunt
  </title>
  <script src="/cdn-cgi/apps/head/rf_aHEm7YUL4SP9DuLocdP0kRH0.js">
  </script>
  <link href="https://www.producthunt.com/time-travel/2022/1/1" rel="canonical"/>
  <meta content="Product Hunt is a curation of the best new products, every day. Discover the latest mobile apps, websites, and technology products that everyone's talking about." name="description"/>
  <meta content="width=device-width, initial-scale=1, minimum-scale=1" name="viewport"/>
  <meta content="1467820943460899" property="fb:app_id"/>
  <meta content="summary_large_image" name="twitter:card"/>
  <meta content="@producthunt" name="twitter:site"/>
  <meta content=" Posts for January 1, 2022 | Product Hunt | Product Hunt" name="twitter:title"/>
  <meta content="Product Hunt is a curation of the best new products, every day. Discover the latest mobile apps, websites, and techn

In [189]:
# Find all the elements of the class to extract all posts
post_containers = soup.find_all(class_='styles_item__T_iXF')
print(post_containers) 

[<li class="styles_item__T_iXF" data-test="post-item-325054"><div class="styles_postContent__Be7Xz"><a href="/products/insured-nomads#world-explorer-by-insured-nomads"><div class="styles_thumbnail__Rmwk5 styles_thumbnail__GsGqZ" data-test="post-thumbnail"><img alt="World Explorer by Insured Nomads" class="styles_mediaThumbnail__LDCQN" loading="lazy" src="https://ph-files.imgix.net/9bc773c8-1720-45b6-86f3-75d257338bc3.png?auto=compress&amp;codec=mozjpeg&amp;cs=strip&amp;auto=format&amp;w=80&amp;h=80&amp;fit=crop&amp;bg=0fff" srcset="https://ph-files.imgix.net/9bc773c8-1720-45b6-86f3-75d257338bc3.png?auto=compress&amp;codec=mozjpeg&amp;cs=strip&amp;auto=format&amp;w=80&amp;h=80&amp;fit=crop&amp;bg=0fff&amp;dpr=1 1x, https://ph-files.imgix.net/9bc773c8-1720-45b6-86f3-75d257338bc3.png?auto=compress&amp;codec=mozjpeg&amp;cs=strip&amp;auto=format&amp;w=80&amp;h=80&amp;fit=crop&amp;bg=0fff&amp;dpr=2 2x, https://ph-files.imgix.net/9bc773c8-1720-45b6-86f3-75d257338bc3.png?auto=compress&amp;code

In [233]:
# Loop through each post container
data= []   
for container in post_containers:
    # Extract the name of the product
    product_name= container.find(class_="styles_font__m46I_ styles_medium__pDQUU styles_semiBold__yhRqL styles_title__MqCmH styles_lineHeight__kGlRn styles_underline__IqHIA").text
    # Extract the number of upvotes
    upvotes= container.find(class_="styles_font__m46I_ styles_black__Z9fG_ styles_small__lLD08 styles_semiBold__yhRqL styles_lineHeight__kGlRn styles_underline__IqHIA").text
    # Extract the category
    category= container.find(class_="styles_postTopicLink__9NDtt")
    if category is not None:
        category = category.text
    else:
        category = "N/A"
    # Extract the tagline
    tagline= container.find(class_= "styles_font__m46I_ styles_grey__YlBrh styles_small__lLD08 styles_normal__FGFK7 styles_tagline__i9xwM styles_lineHeight__kGlRn").text
    # Extract the number of comments
    comments= container.find(class_="styles_button__9iLfH styles_smallSize__n7oLn styles_subtleVariant__qmS4e styles_simpleVariant__63ULX styles_button__dtE4Y").text
    # Extract the image link
    image= container.find(class_="styles_mediaThumbnail__LDCQN")
    image_link = image.get('src')
    # Extract the pricing
    pricing= container.find(class_="styles_font__m46I_ styles_black__Z9fG_ styles_xSmall__xeHkB styles_normal__FGFK7 styles_pricingType__EdAeQ styles_siblingTopicTag__wQEbN styles_lineHeight__kGlRn styles_underline__IqHIA")
    if pricing is not None:
        pricing = pricing.text
    else:
        pricing = "N/A"
    # Store the data in a dictionary
    post_data = {
        'product_name': product_name,
        'upvotes': upvotes,
        'category': category,
        'tagline': tagline,
        'comments': comments,
        'image_src': image_link,
        'pricing': pricing
        }
    # Add the dictionary to the list
    data.append(post_data)
    # Create a Pandas DataFrame from the list of dictionaries
    df = pd.DataFrame(data)

In [234]:
# Add a column with the product's position
df['position'] = range(1, len(df)+1)

#Add a column with the 'date' (which is actually the title)
# Extract the title of the page
page_title = soup.title.string
# Add a new column to the DataFrame
df['date'] = page_title
# Set the value of the 'page_title' column for all rows to the page title
df['date'] = df['date'].apply(lambda x: page_title)

# Print the DataFrame
display(df)

Unnamed: 0,product_name,upvotes,category,tagline,comments,image_src,pricing,position,date
0,World Explorer by Insured Nomads,389,Global Nomad,Insurance meets travel tech for the global wor...,84,https://ph-files.imgix.net/9bc773c8-1720-45b6-...,,1,"Posts for January 1, 2022 | Product Hunt | Pr..."
1,Tailwind Box Shadows,203,Productivity,Curated list of box shadows for your cards to ...,25,https://ph-files.imgix.net/f182991d-f034-43b8-...,Free,2,"Posts for January 1, 2022 | Product Hunt | Pr..."
2,24me Smart Personal Assistant,127,Productivity,Keep new year's resolutions and get organized ...,8,https://ph-files.imgix.net/9afadb00-151b-4f19-...,Free,3,"Posts for January 1, 2022 | Product Hunt | Pr..."
3,Habitify Challenge,168,Productivity,Track & build new habits with your friends in ...,10,https://ph-files.imgix.net/8941b193-9073-4864-...,Free,4,"Posts for January 1, 2022 | Product Hunt | Pr..."
4,Sunflower iOS App,108,iOS,Rewire your brain to associate sobriety with r...,14,https://ph-files.imgix.net/35f3ef3a-3a39-49b3-...,Free,5,"Posts for January 1, 2022 | Product Hunt | Pr..."
5,Best Stoic Quotes,95,Productivity,Get inspired for the new year by these quotes,10,https://ph-files.imgix.net/21d2e305-00a8-4ede-...,Free,6,"Posts for January 1, 2022 | Product Hunt | Pr..."
6,2021 Code Stats from WakaTime,90,Productivity,"Most used languages, editors and stats of deve...",3,https://ph-files.imgix.net/1cefa132-0cbe-468b-...,Free,7,"Posts for January 1, 2022 | Product Hunt | Pr..."
7,Octolink,75,Web App,Link sharing for GitHub repositories,16,https://ph-files.imgix.net/06367585-6044-4268-...,Free,8,"Posts for January 1, 2022 | Product Hunt | Pr..."
8,Lume,83,Developer Tools,A static site generator for Deno,3,https://ph-files.imgix.net/3f2582d8-972e-4a9b-...,Free,9,"Posts for January 1, 2022 | Product Hunt | Pr..."
9,My Good First Issue,56,Open Source,Find open source projects by the percentage o...,3,https://ph-files.imgix.net/450d3778-285e-4254-...,Free,10,"Posts for January 1, 2022 | Product Hunt | Pr..."


STEPS 2 and 3: Scraping of data from multiple URL's by iteration of day and month and concatenation of the results in one dataframe

In [241]:
#Defining the main function
def scrape_data(url):
    data = requests.get(url)
    soup = BeautifulSoup(data.content, 'html.parser')
    data_= []
    post_containers = soup.find_all(class_='styles_item__T_iXF')
    #Iterating over each container
    for container in post_containers:
        # Extracting all fields from each container
        product_name= container.find(class_="styles_font__m46I_ styles_medium__pDQUU styles_semiBold__yhRqL styles_title__MqCmH styles_lineHeight__kGlRn styles_underline__IqHIA").text
        upvotes= container.find(class_="styles_font__m46I_ styles_black__Z9fG_ styles_small__lLD08 styles_semiBold__yhRqL styles_lineHeight__kGlRn styles_underline__IqHIA").text
        category= container.find(class_="styles_postTopicLink__9NDtt")
        if category is not None:
            category = category.text
        else:
            category = "N/A"
        tagline= container.find(class_= "styles_font__m46I_ styles_grey__YlBrh styles_small__lLD08 styles_normal__FGFK7 styles_tagline__i9xwM styles_lineHeight__kGlRn").text
        comments= container.find(class_="styles_button__9iLfH styles_smallSize__n7oLn styles_subtleVariant__qmS4e styles_simpleVariant__63ULX styles_button__dtE4Y").text
        image= container.find(class_="styles_mediaThumbnail__LDCQN")
        if image is not None:
            image_link = image.get('src')
        else:
            image_link = "N/A"
        pricing= container.find(class_="styles_font__m46I_ styles_black__Z9fG_ styles_xSmall__xeHkB styles_normal__FGFK7 styles_pricingType__EdAeQ styles_siblingTopicTag__wQEbN styles_lineHeight__kGlRn styles_underline__IqHIA")
        if pricing is not None:
            pricing = pricing.text
        else:
            pricing = "N/A"
        # Storing the data in a dictionary
        post_data = {
        'product_name': product_name,
        'upvotes': upvotes,
        'category': category,
        'tagline': tagline,
        'comments': comments,
        'image link': image_link,
        'pricing': pricing
        }
        # Add the dictionary to the list
        data_.append(post_data)
        # Create a Pandas DataFrame from the list of dictionaries
        df = pd.DataFrame(data_)
    # Add a column with the product's position
    df['position'] = range(1, len(df)+1)
    #Add a column with the 'date' (which is actually the title)
    # Extract the title of the page
    page_title = soup.title.string
    # Add a new column to the DataFrame
    df['date'] = page_title
    # Set the value of the 'page_title' column for all rows to the page title
    df['date'] = df['date'].apply(lambda x: page_title)
    return df

In [242]:
# Set the base URL for the pages
base_url = "https://www.producthunt.com/time-travel/2022/"

# Initialize an empty list to store the dataframes
df_list = []

# Set the range of months and days to iterate over, bearing into account months with 28, 30 and 31 days. 
for month in range(1, 13):
    # Set the number of days in each month
    if month in [1, 3, 5, 7, 8, 10, 12]:
        num_days = 31
    elif month in [4, 6, 9, 11]:
        num_days = 30
    else:
        num_days = 28
    for day in range(1, num_days+1):
        # Construct the URL for the current page
        url = f"{base_url}{month}/{day}"
        print(url)
        # Scrape the data from the page
        data = scrape_data(url)
        # Append the data to the list of dataframes
        df_list.append(data)
        
#Concatenate all dataframes corresponding to separate days
final_df = pd.concat(df_list, axis=0)

https://www.producthunt.com/time-travel/2022/1/1
https://www.producthunt.com/time-travel/2022/1/2
https://www.producthunt.com/time-travel/2022/1/3
https://www.producthunt.com/time-travel/2022/1/4
https://www.producthunt.com/time-travel/2022/1/5
https://www.producthunt.com/time-travel/2022/1/6
https://www.producthunt.com/time-travel/2022/1/7
https://www.producthunt.com/time-travel/2022/1/8
https://www.producthunt.com/time-travel/2022/1/9
https://www.producthunt.com/time-travel/2022/1/10
https://www.producthunt.com/time-travel/2022/1/11
https://www.producthunt.com/time-travel/2022/1/12
https://www.producthunt.com/time-travel/2022/1/13
https://www.producthunt.com/time-travel/2022/1/14
https://www.producthunt.com/time-travel/2022/1/15
https://www.producthunt.com/time-travel/2022/1/16
https://www.producthunt.com/time-travel/2022/1/17
https://www.producthunt.com/time-travel/2022/1/18
https://www.producthunt.com/time-travel/2022/1/19
https://www.producthunt.com/time-travel/2022/1/20
https://w

https://www.producthunt.com/time-travel/2022/6/15
https://www.producthunt.com/time-travel/2022/6/16
https://www.producthunt.com/time-travel/2022/6/17
https://www.producthunt.com/time-travel/2022/6/18
https://www.producthunt.com/time-travel/2022/6/19
https://www.producthunt.com/time-travel/2022/6/20
https://www.producthunt.com/time-travel/2022/6/21
https://www.producthunt.com/time-travel/2022/6/22
https://www.producthunt.com/time-travel/2022/6/23
https://www.producthunt.com/time-travel/2022/6/24
https://www.producthunt.com/time-travel/2022/6/25
https://www.producthunt.com/time-travel/2022/6/26
https://www.producthunt.com/time-travel/2022/6/27
https://www.producthunt.com/time-travel/2022/6/28
https://www.producthunt.com/time-travel/2022/6/29
https://www.producthunt.com/time-travel/2022/6/30
https://www.producthunt.com/time-travel/2022/7/1
https://www.producthunt.com/time-travel/2022/7/2
https://www.producthunt.com/time-travel/2022/7/3
https://www.producthunt.com/time-travel/2022/7/4
http

https://www.producthunt.com/time-travel/2022/11/26
https://www.producthunt.com/time-travel/2022/11/27
https://www.producthunt.com/time-travel/2022/11/28
https://www.producthunt.com/time-travel/2022/11/29
https://www.producthunt.com/time-travel/2022/11/30
https://www.producthunt.com/time-travel/2022/12/1
https://www.producthunt.com/time-travel/2022/12/2
https://www.producthunt.com/time-travel/2022/12/3
https://www.producthunt.com/time-travel/2022/12/4
https://www.producthunt.com/time-travel/2022/12/5
https://www.producthunt.com/time-travel/2022/12/6
https://www.producthunt.com/time-travel/2022/12/7
https://www.producthunt.com/time-travel/2022/12/8
https://www.producthunt.com/time-travel/2022/12/9
https://www.producthunt.com/time-travel/2022/12/10
https://www.producthunt.com/time-travel/2022/12/11
https://www.producthunt.com/time-travel/2022/12/12
https://www.producthunt.com/time-travel/2022/12/13
https://www.producthunt.com/time-travel/2022/12/14
https://www.producthunt.com/time-travel/

In [243]:
final_df.reset_index(inplace=True, drop=True) 
display(final_df)

Unnamed: 0,product_name,upvotes,category,tagline,comments,image link,pricing,position,date
0,World Explorer by Insured Nomads,389,Global Nomad,Insurance meets travel tech for the global wor...,84,https://ph-files.imgix.net/9bc773c8-1720-45b6-...,,1,"Posts for January 1, 2022 | Product Hunt | Pr..."
1,Tailwind Box Shadows,203,Productivity,Curated list of box shadows for your cards to ...,25,https://ph-files.imgix.net/f182991d-f034-43b8-...,Free,2,"Posts for January 1, 2022 | Product Hunt | Pr..."
2,24me Smart Personal Assistant,127,Productivity,Keep new year's resolutions and get organized ...,8,https://ph-files.imgix.net/9afadb00-151b-4f19-...,Free,3,"Posts for January 1, 2022 | Product Hunt | Pr..."
3,Habitify Challenge,168,Productivity,Track & build new habits with your friends in ...,10,https://ph-files.imgix.net/8941b193-9073-4864-...,Free,4,"Posts for January 1, 2022 | Product Hunt | Pr..."
4,Sunflower iOS App,108,iOS,Rewire your brain to associate sobriety with r...,14,https://ph-files.imgix.net/35f3ef3a-3a39-49b3-...,Free,5,"Posts for January 1, 2022 | Product Hunt | Pr..."
...,...,...,...,...,...,...,...,...,...
6602,Cloud Rebels,25,Tech,IT has never been easier,1,https://ph-files.imgix.net/a1da6bb2-796c-4020-...,Payment Required,12,"Posts for December 31, 2022 | Product Hunt | ..."
6603,SuenaGringo AI,24,Productivity,Helps Spanish immigrants write natural & engag...,5,https://ph-files.imgix.net/ab8126a3-206e-483c-...,Free Options,13,"Posts for December 31, 2022 | Product Hunt | ..."
6604,Gmax CRM Open Source,23,Productivity,Gmax CRM is an open source invoicing software,2,https://ph-files.imgix.net/90f32823-fb61-4b22-...,Free,14,"Posts for December 31, 2022 | Product Hunt | ..."
6605,Grocery Delivery App Development,18,Productivity,SpotnEats developing customized apps for you,2,https://ph-files.imgix.net/96b9a8f3-459b-4557-...,Payment Required,15,"Posts for December 31, 2022 | Product Hunt | ..."


In [244]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6607 entries, 0 to 6606
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   product_name  6607 non-null   object
 1   upvotes       6607 non-null   object
 2   category      6607 non-null   object
 3   tagline       6607 non-null   object
 4   comments      6607 non-null   object
 5   image link    4666 non-null   object
 6   pricing       6607 non-null   object
 7   position      6607 non-null   int64 
 8   date          6607 non-null   object
dtypes: int64(1), object(8)
memory usage: 464.7+ KB


In [245]:
final_df.to_csv('productHunt2022_scrapingdata.csv', index= False)