# API Web Scraping

Version  | Date | Author | Notes |
:-------:|:----:|:-------|:-----:|
0.1 |11 July 2023| Ken Dizon | Initial version

**Objective**

Write a program that scrapes data `Data Analytics, Data Manipulation`
- Compare internal pricing against competitor pricing to determine market competitiveness.
- Analyze funnel metrics to identify underserved markets and industries for growth opportunities.
- Use online resources - such as competitor filings, subscriber services, and service offering tools - for gathering and developing information.
- Assist in the building, monitoring, and distributing BI tools for Underwriting and Pricing leads that aid in goal tracking and execution.
- Provide information to make well informed pricing segmentation changes as it relates to growth, retention, and profitability.

#### Content
* Libraries
1. 

In [1]:
try:
    import numpy as np #math library 
    import scipy #computation
    import matplotlib.pyplot as plt #visualization
    import pandas as pd #dataframes
    from datetime import datetime # timestamp
    # API Packages
    import requests
    from bs4 import BeautifulSoup
    from pprint import pprint

    print('Bs4 Documentation:', 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/')
    print("Libraries imported successfully!")
except ImportError:
    print("Libraries not installed. Please install it to use this library.")

Bs4 Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Libraries imported successfully!


# Websites to Scrape

- **Data Information** 
    - Pickleball is a growing sport in the USA. The gear neccesariy to play is fairly minimal however, can be expensive. As an avid player, I see numerous paddles and even know indivudals who "paddle shame" as their are certain brands that have quality material or a price associated based on the name. 
    - Popular Paddles are from:

In [2]:
# URL of the Joola website you want to scrape
joola_url = ["https://joolausa.com/pickleball-paddles/?orderby=price-desc",
              "https://joolausa.com/pickleball-paddles/page/2/?orderby=price-desc",
              "https://joolausa.com/pickleball-paddles/page/3/?orderby=price-desc"]
selkirk_url = 'https://www.selkirk.com/collections/paddles?sort_by=price-descending'
engage_url = 'https://engagepickleball.com/collections/allpaddles'
diadem_url = 'https://diademsports.com/collections/diadem-best-sellers?filter.p.product_type=Pickleball+Paddles'
crbn_url = 'https://crbnpickleball.com/collections/crbn-best-pickleball-paddles'
ronbus_url = 'https://www.ronbus.com/Paddles_c_11.html'

In [3]:
print(joola_url[0], '\n', selkirk_url, '\n', engage_url, '\n',
      diadem_url, '\n', crbn_url, '\n', ronbus_url, '\n')

https://joolausa.com/pickleball-paddles/?orderby=price-desc 
 https://www.selkirk.com/collections/paddles?sort_by=price-descending 
 https://engagepickleball.com/collections/allpaddles 
 https://diademsports.com/collections/diadem-best-sellers?filter.p.product_type=Pickleball+Paddles 
 https://crbnpickleball.com/collections/crbn-best-pickleball-paddles 
 https://www.ronbus.com/Paddles_c_11.html 



# [1] Scraping
#### 1.1 Joola

In [4]:
len(joola_url)

3

In [5]:
# page 1
joola_url[1]

'https://joolausa.com/pickleball-paddles/page/2/?orderby=price-desc'

In [6]:
# Extract using Bs4
response = requests.get(joola_url[1])
soup = BeautifulSoup(response.text, "html.parser")

In [9]:
# Look at whole HTLM
#print(soup.prettify())

In [11]:
soup.find_all("h2", class_="woocommerce-loop-product__title")[:3]

[<h2 class="woocommerce-loop-product__title">JOOLA Ben Johns Hyperion CGS 14 Pickleball Paddle</h2>,
 <h2 class="woocommerce-loop-product__title">JOOLA Solaire CFS 14 Swift Pickleball Paddle</h2>,
 <h2 class="woocommerce-loop-product__title">JOOLA Solaire CFS 14 Pickleball Paddle</h2>]

In [12]:
soup.find_all("span", class_="price")[:3]

[<span class="price"><span class="woocommerce-Price-amount amount"><bdi><span class="woocommerce-Price-currencySymbol">$</span>199.95</bdi></span></span>,
 <span class="price"><span class="woocommerce-Price-amount amount"><bdi><span class="woocommerce-Price-currencySymbol">$</span>189.95</bdi></span></span>,
 <span class="price"><span class="woocommerce-Price-amount amount"><bdi><span class="woocommerce-Price-currencySymbol">$</span>189.95</bdi></span></span>]

- This format should be same across all 3 Joolas pages. We will now create a product list and price to store the values. 

In [13]:
# Create empty lists to store the data
product_names = []
prices = []

In [14]:
# Iterate over the URLs
for url in joola_url:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Find the product name elements
    product_name_elements = soup.find_all("h2", class_="woocommerce-loop-product__title")
    product_names.extend([element.text.strip()
                          for element in product_name_elements])
    
    # Find the price elements
    price_elements = soup.find_all("span", class_="price")
    prices.extend([element.text.strip()
                   for element in price_elements])

In [15]:
product_name_elements[:3]

[<h2 class="woocommerce-loop-product__title">JOOLA Seneca CDS 16 Pickleball Paddle</h2>,
 <h2 class="woocommerce-loop-product__title">JOOLA Essentials Pickleball Paddles &amp; Balls Set</h2>,
 <h2 class="woocommerce-loop-product__title">JOOLA Seneca FDS 14 Pickleball Paddle</h2>]

In [16]:
product_names[:3]

['JOOLA Anna Bright Scorpeus CFS 14 Pickleball Paddle',
 'JOOLA Collin Johns Scorpeus CFS 16 Pickleball Paddle',
 'JOOLA Ben Johns Perseus CFS 16 Pickleball Paddle']

In [17]:
prices[:3]

['$249.95', '$249.95', '$249.95']

In [18]:
type(prices)

list

In [19]:
# Create a DataFrame from the scraped data
data = {"Product Name": product_names, "Price": prices}
df = pd.DataFrame(data)

# assign the current data to the table
df['Prod_Date']= datetime.today().strftime('%Y-%m-%d')

# assign the brand
df['Brand'] = 'JOOLA'

# Print the DataFrame
df

Unnamed: 0,Product Name,Price,Prod_Date,Brand
0,JOOLA Anna Bright Scorpeus CFS 14 Pickleball P...,$249.95,2023-07-23,JOOLA
1,JOOLA Collin Johns Scorpeus CFS 16 Pickleball ...,$249.95,2023-07-23,JOOLA
2,JOOLA Ben Johns Perseus CFS 16 Pickleball Paddle,$249.95,2023-07-23,JOOLA
3,JOOLA Ben Johns Perseus CFS 14 Pickleball Paddle,$249.95,2023-07-23,JOOLA
4,JOOLA Simone Jardim Hyperion CFS 16 Swift Pick...,$219.95,2023-07-23,JOOLA
5,JOOLA Simone Jardim Hyperion CFS 16 Pickleball...,$219.95,2023-07-23,JOOLA
6,JOOLA Simone Jardim Hyperion CFS 14 Swift Pick...,$219.95,2023-07-23,JOOLA
7,JOOLA Ben Johns Hyperion CFS 16 Swift Pickleba...,$219.95,2023-07-23,JOOLA
8,JOOLA Ben Johns Hyperion CFS 14 Swift Pickleba...,$219.95,2023-07-23,JOOLA
9,JOOLA Ben Johns Hyperion CFS 14 Pickleball Paddle,$219.95,2023-07-23,JOOLA


In [20]:
# Save Joola DF
#df.to_csv('prepped_Xs_data.csv')
#df.to_csv('/Users/Kenny/......./joolaData_23July2023.csv')

#### 1.2 Selkirk

In [21]:
# Storage of SK-paddle list
product_name = []
product_price = []

In [22]:
response = requests.get(selkirk_url)
soup = BeautifulSoup(response.text, "html.parser")

In [23]:
#pprint(soup)
#print(soup.prettify())

In [24]:
product_name = [tag.text.strip()
                for tag in soup.find_all('a', class_='product-title')]

In [25]:
print(product_name[:3])
print(len(product_name))

['Vanguard 2.0 S2\n\n - Crimson Black', 'Vanguard 2.0 S2\n\n - Crimson Black', 'Vanguard 2.0 Invikta\n\n - Crimson Black']
52


In [26]:
product_price = [tag.text.strip().replace('from ', '')
                  for tag in soup.find_all('div', class_='price-regular')]

In [27]:
print(product_price[:3])
print(len(product_price))

['$200.00', '$200.00', '$200.00']
52


In [28]:
# Create a DataFrame from the scraped data
data = {"Product Name": product_name, "Price": product_price}
df_sk = pd.DataFrame(data)

# assign the current data to the table
df_sk['Prod_Date']= datetime.today().strftime('%Y-%m-%d')

# assign the brand
df_sk['Brand'] = 'Selkirk'

# Print the DataFrame
df_sk.head()

Unnamed: 0,Product Name,Price,Prod_Date,Brand
0,Vanguard 2.0 S2\n\n - Crimson Black,$200.00,2023-07-23,Selkirk
1,Vanguard 2.0 S2\n\n - Crimson Black,$200.00,2023-07-23,Selkirk
2,Vanguard 2.0 Invikta\n\n - Crimson Black,$200.00,2023-07-23,Selkirk
3,Vanguard 2.0 Invikta\n\n - Crimson Black,$200.00,2023-07-23,Selkirk
4,Vanguard 2.0 Epic\n\n - Arizona Sun,$200.00,2023-07-23,Selkirk


In [29]:
df_sk.shape

(52, 4)

In [30]:
# Save Selkirk DF
#df.to_csv('prepped_Xs_data.csv')
#df.to_csv('/Users/Kenny/......./selkirkData_23July2023.csv')

________
**NOTE**: What I learnt was how to extract raw data from a website `html format` and use API BS4 in python to place data into a table. I am now able to create insights via data visualization. 

________
# Test DevOps

In [31]:
print(soup.prettify())

<!--                        // Selkirk // 

  
          ███████╗███████╗██╗     ██╗  ██╗██╗██████╗ ██╗  ██╗
          ██╔════╝██╔════╝██║     ██║ ██╔╝██║██╔══██╗██║ ██╔╝
          ███████╗█████╗  ██║     █████╔╝ ██║██████╔╝█████╔╝ 
          ╚════██║██╔══╝  ██║     ██╔═██╗ ██║██╔══██╗██╔═██╗ 
          ███████║███████╗███████╗██║  ██╗██║██║  ██║██║  ██╗
          ╚══════╝╚══════╝╚══════╝╚═╝  ╚═╝╚═╝╚═╝  ╚═╝╚═╝  ╚═╝


                           www.Selkirk.com
                                  •
                           We Are PickleBall
                            

  -->
<!DOCTYPE html>
<!--[if lt IE 7]><html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]-->
<!--[if IE 7]><html class="no-js lt-ie9 lt-ie8" lang="en"> <![endif]-->
<!--[if IE 8]><html class="no-js lt-ie9" lang="en"> <![endif]-->
<!--[if IE 9 ]><html class="ie9 no-js"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!-->
<html class="no-js">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge,ch