# API Web Scraping

Version  | Date | Author | Notes |
:-------:|:----:|:-------|:-----:|
0.1 |11 July 2023| Ken Dizon | Initial version
0.2 |24 July 2023| Ken Dizon | Test environment and data saves

**Objective**

Write a program that scrapes data.

#### Content
* Libraries
1. 

In [1]:
try:
    import numpy as np #math library 
    import scipy #computation
    import matplotlib.pyplot as plt #visualization
    import pandas as pd #dataframes
    from datetime import datetime # timestamp
    # API Packages
    import requests
    from bs4 import BeautifulSoup
    # visuals
    from pprint import pprint
    from IPython.core.display import display, HTML

    print('Bs4 Documentation:', 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/')
    print("Libraries imported successfully!")
except ImportError:
    print("Libraries not installed. Please install it to use this library.")

Bs4 Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Libraries imported successfully!


# Websites to Scrape

- **Data Information** 
    - Pickleball is a growing sport in the USA. The gear neccesariy to play is fairly minimal however, can be expensive. As an avid player, I see numerous paddles and even know indivudals who "paddle shame" as their are certain brands that have quality material or a price associated based on the name. 
    - Popular Paddles are from:

In [2]:
# URL of the Joola website you want to scrape
joola_url = ["https://joolausa.com/pickleball-paddles/?orderby=price-desc",
              "https://joolausa.com/pickleball-paddles/page/2/?orderby=price-desc",
              "https://joolausa.com/pickleball-paddles/page/3/?orderby=price-desc"]
selkirk_url = 'https://www.selkirk.com/collections/paddles?sort_by=price-descending'
engage_url = 'https://engagepickleball.com/collections/allpaddles'
diadem_url = 'https://diademsports.com/collections/diadem-best-sellers?filter.p.product_type=Pickleball+Paddles'
crbn_url = 'https://crbnpickleball.com/collections/crbn-best-pickleball-paddles'
ronbus_url = 'https://www.ronbus.com/Paddles_c_11.html'

In [3]:
print(joola_url[0], '\n', selkirk_url, '\n', engage_url, '\n',
      diadem_url, '\n', crbn_url, '\n', ronbus_url, '\n')

https://joolausa.com/pickleball-paddles/?orderby=price-desc 
 https://www.selkirk.com/collections/paddles?sort_by=price-descending 
 https://engagepickleball.com/collections/allpaddles 
 https://diademsports.com/collections/diadem-best-sellers?filter.p.product_type=Pickleball+Paddles 
 https://crbnpickleball.com/collections/crbn-best-pickleball-paddles 
 https://www.ronbus.com/Paddles_c_11.html 



# [1] Scraping
#### 1.1 Joola

In [4]:
len(joola_url)

3

In [5]:
# page 1
joola_url[1]

'https://joolausa.com/pickleball-paddles/page/2/?orderby=price-desc'

In [6]:
# Extract using Bs4
response = requests.get(joola_url[1])
soup = BeautifulSoup(response.text, "html.parser")

In [7]:
# Look at whole HTLM and search for the element
#print(soup.prettify())

In [8]:
soup.find_all("h2", class_="woocommerce-loop-product__title")[:3]

[<h2 class="woocommerce-loop-product__title">JOOLA Ben Johns Hyperion CGS 14 Pickleball Paddle</h2>,
 <h2 class="woocommerce-loop-product__title">JOOLA Solaire CFS 14 Swift Pickleball Paddle</h2>,
 <h2 class="woocommerce-loop-product__title">JOOLA Solaire CFS 14 Pickleball Paddle</h2>]

In [9]:
soup.find_all("span", class_="price")[:3]

[<span class="price"><span class="woocommerce-Price-amount amount"><bdi><span class="woocommerce-Price-currencySymbol">$</span>199.95</bdi></span></span>,
 <span class="price"><span class="woocommerce-Price-amount amount"><bdi><span class="woocommerce-Price-currencySymbol">$</span>189.95</bdi></span></span>,
 <span class="price"><span class="woocommerce-Price-amount amount"><bdi><span class="woocommerce-Price-currencySymbol">$</span>189.95</bdi></span></span>]

- This format should be same across all 3 Joolas pages. We will now create a product list and price to store the values. 

In [10]:
# Create empty lists to store the data
product_names = []
prices = []

In [11]:
# Iterate over the URLs
for url in joola_url:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Find the product name elements
    product_name_elements = soup.find_all("h2", class_="woocommerce-loop-product__title")
    product_names.extend([element.text.strip()
                          for element in product_name_elements])
    
    # Find the price elements
    price_elements = soup.find_all("span", class_="price")
    prices.extend([element.text.strip()
                   for element in price_elements])

In [12]:
product_name_elements[:3]

[<h2 class="woocommerce-loop-product__title">JOOLA Seneca CDS 16 Pickleball Paddle</h2>,
 <h2 class="woocommerce-loop-product__title">JOOLA Essentials Pickleball Paddles &amp; Balls Set</h2>,
 <h2 class="woocommerce-loop-product__title">JOOLA Seneca FDS 14 Pickleball Paddle</h2>]

In [13]:
product_names[:3]

['JOOLA Anna Bright Scorpeus CFS 14 Pickleball Paddle',
 'JOOLA Collin Johns Scorpeus CFS 16 Pickleball Paddle',
 'JOOLA Ben Johns Perseus CFS 16 Pickleball Paddle']

In [14]:
prices[:3]

['$249.95', '$249.95', '$249.95']

In [15]:
type(prices)

list

In [16]:
# Create a DataFrame from the scraped data
data = {"Product Name": product_names, "Price": prices}
df = pd.DataFrame(data)

# assign the current data to the table
df['Prod_Date']= datetime.today().strftime('%Y-%m-%d')

# assign the brand
df['Brand'] = 'JOOLA'

# Print the DataFrame
df

Unnamed: 0,Product Name,Price,Prod_Date,Brand
0,JOOLA Anna Bright Scorpeus CFS 14 Pickleball P...,$249.95,2023-07-24,JOOLA
1,JOOLA Collin Johns Scorpeus CFS 16 Pickleball ...,$249.95,2023-07-24,JOOLA
2,JOOLA Ben Johns Perseus CFS 16 Pickleball Paddle,$249.95,2023-07-24,JOOLA
3,JOOLA Ben Johns Perseus CFS 14 Pickleball Paddle,$249.95,2023-07-24,JOOLA
4,JOOLA Simone Jardim Hyperion CFS 16 Swift Pick...,$219.95,2023-07-24,JOOLA
5,JOOLA Simone Jardim Hyperion CFS 16 Pickleball...,$219.95,2023-07-24,JOOLA
6,JOOLA Simone Jardim Hyperion CFS 14 Swift Pick...,$219.95,2023-07-24,JOOLA
7,JOOLA Ben Johns Hyperion CFS 16 Swift Pickleba...,$219.95,2023-07-24,JOOLA
8,JOOLA Ben Johns Hyperion CFS 14 Swift Pickleba...,$219.95,2023-07-24,JOOLA
9,JOOLA Ben Johns Hyperion CFS 14 Pickleball Paddle,$219.95,2023-07-24,JOOLA


In [18]:
# Save Joola DF
#df.to_csv('prepped_Xs_data.csv')
# df.to_csv('/Users/Kenny/Documents/DataSets/joolaData_23July2023.csv')

#### 1.2 Selkirk

In [19]:
# Storage of SK-paddle list
product_name = []
product_price = []

In [20]:
response = requests.get(selkirk_url)
soup = BeautifulSoup(response.text, "html.parser")

In [21]:
#pprint(soup)
#print(soup.prettify())

In [22]:
product_name = [tag.text.strip()
                for tag in soup.find_all('a', class_='product-title')]

In [23]:
print(product_name[:3])
print(len(product_name))

['Vanguard 2.0 S2\n\n - Crimson Black', 'Vanguard 2.0 S2\n\n - Crimson Black', 'Vanguard 2.0 Invikta\n\n - Crimson Black']
52


In [24]:
product_price = [tag.text.strip().replace('from ', '')
                  for tag in soup.find_all('div', class_='price-regular')]

In [25]:
print(product_price[:3])
print(len(product_price))

['$200.00', '$200.00', '$200.00']
52


In [26]:
# Create a DataFrame from the scraped data
data = {"Product Name": product_name, "Price": product_price}
df_sk = pd.DataFrame(data)

# assign the current data to the table
df_sk['Prod_Date']= datetime.today().strftime('%Y-%m-%d')

# assign the brand
df_sk['Brand'] = 'Selkirk'

# Print the DataFrame
df_sk.head()

Unnamed: 0,Product Name,Price,Prod_Date,Brand
0,Vanguard 2.0 S2\n\n - Crimson Black,$200.00,2023-07-24,Selkirk
1,Vanguard 2.0 S2\n\n - Crimson Black,$200.00,2023-07-24,Selkirk
2,Vanguard 2.0 Invikta\n\n - Crimson Black,$200.00,2023-07-24,Selkirk
3,Vanguard 2.0 Invikta\n\n - Crimson Black,$200.00,2023-07-24,Selkirk
4,Vanguard 2.0 Epic\n\n - Arizona Sun,$200.00,2023-07-24,Selkirk


In [27]:
df_sk.shape

(52, 4)

In [28]:
# Save Selkirk DF
#df.to_csv('prepped_Xs_data.csv')
# df.to_csv('/Users/Kenny/Documents/DataSets/selkirkData_23July2023.csv')

### 1.3 Engage

In [145]:
print(engage_url)

https://engagepickleball.com/collections/allpaddles


In [146]:
# Storage of SK-paddle list
product_name = []
product_price = []

In [147]:
# site
response = requests.get(engage_url)
soup = BeautifulSoup(response.text, "html.parser")

In [148]:
# element extraction
product = soup.find_all("span", class_="title")
product_name.extend([element.text.strip()
                          for element in product])

In [149]:
print(product_name[:3])
print('Number of Products = ', len(product_name))

['NEW. Pursuit Ultra EX 6.0 Carbon Fiber | Standard', 'NEW. Pursuit Ultra MX 6.0 Carbon Fiber | Elongated', 'NEW. Pursuit Ultra EX Carbon Fiber | Standard']
Number of Products =  21


In [150]:
price = soup.find_all("span", class_="money")
product_price.extend([element.text.strip()
                          for element in price])

In [151]:
print(product_price[:3])
print('Number of Prices = ', len(product_price))

['$ 259.99', '$ 259.99', '$ 259.99']
Number of Prices =  52


In [154]:
product_price

['$ 259.99',
 '$ 259.99',
 '$ 259.99',
 '$ 259.99',
 '$ 259.99',
 '$ 259.99',
 '$ 259.99',
 '$ 259.99',
 '$ 199.99',
 '$ 219.99',
 '$ 199.99',
 '$ 219.99',
 '$ 199.99',
 '$ 219.99',
 '$ 199.99',
 '$ 219.99',
 '$ 199.99',
 '$ 219.99',
 '$ 199.99',
 '$ 219.99',
 '$ 199.99',
 '$ 219.99',
 '$ 199.99',
 '$ 219.99',
 '$ 199.99',
 '$ 219.99',
 '$ 199.99',
 '$ 219.99',
 '$ 169.99',
 '$ 169.99',
 '$ 169.99',
 '$ 169.99',
 '$ 169.99',
 '$ 169.99',
 '$ 169.99',
 '$ 169.99',
 '$ 159.99',
 '$ 159.99',
 '$ 159.99',
 '$ 159.99',
 '$ 159.99',
 '$ 159.99',
 '$ 159.99',
 '$ 159.99',
 '$ 134.99',
 '$ 134.99',
 '$ 134.99',
 '$ 134.99',
 '$ 119.99',
 '$ 119.99',
 '$ 99.99',
 '$ 99.99']

In [155]:
len(product_name) == len(product_price)

False

**Note:** This needs to be solved. 
Problem: We need the both product and prices to match in length. The webiste of engage is showing a sales price and a previous price. This is making it challanging to find a solution towards only finding the first:
```html
 <meta content="New" itemprop="itemCondition"/>
              <span class="money">
               $ 199.99
              </span>
              <span class="was_price">
               <span class="money">
                $ 219.99
                  ```

________
**NOTE**: What I learnt was how to extract raw data from a website `html format` and use API BS4 in python to place data into a table. I am now able to create insights via data visualization. 

________
# Test DevOps
- use this space below to view html

In [34]:
print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7 ]><html class="ie ie6" lang="en"> <![endif]-->
<!--[if IE 7 ]><html class="ie ie7" lang="en"> <![endif]-->
<!--[if IE 8 ]><html class="ie ie8" lang="en"> <![endif]-->
<!--[if IE 9 ]><html class="ie ie9" lang="en"> <![endif]-->
<!--[if (gte IE 10)|!(IE)]><!-->
<html lang="en">
 <!--<![endif]-->
 <head>
  <meta content="8UykyQSoj3XEY2e6iZ6Qq6K4cGwBry7KHcGGqCfc7dk" name="google-site-verification"/>
  <meta content="3IRkf7a2kzwQquiwp1ybTY8FvuJpx2uhaS1WCpPONLY" name="google-site-verification"/>
  <meta content="JSiIuYJH51QVK47IvV3fwnhrUo3JOMMr8rDyh8O915I" name="google-site-verification"/>
  <!-- Google Tag Manager -->
  <script>
   (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-NC