In [1]:
!pip install beautifulsoup4 
!pip install requests
!pip install pandas



we're going to get a list of all the links for every single product from all pages.
Then we are going to go into each product individually and scrape our desired data.

In [1]:
# import required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

Here, we are going to set the base URL of the main page because 
we'll need that when we construct our URLs for each of the individual products.

Also, we will send a user-agent on every HTTP request, because if you make GET request 
using requests then by default the user-agent is Python which might get blocked.

In [2]:
baseurl = "https://www.thewhiskyexchange.com"

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'}


We will write a script to go through each one of these (items) and 
create a URL for us. To do that we need to make an HTTP call first.
Then we will extract the li element using BeautifulSoup.

In [3]:
k = requests.get('https://www.thewhiskyexchange.com/c/35/japanese-whisky').text
soup=BeautifulSoup(k,'html.parser')
productlist = soup.find_all("li",{"class":"product-grid__item"})
print(productlist)

[<li class="product-grid__item"><a class="product-card" href="/p/37325/suntory-torys-classic" onclick="_gaq.push(['_trackEvent', 'Products-GridView', 'click', '37325 : Suntory Torys Classic'])" title="Suntory Torys Classic"><div class="product-card__image-container"><img alt="Suntory Torys Classic" class="product-card__image lazy" data-original="https://img.thewhiskyexchange.com/480/japan_sun20.jpg" height="4" src="https://img.thewhiskyexchange.com/ph.png" width="3"/></div><div class="product-card__content"><p class="product-card__name"> Suntory Torys Classic </p><p class="product-card__meta"> 70cl / 37% </p></div><div class="product-card__data"><p class="product-card__price"> £30.45 </p><p class="product-card__unit-price"> (£43.50 per litre) </p></div></a></li>, <li class="product-grid__item"><a class="product-card" href="/p/45577/nikka-days" onclick="_gaq.push(['_trackEvent', 'Products-GridView', 'click', '45577 : Nikka Days'])" title="Nikka Days"><div class="product-card__image-cont

Next, get the HTML for the items on this page. 
Now, inside each of these lists there is a link to the individual product page.
We will write a script to scrape all those links from the productlist.

In [4]:
# get the product links
productlinks = []
for product in productlist:
    link = product.find("a",{"class":"product-card"}).get('href')
    productlinks.append(baseurl + link)


Here first we have declared an empty list called productlinks.
Then we have used a for loop to reach each productlist element to extract the link.
We have used the .get() function to get the value of the href attribute. 
After extracting the link we store every link inside the list productlinks.
Since we have to create a legit  URL, we have added baseurl to the link.

In [5]:
productlinks

['https://www.thewhiskyexchange.com/p/37325/suntory-torys-classic',
 'https://www.thewhiskyexchange.com/p/45577/nikka-days',
 'https://www.thewhiskyexchange.com/p/16917/akashi-blended-whisky',
 'https://www.thewhiskyexchange.com/p/49001/tokinoka-white-blended-whisky',
 'https://www.thewhiskyexchange.com/p/49821/hatozaki-blended-japanese-whisky',
 'https://www.thewhiskyexchange.com/p/57521/suntory-toki-glass-pack',
 'https://www.thewhiskyexchange.com/p/48272/mars-kasei-blended-whisky',
 'https://www.thewhiskyexchange.com/p/2928/nikka-from-the-barrel',
 'https://www.thewhiskyexchange.com/p/24587/togouchi-premium-blended-whisky',
 'https://www.thewhiskyexchange.com/p/2935/nikka-pure-malt-red',
 'https://www.thewhiskyexchange.com/p/49822/hatozaki-pure-malt-japanese-whisky',
 'https://www.thewhiskyexchange.com/p/34970/tokinoka-black-blended-whisky',
 'https://www.thewhiskyexchange.com/p/37317/suntory-chita-whisky',
 'https://www.thewhiskyexchange.com/p/30377/super-nikka-rare-old',
 'https:/

In [6]:
len(productlinks)

24

As we discussed earlier, we have to cover all five pages of the website. 
To do so we will introduce a for loop before making the HTTP call. 
Since there are 5 pages we will run the loop from 1 to 6. 
Also, make sure to change the target URL.

In [7]:
# get the products links from all the pages
productlinks = []
for x in range(1,6):  
    k = requests.get('https://www.thewhiskyexchange.com/c/35/japanese-whisky?pg={}&psize=24&sort=pasc'.format(x)).text  
    soup=BeautifulSoup(k,'html.parser')  
    productlist = soup.find_all("li",{"class":"product-grid__item"})
 
    for product in productlist:
        link = product.find("a",{"class":"product-card"}).get('href')
        productlinks.append(baseurl + link)

In [8]:
len(productlinks)

112

Now we can loop through each of these links to extract the product information 
from each page and then store it in another list or dictionary.

Next we are going to analyze the pattern in which the information is displayed 
on the product page. We will extract the name, price, ratings, and about text.

The Name is under an h1 tag, the about text is under the div tag,
price is under a p tag, and rating is under the span tag. Now, let's extract them.

In [9]:
# iterate over each link 

data = []

for link in productlinks:
    f = requests.get(link,headers=headers).text
    soup=BeautifulSoup(f,'html.parser')

    try:
        price = soup.find("p",{"class":"product-action__price"}).text.replace('\n',"")
    except:
        price = None

    try:
        about=soup.find("div",{"class":"product-main__description"}).text.replace('\n',"")
    except:
        about=None

    try:
        rating = soup.find("div",{"class":"review-overview"}).text.replace('\n',"")
    except:
        rating=None

    try:
        name=soup.find("h1",{"class":"product-main__name"}).text.replace('\n',"")
    except:
        name=None

    whisky = {"name":name,"price":price,"rating":rating,"about":about}

    data.append(whisky)

    

In [10]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,name,price,rating,about
0,Suntory Torys Classic,£30.45,3.5(2 Reviews),Suntory Torys Classic is a light and easy-drin...
1,Nikka Days,£31.95,5(1 Review),"A vibrant addition to the Nikka range, Nikka D..."
2,Akashi Blended Whisky,£32.75,3.5(19 Reviews),A lesser-known whisky outside the local Japane...
3,Tokinoka White Blended Whisky,£32.95,,A Japanese blended whisky from White Oak disti...
4,Hatozaki Blended Japanese Whisky,£33.95,3(3 Reviews),Named after the oldest stone lighthouse in Jap...


In [11]:
# saving the dataframe
df.to_csv('e_commerce.csv')