# Creating Dataset

To creat a dataset, I scrape reviews from [seek.com.au](https://www.seek.com.au/) for two companies __`Woolworth`__ and __`Coles`__ which both have a fair amount of reviews.

To avoid `JavaScript` obstructions (downloading webpage source before target content is loaded) I use __`selenium`__ in combination with __`BeautifulSoup`__.

In [39]:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver

from datetime import datetime

import csv

### 'Woolworth' Reviews

Having a look at Woolworth [reviews](https://www.seek.com.au/companies/woolworths-supermarkets-432295/reviews) it can be observed that:
1. reviews are delivered over __96__ different pages
2. each review link has `html tag <a>` and `class = '_2gLKVsp'`

In [3]:
content = requests.get('https://www.seek.com.au/companies/woolworths-supermarkets-432295/reviews')
soup = BeautifulSoup(content.text, 'html.parser')
links = soup.find('div', {'class' : '_1Y5kUMd'}).find_all('a', {'class' : '_2gLKVsp'})
links[0]['href']

'/companies/woolworths-supermarkets-432295/reviews/2276877'

Now let's scrape all the review links

In [4]:
browser = webdriver.Firefox()
woolies = []

for page in range(1, 97):
    url = 'https://www.seek.com.au/companies/woolworths-supermarkets-432295/reviews?page={}'.format(page)
    browser.get(url)
    html = browser.page_source
    soup = BeautifulSoup(html, 'html.parser')
    links = soup.find('div', {'class' : '_1Y5kUMd'}).find_all('a', {'class' : '_2gLKVsp'})
        
    for link in links:
        woolies.append("https://www.seek.com.au{}".format(link['href']))

To save all the 1908 links in a file:

In [5]:
with open('woolies_urls.txt', 'w') as file:
    for url in woolies:
        file.write('{}\n'.format(url))

In [102]:
woolies[0:5]

['https://www.seek.com.au/companies/woolworths-supermarkets-432295/reviews/2276877',
 'https://www.seek.com.au/companies/woolworths-supermarkets-432295/reviews/256376',
 'https://www.seek.com.au/companies/woolworths-supermarkets-432295/reviews/340048',
 'https://www.seek.com.au/companies/woolworths-supermarkets-432295/reviews/277368',
 'https://www.seek.com.au/companies/woolworths-supermarkets-432295/reviews/197028']

During scraping below, it happens often that process becomes interrupted probably because site is lagging behind to load completely while data is fetched. However, with one of errors it became known that one of linkes is faulty and should be removed from the list:

`woolies[1277]`  
`https://www.seek.com.au/companies/woolworths-supermarkets-432295/reviews/35519`

In [104]:
len(woolies)

1908

In [106]:
woolies.pop(1277)

'https://www.seek.com.au/companies/woolworths-supermarkets-432295/reviews/35519'

In [107]:
len(woolies)

1907

Let's scrape reviews and some other data and save them in a `csv` format file.

In [None]:
browser = webdriver.Firefox()

with open('RawData.csv', 'a+', newline='') as csvfile:
    datawriter = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    for url in woolies:
        browser.get(url)
        review = browser.page_source
        soup = BeautifulSoup(review, 'html.parser')

        company = 'Woolworths'
        review = soup.find('span', {'data-automation' : 'reviewCardFull-pros'}).text + soup.find('span', {'data-automation' : 'reviewCardFull-cons'}).text
        rating = soup.find('span', {'class' : '_1erK2ob'}).text
        review_date = soup.find('span', {'class' : '_3FrNV7v _38Keb0I _2QG7TNq E6m4BZb'}).text
        time_stamp = datetime.today().strftime('%Y, %B')
        
        datawriter.writerow([company, review, rating, review_date, time_stamp])

In [47]:
import pandas as pd

In [110]:
df=pd.read_csv('RawData.csv', header=None)

In [112]:
df.head()

Unnamed: 0,0,1,2,3,4
0,Woolworths,-Great working environment with very good supp...,5.0,2 years ago,"2020, January"
1,Woolworths,"I enjoyed what I am doing, it's a tough job, b...",3.0,4 years ago,"2020, January"
2,Woolworths,Working with staff everyday. The ability to wo...,4.0,4 years ago,"2020, January"
3,Woolworths,Great opportunities for career advancement for...,4.0,4 years ago,"2020, January"
4,Woolworths,During peek sales periods; casuals get great h...,3.0,4 years ago,"2020, January"
