### Task 1

In Task 1, collect reviews data from personal website address:

http://mlg.ucd.ie/modules/python/assign2/20210711/

extract the following information:
- The star rating of the review
- The title text of the review
- The main body text of the review
- Review helpfulness information

In [1]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import pandas as pd

%matplotlib inline

#### Data Collection

Using beautiful soup to collect all URL of each month's total review

The first page of comments contains link to reviews of each month. There are more than one page of comments per month. We need to access the next page recursively.

In [2]:
#Find the first level Web page through the URL
base_url = "http://mlg.ucd.ie/modules/python/assign2/20210711/"
html = urlopen(base_url).read().decode('utf-8')
# load the beautiful soup
soup = BeautifulSoup(html, features='lxml')
# capture the first page URL suffix of each month, located by class id and class name
href_class = soup.find_all("a", {"class": "list-group-item list-group-item-action"})
href_suffix = [suffix["href"] for suffix in href_class]

first_herf = [(base_url + suffix) for suffix in href_suffix]

print('numbers of links in first level page: ', len(first_herf))
print('The head of links: ')
for x in range(5):
    print (first_herf[x])

numbers of links in first level page:  72
The head of links: 
http://mlg.ucd.ie/modules/python/assign2/20210711/reviews-2016-jan-01.html
http://mlg.ucd.ie/modules/python/assign2/20210711/reviews-2016-feb-01.html
http://mlg.ucd.ie/modules/python/assign2/20210711/reviews-2016-mar-01.html
http://mlg.ucd.ie/modules/python/assign2/20210711/reviews-2016-apr-01.html
http://mlg.ucd.ie/modules/python/assign2/20210711/reviews-2016-may-01.html


Next, find the URLs of each secondary page per month except the first page through the URLs we get.

In [3]:
# More than one page of comments every month. 
#For example, there are five pages of comments in the first month of 2016
total_herf = [i for i in first_herf]
for link in total_herf:
    next_page_url = link
    html = urlopen(next_page_url).read().decode('utf-8')
    soup = BeautifulSoup(html, features='lxml')
    href_class = soup.find_all("a", {"class": "page-link", "href": re.compile('.*?\.html'),  "aria-label": "Next"})
    # herf_suffix would be a list with only one element
    href_suffix = [suffix["href"] for suffix in href_class]
    if(len(href_suffix)): #judge if next link is available
        link = base_url + href_suffix.pop()
        # no need to judge if duplicate
        total_herf.append(link)
    
print('numbers of all review pages link: ', len(total_herf))
print('The head of links: ')
for x in range(5):
    print (total_herf[x])
#print('\n', new_href)  

numbers of all review pages link:  344
The head of links: 
http://mlg.ucd.ie/modules/python/assign2/20210711/reviews-2016-jan-01.html
http://mlg.ucd.ie/modules/python/assign2/20210711/reviews-2016-feb-01.html
http://mlg.ucd.ie/modules/python/assign2/20210711/reviews-2016-mar-01.html
http://mlg.ucd.ie/modules/python/assign2/20210711/reviews-2016-apr-01.html
http://mlg.ucd.ie/modules/python/assign2/20210711/reviews-2016-may-01.html


Dump the url for further use possibly.

In [4]:
with open("review_url.txt", "w") as outfile:
    outfile.write("\n".join(str(item) for item in total_herf))

Import the file as list

In [5]:
with open("review_url.txt") as file:
    lines = file.readlines()
    lines = [line.rstrip() for line in lines]

print(len(lines))
for x in range(5):
    print (lines[x])

344
http://mlg.ucd.ie/modules/python/assign2/20210711/reviews-2016-jan-01.html
http://mlg.ucd.ie/modules/python/assign2/20210711/reviews-2016-feb-01.html
http://mlg.ucd.ie/modules/python/assign2/20210711/reviews-2016-mar-01.html
http://mlg.ucd.ie/modules/python/assign2/20210711/reviews-2016-apr-01.html
http://mlg.ucd.ie/modules/python/assign2/20210711/reviews-2016-may-01.html


From the list of URLs above, parse every review in 2016-2021. For each review, extract the following information: i) The star rating of the review ii) The title text of the review iii) The main body text of the review iv) Review helpfulness information

In [6]:
df = pd.DataFrame(columns=['title', 'body', 'rating', 'helpful_user_num', 'total_user_num'])
#for url in total_herf:
for url in lines:
    #find rating
    html = urlopen(url).read().decode()
    soup = BeautifulSoup(html, features='lxml')
    rating = soup.find_all("img", {"alt": re.compile('[0-9]')})

    # extract rating from each page
    # extract number n from "n-star" term
    #rating = [l["alt"] for l in rating]
    rating = [re.findall('[0-9]+', line["alt"]) for line in rating]
    rating = [int(item) for num in rating for item in num]

    # extract rating 
    title_class = soup.find_all('h5')
    title = [(item.get_text().replace(u'\xa0', u'')) for item in title_class]

    #extract body
    body_class = soup.find_all("p", {"class": "review-body"})
    body = [(item.get_text().replace(u'\xa0', u'')) for item in body_class]

    #extract helpful record to a pair of number
    helpful_class = soup.find_all(string=re.compile("\d users found this review helpful$"))
    helpful = [re.findall('[0-9]+', line) for line in helpful_class]
    helpful = [[int(i) for i in line] for line in helpful]
    helpful_user = []
    total_user = []
    for line in helpful:
        helpful_user.append(line[0])
        total_user.append(line[1])

    df1 = pd.DataFrame({"title":title, 
    "body":body, "rating":rating, "helpful_user_num":helpful_user, "total_user_num":total_user})
    df = pd.concat([df, df1])

df.head(5)

Unnamed: 0,title,body,rating,helpful_user_num,total_user_num
0,The herbs were great...but the cherry tomatoes...,The herb kit that came with my Aerogarden was ...,2,15,17
1,Even more useful than regular parchment paper,I originally bought this just because it was c...,5,19,19
2,Shake it before you bake it,"If you do it in reverse (bake before shaking),...",2,2,13
3,Not what the picture describes,I bought this steak for my father in law for C...,2,7,14
4,What a ripe off - GIVE ME A BREAK,Sorry but I had these noodles and they are no ...,2,10,34


Dump the dataframe to csv file to store the unique dataset:

In [7]:
df.to_csv('dataset.csv', index=False)