# Chapter 2: Data Collection and Pre-processing

Dear user, to help further analyse the design opportunities, \
you will now scrape data from several online sources for references.

In this notebook, you will scrape data from 4 different categories: \
**1. Social Media: Youtube**, \
**2. Social Media: Reddit**, \
**3. Websites**, \
**4. PDF File**

Each category will be using a different way to scrape data.

### TO DO SECTION

In [9]:
'''
Dear user, enter your Product here!
'''

product = "PICO 4 All-in-One VR Headset"

In [10]:
'''
Dear user, enter your online sources here (wherever required)!
'''

# In this notebook, we have pre-picked 5 online sources as follows:
### [1] Social Media: Youtube Comments
#       Please adjust the number of results to ensure a good sample size of 3000 - 5000 comments.
max_results = 3

### [2] Social Media: Reddit Comments
#       Please indicate the most relevant subreddit for your product:
subreddit = "r/virtualreality"

### [3] Website: Tech Magazine (PCGamer) Review
#       Please indicate the url to your website:
url_3 = "https://www.pcgamer.com/pico-4-ve-headset-review/"
#       Please inspect element on the website to find the most relevant container containing what you wish to scrape:
container_tag_3 = "div"
container_class_3 = "text-copy bodyCopy auto"

### [4] Website: Official Product Website
#       Please indicate the url to your website:
url_4 = "https://www.picoxr.com/sg/products/pico4"
###     Please inspect element on the website to find the most relevant container containing what you wish to scrape:
container_tag_4 = "main"
container_class_4 = "tIY88xTQJtQ9ZOGrUS1U"

### [5] PDF File: Official User Manual
#       Please indicate the url to your website:
url_5 = "https://pico-web-tob.oss-cn-beijing.aliyuncs.com/20230825/document/1695015503416348672.pdf"


### RUN AS INTENDED (DO NOT CHANGE ANYTHING.)

##### Set Up

In [11]:
! pip install requests
! pip install python-dotenv
! pip install PyPDF2
! pip install deep_translator



In [12]:
search_terms = product

## 2.1: Data Collection

### Category 1: Social Media Youtube
##### [1] Youtube Comments

In [13]:
""" Initialise and Set up Google API Key """
import os
from googleapiclient.discovery import build
from dotenv import load_dotenv

load_dotenv()

key = os.getenv("GOOGLE_API_KEY")

youtube = build('youtube', 'v3', developerKey=key)

file = "youtube"

In [14]:
""" Define containers to store info """
vid_id = []             	# video id
vid_page = []       		# video links (https...)
vid_title = []              # video title
num_comments = []           # official number of comments
load_error = 0              # error counter
can_load_title = []         # temp. list for storing title w/o loading error
can_load_page = []          # temp. list for storing links w/o loading error
num_page = []               # comment_response page number
page_title = []             # comment_response video title
comment_resp = []           # comment_response
comment_list = []           # temp. list for storing comments
comment_data = []           # comments & replies from comment_response
all_count = 0               # total number of comments

In [15]:
""" Search for Video IDs based on User Inputs """
print("Search for Videos IDs...")
request = youtube.search().list(
    q=search_terms,
    maxResults=max_results,
    part="id",
    type="video",
    order="relevance"         # Switch to "viewCount" if the number of comments are not sufficient
    )
search_response = request.execute()
print(search_response)

Search for Videos IDs...


{'kind': 'youtube#searchListResponse', 'etag': '2sgz1jctv84RvsJuRQ2wiqHH9vk', 'nextPageToken': 'CAMQAA', 'regionCode': 'SG', 'pageInfo': {'totalResults': 375936, 'resultsPerPage': 3}, 'items': [{'kind': 'youtube#searchResult', 'etag': 'prc_QMMDypcmDIw9gcNYp0n7vEI', 'id': {'kind': 'youtube#video', 'videoId': 'i6uMxzBMwLY'}}, {'kind': 'youtube#searchResult', 'etag': 'YjYA0rkMZ7iyYGer2_wxWUHmbj0', 'id': {'kind': 'youtube#video', 'videoId': 'y9ls5fEeG48'}}, {'kind': 'youtube#searchResult', 'etag': 'SPR8aguaaNaXMa4Q7WrsmCu5ZLE', 'id': {'kind': 'youtube#video', 'videoId': 'kCZ1BtCsguo'}}]}


In [16]:
""" Create a list of Video IDs and a corresponding list of weblinks """
print("Videos found...")
for i in range(max_results):
    videoId = search_response['items'][i]['id']['videoId']
    print(videoId)
    vid_id.append(videoId)                          # a list of Video IDs
    page = "https://www.youtube.com/watch?v=" + videoId
    print(page)
    print()
    vid_page.append(page)                           # a list of Video links
print("\nThere are", len(vid_page), "videos.")

Videos found...
i6uMxzBMwLY
https://www.youtube.com/watch?v=i6uMxzBMwLY

y9ls5fEeG48
https://www.youtube.com/watch?v=y9ls5fEeG48

kCZ1BtCsguo
https://www.youtube.com/watch?v=kCZ1BtCsguo


There are 3 videos.


In [17]:
""" Use the list of Video IDs to get video data """
print("Get video data...")
for i in range(len(vid_id)):
    request = youtube.videos().list(
        part="snippet, statistics",
        id=vid_id[i]
        )
    video_response = request.execute()
    print(video_response)

    title = video_response['items'][0]['snippet']['title']
    vid_title.append(title)
    try:                        # use try/except as some videos might not load
        comment_count = video_response['items'][0]['statistics']['commentCount']
        print("Video", i + 1, "-", title, "-- Comment count: ", comment_count)
        print()
        num_comments.append(comment_count)
    except:
        print("Video", i + 1, "-", title, "-- Comments are turned off")
        print()
        num_comments.append(0)

Get video data...
{'kind': 'youtube#videoListResponse', 'etag': 'Vv-Mx2subUf5g5vJzQ_krojLZNc', 'items': [{'kind': 'youtube#video', 'etag': '1OmlpDsJYUH0qU8DnMG50WYU7ns', 'id': 'i6uMxzBMwLY', 'snippet': {'publishedAt': '2022-10-18T20:00:11Z', 'channelId': 'UCsmk8NDVMct75j_Bfb9Ah7w', 'title': 'Pico 4 Review - A Great Quest 2 Alternative!', 'description': "I check out the new Pico 4 VR headset. The Pico 4 releases in the UK, Europe and Asia on the 18th October. I've spent some time with the Pico 4 to give you this review on it's specs and features along with a comparison of how it stacks up against the Quest 2.\n\nAmazon Accessory links (affiliate);\nPico 4 VR headset 128GB / 256GB;\nhttps://amzn.to/3EO0yEa\nAnker USB C to 3.5mm Adapter; \nhttps://amzn.to/3s0A3Ul\nBOBOVR B2 Battery Pack;\nhttps://amzn.to/3yKDQJg\nSteelseries 7P Headphones;\nhttps://amzn.to/3CDolnD\nOfficial Meta Quest Link Cable;\nhttps://amzn.to/3CDolnD\nShort Headphone Cable;\nhttps://amzn.to/3S4r2nN\n\nTimestamps;\n00:

In [18]:
""" Use the list of Video IDs to get comments (by page) """
print("Get comment data...")
for i in range(len(vid_id)):
    try:                                        # use try/except as some "comments are turned off"
        request = youtube.commentThreads().list(
            part="snippet,replies",
            videoId=vid_id[i]
            )
        comment_response = request.execute()
        print(comment_response)

        comment_resp.append(comment_response)   # append 1 page of comment_response
        pages = 1
        num_page.append(pages)                  # append page number of comment_response
        page_title.append(vid_title[i])         # append video title along with the comment_response

        can_load_page.append(vid_page[i])       # drop link if it can't load (have at least 1 comment page)
        can_load_title.append(vid_title[i])     # drop title if it can't load (have at least 1 comment page)

        test = comment_response.get('nextPageToken', 'nil')         # check for nextPageToken
        while test != 'nil':                                        # keep running until last comment page
            next_page_ = comment_response.get('nextPageToken')
            request = youtube.commentThreads().list(
                part="snippet,replies",
                pageToken=next_page_,
                videoId=vid_id[i]
                )
            comment_response = request.execute()
            print(comment_response)

            comment_resp.append(comment_response)                   # append next page of comment_response
            pages += 1
            num_page.append(pages)                                  # append page number of comment_response
            page_title.append(vid_title[i])                         # append video title along with the comment_response

            test = comment_response.get('nextPageToken', 'nil')     # check for nextPageToken (while loop)
    except:
        load_error += 1

Get comment data...
{'kind': 'youtube#commentThreadListResponse', 'etag': 'R2hbbdmQ-YKxH3xqSSU-DZOC3uQ', 'nextPageToken': 'Z2V0X25ld2VzdF9maXJzdC0tQ2dnSWdBUVZGN2ZST0JJRkNJY2dHQUFTQlFpSklCZ0FFZ1VJcUNBWUFCSUZDSWdnR0FBU0JRaWRJQmdCR0FBaURRb0xDSkNfbjZ3R0VKQzM5MG8=', 'pageInfo': {'totalResults': 20, 'resultsPerPage': 20}, 'items': [{'kind': 'youtube#commentThread', 'etag': 'viUVN83qKWyPsi_n7kLQVqVqhdM', 'id': 'Ugwbrn2qhQqXYgJAUQ94AaABAg', 'snippet': {'channelId': 'UCsmk8NDVMct75j_Bfb9Ah7w', 'videoId': 'i6uMxzBMwLY', 'topLevelComment': {'kind': 'youtube#comment', 'etag': 'LkUzHpSnmTR_XXxc_9MzDKAQ_BA', 'id': 'Ugwbrn2qhQqXYgJAUQ94AaABAg', 'snippet': {'channelId': 'UCsmk8NDVMct75j_Bfb9Ah7w', 'videoId': 'i6uMxzBMwLY', 'textDisplay': 'You need to be online with it? that&#39;s one way of them accessing information like face scans to be sent back to China for evaluation. I have lack of trust when electronics ask to be on-line &amp; ask for more personal details before you can use it!', 'textOriginal

In [19]:
""" Show videos without loading errors """
print("Videos that can load...")
vid_page = can_load_page                    # update vid_page with those with no load error
vid_title = can_load_title                  # update vid_title with those with no load error
for i in range(len(vid_title)):
    if vid_title[i] == 'YouTube':           # default error title is 'YouTube'
        vid_title[i] = 'Video_' + str(i+1)  # replace 'YouTube' with Video_1 format
    print(i + 1, vid_title[i])

Videos that can load...
1 Pico 4 Review - A Great Quest 2 Alternative!
2 My PICO 4 All-in-One VR Headset Final Review
3 Pico 4 All-In-One VR Headset Unboxing


In [20]:
""" Sift through and store comments as a list """
print("Get individual comment...")
for k in range(len(comment_resp)):
    count = 0                                                     # comment counter
    comments_found = comment_resp[k]['pageInfo']['totalResults']  # comments on 1 comment_response page
    count = count + comments_found
    for i in range(comments_found):
        try:
            comment_list.append(comment_resp[k]['items'][i]['snippet']['topLevelComment']['snippet']['textDisplay'])
            print(comment_resp[k]['items'][i]['snippet']['topLevelComment']['snippet']['textDisplay'])
        except:
            print("missing comment")                              # or too many comments (e.g. 7.3K comments)

print(comment_list)
print()
print(len(comment_list), "comments in total.")

Get individual comment...
You need to be online with it? that&#39;s one way of them accessing information like face scans to be sent back to China for evaluation. I have lack of trust when electronics ask to be on-line &amp; ask for more personal details before you can use it!
They are locking up these devices to their stores, games, controllers. VR should be open platform, so you can buy controllers (including specialty gun like) for any device, you should also be able to play steam vr games regardless of your vr headset, if they don&#39;t standardize these things VR is not going to be popular (at leas nowhere near as popular as regular consoles).
The middle fingers 😂 that’s the real test
Didn&#39;t Apple copy these guys?
Who also checked his Steam inbox @ <a href="https://www.youtube.com/watch?v=i6uMxzBMwLY&amp;t=439">7:19</a>? 😅
Hi  this head set have some sort of lag in games in wireless mode?  I buyed reverb g2 and sell it a few months ago because the lenses become blurry after mi

In [21]:
""" Saving files! """

""" Create directory """
try:                                              # Create directory named after search terms
    os.makedirs("support/%s" % search_terms)
    print("Directory", search_terms, "created")
except FileExistsError:
    print("Directory", search_terms, "exists")

try:                                              # Create directory to store current search terms
    os.makedirs("support/_current_")
    print("Directory _current_ created")
except FileExistsError:
    print("Directory _current_ exists")

try:                                              # Create directory in search terms named after file
    os.makedirs("support/%s/%s" % (search_terms, file))
    print("Directory", file, "created")
except FileExistsError:
    print("Directory", file, "exists")

try:                                              # Create directory in _current_ named after file
    os.makedirs("support/_current_/%s" % file)
    print("Directory _current_/", file, "created")
except FileExistsError:
    print("Directory _current_/", file, "exists")

""" Save files for future use """
import pickle

f = open("support/%s/%s/comments.txt" % (search_terms, file), "w+", encoding="utf-8")
for i in range(len(comment_list)):
    f.write("<<<" + comment_list[i] + ">>>")
f.close()

pickle.dump(search_terms, open("support/%s/searchTerms.pkl" % search_terms, "wb"))
pickle.dump(comment_list, open("support/%s/%s/comment_list.pkl" % (search_terms, file), "wb"))
pickle.dump(vid_title, open("support/%s/%s/vid_title.pkl" % (search_terms, file), "wb"))
pickle.dump(vid_page, open("support/%s/%s/vid_page.pkl" % (search_terms, file), "wb"))
pickle.dump(vid_id, open("support/%s/%s/vid_id.pkl" % (search_terms, file), "wb"))

""" Save files for next step """
import shutil

source = "support/%s/%s/comments.txt" % (search_terms, file)
destination = "support/_current_/%s/comments.txt" % file
shutil.copyfile(source, destination)

pickle.dump(search_terms, open("support/_current_/searchTerms.pkl", "wb"))

Directory PICO 4 All-in-One VR Headset exists
Directory _current_ exists
Directory youtube created
Directory _current_/ youtube created


### Category 2: Social Media Reddit
##### [2] Reddit Comments

In [22]:
""" Initialise and Set up Reddit API """
import requests
import os
from dotenv import load_dotenv

load_dotenv()

id = os.getenv("REDDIT_API_ID")
key = os.getenv("REDDIT_API_KEY")
user = os.getenv("REDDIT_API_USER")
pw = os.getenv("REDDIT_API_PW")

auth = requests.auth.HTTPBasicAuth(id, key)

data = {'grant_type': 'password',                                       # Initalize using login method (password), username, and password
        'username': user,
        'password': pw}

headers = {'User-Agent': 'DAI/AID'}                                     # Setup our header info, which gives reddit a brief description of our app

res = requests.post('https://www.reddit.com/api/v1/access_token',       # Send request for an OAuth token
                    auth=auth, data=data, headers=headers)

TOKEN = res.json()['access_token']                                      # Convert response to JSON and pull access_token value

headers = {**headers, **{'Authorization': f"bearer {TOKEN}"}}

requests.get('https://oauth.reddit.com/api/v1/me', headers=headers)     # Token is valid for ~2 hours. If get Response [200], you are good to go!

file = "reddit"

In [23]:
""" Define containers to store info """
post_id = []             	     # post id
post_page = []       		     # post links (https...)
post_title = []                  # post title
post_num_comments = []           # official number of comments
comment_list = []           # temp. list for storing comments

In [24]:
""" Search for Post IDs based on User Inputs """
print("Search for Post IDs...")
res = requests.get(f"https://oauth.reddit.com/{subreddit}/search/?q={search_terms}&restrict_sr=on&sort=relevance&t=all",   # Restrict search to only r/virtualreality
                   headers=headers)

if res.status_code == 200:                  # Check if the request was successful
    search_response = res.json()            # Parse the response as JSON
    print(search_response)
else:
    print(f"Error: {res.status_code} - {res.reason}")

Search for Post IDs...


In [25]:
""" Create a list of Post IDs and a corresponding list of weblinks """
print("Posts found...")

for post in search_response['data']['children']:
    postId = post['data']['id']
    print(postId)
    post_id.append(postId)                           # a list of Post IDs
    page = f"https://oauth.reddit.com/r/virtualreality/comments/{postId}"
    print(page)
    print()
    post_page.append(page)                           # a list of Post links

print(f"\nThere are {len(post_page)} posts.")

Posts found...
xvcvnx
https://oauth.reddit.com/r/virtualreality/comments/xvcvnx

ze3p1j
https://oauth.reddit.com/r/virtualreality/comments/ze3p1j

141ngvl
https://oauth.reddit.com/r/virtualreality/comments/141ngvl

1aikgkz
https://oauth.reddit.com/r/virtualreality/comments/1aikgkz

16jwbay
https://oauth.reddit.com/r/virtualreality/comments/16jwbay

11m3bwm
https://oauth.reddit.com/r/virtualreality/comments/11m3bwm

1avlm7n
https://oauth.reddit.com/r/virtualreality/comments/1avlm7n

1aspvkd
https://oauth.reddit.com/r/virtualreality/comments/1aspvkd

xtd8h2
https://oauth.reddit.com/r/virtualreality/comments/xtd8h2

4y0gsx
https://oauth.reddit.com/r/virtualreality/comments/4y0gsx

lkbnun
https://oauth.reddit.com/r/virtualreality/comments/lkbnun

18sac0i
https://oauth.reddit.com/r/virtualreality/comments/18sac0i

111b6lg
https://oauth.reddit.com/r/virtualreality/comments/111b6lg

ycs9xu
https://oauth.reddit.com/r/virtualreality/comments/ycs9xu

18xbufy
https://oauth.reddit.com/r/virtualrea

In [26]:
""" Use the list of Post IDs to get post data """
print("Get post data...")
for i, post_response in enumerate(search_response['data']['children']):
    try:
        print(post_response)
        title = post_response['data']['title']                                  # Extract title and comment count from post response
        post_title.append(title)
        comment_count = post_response['data']['num_comments']
        print(f"Post {i + 1} - {title} -- Comment count: {comment_count}")
        post_num_comments.append(comment_count)
        print()
        
    except Exception as e:
        print(f"Error fetching post data: {e}")

print(sum(post_num_comments), "comments in total.")

Get post data...
{'kind': 't3', 'data': {'approved_at_utc': None, 'subreddit': 'virtualreality', 'selftext': '', 'author_fullname': 't2_7jdj3', 'saved': False, 'mod_reason_title': None, 'gilded': 0, 'clicked': False, 'title': 'PSA - Amazon UK Pico 4 Pre-Orders are up', 'link_flair_richtext': [{'e': 'text', 't': 'News Article'}], 'subreddit_name_prefixed': 'r/virtualreality', 'hidden': False, 'pwls': 6, 'link_flair_css_class': '', 'downs': 0, 'thumbnail_height': 100, 'top_awarded_type': None, 'hide_score': False, 'name': 't3_xvcvnx', 'quarantine': False, 'link_flair_text_color': 'dark', 'upvote_ratio': 0.96, 'author_flair_background_color': None, 'ups': 1183, 'total_awards_received': 0, 'media_embed': {}, 'thumbnail_width': 140, 'author_flair_template_id': None, 'is_original_content': False, 'user_reports': [], 'secure_media': None, 'is_reddit_media_domain': True, 'is_meta': False, 'category': None, 'secure_media_embed': {}, 'link_flair_text': 'News Article', 'can_mod_post': False, 'sco

In [27]:
""" Sift through and store comments as a list """
print("Get individual comment...")

def scrape_comments(url, comment_list):
    try:
        page_json = f"{url}.json"

        post_comments_response = requests.get(page_json, headers=headers) 
        if post_comments_response.status_code == 200:                                       # Check if request was successful
            post_comments_data = post_comments_response.json()
            if isinstance(post_comments_data, list) and len(post_comments_data) > 1:
                total_comments = post_comments_data[1]['data']['children'][0]['data']['total_awards_received']
                print(f"Total comments: {total_comments}")
                comments = post_comments_data[1]['data']['children']                        # Extract comments from the response
                for comment in comments:
                    try:
                        comment_body = comment['data']['body']
                        print(comment_body)
                        comment_list.append(comment_body)
                        if 'replies' in comment['data'] and comment['data']['replies']:     # Check for replies and recursively scrape them
                            scrape_replies(comment['data']['replies']['data']['children'], comment_list)
                    except KeyError:
                        print("Error: Missing 'body' attribute for a comment.")
        else:
            print(f"Error fetching comments for page {url}: {post_comments_response.status_code} - {post_comments_response.reason}")
    except Exception as e:
        print(f"Error fetching comments for page {url}:", e)

def scrape_replies(replies, comment_list):                                                  # Function to scrape replies recursively
    for reply in replies:
        try:
            reply_body = reply['data']['body']
            print(reply_body)
            comment_list.append(reply_body)
            # Recursively scrape replies of replies
            if 'replies' in reply['data'] and reply['data']['replies']:
                scrape_replies(reply['data']['replies']['data']['children'], comment_list)
        except KeyError:
            print("Error: Missing 'body' attribute for a reply.")

for page in post_page:
    scrape_comments(page, comment_list)

print(comment_list)
print()
print(len(comment_list), "valid comments in total.")

Get individual comment...
Total comments: 0
This is the first affordable Pancake headset and according to reviewers it features bigger FoV than the Quest 2, too bad it's a bytedance product though...
I'm pretty new to the world of VR, what is the issue with bytedance?
It's basically Chinese Facebook
no no no...  its much much worse.   FB at least has US scrutiny now and is being sued up the ass for past mistakes.  Bytedance has no accountability and privacy policy is meaningless.
Not true, the device is being distributed in Europe and so has to comply with our strict regulations, additionally this headset doesn't require any login.

Edit: it does need a login....

Listen. To be fair, "Requires no social media account" would entail no login to most people, right? Am I crazy for thinking that?

Regardless they were one step ahead of me on this one so it turns out "Pico accounts technically aren't social media accounts"!!!! Sick of this bollucks.
[deleted]
"here" as in? Gonna be interesti

In [28]:
""" Saving files! """

""" Create directory """
try:                                              # Create directory named after search terms
    os.makedirs("support/%s" % search_terms)
    print("Directory", search_terms, "created")
except FileExistsError:
    print("Directory", search_terms, "exists")

try:                                              # Create directory to store current search terms
    os.makedirs("support/_current_")
    print("Directory _current_ created")
except FileExistsError:
    print("Directory _current_ exists")

try:                                              # Create directory in search terms named after file
    os.makedirs("support/%s/%s" % (search_terms, file))
    print("Directory", file, "created")
except FileExistsError:
    print("Directory", file, "exists")

try:                                              # Create directory in _current_ named after file
    os.makedirs("support/_current_/%s" % file)
    print("Directory _current_/", file, "created")
except FileExistsError:
    print("Directory _current_/", file, "exists")

""" Save files for future use """
import pickle

f = open("support/%s/%s/comments.txt" % (search_terms, file), "w+", encoding="utf-8")
for i in range(len(comment_list)):
    f.write("<<<" + comment_list[i] + ">>>")
f.close()

pickle.dump(search_terms, open("support/%s/searchTerms.pkl" % search_terms, "wb"))
pickle.dump(comment_list, open("support/%s/%s/comment_list.pkl" % (search_terms, file), "wb"))
pickle.dump(post_title, open("support/%s/%s/post_title.pkl" % (search_terms, file), "wb"))
pickle.dump(post_page, open("support/%s/%s/post_page.pkl" % (search_terms, file), "wb"))
pickle.dump(post_id, open("support/%s/%s/post_id.pkl" % (search_terms, file), "wb"))

""" Save files for next step """
import shutil

source = "support/%s/%s/comments.txt" % (search_terms, file)
destination = "support/_current_/%s/comments.txt" % file
shutil.copyfile(source, destination)

pickle.dump(search_terms, open("support/_current_/searchTerms.pkl", "wb"))

Directory PICO 4 All-in-One VR Headset exists
Directory _current_ exists
Directory reddit created
Directory _current_/ reddit created


### Category 3: Websites

In [29]:
""" Scrape Data from Website """
import requests
from bs4 import BeautifulSoup
import re

def extract_text_from_web(url, container_tag, container_class):
    response = requests.get(url)

    if response.status_code == 200:                                     # Check if the request was successful
        soup = BeautifulSoup(response.content, 'html.parser')           # Parse the HTML content

        title_tag = soup.find("title")                                  # Get the title of the webpage if it exists
        if title_tag:
            title = title_tag.get_text()
            print("Title:", title)
        else:
            print("No title found")

        extract = ""
        text = soup.find_all(container_tag, class_=container_class)     # Get text from the container containing the product description
        for i in range(len(text)):
            # Simple Cleaning of Text
            text[i] = re.sub(r'\<script.*?\<\/script\>', '', str(text[i]), flags=re.DOTALL)     # Remove Javascript stuff
            text[i] = re.sub(r'\<.*?\>', ' ', str(text[i]))
            text[i] = text[i].replace('\n', ' ')
            text[i] = text[i].replace("   ", ' ')
            text[i] = text[i].replace("  ", ' ')
            text[i] = re.sub(r'\s+', ' ', text[i].strip())
        extract += " ".join(text)

        return extract

    else:
        print("Failed to retrieve webpage. Status code:", response.status_code)

##### [3] Tech Magazine (PCGamer) Review

In [30]:
""" Initalize """
file = "pc_gamer"

url = url_3
container_tag = container_tag_3
container_class = container_class_3

In [31]:
""" Scrape Data from Website """
scrape_data = extract_text_from_web(url_3, container_tag_3, container_class_3)

print(scrape_data)

Title: Pico 4 VR headset review | PC Gamer


In [32]:
""" Saving files! """

""" Create directory """
try:                                              # Create directory named after search terms
    os.makedirs("support/%s" % search_terms)
    print("Directory", search_terms, "created")
except FileExistsError:
    print("Directory", search_terms, "exists")

try:                                              # Create directory to store current search terms
    os.makedirs("support/_current_")
    print("Directory _current_ created")
except FileExistsError:
    print("Directory _current_ exists")

try:                                              # Create directory in search terms named after file
    os.makedirs("support/%s/%s" % (search_terms, file))
    print("Directory", file, "created")
except FileExistsError:
    print("Directory", file, "exists")

try:                                              # Create directory in _current_ named after file
    os.makedirs("support/_current_/%s" % file)
    print("Directory _current_/", file, "created")
except FileExistsError:
    print("Directory _current_/", file, "exists")

""" Save files for future use """
import pickle

f = open("support/%s/%s/%s.txt" % (search_terms, file, file), "w+", encoding="utf-8")
f.write("<<<" + scrape_data + ">>>")
f.close()

pickle.dump(search_terms, open("support/%s/searchTerms.pkl" % search_terms, "wb"))
pickle.dump(scrape_data, open("support/%s/%s/%s.pkl" % (search_terms, file, file), "wb"))

""" Save files for next step """
import shutil

source = "support/%s/%s/%s.txt" % (search_terms, file, file)
destination = "support/_current_/%s/%s.txt" % (file, file)
shutil.copyfile(source, destination)

pickle.dump(search_terms, open("support/_current_/searchTerms.pkl", "wb"))

Directory PICO 4 All-in-One VR Headset exists
Directory _current_ exists
Directory pc_gamer created
Directory _current_/ pc_gamer created


##### [4] Official Product Website

In [33]:
""" Initalize """
file = "product_desc"

url = url_4
container_tag = container_tag_4
container_class = container_class_4

In [34]:
""" Scrape Data from Website """
scrape_data = extract_text_from_web(url_4, container_tag_4, container_class_4)

print(scrape_data)

Title: Live the Game with PICO 4 All-in-One VR Headset | PICO Singapore
Imagination, the Only Limitation Small but Mighty Balanced design, easy to wear The balanced design means that the weight of the PICO 4 is evenly distributed to the front and the rear. The centre of gravity fits snugly up against the face. The rear has a cushioned support. The result is a superbly comfortable fit. The front does not sway to and fro and the rear does not slip downwards. It is highly stable. The weight of the front end has been reduced by 26.2% and its thickness reduced by 38.8%. At the rear there is a high capacity 5300mAh battery. No matter how long you wear it, you’ll be able to play non-stop. Super Light Super Clear The Proprietary Pancake optical lens allows for a wider, clearer view The Pancake optical lens has made the PICO 4 lighter and, because it refracts and reflects light between lenses, means the PICO 4 has both a wider field of view and clearer images. 4K+ Super-Vision Display The PICO 

In [35]:
""" Saving files! """

""" Create directory """
try:                                              # Create directory named after search terms
    os.makedirs("support/%s" % search_terms)
    print("Directory", search_terms, "created")
except FileExistsError:
    print("Directory", search_terms, "exists")

try:                                              # Create directory to store current search terms
    os.makedirs("support/_current_")
    print("Directory _current_ created")
except FileExistsError:
    print("Directory _current_ exists")

try:                                              # Create directory in search terms named after file
    os.makedirs("support/%s/%s" % (search_terms, file))
    print("Directory", file, "created")
except FileExistsError:
    print("Directory", file, "exists")

try:                                              # Create directory in _current_ named after file
    os.makedirs("support/_current_/%s" % file)
    print("Directory _current_/", file, "created")
except FileExistsError:
    print("Directory _current_/", file, "exists")

""" Save files for future use """
import pickle

f = open("support/%s/%s/%s.txt" % (search_terms, file, file), "w+", encoding="utf-8")
f.write("<<<" + scrape_data + ">>>")
f.close()

pickle.dump(search_terms, open("support/%s/searchTerms.pkl" % search_terms, "wb"))
pickle.dump(scrape_data, open("support/%s/%s/%s.pkl" % (search_terms, file, file), "wb"))

""" Save files for next step """
import shutil

source = "support/%s/%s/%s.txt" % (search_terms, file, file)
destination = "support/_current_/%s/%s.txt" % (file, file)
shutil.copyfile(source, destination)

pickle.dump(search_terms, open("support/_current_/searchTerms.pkl", "wb"))

Directory PICO 4 All-in-One VR Headset exists
Directory _current_ exists
Directory product_desc exists
Directory _current_/ product_desc exists


### Category 4: PDF Files

In [36]:
""" Scrape Data from PDF """
import PyPDF2
import requests
import os

def extract_text_from_pdf(url):
    response = requests.get(url)
    with open('file.pdf', 'wb') as f:                           # Download the PDF file
        f.write(response.content)

    with open('file.pdf', 'rb') as f:                           # Open the PDF file
        reader = PyPDF2.PdfReader(f)
        
        text = ''
        for page_number in range(2, 11):
            page = reader.pages[page_number]
            text += "".join(page.extract_text())                # Extract text from each page
    
    text = text.replace('\n', ' ')
    text = text.replace('EN', ' ')
    text = text.replace("   ", ' ')                 
    text = text.replace("  ", ' ')                  
    
    return text

##### [5] Official User Manual

In [37]:
""" Initalize """
file = "user_manual"

In [38]:
""" Scrape Data from PDF """
scrape_data = extract_text_from_pdf(url_5)
os.remove('file.pdf')

print(scrape_data)

02 propriate IPD may increase the risk of discomfort. • This product has an “Eye Protection Mode”, certified by TÜV Rheinland (Germany), which can protect your eyes by reducing blue light in the three color channels using software algorithms. The screen ap - pears yellowish in this mode and you can turn this feature on/off in "Settings"►"Display"►"Color"►“ - Eye Protection”. • Protect optical lenses during use and storage to prevent damage, such as scratches or exposure to strong light or direct sunlight.In The Box: VR Headset / 2 Controllers / 4 1.5V AA Alkaline Batteries / Glasses Spacer / Nose Pad / 2 Controller Lan - yards / USB-C Power Adapter / USB-C to C 2.0 Data Cable / Quick Guide / User Guide / Safety and Warranty Guide Important Health & Safety Notes • This product is designed and intended to be used in an open and safe indoor area, free of any tripping or slipping hazards. To avoid accidents, remain conscious to the potential confines of your physical area and respect the b

In [39]:
""" Saving files! """

""" Create directory """
try:                                              # Create directory named after search terms
    os.makedirs("support/%s" % search_terms)
    print("Directory", search_terms, "created")
except FileExistsError:
    print("Directory", search_terms, "exists")

try:                                              # Create directory to store current search terms
    os.makedirs("support/_current_")
    print("Directory _current_ created")
except FileExistsError:
    print("Directory _current_ exists")

try:                                              # Create directory in search terms named after file
    os.makedirs("support/%s/%s" % (search_terms, file))
    print("Directory", file, "created")
except FileExistsError:
    print("Directory", file, "exists")

try:                                              # Create directory in _current_ named after file
    os.makedirs("support/_current_/%s" % file)
    print("Directory _current_/", file, "created")
except FileExistsError:
    print("Directory _current_/", file, "exists")

""" Save files for future use """
import pickle

f = open("support/%s/%s/%s.txt" % (search_terms, file, file), "w+", encoding="utf-8")
f.write("<<<" + scrape_data + ">>>")
f.close()

pickle.dump(search_terms, open("support/%s/searchTerms.pkl" % search_terms, "wb"))
pickle.dump(scrape_data, open("support/%s/%s/%s.pkl" % (search_terms, file, file), "wb"))

""" Save files for next step """
import shutil

source = "support/%s/%s/%s.txt" % (search_terms, file, file)
destination = "support/_current_/%s/%s.txt" % (file, file)
shutil.copyfile(source, destination)

pickle.dump(search_terms, open("support/_current_/searchTerms.pkl", "wb"))

Directory PICO 4 All-in-One VR Headset exists
Directory _current_ exists
Directory user_manual exists
Directory _current_/ user_manual exists


## 2.2: Further Preprocessing for Comments

In [11]:
""" Initialise and Establish Dataset """
import pandas as pd

search_terms = pd.read_pickle("support/_current_/searchTerms.pkl")
youtube_comment_list = pd.read_pickle("support/%s/youtube/comment_list.pkl" % search_terms)
reddit_comment_list = pd.read_pickle("support/%s/reddit/comment_list.pkl" % search_terms)

print("Search terms:", search_terms)
print("Social Media results:")
print("Number of Youtube comments:", len(youtube_comment_list))
print("Number of Reddit comments:", len(reddit_comment_list))

In [6]:
""" Translate to English """
from deep_translator import GoogleTranslator

def translate_to_en(comment_list):
    for i in range(len(comment_list)):        # translate all
        try:
            comment_list[i] = GoogleTranslator(source='auto', target='en').translate(str(comment_list[i]))
        except:
            comment_list[i] = ''              # Omitted as exceeded 5000 characters.

print("Translating Youtube comments...")
translate_to_en(youtube_comment_list)
print(youtube_comment_list)
print()
print("Translating Reddit comments...")
translate_to_en(reddit_comment_list)
print(reddit_comment_list)

Translating Youtube comments...

Translating Reddit comments...


In [13]:
""" Clean Text """
import re

def clean_text(text):                               # user defined function for cleaning text
    if text is None:
        return ("")
    else:
      text = text.lower()                             # all lower case
      text = re.sub(r'\[.*?\]', ' ', text)            # remove text within [ ] (' ' instead of '')
      text = re.sub(r'\<.*?\>', ' ', text)            # remove text within < > (' ' instead of '')
      text = re.sub(r'http\S+', ' ', text)            # remove website ref http
      text = re.sub(r'www\S+', ' ', text)             # remove website ref www

      text = text.replace('€', 'euros')               # replace special character with words
      text = text.replace('£', 'gbp')                 # replace special character with words
      text = text.replace('$', 'dollar')              # replace special character with words
      text = text.replace('%', 'percent')             # replace special character with words
      text = text.replace('\n', ' ')                  # remove \n in text that has it

      text = text.replace('\'', '’')                  # standardise apostrophe
      text = text.replace('&#39;', '’')               # standardise apostrophe

      text = text.replace('’d', ' would')             # remove ’ (for would, should? could? had + PP?)
      text = text.replace('’s', ' is')                # remove ’ (for is, John's + N?)
      text = text.replace('’re', ' are')              # remove ’ (for are)
      text = text.replace('’ll', ' will')             # remove ’ (for will)
      text = text.replace('’ve', ' have')             # remove ’ (for have)
      text = text.replace('’m', ' am')                # remove ’ (for am)
      text = text.replace('can’t', 'can not')         # remove ’ (for can't)
      text = text.replace('won’t', 'will not')        # remove ’ (for won't)
      text = text.replace('n’t', ' not')              # remove ’ (for don't, doesn't)

      text = text.replace('’', ' ')                   # remove apostrophe (in general)
      text = text.replace('&quot;', ' ')              # remove quotation sign (in general)

      text = text.replace('cant', 'can not')          # typo 'can't' (note that cant is a proper word)
      text = text.replace('dont', 'do not')           # typo 'don't'

      text = re.sub(r'[^a-zA-Z0-9]', r' ', text)      # only alphanumeric left
      text = text.replace("   ", ' ')                 # remove triple empty space
      text = text.replace("  ", ' ')                  # remove double empty space
      return text

def clean(text_list):
    cleaned_list = []
    for t in text_list:
        cleaned_list.append(clean_text(t))
    return cleaned_list

print("Cleaning Youtube comments...")
youtube_comment_list = clean(youtube_comment_list)
print(youtube_comment_list)
print()
print("Cleaning Reddit comments...")
reddit_comment_list = clean(reddit_comment_list)
print(reddit_comment_list)

Cleaning Youtube comments...
['you need to be online with it that is one way of them accessing information like face scans to be sent back to china for evaluation i have lack of trust when electronics ask to be on line amp ask for more personal details before you can use it ', 'they are locking up these devices to their stores games controllers vr should be open platform so you can buy controllers including specialty gun like for any device you should also be able to play steam vr games regardless of your vr headset if they do not standardize these things vr is not going to be popular at leas nowhere near as popular as regular consoles ', 'the middle fingers that is the real test', 'did not apple copy these guys ', 'who also checked his steam inbox 7 19 ', 'hi this head set have some sort of lag in games in wireless mode i buyed reverb g2 and sell it a few months ago because the lenses become blurry after minutes of use ', 'highly reluctant to give facebook my hard earned cash made my 

In [19]:
""" Saving clean files! """
import pickle
import shutil

""" Youtube """
print("Saving clean Youtube comments...")
file = "youtube"

f = open("support/%s/%s/comments.txt" % (search_terms, file), "w+", encoding="utf-8")
for i in range(len(youtube_comment_list)):
    f.write("<<<" + youtube_comment_list[i] + ">>>")
f.close()

pickle.dump(search_terms, open("support/%s/searchTerms.pkl" % search_terms, "wb"))
pickle.dump(youtube_comment_list, open("support/%s/%s/comment_list.pkl" % (search_terms, file), "wb"))

""" Save files for next step """
import shutil

source = "support/%s/%s/comments.txt" % (search_terms, file)
destination = "support/_current_/%s/comments.txt" % file
shutil.copyfile(source, destination)

pickle.dump(search_terms, open("support/_current_/searchTerms.pkl", "wb"))

""" Reddit """
print("Saving clean Reddit comments...")
file = "reddit"

f = open("support/%s/%s/comments.txt" % (search_terms, file), "w+", encoding="utf-8")
for i in range(len(reddit_comment_list)):
    f.write("<<<" + reddit_comment_list[i] + ">>>")
f.close()

pickle.dump(search_terms, open("support/%s/searchTerms.pkl" % search_terms, "wb"))
pickle.dump(reddit_comment_list, open("support/%s/%s/comment_list.pkl" % (search_terms, file), "wb"))

""" Save files for next step """
import shutil

source = "support/%s/%s/comments.txt" % (search_terms, file)
destination = "support/_current_/%s/comments.txt" % file
shutil.copyfile(source, destination)

pickle.dump(search_terms, open("support/_current_/searchTerms.pkl", "wb"))
