Q1. Crawling and search
(a) Consider a social media site similar to Instagram, with the following features: Each user of the site a unique username, and can “follow” other users (assume that all accounts are public). Each posting made by a user may carry one or more terms that are “marked” with a hashtag (#). When a user searches for posts carrying a certain term (a
hashtag), then they are shown a sequence of posts (from other users) related to this hashtag. The sequencing priority of the relevant posts considers multiple factors, including the user engagement of each relevant post (number of likes, dislikes, comments, recency, etc.) and the user_distance.

user_distance = the smallest number of links connecting this user to the other user (whose post may appear in the sequence). Thus, if user A follows user K, then user_distance(A, K) = 1. Furthermore, if A does not follow
J, but K follows J, then user_distance(A, J) = 2. 

Write the pseudo-code of a function, user_distances, which takes a list of posts, P, and the user name, U, as inputs, and outputs a list, DIST, such DIST[i] is the user_distance( U, P[i]). You can assume that you are given a function, follows(U), which returns the list of users that U follows. 

function user_distances(P, U):
    queue = [(U, 0)]  # Initialize a queue with the starting user and distance 0
    visited = {U}  # Set to keep track of visited users
    distances = {}  # Dictionary to store user distances
    
    while queue is not empty:
        current_user, distance = queue.dequeue()  # Dequeue the next user and distance
        
        distances[current_user] = distance  # Store the user distance
        
        # Get the users followed by the current_user
        followed_users = follows(current_user)
        
        for post in P:
            if post.user in followed_users and post.user not in visited:
                queue.enqueue((post.user, distance + 1))  # Enqueue the followed user with an incremented distance
                visited.add(post.user)  # Mark the user as visited
    
    # Create the list of distances corresponding to the posts in P
    DIST = []
    for post in P:
        DIST.append(distances.get(post.user, -1))  # Append the distance or -1 if user distance not found
    
    return DIST
    
pseudo-code:
1. The function user_distances takes a list of posts P and a user name U as inputs.
2. We initialize a queue with a tuple containing the starting user U and distance 0. We also initialize a set visited to keep track of visited users.
3. We create an empty dictionary distances to store the user distances.
4. While the queue is not empty, we dequeue the next user and distance from the queue.
5. We store the user distance in the distances dictionary.
6. We retrieve the users followed by the current user using the follows function.
7. For each post in the list of posts P, if the post's user is in the followed users and has not been visited before, we enqueue the followed user with an incremented distance and mark the user as visited.
8. After processing all the posts, we create the list DIST of distances corresponding to the posts in P. If a user distance is not found in the distances dictionary, we append -1 to indicate that the distance is not available.
9. Finally, we return the DIST list containing the user distances.

Q2. Crawling and search (python)
Modify the crawler code from the lecture notes, by adding a function bfs_2(root), which returns a list of all the relevant URLs that are a distance of at most two links away from the root, in breadth-first-manner. A relevant URL is (a) a complete URL including the name of the server, and (b) is not referencing a location within a page that is being considered.

In [4]:
import urllib.request
from html.parser import HTMLParser
from urllib.parse import urljoin, urlparse
from collections import deque

def bfs_2(root):
    visited = set()
    queue = deque([(root, 0)])
    relevant_urls = []

    while queue:
        url, distance = queue.popleft()

        if distance <= 2 and url not in visited:
            visited.add(url)
            relevant_urls.append(url)

            if distance < 2:
                try:
                    req = urllib.request.Request(url=url, headers=headers)
                    content = urllib.request.urlopen(req)
                    parser = MyHTMLParser(url)
                    parser.feed(str(content.read()))

                    for link in parser.get_relevant_links():
                        absolute_url = urljoin(url, link)
                        queue.append((absolute_url, distance + 1))
                except Exception as e:
                    print(f"Error occurred while crawling {url}: {e}")

    return relevant_urls

class MyHTMLParser(HTMLParser):
    def __init__(self, base_url):
        super().__init__()
        self.base_url = base_url
        self.server_name = urlparse(base_url).netloc
        self.relevant_links = []

    def handle_starttag(self, tag, attrs):
        for attr in attrs:
            if attr[0] == "href":
                link = attr[1]
                if self.is_relevant_link(link):
                    self.relevant_links.append(link)

    def is_relevant_link(self, link):
        return link.startswith("http") and self.server_name in link

    def get_relevant_links(self):
        return self.relevant_links

url = 'https://www.apple.com/'
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

relevant_urls = bfs_2(url)

for link in relevant_urls:
    print(link)

https://www.apple.com/
https://www.apple.com/ae-ar/
https://www.apple.com/ae/
https://www.apple.com/am/
https://www.apple.com/at/
https://www.apple.com/au/
https://www.apple.com/az/
https://www.apple.com/befr/
https://www.apple.com/benl/
https://www.apple.com/bg/
https://www.apple.com/bh-ar/
https://www.apple.com/bh/
https://www.apple.com/br/
https://www.apple.com/bw/
https://www.apple.com/by/
https://www.apple.com/ca/
https://www.apple.com/ca/fr/
https://www.apple.com/cf/
https://www.apple.com/chde/
https://www.apple.com/chfr/
https://www.apple.com/ci/
https://www.apple.com/cl/
https://www.apple.com/cm/
https://www.apple.com/co/
https://www.apple.com/cz/
https://www.apple.com/de/
https://www.apple.com/dk/
https://www.apple.com/ee/
https://www.apple.com/eg-ar/
https://www.apple.com/eg/
https://www.apple.com/es/
https://www.apple.com/fi/
https://www.apple.com/fr/
https://www.apple.com/ge/
https://www.apple.com/gn/
https://www.apple.com/gq/
https://www.apple.com/gr/
https://www.apple.com

Q3. Simple sentiment analysis via chatGPT
(a) For this question, you will need to first activate your HKUST chatGPT account an acquire an API key. You can do this by following the steps carefully from this ITSC instructions page:
https://itsc.hkust.edu.hk/services/it-infrastructure/azure-openai-api-service
PLEASE store your key in a secure document, and do not share your API key with others.
(b) Write a simple python script that will prompt chatGPT with reviews of two hotels (you can randomly pick two reviews of the same hotel from the following website: tripadvisor.com), and assess which review is more favourable. For doing this question, you may modify the sample Jupyter notebook provided in references folder on canvas.

In [1]:
# Install openai package
!pip install openai



In [2]:
import openai
openai.api_type = "azure"
openai.api_base = "https://hkust.azure-api.net"
openai.api_version = "2023-07-01-preview"

# Replace this by your own api key
openai.api_key = "7b04601689714787bc6aeab1e8649c16"

# send a prompt to chatGPT:
response = openai.ChatCompletion.create(
    engine="gpt-35-turbo",# Other options: gpt-35-turbo-16k, gpt-4, gpt-4-32k
    messages=[
        {"role": "user", "content": "I want you to pick 2 reviews from this site, read them, and assess which review is more favorable \
        based on the keywords you noticed and ratings as well from the website. Website links: \
        https://www.tripadvisor.com/Hotel_Review-g293917-d10046631-Reviews-Stamps_Backpackers-Chiang_Mai.html"}
    ],
)

In [3]:
# Display the message only
print(response['choices'][0]['message']['content'])

As an AI language model, I don't have the ability to access links outside of this platform. However, based on the general instructions given, I can provide an assessment based on the keywords and ratings.

Review 1:
The first review used keywords such as "perfect location", "friendly staff", "clean rooms" and "awesome experience". The reviewer also gave a rating of 5 out of 5 stars. Based on these observations, this review appears to be more favorable.

Review 2:
The second review used keywords such as "basic amenities", "loud noise" and "small space". The reviewer also gave a rating of 3 out of 5 stars. Based on these observations, this review appears to be less favorable.

In summary, based on the keywords and ratings given, the first review appears to be more favorable than the second review.
