# Group 6 Assignment

* Members: ChiaYu Lin, Daumantas Patapas, Marcel Stelte, Michał Butkiewicz
* Institution: Vrije Universiteit Amserdam

This notebook will present and discuss the code used to retrieve data from the social media platforms Twitter and Sina Weibo.

## Sina Weibo Crawler

In this section the crawler for the Sina Weibo social media platform will be described.

The platform is most prominently used in China, therefore the searched keywords are in Chinese too. With only a Dutch phone number it is impossible to create a new account on Sina Weibo, which already prevents the usage of the API provided by Sina Weibo itself. Also if the website is accessed without an account the user will automatically be redirected to the front page. The only exception is the access through the mobile version of the website: https://m.weibo.cn. This version of the website has no restrictions regarding accounts or geolocation, and can therefore be used as a basis for the crawler.

Through research an already existing crawler was found: https://github.com/KaidiGuo/keyword_based_Sina_weibo_crawler. This crawler however was not maintained for several years and not functional anymore. It still gave insight in how to construct a working crawler for Sina Weibo using the mobile website. More specifically, this crawler accesses the API that is used as a data model for the mobile version of the platform: https://m.weibo.cn/api/.

For the functionality of the crawler several utility functions are necessary, which will be explained first.

To make the crawled data reusable at a later point in time it was decided that the results of the crawl should be persisted in a file, so that it could be used later on. For this it needs to be checked if a fitting directory already exists, or if it has to be created. The function for this is depicted below. It checks if a folder called "data" exists within the location in which this notebook will be executed, and creates such a directory if it does not exist yet.

In [None]:
import os

def init_directories():
    db_path = f"{os.path.dirname(os.path.realpath(__file__))}/data/"

    if not os.path.exists(db_path):
        os.mkdir(db_path)

Next, since the keywords need to be represented in an URL string which is appended to the URL of the mobile API, certain characters need to be formatted differently. In the case of this project it was only necessary to do this for the "#" symbol, but the following method can easily be extended with further symbols:

In [None]:
def format_keyword_for_url(keyword: str) -> str:
    return keyword.replace("#", "%23")

The Sina Weibo mobile API returns posts in form of so-called cards. They are however not returned as a list but a tree with any number of branches on each node. To make them easier processable later on a function is provided that puts the cards into a list. There are also many cards that contain no actual posts but e.g. suggestions for users that might be interesting to the user, advertisements, etc., which are not relevant for this research and are therefore filtered from the list too.

The relevant numerical card_types represent the following cards:
* mblog: Contains a singuler post that can be added to the list
* card_group: Contains a list of cards that need to be analysed recursively
* left_- and right_element: Some cards are split horizontally and contain two "mblog" cards that can be added to the list

The following URL gives an example of how the posts are formatted in the API: https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%23%E8%8B%B1%E9%9B%84%E8%81%94%E7%9B%9F%23%23EDG%23. The cards 0 and 1 contain a card with a singular "mblog", which can be added to the list. Card 2 contains a "card_group", which in turn contains a card with a "mblog" and a card without a group or a post. Therefore only the first child of card 2 should be added to the list.

There is also a mechanism in place that stores the ID of each added card in a set. For each new card it will be checked whether that ID is already in the set or not. This mechanism prevents duplicate entries in the list of cards that could falsify the result.

In [None]:
import json

def unpack_nested_cards(retrieved_cards: any) -> set[any]:

    cards = []
    card_ids = set()

    for retrieved_card in retrieved_cards:

        json_card = json.loads(json.dumps(retrieved_card))

        if json_card.get("mblog"):
            card_id = retrieved_card["mblog"]["id"]
            if card_id not in card_ids:
                cards.append(retrieved_card)
                card_ids.add(card_id)

        if json_card.get("card_group"):
            for card in unpack_nested_cards(retrieved_card["card_group"]):
                card_id = card["mblog"]["id"]
                if card_id not in card_ids:
                    cards.append(card)
                    card_ids.add(card_id)

        if json_card.get("left_element"):
            card_id = retrieved_card["left_element"]["mblog"]["id"]
            if card_id not in card_ids:
                cards.append(retrieved_card["left_element"])
                card_ids.add(card_id)

        if json_card.get("right_element"):
            card_id = retrieved_card["right_element"]["mblog"]["id"]
            if card_id not in card_ids:
                cards.append(retrieved_card["right_element"])
                card_ids.add(card_id)

    return cards

The next utility functions focus in formatting datetimes. Since it is necessary to limit the retrieved data to a specific time frame, the creation date and time retrieved by the mobile API needs to be parsed into a datetime object, which can then be compared to previously specified datetime objects to evaluate if the current card is within the allowed timeframe.

The previous example showed that the API returns the datetime in the following format: "Sun Nov 07 02:17:19 +0800 2021". Since the month is not returned as a numerical object, first a function that parses the month into a numerical value is needed:

In [None]:
def parse_month_to_int(month: str) -> int:
    if month == 'Jan':
        return 1
    if month == 'Feb':
        return 2
    if month == 'Mar':
        return 3
    if month == 'Apr':
        return 4
    if month == 'May':
        return 5
    if month == 'Jun':
        return 6
    if month == 'Jul':
        return 7
    if month == 'Aug':
        return 8
    if month == 'Sep':
        return 9
    if month == 'Oct':
        return 10
    if month == 'Nov':
        return 11
    if month == 'Dec':
        return 12

Afterwards the rest of the retrieved datetime string needs to be processed:

In [None]:
import datetime

def parse_creation_time_of_card(creation_time: str) -> datetime:
    (weekday, month, day, time, timezone, year) = creation_time.split(" ")
    (hour, minute, second) = time.split(":")
    if timezone[0:1] == "+":
        actual_timezone = datetime.timezone(datetime.timedelta(hours = int(timezone[1:]) / 100))
    elif timezone[0:1] == "-":
        actual_timezone = datetime.timezone(datetime.timedelta(hours = int(timezone[1:]) / -100))
    actual_creation_time = datetime.datetime(int(year), parse_month_to_int(month), int(day), int(hour), int(minute), int(second), 0, actual_timezone)

    return actual_creation_time