## Project proposal

For this course, we aim to complete a data analysis project about the the game [Palworld](https://en.wikipedia.org/wiki/Palworld). To help you start with the project, here are a couple of things you need to consider and work on to get a clean data for later analysis.

To start with the project, please take some time to get familiar with the game. You don't need to play it but please at least know the basic terminologies, like what is a Pal. (And also, if you do play it, please do not spend too much time on it.)

The two recommended database is [https://palworld.gg/](https://palworld.gg/) and [https://paldb.cc/en/](https://paldb.cc/en/). You can use either, or both, or some other database about the Palworld.

# DS 3000 HW 5

Due: Sunday July 20th @ 11:59 PM EST

### Submission Instructions
Submit this `ipynb` file and the a `PDF` file included with the coding results to Gradescope (this can also be done via the assignment on Canvas).  To ensure that your submitted files represent your latest code, make sure to give a fresh `Kernel > Restart & Run All` just before uploading the files to gradescope.

**Notice that this is a group assignment. Each group only need to submit one copy and when you submit the work, please include everyone in your group.**

### Tips for success
- Start early
- Make use of Piazza
- Make use of Office hour
- Remember to use cells and headings to make the notebook easy to read (if a grader cannot find the answer to a problem, you will receive no points for it)
- Under no circumstances may one student view or share their ungraded homework or quiz with another student [(see also)](http://www.northeastern.edu/osccr/academic-integrity), though you are welcome to **talk about** (not show each other) the problems.

## Part 1.1 (10 points)

Please list 2-3 questions you may be interested to study with the Palworld database. It can be anything related in the game, like the Pals, items or constructions. Some potential question structures can be:
- Are `A` and `B` related? How they are related?
- Which features may affect `C`'s change?
- If I need a higher `D`, which features may have a lower/higher value?
- Based on `E` and `F`, which items/pals are similar?
- I need to predict the value for `G`, which features I need to consider?

## Answer
- capture rate to price relation?
- which type affects (HP/damage/etc..?)
- if i need higher HP, what region should the pal come from?
- how does defense affect speed?

## Part 1.2 (20 points)

Based on the questions we proposed in the part 1.1, what features we may need to include in the analysis? Check the websites, which website has those information? **You need to pick at least 8 features for analysis.** We recommend a mix of numerical (numbers etc.) and categorical (level etc.) features. Is there any other features that you think it may be important but hard to extract or find on the website (can be something in or not in the game)?

## Answer
- name - https://palworld.gg/pals
- type - https://palworld.gg/pals
- region - https://palworld.gg/pals
- stats (hp, defense, price, etc..) - https://palworld.gg/pals
- capture rate at level 1 - https://palworld.gg/capture-rate
- rarity - https://palworld.gg/pals & https://palworld.gg/capture-rate

## Part 1.3 (20 points)

Suppose you do have all the features you mentioned in part 1.2. List 3-4 data visulizations you can make with those features. You do not need to make those visulizations here. Just describe the type of the visualizations (histogram, scatter plot etc. ), which features are involved, will there any hover data or color being added, and **discuss how these data visualizations may be related (or even answer) to your questions in part 1.1**.

## Answer

### scatter plot of capture rate at level 1 - pal price
- color pals based on region
- hover gives info such as:
  - Pal name
  - Type
  - Rarity

### bar chart for region - HP? (or some other stat)
- average HP based on region (rarity can play a factor as well)
- color pals by region biome
- hover gives
  - average HP
  - have a standard deviation for some variability

### scatter plot for defense - speed
- hover over to see pal name/type/rarity

## Part 1.4  (50 points)

Now, go ahead and try to scrape the features you need.

Please show all the codes you have for web scrapping. Your current output data frame should include at least 4 features. (You do not need to scrape all features at this moment, although it is recommend to start earlier. Also, you can choose to not to use the ones you have scraped in the later analysis. No need to worry if you need to change anything later). **Please design your code in pipeline and clearly document each function.** See the Python Style Guide in Week 1 for proper documentation. It is also recommended to save the data you have scrapped.

In [None]:
# Modules that may be used
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [None]:
BASE_URL = "https://palworld.gg/pals"

In [None]:
def fetch_pal(url: str) -> BeautifulSoup:
    """
    Args:
        url (str): The inputted url link for a chosen pal

    Returns:
        BeatifulSoup: the parsed content for the given pal
    """
    # first get the link
    resp = requests.get(url)
    # have a checker in the event the input is wrong
    resp.raise_for_status()
    resp = resp.text
    return BeautifulSoup(resp, features="lxml")

## Once a Pal has been grabbed, parsing for a specific trait will be done

In [None]:
def parse_pals_data(main_soup: BeautifulSoup) -> list[dict]:
    """
    Parses Pal data from palworld.gg HTML using BeautifulSoup,
    extracting Name and link to each Pal's detail page.

    Args:
        main_soup (BeautifulSoup): HTML content of the page.

    Returns:
        list[dict]: List of pals with 'Name' and 'Link'.
    """
    pals = []

    # Get the main section
    pals_section = main_soup.find('section', class_='pals-list')
    if not pals_section:
        print("No pals-list section found.")
        return []

    # Each Pal is a div with class "pal"
    pal_divs = pals_section.find_all('div', class_='pal')

    for pal_div in pal_divs:
        try:
            a_tag = pal_div.find('a', href=True)
            if not a_tag:
                continue

            detail_url = a_tag['href']
            full_url = "https://palworld.gg" + detail_url

            name_div = a_tag.find('div', class_='name')
            name = name_div.get_text(strip=True) if name_div else "Unknown"

            pals.append({"Name": name, "Link": full_url})

        except Exception as e:
            print(f"Error parsing pal card: {e}")
            continue

    return pals


In [None]:
def strip_pal_stats(soup: BeautifulSoup) -> dict:
    """
    Extracts individual stats from a Pal's page.

    Args:
        soup(BeautifulSoup): Parsed data from the Pal's page

    Return:
        dict: Dictionary of the stats
    """

    # Retrieve the name of the Pal
    name = soup.find('h1', class_='name').text.strip() if soup.find('h1', class_='name') else None

    # extract the rarity
    rarity_div = soup.find('div', class_='rarity')
    rarity = rarity_div.text.strip() if rarity_div else None

    # Element name
    elements = []
    elements_div = soup.find('div', class_='elements')
    if elements_div:
        elements_divs = elements_div.find_all('div', class_='element')
        for elem_div in elements_divs:
            # this should get the element name
            elem_name = elem_div.find('div', class_ = 'name')
            if elem_name:
                elements.append(elem_name.text.strip())

    # stats
    # Initialize stats dictionary with expected fields
    stats = {
        'name': name,
        'rarity': rarity,
        'hp': None,
        'defense': None,
        'crafting_speed': None,
        'melee_attack': None,
        'shot_attack': None,
        'price': None,
        'stamina': None,
        'support': None,
        'running_speed': None,
        'sprinting_speed': None,
        'slow_walk_speed': None
    }

    stats_div = soup.find('div', class_='stats')
    # reach the stats section
    if stats_div:
        items_divs = stats_div.find_all('div', class_='item')
        # go into each item
        for item in items_divs:
            stat_name_div = item.find('div', class_='name')
            stat_value_div = item.find('div', class_='value')

            # using the name and value of the stats add to the dictionary
            if stat_name_div and stat_value_div:
                stat_name = stat_name_div.text.strip().lower().replace(' ', '_')
                stat_value = stat_value_div.text.strip()
                try:
                    stat_value = int(stat_value)
                except ValueError:
                    pass  # Keep as string if not an int


                if stat_name in stats:
                    stats[stat_name] = stat_value

    return {
        "Name": stats['name'],
        "Rarity": stats['rarity'],
        "Elements": elements,
        "Stats": {
            "HP": stats['hp'],
            "Defense": stats['defense'],
            "Crafting Speed": stats['crafting_speed'],
            "Melee Attack": stats['melee_attack'],
            "Shot Attack": stats['shot_attack'],
            "Support": stats['support'],
            "Stamina": stats['stamina'],
            "Price": stats['price'],
            "Running Speed": stats['running_speed'],
            "Sprinting Speed": stats['sprinting_speed'],
            "Slow Walk Speed": stats['slow_walk_speed'],
        }
    }

## Main Test

In [None]:
def create_dataframe() -> pd.DataFrame:
    """
    Create a DataFrame from the list of Pal data

    Returns:
        DataFrame: Cleaned Dataframe ready for analysis
    """

    main_pal_soup = fetch_pal(BASE_URL)

    #get the basic information
    pals_info = parse_pals_data(main_pal_soup)

    full_data = []
    # loop through the pals
    for idfx, pal in enumerate(pals_info):
        try:
            print(f"Processing {idfx+1}/{len(pals_info)}: {pal['Name']}")
            detail_soup = fetch_pal(pal["Link"])
            stats = strip_pal_stats(detail_soup)

            # flatten the stats dictionary
            rec = {
                'Name': stats.get('Name'),
                "Link": pal.get("Link"),
                "Rarity": stats['Rarity']
            }

            # Add stats (like HP, Defense, etc.)
            for key, value in stats.get("Stats", {}).items():
                rec[key] = value
            # Add elements as a joined string (optional: or keep as list)
            rec["Elements"] = ", ".join(stats.get("Elements", []))

            full_data.append(rec)
        except Exception as e:
            print(f"Skipping {pal.get('Name', '')}: {e}")
            continue

    df = pd.DataFrame(full_data)
    return df

In [None]:
df = create_dataframe()
df.to_csv("test_data1.csv", index=False)
df.head(10)

Processing 1/225: Anubis#100
Processing 2/225: Arsox#42
Processing 3/225: Astegon#98
Processing 4/225: Azurmane#136
Processing 5/225: Azurobe#82
Processing 6/225: Azurobe Cryst#82
Processing 7/225: Bastigor#137
Processing 8/225: Beakon#73
Processing 9/225: Beegarde#50
Processing 10/225: Bellanoir#112
Processing 11/225: Bellanoir Libero#112
Processing 12/225: Blazamut#96
Processing 13/225: Blazamut Ryu#96
Processing 14/225: Blazehowl#84
Processing 15/225: Blazehowl Noct#84
Processing 16/225: Blue Slime#-1
Processing 17/225: Braloha#145
Processing 18/225: Bristla#30
Processing 19/225: Broncherry#86
Processing 20/225: Broncherry Aqua#86
Processing 21/225: Bushi#72
Processing 22/225: Bushi Noct#72
Processing 23/225: Caprity#35
Processing 24/225: Caprity Noct#35
Processing 25/225: Cattiva#2
Processing 26/225: Cave Bat#-1
Processing 27/225: Cawgnito#44
Processing 28/225: Celaray#25
Processing 29/225: Celaray Lux#25
Processing 30/225: Celesdir#132
Processing 31/225: Chikipi#3
Processing 32/22

Unnamed: 0,Name,Link,Rarity,HP,Defense,Crafting Speed,Melee Attack,Shot Attack,Support,Stamina,Price,Running Speed,Sprinting Speed,Slow Walk Speed,Elements
0,Anubis,https://palworld.gg/pal/anubis,10Epic,120,100,100,130,130,100,100,4960,800,1000,80,Earth
1,Arsox,https://palworld.gg/pal/arsox,4Common,85,95,100,100,95,100,120,3520,700,1050,87,Fire
2,Astegon,https://palworld.gg/pal/astegon,9Epic,100,125,100,100,125,100,300,8200,700,1100,100,"Dragon, Dark"
3,Azurmane,https://palworld.gg/pal/azurmane,7Rare,130,110,100,100,120,100,220,6680,900,1260,90,Electricity
4,Azurobe,https://palworld.gg/pal/azurobe,7Rare,110,100,100,70,100,100,160,5600,600,900,75,"Water, Dragon"
5,Azurobe Cryst,https://palworld.gg/pal/azurobe-cryst,8Epic,115,105,100,100,105,100,160,6720,600,900,75,"Ice, Dragon"
6,Bastigor,https://palworld.gg/pal/bastigor,8Epic,140,120,100,100,130,100,270,9020,750,1100,120,Ice
7,Beakon,https://palworld.gg/pal/beakon,6Rare,105,80,100,100,115,100,160,7490,750,1200,100,Electricity
8,Beegarde,https://palworld.gg/pal/beegarde,4Common,80,90,100,100,90,100,100,1880,450,550,125,Leaf
9,Bellanoir,https://palworld.gg/pal/bellanoir,20Legendary,120,100,100,100,150,100,100,10030,600,800,100,Dark


In [None]:
df.head(50)

Unnamed: 0,Name,Link,Rarity,HP,Defense,Crafting Speed,Melee Attack,Shot Attack,Support,Stamina,Price,Running Speed,Sprinting Speed,Slow Walk Speed,Elements
0,Anubis,https://palworld.gg/pal/anubis,10Epic,120,100,100,130,130,100,100,4960,800,1000,80,Earth
1,Arsox,https://palworld.gg/pal/arsox,4Common,85,95,100,100,95,100,120,3520,700,1050,87,Fire
2,Astegon,https://palworld.gg/pal/astegon,9Epic,100,125,100,100,125,100,300,8200,700,1100,100,"Dragon, Dark"
3,Azurmane,https://palworld.gg/pal/azurmane,7Rare,130,110,100,100,120,100,220,6680,900,1260,90,Electricity
4,Azurobe,https://palworld.gg/pal/azurobe,7Rare,110,100,100,70,100,100,160,5600,600,900,75,"Water, Dragon"
5,Azurobe Cryst,https://palworld.gg/pal/azurobe-cryst,8Epic,115,105,100,100,105,100,160,6720,600,900,75,"Ice, Dragon"
6,Bastigor,https://palworld.gg/pal/bastigor,8Epic,140,120,100,100,130,100,270,9020,750,1100,120,Ice
7,Beakon,https://palworld.gg/pal/beakon,6Rare,105,80,100,100,115,100,160,7490,750,1200,100,Electricity
8,Beegarde,https://palworld.gg/pal/beegarde,4Common,80,90,100,100,90,100,100,1880,450,550,125,Leaf
9,Bellanoir,https://palworld.gg/pal/bellanoir,20Legendary,120,100,100,100,150,100,100,10030,600,800,100,Dark
