<a href="https://colab.research.google.com/github/lowejie/Python/blob/main/Python_Webscrape_Honkai_Star_Rail_CN_Wiki_Character_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Python Web Scraping on CN Honkai Star Rail Wiki Character Data

This project serves as a practice on Web Scraping skills using Python library: BeautifulSoup to create a dataset detailing about the basic information related to the playable characters originated from the popular free-to-play role-playing gacha video game named Honkai: Star Rail published by mihoyo and released back in April 26th of 2023. It is the second game that is widely popularised after Genshin Impact for its immersive world-building and fascinating story-telling as well as attractive character designs. The story follows the main character: Traiblazer along with their companions on an intergalactic voyage to uncover the many mysteries about science-fantsy world itself.

Hereby present the objectives of this project:

*   To practise Web Scraping using Python library: BeaultifulSoup to create datasets for use of personal projects.
*   To familiarise with the various html tags and the methods to extract desired information from them.

Other tools and libraries that will be used for this project include:

*   Requests
*   Pandas





Import the Python modules/libraries required.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

The source used is from the '角色筛选' section of '星穹铁道 Wiki' by '哔哩哔哩游戏中心' which translates to 'Character Filtering' section from the game wikipedia of the Honkai: Star Rail by the publisher, bilibili Games Center, which is the 'Games' section of the popular Youtube-equivalent platform from Mainland China: bilibili. As such, credible and official data related to the current playable characters are obtained from the unofficial Honkai: Star Rail Wiki.  The data gathered is from a table detailing about the attributes associated with each playable character like name, rarity, faction, etc.

In [2]:
url = "https://wiki.biligame.com/sr/%E8%A7%92%E8%89%B2%E7%AD%9B%E9%80%89"

Sending request to get only content from the url link

In [3]:
response = requests.get(url)
response = response.content

Turning the extracted information into html code.

In [4]:
soup = BeautifulSoup(response, "html.parser")

Search only for the table containing the data about the current playable characters.

In [5]:
table_char = soup.find('table', id='CardSelectTr')

To obtain the basic information and stats for each character, the process is split into 4 sections whereby each of the section features a different html tag that is common for storing some of the information of interest.

In the first section, it is observed that key information like name, rarity, path and element is stored within the html tag: img with attribute alt. The information is obtained from the alt attribute storing .png pictures for all of the current playable characters.

In [6]:
image = table_char.find_all('img')

basic = [alt['alt'] for alt in image]

basic

['阿格莱雅.png',
 '5星.png',
 '记忆-白.png',
 '雷.png',
 '大黑塔.png',
 '5星.png',
 '智识-白.png',
 '冰.png',
 '开拓者•记忆.png',
 '5星.png',
 '记忆-白.png',
 '冰.png',
 '星期日.png',
 '5星.png',
 '同谐-白.png',
 '虚数.png',
 '忘归人.png',
 '5星.png',
 '虚无-白.png',
 '火.png',
 '乱破.png',
 '5星.png',
 '智识-白.png',
 '虚数.png',
 '灵砂.png',
 '5星.png',
 '丰饶-白.png',
 '火.png',
 '飞霄.png',
 '5星.png',
 '巡猎-白.png',
 '风.png',
 '貊泽.png',
 '4星.png',
 '巡猎-白.png',
 '雷.png',
 '椒丘.png',
 '5星.png',
 '虚无-白.png',
 '火.png',
 '云璃.png',
 '5星.png',
 '毁灭-白.png',
 '物理.png',
 '三月七•巡猎.png',
 '4星.png',
 '巡猎-白.png',
 '虚数.png',
 '翡翠.png',
 '5星.png',
 '智识-白.png',
 '量子.png',
 '流萤.png',
 '5星.png',
 '毁灭-白.png',
 '火.png',
 '波提欧.png',
 '5星.png',
 '巡猎-白.png',
 '物理.png',
 '知更鸟.png',
 '5星.png',
 '同谐-白.png',
 '物理.png',
 '开拓者•同谐.png',
 '5星.png',
 '同谐-白.png',
 '虚数.png',
 '砂金.png',
 '5星.png',
 '存护-白.png',
 '虚数.png',
 '黄泉.png',
 '5星.png',
 '虚无-白.png',
 '雷.png',
 '加拉赫.png',
 '4星.png',
 '丰饶-白.png',
 '火.png',
 '花火.png',
 '5星.png',
 '同谐-白.png',
 '量子.png',
 '黑天鹅.png',
 '5星.png',
 '

After obtaining the information and storing them in a list, a for loop is used to process the elements such that only relevant information is stored and unwanted information are discarded.

In [7]:
char_info = []
for i in range(0,len(basic)):
    if '.' in basic[i]:
      item = basic[i].split('.')[0]
    if '-' in basic[i]:
      item = basic[i].split('-')[0]
    char_info.append(item)

char_info

['阿格莱雅',
 '5星',
 '记忆',
 '雷',
 '大黑塔',
 '5星',
 '智识',
 '冰',
 '开拓者•记忆',
 '5星',
 '记忆',
 '冰',
 '星期日',
 '5星',
 '同谐',
 '虚数',
 '忘归人',
 '5星',
 '虚无',
 '火',
 '乱破',
 '5星',
 '智识',
 '虚数',
 '灵砂',
 '5星',
 '丰饶',
 '火',
 '飞霄',
 '5星',
 '巡猎',
 '风',
 '貊泽',
 '4星',
 '巡猎',
 '雷',
 '椒丘',
 '5星',
 '虚无',
 '火',
 '云璃',
 '5星',
 '毁灭',
 '物理',
 '三月七•巡猎',
 '4星',
 '巡猎',
 '虚数',
 '翡翠',
 '5星',
 '智识',
 '量子',
 '流萤',
 '5星',
 '毁灭',
 '火',
 '波提欧',
 '5星',
 '巡猎',
 '物理',
 '知更鸟',
 '5星',
 '同谐',
 '物理',
 '开拓者•同谐',
 '5星',
 '同谐',
 '虚数',
 '砂金',
 '5星',
 '存护',
 '虚数',
 '黄泉',
 '5星',
 '虚无',
 '雷',
 '加拉赫',
 '4星',
 '丰饶',
 '火',
 '花火',
 '5星',
 '同谐',
 '量子',
 '黑天鹅',
 '5星',
 '虚无',
 '风',
 '米沙',
 '4星',
 '毁灭',
 '冰',
 '真理医生',
 '5星',
 '巡猎',
 '虚数',
 '阮•梅',
 '5星',
 '同谐',
 '冰',
 '雪衣',
 '4星',
 '毁灭',
 '量子',
 '银枝',
 '5星',
 '智识',
 '物理',
 '寒鸦',
 '4星',
 '同谐',
 '物理',
 '藿藿',
 '5星',
 '丰饶',
 '风',
 '托帕&账账',
 '5星',
 '巡猎',
 '火',
 '桂乃芬',
 '4星',
 '虚无',
 '火',
 '镜流',
 '5星',
 '毁灭',
 '冰',
 '符玄',
 '5星',
 '存护',
 '量子',
 '玲可',
 '4星',
 '丰饶',
 '量子',
 '丹恒•饮月',
 '5星',
 '毁灭',
 '虚数',
 '卡芙卡',

There are total of 65 current playable characters as of 16/2/2025 and 4 information associated with each character is obtained which translates to 65 * 4 = 260 elements present in the list.

In [8]:
len(char_info)

260

In the next section, information like gender, faction and base stats are stored in the html tag with class 'hidden-xs'.

In [9]:
td = table_char.find_all('td', class_="hidden-xs")

td

[<td class="hidden-xs"><a href="/sr/%E9%98%BF%E6%A0%BC%E8%8E%B1%E9%9B%85" title="阿格莱雅">阿格莱雅</a></td>,
 <td class="hidden-xs"><img alt="5星.png" data-file-height="27" data-file-width="99" decoding="async" height="14" loading="lazy" src="https://patchwiki.biligame.com/images/sr/thumb/c/c7/g380fo4o5accoa1rmlckz3e24vpx23e.png/52px-5%E6%98%9F.png" srcset="https://patchwiki.biligame.com/images/sr/thumb/c/c7/g380fo4o5accoa1rmlckz3e24vpx23e.png/77px-5%E6%98%9F.png 1.5x, https://patchwiki.biligame.com/images/sr/c/c7/g380fo4o5accoa1rmlckz3e24vpx23e.png 2x" width="52"/></td>,
 <td class="hidden-xs">女</td>,
 <td class="hidden-xs">翁法罗斯</td>,
 <td class="hidden-xs">1242</td>,
 <td class="hidden-xs">699</td>,
 <td class="hidden-xs">485</td>,
 <td class="hidden-xs">102</td>,
 <td class="hidden-xs">350</td>,
 <td class="hidden-xs"><a href="/sr/%E5%A4%A7%E9%BB%91%E5%A1%94" title="大黑塔">大黑塔</a></td>,
 <td class="hidden-xs"><img alt="5星.png" data-file-height="27" data-file-width="99" decoding="async" height

It is observed that the desired information is always stored in the third until ninth position [index 2 till 8] and thus a for loop is created to run through the previous list to obtain the desired information and store them in a new list.

In [10]:
char_info2 = []
for i in range(2,len(td),9):
  char_info2.extend(td[i:i+7])

char_info2 = [item.text for item in char_info2]

char_info2

['女',
 '翁法罗斯',
 '1242',
 '699',
 '485',
 '102',
 '350',
 '女',
 '空间站「黑塔」',
 '1164',
 '679',
 '485',
 '99',
 '220',
 '男, 女',
 '星穹列车',
 '1203',
 '620',
 '460',
 '100',
 '160',
 '男',
 '银河',
 '1241',
 '640',
 '533',
 '96',
 '130',
 '女',
 '仙舟「罗浮」',
 '1125',
 '582',
 '558',
 '102',
 '130',
 '女',
 '巡海游侠',
 '1086',
 '717',
 '460',
 '96',
 '140',
 '女',
 '仙舟「罗浮」',
 '1358',
 '679',
 '437',
 '98',
 '110',
 '女',
 '仙舟「曜青」',
 '1047',
 '601',
 '388',
 '112',
 '12点【飞黄】',
 '男',
 '仙舟「曜青」',
 '811',
 '599',
 '352',
 '111',
 '120',
 '男',
 '仙舟「曜青」',
 '1358',
 '601',
 '509',
 '98',
 '100',
 '女',
 '仙舟「朱明」',
 '1358',
 '679',
 '460',
 '94',
 '240',
 '女',
 '星穹列车',
 '1058',
 '564',
 '441',
 '102',
 '110',
 '女',
 '星际和平公司',
 '1086',
 '659',
 '509',
 '103',
 '140',
 '女',
 '星核猎手',
 '814',
 '523',
 '776',
 '104',
 '240',
 '男',
 '巡海游侠',
 '1203',
 '620',
 '436',
 '107',
 '115',
 '女',
 '匹诺康尼',
 '1280',
 '640',
 '485',
 '102',
 '160',
 '男, 女',
 '星穹列车',
 '1086',
 '446',
 '679',
 '105',
 '140',
 '男',
 '星际和平公司',
 '1203',
 '446

Again, with 65 current playable characters as of 16/2/2025 and 7 information is obtained for each character, 65 * 7 = 455 elements present in the new list.

In [11]:
len(char_info2)

455

The next section finds all html tag span with class tag-data-1 containing information about the brief description in words about the functionality, damage types, buffs offered, etc of each character.

In [12]:
span1 = table_char.find_all('span', class_="tag-data-1")

char_tag1 = [item.text for item in span1]

char_tag1

['忆灵、高速主C',
 '对群输出、攻击强化',
 '忆灵、真实伤害',
 '效果解除、立即行动、增伤、暴击率提升、暴击伤害提升、能量恢复、能量、召唤物立即行动、战技点恢复',
 '击破特攻提升、无视弱点削韧、行动延后、减防、超击破、击破特攻、战技点恢复',
 '技能强化、无视弱点削韧、群攻、额外回合、弱点击破效率、击破特攻、额外伤害、伤害倍率、快速移动、秘技无受击、能量、超击破、击破易伤',
 '召唤、自身召唤物行动提前、治疗、击破易伤、群攻、额外伤害、追加攻击、攻击力、治疗量、能量',
 '追加攻击、无视弱点削韧、伤害倍率、自身伤害提升、攻击力、暴击伤害、快速移动、秘技牵引、特殊能量',
 '追加攻击、行动提前、战技点恢复、追加攻击易伤、附加伤害',
 '终结技易伤、易伤、持续伤害、特殊领域、群攻、攻击力、能量',
 '反击、自身治疗、能量、嘲讽、暴击伤害、额外伤害、自身减伤、攻击力',
 '速度提升、附加伤害、技能强化、削韧值、立即行动、自身伤害提升。攻击段数提升、能量、击破特攻提升、行动提前、暴伤提升',
 '追加攻击、暴击伤害、攻击力、群攻、速度提升、额外伤害、消耗队友生命值、自身追击伤害提升、秘技无受击、秘技群攻',
 '技能强化、击破特攻、弱点击破效率、速度、弱点植入、无视弱点削韧、秘技群攻、秘技无受击、消耗生命值、能量、自身减伤、自身治疗、快速移动、自身行动提前、效果抵抗、超击破、削韧值、自身击破易伤、自身效果解除',
 '强化技能、弱点植入、嘲讽、行动延后、能量、暴击率、暴击伤害、自身减伤、自身易伤',
 '立即行动、能量、特殊领域、秘技诱敌、秘技无受击、增伤、加攻、附加伤害、控制抵抗、暴伤提升、自身行动提前',
 '能量、击破特攻提升、超击破、行动延后',
 '群体护盾、护盾叠加、随机、暴伤提升、效果抵抗提升、控制抵抗、追加攻击、加防、暴击率',
 '特殊能量、无视弱点削韧、减抗、负面效果参照、探索便利、秘技无受击、自身普攻伤害提升、自身战技伤害提升、自身终结技伤害提升、自身伤害提升、额外伤害、群攻、快速移动、伤害额外提高',
 '技能强化、治疗、群攻、自身行动提前、攻击治疗、击破易伤、治疗量',
 '暴击伤害、行动提前、战技点恢复、增伤、战技点上限、秘技无受击、加攻、能量',
 '持续伤害、减防、自身伤害提升、额外效果、易伤、群攻、自身无视防

len() is used again to check that all information about the 65 current total characters are obtained.

In [13]:
len(char_tag1)

65

Similarly, html tag span is used but class is equal to tag-data-2 instead.

In [14]:
span2 = table_char.find_all('span', class_="tag-data-2")

char_tag2 = [item.text for item in span2]

char_tag2

['忆灵、高速主C、回能、抗性穿透、无视防御、易伤',
 '对群输出、攻击强化、行动提前',
 '忆灵、真实伤害、回能',
 '效果解除、立即行动、增伤、暴击率提升、暴击伤害提升、能量恢复、能量、召唤物立即行动、战技点恢复、无视防御力',
 '击破特攻提升、无视弱点削韧、行动延后、减防、超击破、击破特攻、战技点恢复、击破易伤、弱点击破效率提升、能量、行动提前',
 '技能强化、无视弱点削韧、群攻、额外回合、弱点击破效率、击破特攻、额外伤害、伤害倍率、快速移动、秘技无受击、能量、超击破、击破易伤、削韧值、自身无视防御力、速度提升',
 '召唤、自身召唤物行动提前、治疗、击破易伤、群攻、额外伤害、追加攻击、攻击力、治疗量、能量、减抗、减防、击破特攻提升、弱点击破效率、额外伤害次数增加',
 '追加攻击、无视弱点削韧、伤害倍率、自身伤害提升、攻击力、暴击伤害、快速移动、秘技牵引、特殊能量、伤害额外提高、削韧值、抗性穿透、速度',
 '追加攻击、行动提前、战技点恢复、追加攻击易伤、附加伤害、伤害倍率、暴伤提升、能量、自身伤害提升',
 '终结技易伤、易伤、持续伤害、特殊领域、群攻、攻击力、能量、减抗、增伤、持续伤害倍率',
 '反击、自身治疗、能量、嘲讽、暴击伤害、额外伤害、自身减伤、攻击力、抗性穿透、效果抵抗、暴击率、自身无视防御力、自身终结技伤害提升、额外伤害次数增加',
 '速度提升、附加伤害、技能强化、削韧值、立即行动、自身伤害提升。攻击段数提升、能量、击破特攻提升、行动提前、暴伤提升、暴击伤害、追加攻击、速度',
 '追加攻击、暴击伤害、攻击力、群攻、速度提升、额外伤害、消耗队友生命值、自身追击伤害提升、秘技无受击、秘技群攻、抗性穿透、暴击率、自身无视防御力',
 '技能强化、击破特攻、弱点击破效率、速度、弱点植入、无视弱点削韧、秘技群攻、秘技无受击、消耗生命值、能量、自身减伤、自身治疗、快速移动、自身行动提前、效果抵抗、超击破、削韧值、自身击破易伤、自身效果解除、抗性穿透、效果抵抗、自身无视防御力、额外回合',
 '强化技能、弱点植入、嘲讽、行动延后、能量、暴击率、暴击伤害、自身减伤、自身易伤、击破特攻、战技点恢复、自身无视防御力、额外伤害',
 '立即行动、能量、特殊领域、秘技诱敌、秘技无受击、增伤、加攻、附加伤害、控制抵抗、暴伤提升、自身行动提前、减抗、效果

All information about 65 current total characters are obtained for the final section.

In [15]:
len(char_tag2)

65

Before converting all the information into a Python Pandas Dataframe, all information about one single character has to be stored into a list whereby each list represents all the information about one character. After that, all of the lists representing different characters are stored into another list to create a nested list. Since length of char_tag1 or char_tag2 always equal to the total number of characters, it is used in the for loop to loop through the amount of times equal to it.

In [16]:
all_char = []

for i in range(len(char_tag1)):
    chunk1 = char_info[i * 4:(i + 1) * 4]
    chunk2 = char_info2[i * 7:(i + 1) * 7]
    chunk3 = [char_tag1[i]]
    chunk4 = [char_tag2[i]]

    combined = chunk1 + chunk2 + chunk3 + chunk4
    all_char.append(combined)

all_char


[['阿格莱雅',
  '5星',
  '记忆',
  '雷',
  '女',
  '翁法罗斯',
  '1242',
  '699',
  '485',
  '102',
  '350',
  '忆灵、高速主C',
  '忆灵、高速主C、回能、抗性穿透、无视防御、易伤'],
 ['大黑塔',
  '5星',
  '智识',
  '冰',
  '女',
  '空间站「黑塔」',
  '1164',
  '679',
  '485',
  '99',
  '220',
  '对群输出、攻击强化',
  '对群输出、攻击强化、行动提前'],
 ['开拓者•记忆',
  '5星',
  '记忆',
  '冰',
  '男, 女',
  '星穹列车',
  '1203',
  '620',
  '460',
  '100',
  '160',
  '忆灵、真实伤害',
  '忆灵、真实伤害、回能'],
 ['星期日',
  '5星',
  '同谐',
  '虚数',
  '男',
  '银河',
  '1241',
  '640',
  '533',
  '96',
  '130',
  '效果解除、立即行动、增伤、暴击率提升、暴击伤害提升、能量恢复、能量、召唤物立即行动、战技点恢复',
  '效果解除、立即行动、增伤、暴击率提升、暴击伤害提升、能量恢复、能量、召唤物立即行动、战技点恢复、无视防御力'],
 ['忘归人',
  '5星',
  '虚无',
  '火',
  '女',
  '仙舟「罗浮」',
  '1125',
  '582',
  '558',
  '102',
  '130',
  '击破特攻提升、无视弱点削韧、行动延后、减防、超击破、击破特攻、战技点恢复',
  '击破特攻提升、无视弱点削韧、行动延后、减防、超击破、击破特攻、战技点恢复、击破易伤、弱点击破效率提升、能量、行动提前'],
 ['乱破',
  '5星',
  '智识',
  '虚数',
  '女',
  '巡海游侠',
  '1086',
  '717',
  '460',
  '96',
  '140',
  '技能强化、无视弱点削韧、群攻、额外回合、弱点击破效率、击破特攻、额外伤害、伤害倍率、快速移动、秘技无受击、能量、超击破、击破易伤',
  '技能强化、无视弱点削韧、群攻、额外回合、

After the nested list of characters is created, the nested list is converted into Python Pandas Dataframe with column names: 'Name', 'Rarity', 'Path', 'Element', 'Gender', 'Faction', 'HP lvl80', 'ATK lvl80', 'DEF lvl80', 'Speed', 'Energy_Cap', 'Char_Tag_1', 'Char_Tag_2' representing each of the information stored between characters.

In [17]:
hsr_df = pd.DataFrame(all_char, columns=['Name', 'Rarity', 'Path', 'Element', 'Gender', 'Faction', 'HP lvl80', 'ATK lvl80', 'DEF lvl80', 'Speed', 'Energy_Cap', 'Char_Tag_1', 'Char_Tag_2'])

Ultimately, the dataframe is converted into a CSV file and can be used for data analysis purposes.

In [19]:
hsr_df.to_csv('hsr_char_v3.0.csv', encoding="utf-8-sig")


References:

*   星穹铁道 Wiki (CN Honkai: Star Rail Wiki): https://wiki.biligame.com/sr/%E8%A7%92%E8%89%B2%E7%AD%9B%E9%80%89
*   Content web-scraped from the game wikipedia are provided in accordance with CC BY-NC-SA 4.0 as detailed in the game wikipedia page: https://creativecommons.org/licenses/by-nc-sa/4.0/

