Jasper Wilson
6/13/2022
Hearthstone Card Analysis

This is a small data-science project to demonstrate my abilities with common tools and processes. The aims of this project are to:
<ol>
    <li> Create a dataframe that includes all Hearthstone Cardss and a value that we can associate with their strength</li>
    <li> Use https://rapidapi.com/omgvamp/api/hearthstone/ to get the full information from each card </li>
    <li> Create and normalize dataframe that includes baisic features that could be relevant to a card's strength </li>
    <li> Use ML tools to make a function that can predict the strength of hypothetical cards </li>
</ol>
This project is unlikely to weild impactful results, since the strength of cards is tied to complex factors like card-specific text that will not be evaluated in this project. Hypothetically, a text-analysis may be able to improve my results but the sample size is likely not large enough for this approach to be effective.

Lets start by importing some necessary libraries for this project.

In [48]:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time

The best source for information on the strength of individual cards is https://hsreplay.net/cards/#sortBy=includedPopularity&gameType=RANKED_WILD&showSparse=yes. So let's try scraping the information from this website.

In [12]:
page = requests.get("https://hsreplay.net/cards/#sortBy=includedPopularity&gameType=RANKED_WILD&showSparse=yes").text
soup = BeautifulSoup(page, 'html.parser')

Now that we have our page let's use soup to get a list of all the card names and check to make sure that is has all the cards.

In [13]:
card_names = soup.find_all(class_='card-tile')
print(len(card_names))

0


Well that isn't right. It seems like this page loads the cards with JavaScript, so we need to use Selenium to load the page before we scrape it. 

In [14]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://hsreplay.net/cards/#sortBy=includedPopularity&gameType=RANKED_WILD&showSparse=yes')




[WDM] - Current google-chrome version is 102.0.5005
[WDM] - Get LATEST chromedriver version for 102.0.5005 google-chrome
[WDM] - Driver [/home/jaspermwilson/.wdm/drivers/chromedriver/linux64/102.0.5005.61/chromedriver] found in cache


WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally.
  (unknown error: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Stacktrace:
#0 0x556bf8350f33 <unknown>
#1 0x556bf809b118 <unknown>
#2 0x556bf80be678 <unknown>
#3 0x556bf80b9d5a <unknown>
#4 0x556bf80f4d3a <unknown>
#5 0x556bf80eee63 <unknown>
#6 0x556bf80c482a <unknown>
#7 0x556bf80c5985 <unknown>
#8 0x556bf83954cd <unknown>
#9 0x556bf83995ec <unknown>
#10 0x556bf837f71e <unknown>
#11 0x556bf839a238 <unknown>
#12 0x556bf8374870 <unknown>
#13 0x556bf83b6608 <unknown>
#14 0x556bf83b6788 <unknown>
#15 0x556bf83d0f1d <unknown>
#16 0x7f32bfad6609 <unknown>


That didn't work for me, so let's use the solution from https://stackoverflow.com/questions/53073411/selenium-webdriverexceptionchrome-failed-to-start-crashed-as-google-chrome-is. If the previous block didn't throw an error for you skip the next step.

In [62]:
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

driver = webdriver.Chrome('/home/jaspermwilson/hs_deck_completion/chromedriver',chrome_options=chrome_options) #replace path here with path to chromedriver, see full details at https://chromedriver.chromium.org/getting-started
driver.get('https://hsreplay.net/cards/#sortBy=includedPopularity&gameType=RANKED_WILD&showSparse=yes')

  driver = webdriver.Chrome('/home/jaspermwilson/hs_deck_completion/chromedriver',chrome_options=chrome_options) #replace path here with path to chromedriver, see full details at https://chromedriver.chromium.org/getting-started
  driver = webdriver.Chrome('/home/jaspermwilson/hs_deck_completion/chromedriver',chrome_options=chrome_options) #replace path here with path to chromedriver, see full details at https://chromedriver.chromium.org/getting-started


Alright now that we can read the JavaScript let's see how many cards we have now.

In [22]:
html = driver.page_source
soup = BeautifulSoup(html)
card_names = soup.find_all(class_='card-tile')
print(len(card_names))

24


In [23]:
soup

<html lang="en"><head><script async="" src="https://s0.2mdn.net/instream/video/client.js" type="text/javascript"></script><script async="" src="https://rules.quantcount.com/rules-p-5pR25819dph-b.js"></script><script defer="" src="https://tagan.adlightning.com/enthusiastgaming/bl-fe8bb3e-9afb9e9e.js" type="text/javascript"></script><script defer="" src="https://tagan.adlightning.com/enthusiastgaming/b-01880f1-7536a984.js" type="text/javascript"></script><script type="text/javascript"></script>
<meta charset="utf-8"/><title>Hearthstone Card Statistics - HSReplay.net</title><meta content="width=device-width, initial-scale=1" name="viewport"/><meta content="@HSReplayNet" name="twitter:site"/><meta content="Compare statistics about all collectible Hearthstone cards. Find the cards that are played the most or have the highest winrate." name="description"/><meta content="Compare statistics about all collectible Hearthstone cards. Find the cards that are played the most or have the highest win

That still isn't right. There should be well over 1000 cards on this page. Unfortunately these cards are loaded dynamically after the user scrolls. So let's take a solution from https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python to load all the cards. This code is inconsistent, loading the page may take a variable amount of time or hsreplay.net could refuse connection. Trying different pause times I was able to get all the cards and saved the output from later functions into text files.

In [63]:
SCROLL_PAUSE_TIME = 2

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
print(last_height)

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)
    
    

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

3278


  buttons = wrapper.find_elements_by_css_selector(".btn.btn-default")


MaxRetryError: HTTPConnectionPool(host='localhost', port=35097): Max retries exceeded with url: /session/e0b1e7d3254a0ea6b5eb28779d736086/element/e66f2d45-6b48-4a80-87fa-8e28480c3aad/elements (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8b5a3bc6a0>: Failed to establish a new connection: [Errno 111] Connection refused'))

In [25]:
html = driver.page_source
soup = BeautifulSoup(html)
card_names = soup.find_all(class_='card-tile')
print(len(card_names))

48


In [20]:
soup

<html lang="en"><head><script async="" src="https://s0.2mdn.net/instream/video/client.js" type="text/javascript"></script><script defer="" src="https://tagan.adlightning.com/enthusiastgaming/bl-fe8bb3e-9afb9e9e.js" type="text/javascript"></script><script defer="" src="https://tagan.adlightning.com/enthusiastgaming/b-01880f1-7536a984.js" type="text/javascript"></script><script type="text/javascript"></script>
<meta charset="utf-8"/><title>Hearthstone Card Statistics - HSReplay.net</title><meta content="width=device-width, initial-scale=1" name="viewport"/><meta content="@HSReplayNet" name="twitter:site"/><meta content="Compare statistics about all collectible Hearthstone cards. Find the cards that are played the most or have the highest winrate." name="description"/><meta content="Compare statistics about all collectible Hearthstone cards. Find the cards that are played the most or have the highest winrate." property="og:description"/><link href="https://static.hsreplay.net/static/image

In [44]:
title_list = []
for i in titles:
    title_list.append(i['aria-label'])
    print(i['aria-label'])

Mutanus the Devourer
Lightning Bloom
Zephrys the Great
Dirty Rat
Secret Passage
Blademaster Okani
Loatheb
Amalgam of the Deep
Brann Bronzebeard
Rustrot Viper
Devolve
Raise Dead
Reno Jackson
Zola the Gorgon
Mr. Smite
Southsea Deckhand
Armor Vendor
Scrapyard Colossus
Shadow Visions
Hysteria
Wildheart Guff
Parachute Brigand
Cavern Shinyfinder
Drain Soul
Patches the Pirate
Pufferfist
Ship's Cannon
Archmage Vargoth
Neptulon the Tidehunter
Moonlit Guidance
Palm Reading
Defile
Prize Plunderer
Swindle
Preparation
Kazakus
Filletfighter
Dread Corsair
Cutting Class
Windchill
Zilliax
Swordfish
Sphere of Sapience
Thrive in the Shadows
Ferocious Howl
Mo'arg Forgefiend
Y'Shaarj, Rage Unbound
Shadowstep
Sir Finley, Sea Guide
Buccaneer
Aquatic Form
Ambassador Faelin
Cutlass Courier
Shudderwock
Brilliant Macaw
Shard of the Naaru
Eureka!
Ancestor's Call
Scalding Geyser
Ancestral Spirit
Biology Project
Gone Fishin'
Toxfin
Tour Guide
Muckmorpher
Click-Clocker
Branching Paths
Investment Opportunity
Reno Jac

In [42]:
statistics = soup.find_all('div', {'aria-describedby':re.compile('table1-row\S+ table1-column0')})
percent_list = []
print(len(statistics))
print(len(titles))
for i in statistics:
    print(i.text)
    percent_list.append(i.text)

1610
1610
13.4%
13.2%
13.2%
12.6%
11.0%
10.8%
10.3%
10.0%
10.0%
9.8%
9.4%
9.2%
9.2%
8.6%
8.5%
8.4%
8.4%
8.1%
7.9%
7.8%
7.7%
7.6%
7.3%
7.3%
7.3%
7.3%
7.2%
7.2%
7.2%
7.2%
6.9%
6.9%
6.8%
6.8%
6.6%
6.5%
6.5%
6.5%
6.4%
6.3%
6.3%
6.2%
6.2%
6.2%
6.2%
6.1%
6.0%
5.9%
5.8%
5.8%
5.7%
5.7%
5.6%
5.5%
5.3%
5.3%
5.3%
5.2%
5.2%
5.2%
5.1%
5.0%
5.0%
5.0%
5.0%
4.9%
4.9%
4.9%
4.9%
4.8%
4.8%
4.8%
4.8%
4.8%
4.8%
4.7%
4.7%
4.7%
4.6%
4.6%
4.6%
4.5%
4.5%
4.5%
4.5%
4.5%
4.4%
4.4%
4.4%
4.4%
4.3%
4.3%
4.3%
4.3%
4.3%
4.2%
4.2%
4.1%
4.1%
4.0%
4.0%
3.9%
3.9%
3.9%
3.8%
3.8%
3.7%
3.7%
3.7%
3.7%
3.7%
3.6%
3.6%
3.5%
3.5%
3.5%
3.5%
3.4%
3.4%
3.4%
3.4%
3.4%
3.4%
3.4%
3.4%
3.4%
3.4%
3.3%
3.3%
3.3%
3.2%
3.2%
3.2%
3.2%
3.2%
3.2%
3.2%
3.2%
3.2%
3.1%
3.1%
3.1%
3.1%
3.1%
3.1%
3.0%
3.0%
3.0%
3.0%
3.0%
3.0%
3.0%
3.0%
2.9%
2.9%
2.9%
2.9%
2.9%
2.9%
2.9%
2.8%
2.8%
2.8%
2.8%
2.8%
2.8%
2.7%
2.7%
2.7%
2.7%
2.7%
2.7%
2.7%
2.7%
2.6%
2.6%
2.6%
2.6%
2.6%
2.6%
2.6%
2.6%
2.6%
2.6%
2.6%
2.6%
2.6%
2.5%
2.5%
2.5%
2.5%
2.5%
2.5%
2.5%
2.5%
2.5%
2

In [41]:
statistics.contents


AttributeError: ResultSet object has no attribute 'contents'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [54]:
df = pd.DataFrame()
df['Name'] = title_list
df['Inclusion Rate String'] = percent_list
df['Inclusion Rate'] = df['Inclusion Rate String'].str.rstrip('%').astype('float') / 100.0

In [56]:
df

Unnamed: 0,Name,Inclusion Rate String,Inclusion Rate
0,Mutanus the Devourer,13.4%,0.134
1,Lightning Bloom,13.2%,0.132
2,Zephrys the Great,13.2%,0.132
3,Dirty Rat,12.6%,0.126
4,Secret Passage,11.0%,0.110
...,...,...,...
1605,Iceblood Garrison,0.1%,0.001
1606,Serpent Wig,0.10%,0.001
1607,Jungle Giants,0.10%,0.001
1608,Grave Defiler,0.10%,0.001


In [83]:
def getCardAttributes(card_name):
    formatted_name = card_name.replace(' ', '%20')
    url = f"https://omgvamp-hearthstone-v1.p.rapidapi.com/cards/{formatted_name}"

    headers = {
        "X-RapidAPI-Key": "64accc3c01msh8386ed44b3b00b7p1766a1jsna7ddde737a91",
        "X-RapidAPI-Host": "omgvamp-hearthstone-v1.p.rapidapi.com"
    }

    response = requests.request("GET", url, headers=headers).json()

    return(response[0]['cardId'])

In [84]:
getCardAttributes('Dirty Rat')

CFM_790


Set class to desired 

In [18]:
className = 'WARRIOR'
pageNum = 1
d.get(f'https://hsreplay.net/decks/#playerClasses={className}&page={pageNum}')

In [19]:
html = d.page_source
soup = BeautifulSoup(html)

In [20]:
table = soup.find_all(class_='deck-tile')
for i in table:
    print(i['href'])
    gc = i.find(class_='game-count')
    print(gc.text)


/decks/RK3mmBH6tZxTyqYkirETwg/#gameType=RANKED_STANDARD
48,000
/decks/z5jwryCTWVRyZXKDolBdyg/#gameType=RANKED_STANDARD
26,000
/decks/xebhDnpxspFxPIFr47fGlc/#gameType=RANKED_STANDARD
8,500
/decks/nxZdsVWCt6jY6LJ2agYQ6c/#gameType=RANKED_STANDARD
6,100
/decks/7RdBcG9jwvvbXq89wjo4jd/#gameType=RANKED_STANDARD
4,500
/decks/KjxLS8kFwKSOb9b0ulURzh/#gameType=RANKED_STANDARD
4,200
/decks/J9GMaWunH0cGo5cPiE0mHf/#gameType=RANKED_STANDARD
4,000
/decks/Po4RPXs2Hp3LkUyekw0oTh/#gameType=RANKED_STANDARD
3,500
/decks/QqpLftnkOnFVkywsgktRub/#gameType=RANKED_STANDARD
3,400
/decks/8T0UnrAVXMDvuEtzEXaz1f/#gameType=RANKED_STANDARD
3,000
/decks/aHTqfuhibhR0ZOlOwll73f/#gameType=RANKED_STANDARD
2,900
/decks/Y0uHbZEqakXgl5OcBZIKyd/#gameType=RANKED_STANDARD
1,900
/decks/zUCCPrOU85be32O8WwuS1f/#gameType=RANKED_STANDARD
1,900
/decks/pXiLBVV1366HyZAwW4uPQ/#gameType=RANKED_STANDARD
1,800
/decks/80cVuyy2vPZXovpc9C5olh/#gameType=RANKED_STANDARD
1,500
/decks/ABePTzT6i7VJP6g9yTZvvh/#gameType=RANKED_STANDARD
1,400
/decks/

In [21]:
def scroll_down(self):
    """A method for scrolling the page."""

    # Get scroll height.
    last_height = self.driver.execute_script("return document.body.scrollHeight")

    while True:

        # Scroll down to the bottom.
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load the page.
        time.sleep(2)

        # Calculate new scroll height and compare with last scroll height.
        new_height = self.driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:

            break

        last_height = new_height