### Cards in each set
This nb will add to the set and card count table by entering a row for each card in a set. This way, we can reference this table for quicker routing to pricecharting to get the price of a card. If, say, we enter "Lugia 9/111", we can go to this table and search for "Lugia 9" in each set with 111 cards instead of sending out a request to each url on pricecharting corresponding to a set with 111 cards.

We still want to use pricecharting for pricing (for now), since they have data on reverse holo cards, shadowless, first edition, etc.

We also scrape the pictures for each card in the case that more than one "Houndoom H11/H32" exists (which it does). We can then show them pictures and say "is your card A or B"?

In [1]:
# imports
import requests
from bs4 import BeautifulSoup
import re
import numpy as np
import pandas as pd
import warnings
import shutil
import os
warnings.filterwarnings('ignore')

In [2]:
url_sets = 'https://www.pokellector.com/sets/'  # pokellector sets page
r = requests.get(url_sets, verify=False)
soup = BeautifulSoup(r.text, "html.parser")

# we want to match the url of the form pokellector.com/sets/{ext_list[i]-set_list[i]}
ext_set_dict = {}
for el in soup.select('.button'):  # each set is class "button"
    if el['title'][-3:].lower() == 'set':
        ext_set_dict[el['name']] = el['title'][:-4]
    else:
        ext_set_dict[el['name']] = el['title']

# create df with ext and set name to run down and grab the correct urls. At each url, we find pics,
# names and card #s for every card (save a select few we deal with later). We have 1 row for each set
pokellector_df = pd.DataFrame.from_dict(ext_set_dict, orient='index', columns=['Set Name']).reset_index().rename(columns={'index': 'Extension'})
pokellector_df['URL Ext Full'] = pokellector_df['Extension'] + '-' + pokellector_df['Set Name'].str.replace(' ', '-').replace("'", "")

In [95]:
'''
final df with set name, card and card # to allow for easy querying when a user enters a something
like "Lugia 9/111". We use 111 to get the list of sets (from bulbapedia), then use pokellector to
determine if there is a "Lugia 9" in each of the sets based on the table created below.

We grab the picture from pokellector, and can grab prices from either pricecharting or pokellector.
Pricecharting has loose (avg ebay) prices as well as PSA/BGS graded prices starting at PSA 7.
Pokellector has avg ebay prices, tcg player prices, and troll & toad prices.
'''

full_df = pd.DataFrame()

for i in range(len(pokellector_df)):
    ext = pokellector_df.iloc[i]['URL Ext Full']
    url = url_sets + ext
    r_tmp = requests.get(url, verify=False)
    soup_tmp = BeautifulSoup(r_tmp.text, "html.parser")
    
    # card name exists in the element with class "plaque"
    tmp_name_list = [el.text.split('-') for el in soup_tmp.select('.card > .plaque')]
    # scrape the image source url
    tmp_pic_list = [el['data-src'] for el in soup_tmp.select('.card > a > img')]
    # zip the name/# and pic together in a tuple
    tmp_name_and_pic_list = [(tmp_name_list[i], tmp_pic_list[i]) for i in range(len(tmp_name_list))]
    
    # create a dict with the card # as the key (unique in each set), and a tuple of values -
    # (name, picture url)
    card_and_num_dict = {}
    for list_ in tmp_name_and_pic_list:
        if len(list_[0]) > 1:  # energies are picked up from some sets, with no card #s. We ignore these
            card_and_num_dict[list_[0][0].strip().replace('#', '')] = (list_[0][1].strip(), list_[1])
        else:  # basic error handling
            pass
    
    # create a df for each set as we iterate through the sets. The df has 3 columns - card name,
    # card #, and picture url. Concatenate this df with the "full_df"
    df_tmp = pd.DataFrame.from_dict(card_and_num_dict, orient='index', columns=['Card Name', 'Picture']).reset_index().rename(columns={'index': 'Card Number'})
    df_tmp['Set Name'] = pokellector_df.iloc[i]['Set Name']
    df_tmp['Extension'] = pokellector_df.iloc[i]['Extension']
    df_tmp['URL Ext Full'] = pokellector_df.iloc[i]['URL Ext Full']
    
    full_df = pd.concat([full_df, df_tmp])

full_df.to_excel('Card_Num_Set_pokellector.xlsx')  # save to excel for future use

In [2]:
# read in the df with card name, # and picture url
df = pd.read_excel('Card_Num_Set_pokellector.xlsx', index_col=0)

In [3]:
df.head()

Unnamed: 0,Card Number,Card Name,Picture,Set Name,Extension,URL Ext Full
0,1,Pinsir,https://den-cards.pokellector.com/325/Pinsir.S...,Evolving Skies,SWSH7,SWSH7-Evolving-Skies
1,2,Hoppip,https://den-cards.pokellector.com/325/Hoppip.S...,Evolving Skies,SWSH7,SWSH7-Evolving-Skies
2,3,Skiploom,https://den-cards.pokellector.com/325/Skiploom...,Evolving Skies,SWSH7,SWSH7-Evolving-Skies
3,4,Jumpluff,https://den-cards.pokellector.com/325/Jumpluff...,Evolving Skies,SWSH7,SWSH7-Evolving-Skies
4,5,Seedot,https://den-cards.pokellector.com/325/Seedot.S...,Evolving Skies,SWSH7,SWSH7-Evolving-Skies


In [17]:
# create a folder for each set to store the card images

# sets = list(df['Set Name'].unique())
# for set_ in sets:
#     if not os.path.exists(set_):
#         os.makedirs(f'Images/{set_}')

In [67]:
# run through the df loaded above to grab the picture urls and save them to the proper folder with
# a filename of the form "<name>_<num>.png"

for i in range(1, len(df)):  # change df
    # recreate df with updated name. Replace chars that create errors when trying to save images
    name = df_tmp.iloc[i]['Card Name'].replace('?', '').replace('*', 'Star')
    num = df_tmp.iloc[i]['Card Number']
    set_ = df_tmp.iloc[i]['Set Name']
    pic = df_tmp.iloc[i]['Picture']
    
    r_pic = requests.get(pic, verify=False, stream=True)
    with open(f'Images/{set_}/{name}_{num}.png', 'wb') as f:
        try:
            r_pic.raw.decode_content = True
            shutil.copyfileobj(r_pic.raw, f)
        except:
            pass

In [68]:
# for whatever reason, these sets weren't picked up on the initial loop
missing_sets = ['EX Crystal Guardians', 'EX Delta Species', 'EX Deoxys', 'EX Dragon Frontiers',
                'EX Power Keepers', 'EX Emerald', 'EX FireRed & LeafGreen', 'EX Legend Maker',
                'Great Encounters', 'Legends Awakened', 'Majestic Dawn', 'Mysterious Treasures', 
                'Nintendo Promos', 'Platinum', 'Secret Wonders', 'Stormfront']

df_tmp = df[(df['Set Name'] == 'EX Emerald') | (df['Set Name'] == 'EX FireRed & LeafGreen')
           | (df['Set Name'] == 'EX Legend Maker') | (df['Set Name'] == 'EX Power Keepers')
           | (df['Set Name'] == 'Mysterious Treasures') | (df['Set Name'] == 'Secret Wonders')
           | (df['Set Name'] == 'POP Series 1')
           | (df['Set Name'] == 'POP Series 2')
           | (df['Set Name'] == 'POP Series 3')
           | (df['Set Name'] == 'POP Series 4')
           | (df['Set Name'] == 'POP Series 5')
           | (df['Set Name'] == 'POP Series 6')
           | (df['Set Name'] == 'POP Series 7')
           | (df['Set Name'] == 'POP Series 8')
           | (df['Set Name'] == 'POP Series 9')]

In [24]:
# there are a few cards in aquapolis and skyridge (H##/H32) that don't have images in pokellector.
# Instead, we go to pkmncards.com to scrape the images

# the "H" (holo) cards live on pages 3 and 4
url_aquapolis_3 = 'https://pkmncards.com/page/3/?s=aquapolis'
url_aquapolis_4 = 'https://pkmncards.com/page/4/?s=aquapolis'
url_skyridge_3 = 'https://pkmncards.com/page/3/?s=skyridge'
url_skyridge_4 = 'https://pkmncards.com/page/4/?s=skyridge'
urls_aqua_sky = [url_aquapolis_3, url_aquapolis_4, url_skyridge_3, url_skyridge_4]

for url in urls_aqua_sky:
    r_H = requests.get(url, verify=False)
    soup_H = BeautifulSoup(r_H.text, "html.parser")
    
    initial_names = [el['title'] for el in soup_H.select('.type-pkmn_card > div > a')]
    initial_pics = [el['src'] for el in soup_H.select('.type-pkmn_card > div > a > img')]

    # down-select to "H" cards
    H_names = [el for el in initial_names if '#H' in el]
    re_pics = re.compile('.*\/h\d\d.*')  # regex match to pick up "/h##"
    H_pics = [el for el in initial_pics if re.match(re_pics, el)]

    names = [el.split('·')[0].strip() for el in H_names]
    H_nums = [el.split('·')[1].split('#')[1] for el in H_names]

    set_tmp = [el.split()[2] for el in H_names][0]

    for i in range(len(H_pics)):
        r_pic = requests.get(H_pics[i], verify=False, stream=True)
        with open(f'Images/{set_tmp}/{names[i]}_{H_nums[i]}.png', 'wb') as f:
            try:
                r_pic.raw.decode_content = True
                shutil.copyfileobj(r_pic.raw, f)
            except:
                pass

In [19]:
set_df = pd.DataFrame.from_dict(new_dict, orient='index', columns=['Card Number']).reset_index().rename(columns={'index': 'Card Name'})
set_df['Set Name'] = 'Battle Styles'
pd.merge(pokellector_df, set_df, how='left', on='Set Name')

Unnamed: 0,Extension,Set Name,URL Ext Full,Card Name,Card Number
0,SWSH7,Evolving Skies,SWSH7-Evolving-Skies,,
1,CRE,Chilling Reign,CRE-Chilling-Reign,,
2,SWSH05,Battle Styles,SWSH05-Battle-Styles,Bellsprout,1
3,SWSH05,Battle Styles,SWSH05-Battle-Styles,Weepinbell,2
4,SWSH05,Battle Styles,SWSH05-Battle-Styles,Victreebel,3
...,...,...,...,...,...
260,B2,Base Set 2,B2-Base-Set-2,,
261,FO,Fossil,FO-Fossil,,
262,WOTC,Wizards of the Coast Promos,WOTC-Wizards-of-the-Coast-Promos,,
263,JU,Jungle,JU-Jungle,,


In [27]:
pic = soup2.select('.card')[0].img['data-src']

In [30]:
soup2.select('.card')[0].img['data-src']

'https://den-cards.pokellector.com/305/Bellsprout.SWSH05.1.37528.thumb.png'

In [37]:
# save the picture
r3 = requests.get(pic, verify=False, stream=True)
with open('Images/bellsprout.png', 'wb') as f:
    r3.raw.decode_content = True
    shutil.copyfileobj(r3.raw, f)