In [1]:
import pandas as pd
import psycopg2
from configparser import ConfigParser

## __About__
This 'sub-project' is about auditing and verifying all the data I'm going to use in future projects for my Pokemon MEGA Bank.

In [2]:
'''
Copied my config from my 'database-upload' file for ease-of-access.
'''

def config(filename="database.ini", section="postgresql"): # keeping filename for portability, changing later
    parser = ConfigParser() # creating parser
    parser.read(filename) # reading the .ini file
    db = {} # empty dictionary for database

    if parser.has_section(section): # checking if a config section exists
        params = parser.items(section) # "
        for param in params: # reading every setting
            db[param[0]] = param[1] # applying these for later use
        
    else:
        raise Exception("Section {0} not found in the {1} file".format(section, filename))
    
    try:
        conn = psycopg2.connect(**db) # connecting to the db by bypassing the dictionary
        print("Database connected successfully.")
    except:
        print("Database not connected successfully.")
        raise

    return conn

In [3]:
conn = config(filename=r"C:\Users\Jjoer\GitHub\Pokemon Stocks\database.ini") # connecting to db

df_cards = pd.read_sql_query("SELECT * FROM cards;", conn)
df_prices = pd.read_sql_query("SELECT * FROM prices;", conn)

conn.close() # closing the cursor


Database connected successfully.


  df_cards = pd.read_sql_query("SELECT * FROM cards;", conn)
  df_prices = pd.read_sql_query("SELECT * FROM prices;", conn)


---

### __'Cards' table audit__

The following steps are to verify that the 'cards' table is working as intended.

In [4]:
df_cards.head(10)

Unnamed: 0,card_id,name,supertype,subtypes,set_name,series,card_number,printed_total,artist,rarity
0,hgss4-1,Aggron,Pokémon,[Stage 2],HS—Triumphant,HeartGold & SoulSilver,1,102,Kagemaru Himeno,Rare Holo
1,xy5-1,Weedle,Pokémon,[Basic],Primal Clash,XY,1,160,Midori Harada,Common
2,pl1-1,Ampharos,Pokémon,[Stage 2],Platinum,Platinum,1,127,Atsuko Nishida,Rare Holo
3,dp3-1,Ampharos,Pokémon,[Stage 2],Secret Wonders,Diamond & Pearl,1,132,Kouki Saitou,Rare Holo
4,det1-1,Bulbasaur,Pokémon,[Basic],Detective Pikachu,Sun & Moon,1,18,MPC Film,Common
5,dv1-1,Dratini,Pokémon,[Basic],Dragon Vault,Black & White,1,20,Masakazu Fukuda,Rare Holo
6,mcd19-1,Caterpie,Pokémon,[Basic],McDonald's Collection 2019,Other,1,12,Sekio,
7,pl3-1,Absol G,Pokémon,"[Basic, SP]",Supreme Victors,Platinum,1,147,Yusuke Ishikawa,Rare Holo
8,ex12-1,Aerodactyl,Pokémon,[Stage 1],Legend Maker,EX,1,92,Hajime Kusajima,Rare Holo
9,ex3-1,Absol,Pokémon,[Basic],Dragon,EX,1,97,Naoyo Kimura,Rare Holo


In [5]:
df_cards.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19818 entries, 0 to 19817
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   card_id        19818 non-null  object
 1   name           19818 non-null  object
 2   supertype      19818 non-null  object
 3   subtypes       19818 non-null  object
 4   set_name       19818 non-null  object
 5   series         19818 non-null  object
 6   card_number    19818 non-null  object
 7   printed_total  19818 non-null  object
 8   artist         18679 non-null  object
 9   rarity         19515 non-null  object
dtypes: object(10)
memory usage: 1.5+ MB


---

### Duplicates; Null and Missing Values

So far we can tell there are missing valuess in both artist and rarity and nowhere else.  We'll tally those up and move to check for duplicates after, but given the size of the database, I expect there to be none or a minimal amount.

In [6]:
duplicates = df_cards[df_cards.duplicated(subset="card_id")]
print(duplicates)

Empty DataFrame
Columns: [card_id, name, supertype, subtypes, set_name, series, card_number, printed_total, artist, rarity]
Index: []


In [7]:
null_values = df_cards.isna().sum()
print(null_values)

card_id             0
name                0
supertype           0
subtypes            0
set_name            0
series              0
card_number         0
printed_total       0
artist           1139
rarity            303
dtype: int64


##### Result:
No duplicates on 'card_id'.  That means that my card_id is reliably unique, future merges using the card_id will be safe, and our ingestion was clean.  There are, however, around 1200 missing elements from artist and 300 from rarity.  We can assume this is because of certain omissions from cards (i.e. some cards never had artists nor rarities for cards such as promos and the like) and not indicative of poor db entry or error.  So, instead, we'll replace the null sections with 'Unknown artist/rarity' and move on.

In [8]:
replacement_values = {"artist": "Unknown Artist", "rarity": "Unknown Rarity"}
df_replace_artist_and_rarity = df_cards.fillna(value=replacement_values)
show_df_replace_artist_and_rarity = df_replace_artist_and_rarity.isna().sum()

print(show_df_replace_artist_and_rarity)

card_id          0
name             0
supertype        0
subtypes         0
set_name         0
series           0
card_number      0
printed_total    0
artist           0
rarity           0
dtype: int64


---

### Merging and/or Dropping

Merging columns could be useful for record keeping and that's a thought for the future.  Subtypes are not relevant to any EDA that we're seeking, so we'll take to dropping that.

In [9]:
df_cleaned_cards = df_replace_artist_and_rarity.drop(columns=['subtypes'])

df_cleaned_cards.head(10)

Unnamed: 0,card_id,name,supertype,set_name,series,card_number,printed_total,artist,rarity
0,hgss4-1,Aggron,Pokémon,HS—Triumphant,HeartGold & SoulSilver,1,102,Kagemaru Himeno,Rare Holo
1,xy5-1,Weedle,Pokémon,Primal Clash,XY,1,160,Midori Harada,Common
2,pl1-1,Ampharos,Pokémon,Platinum,Platinum,1,127,Atsuko Nishida,Rare Holo
3,dp3-1,Ampharos,Pokémon,Secret Wonders,Diamond & Pearl,1,132,Kouki Saitou,Rare Holo
4,det1-1,Bulbasaur,Pokémon,Detective Pikachu,Sun & Moon,1,18,MPC Film,Common
5,dv1-1,Dratini,Pokémon,Dragon Vault,Black & White,1,20,Masakazu Fukuda,Rare Holo
6,mcd19-1,Caterpie,Pokémon,McDonald's Collection 2019,Other,1,12,Sekio,Unknown Rarity
7,pl3-1,Absol G,Pokémon,Supreme Victors,Platinum,1,147,Yusuke Ishikawa,Rare Holo
8,ex12-1,Aerodactyl,Pokémon,Legend Maker,EX,1,92,Hajime Kusajima,Rare Holo
9,ex3-1,Absol,Pokémon,Dragon,EX,1,97,Naoyo Kimura,Rare Holo


---

### __'Prices' table audit__

The following steps are to verify that the 'prices' table is working as intended.

In [10]:
df_prices.head(10)

Unnamed: 0,price_id,card_id,source,variant,condition_txt,updated_at,market_price,low_price,mid_price,high_price,raw_json,created_at
0,1,hgss4-1,tcgplayer,holofoil,Near Mint,2025-10-16,3.17,3.15,4.99,19.99,"{'low': 3.15, 'mid': 4.99, 'high': 19.99, 'mar...",2025-11-18 16:14:30.472139+00:00
1,2,hgss4-1,tcgplayer,reverseHolofoil,Near Mint,2025-10-16,4.28,2.0,3.98,9.99,"{'low': 2.0, 'mid': 3.98, 'high': 9.99, 'marke...",2025-11-18 16:14:30.472139+00:00
2,3,xy5-1,tcgplayer,normal,Near Mint,2025-10-16,0.12,0.01,0.19,1.49,"{'low': 0.01, 'mid': 0.19, 'high': 1.49, 'mark...",2025-11-18 16:14:30.472139+00:00
3,4,xy5-1,tcgplayer,reverseHolofoil,Near Mint,2025-10-16,0.6,0.19,0.49,1.59,"{'low': 0.19, 'mid': 0.49, 'high': 1.59, 'mark...",2025-11-18 16:14:30.472139+00:00
4,5,pl1-1,tcgplayer,holofoil,Near Mint,2025-10-16,14.34,5.39,14.51,35.0,"{'low': 5.39, 'mid': 14.51, 'high': 35.0, 'mar...",2025-11-18 16:14:30.472139+00:00
5,6,pl1-1,tcgplayer,reverseHolofoil,Near Mint,2025-10-16,9.17,10.0,12.99,13.98,"{'low': 10.0, 'mid': 12.99, 'high': 13.98, 'ma...",2025-11-18 16:14:30.472139+00:00
6,7,dp3-1,tcgplayer,holofoil,Near Mint,2025-10-16,19.65,10.02,19.86,39.99,"{'low': 10.02, 'mid': 19.86, 'high': 39.99, 'm...",2025-11-18 16:14:30.472139+00:00
7,8,dp3-1,tcgplayer,reverseHolofoil,Near Mint,2025-10-16,17.26,4.99,16.65,18.45,"{'low': 4.99, 'mid': 16.65, 'high': 18.45, 'ma...",2025-11-18 16:14:30.472139+00:00
8,9,det1-1,tcgplayer,holofoil,Near Mint,2025-10-16,0.78,0.08,0.56,5.03,"{'low': 0.08, 'mid': 0.56, 'high': 5.03, 'mark...",2025-11-18 16:14:30.472139+00:00
9,10,dv1-1,tcgplayer,holofoil,Near Mint,2025-10-16,2.38,1.0,2.25,6.58,"{'low': 1.0, 'mid': 2.25, 'high': 6.58, 'marke...",2025-11-18 16:14:30.472139+00:00


In [11]:
df_prices.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196236 entries, 0 to 196235
Data columns (total 12 columns):
 #   Column         Non-Null Count   Dtype              
---  ------         --------------   -----              
 0   price_id       196236 non-null  int64              
 1   card_id        196236 non-null  object             
 2   source         196236 non-null  object             
 3   variant        196236 non-null  object             
 4   condition_txt  196236 non-null  object             
 5   updated_at     196236 non-null  object             
 6   market_price   196023 non-null  float64            
 7   low_price      196235 non-null  float64            
 8   mid_price      196235 non-null  float64            
 9   high_price     196235 non-null  float64            
 10  raw_json       196236 non-null  object             
 11  created_at     196236 non-null  datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), float64(4), int64(1), object(6)
memory usage: 18.0+ MB


---

### Duplicates; Null and Missing Values

The same applies to the 'prices' table.  This time, however, we'll crosscheck duplicates on the 'card_id' with 'source', 'variant', and 'updated_at'.

In [12]:
prices_duplicates = df_prices.duplicated(subset=["card_id", "source", "variant", "updated_at"]).sum()
print(prices_duplicates)

0


In [13]:
null_price_values = df_prices.isna().sum()
print(null_price_values)

price_id           0
card_id            0
source             0
variant            0
condition_txt      0
updated_at         0
market_price     213
low_price          1
mid_price          1
high_price         1
raw_json           0
created_at         0
dtype: int64


##### Results:
There are no duplicates across variants, which means that we don't have any accidental duplicate ingestions and repeated pulls from the API.  Later, when we populate our database with scraped pulls, we'll be able to see if we have redundant rows from different sessions.

Insofar as NULL is concerned, we'll leave the missing values for low, mid, and high, but replace the missing 'market_price's' with the 'mid_price' as it's the most representative.

In [14]:
df_prices_replace = df_prices.copy() # keeping the original data safe and sound by creating a copy

df_prices_replace['market_price'] = df_prices_replace['market_price'].fillna(df_prices_replace['mid_price']) # filling all missing 'market_price' with mid_price for accuracy's sake
df_prices_replace['market_price'].isna().sum()

np.int64(0)

There are no more columns in 'market_price' that're empty.  All were filled and taken care of.

---

### Merging and/or Dropping

Merging for this table is unecessary.  Whereas dropping tables is completely necessary: source, condition_txt, raw_json, and created_at all can be dropped as they either are redundant (see: source, condition_txt) or vestigial (raw_json, created_at).

In [15]:
df_dropped_prices = df_prices_replace.drop(columns=["source", "condition_txt", "raw_json", "created_at"])

df_dropped_prices.head(10)

Unnamed: 0,price_id,card_id,variant,updated_at,market_price,low_price,mid_price,high_price
0,1,hgss4-1,holofoil,2025-10-16,3.17,3.15,4.99,19.99
1,2,hgss4-1,reverseHolofoil,2025-10-16,4.28,2.0,3.98,9.99
2,3,xy5-1,normal,2025-10-16,0.12,0.01,0.19,1.49
3,4,xy5-1,reverseHolofoil,2025-10-16,0.6,0.19,0.49,1.59
4,5,pl1-1,holofoil,2025-10-16,14.34,5.39,14.51,35.0
5,6,pl1-1,reverseHolofoil,2025-10-16,9.17,10.0,12.99,13.98
6,7,dp3-1,holofoil,2025-10-16,19.65,10.02,19.86,39.99
7,8,dp3-1,reverseHolofoil,2025-10-16,17.26,4.99,16.65,18.45
8,9,det1-1,holofoil,2025-10-16,0.78,0.08,0.56,5.03
9,10,dv1-1,holofoil,2025-10-16,2.38,1.0,2.25,6.58


---