# Data Exploration for the World of Warcraft Avatar dataset

Author: Marco Hernani

First of all We change the work directory so We work where the files are located

This dataset was taken from: https://www.kaggle.com/datasets/mylesoneill/warcraft-avatar-history

In [1]:
import os
import numpy as np
import pandas as pd

In [2]:
#Path where the files that contain the dataset are located
path = "Files"

In [3]:
os.chdir(path)

We get the names of the files

In [4]:
os.listdir()

['locations.csv', 'location_coords.csv', 'wowah_data.csv', 'zones.csv']

We get one of the files and start exploring the data 

In [5]:
df_location = pd.read_csv("locations.csv")

In [6]:
df_location.head(10)

Unnamed: 0,Map_ID,Location_Type,Location_Name,Game_Version
0,0,Continent,Eastern Kingdoms,WoW
1,1,Continent,Kalimdor,WoW
2,530,Continent,Outlands,TBC
3,571,Continent,Northrend,WLK
4,646,Continent,Deepholm,CAT
5,732,Continent,Tol Barad,CAT
6,870,Continent,Pandaria,MoP
7,1064,Continent,Mogu Island Daily Area (Isle of Thunder),MoP
8,1116,Continent,Draenor,WoD
9,1191,Continent,Ashran,WoD


We get a grasp to how many diferent values are.

In [7]:
for col in df_location:
    print(df_location[col].unique())
    print(f"The number of unique values for {col} is {len(df_location[col].unique())}")
    print("######")

[   0    1  530  571  646  732  870 1064 1116 1191 1265 1464  562  617
  559  572  618 1134  980   30  529 1105  566  968  628  727  607  998
  761  726  489 1152 1330 1153 1154 1158 1331 1159 1160   48  230  229
  429   90  349  389  129   47 1004 1007   33  329   36   34  109   70
   43  209  469  409  509  531  558  543  585  557  560  556  555  552
  269  542  553  554  540  547  545  546  564  565  534  532  544  548
  580  550  619  601  600  604  602  668  599  658  595  632  576  578
  608  650  574  575  631  533  249  616  615  724  649  603  624  645
  938  670  644  940  755  725  657  643  939  568  859  757  669  967
  720  671  754  962  994  959 1011  961  960 1009 1008 1136  996 1098
 1182 1175 1208 1195 1176 1209 1279 1358 1228 1205 1448]
The number of unique values for Map_ID is 151
######
['Continent' 'Arena' 'Battleground' 'Garrison' 'Dungeon' 'Raid']
The number of unique values for Location_Type is 6
######
['Eastern Kingdoms' 'Kalimdor' 'Outlands' 'Northrend' 'De

Do NaN appear as a diferent value when unique is used?

In [8]:
df_location.isna().sum()

Map_ID           0
Location_Type    0
Location_Name    0
Game_Version     0
dtype: int64

There are no nan values.

reading the file in encoding latin-1 was needed because there was no "no breakable space character" in utf-8 (0xa0).
This character looks like the space character but has other characteristics.

In [9]:
df_loc_coords = pd.read_csv("location_coords.csv", encoding="latin-1")

In [10]:
df_loc_coords

Unnamed: 0,Location_Name,Map_ID,X_coord,Y_coord,Z_coord
0,Eastern Kingdom: Ironforge Airport,0.0,-4488.993311,-1580.19104,509.005066
1,Eastern Kingdom: Wetlands Mountain Camp,0.0,-3855.000000,-3479.00000,579.000000
2,Eastern Kingdom: Dun Morogh plane camp,0.0,-6161.000000,-786.00000,423.000000
3,Eastern Kingdom: Undercity,0.0,1831.260000,238.53000,60.520000
4,Eastern Kingdom: Stormwind City,0.0,-8913.230000,554.63300,93.794400
...,...,...,...,...,...
3804,Terokkar Forest/Allerian Stronghold,530.0,-2995.400000,3873.27000,9.724780
3805,Hellfire Peninsula/The Stair of Destiny,530.0,-323.810000,1027.61000,54.239900
3806,Shadowmoon Valley/Wildhammer Stronghold,530.0,-3980.970000,2156.29000,105.333000
3807,Blade's Edge Mountains/Toshley's Station,530.0,1860.730000,5528.27000,276.838000


In [11]:
df_loc_coords.isna().sum()

Location_Name    0
Map_ID           2
X_coord          0
Y_coord          0
Z_coord          0
dtype: int64

In the content of this file, there is nan values. What We do with those nan values?

In [12]:
df_loc_coords[df_loc_coords.isna().any(axis=1)] ##Selects all rows that have at least one nan value

Unnamed: 0,Location_Name,Map_ID,X_coord,Y_coord,Z_coord
2889,Azshara: The Shattered Strand,,4266.963379,-6277.396973,92.900871
2977,Durotar: Deadeye Shore,,918.715027,-5115.689941,-1.65438


Which values contain the column Map_ID? Can We replace the values for this column in this 2 rows?

In [13]:
df_loc_coords.Map_ID.unique()

array([  0.,   1.,  30., 529., 530., 269., 230., 229., 469., 369., 429.,
       532., 349., 409., 533., 249., 389., 129.,  47., 189., 289.,  70.,
       209., 309., 559., 562., 560., 564., 550.,  33.,  36.,  37.,  43.,
        44.,  48.,  90., 109., 329., 489., 509., 531., 534., 540., 542.,
       543., 544., 545., 546., 547., 548., 552., 553., 554., 555., 556.,
       557., 558., 565., 566.,  17., 450., 571., 568., 599., 601., 574.,
       600., 578., 576., 602., 595.,  nan, 609., 572., 585., 580., 624.,
       615., 608., 604., 575.])

What contains Map_ID in both locations dataframes?

In [14]:
df_loc_coords.dtypes

Location_Name     object
Map_ID           float64
X_coord          float64
Y_coord          float64
Z_coord          float64
dtype: object

In [15]:
df_location.dtypes

Map_ID            int64
Location_Type    object
Location_Name    object
Game_Version     object
dtype: object

In one the column is an int64 and in the other is float64

Reading in the info about the columns in kaggle, now I know that Map_ID use is as an id for zones in location.csv. 
And in location_coords.csv is a reference to the column from location.csv. Because it's an id (a categorical value) it doesn't make sense that there are real numbers in this columns.

To change the type first we need to replace nan values:

In [16]:
mask = df_loc_coords.isna().any(axis=1)

In [17]:
df_loc_coords[mask].Map_ID

2889   NaN
2977   NaN
Name: Map_ID, dtype: float64

In [18]:
loc_names_nan = list(df_loc_coords[mask].Location_Name)
loc_names_nan

['Azshara: The Shattered Strand', 'Durotar: Deadeye Shore']

We look for the Map_ID values in the location dataframe:

In [19]:
df_location[df_location['Location_Name'].isin(loc_names_nan)]

Unnamed: 0,Map_ID,Location_Type,Location_Name,Game_Version


It seems that there is not those values:

In [20]:
loc_names = list(df_location["Location_Name"].unique())

We check that there is not those values in location:

In [21]:
set(loc_names_nan) - set(loc_names)

{'Azshara: The Shattered Strand', 'Durotar: Deadeye Shore'}

With which values can We replace the nans in location coordinates?

Can it be those location names are in df_location but don't match because some characters are lowecase or uppercase or there is a different character in one of the names. What can we do in this case?

In [22]:
#set with all location names in location
#in lowercase
set(df_location["Location_Name"].str.lower())

{"ahn'kahet: the old kingdom",
 'alterac valley',
 'arathi basin',
 'ashran',
 'auchenai crypts',
 'auchindoun<u+00a0>',
 'azjol-nerub',
 'baradin hold',
 'black temple',
 'blackfathom deeps',
 'blackrock caverns',
 'blackrock depths',
 'blackrock foundry<u+00a0>',
 'blackrock spire',
 'blackwing descent',
 'blackwing lair',
 "blade's edge arena",
 'bloodmaul slag mines<u+00a0>',
 'dalaran arena',
 'deepholm',
 'deepwind gorge',
 'dire maul',
 'draenor',
 'dragon soul',
 "drak'tharon keep",
 'eastern kingdoms',
 'end time',
 'eye of the storm',
 'eye of the storm (rated)',
 'firelands',
 'fw horde garrison level 1',
 'fw horde garrison level 2',
 'fw horde garrison level 3',
 'fw horde garrison level 4',
 'gate of the setting sun<u+00a0>',
 'gnomeregan',
 'grim batol',
 'grimrail depot<u+00a0>',
 "gruul's lair",
 'gundrak',
 'halls of lightning',
 'halls of origination',
 'halls of reflection',
 'halls of stone',
 'heart of fear<u+00a0>',
 'hellfire citadel<u+00a0>',
 'hellfire rampart

There is the character <u+00a0> in the column Location_Name

In [23]:
set(df_loc_coords["Location_Name"].str.lower())

{'hellfire citadel: the blood furnace',
 "zul'gurub stranglethorn vale",
 'hyjal: cool ancient statue',
 'blackfathom deeps',
 'badlands:     two giant sitting dwarfs',
 "terokkar forest: lake ere'noru",
 'eversong woods : north sanctum ',
 'bloodmyst isle : cryo-core ',
 'undercity ',
 "ahn'qiraj      ",
 'teldrassil : darnassus ',
 'zangarmarsh : funggor cavern ',
 'barrens    ',
 'tirisfal glades: north coast ',
 "redridge mountains: render's rock",
 '\tthe barrens: the merchant cost',
 "isle of quel'danas: sunwell plateau - outside",
 'zangarmarsh: sporewind lake',
 'tanaris: southmoon ruins',
 'bloodmyst isle : middenvale ',
 'secret locations: howling fjord - burning trees',
 'alterac valley - bg: iceblood garrison',
 'the deadmines: the deadmines',
 'terokkar forest entrance',
 'terokkar forest : tomb of lights ',
 'nagrand arena',
 'dun morogh: coldridge valley',
 "eversong woods: tor'watha",
 "\tgruul's lair: gruul's lair",
 'thousand needles: splithoof crag',
 'the exodar: tr

This data needs first to be cleaned, because We can see that there are strange characters

After the data in both dataframes is clean, We can try again with the nan replace again

In [24]:
# df_loc_coords.Map_ID.iloc[2889] = v1
# df_loc_coords.Map_ID.iloc[2977] = v2

## Cleaning the location table

In [25]:
df_location.head(10)

Unnamed: 0,Map_ID,Location_Type,Location_Name,Game_Version
0,0,Continent,Eastern Kingdoms,WoW
1,1,Continent,Kalimdor,WoW
2,530,Continent,Outlands,TBC
3,571,Continent,Northrend,WLK
4,646,Continent,Deepholm,CAT
5,732,Continent,Tol Barad,CAT
6,870,Continent,Pandaria,MoP
7,1064,Continent,Mogu Island Daily Area (Isle of Thunder),MoP
8,1116,Continent,Draenor,WoD
9,1191,Continent,Ashran,WoD


We can correct the values in Game_Version with our eyes because there is fewer values:

In [26]:
df_location.Game_Version.unique()

array(['WoW', 'TBC', 'WLK', 'CAT', 'MoP', 'WoD'], dtype=object)

I suposse that 'WoW' makes reference to Vanilla WoW, (the game without expansions)

How to filter which values have strange characters in the column Location and Location_type?

In [27]:
from string import punctuation, whitespace, ascii_letters

In [28]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [29]:
whitespace

' \t\n\r\x0b\x0c'

In [30]:
ascii_letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

I suposse there is no string values with the character "´". But We can check it:

In [33]:
set("Adiós").intersection( set("áéíóúÁÉÍÓÚ"))

{'ó'}

In [34]:
df_location.apply(lambda s: True if set(s).intersection( set("áéíóúÁÉÍÓÚ")) else False )

Map_ID           False
Location_Type    False
Location_Name    False
Game_Version     False
dtype: bool

In [35]:
df_location.apply(lambda s: True if set(s).intersection( set(punctuation)) else False )

Map_ID           False
Location_Type    False
Location_Name    False
Game_Version     False
dtype: bool

In [None]:
#A set that contains only wanted characters for the string values
char_mask = ascii_letters + " " + "'"