# WOW AVATAR TRANSFORMATION DATA

The objective of this notebook is to explore, clean and transform the data so it can be used for analysis
or IA purposes (machine learning and deep learning).

In [21]:
import os
import pandas as pd
from string import ascii_letters

In [2]:
#Path were the original dataset files are
path = "./Files/Original"

In [3]:
os.chdir(path)

In [4]:
df_loc = pd.read_csv("locations.csv")

In [5]:
df_loc.head(10)

Unnamed: 0,Map_ID,Location_Type,Location_Name,Game_Version
0,0,Continent,Eastern Kingdoms,WoW
1,1,Continent,Kalimdor,WoW
2,530,Continent,Outlands,TBC
3,571,Continent,Northrend,WLK
4,646,Continent,Deepholm,CAT
5,732,Continent,Tol Barad,CAT
6,870,Continent,Pandaria,MoP
7,1064,Continent,Mogu Island Daily Area (Isle of Thunder),MoP
8,1116,Continent,Draenor,WoD
9,1191,Continent,Ashran,WoD


Is there nan or null values in this dataset?

In [6]:
df_loc.isna().sum()

Map_ID           0
Location_Type    0
Location_Name    0
Game_Version     0
dtype: int64

In [7]:
df_loc.dtypes

Map_ID            int64
Location_Type    object
Location_Name    object
Game_Version     object
dtype: object

Seems that the type of the columns is correct in relation to the values that this contains.
Now let's explore each one of the columns:

## Cleaning Map_ID column

Let's see how many rows this data set has

In [8]:
#Because there are no nan values
n_rows = df_loc["Map_ID"].count() 
n_rows

151

In [9]:
len( set( df_loc["Map_ID"] ) )

151

That's how it's know that the id is unique for each row

## Cleaning Location_Type column

Because there are no nan or null values it's know that all columns have the same number of values. So let's see all the unique values:

In [10]:
df_loc["Location_Type"].unique()

array(['Continent', 'Arena', 'Battleground', 'Garrison', 'Dungeon',
       'Raid'], dtype=object)

Because there are just a few unique values in this column. So the values can be read to look if there are strange characters.

It's not the case. 

This column contains strings that identify the type of location.

## Cleaning Location_Name

In [11]:
loc_unique_len = len(df_loc["Location_Name"].unique())

In [12]:
n_rows == loc_unique_len

True

All the values in each rows are unique

In [13]:
df_loc["Location_Name"].unique()

array(['Eastern Kingdoms', 'Kalimdor', 'Outlands', 'Northrend',
       'Deepholm', 'Tol Barad', 'Pandaria',
       'Mogu Island Daily Area (Isle of Thunder)', 'Draenor', 'Ashran',
       'Tanaan Jungle Intro', 'Tanaan Jungle', "Blade's Edge Arena",
       'Dalaran Arena', 'Nagrand Arena', 'Ruins of Lordaeron',
       'The Ring of Valor', "The Tiger's Peak", "Tol'Viron Arena",
       'Alterac Valley', 'Arathi Basin', 'Deepwind Gorge',
       'Eye of the Storm', 'Eye of the Storm (Rated)', 'Isle of Conquest',
       'Silvershard Mines', 'Strand of the Ancients', 'Temple of Kotmogu',
       'The Battle for Gilneas', 'Twin Peaks', 'Warsong Gulch',
       'FW Horde Garrison Level 1', 'FW Horde Garrison Level 2',
       'FW Horde Garrison Level 3', 'FW Horde Garrison Level 4',
       'SMV Alliance Garrison Level 1', 'SMV Alliance Garrison Level 2',
       'SMV Alliance Garrison Level 3', 'SMV Alliance Garrison Level 4',
       'Blackfathom Deeps', 'Blackrock Depths', 'Blackrock Spire',
     

In [14]:
df_loc[df_loc["Location_Name"].str.contains('<U+00A0>', regex=False)]

Unnamed: 0,Map_ID,Location_Type,Location_Name,Game_Version
129,962,Dungeon,Gate of the Setting Sun<U+00A0>,MoP
130,994,Dungeon,Mogu'Shan Palace<U+00A0>,MoP
131,959,Dungeon,Shado-pan Monastery<U+00A0>,MoP
132,1011,Dungeon,Siege of Niuzao Temple<U+00A0>,MoP
133,961,Dungeon,Stormstout Brewery<U+00A0>,MoP
134,960,Dungeon,Temple of the Jade Serpent<U+00A0>,MoP
135,1009,Raid,Heart of Fear<U+00A0>,MoP
136,1008,Raid,Mogu'shan Vaults<U+00A0>,MoP
137,1136,Raid,Siege of Orgrimmar<U+00A0>,MoP
138,996,Raid,Terrace of Endless Spring<U+00A0>,MoP


There are values like 'Hellfire Citadel<U+00A0>' which contain tag.
We can observe that there is the character <U+00A0>. This is the character
for latin-1 to express an unbreakable space. Let's substitute it by an space character "" (Because all this values are
at the end).

In [15]:
df_loc["Location_Name"] = df_loc["Location_Name"].str.replace("<U+00A0>", "")

In [16]:
df_loc[df_loc["Location_Name"].str.contains('<U+00A0>', regex=False)]

Unnamed: 0,Map_ID,Location_Type,Location_Name,Game_Version


In [19]:
df_loc["Location_Name"].unique()

array(['Eastern Kingdoms', 'Kalimdor', 'Outlands', 'Northrend',
       'Deepholm', 'Tol Barad', 'Pandaria',
       'Mogu Island Daily Area (Isle of Thunder)', 'Draenor', 'Ashran',
       'Tanaan Jungle Intro', 'Tanaan Jungle', "Blade's Edge Arena",
       'Dalaran Arena', 'Nagrand Arena', 'Ruins of Lordaeron',
       'The Ring of Valor', "The Tiger's Peak", "Tol'Viron Arena",
       'Alterac Valley', 'Arathi Basin', 'Deepwind Gorge',
       'Eye of the Storm', 'Eye of the Storm (Rated)', 'Isle of Conquest',
       'Silvershard Mines', 'Strand of the Ancients', 'Temple of Kotmogu',
       'The Battle for Gilneas', 'Twin Peaks', 'Warsong Gulch',
       'FW Horde Garrison Level 1', 'FW Horde Garrison Level 2',
       'FW Horde Garrison Level 3', 'FW Horde Garrison Level 4',
       'SMV Alliance Garrison Level 1', 'SMV Alliance Garrison Level 2',
       'SMV Alliance Garrison Level 3', 'SMV Alliance Garrison Level 4',
       'Blackfathom Deeps', 'Blackrock Depths', 'Blackrock Spire',
     

Let's see which unique characters there is in the column names, to detect which characters are not wanted

In [20]:
letters_in_col = []
for letters in df_loc["Location_Name"].apply(set):
    letters_in_col.extend(letters)
letters_in_col = set(letters_in_col)
print(letters_in_col)

{'y', 'M', 'W', 'j', ':', 'H', 'x', 'r', 'e', 'A', 'V', 'q', 'G', 'R', 'C', 'P', 'p', 't', 'l', 'u', '(', 'L', ')', 'z', 'Z', 'B', 'N', 'c', 'b', 'I', 's', 'o', 'h', 'f', "'", 'S', ' ', 'v', 'a', 'Q', '-', 'U', 'd', 'D', 'K', 'w', '2', '4', 'k', 'T', 'F', 'g', 'O', '1', 'm', 'J', 'i', '3', 'n', 'E'}


In [24]:
strange_chars = letters_in_col - set(ascii_letters)
strange_chars

{' ', "'", '(', ')', '-', '1', '2', '3', '4', ':'}

Selection of all the values with this strange characters except the space

In [61]:
mask = df_loc["Location_Name"].str.contains( r"\b\w*['()\-\d:]\w*\b" , regex=True) 

In [62]:
df_loc[mask]["Location_Name"]

12                Blade's Edge Arena
17                  The Tiger's Peak
18                   Tol'Viron Arena
31         FW Horde Garrison Level 1
32         FW Horde Garrison Level 2
33         FW Horde Garrison Level 3
34         FW Horde Garrison Level 4
35     SMV Alliance Garrison Level 1
36     SMV Alliance Garrison Level 2
37     SMV Alliance Garrison Level 3
38     SMV Alliance Garrison Level 4
54         The Temple of Atal'Hakkar
57                        Zul'Farrak
60                Ruins of Ahn'Qiraj
61               Temple of Ahn'Qiraj
65                        Mana-Tombs
79                      Gruul's Lair
82                Magtheridon's Lair
86        Ahn'kahet: The Old Kingdom
87                       Azjol-Nerub
88                  Drak'Tharon Keep
104                    Onyxia's Lair
116         Lost City of the Tol'vir
121                         Zul'Aman
122                        Zul'Gurub
130                 Mogu'Shan Palace
131              Shado-pan Monastery
1

Seems like all this values are correct. Now let's see if there is a value with an space at the end, at the beginning or
more than two spaces together:

Getting the values with an space at the beginning:

In [64]:
## Selection of all the values with an space at the beginning
mask = df_loc["Location_Name"].str.contains( r"^\s" , regex=True)
df_loc[mask]["Location_Name"]

Series([], Name: Location_Name, dtype: object)

There are no values that start with an space

Getting the values that end with an space at the beginning:

In [65]:
## Selection of all the values with an space at the end
mask = df_loc["Location_Name"].str.contains( r"\s$" , regex=True)
df_loc[mask]["Location_Name"]

Series([], Name: Location_Name, dtype: object)

There are no values that end with an space

Getting all the values that contain more than 1 space between words:

In [66]:
mask = df_loc["Location_Name"].str.contains( r'\S\s{2,}\S' , regex=True)
df_loc[mask]["Location_Name"]

Series([], Name: Location_Name, dtype: object)

## Cleaning Game_Version 

Finally all values can be seen for the last column to clean

In [68]:
df_loc["Game_Version"].unique()

array(['WoW', 'TBC', 'WLK', 'CAT', 'MoP', 'WoD'], dtype=object)

Look like all values are correct.
This column values indicate the version.