## 0. Cleaning Data

The dataset we will use in this project is a Harry Potter characters table. This has been created on Kaggle, [see original link here.](https://www.kaggle.com/datasets/gulsahdemiryurek/harry-potter-dataset). 
The author notes "The other data were collected from pottermore.com and https://harrypotter.fandom.com/wiki/Main_Page"

You can either get the data from the Kaggle link or from the data folder in this repo, and use the notebook steps bellow to clean it; or you can directly use the clean_data.csv file and continue to the Elastic Steps in the tutorial here: 

[1. Kibana Dashboard](/1.%20Kibana%20Dashboard.md)



In [89]:
import pandas as pd
import re
hp_data = pd.read_csv("data/Characters.csv", sep = ";" )

We will first import our data into a pandas dataframe and take a look at what changes need to be made.

In [81]:
hp_data

Unnamed: 0,Id,Name,Gender,Job,House,Wand,Patronus,Species,Blood status,Hair colour,Eye colour,Loyalty,Skills,Birth,Death
0,1,Harry James Potter,Male,Student,Gryffindor,"11"" Holly phoenix feather",Stag,Human,Half-blood,Black,Bright green,Albus Dumbledore | Dumbledore's Army | Order o...,Parseltongue| Defence Against the Dark Arts | ...,31 July 1980,
1,2,Ronald Bilius Weasley,Male,Student,Gryffindor,"12"" Ash unicorn tail hair",Jack Russell terrier,Human,Pure-blood,Red,Blue,Dumbledore's Army | Order of the Phoenix | Hog...,Wizard chess | Quidditch goalkeeping,1 March 1980,
2,3,Hermione Jean Granger,Female,Student,Gryffindor,"10¾"" vine wood dragon heartstring",Otter,Human,Muggle-born,Brown,Brown,Dumbledore's Army | Order of the Phoenix | Hog...,Almost everything,"19 September, 1979",
3,4,Albus Percival Wulfric Brian Dumbledore,Male,Headmaster,Gryffindor,"15"" Elder Thestral tail hair core",Phoenix,Human,Half-blood,Silver| formerly auburn,Blue,Dumbledore's Army | Order of the Phoenix | Hog...,Considered by many to be one of the most power...,Late August 1881,"30 June, 1997"
4,5,Rubeus Hagrid,Male,Keeper of Keys and Grounds | Professor of Care...,Gryffindor,"16"" Oak unknown core",,Half-Human/Half-Giant,Part-Human (Half-giant),Black,Black,Albus Dumbledore | Order of the Phoenix | Hogw...,Resistant to stunning spells| above average st...,6 December 1928,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
135,136,Wilhelmina Grubbly-Plank,Female,Substitute professor of Care of Magical Creatu...,,Unknown,Non-corporeal,Human,,Grey,,Hogwarts School of Witchcraft and Wizardry,,,
136,137,Fenrir Greyback,Male,,,Unknown,,Werewolf,,Grey,,Lord Voldemort | Death Eaters,Physical combat,Pre 1945,
137,138,Gellert Grindelwald,Male,Revolutionary leader(c. 1920s[6] – 1945),,"15"", Elder, Thestral tail hair core",,Human,Pure-blood or half-blood,Blond,Blue,Gellert Grindelwald's Acolytes,Duelling,1883,"March, 1998"
138,139,Dobby,Male,"Malfoy family's house-elf (? - 1993),\nHogwart...",,,,House elf,,Green,,,"A type of magic specific to house-elves, perfo...",28 June,"Late March, 1998"


### Data cleaning
* Replace missing values with "Uknown" for better statistics
* Remove all special charcaters except for alphanumeric and space from all columns that contain straings
* Fixing date formatting (as much as possible since the formats vary vastly)
* For columns that should map to categorical data, map to unique value when there are various spellings

In [112]:
hp_data = hp_data.fillna("Unknown")

In [99]:
df_obj = hp_data.select_dtypes(["object"])
hp_data = df_obj.applymap(lambda x: re.sub(r'[^ \w+]', '', str(x)))

In [110]:
hp_data["Birth"]= hp_data["Birth"].apply(lambda x: pd.to_datetime(x, errors="coerce"))
hp_data["Death"]= hp_data["Death"].apply(lambda x: pd.to_datetime(x, errors="coerce"))
hp_data = hp_data.fillna("Unknown")

In [120]:
hp_data["Blood status"].unique()
hp_data["Blood status"] = hp_data["Blood status"].str.lower()
hp_data["Blood status"] = hp_data["Blood status"].replace(to_replace=["halfbloodorpureblood", "purebloodorhalfblood"], value = "pureblood or halfblood")

In [126]:
hp_data

Unnamed: 0,Name,Gender,Job,House,Wand,Patronus,Species,Blood status,Hair colour,Eye colour,Loyalty,Skills,Birth,Death
0,Harry James Potter,Male,Student,Gryffindor,11 Holly phoenix feather,Stag,Human,halfblood,Black,Bright green,Albus Dumbledore Dumbledores Army Order of t...,Parseltongue Defence Against the Dark Arts Se...,1980-07-31 00:00:00,Unknown
1,Ronald Bilius Weasley,Male,Student,Gryffindor,12 Ash unicorn tail hair,Jack Russell terrier,Human,pureblood,Red,Blue,Dumbledores Army Order of the Phoenix Hogwar...,Wizard chess Quidditch goalkeeping,1980-03-01 00:00:00,Unknown
2,Hermione Jean Granger,Female,Student,Gryffindor,10¾ vine wood dragon heartstring,Otter,Human,muggleborn,Brown,Brown,Dumbledores Army Order of the Phoenix Hogwar...,Almost everything,1979-09-19 00:00:00,Unknown
3,Albus Percival Wulfric Brian Dumbledore,Male,Headmaster,Gryffindor,15 Elder Thestral tail hair core,Phoenix,Human,halfblood,Silver formerly auburn,Blue,Dumbledores Army Order of the Phoenix Hogwar...,Considered by many to be one of the most power...,Unknown,1997-06-30 00:00:00
4,Rubeus Hagrid,Male,Keeper of Keys and Grounds Professor of Care ...,Gryffindor,16 Oak unknown core,Unknown,HalfHumanHalfGiant,parthumanhalfgiant,Black,Black,Albus Dumbledore Order of the Phoenix Hogwar...,Resistant to stunning spells above average str...,1928-12-06 00:00:00,Unknown
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
135,Wilhelmina GrubblyPlank,Female,Substitute professorofCare of Magical Creature...,Unknown,Unknown,Noncorporeal,Human,unknown,Grey,Unknown,Hogwarts School of Witchcraft and Wizardry,Unknown,Unknown,Unknown
136,Fenrir Greyback,Male,Unknown,Unknown,Unknown,Unknown,Werewolf,unknown,Grey,Unknown,Lord Voldemort Death Eaters,Physical combat,Unknown,Unknown
137,Gellert Grindelwald,Male,Revolutionary leaderc1920s61945,Unknown,15 Elder Thestral tail hair core,Unknown,Human,pureblood or halfblood,Blond,Blue,Gellert Grindelwalds Acolytes,Duelling,1883-01-01 00:00:00,1998-03-01 00:00:00
138,Dobby,Male,Malfoy familys houseelf 1993Hogwarts kitchen...,Unknown,Unknown,Unknown,House elf,unknown,Green,Unknown,Unknown,A type of magic specific to houseelves perform...,Unknown,Unknown


In [128]:
hp_data.to_csv("data/clean_characters.csv")

In [None]:
import pandas as pd
import re
hp_data = pd.read_csv("data/clean_characters.csv")


In [34]:
hp_data["Skills"][0:20]

0                                                   Parseltongue Defence Against the Dark Arts  Seeker
1                                                                  Wizard chess  Quidditch goalkeeping
2                                                                                    Almost everything
3                                Considered by many to be one of the most powerful wizards of his time
4                                 Resistant to stunning spells above average strength  crossbowmanship
5                                                                                            Herbology
6                                                                                               Beater
7                                                                                               Beater
8                                                                                  Chaser BatBogey hex
9                                                                        