<a href="https://colab.research.google.com/github/nurfnick/Data_Viz/blob/main/19_Missing_and_Incomplete.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Missing and Incomplete

Often datasets will be missing entries.  There are many approaches we can take to dealing with these errors and omissions.  I will examine a dataset on the characters from <ins> The Lord of The Rings</ins>

In [1]:
import pandas as pa

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/lotr_characters.csv')

df.head()

Unnamed: 0,birth,death,gender,hair,height,name,race,realm,spouse
0,,,Female,,,Adanel,Men,,Belemir
1,TA 2978,"February 26 ,3019",Male,Dark (book) Light brown (movie),,Boromir,Men,,
2,,"March ,3019",Male,,,Lagduf,Orcs,,
3,TA 280,TA 515,Male,,,Tarcil,Men,Arnor,Unnamed wife
4,,,Male,,,Fire-drake of Gondolin,Dragon,,


We see right away that there are lots of `NaN`'s.  This is an empty field in our dataset.  Some characters are mentioned but never given much more background than a name.

In [2]:
df.isnull().sum(axis = 0)

birth     207
death     315
gender    143
hair      734
height    813
name        0
race      140
realm     714
spouse    403
dtype: int64

There are null values in every column except name.

In [3]:
df.isnull().sum(axis = 1).value_counts().sort_index()

0     15
1     59
2    185
3    236
4    178
5     81
6     20
7      1
8    136
dtype: int64

Here we see that there are only 15 entries with all fields and 136 that are name only (since name was never blank!)  Let's look at just those characters.

In [4]:
df[~df.isnull().any(axis = 1)]

Unnamed: 0,birth,death,gender,hair,height,name,race,realm,spouse
125,SA 3209,TA 2,Male,Black,Very tall almost 7'1,Isildur,Men,"Arnor,Gondor",Unnamed wife
134,"YT, and perhaps firstborn",Still Alive,Male,Probably Golden,Tall,Ingwë,Elves,"Valinor,Taniquetil",Unnamed wife
166,YT,FA 400,Male,Dark,Tall,Eöl,Elves,Nan Elmoth,Aredhel
186,TA 2990,FO 63,Male,Dirty blond,Tall-6'6,omer,Men,Rohan,Lothíriel after the War of the Ring
194,FA 532,"Still alive; departed to ,Aman, on ,September ...",Male,Dark,Tall,Elrond,Half-elven,Rivendell,Celebrían
204,SA 3119,SA 3441,Male,Brown,"7' 10""",Elendil,Men,"Arnor,Gondor",Unnamed wife
530,YT,"Still alive, departed over the sea in the earl...",Male,Silver,Tall,Celeborn,Elves,"Eregion,Lothlórien,Caras Galadhon",Galadriel
551,Possibly pre First Age,Unknown; possibly still alive,Most likely male,,Huge,Watcher in the Water,Urulóki,Doors of Durin,Most likely none
579,3019,February 293019,Male,Dark (movie),"6' 6"" (movie)",Uglúk,Uruk-hai,Isengard,
620,TA 2925,TA 3007,Male,Brown (film),"1.76m / 5'9"" (film)",Bain,Men,Dale,Unnamed wife


Of course we could ask for just the ones with 8 null values.

In [6]:
df[df.isnull().sum(axis = 1) == 8].name

8                          Angrim
14                      Angelimar
17      Linda (Baggins) Proudfoot
18                 Bodo Proudfoot
40     Tanta (Hornblower) Baggins
                  ...            
886                        Andvír
891                        Amlach
904                         Aghan
905                       Agathor
907                      Aerandir
Name: name, Length: 136, dtype: object

I only included the names since the rest of the dataset was null!

Of course we can use this method to include only entries that have 4 or less null entries.

In [7]:
df[df.isnull().sum(axis = 1) <= 4]

Unnamed: 0,birth,death,gender,hair,height,name,race,realm,spouse
1,TA 2978,"February 26 ,3019",Male,Dark (book) Light brown (movie),,Boromir,Men,,
3,TA 280,TA 515,Male,,,Tarcil,Men,Arnor,Unnamed wife
5,SA 2709,SA 2962,Male,,,Ar-Adûnakhôr,Men,Númenor,Unnamed wife
7,YT,FA 455,Male,Golden,,Angrod,Elves,,Eldalótë
9,SA 3219,SA 3440,Male,,,Anárion,Men,Gondor,Unnamed wife
...,...,...,...,...,...,...,...,...,...
903,TA 2827,TA 2932,Male,,,Aglahad,Men,,Unnamed wife
906,"Mid ,First Age",FA 495,Female,,,Aerin,Men,,Brodda
908,"YT during the ,Noontide of Valinor",FA 455,Male,Golden,,Aegnor,Elves,,"Loved ,Andreth but remained unmarried"
909,TA 2917,TA 3010,Male,,,Adrahil II,Men,,Unnamed wife


Maybe we only want the characters whose *realm* has been included.  We'll negate the `isnull()` command.

In [9]:
df[~df.realm.isnull()]

Unnamed: 0,birth,death,gender,hair,height,name,race,realm,spouse
3,TA 280,TA 515,Male,,,Tarcil,Men,Arnor,Unnamed wife
5,SA 2709,SA 2962,Male,,,Ar-Adûnakhôr,Men,Númenor,Unnamed wife
9,SA 3219,SA 3440,Male,,,Anárion,Men,Gondor,Unnamed wife
10,SA 3118,Still alive,Male,,Tall,Ar-Pharazôn,Men,Númenor,Tar-Míriel
11,SA 2876,SA 3102,Male,,,Ar-Sakalthôr,Men,Númenor,Unnamed wife
...,...,...,...,...,...,...,...,...,...
890,TA 726,TA 946,Male,,,Amlaith,Men,Arthedain,Unnamed wife
892,"Sometime during ,Years of the Trees, or the ,F...",SA 3434,Male,,,Amdír,Elves,Lórien,Unnamed wife
898,,,Female,,,Almarian,Men,Númenor,Tar-Meneldur
900,TA 2544,TA 2645,Male,,,Aldor,Men,Rohan,Unnamed wife
