# Pandas Project ( Cleaning Sharks )

## Steps:
1. Check out the Data
2. Drop duplicate rows
3. Rename columns
4. Drop rows with missing main information
5. Fill missing information from Date on Year and vc.
6. Fill Country, Area or location from info from other rows.
7. Categorize activities
8. Categorie injury
9. Clasify into main and secundaries. If after all that there are rows missing main data -> drop those rows.


Methods I will probably need:

* df.fillna(0, inplace=True)
* df.isnull()
* df.drop_duplicates()
* df.reset_index(inplace=True)
* df.str.replace()
* df.c.value_counts()



In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
sharks = pd.read_csv('attacks.csv.zip', encoding='latin-1')

In [3]:
sharks.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,...,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,...,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,...,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,...,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,...,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,2018.06.04,6299.0,,


In [4]:
# Checking for NaNs in every column and shape of dataframe:

print(sharks.isna().sum())
print(sharks.shape)

Case Number               17021
Date                      19421
Year                      19423
Type                      19425
Country                   19471
Area                      19876
Location                  19961
Activity                  19965
Name                      19631
Sex                       19986
Age                       22252
Injury                    19449
Fatal (Y/N)               19960
Time                      22775
Species                   22259
Investigator or Source    19438
pdf                       19421
href formula              19422
href                      19421
Case Number.1             19421
Case Number.2             19421
original order            19414
Unnamed: 22               25722
Unnamed: 23               25721
dtype: int64
(25723, 24)


In [5]:
sharks.dtypes

Case Number                object
Date                       object
Year                      float64
Type                       object
Country                    object
Area                       object
Location                   object
Activity                   object
Name                       object
Sex                        object
Age                        object
Injury                     object
Fatal (Y/N)                object
Time                       object
Species                    object
Investigator or Source     object
pdf                        object
href formula               object
href                       object
Case Number.1              object
Case Number.2              object
original order            float64
Unnamed: 22                object
Unnamed: 23                object
dtype: object

In [6]:
# Eliminate duplicate rows:
sharks2 = sharks.drop_duplicates()
num_dupl = sharks2.shape[0]-sharks.shape[0]
sharks2.reset_index(drop=True, inplace=True)
print('Eliminated Duplicates: {}'.format(-num_dupl))
print(sharks.shape)
print(sharks2.shape)

Eliminated Duplicates: 19411
(25723, 24)
(6312, 24)


In [7]:
sharks2.isna().sum()

Case Number                  2
Date                        10
Year                        12
Type                        14
Country                     60
Area                       465
Location                   550
Activity                   554
Name                       220
Sex                        575
Age                       2841
Injury                      38
Fatal (Y/N)                549
Time                      3364
Species                   2848
Investigator or Source      27
pdf                         10
href formula                11
href                        10
Case Number.1               10
Case Number.2               10
original order               3
Unnamed: 22               6311
Unnamed: 23               6310
dtype: int64

In [8]:
# Renaming Columns

sharks2 = sharks2.rename(columns={'Case Number': 'Case', 'Fatal (Y/N)': 'Fatal', 'Species ':'Species',
                'Investigator or Source':'Source','href formula':'href_formula', 'Sex ':'Sex',
                'original order': 'order'})
sharks2.columns

Index(['Case', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex', 'Age', 'Injury', 'Fatal', 'Time', 'Species',
       'Source', 'pdf', 'href_formula', 'href', 'Case Number.1',
       'Case Number.2', 'order', 'Unnamed: 22', 'Unnamed: 23'],
      dtype='object')

In [9]:
# Getting only rows with year, case or date

sharks2 = sharks2[sharks2['Date'].notna()]

# Filling missing Area with Country
# Filling Missing Location With Area

sharks2['Area'] = sharks2['Area'].fillna(sharks2['Country'])
sharks2['Location'] = sharks2['Location'].fillna(sharks2['Area'])

In [10]:
# Replace with Name: UNKNOWN

sharks2.Name.value_counts(dropna=False).head(30)

sharks2['Name'] = sharks2['Name'].fillna('UNKNOWN')

    

In [11]:
sharks3 = sharks2.copy()

In [12]:
sharks3.head()

Unnamed: 0,Case,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Source,pdf,href_formula,href,Case Number.1,Case Number.2,order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,...,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,...,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,...,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,...,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,...,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,2018.06.04,6299.0,,


From Here I decided to start checing every column in order:

## Case Col

In [13]:
# Checking Case Col:
sharks3.Case.value_counts()

# There appear to be duplicate cases, si I will check some examples

1980.07.00      2
2014.08.02      2
1966.12.26      2
2013.10.05      2
2009.12.18      2
               ..
2005.09.07      1
2012.03.14.b    1
1940.03.20      1
2000.09.25      1
2002.10.14      1
Name: Case, Length: 6285, dtype: int64

In [14]:
sharks3[sharks3.Case == '2012.09.02.b']
# Both cases are different so I will instead create my own id system later on.

Unnamed: 0,Case,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Source,pdf,href_formula,href,Case Number.1,Case Number.2,order,Unnamed: 22,Unnamed: 23
746,2012.09.02.b,02-Sep-2012,2012.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Boogie boarding,female,F,...,3.5' to 4' shark,"WYTV, 9/3/2012",2012.09.02.b-NSB-girl.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2012.09.02.b,2012.09.02.b,5557.0,,
747,2012.09.02.b,02-Sep-2012,2012.0,Provoked,USA,Hawaii,"Spreckelsville, Maui",Spearfishing,M. Malabon,,...,"Tiger shark, 10' to 12'",HawaiiNow.com,2012.09.02.c-Malabon.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2012.09.02.b,2012.09.02.b,5556.0,,


## Date Col

In [15]:
# Checking Date Col
# Some Dates have the word reported, before and Circa. Remove them with .apply


sharks3['Date'] = sharks3['Date'].apply(lambda x: re.sub('Reported ', '', str(x)))
sharks3['Date'] = sharks3['Date'].apply(lambda x: re.sub('Before ', '', str(x)))
sharks3['Date'] = sharks3['Date'].apply(lambda x: re.sub('Circa ', '', str(x)))
sharks3['Date'] = sharks3['Date'].apply(lambda x: re.sub(r'', '', str(x)))

#sharks3['Date'] = sharks3['Date'].apply(lambda x: re.sub("?", '', str(x)))

#Convert column to str

sharks3['Date'] = sharks3['Date'].astype(str)

#sharks3.head(50)

## Year Col

In [16]:
# Checking Year Col

sharks3.Year.describe()

# Fill nans with 0
# Converto to string
# Fill 0s with info from Date

count    6300.000000
mean     1927.272381
std       281.116308
min         0.000000
25%      1942.000000
50%      1977.000000
75%      2005.000000
max      2018.000000
Name: Year, dtype: float64

In [17]:
sharks3['Year'] = sharks3['Year'].fillna(0)

In [18]:
# Convert to str
sharks3['Year'] = sharks3['Year'].astype(str)

# to get rid of .0
sharks3['Year'] = sharks3['Year'].apply(lambda x: re.sub("\.0", "", x))



In [19]:
def get_date2(x):
    try:
        x2 = x.split()
        x3 = x2[-1].split('-')
        return (x3[-1])
    except:
        pass
    
def turn_int(x):
    try:
        ix = int(x)
        return ix
    except:
        return np.nan

In [20]:
sharks3['Year']=sharks3['Date'].apply(get_date2)
sharks3['Year'] = sharks3['Year'].apply(turn_int)

sharks4 = sharks3.dropna(subset=['Year'])

## Type Col

In [21]:
sharks4.Type.value_counts()

# Boat & Boatomg & Boating same thing
# Questionable -> invalid

Unprovoked      4536
Provoked         571
Invalid          541
Sea Disaster     234
Boating          201
Boat             137
Questionable       2
Boatomg            1
Name: Type, dtype: int64

In [22]:
sharks4.loc[sharks4['Type'] == 'Boat', 'Type'] = 'Boating'
sharks4.loc[sharks4['Type'] == 'Boatomg', 'Type'] = 'Boating'
sharks4.loc[sharks4['Type'] == 'Questionable', 'Type'] = 'Invalid'


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [23]:
sharks4.Type.value_counts()

Unprovoked      4536
Provoked         571
Invalid          543
Boating          339
Sea Disaster     234
Name: Type, dtype: int64

## Country, Area Location Cols
These columns have been already previously fixed.

In [24]:
sharks4.Country.value_counts()

USA                      2219
AUSTRALIA                1324
SOUTH AFRICA              576
PAPUA NEW GUINEA          132
NEW ZEALAND               127
                         ... 
REUNION ISLAND              1
GUATEMALA                   1
NORTH ATLANTIC OCEAN        1
NEVIS                       1
AFRICA                      1
Name: Country, Length: 210, dtype: int64

## Activity

In [25]:
sharks4.Activity.value_counts().head(50)

Surfing                           968
Swimming                          863
Fishing                           427
Spearfishing                      328
Bathing                           160
Wading                            146
Diving                            126
Standing                           98
Snorkeling                         87
Scuba diving                       75
Body boarding                      61
Body surfing                       49
Swimming                           47
Kayaking                           33
Fell overboard                     32
Treading water                     32
Pearl diving                       31
Boogie boarding                    29
Free diving                        27
Windsurfing                        19
Walking                            16
Boogie Boarding                    16
Shark fishing                      15
Floating                           14
Fishing                            13
Rowing                             12
Surf fishing

In [26]:
act_dic = {'surf':'Surfing','swim':'Swimming','fish':'Fishing','bath':'Bathing','divi':'Diving'}

def cat_act(act0):
    x = 'Other'
    for key in act_dic:
        if key in act0.lower():
            x = act_dic[key]
    
    return x

In [27]:
# Converting to str
sharks4['Activity'] = sharks4['Activity'].astype(str)

# Categorizing Activities acording to most common ones, naming others as 'others'
sharks4['Activity'] = sharks4['Activity'].apply(cat_act)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [28]:
sharks4.Activity.value_counts()

Other       2032
Surfing     1188
Fishing     1155
Swimming    1093
Diving       569
Bathing      190
Name: Activity, dtype: int64

## Sex Cols

In [29]:
sharks4.Sex.value_counts(dropna=False)

M      5032
F       629
NaN     560
M         2
N         2
lli       1
.         1
Name: Sex, dtype: int64

In [30]:
# Eliminate ' ' from sex

sharks4['Sex'] = sharks4['Sex'].astype(str)
sharks4['Sex'] = sharks4['Sex'].apply(lambda x: re.sub(r' ', '', str(x)))
sharks4.Sex.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


M      5034
F       629
nan     560
N         2
lli       1
.         1
Name: Sex, dtype: int64

In [31]:
# 'N', 'lli' & '.' dont tell us anything.

sharks4['Sex'] = sharks4[(sharks4['Sex']!='N') & (sharks4['Sex']!='lli') & (sharks4['Sex']!='.')].Sex

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [32]:
# Considering Sex crucial for analysis, droping nans

sharks4 = sharks4[sharks4['Sex']!='nan']
sharks4.Sex.value_counts()

M    5034
F     629
Name: Sex, dtype: int64

## Name Cols

In [33]:
sharks4.Name.value_counts().head(30)

lst_unk_names = ['male','female','boy','males','sailor','man','diver','girl','child','soldier',
                 'unidentified','men','native','unknown','local','pilot','crew','teacher']

sharks4['Name'] = sharks4['Name'].astype(str)

def cat_name(name0):
    for el in lst_unk_names:
        if el in name0.lower():
            return 'UNKNOWN'
    return name0  

#sharks4.Name.value_counts().head(20)

In [34]:
sharks4['Name'] = sharks4.Name.apply(cat_name)

In [35]:
sharks4.Name.value_counts()

UNKNOWN              1354
John Williams           3
Kenny Burns             2
Seth Mead               2
Thomas McDonald         2
                     ... 
Angel B. Escartin       1
C.E. Slaughter          1
Walter Mitchell         1
Michael Carpenter       1
Liam Walker             1
Name: Name, Length: 4281, dtype: int64

## Injury Cols

In [36]:
inj_dic = {'fatal':'FATAL', 'foot':'FOOT', 'leg':'LEG','no injury':'NO INJURY',
           'thigh':'THIGH','hand':'HAND', 'arm':'ARM','calf':'CALF','heel':'FOOT',
           'ankle':'FOOT'}

sharks4.Injury.value_counts().head(10)

FATAL                      744
Foot bitten                 85
Survived                    77
No injury                   72
Leg bitten                  71
Left foot bitten            50
Right foot bitten           38
No injury, board bitten     31
Hand bitten                 24
Thigh bitten                24
Name: Injury, dtype: int64

In [37]:
def cat_inj(inj0):
    for key in inj_dic:
        if key.lower() in inj0.lower():
            return inj_dic[key]
        
    return 'UNKNOWN INJURY'

In [38]:
sharks4['Injury'] = sharks4['Injury'].astype(str)
sharks4['Injury'] = sharks4['Injury'].apply(cat_inj)

In [39]:
sharks4.Injury.value_counts().head(10)

FATAL             1265
UNKNOWN INJURY    1007
FOOT               888
LEG                734
NO INJURY          552
ARM                384
HAND               352
THIGH              299
CALF               186
Name: Injury, dtype: int64

## Faltal Col

In [40]:
sharks4.Fatal.value_counts()

N          3903
Y          1259
UNKNOWN      40
 N            7
y             1
M             1
2017          1
Name: Fatal, dtype: int64

In [43]:
# eliminate white space
sharks4['Fatal'] = sharks4['Fatal'].astype(str)
sharks4['Fatal'] = sharks4['Fatal'].apply(lambda x: re.sub(r' ', '', str(x)))

# Filling nans with UNKNOWN
sharks4['Fatal'] = sharks4['Fatal'].apply(lambda x: re.sub(r'nan', 'UNKNOWN', str(x)))

# Selecting only n, y or unknown
sharks4['Fatal'] = sharks4[(sharks4['Fatal'] == 'N') | (sharks4['Fatal'] == 'Y') | (sharks4['Fatal'] == 'UNKNOWN')].Fatal


In [45]:
sharks4.Fatal.value_counts()

N          3910
Y          1259
UNKNOWN     498
Name: Fatal, dtype: int64

## Species Col

In [50]:
sharks4.Species.value_counts(dropna=False)

UNKNOWN                                               2546
White shark                                            139
Invalid                                                 87
Shark involvement prior to death was not confirmed      86
Shark involvement not confirmed                         79
                                                      ... 
Porbeagle shark, 7'                                      1
Possibly a broadnose 7-gill shark                        1
Raggedtooth shark, 2 m [6'9"]                            1
Blacktip shark, 2'                                       1
1.8 m [6'] "cocktail shark                               1
Name: Species, Length: 1424, dtype: int64

In [49]:
sharks4.Species.fillna('UNKNOWN', inplace=True)
sharks4['Species'] = sharks4['Species'].astype(str)


In [156]:
species_dic = {'white':'WHITE','tiger':'TIGUER', 'bull':'BULL', 'wobbegong':'WOBBEGONG', 'blacktip': 'BLACKTIP',
               'black t': 'BLACKTIP',
               'blue':'BLUE', 'bronze': 'BRONZE', 'raggedtooth':'RAGGETDTOOTH', 'zambesi':'ZAMBESI', 'zambezi':'ZAMBESI', 'mako':'MAKO',
               'hammerhead':'HAMMERHEAD', 'spinner':'SPINNER', 'lemon':'LEMON', 'sand':'SAND', 'gray': 'GRAY', 'grey':'GRAY',
               'caribbean':'CARIBBEAN','nurse':'NURSE', 'angel': 'ANGEL', 'dusk':'DUSKY', 'reef':'REEF','galapagos':'GALAPAGOS',
               'not confirmed':'INVALID', 'unconfirmed':'INVALID','questionable':'INVALID', 'not co':'INVALID','doubt':'INVALID',
               'invalid':'INVALID', 'no shark':'INVALID', 'unidentified':'UNKNOWN','small':'UNKNOWN','little':'UNKNOWN'}
sharks4.Species.value_counts().head(30)

UNKNOWN             2622
WHITE                580
INVALID              428
TIGUER               263
BULL                 167
BLACKTIP              99
NURSE                 92
BRONZE                58
WOBBEGONG             49
BLUE                  48
MAKO                  43
HAMMERHEAD            42
RAGGETDTOOTH          41
4' shark              39
6' shark              39
1.8 m [6'] shark      33
LEMON                 32
1.5 m [5'] shark      31
ZAMBESI               28
GRAY                  28
1.2 m [4'] shark      26
3' shark              26
4' to 5' shark        24
SAND                  24
5' shark              23
2 m shark             22
SPINNER               21
3 m [10'] shark       21
3' to 4' shark        18
CARIBBEAN             16
Name: Species, dtype: int64

In [163]:
def cat_species(spe0):
    for key in species_dic:
        if key in spe0.lower():
            return species_dic[key]
    return 'UNKNOWN'

In [164]:
sharks4['Species'] = sharks4.Species.apply(cat_species)

In [165]:
sharks4.Species.value_counts().iloc[0:60]


UNKNOWN       3871
WHITE          580
INVALID        432
BULL           167
BLACKTIP        99
NURSE           92
BRONZE          58
WOBBEGONG       49
BLUE            48
MAKO            43
HAMMERHEAD      42
LEMON           32
ZAMBESI         28
GRAY            28
SAND            24
SPINNER         21
CARIBBEAN       16
REEF            13
DUSKY           12
GALAPAGOS        7
ANGEL            5
Name: Species, dtype: int64

## Selecting Main
### And dropping rows that are missing main info

Main:
* Type
* Country
* Sex
* Injury
* Species

In [195]:
sharks4 = sharks4.reset_index(drop=True)

In [196]:
sharks4.Type.value_counts(dropna=False)

Unprovoked      4376
Provoked         515
Invalid          470
Sea Disaster     169
Boating          133
NaN                4
Name: Type, dtype: int64

In [197]:
sharks4.Country.value_counts(dropna=False).head(20)

USA                 2121
AUSTRALIA           1185
SOUTH AFRICA         512
PAPUA NEW GUINEA     113
NEW ZEALAND          112
BAHAMAS              106
BRAZIL                98
MEXICO                79
ITALY                 58
FIJI                  55
REUNION               52
PHILIPPINES           47
NaN                   44
CUBA                  40
MOZAMBIQUE            38
NEW CALEDONIA         37
SPAIN                 37
EGYPT                 35
INDIA                 33
PANAMA                31
Name: Country, dtype: int64

In [198]:
sharks4.Sex.value_counts(dropna=False)

M      5034
F       629
NaN       4
Name: Sex, dtype: int64

In [211]:
sharks4.Injury.value_counts(dropna=False)

FATAL             1265
UNKNOWN INJURY    1007
FOOT               888
LEG                734
NO INJURY          552
ARM                384
HAND               352
THIGH              299
CALF               186
Name: Injury, dtype: int64

In [212]:
sharks4.Species.value_counts(dropna=False)

UNKNOWN       3871
WHITE          580
INVALID        432
BULL           167
BLACKTIP        99
NURSE           92
BRONZE          58
WOBBEGONG       49
BLUE            48
MAKO            43
HAMMERHEAD      42
LEMON           32
ZAMBESI         28
GRAY            28
SAND            24
SPINNER         21
CARIBBEAN       16
REEF            13
DUSKY           12
GALAPAGOS        7
ANGEL            5
Name: Species, dtype: int64

In [209]:
sharks5 = sharks4[sharks4.Type.isnull()==False]
sharks5 = sharks5[sharks5.Country.isnull()==False]
sharks5 = sharks5[sharks5.Sex.isnull()==False]

# sharks4.Country.value_counts(dropna=False).head(20)

In [219]:
# Dropping INVALID because they're probably not real attacks

sharks5 = sharks5[sharks5.Species!='INVALID']
sharks5 = sharks5[sharks5.Type!='Invalid']

In [221]:
print('Total Droped Cols: {}'.format(sharks.shape[0]-sharks5.shape[0]))

Total Droped Cols: 20581


In [222]:
sharks5.to_csv('sharks_clean.csv')