### First of all, I will import all necessary libraries (including matplotlib, as I might need it later):

In [1]:
import pandas as pd
import numpy as np
import matplotlib as mp

### I downloaded and unzipped the .csv file on the same folder as the project notebook. In order to import it we need to set the **"engine"** argument of the **pd.read_csv** function to *"python"*.

In [2]:
sharks=pd.read_csv('./GSAF5.csv',sep=",",engine='python')

print(f"The df has {sharks.shape[0]} rows and {sharks.shape[1]} columns. It looks like this:")
sharks.head(3)

The df has 5992 rows and 24 columns. It looks like this:


Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.c,2016.09.18.c,5993,,
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.b,2016.09.18.b,5992,,
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.a,2016.09.18.a,5991,,


### Let's see the types of each variable (column) in our data set:

In [3]:
sharks.dtypes

Case Number               object
Date                      object
Year                       int64
Type                      object
Country                   object
Area                      object
Location                  object
Activity                  object
Name                      object
Sex                       object
Age                       object
Injury                    object
Fatal (Y/N)               object
Time                      object
Species                   object
Investigator or Source    object
pdf                       object
href formula              object
href                      object
Case Number.1             object
Case Number.2             object
original order             int64
Unnamed: 22               object
Unnamed: 23               object
dtype: object

### Most variables have an "object" type (strings). It might be necessary to clean columns and change their types later on. Now, let's look for missing values:

In [4]:
sharks.copy().isnull().sum()

Case Number                  0
Date                         0
Year                         0
Type                         0
Country                     43
Area                       402
Location                   496
Activity                   527
Name                       200
Sex                        567
Age                       2681
Injury                      27
Fatal (Y/N)                 19
Time                      3213
Species                   2934
Investigator or Source      15
pdf                          0
href formula                 1
href                         3
Case Number.1                0
Case Number.2                0
original order               0
Unnamed: 22               5991
Unnamed: 23               5990
dtype: int64

### We see that the last two columns ("Unnamed: 22" and "Unnamed: 23") are practically empty, so there's no need to keep them in our clean dataframe. I will do my cleaning in a new dataframe (clean_sharks). Let's get rid of those columns:

In [5]:
clean_sharks = sharks.copy().drop(["Unnamed: 22","Unnamed: 23"], axis=1)
clean_sharks.head(3)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,N,13h00,,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.c,2016.09.18.c,5993
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,...,N,11h00,,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.b,2016.09.18.b,5992
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,N,10h43,,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.a,2016.09.18.a,5991


### Great! now, I've noticed that not all columns have "pretty" names. That is, some of them begin/end with spaces, or are just not intuitive:

In [6]:
clean_sharks.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order'],
      dtype='object')

### Let's "prettify" those column names:

In [8]:
clean_sharks.columns = [c.strip() for c in clean_sharks.columns]
clean_sharks.rename(index=str, columns={"Fatal (Y/N)": "Fatal", "href formula": "url","href":"url 2"},inplace=True)
clean_sharks.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex', 'Age', 'Injury', 'Fatal', 'Time', 'Species',
       'Investigator or Source', 'pdf', 'url', 'url 2', 'Case Number.1',
       'Case Number.2', 'original order'],
      dtype='object')

### By looking at the dataframe I see that the columns "Case Number", "Case Number.1" and "Case Number.2" are quite similar. Let's see all the registers in which this columns differ:

In [9]:
print("rows with differences: ",clean_sharks[((clean_sharks["Case Number"] != clean_sharks["Case Number.1"])|(clean_sharks["Case Number"] 
...                  != clean_sharks["Case Number.2"]))][["Case Number","Case Number.1","Case Number.2"]].shape[0])
clean_sharks[((clean_sharks["Case Number"] != clean_sharks["Case Number.1"])|(clean_sharks["Case Number"] 
...                  != clean_sharks["Case Number.2"]))][["Case Number","Case Number.1","Case Number.2"]]

rows with differences:  13


Unnamed: 0,Case Number,Case Number.1,Case Number.2
4,2016.09.15,2016.09.16,2016.09.15
33,2016.07.14.4,2016.07.14.R,2016.07.14.4
97,2016.01.24.b,2015.01.24.b,2016.01.24.b
116,2015.12.23,2015.11.07,2015.12.23
121,2015.10.28.a,2015.10.28,2015.10.28.a
169,2015.07-10,2015.07.10,2015.07.10
3296,1967.07.05,1967/07.05,1967.07.05
3569,"1962,08.30.b",1962.08.30.b,"1962,08.30.b"
3654,1961.09.02.R,"1961.09,06.R",1961.09.02.R
4177,1952.08.05,1952.08.04,1952.08.05


### That's only 13 rows. Let's consider the "Case Number" value as the valid one and discard the other two columns:

In [10]:
clean_sharks.drop(["Case Number.1","Case Number.2"], axis=1,inplace=True)

### It's the same case for columns "url" and "url 2". Only 54 rows have different values between these two columns, and colum "url 2" has more missing values than "url". I think it's safe to say we can discard column "url 2" as well:

In [11]:
print("rows with differences: ", clean_sharks[(clean_sharks["url"] != clean_sharks["url 2"])].shape[0])
clean_sharks.drop(["url 2"], axis=1,inplace=True)

rows with differences:  54


### Out of nearly 6000 rows only 124 don't show the year of the incident (showing 0 as the year). We could try to correct those values with the information available in columns "Case Number" or "Date"... but is it really worth it?:

In [12]:
test=pd.DataFrame()
test=clean_sharks.copy()[clean_sharks["Year"]==0][["Case Number","Date"]]
print("Number of rows with Year '0': ",test.shape[0])
test.head(10)

Number of rows with Year '0':  124


Unnamed: 0,Case Number,Date
5868,0.0214,Ca. 214 B.C.
5869,0.0336,Ca. 336.B.C..
5870,0.0493,493 B.C.
5871,0.0725,Ca. 725 B.C.
5872,ND-0153,1990 or 1991
5873,ND-0152,Before 2016
5874,ND-0151,Before Oct-2009
5875,ND-0150,Before 1934
5876,ND-0149,Before 1934
5877,ND-0148,2009?


### By looking at the Case Number and dates of these rows we see that, in most cases, there's no clear information on the date of occurence. I will discard these rows from my clean dataframe:

In [13]:
clean_sharks.drop(test.index,axis=0,inplace=True)

### We still have lots of null values in certain columns:

In [14]:
clean_sharks.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5868 entries, 0 to 5867
Data columns (total 19 columns):
Case Number               5868 non-null object
Date                      5868 non-null object
Year                      5868 non-null int64
Type                      5868 non-null object
Country                   5829 non-null object
Area                      5491 non-null object
Location                  5406 non-null object
Activity                  5360 non-null object
Name                      5675 non-null object
Sex                       5311 non-null object
Age                       3298 non-null object
Injury                    5842 non-null object
Fatal                     5849 non-null object
Time                      2772 non-null object
Species                   3024 non-null object
Investigator or Source    5853 non-null object
pdf                       5868 non-null object
url                       5867 non-null object
original order            5868 non-null int64
dtypes:

### Apart from "Age", all of these columns should maintain their type (object). Therefore, we can replace null values in these columns for "Unknown" and avoid conflicts:

In [15]:
unknown_columns=["Country","Area","Location","Activity","Name","Sex","Injury","Fatal","Time","Species",
                 "Investigator or Source","url"]

for u in unknown_columns: clean_sharks[u].fillna("Unknown",inplace=True)

null_cols = clean_sharks.isnull().sum()
null_cols[null_cols > 0]



Age    2570
dtype: int64

### For "Age" we can replace null values for 0

In [16]:
zero_columns=["Age","Time"]

for z in zero_columns: clean_sharks[z].fillna(0,inplace=True)
    
null_cols = clean_sharks.isnull().sum()
null_cols[null_cols > 0]

Series([], dtype: int64)

### Great! now that we have no null values in the dataframe, let's do some more cleaning. We would like the "Age" variable to be of *int64* type, but there's still some values that won't allow us to transform it directly. Here's a set of the values in this column:

In [17]:
print(set(clean_sharks['Age']))

{0, '73', '52', '29', '50 & 30', 'X', '77', '74', '43', '36', '44', '18 months', '33 & 26', ' ', '10 or 12', '"young"', '60s', '12', '41', '84', '23 & 26', '11', 'F', '42', '63', '30 or 36', '40', '2 to 3 months', '54', '20s', '69', '33', '9 or 10', '72', '18 or 20', '28, 23 & 30', '49', '67', '36 & 26', '38', 'young', '71', '19', '34 & 19', 'mid-20s', '81', 'Ca. 33', '87', '9', '1', '14', '6�', '30', '7      &    31', '� ', "60's", '60', 'adult', '13', '75', '53', '16', '12 or 13', '39', 'MAKE LINE GREEN', '70', '?    &   14', '17', '37', '31 or 33', '78', '2�', '51', '17 & 16', '24', '7 or 8', '30s', '17 & 35', '35', '8', '58', '55', 'mid-30s', '28', '8 or 10', '30 & 32', '23', '31', '33 & 37', 'teen', '27', '37, 67, 35, 27,  ? & 27', '3', '50s', '13 or 18', 'Elderly', 'A.M.', '45', '>50', '25 or 28', 'Both 11', '6', '7', '46', '21 & ?', '65', '26', '25', '56', '(adult)', '21', '23 & 20', '9 months', '68', '10', 'Teen', '32', '20?', '59', '  ', '20', '21 or 26', '18', '16 to 18', '15

### Ok, we see that not all rows contain letters or more than one number (separated by either spaces or special characters). Let's see how many rows contain non numeric strings:

In [18]:
clean_sharks[clean_sharks["Age"].str.isdigit()==False].shape[0]

100

### Not that many! I'd say we drop them and make this column *int64* type:

In [19]:
test=pd.DataFrame()

test = clean_sharks[clean_sharks["Age"].str.isdigit()==False]
clean_sharks.drop(test.index,axis=0,inplace=True)

In [21]:
clean_sharks["Age"]=clean_sharks["Age"].astype("int64")