### First of all, I will import all necessary libraries (including matplotlib, as I might need it later):

In [33]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt

### I downloaded and unzipped the .csv file on the same folder as the project notebook. In order to read it we need to set the **"engine"** argument of the **pd.read_csv** function to *"python"*.

In [2]:
sharks=pd.read_csv('./GSAF5.csv',sep=",",engine='python')

print(f"The df has {sharks.shape[0]} rows and {sharks.shape[1]} columns. It looks like this:")
sharks.head(3)

The df has 5992 rows and 24 columns. It looks like this:


Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.c,2016.09.18.c,5993,,
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.b,2016.09.18.b,5992,,
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.a,2016.09.18.a,5991,,


### Let's see the types of each variable (column) in our data set:

In [3]:
sharks.dtypes

Case Number               object
Date                      object
Year                       int64
Type                      object
Country                   object
Area                      object
Location                  object
Activity                  object
Name                      object
Sex                       object
Age                       object
Injury                    object
Fatal (Y/N)               object
Time                      object
Species                   object
Investigator or Source    object
pdf                       object
href formula              object
href                      object
Case Number.1             object
Case Number.2             object
original order             int64
Unnamed: 22               object
Unnamed: 23               object
dtype: object

### Most variables have an "object" type (strings). It might be necessary to clean columns and change their types later on. Now, let's look for missing values:

In [4]:
sharks.copy().isnull().sum()

Case Number                  0
Date                         0
Year                         0
Type                         0
Country                     43
Area                       402
Location                   496
Activity                   527
Name                       200
Sex                        567
Age                       2681
Injury                      27
Fatal (Y/N)                 19
Time                      3213
Species                   2934
Investigator or Source      15
pdf                          0
href formula                 1
href                         3
Case Number.1                0
Case Number.2                0
original order               0
Unnamed: 22               5991
Unnamed: 23               5990
dtype: int64

### We see that the last two columns ("Unnamed: 22" and "Unnamed: 23") are practically empty, so there's no need to keep them in our clean dataframe. I will do my cleaning in a new dataframe (clean_sharks). Let's get rid of those columns:

In [5]:
clean_sharks = sharks.copy().drop(["Unnamed: 22","Unnamed: 23"], axis=1)
clean_sharks.head(3)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,N,13h00,,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.c,2016.09.18.c,5993
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,...,N,11h00,,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.b,2016.09.18.b,5992
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,N,10h43,,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.a,2016.09.18.a,5991


### Great! now, I've noticed that not all columns have "pretty" names. That is, some of them begin/end with spaces, or are just not intuitive:

In [6]:
clean_sharks.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order'],
      dtype='object')

### Let's "prettify" those column names:

In [7]:
clean_sharks.columns = [c.strip() for c in clean_sharks.columns]
clean_sharks.rename(index=str, columns={"Fatal (Y/N)": "Fatal", "href formula": "url","href":"url 2"},inplace=True)
clean_sharks.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex', 'Age', 'Injury', 'Fatal', 'Time', 'Species',
       'Investigator or Source', 'pdf', 'url', 'url 2', 'Case Number.1',
       'Case Number.2', 'original order'],
      dtype='object')

### By looking at the dataframe I see that the columns "Case Number", "Case Number.1" and "Case Number.2" are quite similar. Let's see all the registers in which this columns differ:

In [8]:
print("rows with differences: ",clean_sharks[((clean_sharks["Case Number"] != clean_sharks["Case Number.1"])|(clean_sharks["Case Number"] 
...                  != clean_sharks["Case Number.2"]))][["Case Number","Case Number.1","Case Number.2"]].shape[0])
clean_sharks[((clean_sharks["Case Number"] != clean_sharks["Case Number.1"])|(clean_sharks["Case Number"] 
...                  != clean_sharks["Case Number.2"]))][["Case Number","Case Number.1","Case Number.2"]]

rows with differences:  13


Unnamed: 0,Case Number,Case Number.1,Case Number.2
4,2016.09.15,2016.09.16,2016.09.15
33,2016.07.14.4,2016.07.14.R,2016.07.14.4
97,2016.01.24.b,2015.01.24.b,2016.01.24.b
116,2015.12.23,2015.11.07,2015.12.23
121,2015.10.28.a,2015.10.28,2015.10.28.a
169,2015.07-10,2015.07.10,2015.07.10
3296,1967.07.05,1967/07.05,1967.07.05
3569,"1962,08.30.b",1962.08.30.b,"1962,08.30.b"
3654,1961.09.02.R,"1961.09,06.R",1961.09.02.R
4177,1952.08.05,1952.08.04,1952.08.05


### That's only 13 rows. Let's consider the "Case Number" value as the valid one and discard the other two columns:

In [9]:
clean_sharks.drop(["Case Number.1","Case Number.2"], axis=1,inplace=True)

### It's the same case for columns "url" and "url 2". Only 54 rows have different values between these two columns, and colum "url 2" has more missing values than "url". I think it's safe to say we can discard column "url 2" as well:

In [10]:
print("rows with differences: ", clean_sharks[(clean_sharks["url"] != clean_sharks["url 2"])].shape[0])
clean_sharks.drop(["url 2"], axis=1,inplace=True)

rows with differences:  54


### Out of nearly 6000 rows only 124 don't show the year of the incident (showing 0 as the year). We could try to correct those values with the information available in columns "Case Number" or "Date"... but is it really worth it?:

In [11]:
test=pd.DataFrame()
test=clean_sharks.copy()[clean_sharks["Year"]==0][["Case Number","Date"]]
print("Number of rows with Year '0': ",test.shape[0])
test.head(10)

Number of rows with Year '0':  124


Unnamed: 0,Case Number,Date
5868,0.0214,Ca. 214 B.C.
5869,0.0336,Ca. 336.B.C..
5870,0.0493,493 B.C.
5871,0.0725,Ca. 725 B.C.
5872,ND-0153,1990 or 1991
5873,ND-0152,Before 2016
5874,ND-0151,Before Oct-2009
5875,ND-0150,Before 1934
5876,ND-0149,Before 1934
5877,ND-0148,2009?


### By looking at the Case Number and dates of these rows we see that, in most cases, there's no clear information on the date of occurence. I will discard these rows from my clean dataframe:

In [12]:
clean_sharks.drop(test.index,axis=0,inplace=True)

### We still have lots of null values in certain columns:

In [13]:
clean_sharks.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5868 entries, 0 to 5867
Data columns (total 19 columns):
Case Number               5868 non-null object
Date                      5868 non-null object
Year                      5868 non-null int64
Type                      5868 non-null object
Country                   5829 non-null object
Area                      5491 non-null object
Location                  5406 non-null object
Activity                  5360 non-null object
Name                      5675 non-null object
Sex                       5311 non-null object
Age                       3298 non-null object
Injury                    5842 non-null object
Fatal                     5849 non-null object
Time                      2772 non-null object
Species                   3024 non-null object
Investigator or Source    5853 non-null object
pdf                       5868 non-null object
url                       5867 non-null object
original order            5868 non-null int64
dtypes:

### Apart from "Age", all of these columns should maintain their type (object). Therefore, we can replace null values in these columns for "Unknown" and avoid conflicts:

In [14]:
unknown_columns=["Country","Area","Location","Activity","Name","Sex","Injury","Fatal","Time","Species",
                 "Investigator or Source","url"]

for u in unknown_columns: clean_sharks[u].fillna("Unknown",inplace=True)

null_cols = clean_sharks.isnull().sum()
null_cols[null_cols > 0]



Age    2570
dtype: int64

### For "Age" we can replace null values for 0

In [15]:
zero_columns=["Age","Time"]

for z in zero_columns: clean_sharks[z].fillna(0,inplace=True)
    
null_cols = clean_sharks.isnull().sum()
null_cols[null_cols > 0]

Series([], dtype: int64)

### Great! now that we have no null values in the dataframe, let's do some more cleaning. We would like the "Age" variable to be of *int64* type, but there's still some values that won't allow us to transform it directly. Here's a set of the values in this column:

In [16]:
print(set(clean_sharks['Age']))

{0, '32', '69', '36', '37', '7 or 8', '36 & 23', '57', ' ', '25 to 35', '40', '46', '10 or 12', '28', 'F', '53', 'Both 11', '25', '11', '10', '20', '22', '72', '46 & 34', '7      &    31', 'A.M.', '86', 'Elderly', '17 & 35', '9 or 10', '50 & 30', 'Ca. 33', '16', '45', '30', '47', '52', '1', '? & 19', '21 or 26', '51', '58', 'Teen', '17 & 16', '54', '33 & 26', '14', '?    &   14', '8 or 10', '29', 'mid-30s', '8', '71', '87', '64', '60', '31 or 33', '34', '19', '9 months', '"middle-age"', '13 or 18', '26', '33 & 37', 'X', '30 or 36', '77', '30 & 32', '35', '41', 'teen', '2�', '24', '44', '37, 67, 35, 27,  ? & 27', '2 to 3 months', '9', '70', '55', '30s', '21', '27', '28, 23 & 30', '25 or 28', '36 & 26', '62', '60s', '16 to 18', '40s', '63', '20s', '66', '65', '18', '17', '12 or 13', '23 & 26', '78', '9 & 12', '21, 34,24 & 35', '20?', '18 to 22', '18 months', '33', '38', '  ', '32 & 30', '34 & 19', '33 or 37', "60's", 'Teens', '42', '39', '3', '59', '73', '21 & ?', '31', '48', '43', '� ',

### Ok, we see that not all rows contain letters or more than one number (separated by either spaces or special characters). Let's see how many rows contain non numeric strings:

In [17]:
clean_sharks[clean_sharks["Age"].str.isdigit()==False].shape[0]

100

### Not that many! I'd say we drop them and make this column *int64* type:

In [18]:
test=pd.DataFrame()

test = clean_sharks[clean_sharks["Age"].str.isdigit()==False]
clean_sharks.drop(test.index,axis=0,inplace=True)

In [19]:
clean_sharks["Age"]=clean_sharks["Age"].astype("int64")
print(set(clean_sharks['Age']))

{0, 1, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 77, 78, 81, 84, 86, 87}


### Now let's clean the "Sex" column using a function:

In [20]:
print("Values before cleaning: ",set(clean_sharks['Sex']))

def sex_clean(s):
    sex=str(s).strip().upper()
    if sex in ["M","F"]: return sex
    return "Unknown"

clean_sharks["Sex"]=clean_sharks["Sex"].apply(sex_clean)
print("Values after cleaning: ",set(clean_sharks['Sex']))
clean_sharks["Sex"].value_counts()

Values before cleaning:  {'F', 'Unknown', '.', 'N', 'M ', 'M', 'lli'}
Values after cleaning:  {'F', 'M', 'Unknown'}


M          4663
F           553
Unknown     552
Name: Sex, dtype: int64

### Same for the "Fatal" column:

In [21]:
print("Values before cleaning: ",set(clean_sharks['Fatal']))

def fatal_clean(f):
    fatal=str(f).strip().upper()
    if fatal in ["Y","N"]: return fatal
    return "Unknown"

clean_sharks["Fatal"]=clean_sharks["Fatal"].apply(fatal_clean)
print("Values after cleaning: ",set(clean_sharks['Fatal']))
clean_sharks["Fatal"].value_counts()

Values before cleaning:  {'F', 'N ', 'UNKNOWN', 'Unknown', 'Y', ' N', 'N', 'n', '#VALUE!'}
Values after cleaning:  {'N', 'Y', 'Unknown'}


N          4192
Y          1462
Unknown     114
Name: Fatal, dtype: int64

### Now let's try cleaning the "Type" column:

In [22]:
print(set(clean_sharks['Type']))

{'Boating', 'Provoked', 'Boat', 'Unprovoked', 'Invalid', 'Sea Disaster'}


### By looking at rows with types "Boating" or "Boat" we can see that these refer to the same type of incidents. So let's group those into the same category ("Boat"):

In [23]:
print(clean_sharks[(clean_sharks["Type"]=="Boating")][["Type","Injury"]].head())
print(clean_sharks[(clean_sharks["Type"]=="Boat")][["Type","Injury"]].head())

def type_clean(t):
    typ=t
    if typ in ["Boat","Boating"]: return "Boat"
    return typ

clean_sharks["Type"]=clean_sharks["Type"].apply(type_clean)
print("Values after cleaning: ",set(clean_sharks['Type']))
clean_sharks["Type"].value_counts()



         Type                                             Injury
4025  Boating  FATAL. Shark sank fishing boat, causing death ...
4027  Boating              No injury, sharks bit propellers, etc
4030  Boating                                         No details
4047  Boating          No injury to occupants, shark gouged hull
4048  Boating  No injury to occupants, shark released from ne...
    Type                                            Injury
5   Boat          Shark rammed boat. No injury to occupant
22  Boat          No injury, shark nudged kayak repeatedly
29  Boat               No injury, shark bit trolling motor
35  Boat  No injury. Hull bitten, tooth fragment recovered
37  Boat  No injury. Hull bitten, tooth fragment recovered
Values after cleaning:  {'Provoked', 'Boat', 'Unprovoked', 'Invalid', 'Sea Disaster'}


Unprovoked      4211
Provoked         541
Invalid          506
Boat             300
Sea Disaster     210
Name: Type, dtype: int64

### In *object* type variables we see lot's of registers that have the same values but are being counted as different values due to spaces at the beginning/end of the string, or upper/lower cases (for example: "Florida" vs "florida" vs "  Florida"). We can easily manipulate these values:

In [24]:
def strip_cap(s):
    return str(s).strip().lower().capitalize()

def strip_title(s):
    return str(s).strip().lower().title()

to_cap=["Activity","Injury"]
to_title=["Area","Location","Name","Species"]

for c in to_cap:
    beforecap=len(set(clean_sharks[c]))
    print (f"Different values in {c} before cleaning: {beforecap}")
    clean_sharks[c] = clean_sharks[c].apply(strip_cap)
    aftercap=len(set(clean_sharks[c]))
    print (f"Different values in {c} after cleaning: {aftercap}")
    
for t in to_title:
    beforetitle=len(set(clean_sharks[t]))
    print (f"Different values in {t} before cleaning: {beforetitle}")
    clean_sharks[t] = clean_sharks[t].apply(strip_title)
    aftertitle=len(set(clean_sharks[t]))
    print (f"Different values in {t} after cleaning: {aftertitle}")
    
beforecount=len(set(clean_sharks["Country"]))
print (f"Different values in Country before cleaning: {beforecount}")
clean_sharks["Country"] = clean_sharks["Country"].str.strip().str.upper()
aftertcount=len(set(clean_sharks["Country"]))
print (f"Different values in Country after cleaning: {aftertcount}")

Different values in Activity before cleaning: 1425
Different values in Activity after cleaning: 1369
Different values in Injury before cleaning: 3484
Different values in Injury after cleaning: 3403
Different values in Area before cleaning: 754
Different values in Area after cleaning: 734
Different values in Location before cleaning: 3804
Different values in Location after cleaning: 3761
Different values in Name before cleaning: 4864
Different values in Name after cleaning: 4850
Different values in Species before cleaning: 1498
Different values in Species after cleaning: 1404
Different values in Country before cleaning: 197
Different values in Country after cleaning: 186


### Come to think of it, there are still some columns that won't be really necessary for my analysis:

In [34]:
clean_sharks.head(3)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal,Time,Species,Investigator or Source,pdf,url,original order
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Male,M,16,Minor injury to thigh,N,13h00,Unknown,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,5993
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,36,Lacerations to hands,N,11h00,Unknown,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,5992
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Male,M,43,Lacerations to lower leg,N,10h43,Unknown,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,5991


### Especifically the "Name", "Investigator or Source", "pdf", "url" and "original order". So, let's drop those too:

In [35]:
clean_sharks = clean_sharks.drop(["Name", "Investigator or Source", "pdf", "url","original order"], axis=1)
clean_sharks.head(3)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Sex,Age,Injury,Fatal,Time,Species
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,M,16,Minor injury to thigh,N,13h00,Unknown
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,M,36,Lacerations to hands,N,11h00,Unknown
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,M,43,Lacerations to lower leg,N,10h43,Unknown


### I will focus my analysis on the 

In [46]:
clean_sharks[clean_sharks["Year"]>1950]

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Sex,Age,Injury,Fatal,Time,Species
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,M,16,Minor injury to thigh,N,13h00,Unknown
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,M,36,Lacerations to hands,N,11h00,Unknown
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,M,43,Lacerations to lower leg,N,10h43,Unknown
3,2016.09.17,17-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Thirteenth Beach,Surfing,M,0,Struck by fin on chest & leg,N,Unknown,Unknown
4,2016.09.15,16-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Bells Beach,Surfing,M,0,No injury: knocked off board by shark,N,Unknown,2 M Shark
5,2016.09.15.R,15-Sep-16,2016,Boat,AUSTRALIA,Western Australia,Bunbury,Fishing,Unknown,0,Shark rammed boat. no injury to occupant,N,Unknown,Unknown
7,2016.09.07,07-Sep-16,2016,Unprovoked,USA,Hawaii,"Makaha, Oahu",Swimming,F,51,Severe lacerations to shoulder & forearm,N,14h30,"Tiger Shark, 10?"
8,2016.09.06,06-Sep-16,2016,Unprovoked,NEW CALEDONIA,North Province,Koumac,Kite surfing,M,50,Fatal,Y,15h40,Unknown
9,2016.09.05.b,05-Sep-16,2016,Unprovoked,USA,South Carolina,"Kingston Plantation, Myrtle Beach, Horry County",Boogie boarding,F,12,Lacerations & punctures to lower right leg,N,Late afternoon,Unknown
10,2016.09.05.a,05-Sep-16,2016,Unprovoked,AUSTRALIA,Western Australia,Injidup,Surfing,M,0,"No inury, board broken in half by shark",N,Late afternoon,Unknown


### It would be usefull to have a "Month" column in our clean dataframe. We could try extracting the month from the "Date" column, but it's not present in every row. However, in almost every row the "Case Number" values contain the month of the attack as a number (which will make it easier to sort later on). So let's try extracting the month from this column (unknown month will be represented as a 0):

In [78]:
clean_sharks["Month"] = clean_sharks["Case Number"].str.extract(r'(\.\d+\.|\.\d+\-)')
clean_sharks["Month"] = clean_sharks["Month"].str.strip('.').str.strip('-').astype(int)
ew_columns = ['Case Number', 'Date','Month', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Sex', 'Age', 'Injury', 'Fatal', 'Time', 'Species']
clean_sharks = clean_sharks[new_columns]
clean_sharks.head(1)


Unnamed: 0,Case Number,Date,Month,Year,Type,Country,Area,Location,Activity,Sex,Age,Injury,Fatal,Time,Species
0,2016.09.18.c,18-Sep-16,9,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,M,16,Minor injury to thigh,N,13h00,Unknown
1,2016.09.18.b,18-Sep-16,9,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,M,36,Lacerations to hands,N,11h00,Unknown
2,2016.09.18.a,18-Sep-16,9,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,M,43,Lacerations to lower leg,N,10h43,Unknown


In [25]:
'''test = pd.DataFrame()
test ["Survived"] = clean_sharks[clean_sharks["Fatal"]=="N"].groupby(["Sex"]).count()["Fatal"]
test ["Died"] = clean_sharks[clean_sharks["Fatal"]=="Y"].groupby(["Sex"]).count()["Fatal"]
test ["Surv rate"] = (test["Survived"]/(test["Survived"]+test["Died"]))*100
test
'''

'test = pd.DataFrame()\ntest ["Survived"] = clean_sharks[clean_sharks["Fatal"]=="N"].groupby(["Sex"]).count()["Fatal"]\ntest ["Died"] = clean_sharks[clean_sharks["Fatal"]=="Y"].groupby(["Sex"]).count()["Fatal"]\ntest ["Surv rate"] = (test["Survived"]/(test["Survived"]+test["Died"]))*100\ntest\n'

In [26]:
'''
test2=pd.DataFrame()
test2["Reported attacks"] = clean_sharks.groupby(["Country"]).count()["Year"]
test2["Common Sex"]=clean_sharks.groupby(["Country"])["Sex"].agg(pd.Series.mode)
test2["Victim age"]=clean_sharks[clean_sharks["Sex"]==test2["Sex"]].groupby(["Country"])["Age"].agg(pd.Series.mean)
test2 = test2.sort_values("Reported attacks",ascending=False)
test2
#clean_sharks[clean_sharks["Country"]=="USA"]["Sex"].mode()[0]
#test2["Sex"]==clean_sharks["Sex"]
'''

'\ntest2=pd.DataFrame()\ntest2["Reported attacks"] = clean_sharks.groupby(["Country"]).count()["Year"]\ntest2["Common Sex"]=clean_sharks.groupby(["Country"])["Sex"].agg(pd.Series.mode)\ntest2["Victim age"]=clean_sharks[clean_sharks["Sex"]==test2["Sex"]].groupby(["Country"])["Age"].agg(pd.Series.mean)\ntest2 = test2.sort_values("Reported attacks",ascending=False)\ntest2\n#clean_sharks[clean_sharks["Country"]=="USA"]["Sex"].mode()[0]\n#test2["Sex"]==clean_sharks["Sex"]\n'