# Pandas project:
# Are high populated countries more susceptible to shark attacks?

Shark attacks are a global phenomenon that usually increases alarm in coastal societies, often with negative outcomes for shark populations.

## Hypothesis:

Popular culture is flooded by the negative connotations of shark activity, however, is this specie attacks so common as we usually think? We would tend to think that the more people close to the water the attack tren may rise.

Let's check the data:

For this analysis we imported the data from [Kaggle on shark attacks incidents.](https://www.kaggle.com/teajay/global-shark-attacks/version/1#GSAF5.csv)

In [149]:
import pandas as pd
import numpy as np
import re

In [172]:
raw_df = pd.read_csv('GSAF5_uncleaned.csv', encoding='latin-1')
display(raw_df.head())
raw_df.shape

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.c,2016.09.18.c,5993,,
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.b,2016.09.18.b,5992,,
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.a,2016.09.18.a,5991,,
3,2016.09.17,17-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Thirteenth Beach,Surfing,Rory Angiolella,M,...,,"The Age, 9/18/2016",2016.09.17-Angiolella.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.17,2016.09.17,5990,,
4,2016.09.15,16-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Bells Beach,Surfing,male,M,...,2 m shark,"The Age, 9/16/2016",2016.09.16-BellsBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.16,2016.09.15,5989,,


(5992, 24)

As we can see in the cell below, the data set its pretty messy when it comes to variable nomenclature and jargon use in several columns, for example in Species.

Therefore, we are proceding to clean the dataset step by step according to our main assumption:
1- We will use information related to time (month/year), country, location and fatality of attacks. We will also include type of species in order to check what's the trend on how different species agressivity may vary from one to ohters.

In [173]:
#Identify how many nulls you have for each variable
null_cols = raw_df.isnull().sum()

null_cols[null_cols > 0]

Country                     43
Area                       402
Location                   496
Activity                   527
Name                       200
Sex                        567
Age                       2681
Injury                      27
Fatal (Y/N)                 19
Time                      3213
Species                   2934
Investigator or Source      15
href formula                 1
href                         3
Unnamed: 22               5991
Unnamed: 23               5990
dtype: int64

We have identified that there are two columns that gives us no information: Unnamed: 22 and Unnamed: 23. 
We proceed to delete them:

We also proceed to delete the rest of the columns that are unnecessary for our study

In [174]:
unnammed_cols = raw_df[["Unnamed: 22","Unnamed: 23"]]
unnammed_cols.isnull().sum()


Unnamed: 22    5991
Unnamed: 23    5990
dtype: int64

In [175]:
unnammed_cols = raw_df.drop(columns=["Unnamed: 22","Unnamed: 23"])
raw_df2 = unnammed_cols
raw_df2.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order'],
      dtype='object')

In [176]:
data_clean = raw_df2.drop(columns=["Type", 'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Time', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order'])

For column of fatalities, we detected that there are necessary to fix some issues

In [177]:
data_clean['Fatal (Y/N)'].value_counts()

N          4315
Y          1552
UNKNOWN      94
 N            8
#VALUE!       1
n             1
F             1
N             1
Name: Fatal (Y/N), dtype: int64

The unknown and invalid values are unnecessary for the current analysis, therefore, we proceed to fix them:

In [178]:
data_clean['Fatal (Y/N)'].replace(['UNKNOWN', '#VALUE!', "F"], np.nan, inplace=True)
data_clean['Fatal (Y/N)'].replace([" N", "N ","n"], "N", inplace=True)
data_clean.columns

Index(['Case Number', 'Date', 'Year', 'Country', 'Area', 'Location',
       'Fatal (Y/N)', 'Species '],
      dtype='object')

In [179]:
data_clean['Fatal (Y/N)'].value_counts()

N    4325
Y    1552
Name: Fatal (Y/N), dtype: int64

In [180]:
data_clean.columns

Index(['Case Number', 'Date', 'Year', 'Country', 'Area', 'Location',
       'Fatal (Y/N)', 'Species '],
      dtype='object')

As we have interest in months case, we proceed to extract the month of attack from the "Date" column. For that purpose, we created a dictionary of the months that appears in the column and the output we want (as a numeric value).

In [181]:
data_clean['Date'].value_counts().head()

1957    11
1942     9
1956     8
1941     7
1958     7
Name: Date, dtype: int64

In [182]:
def findMonth (m):
    months = {
        "Jan":1, "Feb":2, "Mar":3, "Apr":4, "May":5, "Jun":6, "Jul":7, "Aug":8, "Sep":9, "Oct":10, "Nov":11, "Dec":12
    }
    for month, numero in months.items(): 
        if month in m: 
            return numero
    return None

data_clean["Month"] = data_clean["Date"].apply(findMonth)


#As the function returns a float value, we change the type
data_clean["Month"]= data_clean["Month"].fillna(0).astype("int64")

In [183]:
display(data_clean.head())

Unnamed: 0,Case Number,Date,Year,Country,Area,Location,Fatal (Y/N),Species,Month
0,2016.09.18.c,18-Sep-16,2016,USA,Florida,"New Smyrna Beach, Volusia County",N,,9
1,2016.09.18.b,18-Sep-16,2016,USA,Florida,"New Smyrna Beach, Volusia County",N,,9
2,2016.09.18.a,18-Sep-16,2016,USA,Florida,"New Smyrna Beach, Volusia County",N,,9
3,2016.09.17,17-Sep-16,2016,AUSTRALIA,Victoria,Thirteenth Beach,N,,9
4,2016.09.15,16-Sep-16,2016,AUSTRALIA,Victoria,Bells Beach,N,2 m shark,9


After we applied this function, we continued to check null values through our dataframe. We also proceed to take out all 0 values as we take them as null too.

In [184]:
#Identify how many nulls you have for each variable
null_cols = data_clean.isnull().sum()

null_cols[null_cols > 0]

Country          43
Area            402
Location        496
Fatal (Y/N)     115
Species        2934
dtype: int64

In [185]:
data_clean.shape
data_clean.columns

Index(['Case Number', 'Date', 'Year', 'Country', 'Area', 'Location',
       'Fatal (Y/N)', 'Species ', 'Month'],
      dtype='object')

In [188]:
data_clean2 = data_clean[data_clean["Month"] & data_clean["Year"] != 0].copy()
data_clean2.shape

(3793, 9)

As we considered Case Number the ref. number for each attack, we verified that we do not carried on with duplicates:

In [189]:
data_clean2['Case Number'] = data_clean.drop_duplicates(subset=['Case Number'], keep='first')
data_clean2.shape
data_clean2.columns

Index(['Case Number', 'Date', 'Year', 'Country', 'Area', 'Location',
       'Fatal (Y/N)', 'Species ', 'Month'],
      dtype='object')

In [191]:
data_clean2.columns

Index(['Case Number', 'Date', 'Year', 'Country', 'Area', 'Location',
       'Fatal (Y/N)', 'Species ', 'Month'],
      dtype='object')

Now we proceed to clean the column Species. As the registers did not follow any criteria neither pattern, we created a dictionary, similar to the Month column methodology.

In [194]:
data_clean2["Species "].isnull()

pd.to_numeric(data_clean2["Species "], errors='coerce')

data_clean2 = data_clean2.dropna(subset=["Species "])

data_clean2["Species "] = data_clean2["Species "].astype(str)

In [213]:
def findSpecies (s):
    sharks = {
        "Bull":"Bull Shark", "bull":"Bull Shark", 
        "White":"White Shark", "white":"White Shark",
        "Tiger":"Tiger Shark", "tiger":"Tiger Shark",
        "Mako": "Mako Shark", "mako": "Mako Shark",
        "Blacktip": "Blacktip Shark", "blacktip": "Blacktip Shark",
        "Blue": "Blue Shark", "blue": "Blue Shark",
        "involvement": None, "involve": None, "Bronze":"Bronze Shark", "bronze":"Bronze Shark",
        "whaler":"Whale Shark"
    }
    for shark, species in sharks.items(): 
        if shark in s: 
            return species
    return "Other"

In [214]:
data_clean2["Cat_shark"] = data_clean2["Species "].apply(findSpecies)
data_clean2["Cat_shark"].value_counts()

Other             988
White Shark       439
Tiger Shark       183
Bull Shark        100
Blacktip Shark     66
Mako Shark         37
Bronze Shark       35
Blue Shark         28
Whale Shark         4
Name: Cat_shark, dtype: int64

In [223]:
data_clean2.columns
data_clean3 = data_clean2.drop(['Species ', "Date", "Location"], axis=1)

In [224]:
data_clean3.shape

(1985, 7)

In [225]:
display(data_clean3.head())

Unnamed: 0,Case Number,Year,Country,Area,Fatal (Y/N),Month,Cat_shark
103,2015.12.26,2015,SOUTH AFRICA,KwaZulu-Natal,N,12,White Shark
104,2015.12.25,2015,SPAIN,Grand Canary Island,N,12,Other
105,2015.12.22,2015,USA,Hawaii,N,12,Other
106,2015.12.21.b,2015,AUSTRALIA,New South Wales,N,12,Bronze Shark
107,2015.12.21.a,2015,BRAZIL,Pernambuco,N,12,Tiger Shark


Now we proceed to analyse the variance:

In [226]:
low_variance = []

for col in data_clean3._get_numeric_data():
    minimum = min(data_clean3[col])
    ninety_perc = np.percentile(data_clean3[col], 90)
    if ninety_perc == minimum:
        low_variance.append(col)

print(low_variance)


[]


There are no columns with low variance.

In [227]:
stats = data_clean3.describe().transpose()
stats['IQR'] = stats['75%'] - stats['25%']
stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,IQR
Year,1985.0,1982.607053,32.276273,1785.0,1963.0,1994.0,2008.0,2015.0,45.0
Month,1985.0,7.069018,3.25105,1.0,5.0,7.0,10.0,12.0,5.0
