# Ironhack - Data Analytics Bootcamp
___________________________________________________________________________________________________________________________

## Project 2 - Shark Attacks

Data Cleaning and Manipulation

#### Main Objectives

The dataset provided by Ironhack contains significantly messy
data. Your job is to apply the different cleaning and
manipulation techniques to generate a cleaner CSV
version of this data.

#### The Data Set

- Go to kaggle.com and create an account;
- Go to the search bar and look for ‘Global Shark Attacks’;
- Download the data set;
- For more info: https://www.sharkattackfile.net.

#### Deliverables

- A clean CSV file on your GitHub account;
- The url of the file on your GitHub. It should be able to be read using "pd.read_csv(url)";
- The link to the Jupyter notebook (or the GitHub project);

#### Deadline

- The same day.

___________________________________________________________________________________________________________________________

## Colaborators:

- Marcus Felipe Ferreira Brandão

- Pedro Di Gianni
___________________________________________________________________________________________________________________________

### Approach:

When we analyzed this data set, we realized that the best approach would be to organize it based on the date on which the accidents occurred, which in addition to facilitating the search for past cases, makes it possible to know the times of the year and the periods of the day when they usually occur. Therefore, we chose the following steps:

- importing modules "pandas", "numpy", "regex" and "datetime";
- reading the file using pandas methods (head, info and shape);
- cleaning the data set, dropping columns and rows with no or very few data and similar columns;
- date treatment with the creation of three columns containing the years, days and months of the accidents;
- time treatment by standardizing the format of the time when the accidents occurred;
- creation of a new column with four day periods (morning, afternoon, night and "wee hours");
- standardization of the sex field for better visualization and manipulation of data: "M" for male and "F" for female;
- cleaning and changing formats of the field "age" for better visualization and manipulation of data;
- changing the field "fatal" to "True" or "False", for better manipulation of the data; and
- organizing all the columns of the data set in a more logical way, such as placing "activity", "type", "injury" and "fatal" next to each other.
___________________________________________________________________________________________________________________________

## Reading file:

In [115]:
import pandas as pd
import numpy as np
import re
import datetime

In [116]:
pd.set_option('display.max_columns', 24)

In [117]:
# Import comma-separated file

sharkattack = pd.read_csv('data/attacks.csv', sep=',', encoding='latin-1')

In [118]:
sharkattack.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25723 entries, 0 to 25722
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Case Number             8702 non-null   object 
 1   Date                    6302 non-null   object 
 2   Year                    6300 non-null   float64
 3   Type                    6298 non-null   object 
 4   Country                 6252 non-null   object 
 5   Area                    5847 non-null   object 
 6   Location                5762 non-null   object 
 7   Activity                5758 non-null   object 
 8   Name                    6092 non-null   object 
 9   Sex                     5737 non-null   object 
 10  Age                     3471 non-null   object 
 11  Injury                  6274 non-null   object 
 12  Fatal (Y/N)             5763 non-null   object 
 13  Time                    2948 non-null   object 
 14  Species                 3464 non-null 

In [119]:
sharkattack.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57.0,"No injury to occupant, outrigger canoe and pad...",N,18h00,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11.0,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48.0,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,2018.06.04,6299.0,,


## Cleaning data frame:

### - Dropping NaNs

In [120]:
#dropping lines in which all values are NaN

mask = sharkattack.isnull().sum(axis=1) > 19

list_to_drop = sharkattack.loc[mask, :].index

sharkattack = sharkattack.drop(list_to_drop)

In [121]:
sharkattack = sharkattack.drop(columns=['Unnamed: 22', 'Unnamed: 23'])

In [122]:
sharkattack.shape

(6302, 22)

In [123]:
sharkattack.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6302 entries, 0 to 6301
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Case Number             6301 non-null   object 
 1   Date                    6302 non-null   object 
 2   Year                    6300 non-null   float64
 3   Type                    6298 non-null   object 
 4   Country                 6252 non-null   object 
 5   Area                    5847 non-null   object 
 6   Location                5762 non-null   object 
 7   Activity                5758 non-null   object 
 8   Name                    6092 non-null   object 
 9   Sex                     5737 non-null   object 
 10  Age                     3471 non-null   object 
 11  Injury                  6274 non-null   object 
 12  Fatal (Y/N)             5763 non-null   object 
 13  Time                    2948 non-null   object 
 14  Species                 3464 non-null   

In [68]:
sharkattack.head(10)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57.0,"No injury to occupant, outrigger canoe and pad...",N,18h00,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11.0,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,2018.06.18,6302.0
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48.0,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,2018.06.09,6301.0
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,2018.06.08,6300.0
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,2018.06.04,6299.0
5,2018.06.03.b,03-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,"Flat Rock, Ballina",Kite surfing,Chris,M,,"No injury, board bitten",N,,,"Daily Telegraph, 6/4/2018",2018.06.03.b-FlatRock.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.03.b,2018.06.03.b,6298.0
6,2018.06.03.a,03-Jun-2018,2018.0,Unprovoked,BRAZIL,Pernambuco,"Piedade Beach, Recife",Swimming,Jose Ernesto da Silva,M,18.0,FATAL,Y,Late afternoon,Tiger shark,"Diario de Pernambuco, 6/4/2018",2018.06.03.a-daSilva.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.03.a,2018.06.03.a,6297.0
7,2018.05.27,27-May-2018,2018.0,Unprovoked,USA,Florida,"Lighhouse Point Park, Ponce Inlet, Volusia County",Fishing,male,M,52.0,Minor injury to foot. PROVOKED INCIDENT,N,,"Lemon shark, 3'","K. McMurray, TrackingSharks.com",2018.05.27-Ponce.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.05.27,2018.05.27,6296.0
8,2018.05.26.b,26-May-2018,2018.0,Unprovoked,USA,Florida,"Cocoa Beach, Brevard County",Walking,Cody High,M,15.0,Lower left leg bitten,N,17h00,"Bull shark, 6'","K.McMurray, TrackingSharks.com",2018.05.26.b-High.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.05.26.b,2018.05.26.b,6295.0
9,2018.05.26.a,26-May-2018,2018.0,Unprovoked,USA,Florida,"Daytona Beach, Volusia County",Standing,male,M,12.0,Minor injury to foot,N,14h00,,"K. McMurray, Tracking Sharks.com",2018.05.26.a-DaytonaBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.05.26.a,2018.05.26.a,6294.0


### - Dropping similar columns:

In [124]:
(sharkattack['href formula'] == sharkattack['href']).mean()

0.9904792129482703

In [125]:
#Dropping column "href formula", because it is 99% equal to "href"

sharkattack['href formula'] == sharkattack['href']

#sharkattack.loc[6304, 'href formula'] == sharkattack.loc[6304, 'href']

sharkattack.loc[[6298, 6299], :]

sharkattack = sharkattack.drop(columns='href formula')

In [126]:
(sharkattack['Case Number.2'] == sharkattack['Case Number.1']).mean()

0.9968264043160902

In [127]:
#Dropping column "Case Number.2", because it is 99% equal to "Case Number.1"

sharkattack['Case Number.2'] == sharkattack['Case Number.1']

#sharkattack.loc[6304, 'href formula'] == sharkattack.loc[6304, 'href']

sharkattack = sharkattack.drop(columns='Case Number.2')

In [128]:
(sharkattack['Case Number'] == sharkattack['Case Number.1']).mean()

0.9961916851793081

In [129]:
#Dropping column "Case Number", because it is 99% equal to "Case Number.1"

sharkattack['Case Number'] == sharkattack['Case Number.1']

#sharkattack.loc[6304, 'href formula'] == sharkattack.loc[6304, 'href']

sharkattack = sharkattack.drop(columns='Case Number')


In [130]:
sharkattack.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6302 entries, 0 to 6301
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Date                    6302 non-null   object 
 1   Year                    6300 non-null   float64
 2   Type                    6298 non-null   object 
 3   Country                 6252 non-null   object 
 4   Area                    5847 non-null   object 
 5   Location                5762 non-null   object 
 6   Activity                5758 non-null   object 
 7   Name                    6092 non-null   object 
 8   Sex                     5737 non-null   object 
 9   Age                     3471 non-null   object 
 10  Injury                  6274 non-null   object 
 11  Fatal (Y/N)             5763 non-null   object 
 12  Time                    2948 non-null   object 
 13  Species                 3464 non-null   object 
 14  Investigator or Source  6285 non-null   

## Data Treatment:

### - Time

In [131]:
# Identifying in which cells the data starts with a number, so we can turn this series into an integer
sharkattack["Time"]=sharkattack["Time"].astype(str)
sharkattack["Time"]=sharkattack["Time"].apply(lambda x: re.findall("^\d{,2}",x)[0:2])

In [132]:
sharkattack["Time"]=sharkattack["Time"].apply(lambda x: x[0])

In [133]:
nullnumbers = sharkattack["Time"]==""
sharkattack.loc[nullnumbers,"Time"]=-1
sharkattack.head(5)

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href,Case Number.1,original order
0,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57.0,"No injury to occupant, outrigger canoe and pad...",N,18,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,6303.0
1,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11.0,Minor injury to left thigh,N,14,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,6302.0
2,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48.0,Injury to left lower leg from surfboard skeg,N,7,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,6301.0
3,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,-1,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,6300.0
4,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,-1,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,6299.0


In [134]:
sharkattack["Day Period"]=sharkattack["Time"].astype(int)

In [135]:
#Creating Day Period based on the criteria below
nodata = sharkattack["Day Period"]==-1
wee_hours = (sharkattack["Day Period"]>=0) & (sharkattack["Day Period"]<6)
morning = (sharkattack["Day Period"]>=6) & (sharkattack["Day Period"]<12)
afternoon = (sharkattack["Day Period"]>=12) & (sharkattack["Day Period"]<18)
night = (sharkattack["Day Period"]>=18)

In [136]:
sharkattack.loc[nodata,"Day Period"]="No Data"
sharkattack.loc[morning,"Day Period"]="Morning"
sharkattack.loc[afternoon,"Day Period"]="Afternoon"
sharkattack.loc[night,"Day Period"]="Night"
sharkattack.loc[wee_hours,"Day Period"]="Wee hours"

In [137]:
sharkattack["Day Period"].value_counts()

No Data      3944
Afternoon    1314
Morning       770
Night         233
Wee hours      41
Name: Day Period, dtype: int64

In [138]:
sharkattack.loc[nullnumbers,"Time"]=np.nan

In [139]:
sharkattack.head(8)

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href,Case Number.1,original order,Day Period
0,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57.0,"No injury to occupant, outrigger canoe and pad...",N,18.0,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,6303.0,Night
1,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11.0,Minor injury to left thigh,N,14.0,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,6302.0,Afternoon
2,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48.0,Injury to left lower leg from surfboard skeg,N,7.0,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,6301.0,Morning
3,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,6300.0,No Data
4,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,6299.0,No Data
5,03-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,"Flat Rock, Ballina",Kite surfing,Chris,M,,"No injury, board bitten",N,,,"Daily Telegraph, 6/4/2018",2018.06.03.b-FlatRock.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.03.b,6298.0,No Data
6,03-Jun-2018,2018.0,Unprovoked,BRAZIL,Pernambuco,"Piedade Beach, Recife",Swimming,Jose Ernesto da Silva,M,18.0,FATAL,Y,,Tiger shark,"Diario de Pernambuco, 6/4/2018",2018.06.03.a-daSilva.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.03.a,6297.0,No Data
7,27-May-2018,2018.0,Unprovoked,USA,Florida,"Lighhouse Point Park, Ponce Inlet, Volusia County",Fishing,male,M,52.0,Minor injury to foot. PROVOKED INCIDENT,N,,"Lemon shark, 3'","K. McMurray, TrackingSharks.com",2018.05.27-Ponce.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.05.27,6296.0,No Data


### - Age

In [140]:
sharkattack.groupby(by="Age").count()

Unnamed: 0_level_0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href,Case Number.1,original order,Day Period
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
,2,2,2,2,2,2,1,2,2,2,2,1,1,2,2,2,2,2,2
,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1
28,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
30,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1
43,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
mid-20s,1,1,1,1,1,1,0,1,1,1,0,0,1,1,1,1,1,1,1
mid-30s,1,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,1
teen,5,5,5,5,5,5,5,5,5,5,5,2,1,5,5,5,5,5,5
young,2,2,2,2,2,2,2,2,2,2,2,0,0,2,2,2,2,2,2


In [141]:
#Identifying in which cells Age starts with a number so we can turn this column into an integer
sharkattack["Age"]=sharkattack["Age"].astype(str)
sharkattack["Age"]=sharkattack["Age"].apply(lambda x: re.findall("^\d{,3}",x)[0:2])

In [142]:
sharkattack["Age"]=sharkattack["Age"].apply(lambda x: x[0])

In [143]:
nullnumbers = sharkattack["Age"]==""
sharkattack.loc[nullnumbers,"Age"]=0

In [144]:
sharkattack["Age"]=sharkattack["Age"].astype(int)
zeronums = sharkattack["Age"]==0
sharkattack.loc[zeronums,"Age"]=np.nan

In [145]:
sharkattack["Age"].mean()

27.314194112503642

In [146]:
sharkattack["Age"].value_counts()

17.0    156
18.0    153
20.0    150
19.0    142
15.0    139
       ... 
87.0      1
82.0      1
84.0      1
72.0      1
86.0      1
Name: Age, Length: 81, dtype: int64

### - Sex

In [149]:
sharkattack.rename(columns={"Sex ":"Sex"},inplace=True)

In [150]:
# Putting condition in order to clean sex column
sexconditions = (sharkattack["Sex"] != "M") & (sharkattack["Sex"] != "F")

In [151]:
sharkattack.loc[sexconditions,"Sex"]=np.nan

In [152]:
sharkattack["Sex"].value_counts()

M    5094
F     637
Name: Sex, dtype: int64

### - Date

In [153]:
#Partially cleaning column "Year".

sharkattack['Year'] = sharkattack['Year'].replace([np.inf, -np.inf], np.nan)

sharkattack['Year'] = sharkattack['Year'].fillna(0)

sharkattack['Year'] = sharkattack['Year'].astype('int64')

sharkattack['Year'].tolist()

[2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2018,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,
 2017,

In [154]:
#Creating column "Date Year" from column "Date".

date_year_list = []
date_year = 0
for i in sharkattack['Date']:
    date_year = i[-4:]
    if date_year.isdigit() == False:
        date_year = '0'
    date_year_list.append(date_year)
    
sharkattack['Date Year'] = date_year_list

In [155]:
#Creating column "Date Month" from column "Date".

date_month_list = []
date_month = 0
for i in sharkattack['Date']:    
    if (i[-8:-5].lower() == 'jan') or (i[-8:-5].lower() == '1'):
        date_month = 1
    elif (i[-8:-5].lower() == 'feb') or (i[-8:-5].lower() == '2'):
        date_month = 2 
    elif (i[-8:-5].lower() == 'mar') or (i[-8:-5].lower() == '3'):
        date_month = 3
    elif (i[-8:-5].lower() == 'apr') or (i[-8:-5].lower() == '4'):
        date_month = 4
    elif (i[-8:-5].lower() == 'may') or (i[-8:-5].lower() == '5'):
        date_month = 5
    elif (i[-8:-5].lower() == 'jun') or (i[-8:-5].lower() == '6'):
        date_month = 6
    elif (i[-8:-5].lower() == 'jul') or (i[-8:-5].lower() == '7'):
        date_month = 7
    elif (i[-8:-5].lower() == 'aug') or (i[-8:-5].lower() == '8'):
        date_month = 8
    elif (i[-8:-5].lower() == 'sep') or (i[-8:-5].lower() == '9'):
        date_month = 9
    elif (i[-8:-5].lower() == 'oct') or (i[-8:-5].lower() == '10'):
        date_month = 10
    elif (i[-8:-5].lower() == 'nov') or (i[-8:-5].lower() == '11'):
        date_month = 11
    elif (i[-8:-5].lower() == 'dec') or (i[-8:-5].lower() == '12'):
        date_month = 12
    else:
        date_month = 0
    date_month_list.append(date_month) 
    
    
sharkattack['Date Month'] = date_month_list

In [156]:
#Creating column "Date Day" from column "Date".

date_day_list = []
date_day = 0
for i in sharkattack['Date']:
    date_day = i[-11:-9]
    if date_day.isdigit() == False:
        date_day = '0'
    date_day_list.append(date_day)
    
sharkattack['Date Day'] = date_day_list

In [157]:
#Comparing column "Date Year" and "Year" and replacing 0 values in the "Date Year column".

sharkattack['Year'] = sharkattack['Year'].astype(int)

sharkattack['Date Year'] = sharkattack['Date Year'].astype(int) 

date_year_list2 = sharkattack['Date Year'].tolist()

year_list2 = sharkattack['Year'].tolist()

for i in range(len(date_year_list2)):
    if date_year_list2[i] == 0:
        date_year_list2[i] = year_list2[i]
        
sharkattack['Date Year'] = date_year_list2

In [158]:
mask2 = (sharkattack['Date Year'] != sharkattack['Year'])

sharkattack.loc[mask2, :]

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href,Case Number.1,original order,Day Period,Date Year,Date Month,Date Day
187,Reported 08-Jan-2017,0,Invalid,AUSTRALIA,Queensland,,Spearfishing,Kerry Daniel,M,35.0,"No attack, shark made a threat display",,,Bull shark,Liquid Vision 1/8/2017,2017.01.08.R-KerryDaniel.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2017.01.08.R,6116.0,No Data,2017,1,08
675,14-May-2014,2013,Unprovoked,ECUADOR,Santa Cruz Island,"Playa Brava, Turtle Bay",Surfing,Intriago Diego,M,29.0,Superficial injury to left calf,N,12,Galapagos shark,"El Universo, 5/16/2013",2013.05.14-Diego.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2013.05.14,5628.0,Afternoon,2014,5,14
802,Apr-2013,2012,Unprovoked,USA,Florida,"Sanibel Island, Lee County",,Dylan Hapworth,M,,Right shin bitten,N,,a small shark,"Morning Sentinel, 4/20/2012",2012.04.00-Hapworth.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2012.04.00,5501.0,No Data,2013,4,0
905,26-Jun-2008,2011,Unprovoked,USA,North Carolina,"North Topsail Beach, Onslow County",Playing in the surf,Cassidy Cartwright,F,10.0,Ankle bitten,N,,"Bull shark, 6'",C. Creswell & G. Hubbell,2011.06.26-Cartwright.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2011.06.26,5398.0,No Data,2008,6,26
940,Reported 10-Mar-2010,2011,Invalid,EGYPT,South Sinai Peninsula,Sharm-el-Sheikh,,female,F,,"Apparent drowning, and subsequent scavenging b...",,,,"Swindon Advertiser, 3/10/2011",2011.03.10.R-Sharm-scavenging.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2011.03.10.R,5363.0,No Data,2010,3,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6297,Before 1903,0,Unprovoked,AUSTRALIA,Western Australia,Roebuck Bay,Diving,male,M,,FATAL,Y,,,"H. Taunton; N. Bartlett, p. 234",ND-0005-RoebuckBay.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0005,6.0,No Data,1903,0,0
6298,Before 1903,0,Unprovoked,AUSTRALIA,Western Australia,,Pearl diving,Ahmun,M,,FATAL,Y,,,"H. Taunton; N. Bartlett, pp. 233-234",ND-0004-Ahmun.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0004,5.0,No Data,1903,0,0
6299,1900-1905,0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,Coast Guard personnel,M,,FATAL,Y,,,"F. Schwartz, p.23; C. Creswell, GSAF",ND-0003-Ocracoke_1900-1905.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0003,4.0,No Data,1905,0,0
6300,1883-1889,0,Unprovoked,PANAMA,,"Panama Bay 8ºN, 79ºW",,Jules Patterson,M,,FATAL,Y,,,"The Sun, 10/20/1938",ND-0002-JulesPatterson.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0002,3.0,No Data,1889,0,0


In [159]:
sharkattack.head()

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href,Case Number.1,original order,Day Period,Date Year,Date Month,Date Day
0,25-Jun-2018,2018,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57.0,"No injury to occupant, outrigger canoe and pad...",N,18.0,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,6303.0,Night,2018,6,25
1,18-Jun-2018,2018,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11.0,Minor injury to left thigh,N,14.0,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,6302.0,Afternoon,2018,6,18
2,09-Jun-2018,2018,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48.0,Injury to left lower leg from surfboard skeg,N,7.0,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,6301.0,Morning,2018,6,9
3,08-Jun-2018,2018,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,6300.0,No Data,2018,6,8
4,04-Jun-2018,2018,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,6299.0,No Data,2018,6,4


### Fatal

In [160]:
# Renaming fatal column, cleaning and turning into boolean

In [161]:
sharkattack["Fatal (Y/N)"].value_counts()

N          4293
Y          1388
UNKNOWN      71
 N            7
M             1
N             1
y             1
2017          1
Name: Fatal (Y/N), dtype: int64

In [162]:
sharkattack.rename(columns={"Fatal (Y/N)":"Fatal"},inplace=True)

In [163]:
fatalconditions = (sharkattack["Fatal"] != "Y") & (sharkattack["Fatal"] != "N")

In [164]:
sharkattack.loc[fatalconditions,"Fatal"]=np.nan

In [165]:
sharkattack.loc[sharkattack["Fatal"]=="Y","Fatal"]=True
sharkattack.loc[sharkattack["Fatal"]=="N","Fatal"]=False

In [167]:
sharkattack["Fatal"].value_counts()

False    4293
True     1388
Name: Fatal, dtype: int64

In [168]:
sharkattack["Fatal"].mean()

0.2443231825382855

In [169]:
sharkattack["Fatal"]=sharkattack["Fatal"].astype(bool)

In [170]:
sharkattack.dtypes

Date                       object
Year                        int32
Type                       object
Country                    object
Area                       object
Location                   object
Activity                   object
Name                       object
Sex                        object
Age                       float64
Injury                     object
Fatal                        bool
Time                       object
Species                    object
Investigator or Source     object
pdf                        object
href                       object
Case Number.1              object
original order            float64
Day Period                 object
Date Year                   int64
Date Month                  int64
Date Day                   object
dtype: object

## Reorganizing Columns:

In [171]:
#Organizing columns order in a logical way

sharkattack.keys()

sharkattack = sharkattack[['Case Number.1', 'Date Year', 'Date Month', 'Date Day', 'Time', 'Day Period', 'Country', 'Area', 'Location', 'Name', 'Sex', 'Age', 'Activity', 'Type', 'Injury', 'Fatal', 'Species ', 'Investigator or Source', 'pdf', 'href', 'original order']]

sharkattack.head(2)

Unnamed: 0,Case Number.1,Date Year,Date Month,Date Day,Time,Day Period,Country,Area,Location,Name,Sex,Age,Activity,Type,Injury,Fatal,Species,Investigator or Source,pdf,href,original order
0,2018.06.25,2018,6,25,18,Night,USA,California,"Oceanside, San Diego County",Julie Wolfe,F,57.0,Paddling,Boating,"No injury to occupant, outrigger canoe and pad...",False,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,6303.0
1,2018.06.18,2018,6,18,14,Afternoon,USA,Georgia,"St. Simon Island, Glynn County",Adyson McNeely,F,11.0,Standing,Unprovoked,Minor injury to left thigh,False,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,6302.0


In [113]:
sharkattack.rename(columns={'Case Number.1': 'Case Number','Date Day': 'Day', 'Date Month': 'Month', 'Date Year': 'Year', }, inplace = True)

sharkattack.head(100)

Unnamed: 0,Case Number,Year,Month,Day,Time,Day Period,Country,Area,Location,Name,Sex,Age,Activity,Type,Injury,Fatal,Species,Investigator or Source,pdf,href,original order
0,2018.06.25,2018,6,25,18,Night,USA,California,"Oceanside, San Diego County",Julie Wolfe,F,57.0,Paddling,Boating,"No injury to occupant, outrigger canoe and pad...",False,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,6303.0
1,2018.06.18,2018,6,18,14,Afternoon,USA,Georgia,"St. Simon Island, Glynn County",Adyson McNeely,F,11.0,Standing,Unprovoked,Minor injury to left thigh,False,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,6302.0
2,2018.06.09,2018,6,09,07,Morning,USA,Hawaii,"Habush, Oahu",John Denges,M,48.0,Surfing,Invalid,Injury to left lower leg from surfboard skeg,False,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,6301.0
3,2018.06.08,2018,6,08,,No Data,AUSTRALIA,New South Wales,Arrawarra Headland,male,M,,Surfing,Unprovoked,Minor injury to lower leg,False,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,6300.0
4,2018.06.04,2018,6,04,,No Data,MEXICO,Colima,La Ticla,Gustavo Ramos,M,,Free diving,Provoked,Lacerations to leg & hand shark PROVOKED INCIDENT,False,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,6299.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2017.09.02.a,2017,9,02,10,Morning,USA,Florida,"New Smyrna Beach, Volusia County",Chase Elmore,M,17.0,Surfing,Unprovoked,Minor injury to right hand,False,,"Click Orlando, 9/2/2017",2017.09.02.a-Elmore.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,6208.0
96,2017.08.29,2017,8,29,10,Morning,AUSTRALIA,Victoria,Cathedral Rock,Marcel Brundler,M,37.0,Surfing,Unprovoked,"No injury, board bitten",False,"White shark, 3 m","B. Myatt, GSAF",2017.08.27-Brundler.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,6207.0
97,2017.08.27,2017,8,27,13,Afternoon,USA,Florida,Bathtub Beach,Violet Veatch,F,3.0,Wading,Unprovoked,Leg injured,False,,"Sun Sentinel, 8/27/2017",2017.08.27-Veatch.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,6206.0
98,2017.08.26.b,2017,8,26,,No Data,SPAIN,Castellón,Grao de Moncofa,female,F,11.0,Swimming,Invalid,Lacerations to left foot,True,Shark involvement questionable,"El Periodico Mediterraneo, 8/27/2017",2017.08.26.b-Spain.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,6205.0


## Exporting file:

In [114]:
sharkattack.to_csv('./project2-shark-attacks.csv', index=False)