# Shark Attack Database Mini Project

## Objective
<br>

* Organize and clean a csv data file.

* Detail and explain all python code and commands used in the importing, cleaning, manipulation, exporting and analysis.

<br>

database version: 7.1

# Starting the code

## Importing libraries

In [111]:
import pandas as pd
import numpy as np
import re
import countryinfo

## Importing database

Importing a csv database using latin-1 encoding

In [3]:
# Importing csv database as "sharks"
sharks = pd.read_csv('GSAF5.csv', sep = ',', encoding='latin-1')

# Creating a backup copy
sharks_bkp = sharks.copy()

## Declaring functions

In [4]:
def standardize_headers(df, func=None):
    '''
    This functions works cleaning columns names:
    Replacing whitespaces, lower characteres and turning to string.
    '''
    df.columns = df.columns.str.replace(' ', '_').str.lower()
    if func:
      df = df.apply(func)
    return df

## Data Cleaning

Starting by steps cleaning this database to improve quality and productivity. Cleaning all incorrect information, just leaving the highest quality info.

In [5]:
# Dataframe shape
sharks.shape

(5992, 24)

### Cleaning Columns
<br>

* Cleaning whitespaces

<br>

* Special characters

<br>

* Lower all characters. 

<br>


In [6]:
standardize_headers(sharks)
sharks.head()

Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex_,...,species_,investigator_or_source,pdf,href_formula,href,case_number.1,case_number.2,original_order,unnamed:_22,unnamed:_23
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.c,2016.09.18.c,5993,,
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.b,2016.09.18.b,5992,,
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.a,2016.09.18.a,5991,,
3,2016.09.17,17-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Thirteenth Beach,Surfing,Rory Angiolella,M,...,,"The Age, 9/18/2016",2016.09.17-Angiolella.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.17,2016.09.17,5990,,
4,2016.09.15,16-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Bells Beach,Surfing,male,M,...,2 m shark,"The Age, 9/16/2016",2016.09.16-BellsBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.16,2016.09.15,5989,,


### Checking Null and NaN

In [7]:
# Searching for NaN values
sharks.isna().sum()

case_number                  0
date                         0
year                         0
type                         0
country                     43
area                       402
location                   496
activity                   527
name                       200
sex_                       567
age                       2681
injury                      27
fatal_(y/n)                 19
time                      3213
species_                  2934
investigator_or_source      15
pdf                          0
href_formula                 1
href                         3
case_number.1                0
case_number.2                0
original_order               0
unnamed:_22               5991
unnamed:_23               5990
dtype: int64

In [8]:
# Searching for Null values
sharks.isnull().sum()

case_number                  0
date                         0
year                         0
type                         0
country                     43
area                       402
location                   496
activity                   527
name                       200
sex_                       567
age                       2681
injury                      27
fatal_(y/n)                 19
time                      3213
species_                  2934
investigator_or_source      15
pdf                          0
href_formula                 1
href                         3
case_number.1                0
case_number.2                0
original_order               0
unnamed:_22               5991
unnamed:_23               5990
dtype: int64

In [9]:
# Searching for duplicates rows
sharks.duplicated().sum()

0

### Dropping column

#### Dropping two NaN columns 

<br>

For improve performance and having no useful data.

In [10]:
sharks = sharks.drop(axis = 1, columns = ['unnamed:_22', 'unnamed:_23'])

#### Drop original_order column

<br>

A column than was a Original index that is no longer useful.

In [11]:
sharks = sharks.drop(axis = 1, columns = ['original_order'])

In [36]:
sharks_case = sharks['case_number'] == sharks['case_number.1']
sharks_case.value_counts()

True     5979
False      13
dtype: int64

In [13]:
sharks_case = sharks['case_number.1'] == sharks['case_number.2']
sharks_case.value_counts()

True     5981
False      11
dtype: int64

In [29]:
sharks_href = sharks['href_formula'] == sharks['href']
sharks_case.value_counts()

True     5981
False      11
dtype: int64

### Drop duplicate column

<br>

* case_number.2 and case_number.1 was duplicate column of case_number


<br>

* href_formula was a duplicate column of href

<br>

In [37]:
sharks = sharks.drop(axis = 1, columns = ['case_number.1'])

In [23]:
sharks = sharks.drop(axis = 1, columns = ['case_number.2'])

In [30]:
sharks = sharks.drop(axis = 1, columns = ['href_formula'])

## Data Manipulation

### Creating and cleaning a column of months

In [58]:
temp_lst = []
for row in sharks['date']:
    temp_row = ''.join(re.findall('\-[A-Za-z]{3}\-',row)).lower()
    temp_row = re.sub('\-','',temp_row)
        
        
    if temp_row == '':
        temp_row = np.nan

    temp_lst.append(temp_row)
sharks['month'] = temp_lst

In [94]:
sharks['month'].value_counts(dropna=False)

NaN       870
jul       590
aug       537
sep       491
jan       476
jun       453
dec       397
oct       385
apr       375
mar       367
nov       365
may       344
feb       335
marmar      2
janjan      1
jut         1
novnov      1
augaug      1
sepsep      1
Name: month, dtype: int64

Droping NaN rows and correcting months name that are incorrectly

In [97]:
sharks = sharks.dropna(subset = ['month'])

In [99]:
for row in sharks['month']:
    if len(row) > 3:
        sharks['month'].replace(row, row[:3], inplace = True)
    elif row == 'jut':
        sharks['month'].replace(row,'jun', inplace = True)

In [100]:
sharks['month'].value_counts(dropna=False)

jul    590
aug    538
sep    492
jan    477
jun    454
dec    397
oct    385
apr    375
mar    369
nov    366
may    344
feb    335
Name: month, dtype: int64

### Cleaning Country column

In [157]:
for row in sharks['country']:
    if isinstance(row, str):
        new_row = re.sub('\/.+|\(.+\)|\.|\?', '', row)
        new_row = re.sub('\&', 'and', new_row.strip().lower())
        
        if new_row == 'usa':
            new_row = new_row.replace(new_row, 'united states')
        elif new_row == 'bahamas':
            new_row = new_row.replace(new_row, 'the bahamas')
        elif new_row == 'england' or new_row == 'british isles':
            new_row = new_row.replace(new_row, 'united kingdom')
        elif new_row == 'reunion':
            new_row = new_row.replace(new_row, 'réunion')
        elif new_row == 'okinawa':
            new_row = new_row.replace(new_row, 'japan')
        elif new_row == 'azores':
            new_row = new_row.replace(new_row, 'portugal')
        elif new_row == 'red sea':
            new_row = new_row.replace(new_row, 'egypt')
        elif new_row == 'columbia':
            new_row = new_row.replace(new_row, 'colombia')
        elif new_row == 'new britain' or new_row == 'new guinea' or new_row == 'british new guinea' or new_row == 'admiralty islands':
            new_row = new_row.replace(new_row, 'papua new guinea')
        
        sharks['country'].replace(row,new_row, inplace=True)
    else:
        sharks['country'].replace(row,np.nan, inplace=True)

In [190]:
sharks['country'].unique()

array(['united states', 'australia', 'new caledonia', 'réunion',
       'the bahamas', 'spain', 'china', 'japan', 'colombia',
       'south africa', 'egypt', 'new zealand', 'indonesia',
       'french polynesia', 'cape verde', 'fiji', 'brazil',
       'dominican republic', 'cayman islands', 'united arab emirates',
       'aruba', 'mozambique', 'thailand', 'puerto rico', 'italy',
       'mexico', 'atlantic ocean', 'greece', 'mauritius', 'st martin',
       'france', 'ecuador', 'papua new guinea', 'trinidad and tobago',
       'kiribati', 'israel', 'diego garcia', 'taiwan', 'jamaica',
       'palestinian territories', 'guam', 'seychelles', 'belize',
       'philippines', 'nigeria', 'tonga', 'scotland', 'canada', 'croatia',
       'saudi arabia', 'chile', 'antigua', 'kenya', 'russia',
       'turks and caicos', 'costa rica', 'united kingdom', 'malaysia',
       'samoa', 'portugal', 'solomon islands', 'south korea', 'malta',
       'vietnam', 'madagascar', 'panama', 'somalia', 'nevis', 'cu

In [187]:
mask = (sharks['country'] == 'united states') & (sharks['activity'] == 'Surfing')
sharks.loc[mask, :]

Unnamed: 0,case_number,date,month,year,type,country,area,location,activity,name,sex_,age,injury,fatal_(y/n),time,species_,investigator_or_source,pdf,href
0,2016.09.18.c,18-Sep-16,sep,2016,Unprovoked,united states,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,16,Minor injury to thigh,N,13h00,,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...
1,2016.09.18.b,18-Sep-16,sep,2016,Unprovoked,united states,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,36,Lacerations to hands,N,11h00,,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...
2,2016.09.18.a,18-Sep-16,sep,2016,Unprovoked,united states,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,43,Lacerations to lower leg,N,10h43,,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...
13,2016.08.29.b,29-Aug-16,aug,2016,Unprovoked,united states,Florida,"New Smyrna Beach, Volusia County",Surfing,Sam Cumiskey,M,25,Lacerations to right foot,N,15h00,"Bull shark, 6'","News Channel 8, 8/30/16",2016.08.29.b-Cumiskey.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...
14,2016.08.29.a,29-Aug-16,aug,2016,Unprovoked,united states,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,37,Minor injury to ankle,N,14h00,,"News Channel 8, 8/30/16",2016.08.29.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3231,1968.09.22,22-Sep-68,sep,1968,Invalid,united states,Florida,"Riviera Beach, Palm Beach County",Surfing,,,,,UNKNOWN,12h48,Shark involvement not confirmed,M. Vorenberg,1968.09.22-NV-RivieraBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...
3254,1968.03.10,10-Mar-68,mar,1968,Unprovoked,united states,Florida,"Jensen Beach, Martin County",Surfing,Jan Icyda,M,20,Foot lacerated,N,Afternoon,,H.D. Baldridge (1994) SAF Case #1548,1968.03.10-Icyda.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...
3256,1968.02.26,26-Feb-68,feb,1968,Unprovoked,united states,Florida,Palm Beach County,Surfing,Fred Hennessee,M,17,Foot lacerated,N,13h30,Bull shark,"H.D. Baldridge, #1549",1968.02.26-Hennessee.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...
3403,1965.04.25,25-Apr-65,apr,1965,Invalid,united states,California,"Pacifica, San Mateo County",Surfing,Michael Sammut,M,19,Drowned & his body was not recovered. Sharks s...,Y,16h00,,"H.D. Baldridge, SAF Case #1370; R. Collier, p....",1965.04.25-Sammut.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...


array(['Surfing', 'Fishing', 'Wading', ..., 'Conch diver',
       'Man fell overboard from ship. Those on board threw a rope to him with a wooden block & were pulling him to the ship',
       'Crossing river on a raft'], dtype=object)

## Data Export

## Data Analysis