# Shark Attack Database Mini Project

## Objective
<br>

* Organize and clean a csv data file.

* Detail and explain all python code and commands used in the importing, cleaning, manipulation, exporting and analysis.

<br>

database version: 7.1

# Starting the code

## Importing libraries

In [20]:
import pandas as pd
import numpy as np
import re
import countryinfo

## Importing database

Importing a csv database using latin-1 encoding

In [2]:
# Importing csv database as "sharks"
sharks = pd.read_csv('attacks.csv', sep = ',', encoding='latin-1')

# Creating a backup copy
sharks_bkp = sharks.copy()

## Declaring functions

In [80]:
def standardize_headers(df, func=None):
    '''
    This functions works cleaning columns names:
    Replacing whitespaces, lower characteres and turning to string.
    '''
    df.columns = df.columns.str.replace(' ', '_').str.lower()
    if func:
      df = df.apply(func)
    return df

def get_month(x):
    try:
        pattern = '-(\w+)-'
        x = ''.join(re.findall(pattern, x))
        if len(x) == 3:
            return x
        else: return NaN
    except:
        return NaN

## Data Cleaning

Starting by steps cleaning this database to improve quality and productivity. Cleaning all incorrect information, just leaving the highest quality info.

In [4]:
# Dataframe shape
sharks.shape

(25723, 24)

### Cleaning Columns
<br>

* Cleaning whitespaces

<br>

* Special characters

<br>

* Lower all characters. 

<br>


In [5]:
standardize_headers(sharks)
sharks.shape

(25723, 24)

### Checking Null and NaN

In [6]:
# Searching for NaN values
sharks.isna().sum()

case_number               17021
date                      19421
year                      19423
type                      19425
country                   19471
area                      19876
location                  19961
activity                  19965
name                      19631
sex_                      19986
age                       22252
injury                    19449
fatal_(y/n)               19960
time                      22775
species_                  22259
investigator_or_source    19438
pdf                       19421
href_formula              19422
href                      19421
case_number.1             19421
case_number.2             19421
original_order            19414
unnamed:_22               25722
unnamed:_23               25721
dtype: int64

In [7]:
# Searching for Null values
sharks.isnull().sum()

case_number               17021
date                      19421
year                      19423
type                      19425
country                   19471
area                      19876
location                  19961
activity                  19965
name                      19631
sex_                      19986
age                       22252
injury                    19449
fatal_(y/n)               19960
time                      22775
species_                  22259
investigator_or_source    19438
pdf                       19421
href_formula              19422
href                      19421
case_number.1             19421
case_number.2             19421
original_order            19414
unnamed:_22               25722
unnamed:_23               25721
dtype: int64

In [8]:
# Searching for duplicates rows
sharks.duplicated().sum()

19411

In [9]:
# Dropping duplicates
sharks = sharks.drop_duplicates()
sharks.isna().sum()

case_number                  2
date                        10
year                        12
type                        14
country                     60
area                       465
location                   550
activity                   554
name                       220
sex_                       575
age                       2841
injury                      38
fatal_(y/n)                549
time                      3364
species_                  2848
investigator_or_source      27
pdf                         10
href_formula                11
href                        10
case_number.1               10
case_number.2               10
original_order               3
unnamed:_22               6311
unnamed:_23               6310
dtype: int64

### Dropping column

#### Dropping two NaN columns 

<br>

For improve performance and having no useful data.

In [10]:
sharks = sharks.drop(axis = 1, columns = ['unnamed:_22', 'unnamed:_23'])

#### Drop original_order column

<br>

A column than was a Original index that is no longer useful.

In [11]:
sharks = sharks.drop(axis = 1, columns = ['original_order'])

In [12]:
sharks_case = sharks['case_number'] == sharks['case_number.1']
sharks_case.value_counts()

True     6278
False      34
dtype: int64

In [13]:
sharks_case = sharks['case_number.1'] == sharks['case_number.2']
sharks_case.value_counts()

True     6282
False      30
dtype: int64

In [14]:
sharks_href = sharks['href_formula'] == sharks['href']
sharks_case.value_counts()

True     6282
False      30
dtype: int64

### Drop duplicate column

<br>

* case_number.2 and case_number.1 was duplicate column of case_number


<br>

* href_formula was a duplicate column of href

<br>

In [15]:
sharks = sharks.drop(axis = 1, columns = ['case_number.1'])

In [16]:
sharks = sharks.drop(axis = 1, columns = ['case_number.2'])

In [17]:
sharks = sharks.drop(axis = 1, columns = ['href_formula'])

## Data Manipulation

### Creating and cleaning a column of months

In [22]:
sharks.head()

Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex_,age,injury,fatal_(y/n),time,species_,investigator_or_source,pdf,href
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57.0,"No injury to occupant, outrigger canoe and pad...",N,18h00,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11.0,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48.0,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...


In [57]:
sharks['month'] = sharks['date'].apply(get_month)
sharks

Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex_,age,injury,fatal_(y/n),time,species_,investigator_or_source,pdf,href,months,month
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57,"No injury to occupant, outrigger canoe and pad...",N,18h00,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,,Jun
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,,Jun
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,,Jun
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,,Jun
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,,Jun
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6307,0,,,,,,,,,,,,,,,,,,,
6308,0,,,,,,,,,,,,,,,,,,,
6309,0,,,,,,,,,,,,,,,,,,,
8702,,,,,,,,,,,,,,,,,,,,


In [70]:
sharks['month'].value_counts(dropna=False)

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas\_libs\hashtable_class_helper.pxi", line 1652, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


Jul            621
Aug            553
Sep            520
Jan            493
Jun            475
              ... 
[Sep, 1928]      1
[Aug, 1852]      1
28               1
[Ca. 1962]       1
[Aug, 1864]      1
Name: month, Length: 674, dtype: int64

Droping NaN rows and correcting months name that are incorrectly

In [71]:
sharks = sharks.dropna(subset = ['month'])

In [81]:
sharks['month'].value_counts(dropna=False)

          895
Jul       621
Aug       555
Sep       520
Jan       493
Jun       475
Apr       420
Oct       417
Dec       415
Mar       379
Nov       377
May       358
Feb       356
July        4
Sept        2
Ap          2
MarMar      2
17          1
March       1
SepSep      1
13          1
24          1
28          1
AugAug      1
JanJan      1
NovNov      1
26          1
30          1
Name: month, dtype: int64

In [85]:
sharks

Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex_,age,injury,fatal_(y/n),time,species_,investigator_or_source,pdf,href,month
0,2018.06.25,25-Jun-2018,2018.0,Boating,united states,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57,"No injury to occupant, outrigger canoe and pad...",N,18h00,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,Jun
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,united states,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,Jun
2,2018.06.09,09-Jun-2018,2018.0,Invalid,united states,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,Jun
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,australia,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,Jun
4,2018.06.04,04-Jun-2018,2018.0,Provoked,mexico,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,Jun
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6297,ND.0005,Before 1903,0.0,Unprovoked,australia,Western Australia,Roebuck Bay,Diving,male,M,,FATAL,Y,,,"H. Taunton; N. Bartlett, p. 234",ND-0005-RoebuckBay.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,
6298,ND.0004,Before 1903,0.0,Unprovoked,australia,Western Australia,,Pearl diving,Ahmun,M,,FATAL,Y,,,"H. Taunton; N. Bartlett, pp. 233-234",ND-0004-Ahmun.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,
6299,ND.0003,1900-1905,0.0,Unprovoked,united states,North Carolina,Ocracoke Inlet,Swimming,Coast Guard personnel,M,,FATAL,Y,,,"F. Schwartz, p.23; C. Creswell, GSAF",ND-0003-Ocracoke_1900-1905.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,
6300,ND.0002,1883-1889,0.0,Unprovoked,panama,,"Panama Bay 8ºN, 79ºW",,Jules Patterson,M,,FATAL,Y,,,"The Sun, 10/20/1938",ND-0002-JulesPatterson.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,


### Cleaning Country column

In [67]:
for row in sharks['country']:
    if isinstance(row, str):
        new_row = re.sub('\/.+|\(.+\)|\.|\?', '', row)
        new_row = re.sub('\&', 'and', new_row.strip().lower())
        
        if new_row == 'usa':
            new_row = new_row.replace(new_row, 'united states')
        elif new_row == 'bahamas':
            new_row = new_row.replace(new_row, 'the bahamas')
        elif new_row == 'england' or new_row == 'british isles':
            new_row = new_row.replace(new_row, 'united kingdom')
        elif new_row == 'reunion':
            new_row = new_row.replace(new_row, 'réunion')
        elif new_row == 'okinawa':
            new_row = new_row.replace(new_row, 'japan')
        elif new_row == 'azores':
            new_row = new_row.replace(new_row, 'portugal')
        elif new_row == 'red sea':
            new_row = new_row.replace(new_row, 'egypt')
        elif new_row == 'columbia':
            new_row = new_row.replace(new_row, 'colombia')
        elif new_row == 'new britain' or new_row == 'new guinea' or new_row == 'british new guinea' or new_row == 'admiralty islands':
            new_row = new_row.replace(new_row, 'papua new guinea')
        
        sharks['country'].replace(row,new_row, inplace=True)
    else:
        sharks['country'].replace(row,np.nan, inplace=True)

In [68]:
sharks['country'].unique()

array(['united states', 'australia', 'mexico', 'brazil', 'united kingdom',
       'south africa', 'thailand', 'costa rica', 'maldives',
       'the bahamas', 'new caledonia', 'ecuador', 'malaysia', 'libya',
       nan, 'cuba', 'mauritius', 'new zealand', 'spain', 'samoa',
       'solomon islands', 'japan', 'egypt',
       'st helena, british overseas territory', 'comoros', 'réunion',
       'french polynesia', 'united arab emirates', 'philippines',
       'indonesia', 'china', 'colombia', 'cape verde', 'fiji',
       'dominican republic', 'cayman islands', 'aruba', 'mozambique',
       'puerto rico', 'italy', 'atlantic ocean', 'greece', 'st martin',
       'france', 'papua new guinea', 'trinidad and tobago', 'kiribati',
       'israel', 'diego garcia', 'taiwan', 'jamaica',
       'palestinian territories', 'guam', 'seychelles', 'belize',
       'nigeria', 'tonga', 'scotland', 'canada', 'croatia',
       'saudi arabia', 'chile', 'antigua', 'kenya', 'russia',
       'turks and caicos', '

In [69]:
mask = (sharks['country'] == 'united states') & (sharks['activity'] == 'Surfing')
sharks.loc[mask, :]

Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex_,age,injury,fatal_(y/n),time,species_,investigator_or_source,pdf,href,months,month
2,2018.06.09,09-Jun-2018,2018.0,Invalid,united states,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,,Jun
53,2017.12.31,31-Dec-2017,2017.0,Unprovoked,united states,Hawaii,"Hultin's Beach, Oahu",Surfing,Marjorie Mariano,F,54,Severe lacerations to left thigh & knee,N,18h00,Tiger shark,"J. Howard, Surfling Now, 1/2/2018",2017.12.31-Mariano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,,Dec
54,2017.12.30,30-Dec-2017,2017.0,Unprovoked,united states,California,"Drakes Estero, Point Reyes, Marin County",Surfing,Natalie Jones,F,35,Foot bitten,N,12h00,,R. Collier,2017.12.30.Jones.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,,Dec
61,2017.11.18,18-Nov-2017,2017.0,Unprovoked,united states,Florida,"Floridana Beach, Brevard County",Surfing,Kaia Anderson,F,14,Heel bitten,N,Late afternoon,,"Florida Today, 11/21/2017",2017.11.18-Anderson.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,,Nov
74,2017.10.09,09-Oct-2017,2017.0,Unprovoked,united states,Hawaii,"Davidsons Beach, Kekaha, Kauai",Surfing,Mitch Milan,M,54,Lacerations to left hand,N,18h30,"Tiger shark, 8 to 10 feet","Hawaii News Now, 10/10/2017",2017.10.09-Milan.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,,Oct
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3643,1965.00.00.f,1965,1965.0,Provoked,united states,California,"Dana Point, San Clemente, Orange County",Surfing,Barry Berg,M,Teen,Puncture wounds to foot when he stepped on a s...,N,,,"Orange County Register, 1/28/2998",1965.00.00.f-Berg.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,,[1965]
4284,1955.08.30.a,30-Aug-1955,1955.0,Provoked,united states,California,"Zuma Beach, Santa Monica, Los Angeles County",Surfing,Dale Strand,M,25,"Surfer grabbed shark, which turned & bit him a...",N,,5' thresher or blue shark. The shark was kill...,"SAF Case #244; D. Miller & R. Collier, V.M. Co...",1955.08.30.a-DaleStrand.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,,Aug
6097,1828.00.00,1828,1828.0,Unprovoked,united states,Hawaii,"Uo, Lahaina, Maui",Surfing,Male,M,,FATAL,Y,,,"J. Borg, p.68; L. Taylor (1993), pp.94-95",1828.00.00-male.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,,[1828]
6143,1779.00.00,1779,1779.0,Unprovoked,united states,Hawaii,"Maliu, Hawai'i",Surfing,Nu'u-anu-pa'a hu,M,young,"FATAL, buttock lacerated",Y,,,"G.H. Balazs; J. Borg, p.68; L. Taylor (1993), ...",1779.00.00-Hawaii.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,,[1779]


## Data Export

## Data Analysis