# Shark Attack Database Mini Project

## Objective
<br>

* Organize and clean a csv data file.

* Detail and explain all python code and commands used in the importing, cleaning, manipulation, exporting and analysis.

<br>

database version: 7.1

# Starting the code

## Importing libraries

In [1]:
import pandas as pd
import numpy as np
import re
import countryinfo

## Importing database

Importing a csv database using latin-1 encoding

In [2]:
# Importing csv database as "sharks"
sharks = pd.read_csv('attacks.csv', sep = ',', encoding='latin-1')

# Creating a backup copy
sharks_bkp = sharks.copy()

## Declaring functions

In [3]:
def standardize_headers(df, func=None):
    '''
    This functions works cleaning columns names:
    Replacing whitespaces, lower characteres and turning to string.
    '''
    df.columns = df.columns.str.replace(' ', '_').str.lower()
    if func:
      df = df.apply(func)
    return df

def get_month(x):
    try:
        pattern = '-(\w+)-'
        x = ''.join(re.findall(pattern, x))
        if len(x) == 3:
            return x
        else: return x
    except:
        return x

## Data Cleaning

Starting by steps cleaning this database to improve quality and productivity. Cleaning all incorrect information, just leaving the highest quality info.

In [4]:
# Dataframe shape
sharks.shape

(25723, 24)

### Cleaning Columns
<br>

* Cleaning whitespaces

<br>

* Special characters

<br>

* Lower all characters. 

<br>


In [5]:
standardize_headers(sharks)
sharks.shape

(25723, 24)

In [6]:
sharks = sharks.rename(columns = {'sex_':'sex'})
sharks.columns

Index(['case_number', 'date', 'year', 'type', 'country', 'area', 'location',
       'activity', 'name', 'sex', 'age', 'injury', 'fatal_(y/n)', 'time',
       'species_', 'investigator_or_source', 'pdf', 'href_formula', 'href',
       'case_number.1', 'case_number.2', 'original_order', 'unnamed:_22',
       'unnamed:_23'],
      dtype='object')

### Checking Null and NaN

In [7]:
# Searching for NaN values
sharks.isna().sum()

case_number               17021
date                      19421
year                      19423
type                      19425
country                   19471
area                      19876
location                  19961
activity                  19965
name                      19631
sex                       19986
age                       22252
injury                    19449
fatal_(y/n)               19960
time                      22775
species_                  22259
investigator_or_source    19438
pdf                       19421
href_formula              19422
href                      19421
case_number.1             19421
case_number.2             19421
original_order            19414
unnamed:_22               25722
unnamed:_23               25721
dtype: int64

In [31]:
# Searching for Null values
sharks.isnull().sum()

case_number                  2
date                        10
year                        12
type                        14
country                     60
area                       465
location                   550
activity                   554
name                       220
sex_                       575
age                       2841
injury                      38
fatal_(y/n)                549
time                      3364
species_                  2848
investigator_or_source      27
pdf                         10
href                        10
dtype: int64

In [8]:
# Searching for duplicates rows
sharks.duplicated().sum()

19411

In [9]:
# Dropping duplicates
sharks = sharks.drop_duplicates()
sharks.isna().sum()

case_number                  2
date                        10
year                        12
type                        14
country                     60
area                       465
location                   550
activity                   554
name                       220
sex                        575
age                       2841
injury                      38
fatal_(y/n)                549
time                      3364
species_                  2848
investigator_or_source      27
pdf                         10
href_formula                11
href                        10
case_number.1               10
case_number.2               10
original_order               3
unnamed:_22               6311
unnamed:_23               6310
dtype: int64

### Dropping column

#### Dropping two NaN columns 

<br>

For improve performance and having no useful data.

In [10]:
sharks = sharks.drop(axis = 1, columns = ['unnamed:_22', 'unnamed:_23'])

#### Drop original_order column

<br>

A column than was a Original index that is no longer useful.

In [11]:
sharks = sharks.drop(axis = 1, columns = ['original_order'])

In [12]:
sharks_case = sharks['case_number'] == sharks['case_number.1']
sharks_case.value_counts()

True     6278
False      34
dtype: int64

In [13]:
sharks_case = sharks['case_number.1'] == sharks['case_number.2']
sharks_case.value_counts()

True     6282
False      30
dtype: int64

In [14]:
sharks_href = sharks['href_formula'] == sharks['href']
sharks_case.value_counts()

True     6282
False      30
dtype: int64

### Drop duplicate column

<br>

* case_number.2 and case_number.1 was duplicate column of case_number


<br>

* href_formula was a duplicate column of href

<br>

In [15]:
sharks = sharks.drop(axis = 1, columns = ['case_number.1'])

In [16]:
sharks = sharks.drop(axis = 1, columns = ['case_number.2'])

In [17]:
sharks = sharks.drop(axis = 1, columns = ['href_formula'])

## Data Manipulation

### Cleaning country column

In [18]:
sharks['country'].value_counts(dropna=False)

USA                       2229
AUSTRALIA                 1338
SOUTH AFRICA               579
PAPUA NEW GUINEA           134
NEW ZEALAND                128
                          ... 
GREENLAND                    1
RED SEA?                     1
MEXICO                       1
ITALY / CROATIA              1
BRITISH VIRGIN ISLANDS       1
Name: country, Length: 213, dtype: int64

In [19]:
for row in sharks['country']:
    if isinstance(row, str):
        new_row = re.sub('\/.+|\(.+\)|\.|\?', '', row)
        new_row = re.sub('\&', 'and', new_row.strip().lower())
        sharks['country'].replace(row,new_row, inplace=True)
    else:
        sharks['country'].replace(row,np.nan, inplace=True)

In [20]:
mask = sharks['country'] == 'between portugal and india'
sharks.loc[mask , :]

Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex,age,injury,fatal_(y/n),time,species_,investigator_or_source,pdf,href
6170,1580.01.10.R,Letter dated 10-Jan-1580,1580.0,Unprovoked,between portugal and india,,,Man fell overboard from ship. Those on board t...,male,M,,"FATAL. ""Shark tore him to pieces.",Y,,,"G.P. Whitley, p. 10",1580.01.10.R-Portugal-India.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...


Droping NaN rows and correcting country name that are incorrectly

In [21]:
sharks = sharks.dropna(subset = ['country'])

In [22]:
sharks['country'].value_counts(dropna=False)

usa                   2229
australia             1338
south africa           579
papua new guinea       134
new zealand            128
                      ... 
western samoa            1
british new guinea       1
guatemala                1
admiralty islands        1
roatan                   1
Name: country, Length: 191, dtype: int64

In [23]:
sharks['country'].unique()

array(['usa', 'australia', 'mexico', 'brazil', 'england', 'south africa',
       'thailand', 'costa rica', 'maldives', 'bahamas', 'new caledonia',
       'ecuador', 'malaysia', 'libya', 'cuba', 'mauritius', 'new zealand',
       'spain', 'samoa', 'solomon islands', 'japan', 'egypt',
       'st helena, british overseas territory', 'comoros', 'reunion',
       'french polynesia', 'united kingdom', 'united arab emirates',
       'philippines', 'indonesia', 'china', 'columbia', 'cape verde',
       'fiji', 'dominican republic', 'cayman islands', 'aruba',
       'mozambique', 'puerto rico', 'italy', 'atlantic ocean', 'greece',
       'st martin', 'france', 'papua new guinea', 'trinidad and tobago',
       'kiribati', 'israel', 'diego garcia', 'taiwan', 'jamaica',
       'palestinian territories', 'guam', 'seychelles', 'belize',
       'nigeria', 'tonga', 'scotland', 'canada', 'croatia',
       'saudi arabia', 'chile', 'antigua', 'kenya', 'russia',
       'turks and caicos', 'azores', 'south

### Cleaning fatality column

In [35]:
sharks['fatal_(y/n)'].unique()

array(['N', 'Y', 'nan', 'M', 'UNKNOWN', 'y'], dtype=object)

In [27]:
sharks = sharks.drop(786)

In [30]:
sharks['fatal_(y/n)'] = sharks['fatal_(y/n)'].apply(lambda x: str(x))
 # Can verify to see that dates prints out as an object

In [31]:
def cleaning_fatal(row):
    pattern = ' N|N '
    new_row = re.sub(pattern, 'N', row)
    return new_row
sharks['fatal_(y/n)'] = sharks['fatal_(y/n)'].apply(cleaning_fatal)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6251 entries, 0 to 6301
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   case_number             6250 non-null   object 
 1   date                    6251 non-null   object 
 2   year                    6249 non-null   float64
 3   type                    6247 non-null   object 
 4   country                 6251 non-null   object 
 5   area                    5831 non-null   object 
 6   location                5750 non-null   object 
 7   activity                5714 non-null   object 
 8   name                    6045 non-null   object 
 9   sex                     5690 non-null   object 
 10  age                     3461 non-null   object 
 11  injury                  6225 non-null   object 
 12  fatal_(y/n)             6251 non-null   object 
 13  time                    2940 non-null   object 
 14  species_                3451 non-null   

In [59]:
# Changing one row letter y to capitalize
sharks['fatal_(y/n)'] = sharks['fatal_(y/n)'].str.replace('y', 'Y')

# Changing one row letter M to letter N
sharks['fatal_(y/n)'] = sharks['fatal_(y/n)'].str.replace('M', 'N')

# Changing nan to Unknown
sharks['fatal_(y/n)'] = sharks['fatal_(y/n)'].str.replace('nan', 'UNKNOWN')

In [60]:
sharks['fatal_(y/n)'].value_counts()

N          4283
Y          1365
UNKNOWN     603
Name: fatal_(y/n), dtype: int64

### Cleaning sex column

In [71]:
#Verifying spaces M
mask_M = sharks['sex'] == 'M '
sharks.loc[mask_M , :]
#Cleaning spaces
sharks['sex'] = sharks['sex'].str.replace('M ', 'M')

array(['F', 'M', nan], dtype=object)

In [65]:
#Verifying lli info
mask = sharks['sex'] == 'lli'
sharks.loc[mask , :]

Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex,age,injury,fatal_(y/n),time,species_,investigator_or_source,pdf,href


In [68]:
#Changing lli to M
sharks['sex'] = sharks['sex'].str.replace('lli', 'M')

#Verifying N
mask_N_to_M = sharks['sex'] == 'N'
sharks.loc[mask_N_to_M , :]



Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex,age,injury,fatal_(y/n),time,species_,investigator_or_source,pdf,href


In [69]:
#Changing N to M
sharks['sex'] = sharks['sex'].str.replace('N' , 'M')

#Verifying dot(.)
mask_dot = sharks['sex'] == '.'
sharks.loc[mask_dot , :]

#dropping dot row
sharks = sharks.drop(axis = 0, index = [5437])

KeyError: '[5437] not found in axis'

In [74]:
sharks['sex'] = sharks['sex'].fillna(value = 'UNKNOWN')

In [80]:
sharks['sex'].value_counts()

M          5055
F           634
UNKNOWN     561
Name: sex, dtype: int64

## Data Export

## Data Analysis