# Shark Attack Database Mini Project

## Objective
<br>

* Organize and clean a csv data file.

* Detail and explain all python code and commands used in the importing, cleaning, manipulation, exporting and analysis.

<br>

database version: 7.1

# Starting the code

## Importing libraries

In [2]:
import pandas as pd
import numpy as np
import re

## Importing database

Importing a csv database using latin-1 encoding

In [3]:
# Importing csv database as "sharks"
sharks = pd.read_csv('GSAF5.csv', sep = ',', encoding='latin-1')

# Creating a backup copy
sharks_bkp = sharks.copy()

## Declaring functions

In [4]:
def standardize_headers(df, func=None):
    '''
    This functions works cleaning columns names:
    Replacing whitespaces, lower characteres and turning to string.
    '''
    df.columns = df.columns.str.replace(' ', '_').str.lower()
    if func:
      df = df.apply(func)
    return df

## Data Cleaning

Starting by steps cleaning this database to improve quality and productivity. Cleaning all incorrect information, just leaving the highest quality info.

In [5]:
# Dataframe shape
sharks.shape

(5992, 24)

### Cleaning Columns
<br>

* Cleaning whitespaces

<br>

* Special characters

<br>

* Lower all characters. 

<br>


In [6]:
standardize_headers(sharks)
sharks.head()

Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex_,...,species_,investigator_or_source,pdf,href_formula,href,case_number.1,case_number.2,original_order,unnamed:_22,unnamed:_23
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.c,2016.09.18.c,5993,,
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.b,2016.09.18.b,5992,,
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.a,2016.09.18.a,5991,,
3,2016.09.17,17-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Thirteenth Beach,Surfing,Rory Angiolella,M,...,,"The Age, 9/18/2016",2016.09.17-Angiolella.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.17,2016.09.17,5990,,
4,2016.09.15,16-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Bells Beach,Surfing,male,M,...,2 m shark,"The Age, 9/16/2016",2016.09.16-BellsBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.16,2016.09.15,5989,,


### Checking Null and NaN

In [7]:
# Searching for NaN values
sharks.isna().sum()

case_number                  0
date                         0
year                         0
type                         0
country                     43
area                       402
location                   496
activity                   527
name                       200
sex_                       567
age                       2681
injury                      27
fatal_(y/n)                 19
time                      3213
species_                  2934
investigator_or_source      15
pdf                          0
href_formula                 1
href                         3
case_number.1                0
case_number.2                0
original_order               0
unnamed:_22               5991
unnamed:_23               5990
dtype: int64

In [8]:
# Searching for Null values
sharks.isnull().sum()

case_number                  0
date                         0
year                         0
type                         0
country                     43
area                       402
location                   496
activity                   527
name                       200
sex_                       567
age                       2681
injury                      27
fatal_(y/n)                 19
time                      3213
species_                  2934
investigator_or_source      15
pdf                          0
href_formula                 1
href                         3
case_number.1                0
case_number.2                0
original_order               0
unnamed:_22               5991
unnamed:_23               5990
dtype: int64

In [9]:
# Searching for duplicates rows
sharks.duplicated().sum()

0

### Dropping column

#### Dropping two NaN columns 

<br>

For improve performance and having no useful data.

In [10]:
sharks = sharks.drop(axis = 1, columns = ['unnamed:_22', 'unnamed:_23'])

#### Drop original_order column

<br>

A column than was a Original index that is no longer useful.

In [11]:
sharks = sharks.drop(axis = 1, columns = ['original_order'])

In [36]:
sharks_case = sharks['case_number'] == sharks['case_number.1']
sharks_case.value_counts()

True     5979
False      13
dtype: int64

In [13]:
sharks_case = sharks['case_number.1'] == sharks['case_number.2']
sharks_case.value_counts()

True     5981
False      11
dtype: int64

In [29]:
sharks_href = sharks['href_formula'] == sharks['href']
sharks_case.value_counts()

True     5981
False      11
dtype: int64

### Drop duplicate column

<br>

* case_number.2 and case_number.1 was duplicate column of case_number


<br>

* href_formula was a duplicate column of href

<br>

In [37]:
sharks = sharks.drop(axis = 1, columns = ['case_number.1'])

In [23]:
sharks = sharks.drop(axis = 1, columns = ['case_number.2'])

In [30]:
sharks = sharks.drop(axis = 1, columns = ['href_formula'])

## Data Manipulation

### Creating a column of months

In [58]:
temp_lst = []
for row in sharks['date']:
    temp_row = ''.join(re.findall('\-[A-Za-z]{3}\-',row)).lower()
    temp_row = re.sub('\-','',temp_row)
        
        
    if temp_row == '':
        temp_row = np.nan

    temp_lst.append(temp_row)
sharks['month'] = temp_lst

In [94]:
sharks['month'].value_counts(dropna=False)

NaN       870
jul       590
aug       537
sep       491
jan       476
jun       453
dec       397
oct       385
apr       375
mar       367
nov       365
may       344
feb       335
marmar      2
janjan      1
jut         1
novnov      1
augaug      1
sepsep      1
Name: month, dtype: int64

Droping NaN rows and correcting months name that are incorrectly

In [97]:
sharks = sharks.dropna(subset = ['month'])

In [99]:
for row in sharks['month']:
    if len(row) > 3:
        sharks['month'].replace(row, row[:3], inplace = True)
    elif row == 'jut':
        sharks['month'].replace(row,'jun', inplace = True)

In [100]:
sharks['month'].value_counts(dropna=False)

jul    590
aug    538
sep    492
jan    477
jun    454
dec    397
oct    385
apr    375
mar    369
nov    366
may    344
feb    335
Name: month, dtype: int64

## Data Export

## Data Analysis