# :Shark Attack!!!

http://www.sharkattackfile.net/incidentlog.htm
https://github.com/awesomedata/awesome-public-datasets
https://www.kaggle.com/teajay/global-shark-attacks/version/1#GSAF5.csv
https://pandas.pydata.org/pandas-docs/stable/

Ideas de hipótesis:
evolución de ataques por tipo de incidentes, por zona y especies de tiburones
evolución de incidencia
https://www.tiburonpedia.com/

![ChessUrl](<iframe src="https://giphy.com/embed/cCvWHbfVdn2bm" width="480" height="270" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/shark-sharks-week-cCvWHbfVdn2bm">via GIPHY</a></p> "chess")

In [20]:
#Importing Pandas
import pandas as pd
import numpy as np
import re

In [2]:
#Importing data using Pandas
df = pd.read_csv('../GSAF5.csv',encoding = 'ISO-8859-1')

In [3]:
# rows in dataFrame
len(df)

5992

In [4]:
# dataFrame shape: rows and columns
display(df.shape)

(5992, 24)

In [5]:
# checking which variables are in the dataset
df.columns
# observation: we can see that there are white spaces in texts --> next step: clean it

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')

In [6]:
# checking data quantity in these variables
info = df.info()
# we can see that we have few data for time, age and species
# the 2 last columns are empty (unamed: 22 and unamed: 23) --> next step: drop them
# observation: case number, pdf, href and href formula, case number. 1 and case number 2, 
# original order and Investigator or Source are not relevant data --> next step: drop them

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5992 entries, 0 to 5991
Data columns (total 24 columns):
Case Number               5992 non-null object
Date                      5992 non-null object
Year                      5992 non-null int64
Type                      5992 non-null object
Country                   5949 non-null object
Area                      5590 non-null object
Location                  5496 non-null object
Activity                  5465 non-null object
Name                      5792 non-null object
Sex                       5425 non-null object
Age                       3311 non-null object
Injury                    5965 non-null object
Fatal (Y/N)               5973 non-null object
Time                      2779 non-null object
Species                   3058 non-null object
Investigator or Source    5977 non-null object
pdf                       5992 non-null object
href formula              5991 non-null object
href                      5989 non-null object
C

In [9]:
# removing white spaces in columns name:
df = df.rename(columns=lambda x: x.strip())
df.columns

In [11]:
# removing white spaces in values:
# select columns with variables containing strings:
df_obj = df.select_dtypes(['object'])
# strip leading and trailing space: removing spaces from both left and right side of the strings
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
df.head(10)

In [8]:
# drop empty columns + variables not relevant for analysis:
df = df.drop(columns=['Case Number', 'Investigator or Source','pdf','href formula','href','Case Number.1','Case Number.2','original order','Unnamed: 22','Unnamed: 23'])
info = df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5992 entries, 0 to 5991
Data columns (total 14 columns):
Date           5992 non-null object
Year           5992 non-null int64
Type           5992 non-null object
Country        5949 non-null object
Area           5590 non-null object
Location       5496 non-null object
Activity       5465 non-null object
Name           5792 non-null object
Sex            5425 non-null object
Age            3311 non-null object
Injury         5965 non-null object
Fatal (Y/N)    5973 non-null object
Time           2779 non-null object
Species        3058 non-null object
dtypes: int64(1), object(13)
memory usage: 655.5+ KB


In [7]:
# checking data quality: how normalized and structured (distinct values) are data in these variables 
# Type, Sex, Year, Fatal (Y/N), Age, Country, Time, Area are the most structured data.
# Then, if we work on clean mentions, maybe we can extract a better quality for Activity and Species 
df.nunique()

Case Number               5976
Date                      5128
Year                       232
Type                         6
Country                    203
Area                       785
Location                  3929
Activity                  1492
Name                      5009
Sex                          6
Age                        151
Injury                    3595
Fatal (Y/N)                  8
Time                       357
Species                   1538
Investigator or Source    4752
pdf                       5981
href formula              5980
href                      5972
Case Number.1             5975
Case Number.2             5976
original order            5988
Unnamed: 22                  1
Unnamed: 23                  2
dtype: int64

# Summary Data flow per variable:
- column 0: raw number --> drop
- Case Number ---> drop
- Date --> relevant
- Year --> relevant
- Type  --> relevant
- Country  --> relevant
- Area  --> relevant
- Location  --> relevant
- Activity --> relevant
- Name --> not relevant for analysis, but we keep it to remove duplicated registers
- Sex --> relevant
- Age --> relevant
- Injury --> relevant
- Fatal (Y/N) --> relevant
- Time --> relevant
- Species --> relevant 
- Investigator or Source --> drop
- pdf --> drop
- href formula --> drop
- href --> drop
- Case Number.1 --> drop
- Case Number.2 --> drop
- original order --> drop
- Unnamed: 22 --> drop
- Unnamed: 23 --> drop

In [14]:
# we will create a norm_data only with columns with normalized data:
norm_data = df[['Type', 'Sex', 'Year', 'Date','Fatal (Y/N)', 'Age', 'Country', 'Time', 'Area', 'Activity','Species']]

In [15]:
norm_data.head(10)

Unnamed: 0,Type,Sex,Year,Date,Fatal (Y/N),Age,Country,Time,Area,Activity,Species
0,Unprovoked,M,2016,18-Sep-16,N,16,USA,13h00,Florida,Surfing,
1,Unprovoked,M,2016,18-Sep-16,N,36,USA,11h00,Florida,Surfing,
2,Unprovoked,M,2016,18-Sep-16,N,43,USA,10h43,Florida,Surfing,
3,Unprovoked,M,2016,17-Sep-16,N,,AUSTRALIA,,Victoria,Surfing,
4,Unprovoked,M,2016,16-Sep-16,N,,AUSTRALIA,,Victoria,Surfing,2 m shark
5,Boat,,2016,15-Sep-16,N,,AUSTRALIA,,Western Australia,Fishing,
6,Unprovoked,M,2016,11-Sep-16,N,60s,USA,15h15,Florida,Wading,3' to 4' shark
7,Unprovoked,F,2016,07-Sep-16,N,51,USA,14h30,Hawaii,Swimming,"Tiger shark, 10?"
8,Unprovoked,M,2016,06-Sep-16,Y,50,NEW CALEDONIA,15h40,North Province,Kite surfing,
9,Unprovoked,F,2016,05-Sep-16,N,12,USA,Late afternoon,South Carolina,Boogie boarding,


In [16]:
# TYPE OF INCIDENT - exploratory analysis
df['Type'].value_counts()

Unprovoked      4386
Provoked         557
Invalid          519
Sea Disaster     220
Boat             200
Boating          110
Name: Type, dtype: int64

In [17]:
# SEX - exploratory analysis
df['Sex'].value_counts()


M      4837
F       585
lli       1
N         1
.         1
Name: Sex, dtype: int64

In [18]:
# YEAR - exploratory analysis
df['Year'].value_counts()

2015    139
2011    128
2014    125
0       124
2013    122
2008    121
2009    120
2012    117
2007    112
2006    103
2016    103
2005    103
2010    101
2000     97
1959     93
1960     93
2001     92
2004     92
2003     92
2002     88
1962     86
1961     78
1995     76
1964     66
1998     65
1999     65
1996     61
1963     61
1966     58
1997     57
       ... 
1785      1
1834      1
1791      1
1733      1
1721      1
1637      1
1617      1
77        1
5         1
1703      1
1755      1
1767      1
1771      1
1779      1
1787      1
1803      1
1749      1
1807      1
1811      1
1819      1
1805      1
1831      1
1555      1
1738      1
1859      1
1742      1
1758      1
1818      1
1822      1
1595      1
Name: Year, Length: 232, dtype: int64

In [None]:
'''df.columa.apply(funcion)
def funcion(dato):
	return(nuevo_dato)

df.nombre_columna = df[‘nombre_columna’]
'''

In [19]:
# YEAR - Data cleaning & manipulation: 
# identifying nulls and years in different format len(value)!=4
# for i in df['Year']
# defining a function to clean detecting year format in date column:
def cleaning_year(x):
    pattern = '[0-9]{4}'
    new_x = re.findall(pattern, x)
    return new x

df.columa.apply(funcion)

    
#df['Year'].isnull()
# defining a function to clean
for i in df.index:
    if len(df.loc[i,'Year']) < 4: 

SyntaxError: unexpected EOF while parsing (<ipython-input-19-fe1ac07cdbea>, line 6)

In [None]:
# Fatal (Y/N) - exploratory analysis
df['Fatal (Y/N)'].value_counts()

In [None]:
# AGE - exploratory analysis
df['Age'].value_counts()

In [None]:
# COUNTRY - exploratory analysis
df['Country'].value_counts()

In [None]:
# TIME - exploratory analysis
df['Time'].value_counts()

In [None]:
# AREA - exploratory analysis
df['Area'].value_counts()

In [None]:
# ACTIVITY - exploratory analysis
df['Activity'].value_counts()

In [None]:
# SPECIES - exploratory analysis
df['Species'].value_counts()

In [None]:
# Data cleaning & manipulation: (2/8 methods)

In [None]:
# Data cleaning & manipulation: (3/8 methods)

In [None]:
# Data cleaning & manipulation: (4/8 methods)

In [None]:
# Data cleaning & manipulation: (5/8 methods)

In [None]:
# Data cleaning & manipulation: (6/8 methods)

In [None]:
# Data cleaning & manipulation: (7/8 methods)
# First remove punctuation, spaces, then replace carriage return and convert to lower text:
s = re.sub(r'[^\w\s]','',poem).replace('\n', ' ').lower()
print (s)
# then split to list:
list_of_strings = s.split(" ")
print(list_of_strings)


# First remove punctuation, spaces, then replace carriage return and convert to lower text and plit:
words = re.sub(r'[^\w\s]','',poem).replace('\n', ' ').lower().split()
print(words)

# Then remove blacklist from words to get a clean list:
clean = [w for w in words if w not in blacklist]
print(clean)


# Then remove duplicated items in clean list to have unique items:
unique = set([w for w in words if w not in blacklist])
print(unique)

# Here we can compare the length of all and see how we have cleaned the data along the process
print('poem len was',len(poem),' VS then words list len was',len(words),' VS clean list len was',len(clean),' VS unique len is now',len(unique))

pattern =r"\w*\d+"
filtered_data = []
for i in data:
    if re.search(pattern,i):
        temp = re.search(pattern,i)
        print(temp)
        filtered_data.append(i)
    else:
        print(i,'NOT MATCH')
print (filtered_data)

In [None]:
# Data cleaning & manipulation: (8/8 methods)
# Last step: removing duplicated

In [None]:
# Exploratory analysis - with cleaned Data

In [None]:
# Exporting clean data in CSV using Pandas

README.md file containing a detailed explanation of the process followed in the importing, cleaning, manipulation, 
and exporting of your data as well as your results, obstacles encountered, and lessons learned