In [1]:
!kaggle datasets download -d teajay/global-shark-attacks

Downloading global-shark-attacks.zip to C:\Users\juanp\Ironhack\proyectos\pandas-project




  0%|          | 0.00/548k [00:00<?, ?B/s]
100%|##########| 548k/548k [00:00<00:00, 5.39MB/s]
100%|##########| 548k/548k [00:00<00:00, 5.34MB/s]


In [2]:
!tar -xzvf global-shark-attacks.zip

x attacks.csv


In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt
import re
import collections

In [4]:
data = pd.read_csv("data\\data_year.csv", encoding='cp1252')

## 1. Basic Analysis

In [33]:
data.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,...,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,...,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,...,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,...,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,...,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,2018.06.04,6299.0,,


In [50]:
data.shape

(25723, 24)

In [35]:
data.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')

The dataset contains a number of columns that wont be relevant for the analysis, like "pdf" of "href formula", and others whose meaning cannot be interpreted easily, like "Unnamed 22" or "Original order". This columsn can be dropped as they wont be useful. Also, the column "Case Number" and "Date" seem to contain the same date information with different formats, so we can delete one of them and work with the other. 

In [4]:
data = data.drop(["Date","Name","Investigator or Source","pdf","href formula","href","Case Number.1","Case Number.2",
                 "original order","Unnamed: 22", "Unnamed: 23"], axis=1)

### NaN values

In [81]:
data.isnull().sum()

Case Number    17021
Year           19423
Type           19425
Country        19471
Area           19876
Location       19961
Activity       19965
Sex            19986
Age            22252
Injury         19449
Fatal (Y/N)    19960
Time           22775
Species        22259
dtype: int64

There is a great number of rows that contain a lot of Nan values. An way to deal with them would be to drop all rows that have a 100% of NaN values, however, there are rows that have many NaN values without reaching the 100%. In order keep the maximun amount of information possible,the **thresh** parameter of the **dropna()** built-in function will be used to determine the maximum number of NaN values allowed per row. In this case the parameter will be set to 6

In [5]:
data = data.dropna(axis=0, thresh=6)

In [170]:
data.shape

(6301, 13)

With this, we have deleted around 19,000 rows of the dataset, almost 4/5 of the total

### How to treat the remaining NaN values?

There are different ways to fill the NaN values in a dataset, and each one depends on a series of factors like the type of the column, the kind of analysis, etc. In this case, two approaches are going to be taken. Those NaN values in string type columns are going to be replaced by **Unknown**, as it is something that should not be guessed nor inferred. In the case of data tyoe columns, even though they have not yet been transformed, are going to be repalced with the previous value because it can be assumed that they follow a certain order

#### When trying to replace the NaN values of categorical columns, there is an error related to the **Species** column, at is because it has a white space at the end. This can be solved later, but in order to have all the columns with the same format lets do it now

In [104]:
data.rename(columns={"Species ":"Species"},inplace  =True)

In [102]:
data.rename(columns={"Sex ":"Sex"}, inplace=True)

In [105]:
data.rename(columns={"Fatal (Y/N)":"Fatal"},inplace=True)

In [107]:
data.rename(columns={"Case Number":"Case_Number"},inplace=True)

In [7]:
#Replace the NaN with Unknown of categorical values (object)
data[["Type", "Country", "Area", "Activity", "Injury"
            ,"Sex ", "Fatal (Y/N)", "Species"]] = data[["Type", "Country", "Area", "Activity", "Injury"
                         ,"Sex ", "Fatal (Y/N)","Species"]].fillna("Unknown") 

In [8]:
#In the case of 'Year' and 'Time' we're going to fill NaN values with the previous one, as we assume that they
#follow a certain order, and doing this is better that filling them with 'NONE'

data[["Case Number","Time", "Year"]] = data[["Case Number","Time", "Year"]].fillna(method = 'ffill')

## 2. Column Analysis

Now, lets check each of the colums to see what corrections need to be done

### **Year**

In [9]:
def sorting(data):
    return data.sort_values(ascending = True)

In [10]:
def counts(column):
    '''
    This function returns the value_counts() result of a column as a dictionary sorted by key in ascending order.
    Easier to examine.
    Args: the column of interest in the format of df.column or df["columnd"]
    '''
    s = (column.value_counts())
    dictionary =  s.to_dict()
    return collections.OrderedDict(sorted(dictionary.items()))

In [100]:
year = counts(data.Year)

The first thing that should be noticed is that ther a 125 attack in year 0, which makes no sense. Also, there are records of attacks taking place in years that are too ancient. This may be typing errors, so, instead of deleting all those records, and losing all the information, when analyzing the **Year** column a threshold will be set to avoid taking this attypical years into account

It should also be mentioned that the years have decimals, which is not possible. Therefore, lets remove them.

In [11]:
data.Year = data.Year.astype('int32')

In [16]:
# The df has been safed as a csv so that the progress is not lost if jupyter notebook is closed 
data.to_csv(r'C:\Users\juanp\Ironhack\proyectos\pandas-project\data\data_year.csv' , index = True)

## Plots and Visualizations

In [7]:
from bokeh.plotting import figure, show

In [5]:
data = pd.read_csv('data\\data_regex.csv')

In [9]:
data.columns

Index(['Case_Number', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Sex', 'Age', 'Injury', 'Fatal', 'Time', 'Species',
       'fatal_injury', 'shark_species', 'month_incident'],
      dtype='object')

In [32]:
sharks = list(data.shark_species.unique())

In [23]:
shark_count = list(data.shark_species.value_counts())

In [52]:
sharks[1] = 'unknown'
sharks.remove('unknown')

In [41]:
p = figure(x_range=sharks, height=250, title="Shark counts",
           toolbar_location=None, tools="")

In [63]:
p.vbar(x=sharks, top=shark_count, width=0.2)

In [64]:
p.xgrid.grid_line_color = None
p.y_range.start = 0

In [65]:
show(p)