# Quest 2 : Shark Attacks


### Dream Team:

- <em>Karina</em>
- <em>Pollyana</em>
- <em>Salim</em>
- <em>Jules</em>

# Introoduction

As an insurance company we want to analyze shark attack statistics in order to evaluate and assess the incidence rates based on various features and factors

This analysis can help determine premium rates based on the likelihood of shark-related incidents.



Business goal



Develop a fair method to determine insurance rates for shark attack coverage using data from shark attack incidents

Hypothesis

The easiest way to get attacked by a shark is by certain type of activity

Shark attacks occur more frequently in certain countries

The severity of injuries during a shark attack is positively correlated with higher insurance rates

## <font color='DarkBlue'>I. <ins>Prerequisites</ins>: <font color='blue'></font>

### <font color='MediumBlue'>1 - <ins>Identifying the dataset source</ins>: <font color='violet'></font>

<ins><strong>Data Source: </strong></ins>

In [1]:
sharks = "GSAF5.xls"

In [2]:
source = sharks

### <font color='MediumBlue'>2 - <ins> Importing libraries</ins>: <font color='violet'></font>

<strong>pandas</strong>

In [3]:
import pandas as pd
import numpy as np
import datetime as dt

### <font color='MediumBlue'>3 - <ins>  Loading the dataset into a DataFrame</ins>: <font color='violet'></font>

In [4]:
df = pd.read_excel(source)

##  <font color='DarkBlue'>II. <ins>Exploring the Dataset</ins>: <font color='blue'></font>

### <font color='MediumBlue'>1 - <ins> Dataset Overview</ins>: <font color='violet'></font>

#### <font color='CornflowerBlue'>a) Displaying number of rows and number of columns: </font>

In [5]:
df.shape

(6947, 23)

#### <font color='CornflowerBlue'>b) Glancing at the dataset: </font>

<ins><strong>Displaying the first rows : </strong></ins>

In [6]:
df.head(5)

Unnamed: 0,Date,Year,Type,Country,State,Location,Activity,Name,Sex,Age,...,Species,Source,pdf,href formula,href,Case Number,Case Number.1,original order,Unnamed: 21,Unnamed: 22
0,08 Dec-2023,2023.0,Unprovoked,AUSTRALIA,Queensland,1770,Swimming,malle,20s,,...,,"B. Myatt, GSAF",,,,,,,,
1,04 Dec-2023,2023.0,Unprovoked,BAHAMAS,New Providence Isoad,Sandals Resort,Paddle boarding,Lauren Erickson Van Wart,F,44.0,...,,"NBC News, 12/4/2023",,,,,,,,
2,02 Dec-2023,2023.0,Unprovoked,MEXICO,Jalisco,San Patricio Melaque,Swimming,Maria Fernandez Martinez Jimenez,F,26.0,...,,"News Channel 21, 12/3.2023",,,,,,,,
3,30 Nov-2023,2023.0,Unprovoked,AUSTRALIA,Queensland,Clack Island,Swimming,Matthew Davitt,M,21.0,...,1.8m bull shark,"ABC Net, 11/30/2023",,,,,,,,
4,21 Nov-2023,2023.0,Unprovoked,BAHAMAS,Grand Bahama Island,Tiger Beach,Scuba diving,female,F,47.0,...,,"Eye Witness News, 11/22/2023",,,,,,,,


<ins><strong>Displaying names and culumns type : </strong></ins>

In [9]:
df.dtypes

Date               object
Year              float64
Type               object
Country            object
State              object
Location           object
Activity           object
Name               object
Sex                object
Age                object
Injury             object
Unnamed: 11        object
Time               object
Species            object
Source             object
pdf                object
href formula       object
href               object
Case Number        object
Case Number.1      object
original order    float64
Unnamed: 21        object
Unnamed: 22        object
dtype: object

<ins><strong>Displaying number of unique values for each column : </strong></ins>

In [10]:
df.nunique()

Date              5983
Year               258
Type                11
Country            224
State              896
Location          4495
Activity          1585
Name              5669
Sex                  9
Age                243
Injury            4071
Unnamed: 11         12
Time               409
Species           1671
Source            5284
pdf               6789
href formula      6785
href              6776
Case Number       6777
Case Number.1     6775
original order    6797
Unnamed: 21          1
Unnamed: 22          2
dtype: int64

<ins><strong>Displaying number of unique values for each column that has less than 10 distinct values : </strong></ins>

In [11]:
df.nunique()[lambda x: x <= 10]

Sex            9
Unnamed: 21    1
Unnamed: 22    2
dtype: int64

#### <font color='CornflowerBlue'>c) Displaying unique values for each column that has less than 10 distinct value:</font>

In [18]:
dico = {}
for i in range(len(df.columns)):
    if df.nunique()[i] <= 10:
        dico[df.columns[i]] = df.nunique()[i]
        
for j in dico.keys():
    print(j,":",df[j].unique())

Sex : ['20s' 'F' 'M' nan ' M' 'M ' 'lli' 'M x 2' 'N' '.']
Unnamed: 21 : [nan 'stopped here']
Unnamed: 22 : [nan 'Teramo' 'change filename']


### <font color='MediumBlue'>2 - <ins> Identifying numerical variables and their specifications</ins>: <font color='violet'></font>

<ins><strong><font color='BlueViolet'>Numerical</font></strong> **variables specifications**:</ins>

From the data types output it is safe to assume that the following columns are numerical variables: 
- **Year**: <em><font color='DarkMagenta'> float64</font></em>
- **original order**: <em><font color='DarkMagenta'> float64</font></em>

However, all remaining columns shouldn't be necessary considered as categorical variables.
Indeed, based on the name of the following columns and their values, these are numerical variables in nature:

- **Age**:<font color='red'> should be</font> <em><font color='DarkMagenta'>int</font></em>
- **Date**:<font color='red'> should be</font><em><font color='DarkMagenta'> Date</font></em>
- **Time**:<font color='red'> should be</font><em><font color='DarkMagenta'> Date</font></em>



Additionally, here are the following actions that can be suggested to "fix" some of the data discrepancies:
- Modify "Date" type and format
- Change "Year" type to int
- Deal with missing values such as "Sex"
- Address outliers (?)

In [None]:
#renaming variables: everything in lower case and replacing spaces levaring fix_col_names function

def fix_col_names(df):
    df.columns = df.columns.str.strip().str.lower().str.replace(r'\s+','_',regex=True)
    return df

fix_col_names(df)

# example df.rename(columns= {'foo': 'bar'}, inplace=True)



In [None]:
# Changing Age type

# example df['age'] = df['age'].apply(float)

#### <font color='CornflowerBlue'>a) Continuous variables: </font>

In [None]:
cont_var = []


#### <font color='CornflowerBlue'>b) Discrete variables: </font>

In [None]:
disc_var = []     # None for this dataset

#### <font color='CornflowerBlue'>c) All numerical variables: </font>

In [None]:
num_var = disc_var + cont_var

### <font color='MediumBlue'>3 - <ins> Identifying categorical variables and their specifications</ins>: <font color='violet'></font>

<strong><font color='BlueViolet'>Categorical</font></strong> **variables specification**:

- **xxx**: <ins><em><font color='DarkMagenta'>Nominal</font></em></ins>.There is no obvious order for that variable.
- **xx**:<ins><em><font color='DarkMagenta'>Nominal</font></em></ins>.There is no obvious order for that variable.
- **xxx**: <ins><em><font color='DarkMagenta'>Nominal</font></em></ins>. This is a boolean like variable with no particular order. 
- **xxx**: <ins><em><font color='DarkMagenta'>Nominal</font></em></ins>.This is a boolean like variable with no particular order.
- **xxx**: <ins><em><font color='DarkMagenta'>Ordinal</font></em></ins>.Clearly, we can order the level of education, for example by ascending order.



####  <font color='CornflowerBlue'>a) Nominal variables: </font>

In [None]:
nom_var = []

#### <font color='CornflowerBlue'>b) Ordinal variables:</font>

In [None]:
ord_var = []

####  <font color='CornflowerBlue'>c) All categorical variables:</font>

In [None]:
cat_var = nom_var + ord_var

## <font color='DarkBlue'>III. <ins>Analysing Descriptive Statistics</ins>: <font color='blue'></font>

### <font color='MediumBlue'>1 - <ins> Analysing numerical variables</ins>: <font color='violet'></font>

#### <font color='CornflowerBlue'>a) Measuring Central Tendencies: <font color='violet'></font> 

#### i) <font color='ForestGreen'>Mean</font>

In [None]:
df[num_var].mean().round(2)

#### ii)  <font color='ForestGreen'>Median</font>

In [None]:
df[num_var].median()

#### iii)  <font color='ForestGreen'>Mode</font>

In [None]:
df[num_var].mode()

**Findings on central tendencies**

Based on our calculations and interpretations of central measures, we can make the following observations:

 

#### <font color='CornflowerBlue'>b) Measuring Dispersion:</font> 

#### i) <font color='ForestGreen'>Standard Deviation</font>

In [None]:
df[num_var].std().round(2)

* **xxxx**: The Standard Deviation blablabla



#### ii) <font color='ForestGreen'>Range</font>

In [None]:
df[num_var].max() - df[num_var].min()

#### <font color='CornflowerBlue'>c)  Summarizing Statistics:</font> 

#### <font color='ForestGreen'> i) Statistics Summary: count, mean, standard deviation, min, quartiles, maximum</font>

In [None]:
df[num_var].describe().round(2)

### <font color='MediumBlue'>2 - <ins> Analysing categorical variables</ins>: <font color='violet'></font>

#### <font color='CornflowerBlue'>a) Measuring Frequency: </font>

#### <font color='ForestGreen'> i) frequency in counts</font>

In [None]:
for k in cat_var:
    print(f"{df[k].value_counts()} \n")    

#### <font color='ForestGreen'> ii) frequency in percentages</font>

In [None]:
for k in cat_var:
    percentage = (df[k].value_counts() / df.shape[0]) * 100
    print(f"{percentage} \n ")