Please run the following cell to have the proper styling for the notebook.

In [1]:
%%html
<style>
#notebook-container { 
    width: 50% !important; 
    min-width: 800px;
    padding-right: 5em !important;
}
h1 { margin-top: 3em !important; }
h2 { margin-top: 2em !important; }
h3 { margin-top: 1em !important; }
div.task {
  font-size: 1.2em;
  padding: .7em;
  border: 2px solid #ccc;
  background-color: #eee;
  border-radius: 5px;
  margin: 0.5em 0px;
  display: flex;
}
div.task div:first-child {
    padding-right: 10px;
    font-size: 1.2em;
    line-height: 1.1em;
    color: #777;
}
div.task tt {
    background-color: #fff;
    font-size: 0.9em;
    padding: 0px 5px 0px 5px;
    border-radius: 1px;
}    
</style>

<h2>How do I approach this assignment?</h2>

The core task of this assignment lies in coming up with interesting questions about the dataset. In order to find good questions, it will be necessary to <i>explore</i> it: try various plots and see whether you find something of note, something that looks interesting to you. Make use of your lab notebooks, we have seen several types of plots and how to create them with matplotlib. I recommend that you do this exploratory phase in a different notebook!

Once you have found an answer to your question, describe your process: what parts of the dataset you are using, how you deal with missing data, what you compute or plot and why, and finally what conclusions you draw from that. You should use <b>at least one</b> plot to answer your question&mdash;if you can answer your question by only computing statistics, it is probably not a good question to work on.

<b>Feel free to run your questions by me if you are unsure!</b>

This assignment will be marked like an essay: you need to explain to a potential reader what question you are asking and convince them that the answer you offer is supported by the data. 

If you are stuck please contact me well ahead of the deadline. I am happy to help!

<h1>Shark attacks!</h1>

You have already worked extensively with the shark attack dataset. This version has a few additional columns: 

<ul>
    <li><tt>Area</tt> contains a more precise description of where the incident occurred,
    <li><tt>Type</tt> a broad description of why it happened (importantly, whether the attack was provoked or not), 
    <li><tt>Injury</tt> describes the severity of the 
        injury sustained by the victim (if available),
    <li><tt>Species</tt> the shark species involved in the attack (insofar it is known),
    <li>and finally <tt>Size (min)</tt> / <tt>Size (max)</tt> an estimate of the shark's size in cm. The `min` and `max` stem from the fact that the original dataset has a rough description of the attacking shark and often estimates like &ldquo;between 5 and 7 foot&rdquo; are provided.
</ul>       

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

sharks = pd.read_csv('resources-02/shark-attacks-cleaned.csv', index_col=0)
sharks.head(8)

Unnamed: 0,Year,Month,Country,Area,Type,Activity,Sex,Age,Fatal,Injury,Species,Size (min),Size (max)
0,2018,Jun,USA,California,Boating,Paddling,F,57.0,N,Minor,White,,
1,2018,Jun,USA,Georgia,Unprovoked,Standing,F,11.0,N,Minor,,,
2,2018,Jun,USA,Hawaii,Invalid,Surfing,M,48.0,N,Minor,,,
3,2018,Jun,AUSTRALIA,New South Wales,Unprovoked,Surfing,M,,N,Minor,,200.0,200.0
4,2018,Jun,MEXICO,Colima,Provoked,Free diving,M,,N,Moderate,,300.0,300.0
5,2018,Jun,AUSTRALIA,New South Wales,Unprovoked,Kite surfing,M,,N,Minor,,,
6,2018,Jun,BRAZIL,Pernambuco,Unprovoked,Swimming,M,18.0,Y,Fatal,Tiger,,
7,2018,May,USA,Florida,Unprovoked,Fishing,M,52.0,N,Minor,,91.0,91.0


<h2>Dataset summary</h2>
<div class="task">
    <div>1)</div>
    <div>
        Describe every column of the dataset: what type of data it is (recall our data type classification!), what values we find in it and how they are roughly distributed. 
    </div>
</div>


<div>The following columns are present in the dataset:</div>

<div>Using 'describe()' I will quickly evaluate the numerical columns and refer to the resulting dataframe below in the descriptions:</div>

In [3]:
sharks.describe().T #T will choose the layout most suitable.

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,6300.0,1927.272381,281.116308,0.0,1942.0,1977.0,2005.0,2018.0
Age,3368.0,27.366093,13.909223,1.0,17.0,24.0,35.0,87.0
Size (min),1976.0,255.439777,136.289556,2.0,150.0,213.0,350.0,1006.0
Size (max),1976.0,267.893219,141.190139,2.0,152.0,240.0,366.0,1006.0


<h3>The next function will display the top 'x' values from each column in dataset 'dset' in turn:

In [4]:
def describeTopx(dset, x):
    from IPython.display import display  
    for i in range(len(list(dset))):
        colName = list(dset)[i]
        display(colName)
        display("Total number of selections:",sharks[colName].value_counts().count())
        display(sharks[colName].value_counts().head(x))
        
describeTopx(sharks,10)

'Year'

'Total number of selections:'

249

2015    143
2017    136
2016    130
2011    128
2014    127
0       125
2008    122
2013    122
2009    120
2012    117
Name: Year, dtype: int64

'Month'

'Total number of selections:'

12

Jul    671
Aug    601
Sep    555
Jan    518
Jun    497
Apr    455
Oct    445
Dec    438
Mar    413
Nov    408
Name: Month, dtype: int64

'Country'

'Total number of selections:'

212

USA                 2229
AUSTRALIA           1337
SOUTH AFRICA         579
PAPUA NEW GUINEA     134
NEW ZEALAND          128
BRAZIL               112
BAHAMAS              109
MEXICO                89
ITALY                 71
FIJI                  62
Name: Country, dtype: int64

'Area'

'Total number of selections:'

825

Florida                  1037
New South Wales           486
Queensland                310
Hawaii                    298
California                290
KwaZulu-Natal             213
Western Cape Province     195
Western Australia         189
Eastern Cape Province     160
South Carolina            160
Name: Area, dtype: int64

'Type'

'Total number of selections:'

6

Unprovoked      4594
Provoked         574
Invalid          546
Boating          341
Sea Disaster     239
Questionable       2
Name: Type, dtype: int64

'Activity'

'Total number of selections:'

1532

Surfing         971
Swimming        868
Fishing         431
Spearfishing    332
Bathing         162
Wading          149
Diving          127
Standing         99
Snorkeling       89
Scuba diving     76
Name: Activity, dtype: int64

'Sex'

'Total number of selections:'

2

M    5094
F     637
Name: Sex, dtype: int64

'Age'

'Total number of selections:'

80

17.0    154
18.0    150
20.0    142
19.0    142
15.0    139
16.0    138
21.0    119
22.0    117
25.0    108
24.0    106
Name: Age, dtype: int64

'Fatal'

'Total number of selections:'

2

N    4302
Y    1387
Name: Fatal, dtype: int64

'Injury'

'Total number of selections:'

5

Minor        2423
Moderate     1477
Fatal        1461
Major         440
No injury      19
Name: Injury, dtype: int64

'Species'

'Total number of selections:'

39

White            178
Tiger             80
Bull              56
Wobbegong         22
Zambesi           20
Blacktip          17
Mako              16
Blue              15
Raggedtooth       14
Bronze whaler     12
Name: Species, dtype: int64

'Size (min)'

'Total number of selections:'

99

300.0    134
122.0    129
183.0    115
200.0    111
180.0    107
150.0    107
152.0     87
91.0      77
305.0     68
400.0     64
Name: Size (min), dtype: int64

'Size (max)'

'Total number of selections:'

100

300.0    140
152.0    115
180.0    106
183.0    105
150.0    104
200.0    102
122.0     94
244.0     65
400.0     65
120.0     59
Name: Size (max), dtype: int64

<h2>Descriptions of the columns:</h2>

<div><b><u>Year:</u></b></div>
<div>This column contains numeric data, the year in which the attacks took place. The years range from 0, 77, 500 (I assume these are anomalies), then 1543 to 2018. The reports per year increase year on year. The average is 1927 but this will have been skewed by the previously mentioned anomolies.
<div><b><u>Month:</u></b></div></u></b></div></u></b></div>
<div>This column contains ordinal data, namely the 12 months of the year. The northern hemisphere summer months have the highest numbers.
<div><b><u>Country:</u></b></div></u></b></div>
<div>This contains categorical data, the countries the attacks place in (or at least the coastal waters). With 212 total countries, and there being only 195-206 countries in the world, there must be some errors here. Over half of the reports are in the top three countries.
<div><b><u>Area:</u></b></div>
<div>This column also contains categorical data, more specific locations than the country. One sixth of all reports are in Florida, which suggests either Florida is a dangerous state, or this could be influenced by the fact that the International Shark Attack File is based in Florida!
<div><b><u>Type:</u></b></div>
<div>This column contains categorical data. It states the type of attack, with 'unprovoked' accounting for the vast majority (72%) of reports.
<div><b><u>Activity:</u></b></div>
<div>This column describes the activities the victims were involved in during the attack. With a total of 1532 listed activities, surfing, swimming and fishing dominate the reports. Indeed, the ones at the other end of the listing tend to be similar activities with more descriptions.
<div><b><u>Sex:</u></b></div>
<div>This column states wether the victim was male or female, with a ratio of 8:1 in favour of males!
<div><b><u>Age:</u></b></div>
<div>This column states the age of the victim, ranging from 1.0 to 87, an average of 27, but the top end of the reports containing late teens to early twenties.
<div><b><u>Fatal:</u></b></div>
<div>This column states whether the attack involved a fatality or not. The ratio is about 3:1 non-fatal to fatal, with about 90% of all reports having data.
<div><b><u>Injury:</u></b></div>
<div>This column describes the severity of the injury, if any. Over 92% of all reports have this data, with 'minor' having the most entries. Interestingly, the 'fatal' entry does not match the fatal column.
<div><b><u>Species:</u></b></div>
<div>This column describes the shark species, or breed. The 'white', or Great White Shark dominates the reports, although the total entries only represent about 10% of all reports.
<div><b><u>Size (min) & Size (max):</u></b></div>
<div>These columns describe the size of the shark, and ranges from a minimum of 2cm (another anomaly?) to over 10m. Both have similar mean (255/267cm) with a 3m shark being the most reported.

<h2>Data exploration</h2>

<div class="task">
    <div>2)</div>
    <div>
        Come up with <b>three</b> questions about the dataset
        and attempt to answer them using `pandas` and `matplotlib` (and any additional libraries you want to use).
    </div>
</div>

Per question, you should provide <b>at least one</b> plot which helps in exploring the question. <i>Every plot should be accompanied by a description of what is plotted and an interpretation of what it shows with respect to your question</i>.
    
Examples for good questions are &ldquo;Do larger sharks cause graver injuries?&rdquo;,
&ldquo;Which shark species is most dangerous&rdquo;, or
&ldquo;Do men provoke sharks more than women?&rdquo; (you can use one of these, please come up with different questions for the other two). 

Your write-up should contain an explanation of the question, a discussion on what part of the dataset you are focusing on (and why!), one or more suitable plots of this subset (including a description and an interpretation of the plot)
and an attempt at answering your question&mdash;if possible, quantitatively. 


A few hints:
<ul>
    <li>Select a <b>suitable subset</b> of the data. For example, if your question is related to, say, modern tourism, you should restrict yourself to rows with dates within the last 50 years or so.
    <li>Make sure that you select the <b>best suited plot</b>
       (among those that were introduced in the lecture and lab) and tune its appearance to make it <b>as readable as possible</b> (in particular with respect to the story you want to tell)
    <li>Feel free to use <b>external knowledge</b> to supplement your narrative about the data. Unless it is information that it easily verified, please supply a link to your source (Wikipedia is perfectly fine in this context)
</ul>

<h3>Question 1: Does age have any bearing on the severity of the attacks, are there any other factors?</h3>

<div style="text-align: left margin: 3em 6em">Here I will examine whether the age of any victim relates to the severity of the injuries they receive, if not, I will apply another factor to see if anything has a correlation:</div>


<h4>1.1: Does age correlate to the severity of the injury?</h4>

In [None]:
#Code to ensure the plots are a nice size
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 15
fig_size[1] = 9
plt.rcParams["figure.figsize"] = fig_size

# Please note, in order to see the following plots you will need to click on the relevant buttons.  You can view all data together or split by gender.

In [None]:
import ipywidgets as widgets
from IPython.display import display, clear_output
import seaborn as sns
#from IPython.display import clear_output

    
btn = widgets.Button(description='Male/Female')
other_btn = widgets.Button(description='All')
display('Please select a button to show a boxplot for Age:')
display(widgets.HBox((btn, other_btn)))
 
def my_event_handler(btn_object):
    clear_output()
    display('Please select a button to show a boxplot for Age:')
    display(widgets.HBox((btn, other_btn)))
    sns.boxplot(x='Injury', y = 'Age', data = sharks, hue = 'Sex', order = ['No injury', 'Minor', 'Moderate', 'Major', 'Fatal'])
    
def my_event_handler2(btn_object):
    clear_output()
    display('Please select a button to show a boxplot for Age:')
    display(widgets.HBox((btn, other_btn)))
    sns.boxplot(x='Injury', y = 'Age', data = sharks,  order = ['No injury', 'Minor', 'Moderate', 'Major', 'Fatal'])
 
btn.on_click(my_event_handler)
other_btn.on_click(my_event_handler2)

<div>Looking at both plots, there does not seem to be much difference between the different types of injury and the ages of the victims.   The 'No injury' group are a little older.</div>
<div>We shall look at another factor:</div>

<h4>1.2: Does the size of the shark correlate to the severity of the injury?</h4>

In [None]:
import ipywidgets as widgets
from IPython.display import display, clear_output


btn = widgets.Button(description='Male/Female')
other_btn = widgets.Button(description='All')
display('Please select a button to show a boxplot for shark size(max):')
display(widgets.HBox((btn, other_btn)))
 
def my_event_handler(btn_object):
    clear_output()
    display('Please select a button to show a boxplot for shark size(max):')
    display(widgets.HBox((btn, other_btn)))
    sns.boxplot(x='Injury', y =  'Size (max)', data = sharks, hue = 'Sex', order = ['No injury', 'Minor', 'Moderate', 'Major', 'Fatal'])
    
def my_event_handler2(btn_object):
    clear_output()
    display('Please select a button to show a boxplot for shark size(max):')
    display(widgets.HBox((btn, other_btn)))
    sns.boxplot(x='Injury', y = 'Size (max)', data = sharks,  order = ['No injury', 'Minor', 'Moderate', 'Major', 'Fatal'])
 
btn.on_click(my_event_handler)
other_btn.on_click(my_event_handler2)

<div>As expected, from 'Moderate' to 'Fatal' there is an increase in size, the bigger the shark, the harder the bite!  There is a strange result with 'Minor' seeming to have a similar reading to Major', where I would expect this to be less than moderate.</div>

<h4>1.3: Does the year correlate to the severity of the injury?</h4>

In [None]:
import ipywidgets as widgets
from IPython.display import display, clear_output
#from IPython.display import clear_output

    
btn = widgets.Button(description='Male/Female')
other_btn = widgets.Button(description='All')
btn3 = widgets.Button(description='M/F after 1900')
other_btn4= widgets.Button(description='All after 1900')
btn5 = widgets.Button(description='M/F after 1960')
other_btn6 = widgets.Button(description='All after 1960')
display('Please select a button to show a boxplot for year:')
display(widgets.HBox((btn, other_btn, btn3, other_btn4, btn5, other_btn6)))
 
def my_event_handler(btn_object):
    clear_output()
    display('Please select a button to show a boxplot for year:')
    display(widgets.HBox((btn, other_btn, btn3, other_btn4, btn5, other_btn6)))
    sns.boxplot(x='Injury', y = 'Year', data = sharks[sharks['Year']>1750], hue = 'Sex', order = ['No injury', 'Minor', 'Moderate', 'Major', 'Fatal'])
    
def my_event_handler2(btn_object):
    clear_output()
    display('Please select a button to show a boxplot for year:')
    display(widgets.HBox((btn, other_btn, btn3, other_btn4, btn5, other_btn6)))
    sns.boxplot(x='Injury', y = 'Year', data = sharks[sharks['Year']>1750],  order = ['No injury', 'Minor', 'Moderate', 'Major', 'Fatal'])
    
def my_event_handler3(btn_object):
    clear_output()
    display('Please select a button to show a boxplot for year:')
    display(widgets.HBox((btn, other_btn, btn3, other_btn4, btn5, other_btn6)))
    sns.boxplot(x='Injury', y = 'Year', data = sharks[sharks['Year']>1900], hue = 'Sex', order = ['No injury', 'Minor', 'Moderate', 'Major', 'Fatal'])
    
def my_event_handler4(btn_object):
    clear_output()
    display('Please select a button to show a boxplot for year:')
    display(widgets.HBox((btn, other_btn, btn3, other_btn4, btn5, other_btn6)))
    sns.boxplot(x='Injury', y = 'Year', data = sharks[sharks['Year']>1900],  order = ['No injury', 'Minor', 'Moderate', 'Major', 'Fatal'])

def my_event_handler5(btn_object):
    clear_output()
    display('Please select a button to show a boxplot for year:')
    display(widgets.HBox((btn, other_btn, btn3, other_btn4, btn5, other_btn6)))
    sns.boxplot(x='Injury', y = 'Year', data = sharks[sharks['Year']>1960], hue = 'Sex', order = ['No injury', 'Minor', 'Moderate', 'Major', 'Fatal'])
    
def my_event_handler6(btn_object):
    clear_output()
    display('Please select a button to show a boxplot for year:')
    display(widgets.HBox((btn, other_btn, btn3, other_btn4, btn5, other_btn6)))
    sns.boxplot(x='Injury', y = 'Year', data = sharks[sharks['Year']>1960],  order = ['No injury', 'Minor', 'Moderate', 'Major', 'Fatal'])
 
btn.on_click(my_event_handler)
other_btn.on_click(my_event_handler2)
btn3.on_click(my_event_handler3)
other_btn4.on_click(my_event_handler4)
btn5.on_click(my_event_handler5)
other_btn6.on_click(my_event_handler6)


<div>If you select all of the years, the boxplot does not tell you much in regards to what we are looking for, or at least, the evidence is quite compressed near the top.  Once you select only the later years, it seems evident that there are less fatalaties in this century.  Indeed, the 'No injury' plot seems to mainly have reports in the last 20 years or so.</div>

<h3>Conclusion:</h3>
<div>The age of the person does not seem to have much of a bearing on the severity of the injuries reported, and the size of the shark does, which I do not find surprising.  One thing I do find interesting is that the fatalaties have declined while the reports of attacks with no injuries have increased in the last 20 years.  Could the causes of this be that medical advances are saving lives?  Are non-injury attacks being reported more now because it is easier to due to the internet?</div>

<h3>Question 2: Which shark species is most dangerous?  (Felix's suggestion...)</h3>


<div style = "text-align: left margin: 3em 6em">Here I will be exploring the species data to see which species causes the most serious attacks and maybe we will see which causes the most of all types of attacks.  So, first of all I will make a copy of the data which only includes the data we need:</div>

In [None]:
sharksQ2 = sharks[['Type','Activity','Injury','Species']].copy() 
#I have included Type and Activity in case they become useful later on.
sharksQ2.head()

<div>Now, I will remove all NaN entries from the 'Injury' and 'Species' columns:</div>

In [None]:
Q2Subset = sharksQ2.dropna(subset = ['Injury', 'Species'])
Q2Subset.head(10)

<h4>Initial dataframe showing the species with the most attacks:</h4>

In [None]:
topspecies = Q2Subset['Species'].value_counts().head(10)
topspecies

<div>The above table does give us the basic information we are after in a simple format.  It does show that the 'White' or Great White Shark is the species with most reported attacks.  This is not a big surprise, but we could take in to account the fact that the Great White is the most well known breed and has a poor reputation, (thanks to a series of films), and could be the 'go-to' breed suggested by witnesses, or maybe Spielberg was right!  This data could have been produced using a spreadsheet, so I will explore it further using the tools we have available.</div>
<div>Something we can look at is the severity of attacks, which is why I included the 'Injury' column.  From here I will only consider the top 10 sharks:</div>

<h4>Initial dataframe showing the number of each type of attack:</h4>

In [None]:
Q2Subset['Injury'].value_counts()

In [None]:
top10species = Q2Subset['Species'].value_counts().head(10)

import matplotlib.pyplot as plt
import numpy as np
 
x = top10species.index
y = top10species.values

 
plt.bar(x, y)
plt.xlabel('Species')
plt.title('Registered Shark Attacks by Species')
plt.ylabel('Total Shark Attacks')

plt.show()

<div>A stacked chart showing the different types of injury:

In [None]:
a = pd.crosstab(sharksQ2.Species, sharksQ2.Injury).head(10)
a.plot.bar(stacked=True)

<h4>Conclusion:</h4>
<div>With no surprises, the Great White is consistently the most dangerous shark.
   

<h3>Question 3: Which month is the most dangerous for shark attacks?</h3>
<br>
<div style="text-align: left margin: 3em 6em">As the previous two questions only used a small amount of data due to the low number of times the species was recorded, I will try and use all of the records here. With this question I will explore the data relating to the time of year attacks take place.  I expect there to be an issue with the southern and northern hemispheres having summers at different times and will take in to account the fact that more people are in the water in the warmer months.</div>
<h4>Initial series showing months and number of attacks:</h4>

In [None]:
sharks['Month'].value_counts()

<div>This shows that the summer months and January seem to be most dangerous, not a surprise really considering that January is summer in the southern hemisphere.  If we wish to look further, we will need to remove the NaN values, look at the severuty of the attacks, and adjust the 'month' data to account for the part of the world the report is from:</div>
<div>Pull out the required columns:

In [None]:
sharksQ3 = sharks[['Month','Country','Injury']].copy() 

<div>Show the top 10 countries:</div>

In [None]:
sharksQ3["Country"].value_counts().head(10)

<div>Here it is quite clear that the majority of attacks take place in a small number of countries, in 3 distinct regions, America, Australasia and South Africa.  A further study in which we look at which areas of USA and Australia have frequent attacks would be interesting due to their large, varied coastal regions.</div>
<div>I will, arbitrarily, choose to use the countries that have had over 100 attacks reported (I will also remove all NaN values from all columns:</div>

In [None]:
topCountries = sharksQ3.groupby("Country").filter(lambda x: len(x) > 100).dropna()
display(topCountries.describe())
display(topCountries["Country"].value_counts())

<div>Due to filtering for countries with more than 100 reports before removing all NaN values, we have a few countries with less than 100 records. I will use these anyway.</div>
<div>I now need to categorise these with regards to their geographical location, namely, which hemisphere.  This will create a new column, 'Hemisphere', and append either 'North' or 'South' depending on the 'Country' value (I have placed Brazil in to the southern hemisphere):</div>

In [None]:
topCountries['Hemisphere'] = np.where((
    (topCountries['Country'] =='USA') |
  (topCountries['Country'] == 'BAHAMAS')),
 'North', 'South')
topCountries.head(10)

<div>This is a stacked bar chart showing attacks by all countries:

In [None]:
topNorth = topCountries[topCountries['Hemisphere']=='North']
topSouth = topCountries[topCountries['Hemisphere']=='South']
#topSouth
a = pd.crosstab(topSouth.Country, topSouth.Injury)
a.plot.bar(stacked=True)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

topNorthS = topNorth['Month'].value_counts()

x = topNorthS.index
y = topNorthS.values

 
plt.bar(x, y)
plt.xlabel('Month')
plt.title('Shark Attacks by Month - Northern Hemisphere')
plt.ylabel('Total Shark Attacks')

plt.show()


In [None]:
import matplotlib.pyplot as plt
import numpy as np

topSouthS = topSouth['Month'].value_counts()

x = topSouthS.index
y = topSouthS.values

 
plt.bar(x, y)
plt.xlabel('Month')
plt.title('Shark Attacks by Month - Southern Hemisphere')
plt.ylabel('Total Shark Attacks')

plt.show()


<h4>Conclusion:</h4>
<div>The months with the most attacks are the summer months in whichever hemisphere the attacks take place.  I would have liked to have plotted the northern and southern hemisphere readings together, which would have shown a clearer picture.