![IronHack Logo](https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/upload_d5c5793015fec3be28a63c4fa3dd4d55.png)

# Project: Data Cleaning and Manipulation with Pandas

## Overview

The goal of this project is to combine everything you have learned about data wrangling, cleaning, and manipulation with Pandas so you can see how it all works together. For this project, you will start with this messy data set [Shark Attack](https://www.kaggle.com/teajay/global-shark-attacks/version/1). You will need to import it, use your data wrangling skills to clean it up, prepare it to be analyzed, and then export it as a clean CSV data file.

**You will be working individually for this project**, but we'll be guiding you along the process and helping you as you go. Show us what you've got!


---

## Technical Requirements

The technical requirements for this project are as follows:

* The dataset that we provide you is a significantly messy data set. Apply the different cleaning and manipulation techniques you have learned.
* Import the data using Pandas.
* Examine the data for potential issues.
* Use at least 8 of the cleaning and manipulation methods you have learned on the data.
* Produce a Jupyter Notebook that shows the steps you took and the code you used to clean and transform your data set.
* Export a clean CSV version of your data using Pandas.

## Necessary Deliverables

The following deliverables should be pushed to your Github repo for this chapter.

* **A cleaned CSV data file** containing the results of your data wrangling work.
* **A Jupyter Notebook (data-wrangling.ipynb)** containing all Python code and commands used in the importing, cleaning, manipulation, and exporting of your data set.
* **A ``README.md`` file** containing a detailed explanation of the process followed in the importing, cleaning, manipulation, and exporting of your data as well as your results, obstacles encountered, and lessons learned.

## Suggested Ways to Get Started

* **Examine the data and try to understand what the fields mean** before diving into data cleaning and manipulation methods.
* **Break the project down into different steps** - use the topics covered in the lessons to form a check list, add anything else you can think of that may be wrong with your data set, and then work through the check list.
* **Use the tools in your tool kit** - your knowledge of Python, data structures, Pandas, and data wrangling.
* **Work through the lessons in class** & ask questions when you need to! Think about adding relevant code to your project each night, instead of, you know... _procrastinating_.
* **Commit early, commit often**, don’t be afraid of doing something incorrectly because you can always roll back to a previous version.
* **Consult documentation and resources provided** to better understand the tools you are using and how to accomplish what you want.

## Useful Resources

* [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/)
* [Pandas Tutorials](https://pandas.pydata.org/pandas-docs/stable/tutorials.html)
* [StackOverflow Pandas Questions](https://stackoverflow.com/questions/tagged/pandas)
* [Awesome Public Data Sets](https://github.com/awesomedata/awesome-public-datasets)
* [Kaggle Data Sets](https://www.kaggle.com/datasets)


## Histories
### A case

https://www.theguardian.com/environment/2019/jan/03/shark-attack-bitten-great-white-survivors-near-death-experience

### "Instead of ignoring the problem or pretending that we can put up nets and protect everyone from sharks, what is needed is a public education campaign to teach people how to co-exist with sharks," Lewis Levine, M.D.


* Yet each year, for every human killed by a shark, our species slaughters millions of sharks - about 73 million sharks last year.

### Mission

* Global Shark Attack File to present facts about these events

### Methodology

* Early on, we became aware that the word "attack" was usually a misnomer. An "attack" by a shark is an extremely rare event, even less likely than statistics suggest. When a shark bites a surfboard, leaving the surfer unharmed, it was historically recorded as an "attack". Collisions between humans and sharks in low visibility water were also recorded as "attacks";

* Although incidents that occur in remote areas may go unrecorded, the Global Shark Attack File is a compilation of a number of data sources, and we have a team of qualified researchers throughout the world that actively investigate these incidents. One of our objectives is to provide a clear picture of the actual threat presented by sharks to humans. In this regard, we remind our visitors that more people drown in a single year in the United States than have been killed by sharks throughout the entire world in the last two centuries.



https://www.floridamuseum.ufl.edu/shark-attacks/yearly-worldwide-summary/

### Dataset

The International Shark Attack File is a global database of shark attacks. It began as an attempt to catalogue shark attacks on servicemen during World War II. The Office of Naval Research funded it from 1958 until 1968. During that time, a panel of shark experts developed a standard system for collecting accounts of shark attacks from around the world. The file was temporarily housed at the Mote Marine Laboratory in Sarasota, Florida, until a permanent home was found at the Florida Museum of Natural History at the University of Florida. It is currently under the direction of members of the American Elasmobranch Society, including George H. Burgess. The file contains information on over 5,300 shark attacks, and includes detailed, often privileged, information including autopsy reports and graphic photos. The file is accessible only to scientists whose access is permitted only by a review board.

A similar project called Global Shark Attack File is accessible to everybody as an XLS file that can be sorted by date and location of the shark attack.[1]

* http://www.sharkattackfile.net/incidentlog.htm

### Categories

* 
INCIDENTS INVOLVING BOATS – INCIDENTS IN WHICH A BOAT WAS BITTEN OR RAMMED BY A SHARK ARE IN GREEN. 

HOWEVER, IN CASES IN WHICH THE SHARK WAS HOOKED, NETTED OR GAFFED, THE ENTRY IS ORANGE BECAUSE THEY ARE CLASSED AS PROVOKED INCIDENTS.

CASUALTIES OF WAR & AIR/SEA DISASTERS - SHARKS MAINTAIN THE HEALTH OF THE MARINE ECOSYSTEM BY REMOVING THE DEAD OR INJURED ANIMALS. MANY INCIDENTS RESULT BECAUSE, LIKE OTHER ANIMALS THAT DON'T RELY ON INSTINCT ALONE, SHARKS EXPLORE THEIR ENVIRONMENT. LACKING HANDS, THEY MAY INVESTIGATE AN UNFAMILIAR OBJECT WITH THEIR MOUTHS. UNLIKE HUMANS, THERE IS NO MALICE IN SHARKS; THEY SIMPLY DO WHAT NATURE DESIGNED THEM TO DO. AIR/SEA DISASTERS ARE ACCIDENTS THAT PLACE PEOPLE INTO THE DAY-TO-DAY BUSINESS OF SHARKS. THE WARTIME LOSSES DUE TO SHARKS RESULT FROM MANS' CRUELTY TO MAN. AIR/SEA DISASTERS ARE IN YELLOW.

QUESTIONABLE INCIDENTS - INCIDENTS IN WHICH THERE ARE INSUFFICIENT DATA TO DETERMINE IF THE INJURY WAS CAUSED BY A SHARK OR THE PERSON DROWNED AND THE BODY WAS LATER SCAVENGED BY SHARKS. IN A FEW CASES, DESPITE MEDIA REPORTS TO THE CONTRARY, EVIDENCE INDICATED THERE WAS NO SHARK INVOLVEMENT WHATSOEVER. SUCH INCIDENTS ARE IN BLUE.

http://www.sharkattackfile.net/incidentlog.htm

#### Help the sharks
* https://www.sharkwater.com/shark-database/

#### shark profile
https://www.shark.ch/Database/

In [79]:
import pandas as pd
import numpy as np

## Import the data using Pandas

In [80]:
# pdf source: http://sharkattackfile.net/spreadsheets/pdf_directory/

In [101]:
df = pd.read_csv(
    # file to be imported
    'attacks.csv',
    # encoding
    encoding='latin1',
    # na values
    na_values=['xx', 'NaN', '0']
)

In [102]:
df.head(1)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,...,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0,,


In [103]:
df.tail(1)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
25722,,,,,,,,,,,...,,,,,,,,,,


In [104]:
df.shape

(25723, 24)

In [105]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25723 entries, 0 to 25722
Data columns (total 24 columns):
Case Number               6301 non-null object
Date                      6302 non-null object
Year                      6175 non-null float64
Type                      6298 non-null object
Country                   6252 non-null object
Area                      5847 non-null object
Location                  5762 non-null object
Activity                  5758 non-null object
Name                      6092 non-null object
Sex                       5737 non-null object
Age                       3471 non-null object
Injury                    6274 non-null object
Fatal (Y/N)               5763 non-null object
Time                      2948 non-null object
Species                   3464 non-null object
Investigator or Source    6285 non-null object
pdf                       6302 non-null object
href formula              6301 non-null object
href                      6302 non-null obje

## Feature selection: working with columns

In [106]:
# rename columns
def rename_cols(cols):
    # lower the cases using the string method lower()
    cols = cols.str.lower()
    # replace values to normalize the column's name
    cols = cols.str.replace(' ', '_').str.replace(')','')
    cols = cols.str.replace(':', '_').str.replace('(', '')
    cols = cols.str.replace('.', '_').str.replace('/', '')
    return(cols)

In [107]:
# call the rename_cols() function with df.columns as an argument
df.columns = rename_cols(df.columns)

In [108]:
# select the first 17 coluns and slice it using brackets
df = df[df.columns[0:16]]

In [109]:
df.head()

Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex_,age,injury,fatal_yn,time,species_,investigator_or_source
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57.0,"No injury to occupant, outrigger canoe and pad...",N,18h00,White shark,"R. Collier, GSAF"
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11.0,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com"
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48.0,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com"
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF"
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper


In [110]:
df.shape

(25723, 16)

In [111]:
df.columns

Index(['case_number', 'date', 'year', 'type', 'country', 'area', 'location',
       'activity', 'name', 'sex_', 'age', 'injury', 'fatal_yn', 'time',
       'species_', 'investigator_or_source'],
      dtype='object')

## Missing values - NaN
* How to define what data should be considered a NaN value?
* How to check a NaN?
* How to assing a NaN value?

In [112]:
# check the number of NaN values in each column
df.isnull().sum().sort_values(ascending=False)

time                      22775
species_                  22259
age                       22252
sex_                      19986
activity                  19965
location                  19961
fatal_yn                  19960
area                      19876
name                      19631
year                      19548
country                   19471
injury                    19449
investigator_or_source    19438
type                      19425
case_number               19422
date                      19421
dtype: int64

In [113]:
df.shape

(25723, 16)

In [114]:
# drop rows all NaN values?
df = df.dropna(axis=0, how='all')

In [115]:
df.shape

(6302, 16)

In [116]:
# after dropping all NaN values, recheck the NaN values in each column
df.isnull().sum().sort_values(ascending=False)

time                      3354
species_                  2838
age                       2831
sex_                       565
activity                   544
location                   540
fatal_yn                   539
area                       455
name                       210
year                       127
country                     50
injury                      28
investigator_or_source      17
type                         4
case_number                  1
date                         0
dtype: int64

##### Analyse column by column
Let's start with a question?
- What column does represent a unique identity associated to each row -- like a unique ID for each case?

First of all, it is important to imagine that each row represent just one case, a unique story that will be summed up to other stories in each row. It is called analysed by block.

"The data talks about everyone, but not individually".


###### let's test if the column 'case_number' could be this ID column

* Grouping the same values into a frequency table, we notice that the value '0' appeared 2400 times

In [117]:
df.case_number.value_counts().head()

2006.09.02      2
1923.00.00.a    2
1920.00.00.b    2
1990.05.10      2
1907.10.16.R    2
Name: case_number, dtype: int64

* What is the use of a repeated ID, considering that each case_number should, in theory, represent a unique code for each row?
 
Should we consider the '0' as a NaN value?

In [119]:
# We can check specific values using .iloc
# df.case_number.iloc[8697]

In [120]:
# Let's check what data type are we talking about?
# type(df.case_number.iloc[8697])

* As we can see, the '0' is not able to assume the role of an ID. Let's now check it the rows with '0' carries any meaningful data:

Considering the rows from the column 'case_number' that carries '0' on it, all the columns carries NaN values, as we notice below:

In [121]:
# selecting all values from case_number equal to 0, and sum the NaN values:
df[df.case_number == '0'].isnull().sum()

case_number               0
date                      0
year                      0
type                      0
country                   0
area                      0
location                  0
activity                  0
name                      0
sex_                      0
age                       0
injury                    0
fatal_yn                  0
time                      0
species_                  0
investigator_or_source    0
dtype: int64

* If all other columns carries NaN and the '0' is not useful, we can label them into NaN:

In [122]:
# Replace all '0' with np.nan
df.case_number[df.case_number == '0'] = np.nan

In [123]:
# check if the value were replaced -- as expected
# and it worked!
df.case_number.value_counts(ascending=False).head()

2006.09.02      2
1923.00.00.a    2
1920.00.00.b    2
1990.05.10      2
1907.10.16.R    2
Name: case_number, dtype: int64

* After we replaced all '0' from the column case_number to a np.nan values, we're ready to exclude them from the dataset using the dropna() function.

In [124]:
# Before that, let's check the data shape

In [125]:
df.shape

(6302, 16)

In [126]:
# Dropping all rows in which all the cases are NaN
df = df.dropna(axis=0, how='all')

In [127]:
# It worked! Again, we dropped out all the rows with all NaN values
df.shape

(6302, 16)

* After analying the column 'case_number', let's move forward to new columns

In [128]:
df.isnull().sum().sort_values(ascending=False)

time                      3354
species_                  2838
age                       2831
sex_                       565
activity                   544
location                   540
fatal_yn                   539
area                       455
name                       210
year                       127
country                     50
injury                      28
investigator_or_source      17
type                         4
case_number                  1
date                         0
dtype: int64

* Let's check the 'date' column:

In [155]:
df.date.str.replace('Reported', '').str.replace('No date', '').str.replace('Between ', '')

0                                             25-Jun-2018
1                                             18-Jun-2018
2                                             09-Jun-2018
3                                             08-Jun-2018
4                                             04-Jun-2018
5                                             03-Jun-2018
6                                             03-Jun-2018
7                                             27-May-2018
8                                             26-May-2018
9                                             26-May-2018
10                                            24-May-2018
11                                            21-May-2018
12                                            13-May-2018
13                                            13-May-2018
14                                               May 2018
15                                            12-May-2018
16                                            09-May-2018
17            

In [130]:
df.date.iloc[6275]

'No date'

* We can see that 'No date' represent a NaN, a value that does not represent a number -- despite not being portrayed as so.

So now, our job is to build a list of NaN values. The list should be considered as a parameter while importing the file. Or we can exclude one bye one, as we did with the '0' from the column 'case_number'.

In [131]:
nan_values = ['No date']

In [132]:
df.date[df.date == 'No date'] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [133]:
df.date[df.date == 'No date'] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [134]:
df.date.value_counts()

1957                    11
1942                     9
1956                     8
1950                     7
1941                     7
1958                     7
1949                     6
12-Apr-2001              5
Aug-1956                 5
1955                     5
1959                     5
1954                     5
No date, Before 1963     5
1970s                    5
1940                     5
05-Oct-2003              5
28-Jul-1995              5
Oct-1960                 5
23-Jan-1970              4
27-Dec-2008              4
29-Apr-2017              4
Before 1906              4
1995                     4
09-Jan-2010              4
1952                     4
27-Jul-1952              4
Before 1958              4
1938                     4
1960                     4
28-Dec-2014              4
                        ..
18-Apr-2003              1
24-Oct-2009              1
27-Jul-1960              1
17-Apr-2008              1
15-Oct-1897              1
04-Mar-1956              1
1

* Let's move to analyse the column 'year'

In [135]:
# We found 125 cases with '0.0', which is not a valid year
df.year.value_counts()

2015.0    143
2017.0    136
2016.0    130
2011.0    128
2014.0    127
2008.0    122
2013.0    122
2009.0    120
2012.0    117
2007.0    112
2006.0    103
2005.0    103
2010.0    101
2000.0     97
1959.0     93
1960.0     93
2001.0     92
2003.0     92
2004.0     92
2002.0     88
1962.0     86
1961.0     78
1995.0     76
1964.0     66
1999.0     66
1998.0     65
1963.0     61
1996.0     61
1966.0     58
1997.0     57
         ... 
1733.0      1
1859.0      1
1822.0      1
1815.0      1
1703.0      1
1748.0      1
1807.0      1
1580.0      1
1801.0      1
1802.0      1
1834.0      1
1638.0      1
1637.0      1
1784.0      1
1755.0      1
1767.0      1
1819.0      1
1738.0      1
1753.0      1
1783.0      1
1742.0      1
1816.0      1
1555.0      1
500.0       1
1721.0      1
1771.0      1
1791.0      1
1554.0      1
1823.0      1
1786.0      1
Name: year, Length: 248, dtype: int64

In [136]:
# Checking value
df.year.iloc[6274]

nan

In [137]:
# Checking the type of 0.0 value
type(df.year.iloc[6274])

numpy.float64

In [138]:
# Let's transform the '0.0' value into a NaN
df.year[df.year == 0.0] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [139]:
# Let's check again
df.year.value_counts()

2015.0    143
2017.0    136
2016.0    130
2011.0    128
2014.0    127
2008.0    122
2013.0    122
2009.0    120
2012.0    117
2007.0    112
2006.0    103
2005.0    103
2010.0    101
2000.0     97
1959.0     93
1960.0     93
2001.0     92
2003.0     92
2004.0     92
2002.0     88
1962.0     86
1961.0     78
1995.0     76
1964.0     66
1999.0     66
1998.0     65
1963.0     61
1996.0     61
1966.0     58
1997.0     57
         ... 
1733.0      1
1859.0      1
1822.0      1
1815.0      1
1703.0      1
1748.0      1
1807.0      1
1580.0      1
1801.0      1
1802.0      1
1834.0      1
1638.0      1
1637.0      1
1784.0      1
1755.0      1
1767.0      1
1819.0      1
1738.0      1
1753.0      1
1783.0      1
1742.0      1
1816.0      1
1555.0      1
500.0       1
1721.0      1
1771.0      1
1791.0      1
1554.0      1
1823.0      1
1786.0      1
Name: year, Length: 248, dtype: int64

In [140]:
df.year.isnull().sum()

127

In [141]:
df2 = df.copy()

In [142]:
df2.index = df.year

In [143]:
df2

Unnamed: 0_level_0,case_number,date,year,type,country,area,location,activity,name,sex_,age,injury,fatal_yn,time,species_,investigator_or_source
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2018.0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57,"No injury to occupant, outrigger canoe and pad...",N,18h00,White shark,"R. Collier, GSAF"
2018.0,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com"
2018.0,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com"
2018.0,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF"
2018.0,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper
2018.0,2018.06.03.b,03-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,"Flat Rock, Ballina",Kite surfing,Chris,M,,"No injury, board bitten",N,,,"Daily Telegraph, 6/4/2018"
2018.0,2018.06.03.a,03-Jun-2018,2018.0,Unprovoked,BRAZIL,Pernambuco,"Piedade Beach, Recife",Swimming,Jose Ernesto da Silva,M,18,FATAL,Y,Late afternoon,Tiger shark,"Diario de Pernambuco, 6/4/2018"
2018.0,2018.05.27,27-May-2018,2018.0,Unprovoked,USA,Florida,"Lighhouse Point Park, Ponce Inlet, Volusia County",Fishing,male,M,52,Minor injury to foot. PROVOKED INCIDENT,N,,"Lemon shark, 3'","K. McMurray, TrackingSharks.com"
2018.0,2018.05.26.b,26-May-2018,2018.0,Unprovoked,USA,Florida,"Cocoa Beach, Brevard County",Walking,Cody High,M,15,Lower left leg bitten,N,17h00,"Bull shark, 6'","K.McMurray, TrackingSharks.com"
2018.0,2018.05.26.a,26-May-2018,2018.0,Unprovoked,USA,Florida,"Daytona Beach, Volusia County",Standing,male,M,12,Minor injury to foot,N,14h00,,"K. McMurray, Tracking Sharks.com"


## Clearning the right data -- and save the boat

In [144]:
# replace values starting from date

In [145]:
# function to value_counts() all the data

## Index: working with dates and time series

In [146]:
df[df.date.notnull()].shape

(6296, 16)

In [149]:
# df.case_number.iloc[8697]

In [151]:
df.year.isnull().sum()

127

In [150]:
df[df.year]

KeyError: '[2018. 2018. 2018. ...   nan   nan   nan] not in index'

In [611]:
df[df.date.notnull()]

0       2018.0
1       2018.0
2       2018.0
3       2018.0
4       2018.0
5       2018.0
6       2018.0
7       2018.0
8       2018.0
9       2018.0
10      2018.0
11      2018.0
12      2018.0
13      2018.0
14      2018.0
15      2018.0
16      2018.0
17      2018.0
18      2018.0
19      2018.0
20      2018.0
21      2018.0
22      2018.0
23      2018.0
24      2018.0
25      2018.0
26      2018.0
27      2018.0
28      2018.0
29      2018.0
         ...  
6272       0.0
6273       0.0
6274       0.0
6275       0.0
6276       0.0
6277       0.0
6278       0.0
6279       0.0
6280       0.0
6281       0.0
6282       0.0
6283       0.0
6284       0.0
6285       0.0
6286       0.0
6287       0.0
6288       0.0
6289       0.0
6290       0.0
6291       0.0
6292       0.0
6293       0.0
6294       0.0
6295       0.0
6296       0.0
6297       0.0
6298       0.0
6299       0.0
6300       0.0
6301       0.0
Name: year, Length: 6301, dtype: float64

### Examine the data for potential issues.

In [318]:
# clean data
df.sex_.value_counts()

M      5094
F       637
N         2
M         2
.         1
lli       1
Name: sex_, dtype: int64

In [343]:
# cleaning typos
df[(df.sex_=='M') | (df.sex_=='F')].sex_.value_counts()

M    5094
F     637
Name: sex_, dtype: int64

In [304]:
df_1.sex_.value_counts()

M     5094
F      637
N        2
M        2
.        1
Name: sex_, dtype: int64

In [278]:
# create a referece dictionary to categorize the data
#df.species_.value_counts()

In [30]:
# clean and categorize the data
df.Activity.value_counts()

Surfing                                                                                                     971
Swimming                                                                                                    869
Fishing                                                                                                     431
Spearfishing                                                                                                333
Bathing                                                                                                     162
Wading                                                                                                      149
Diving                                                                                                      127
Standing                                                                                                     99
Snorkeling                                                                                              

In [164]:
df = df.sort_values('year')

In [158]:
import pandas_profiling

In [165]:
df.profile_report()

