![IronHack Logo](https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/upload_d5c5793015fec3be28a63c4fa3dd4d55.png)

# Project: Data Cleaning and Manipulation with Pandas

## Overview

The goal of this project is to combine everything you have learned about data wrangling, cleaning, and manipulation with Pandas so you can see how it all works together. For this project, you will start with this messy data set [Shark Attack](https://www.kaggle.com/teajay/global-shark-attacks/version/1). You will need to import it, use your data wrangling skills to clean it up, prepare it to be analyzed, and then export it as a clean CSV data file.

**You will be working individually for this project**, but we'll be guiding you along the process and helping you as you go. Show us what you've got!


---

## Technical Requirements

The technical requirements for this project are as follows:

* The dataset that we provide you is a significantly messy data set. Apply the different cleaning and manipulation techniques you have learned.
* Import the data using Pandas.
* Examine the data for potential issues.
* Use at least 8 of the cleaning and manipulation methods you have learned on the data.
* Produce a Jupyter Notebook that shows the steps you took and the code you used to clean and transform your data set.
* Export a clean CSV version of your data using Pandas.

## Necessary Deliverables

The following deliverables should be pushed to your Github repo for this chapter.

* **A cleaned CSV data file** containing the results of your data wrangling work.
* **A Jupyter Notebook (data-wrangling.ipynb)** containing all Python code and commands used in the importing, cleaning, manipulation, and exporting of your data set.
* **A ``README.md`` file** containing a detailed explanation of the process followed in the importing, cleaning, manipulation, and exporting of your data as well as your results, obstacles encountered, and lessons learned.

## Suggested Ways to Get Started

* **Examine the data and try to understand what the fields mean** before diving into data cleaning and manipulation methods.
* **Break the project down into different steps** - use the topics covered in the lessons to form a check list, add anything else you can think of that may be wrong with your data set, and then work through the check list.
* **Use the tools in your tool kit** - your knowledge of Python, data structures, Pandas, and data wrangling.
* **Work through the lessons in class** & ask questions when you need to! Think about adding relevant code to your project each night, instead of, you know... _procrastinating_.
* **Commit early, commit often**, don’t be afraid of doing something incorrectly because you can always roll back to a previous version.
* **Consult documentation and resources provided** to better understand the tools you are using and how to accomplish what you want.

## Useful Resources

* [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/)
* [Pandas Tutorials](https://pandas.pydata.org/pandas-docs/stable/tutorials.html)
* [StackOverflow Pandas Questions](https://stackoverflow.com/questions/tagged/pandas)
* [Awesome Public Data Sets](https://github.com/awesomedata/awesome-public-datasets)
* [Kaggle Data Sets](https://www.kaggle.com/datasets)


In [647]:
import pandas as pd
import numpy as np

## Import the data using Pandas

In [648]:
# pdf source: http://sharkattackfile.net/spreadsheets/pdf_directory/

In [649]:
df = pd.read_csv(
    # file to be imported
    'attacks.csv',
    # encoding
    encoding='latin1',
    # na values
    na_values=['xx', 'NaN']
)

In [650]:
df.head(1)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,...,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0,,


In [651]:
df.tail()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
25718,,,,,,,,,,,...,,,,,,,,,,
25719,,,,,,,,,,,...,,,,,,,,,,
25720,,,,,,,,,,,...,,,,,,,,,,
25721,,,,,,,,,,,...,,,,,,,,,,
25722,,,,,,,,,,,...,,,,,,,,,,


In [652]:
df.shape

(25723, 24)

In [653]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25723 entries, 0 to 25722
Data columns (total 24 columns):
Case Number               8701 non-null object
Date                      6302 non-null object
Year                      6300 non-null float64
Type                      6298 non-null object
Country                   6252 non-null object
Area                      5847 non-null object
Location                  5762 non-null object
Activity                  5758 non-null object
Name                      6092 non-null object
Sex                       5737 non-null object
Age                       3471 non-null object
Injury                    6274 non-null object
Fatal (Y/N)               5763 non-null object
Time                      2948 non-null object
Species                   3464 non-null object
Investigator or Source    6285 non-null object
pdf                       6302 non-null object
href formula              6301 non-null object
href                      6302 non-null obje

## Feature selection: working with columns

In [654]:
# rename columns
def rename_cols(cols):
    cols = cols.str.lower()
    cols = cols.str.replace(' ', '_').str.replace(')','')
    cols = cols.str.replace(':', '_').str.replace('(', '')
    cols = cols.str.replace('.', '_').str.replace('/', '')
    return(cols)

In [655]:
# call the rename_cols() function with df.columns as an argument
df.columns = rename_cols(df.columns)

In [656]:
# select the first 17 coluns and slice it using brackets
df = df[df.columns[0:16]]

In [657]:
df.shape

(25723, 16)

In [658]:
df.columns

Index(['case_number', 'date', 'year', 'type', 'country', 'area', 'location',
       'activity', 'name', 'sex_', 'age', 'injury', 'fatal_yn', 'time',
       'species_', 'investigator_or_source'],
      dtype='object')

## Missing values - NaN
* How to define what data should be considered a NaN value?
* How to check a NaN?
* How to assing a NaN value?

In [659]:
# check the number of NaN values in each column
df.isnull().sum().sort_values(ascending=False)

time                      22775
species_                  22259
age                       22252
sex_                      19986
activity                  19965
location                  19961
fatal_yn                  19960
area                      19876
name                      19631
country                   19471
injury                    19449
investigator_or_source    19438
type                      19425
year                      19423
date                      19421
case_number               17022
dtype: int64

In [660]:
df.shape

(25723, 16)

In [661]:
# drop rows all NaN values?
df = df.dropna(axis=0, how='all')

In [662]:
df.shape

(8702, 16)

In [663]:
# after dropping all NaN values, recheck the NaN values in each column
df.isnull().sum().sort_values(ascending=False)

time                      5754
species_                  5238
age                       5231
sex_                      2965
activity                  2944
location                  2940
fatal_yn                  2939
area                      2855
name                      2610
country                   2450
injury                    2428
investigator_or_source    2417
type                      2404
year                      2402
date                      2400
case_number                  1
dtype: int64

##### Analyse column by column
Let's start with a question?
- What column does represent a unique identity associated to each row -- like a unique ID for each case?

First of all, it is important to imagine that each row represent just one case, a unique story that will be summed up to other stories in each row. It is called analysed by block.

"The data talks about everyone, but not individually".


###### let's test if the column 'case_number' could be this ID column

* Grouping the same values into a frequency table, we notice that the value '0' appeared 2400 times

In [664]:
df.case_number.value_counts().head()

0                 2400
1990.05.10           2
1915.07.06.a.R       2
1966.12.26           2
1983.06.15           2
Name: case_number, dtype: int64

* What is the use of a repeated ID, considering that each case_number should, in theory, represent a unique code for each row?
 
Should we consider the '0' as a NaN value?

In [665]:
# We can check specific values using .iloc
df.case_number.iloc[8697]

'0'

In [666]:
# Let's check what data type are we talking about?
type(df.case_number.iloc[8697])

str

* As we can see, the '0' is not able to assume the role of an ID. Let's now check it the rows with '0' carries any meaningful data:

Considering the rows from the column 'case_number' that carries '0' on it, all the columns carries NaN values, as we notice below:

In [668]:
# selecting all values from case_number equal to 0, and sum the NaN values:
df[df.case_number == '0'].isnull().sum()

case_number                  0
date                      2400
year                      2400
type                      2400
country                   2400
area                      2400
location                  2400
activity                  2400
name                      2400
sex_                      2400
age                       2400
injury                    2400
fatal_yn                  2400
time                      2400
species_                  2400
investigator_or_source    2400
dtype: int64

* If all other columns carries NaN and the '0' is not useful, we can label them into NaN:

In [669]:
# Replace all '0' with np.nan
df.case_number[df.case_number == '0'] = np.nan

In [670]:
# check if the value were replaced -- as expected
# and it worked!
df.case_number.value_counts(ascending=False).head()

1990.05.10        2
1915.07.06.a.R    2
1966.12.26        2
1983.06.15        2
1913.08.27.R      2
Name: case_number, dtype: int64

* After we replaced all '0' from the column case_number to a np.nan values, we're ready to exclude them from the dataset using the dropna() function.

In [671]:
# Before that, let's check the data shape

In [672]:
df.shape

(8702, 16)

In [673]:
# Dropping all rows in which all the cases are NaN
df = df.dropna(axis=0, how='all')

In [674]:
# It worked! Again, we dropped out all the rows with all NaN values
df.shape

(6302, 16)

* After analying the column 'case_number', let's move forward to new columns

In [675]:
df.isnull().sum().sort_values(ascending=False)

time                      3354
species_                  2838
age                       2831
sex_                       565
activity                   544
location                   540
fatal_yn                   539
area                       455
name                       210
country                     50
injury                      28
investigator_or_source      17
type                         4
year                         2
case_number                  1
date                         0
dtype: int64

* Let's check the 'date' column:

In [676]:
df.date.value_counts()

1957                    11
1942                     9
1956                     8
1941                     7
1958                     7
1950                     7
1949                     6
No date                  6
1940                     5
05-Oct-2003              5
1955                     5
Oct-1960                 5
Aug-1956                 5
1959                     5
1954                     5
1970s                    5
28-Jul-1995              5
12-Apr-2001              5
No date, Before 1963     5
Before 1958              4
Reported 10-Oct-1906     4
27-Dec-2008              4
1960s                    4
14-Jun-2012              4
09-Jul-1994              4
1960                     4
15-Apr-2018              4
29-Apr-2017              4
28-Dec-2014              4
1938                     4
                        ..
14-Sep-1979              1
11-Jan-1976              1
Reported 15-Dec-1877     1
11-Oct-1987              1
04-Aug-1980              1
14-Jan-2005              1
2

In [677]:
df.date.iloc[6275]

'No date'

* We can see that 'No date' represent a NaN, a value that does not represent a number -- despite not being portrayed as so.

So now, our job is to build a list of NaN values. The list should be considered as a parameter while importing the file. Or we can exclude one bye one, as we did with the '0' from the column 'case_number'.

In [691]:
nan_values = ['No date']

In [692]:
df.date[df.date == 'No date'] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [693]:
df.date[df.date == 'No date'] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [694]:
df.date.value_counts()

1957                    11
1942                     9
1956                     8
1950                     7
1941                     7
1958                     7
1949                     6
12-Apr-2001              5
1959                     5
1970s                    5
No date, Before 1963     5
1954                     5
05-Oct-2003              5
1955                     5
1940                     5
Aug-1956                 5
28-Jul-1995              5
Oct-1960                 5
27-Dec-2008              4
29-Apr-2017              4
1876                     4
1960s                    4
1945                     4
23-Jan-1970              4
09-Jul-1994              4
27-Jul-1952              4
14-Jun-2012              4
1904                     4
15-Apr-2018              4
28-Dec-2014              4
                        ..
14-Sep-1979              1
11-Jan-1976              1
Reported 15-Dec-1877     1
11-Oct-1987              1
04-Aug-1980              1
14-Jan-2005              1
2

* Let's move to analyse the column 'year'

In [681]:
# We found 125 cases with '0.0', which is not a valid year
df.year.value_counts()

2015.0    143
2017.0    136
2016.0    130
2011.0    128
2014.0    127
0.0       125
2008.0    122
2013.0    122
2009.0    120
2012.0    117
2007.0    112
2005.0    103
2006.0    103
2010.0    101
2000.0     97
1960.0     93
1959.0     93
2004.0     92
2001.0     92
2003.0     92
2002.0     88
1962.0     86
1961.0     78
1995.0     76
1964.0     66
1999.0     66
1998.0     65
1996.0     61
1963.0     61
1966.0     58
         ... 
1823.0      1
1822.0      1
1733.0      1
1805.0      1
1617.0      1
77.0        1
1788.0      1
1811.0      1
1779.0      1
1767.0      1
1755.0      1
1753.0      1
1859.0      1
1764.0      1
1637.0      1
1810.0      1
1802.0      1
1703.0      1
1748.0      1
1841.0      1
1749.0      1
1784.0      1
1807.0      1
1595.0      1
1580.0      1
1801.0      1
1638.0      1
1834.0      1
1723.0      1
1786.0      1
Name: year, Length: 249, dtype: int64

In [682]:
# Checking value
df.year.iloc[6274]

0.0

In [683]:
# Checking the type of 0.0 value
type(df.year.iloc[6274])

numpy.float64

In [684]:
# Let's transform the '0.0' value into a NaN
df.year[df.year == 0.0] = np.nan

In [685]:
# Let's check again
df.year.value_counts()

2015.0    143
2017.0    136
2016.0    130
2011.0    128
2014.0    127
2008.0    122
2013.0    122
2009.0    120
2012.0    117
2007.0    112
2006.0    103
2005.0    103
2010.0    101
2000.0     97
1959.0     93
1960.0     93
2001.0     92
2003.0     92
2004.0     92
2002.0     88
1962.0     86
1961.0     78
1995.0     76
1964.0     66
1999.0     66
1998.0     65
1963.0     61
1996.0     61
1966.0     58
1997.0     57
         ... 
1733.0      1
1859.0      1
1822.0      1
1815.0      1
1703.0      1
1748.0      1
1807.0      1
1580.0      1
1801.0      1
1802.0      1
1834.0      1
1638.0      1
1637.0      1
1784.0      1
1755.0      1
1767.0      1
1819.0      1
1738.0      1
1753.0      1
1783.0      1
1742.0      1
1816.0      1
1555.0      1
500.0       1
1721.0      1
1771.0      1
1791.0      1
1554.0      1
1823.0      1
1786.0      1
Name: year, Length: 248, dtype: int64

In [686]:
df.year.isnull().sum()

127

In [687]:
df2 = df.copy()

In [688]:
df2.index = df.year

In [689]:
df2

Unnamed: 0_level_0,case_number,date,year,type,country,area,location,activity,name,sex_,age,injury,fatal_yn,time,species_,investigator_or_source
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2018.0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57,"No injury to occupant, outrigger canoe and pad...",N,18h00,White shark,"R. Collier, GSAF"
2018.0,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com"
2018.0,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com"
2018.0,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF"
2018.0,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper
2018.0,2018.06.03.b,03-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,"Flat Rock, Ballina",Kite surfing,Chris,M,,"No injury, board bitten",N,,,"Daily Telegraph, 6/4/2018"
2018.0,2018.06.03.a,03-Jun-2018,2018.0,Unprovoked,BRAZIL,Pernambuco,"Piedade Beach, Recife",Swimming,Jose Ernesto da Silva,M,18,FATAL,Y,Late afternoon,Tiger shark,"Diario de Pernambuco, 6/4/2018"
2018.0,2018.05.27,27-May-2018,2018.0,Unprovoked,USA,Florida,"Lighhouse Point Park, Ponce Inlet, Volusia County",Fishing,male,M,52,Minor injury to foot. PROVOKED INCIDENT,N,,"Lemon shark, 3'","K. McMurray, TrackingSharks.com"
2018.0,2018.05.26.b,26-May-2018,2018.0,Unprovoked,USA,Florida,"Cocoa Beach, Brevard County",Walking,Cody High,M,15,Lower left leg bitten,N,17h00,"Bull shark, 6'","K.McMurray, TrackingSharks.com"
2018.0,2018.05.26.a,26-May-2018,2018.0,Unprovoked,USA,Florida,"Daytona Beach, Volusia County",Standing,male,M,12,Minor injury to foot,N,14h00,,"K. McMurray, Tracking Sharks.com"


## Clearning the right data -- and save the boat

In [579]:
# replace values starting from date

In [580]:
# function to value_counts() all the data

## Index: working with dates and time series

In [444]:
df[df.date.notnull()].shape

(6302, 16)

In [445]:
df.case_number.iloc[8697]

'0'

In [None]:
df[df.year]

In [611]:
df[df.date.notnull()]

0       2018.0
1       2018.0
2       2018.0
3       2018.0
4       2018.0
5       2018.0
6       2018.0
7       2018.0
8       2018.0
9       2018.0
10      2018.0
11      2018.0
12      2018.0
13      2018.0
14      2018.0
15      2018.0
16      2018.0
17      2018.0
18      2018.0
19      2018.0
20      2018.0
21      2018.0
22      2018.0
23      2018.0
24      2018.0
25      2018.0
26      2018.0
27      2018.0
28      2018.0
29      2018.0
         ...  
6272       0.0
6273       0.0
6274       0.0
6275       0.0
6276       0.0
6277       0.0
6278       0.0
6279       0.0
6280       0.0
6281       0.0
6282       0.0
6283       0.0
6284       0.0
6285       0.0
6286       0.0
6287       0.0
6288       0.0
6289       0.0
6290       0.0
6291       0.0
6292       0.0
6293       0.0
6294       0.0
6295       0.0
6296       0.0
6297       0.0
6298       0.0
6299       0.0
6300       0.0
6301       0.0
Name: year, Length: 6301, dtype: float64

### Examine the data for potential issues.

In [318]:
# clean data
df.sex_.value_counts()

M      5094
F       637
N         2
M         2
.         1
lli       1
Name: sex_, dtype: int64

In [343]:
# cleaning typos
df[(df.sex_=='M') | (df.sex_=='F')].sex_.value_counts()

M    5094
F     637
Name: sex_, dtype: int64

In [304]:
df_1.sex_.value_counts()

M     5094
F      637
N        2
M        2
.        1
Name: sex_, dtype: int64

In [278]:
# create a referece dictionary to categorize the data
#df.species_.value_counts()

In [30]:
# clean and categorize the data
df.Activity.value_counts()

Surfing                                                                                                     971
Swimming                                                                                                    869
Fishing                                                                                                     431
Spearfishing                                                                                                333
Bathing                                                                                                     162
Wading                                                                                                      149
Diving                                                                                                      127
Standing                                                                                                     99
Snorkeling                                                                                              

In [256]:
df.type.value_counts()

Unprovoked      4595
Provoked         574
Invalid          547
Sea Disaster     239
Boating          203
Boat             137
Questionable       2
Boatomg            1
Name: type, dtype: int64

In [25]:
df.Country.value_counts()

USA                             2229
AUSTRALIA                       1338
SOUTH AFRICA                     579
PAPUA NEW GUINEA                 134
NEW ZEALAND                      128
BRAZIL                           112
BAHAMAS                          109
MEXICO                            89
ITALY                             71
FIJI                              62
PHILIPPINES                       61
REUNION                           60
NEW CALEDONIA                     53
CUBA                              46
MOZAMBIQUE                        45
SPAIN                             44
INDIA                             40
EGYPT                             38
CROATIA                           34
JAPAN                             34
PANAMA                            32
SOLOMON ISLANDS                   30
IRAN                              29
JAMAICA                           27
FRENCH POLYNESIA                  25
GREECE                            25
HONG KONG                         24
I