# PANDAS PROJECT -- JUAN FERNÁNDEZ-DAZA

The goal of this project is to combine everything you have learned about data wrangling, 
cleaning, and manipulation with Pandas so you can see how it all works together. 
For this project, you will start with this messy data set Shark Attack. You will need to import it, 
use your data wrangling skills to clean it up, prepare it to be analyzed, and then export it as a clean CSV data file.

You will be working individually for this project, but we'll be guiding you along the process and helping you as you go. 
Show us what you've got!

# Instructions -- Technical Requirements

The dataset that we provide you is a significantly messy data set.
Apply the different cleaning and manipulation techniques you have learned.

Steps:
1. Import the data using Pandas.
2. Examine the data for potential issues.
3. Use at least 8 of the cleaning and manipulation methods you have learned on the data.
4. Produce a Jupyter Notebook that shows the steps you took and the code you used to clean and transform your data set.
5. Export a clean CSV version of your data using Pandas.

First step is to import the csv file -- however we don't know the encoding of the file, to do that we run this command 
in  the terminal --> file GSAF5.csv

In [102]:
import pandas as pd
import numpy as np
import matplotlib

In [103]:
sharks = pd.read_csv('GSAF5.csv',encoding='iso8859_15')
sharks.head() 

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.c,2016.09.18.c,5993,,
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.b,2016.09.18.b,5992,,
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.a,2016.09.18.a,5991,,
3,2016.09.17,17-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Thirteenth Beach,Surfing,Rory Angiolella,M,...,,"The Age, 9/18/2016",2016.09.17-Angiolella.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.17,2016.09.17,5990,,
4,2016.09.15,16-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Bells Beach,Surfing,male,M,...,2 m shark,"The Age, 9/16/2016",2016.09.16-BellsBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.16,2016.09.15,5989,,


In [104]:
# Lets do a quick examination of the file
f'''
The file contains 
{sharks.shape} rows and columns and {sharks.size} instances
'''

'\nThe file contains \n(5992, 24) rows and columns and 143808 instances\n'

In [105]:
# I am interested in knowing how many columns can I drop
null_cols = sharks.isnull().sum()
null_cols[null_cols > 0]

Country                     43
Area                       402
Location                   496
Activity                   527
Name                       200
Sex                        567
Age                       2681
Injury                      27
Fatal (Y/N)                 19
Time                      3213
Species                   2934
Investigator or Source      15
href formula                 1
href                         3
Unnamed: 22               5991
Unnamed: 23               5990
dtype: int64

# Method #1 drop the columns with more than 2k null lines --> nearly half of them

In [106]:
drop_cols = list(null_cols[null_cols > 2000].index)
sharks = sharks.drop(drop_cols, axis=1)

In [107]:
null_cols = sharks.isnull().sum()
null_cols[null_cols > 0]

Country                    43
Area                      402
Location                  496
Activity                  527
Name                      200
Sex                       567
Injury                     27
Fatal (Y/N)                19
Investigator or Source     15
href formula                1
href                        3
dtype: int64

In [108]:
# Obtaining the columns of the dataframe:
columns_sharks = list(sharks.columns.values)
print(columns_sharks)

['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location', 'Activity', 'Name', 'Sex ', 'Injury', 'Fatal (Y/N)', 'Investigator or Source', 'pdf', 'href formula', 'href', 'Case Number.1', 'Case Number.2', 'original order']


In [109]:
sharks.head(8)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Injury,Fatal (Y/N),Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,Minor injury to thigh,N,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.c,2016.09.18.c,5993
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,Lacerations to hands,N,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.b,2016.09.18.b,5992
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,Lacerations to lower leg,N,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.a,2016.09.18.a,5991
3,2016.09.17,17-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Thirteenth Beach,Surfing,Rory Angiolella,M,Struck by fin on chest & leg,N,"The Age, 9/18/2016",2016.09.17-Angiolella.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.17,2016.09.17,5990
4,2016.09.15,16-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Bells Beach,Surfing,male,M,No injury: Knocked off board by shark,N,"The Age, 9/16/2016",2016.09.16-BellsBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.16,2016.09.15,5989
5,2016.09.15.R,15-Sep-16,2016,Boat,AUSTRALIA,Western Australia,Bunbury,Fishing,Occupant: Ben Stratton,,Shark rammed boat. No injury to occupant,N,"West Australian, 9/15/2016",2016.09.15.R-boat.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.15.R,2016.09.15.R,5988
6,2016.09.11,11-Sep-16,2016,Unprovoked,USA,Florida,"Ponte Vedra, St. Johns County",Wading,male,M,Minor injury to arm,N,"News4Jax, 9/11/2016",2016.09.11-PonteVedra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.11,2016.09.11,5987
7,2016.09.07,07-Sep-16,2016,Unprovoked,USA,Hawaii,"Makaha, Oahu",Swimming,female,F,Severe lacerations to shoulder & forearm,N,"Hawaii News Now, 9/7/2016",2016.09.07-Oahu.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.07,2016.09.07,5986



# Method #2: check if the columns case number, case number.1 & case number.2 to see if we could drop duplicated columns

In [110]:
sharks['Check_CaseNumber'] = np.where((sharks['Case Number'] == sharks['Case Number.1']) & (sharks['Case Number.1'] == sharks['Case Number.2'])
                     , 1, 0)

In [111]:
sharks[['Check_CaseNumber']].sum()

Check_CaseNumber    5979
dtype: int64

Basically 99% of the rows are the same so we can drop these repeated columns too

In [112]:
sharks = sharks.drop(sharks[['Case Number.1','Case Number.2','Check_CaseNumber']], axis=1)

In [113]:
sharks.head(8)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Injury,Fatal (Y/N),Investigator or Source,pdf,href formula,href,original order
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,Minor injury to thigh,N,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5993
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,Lacerations to hands,N,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5992
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,Lacerations to lower leg,N,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5991
3,2016.09.17,17-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Thirteenth Beach,Surfing,Rory Angiolella,M,Struck by fin on chest & leg,N,"The Age, 9/18/2016",2016.09.17-Angiolella.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5990
4,2016.09.15,16-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Bells Beach,Surfing,male,M,No injury: Knocked off board by shark,N,"The Age, 9/16/2016",2016.09.16-BellsBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5989
5,2016.09.15.R,15-Sep-16,2016,Boat,AUSTRALIA,Western Australia,Bunbury,Fishing,Occupant: Ben Stratton,,Shark rammed boat. No injury to occupant,N,"West Australian, 9/15/2016",2016.09.15.R-boat.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5988
6,2016.09.11,11-Sep-16,2016,Unprovoked,USA,Florida,"Ponte Vedra, St. Johns County",Wading,male,M,Minor injury to arm,N,"News4Jax, 9/11/2016",2016.09.11-PonteVedra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5987
7,2016.09.07,07-Sep-16,2016,Unprovoked,USA,Hawaii,"Makaha, Oahu",Swimming,female,F,Severe lacerations to shoulder & forearm,N,"Hawaii News Now, 9/7/2016",2016.09.07-Oahu.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5986


# Method #3: check if there is data that doesnt make sense

In [114]:
activity = pd.Series(sharks.Activity.unique())
activity.head(4)

0     Surfing
1     Fishing
2      Wading
3    Swimming
dtype: object

In [115]:
year = pd.Series(sharks.Year.unique())
year.sort_values(ascending=True).head(20)

231       0
230       5
229      77
228     500
227    1543
226    1554
225    1555
224    1580
223    1595
222    1617
221    1637
220    1638
219    1642
218    1700
217    1703
216    1721
215    1733
214    1738
213    1742
212    1748
dtype: int64

In [116]:
# Since we want to create an app to locate the attacks, 
# the attacks that have happened before 1990 are not quite relevant to use, so we can drop the years before

In [117]:
sharks = sharks[(sharks['Year']>1990)]
sharks.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Injury,Fatal (Y/N),Investigator or Source,pdf,href formula,href,original order
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,Minor injury to thigh,N,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5993
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,Lacerations to hands,N,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5992
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,Lacerations to lower leg,N,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5991
3,2016.09.17,17-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Thirteenth Beach,Surfing,Rory Angiolella,M,Struck by fin on chest & leg,N,"The Age, 9/18/2016",2016.09.17-Angiolella.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5990
4,2016.09.15,16-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Bells Beach,Surfing,male,M,No injury: Knocked off board by shark,N,"The Age, 9/16/2016",2016.09.16-BellsBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5989


In [118]:
# Lets do a quick examination of the file
f'''
The file contains 
{sharks.shape} rows and columns and {sharks.size} instances
'''

'\nThe file contains \n(2385, 17) rows and columns and 40545 instances\n'

# Method #4: the Sex column contains a space, we need to rename it

In [119]:
sharks = sharks.rename(columns={'Sex ': 'Sex'}, inplace=False)

In [120]:
sex = pd.Series(sharks.Sex.unique())
sex

0      M
1    NaN
2      F
3     M 
4    lli
dtype: object

# Method #5: there are some sex specifications that can be renamed

In [121]:
sharks.loc[(sharks['Sex']=='M ')] = 'M'

In [122]:
sex = pd.Series(sharks.Sex.unique())
sex

0      M
1    NaN
2      F
3    lli
dtype: object

In [72]:
# Since the sex is important for our app, we are gonna drop those Nan or values that are diff from M/F

In [124]:
# Copy
sharks2 = sharks

In [127]:
sharks = sharks[(sharks['Sex'].isin(['M','F']))]

In [128]:
sex = pd.Series(sharks.Sex.unique())
sex

0    M
1    F
dtype: object

In [131]:
sharks = sharks.rename(columns={'Fatal (Y/N)': 'Fatal'}, inplace=False)
fatal = pd.Series(sharks.Fatal.unique())
fatal

0          N
1          Y
2        NaN
3          M
4    UNKNOWN
dtype: object

In [132]:
sharks = sharks[(sharks['Fatal'].isin(['Y','N']))]

In [133]:
# Lets do a quick examination of the file
f'''
The file contains 
{sharks.shape} rows and columns and {sharks.size} instances
'''

'\nThe file contains \n(2225, 17) rows and columns and 37825 instances\n'

# Method #6: Evaluate the countries with low attacks so we can discard them

In [168]:
sharks['N_Attacks'] = 1
sharks.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Injury,Fatal,Investigator or Source,pdf,href formula,href,original order,N_Attacks
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,Minor injury to thigh,N,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5993,1
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,Lacerations to hands,N,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5992,1
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,Lacerations to lower leg,N,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5991,1
3,2016.09.17,17-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Thirteenth Beach,Surfing,Rory Angiolella,M,Struck by fin on chest & leg,N,"The Age, 9/18/2016",2016.09.17-Angiolella.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5990,1
4,2016.09.15,16-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Bells Beach,Surfing,male,M,No injury: Knocked off board by shark,N,"The Age, 9/16/2016",2016.09.16-BellsBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5989,1


In [195]:
df_countries= sharks.loc[:,["Country",'N_Attacks']].groupby(["Country"], as_index =False).sum()
df_countries.head(5)

Unnamed: 0,Country,N_Attacks
0,TONGA,3
1,ANGOLA,1
2,ANTIGUA,1
3,ARUBA,1
4,ATLANTIC OCEAN,2


In [196]:
df_countries.columns

Index(['Country', 'N_Attacks'], dtype='object')

In [197]:
# I just want to keep the countries with less than 4 attacks because they are not relevant
df_countries = df_countries[(df_countries['N_Attacks'] < 4)]
df_countries.head()

Unnamed: 0,Country,N_Attacks
0,TONGA,3
1,ANGOLA,1
2,ANTIGUA,1
3,ARUBA,1
4,ATLANTIC OCEAN,2


In [202]:
# List of the countries to ignore
countries = [coun for coun in df_countries['Country']]

In [208]:
# Ignoring the countries
sharks = sharks[~sharks['Country'].isin(countries)]

In [210]:
# Drop the N_Attacks column I've created
sharks = sharks.drop(sharks[['N_Attacks']], axis=1)

In [211]:
# Lets do a quick examination of the file
f'''
The file contains 
{sharks.shape} rows and columns and {sharks.size} instances
'''

'\nThe file contains \n(2128, 17) rows and columns and 36176 instances\n'

# Method #7: Re-naming the colums to make the data more understandable

In [212]:
sharks.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex', 'Injury', 'Fatal', 'Investigator or Source',
       'pdf', 'href formula', 'href', 'original order'],
      dtype='object')

In [213]:
# First lets drop href formula
sharks = sharks.drop(sharks[['href formula']], axis=1)

In [215]:
sharks = sharks.rename(columns={'href': 'Link','pdf':'PDF_news'}, inplace=False)

# Method #8: Finally dropping duplicates

In [216]:
before = len(sharks)
sharks = sharks.drop_duplicates()
after = len(sharks)
print('Number of duplicate records dropped: ', str(before - after))

Number of duplicate records dropped:  0


# Finally we are in conditions to write the clean file

In [217]:
sharks.to_csv('Sharks_clean.csv', sep=';', index = False)