<img src="https://www.bisnia.es/wp-content/uploads/2019/10/limpieza-datos-870x466.jpg">

# Data Clean up

### Import the necessary libraries

In [55]:
import pandas as pd
import src.limpieza_texto as lt

### Import the dateframe

In [56]:
original_data = pd.read_csv("data/attacks.csv",encoding = "ISO-8859-1")

In [57]:
original_data.sample(5)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
25687,,,,,,,,,,,...,,,,,,,,,,
13802,,,,,,,,,,,...,,,,,,,,,,
24424,,,,,,,,,,,...,,,,,,,,,,
10522,,,,,,,,,,,...,,,,,,,,,,
21333,,,,,,,,,,,...,,,,,,,,,,


There are some rows where there is no data.

In [58]:
original_data.isnull().sum()

Case Number               17021
Date                      19421
Year                      19423
Type                      19425
Country                   19471
Area                      19876
Location                  19961
Activity                  19965
Name                      19631
Sex                       19986
Age                       22252
Injury                    19449
Fatal (Y/N)               19960
Time                      22775
Species                   22259
Investigator or Source    19438
pdf                       19421
href formula              19422
href                      19421
Case Number.1             19421
Case Number.2             19421
original order            19414
Unnamed: 22               25722
Unnamed: 23               25721
dtype: int64

### The first step is to delete all the incomplete values.
We use the function `dropna` to delete `all` the rows where there are no values.

In [59]:
data_filtered_1= original_data.dropna(axis=0, how="all")
print(data_filtered_1.shape)
data_filtered_1.tail()

(8703, 24)


Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
8698,0,,,,,,,,,,...,,,,,,,,,,
8699,0,,,,,,,,,,...,,,,,,,,,,
8700,0,,,,,,,,,,...,,,,,,,,,,
8701,0,,,,,,,,,,...,,,,,,,,,,
25722,xx,,,,,,,,,,...,,,,,,,,,,


As we cas see, there are some row where thr `case number` value is `0` and the rest of the values are `Nan`.
First we create a new variable and select the filtered rows of the dataframe where the `case number`value is different from `0`.

In [64]:
data_filtered_2 = lt.valores_distintos(data_filtered_1, "Case Number", "0")
data_filtered_2.tail(5)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
6298,ND.0004,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,,Pearl diving,Ahmun,M,...,,"H. Taunton; N. Bartlett, pp. 233-234",ND-0004-Ahmun.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0004,ND.0004,5.0,,
6299,ND.0003,1900-1905,0.0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,Coast Guard personnel,M,...,,"F. Schwartz, p.23; C. Creswell, GSAF",ND-0003-Ocracoke_1900-1905.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0003,ND.0003,4.0,,
6300,ND.0002,1883-1889,0.0,Unprovoked,PANAMA,,"Panama Bay 8ºN, 79ºW",,Jules Patterson,M,...,,"The Sun, 10/20/1938",ND-0002-JulesPatterson.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0002,ND.0002,3.0,,
6301,ND.0001,1845-1853,0.0,Unprovoked,CEYLON (SRI LANKA),Eastern Province,"Below the English fort, Trincomalee",Swimming,male,M,...,,S.W. Baker,ND-0001-Ceylon.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0001,ND.0001,2.0,,
25722,xx,,,,,,,,,,...,,,,,,,,,,


As que can see, there is a value at the end of dataframe where the `Case Number` value is `xx` and the rest of the values of the row `NaN`, so I delete that row using the method `.drop()` and selecting the value of the index.

In [65]:
data_filtered_3 = data_filtered_2.drop(index= 25722)

In [66]:
data_filtered_3.sample(3)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
2017,2000.09.00.b,Early Sep-2000,2000.0,Unprovoked,TANZANIA,,"Coco Beach, Dar-es-Salaam",Swimming,,,...,Thought to involve a Zambesi shark,T. Thierry,2000.09.00-CocoBeach4.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2000.09.00.b,2000.09.00.b,4286.0,,
2651,1989.12.02,02-Dec-1989,1989.0,Invalid,AUSTRALIA,Queensland,Fraser Island,Swimming,Michael Preston,M,...,Shark involvement prior to death was not confi...,"The Advertiser, 12/4/1989, p.2",1989.12.02-Preston.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,1989.12.02,1989.12.02,3652.0,,
4096,1959.05.03,03-May-1959,1959.0,Unprovoked,USA,Florida,"Panama City, Bay County",Spearfishing,Ernest Grover,M,...,"Tiger shark, 3.7 m [12']","Miami Herald, 5/5/1959; T. Helm, pp. 93 & 243",1959.05.03-Grover.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,1959.05.03,1959.05.03,2207.0,,


There are also some columns called`pdf`, `href formula`, `Unnamed: 22`, `Unnamed: 23`, which I consider that are not useful.

The column `original order` is also no usefull becouse is the same as the index but inverted.

I used a personal function called `.elim_columnas()` to delete all the columns i have selected in a list called `eliminar`.

In [72]:
eliminar = ["Unnamed: 22","Unnamed: 23", "pdf", "href formula", "original order"]
lt.elim_columnas(data_filtered_3, eliminar, data_filtered_4)
data_filtered_4.sample(5)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,href,Case Number.1,Case Number.2
2173,1998.08.12,12-Aug-1998,1998.0,Provoked,SOUTH AFRICA,Transvaal,"National Zoological Gardens Aquarium, Pretoria",Moving a shark in a net,Kobus Goosen,M,,Lacerations to right shin PROVOKED INCIDENT,N,,"Sandtiger shark, 2 m, male","Daily Dispatch, 8/13/1998",http://sharkattackfile.net/spreadsheets/pdf_di...,1998.08.12,1998.08.12
4632,1944.12.02,02-Dec-1944,1944.0,Provoked,USA,Florida,3 miles north of the inlet at Palm Beach,Fishing for mackerel,"28' sea skiff, occupants: Alan Moree and anoth...",,,No injury to occupants. After being prodded wi...,N,,"Tiger shark, 4.5 to 5.5 m [14'9"" to 18'], 2000...","New York Times, 12/3/1944, III, p.2, col.7",http://sharkattackfile.net/spreadsheets/pdf_di...,1944.12.02,1944.12.02
4265,1956.00.00.h,1956,1956.0,Unprovoked,PAPUA NEW GUINEA,Madang Province,"Singour, 60 miles south of Madang",Diving,native boy,M,,Lower leg & foot lacerated,N,,,H.D. Baldridge,http://sharkattackfile.net/spreadsheets/pdf_di...,1956.00.00.h,1956.00.00.h
3084,1979.12.00,05-Dec-1979,1979.0,Invalid,PORTUGAL,Madeira Islands,"Sao Jorge, Madeira Island",Spearfishing,Fernando Branco de Abreu,M,19.0,FATAL,Y,,White shark?,C. Moore. GSAF,http://sharkattackfile.net/spreadsheets/pdf_di...,1979.12.00,1979.12.00
5861,1877.01.24.R,Reported 24-Jan-1877,1877.0,Unprovoked,AUSTRALIA,Victoria,Boarding School Bay,Swimming,female,F,,Ankle injured,N,,6' shark,"The Argus, 1/24/1877",http://sharkattackfile.net/spreadsheets/pdf_di...,1877.01.24.R,1877.01.24.R


#### At the end of the dataframe there are to columns, so the next step is to check if the information is difference between them.
I have created a function called `check_columns` that find out the values that are not equal in the selected `columns`.

In [74]:
lt.check_columns(data_filtered_4, "Case Number.1", "Case Number.2").sample(3)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,href,Case Number.1,Case Number.2
5275,1900.00.00.R,Reported to have taken place in 1919,1919.0,Boating,ITALY,,Savona,Fishing,,M,,No injury,N,,13' shark,"C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...,1919.00.00.R,1900.00.00.R
339,2015.10.28.a,28-Oct-2015,2015.0,Unprovoked,USA,Hawaii,"Malaka, Oahu",Body boarding,Raymond Senensi,M,10.0,"Lacerations & puncture wounds to right thigh, ...",N,14h50,,"Star Advertiser, 10/28/2015",http://sharkattackfile.net/spreadsheets/pdf_di...,2015.10.28,2015.10.28.a
4425,1851.12.15.R,\n1951.12.15.R,1951.0,Invalid,MOZAMBIQUE,Sofala Province,Beira,Swimming,Antonio De Almeida Pinto,M,,No injury,,,Invalid,GSAF,http://sharkattackfile.net/spreadsheets/pdf_di...,1951.12.15.R,1851.12.15.R


After analyzing these two columns, you can see that it is the same value as the `case number` column but with different characters, so I decide to delete them.

In [76]:
eliminar_2 = ["Case Number.1", "Case Number.2"]
lt.elim_columnas(data_filtered_4, eliminar_2, data_filtered_5)
data_filtered_5.sample(3)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,href
4622,1945.07.00,Jul-1945,1945.0,Sea Disaster,JAVA,Northern Java,Off Cheribon,"90 European civilians, many women & children, ...",Swimming,,,Sharks attacked the swimmers. The sole survivo...,Y,Dusk,,G. Duncan,http://sharkattackfile.net/spreadsheets/pdf_di...
422,2015.05.24,24-May-2015,2015.0,Unprovoked,USA,Florida,"Cocoa Beach, Brevard County",Swimming,Alysa Whetro,F,13.0,"Puncture wounds to lower left leg and ankle, s...",N,,,"WFTV, 5/27/2015",http://sharkattackfile.net/spreadsheets/pdf_di...
2751,1988.02.02,02-Feb-1988,1988.0,Unprovoked,Fiji,Vanua Levu,,Diving,Qalo Moceyawa,M,22.0,Lacerations to left arm & waist,N,,"Tiger shark, 3 m","Sun, 2/5/1988, p.15",http://sharkattackfile.net/spreadsheets/pdf_di...


## Hypothesis 1

I conducted a study on how many `researchers` there are and how much research they have carried out. Is there a shark attack specialist?

In [78]:
data_filtered_5["Investigator or Source"].unique()

array(['R. Collier, GSAF', 'K.McMurray, TrackingSharks.com',
       'B. Myatt, GSAF', ..., 'F. Schwartz, p.23; C. Creswell, GSAF',
       'The Sun, 10/20/1938', 'S.W. Baker'], dtype=object)

In [35]:
data_filtered_5["Investigator or Source"].value_counts()

C. Moore, GSAF                                               105
C. Creswell, GSAF                                             92
S. Petersohn, GSAF                                            82
R. Collier                                                    55
R. Collier, GSAF                                              48
                                                            ... 
Coppleson (1962), p.247                                        1
Geraldton Guardian, 6/18/1925                                  1
V.M. Coppleson (1958), pp.67 & 233;  A. Sharpe, pp.89-90]      1
Coppleson.V1. (1933); Melbourne Herald, 1/7/1925               1
New Zealand Herald, 10/22/2017                                 1
Name: Investigator or Source, Length: 4969, dtype: int64

In [33]:
data_filtered_5[data_filtered_5["Investigator or Source"] == "C. Moore, GSAF"].sample(10)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,href
5489,1905.08.24,24-Aug-1905,1905.0,Invalid,EGYPT,Suez Canal,Port Said,Human head found in shark caught by British st...,,M,,Probable drowning & scavenging.,,,"Tiger shark, 3.9 m","C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...
4467,1950.07.19,1950.07.19,1950.0,Provoked,ITALY,Savona,Albenga,Fishing,male,,,Harpooned shark bit his forehead PROVOKED INCI...,N,,,"C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...
4826,1938.00.00.e.R,Reported 1938,1938.0,Unprovoked,EGYPT,,Mersa Matruh,Sponge diving,males,M,,FATAL,Y,,,"C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...
5337,1914.06.12,13-Jun-1914,1914.0,Boating,MONTENEGRO,Adriatic Sea,Between St. Stjepan and Budva,Fishing boat,Occupants: Ivan Angjus & Stevo Kentera,M,,"No injury, shark bit paddle and stern of boat",N,07h00,,"C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...
6148,1764.00.00,1764,1764.0,Unprovoked,SPAIN,,Guadalquivir River,Swimming,male,M,,FATAL,Y,,,"C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...
3810,1962.07.03.R,Reported 03-Jul-1962,1962.0,Invalid,GREECE,Cyclades,Near Mykonos Island,,Boat with tourists onboard,,,No injury,,,Questionable incident,"C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...
3513,1967.08.25,25-Aug-1967,1967.0,Unprovoked,ITALY,Liguria,"Marinella Sarzana, La Spezia",Spearfishing on Scuba,Gian Paolo Porta Casucci,M,,Minor injuries to face & forearm,N,,,"C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...
4949,1934.01.08.R,Reported 08-Feb-1934,1934.0,Boating,TURKEY,Istanbul,"Haydarpasa jetty, Istanbul",Fishing,2 males,M,,No injury,N,,,"C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...
5939,1864.09.18.R,Reported 18-Sep-1864,1864.0,Provoked,FRANCE,Alpes Maritime,Antibes,Dragging a shark,fisherman,M,,Knee bitten PROVOKED INCIDENT,N,,1.5 m shark,"C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...
5044,1930.05.11.R,Reported 11-May-1930,1930.0,Boating,TURKEY,,Ye?ilköy,Fishing,small boat. Occupants: 2 Englishmen,M,,No injury but shark damaged boat,N,,,"C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...


In [47]:
import src.limpieza_texto as lt

In [48]:
lt.valores_iguales(data_filtered_5, "Investigator or Source", "C. Moore, GSAF")

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,href
136,2017.06.07.R,Reported 07-Jun-2017,2017.0,Unprovoked,UNITED KINGDOM,South Devon,Bantham Beach,Surfing,Rich Thomson,M,30,"Bruise to leg, cuts to hand sustained when he ...",N,,"3m shark, probably a smooth hound","C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...
2718,1988.08.22.a,22-Aug-1988,1988.0,Unprovoked,ITALY,Manfredonia,Ippocampo,,male,M,16,Survived,N,,,"C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...
3183,1976.06.02.R,Reported 02-Jun-1976,1976.0,Provoked,ITALY,Reggio Calabria Province,Bovalino,Fishing,Francisco Pelle,M,46,Shark rammed boat PROVOKED INCIDENT,N,,"Blue shark, 2m","C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...
3229,1975.04.25,25-Apr-1975,1975.0,Invalid,ITALY,Genoa Province,Cervara,Scuba diving,Walter Sansoni,M,37,The press reported this as an attack by a whit...,,,Invalid,"C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...
3513,1967.08.25,25-Aug-1967,1967.0,Unprovoked,ITALY,Liguria,"Marinella Sarzana, La Spezia",Spearfishing on Scuba,Gian Paolo Porta Casucci,M,,Minor injuries to face & forearm,N,,,"C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6155,1742.12.17,17-Dec-1742,1742.0,Unprovoked,,,Carlisle Bay,Swimming,2 impressed seamen,M,,FATAL,Y,,,"C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...
6156,1738.04.06.R,Reported 06-Apr-1738,1738.0,Unprovoked,ITALY,Sicily,Strait of Messina,Swimming,male,M,,FATAL,Y,,,"C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...
6193,ND-0134,Between 1951-1963,0.0,Unprovoked,GREECE,,,Swimming,Martha Hatagouei,F,,FATAL,Y,,,"C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...
6196,ND-0130,Before 1876,0.0,Unprovoked,LEBANON,,,Collecting fish,Kahlifeh,M,,Posterior thigh bitten,N,,,"C. Moore, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...


### last step: export the dataset to the data folder  "midatasetlimpio.csv"

In [81]:
shrk = data_filtered_5
shrk.sample(5)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,href
1904,2001.09.08,08-Sep-2001,2001.0,Provoked,USA,Florida,"Everglades National Park, Monroe County",Fishing,male,M,44.0,Fingers & leg lacerated by hooked shark PROVO...,N,,,"Orlando Sentinel, 9/9/2001",http://sharkattackfile.net/spreadsheets/pdf_di...
2091,1999.11.00.b,Nov-1999,1999.0,Unprovoked,MARSHALL ISLANDS,Alinglaplap Atoll,Island J4H,Swimming,Dally Bayo,M,12.0,Lacerations to leg,N,,"Grey reef shark, 1.2 m [4']",www.svcherokee.com/pages/ Ailingilaplap.htm,http://sharkattackfile.net/spreadsheets/pdf_di...
2143,1999.00.00.b,1999,1999.0,Unprovoked,SOUTH AFRICA,Western Cape Province,"Hospital Rock, Dyers Island",Spearfishing,Healy Lootz,M,30.0,Heel lacerated,N,12h00,"White shark, 4.6 m [15']",V. Van der Merwe,http://sharkattackfile.net/spreadsheets/pdf_di...
2233,1997.08.11.b,11-Aug-1997,1997.0,Unprovoked,EGYPT,Red Sea,Safaga,Fishing,Nagah Attalah Al Sayed,M,17.0,Seriously injured,N,,,"Middle East Times, 8/15/1997",http://sharkattackfile.net/spreadsheets/pdf_di...
4748,1941.02.01,01-Feb-1941,1941.0,Provoked,JAMAICA,Trelawney Province,Bogue (near Falmouth),Seine netting,Albert Buchanan,M,,"Left knee, calf & heel bitten by shark trapped...",N,07h30,,"Daily Gleaner, 2/3/1941, p. 1",http://sharkattackfile.net/spreadsheets/pdf_di...


In [84]:
shrk.to_csv(".\Data\midatasetlimpio.csv")