# Ironhack - Data Analytics Bootcamp
___________________________________________________________________________________________________________________________

## Project 2 - Shark Attacks

Data Cleaning and Manipulation

### Main Objectives

The dataset provided by Ironhack contains significantly messy data. Your job is to apply the different cleaning and manipulation techniques to generate a cleaner CSV version of this data.

<img src="./objectives.jpg" width="600px"/>

### Main Objectives - HINT: Make yourself focused

Ask a question to the dataset and try to answer it using data. This will focus your cleaning and data manipulation through the process.

### More Objectives

In the process of answering that question, we believe you’ll feel the need of some tools like:
- Visualization techniques;
- Statistics;
- More data sources;
All of which we’ll handle in near future.

### The Data Set

- Go to kaggle.com and create an account;
- Go to the search bar and look for ‘Global Shark Attacks’;
- Download the data set;
- For more info: https://www.sharkattackfile.net.

### Deliverables

- A clean CSV file on your GitHub account;
- The url of the file on your GitHub. It should be able to be read using "pd.read_csv(url)";
- The link to the Jupyter notebook (or the GitHub project);

### Deadline

- The same day.

### Methodology:

When we analyzed this data set, we realized that the best approach would be to organize it based on the date on which the accidents occurred, which in addition to facilitating the search for past cases, makes it possible to know the times of the year and the periods of the day when they usually occur. Therefore, we chose the following steps:

- importing modules "pandas", "numpy", "regex" and "datetime";
- reading the file using pandas methods (head, info and shape);
- cleaning the data set, dropping columns and rows with no or very few data and similar columns;
- date treatment with the creation of three columns containing the years, days and months of the accidents;
- time treatment by standardizing the format of the time when the accidents occurred;
- creation of a new column with four day periods (morning, afternoon, night and "wee hours");
- standardization of the sex field for better visualization and manipulation of data: "M" for male and "F" for female;
- cleaning and changing formats of the field "age" for better visualization and manipulation of data;
- changing the field "fatal" to "True" or "False", for better manipulation of the data; and
- organizing all the columns of the data set in a more logical way, such as placing "activity", "type", "injury" and "fatal" next to each other.

### Technologies used:

- Python
- Pandas

***

## Colaborators:

- Marcus Brandão

- Pedro Di Gianni
***

## Reading file:

In [None]:
import pandas as pd
import numpy as np
import re
import datetime

In [None]:
pd.set_option('display.max_columns', 24)

In [None]:
# Import comma-separated file

sharkattack = pd.read_csv('./attacks.csv', sep=',', encoding='latin-1')

In [None]:
sharkattack.info()

In [None]:
sharkattack.head(100)

## Cleaning data frame:

### - Dropping NaNs

In [None]:
#dropping lines in which all values are NaN

mask = sharkattack.isnull().sum(axis=1) > 19

list_to_drop = sharkattack.loc[mask, :].index

sharkattack = sharkattack.drop(list_to_drop)

In [None]:
sharkattack = sharkattack.drop(columns=['Unnamed: 22', 'Unnamed: 23'])

In [None]:
sharkattack.shape

In [None]:
sharkattack.info()

In [None]:
sharkattack.head(10)

### - Dropping similar columns:

In [None]:
(sharkattack['href formula'] == sharkattack['href']).mean()

In [None]:
#Dropping column "href formula", because it is 99% equal to "href"

sharkattack['href formula'] == sharkattack['href']

#sharkattack.loc[6304, 'href formula'] == sharkattack.loc[6304, 'href']

sharkattack.loc[[6298, 6299], :]

sharkattack = sharkattack.drop(columns='href formula')

In [None]:
(sharkattack['Case Number.2'] == sharkattack['Case Number.1']).mean()

In [None]:
#Dropping column "Case Number.2", because it is 99% equal to "Case Number.1"

sharkattack['Case Number.2'] == sharkattack['Case Number.1']

#sharkattack.loc[6304, 'href formula'] == sharkattack.loc[6304, 'href']

sharkattack = sharkattack.drop(columns='Case Number.2')

In [None]:
(sharkattack['Case Number'] == sharkattack['Case Number.1']).mean()

In [None]:
#Dropping column "Case Number", because it is 99% equal to "Case Number.1"

sharkattack['Case Number'] == sharkattack['Case Number.1']

#sharkattack.loc[6304, 'href formula'] == sharkattack.loc[6304, 'href']

sharkattack = sharkattack.drop(columns='Case Number')


In [None]:
sharkattack.info()

## Data Treatment:

### - Time

In [None]:
# Identifying in which cells the data starts with a number, so we can turn this series into an integer
sharkattack["Time"]=sharkattack["Time"].astype(str)
sharkattack["Time"]=sharkattack["Time"].apply(lambda x: re.findall("^\d{,2}",x)[0:2])

In [None]:
sharkattack["Time"]=sharkattack["Time"].apply(lambda x: x[0])

In [None]:
nullnumbers = sharkattack["Time"]==""
sharkattack.loc[nullnumbers,"Time"]=-1
sharkattack.head(5)

In [None]:
sharkattack["Day Period"]=sharkattack["Time"].astype(int)

In [None]:
#Creating Day Period based on the criteria below
nodata = sharkattack["Day Period"]==-1
wee_hours = (sharkattack["Day Period"]>=0) & (sharkattack["Day Period"]<6)
morning = (sharkattack["Day Period"]>=6) & (sharkattack["Day Period"]<12)
afternoon = (sharkattack["Day Period"]>=12) & (sharkattack["Day Period"]<18)
night = (sharkattack["Day Period"]>=18)

In [None]:
sharkattack.loc[nodata,"Day Period"]="No Data"
sharkattack.loc[morning,"Day Period"]="Morning"
sharkattack.loc[afternoon,"Day Period"]="Afternoon"
sharkattack.loc[night,"Day Period"]="Night"
sharkattack.loc[wee_hours,"Day Period"]="Wee hours"

In [None]:
sharkattack["Day Period"].value_counts()

In [None]:
sharkattack.loc[nullnumbers,"Time"]=np.nan

In [None]:
sharkattack.head(8)

### - Age

In [None]:
sharkattack.groupby(by="Age").count()

In [None]:
#Identifying in which cells Age starts with a number so we can turn this column into an integer
sharkattack["Age"]=sharkattack["Age"].astype(str)
sharkattack["Age"]=sharkattack["Age"].apply(lambda x: re.findall("^\d{,3}",x)[0:2])

In [None]:
sharkattack["Age"]=sharkattack["Age"].apply(lambda x: x[0])

In [None]:
nullnumbers = sharkattack["Age"]==""
sharkattack.loc[nullnumbers,"Age"]=0

In [None]:
sharkattack["Age"]=sharkattack["Age"].astype(int)
zeronums = sharkattack["Age"]==0
sharkattack.loc[zeronums,"Age"]=np.nan

In [None]:
sharkattack["Age"].mean()

In [None]:
sharkattack["Age"].value_counts()

### - Sex

In [None]:
sharkattack.rename(columns={"Sex ":"Sex"},inplace=True)

In [None]:
# Putting condition in order to clean sex column
sexconditions = (sharkattack["Sex"] != "M") & (sharkattack["Sex"] != "F")

In [None]:
sharkattack.loc[sexconditions,"Sex"]=np.nan

In [None]:
sharkattack["Sex"].value_counts()

### - Date

In [None]:
#Partially cleaning column "Year".

sharkattack['Year'] = sharkattack['Year'].replace([np.inf, -np.inf], np.nan)

sharkattack['Year'] = sharkattack['Year'].fillna(0)

sharkattack['Year'] = sharkattack['Year'].astype('int64')

sharkattack['Year'].tolist()

In [None]:
#Creating a list with the values from the column "Date".
sharkattack_date = sharkattack['Date'].values.tolist()

In [None]:
#Creating column "Date Year" from column "Date".

date_year_list = []
date_year = 0
for i in sharkattack_date:
    date_year = i[-4:]
    if date_year.isdigit() == False:
        date_year = '0'
    date_year_list.append(date_year)
    
sharkattack['Date Year'] = date_year_list
sharkattack['Date Year']


In [None]:
#Creating column "Date Month" from column "Date".

date_month_list = []
date_month = 0
for i in sharkattack_date:    
    if (i[-8:-5].lower() == 'jan') or (i[-8:-5].lower() == '1'):
        date_month = 1
    elif (i[-8:-5].lower() == 'feb') or (i[-8:-5].lower() == '2'):
        date_month = 2 
    elif (i[-8:-5].lower() == 'mar') or (i[-8:-5].lower() == '3'):
        date_month = 3
    elif (i[-8:-5].lower() == 'apr') or (i[-8:-5].lower() == '4'):
        date_month = 4
    elif (i[-8:-5].lower() == 'may') or (i[-8:-5].lower() == '5'):
        date_month = 5
    elif (i[-8:-5].lower() == 'jun') or (i[-8:-5].lower() == '6'):
        date_month = 6
    elif (i[-8:-5].lower() == 'jul') or (i[-8:-5].lower() == '7'):
        date_month = 7
    elif (i[-8:-5].lower() == 'aug') or (i[-8:-5].lower() == '8'):
        date_month = 8
    elif (i[-8:-5].lower() == 'sep') or (i[-8:-5].lower() == '9'):
        date_month = 9
    elif (i[-8:-5].lower() == 'oct') or (i[-8:-5].lower() == '10'):
        date_month = 10
    elif (i[-8:-5].lower() == 'nov') or (i[-8:-5].lower() == '11'):
        date_month = 11
    elif (i[-8:-5].lower() == 'dec') or (i[-8:-5].lower() == '12'):
        date_month = 12
    else:
        date_month = 0
    date_month_list.append(date_month) 
    
    
sharkattack['Date Month'] = date_month_list

In [None]:
#Creating column "Date Day" from column "Date".

date_day_list = []
date_day = 0
for i in sharkattack_date:
    date_day = i[-11:-9]
    if date_day.isdigit() == False:
        date_day = '0'
    date_day_list.append(date_day)
    
sharkattack['Date Day'] = date_day_list

In [None]:
#Comparing column "Date Year" and "Year" and replacing 0 values in the "Date Year column".

sharkattack['Year'] = sharkattack['Year'].astype(int)

sharkattack['Date Year'] = sharkattack['Date Year'].astype(int) 

date_year_list2 = sharkattack['Date Year'].tolist()

year_list2 = sharkattack['Year'].tolist()

for i in range(len(date_year_list2)):
    if date_year_list2[i] == 0:
        date_year_list2[i] = year_list2[i]
        
sharkattack['Date Year'] = date_year_list2

In [None]:
mask2 = (sharkattack['Date Year'] != sharkattack['Year'])

sharkattack.loc[mask2, :]

### Fatal

In [None]:
# Renaming fatal column, cleaning and turning into boolean

In [None]:
sharkattack["Fatal (Y/N)"].value_counts()

In [None]:
sharkattack.rename(columns={"Fatal (Y/N)":"Fatal"},inplace=True)

In [None]:
fatalconditions = (sharkattack["Fatal"] != "Y") & (sharkattack["Fatal"] != "N")

In [None]:
sharkattack.loc[fatalconditions,"Fatal"]=np.nan

In [None]:
sharkattack.loc[sharkattack["Fatal"]=="Y","Fatal"]=True
sharkattack.loc[sharkattack["Fatal"]=="N","Fatal"]=False

In [None]:
sharkattack["Fatal"].value_counts()

In [None]:
sharkattack["Fatal"]=sharkattack["Fatal"].astype(bool)

In [None]:
sharkattack.dtypes

## Reorganizing Columns:

In [None]:
#Organizing columns order in a logical way

sharkattack.keys()

sharkattack = sharkattack[['Case Number.1', 'Date Year', 'Date Month', 'Date Day', 'Time', 'Day Period', 'Country', 'Area', 'Location', 'Name', 'Sex', 'Age', 'Activity', 'Type', 'Injury', 'Fatal', 'Species ', 'Investigator or Source', 'pdf', 'href', 'original order']]

sharkattack.head(2)

In [None]:
sharkattack.rename(columns={'Case Number.1': 'Case Number','Date Day': 'Day', 'Date Month': 'Month', 'Date Year': 'Year', }, inplace = True)

sharkattack.head(100)

## Exporting file:

In [None]:
sharkattack.to_csv('./project2-shark-attacks.csv', index=False)