# Exploratory Data Analysis

## Introduction
Data, derived from the Latin word \"dare\" (meaning \"something given\"), refers to observations or facts about a subject. 
Data science aims to uncover hidden useful relationships within data. Before applying advanced analytical techniques, basic data exploration is crucial.
This involves understanding the dataset's basic characteristics, preparing it for further analysis, and sometimes quickly gaining insights.

Data exploration, or exploratory data analysis (EDA), uses simple techniques like pivot tables, statistical computations (mean, deviation), and various charts (line, bar, scatter) to understand the data. 
EDA helps grasp the data's structure, distribution, and relationships between attributes, guiding further statistical and data science treatments. 
EDA includes two main types: descriptive statistics and data visualization.
Descriptive statistics condense key dataset characteristics into numeric metrics (mean, standard deviation, correlation).
Data visualization projects data into multi-dimensional space or abstract images, covering all types of charts. 
Both techniques are essential in data science for comprehensive data exploration.

In the data science process, data exploration is leveraged in many different steps including preprocessing or data preparation, modeling, and interpretation of the modeling results.
1. Data understanding: Data exploration provides a high-level overview of each attribute (also called a *variable*) in the dataset and the interaction between the attributes. Data exploration helps answer questions like *what is the typical value of an attribute* or *how much do the data points differ from the typical value*, or *are there any extreme values in the data*.
2. Data preparation: Before applying an algorithm, the dataset has to be prepared by handling any of the anomalies that may be present in the data. These anomalies include outliers, missing values, or highly correlated attributes. Some data science algorithms do not work well when input attributes are correlated. Thus, correlated attributes need to be identified and handled or removed.
3. Data science tasks: Basic data exploration can sometimes substitute the entire data science process. For example, scatterplots can identify clusters in low-dimensional data or can help develop regression or classification models with simple visual rules.
4. Interpreting the results: Finally, data exploration is used in understanding the prediction, classification, and clustering of the results of the data science process. Histograms help to comprehend the distribution of the attribute and can also be useful for visualizing numeric prediction, error rate estimation, etc.


## Shark Research Challenge
Welcome to the Shark Research Challenge! You have been hired as a data scientist by the Oceanic Research Institute to analyze data about sharks.
The institute has collected detailed information on different shark species, gender, geographical locations, and their movement patterns.
Your expertise is needed to uncover critical insights that can help marine biologists understand shark behaviours and migration patterns better.

### Task
The institute is planning a major expedition and needs your help to make data-driven decisions. They have provided you with a dataset named `sharks.csv`, which includes various observations of sharks. The research team has tasked you with the following task:
- **Species Analysis**: Using Python code and the AI assistant on the chat on the left, determine the most and least common shark species in the data. This will help the team focus on species that may need more conservation efforts or those that are thriving.
   - If the chat is not already open on the left side: to open the chat, click on the chat bubbles on the left toolbar, then at the top select *Open chat* and click on **EDA**
- Write a short (2-3 paragraphs) report describing your process and results as if you were describing it to your team at the Oceanic Research Institute. You can refer to your code, the AI assistant, and how their insights influenced your decisions. You can use visualizations if you prefer.

### First time using Jupyter Notebooks?
This is a Jupyter Notebook, it allows you to write and run code in an interactive environment. It combines code execution, text, and visualizations in a single document, making it ideal for data analysis, machine learning, and educational purposes. 

Understand the Interface:
- Cells: The main components where you write your code. There are two main types: Code cells and Markdown cells.
    - Code Cell: For writing and executing code. 
    - Markdown Cell: For writing formatted text using \"Markdown\" language. This cell is a Markdown cell; double-click on it to see the unformatted text, and click on the Play button on the toolbar above to format the text again.
- Toolbar: Buttons for saving, adding/deleting cells, and running code.

Basic Operations:
- Run Code: Write Python code in a cell and click on the Play button in the toolbar or press `Shift + Enter` to execute it. The result will be displayed below the cell.
- Add Cells: Click the + button in the toolbar or press `B` after running a cell to add a new cell below. You can use as many cells as you want.
- Delete Cells: Select the cell and click the scissors icon or press `D` twice.
- Change Cell Type:
    - To Markdown: Select the cell and select `Markdown` in the toolbar or press M. 
    - To Code: Select the cell and select `Code` in the toolbar or press Y.
    
Try to run the cell below:

In [None]:
animal = 'sharks'
print('I love ' + animal)

Now add a new cell below, copy-paste the code and change the animal.

Feel free to ask any questions to the assistant on the chat! And please avoid using other resources (like Google or other chats)

## Your task
Import the data, explore it and understand its structure to complete the task.
- **Species Analysis**: Using Python code and the AI assistant on the chat on the left, determine the most and least common shark species in the data. This will help the team focus on species that may need more conservation efforts or those that are thriving.

In [None]:
# This is the location of the CSV (comma-separated value) dataset

dataset_path = 'sharks.csv'

## Report

Write below your report about the sharks and the prevalence of different species in the data, use 1-2 sentences to describe your findings as if you were communicating with your team at the Oceanic Research Institute. 
Then write 1-2 sentences to describe your process to your "supervisor". 
You can refer to what you learned, your code, the AI assistant, and how the insights obtained influenced your decisions. 
You can use visualizations if you prefer.