# Onboarding DS - Part 2

Now that you have seen some of the most important packages we will use, let us introduce you to the two of the main parts of a data science project: the exploratory data analysis and the data preparation.

The exploratory data analysis is the step where you will know better your data: what your features are, how they are distributed, how they are related and what you can conclude observing them in tables and graphs. Then, to effectively start using your data (in machine learning models and other functions), you must prepare it beforehand, with normalization techniques and treating missing data, for example.

In this notebook, we will show you some of the analysis and preparations we can do with the data through a mock project. You will do the coding on your own, but we will guide you through all the steps.

<div class = 'alert alert-block alert-warning'> Feel free to reach anyone in the DS team whenever you have doubts or questions! Exchanging knowledge is one of the greatest and most meaningful values we have here :)

## Packages

Before we start, it is important to know that we are going to plot a lot of graphs along this notebook. To do so, there are many different Python packages you can use, such as Matplotlib and Seaborn. Here, we will use Plotly. We recommend you to install it using `pip3 install plotly`.

Besides that, feel free to use this one or any other package when constructing your own analysis. Each one has its advantages and disadvantages. For instance, we chose to use Plotly due to its interactive plots and easy manipulation.

Ok, let's start now:

<div class = 'alert alert-block alert-info'> Task 1: to observe and manipulate our data, we will use Plotly and the packages shown in the previous notebook: numpy and pandas. Import them as <b>np</b> and <b>pd</b> respectively.

In [None]:
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from autocorrectors.exospart2 import part2exo1, part2exo3, part2exo7

# TO DO
# import the packages

We will work with a penguins **dataset** (Palmer Penguins dataset). It contains some physical characteristics of penguins from the Palmer Archipelago. There are penguins of different species in there and our goal is to group them accordingly to their species, based on their physical characteristics. This data is stored in a csv file, so we must import it into a Dataframe in order to use it. The `pd.read_csv` command imports the csv file into a dataframe (called `penguins` in this context). Then, `head()` shows us the first rows of the dataframe.

Note: the original dataset is composed by 2 tables. Here we will start working with only one of them. However, feel free to use both ones and test your results :)

In [None]:
penguins = pd.read_csv('./data/penguins_dataset.csv')
penguins.head()

<div class = 'alert alert-block alert-info'> Task 2: We can see that our dataframe has a column "Unnamed: 0" that is useless for us. So delete the column "Unnamed: 0".<br>
    
Then, using the method <b>shape</b>, find out the number of rows and columns (n_rows and n_columns) in the musics dataframe. Using <b>penguins.isnull().sum()</b> check if there are <b>null values</b> (in other words, missing values).<br><br>
    
Tip: look for <b>drop</b> in the pandas documentation
</div>    

In [None]:
# TO DO
# drop column "Unnamed: 0"

n_rows = # find out the number of rows
n_columns = # find out the number of columns

# TO DO
# check if there are missing values

part2exo1(n_rows, n_columns)

An important thing you can do is checking which are the possible values of each column. Sometimes, there are values that you were not expecting and that can mess with your future analysis.

In [None]:
display(penguins['island'].unique())
display(penguins['sex'].unique())

Other than null values, the sex column contains weird data ('.') which we can treat as missing data too. There are many ways of treating missing data. One of them is deleting them (delete the row or the column, depending on how the null values are distributed in your dataset). It is usually useful when we have lots of data and we just want a general view of the context. Another possibility to deal with missing data is substituting it for reasonable values, by doing an interpolation, getting the mode, etc.

<div class = 'alert alert-block alert-info'> Task 3: Replace the missing values with the mode value of the respective column.
</div>

In [None]:
# TO DO
# replace missing values with mode value

## Data visualization

In [None]:
fig = go.Figure(data=[go.Histogram(x = penguins['island'])])
fig.show()

Through this histogram ("graph of frequencies"), we can observe that just a small part of the penguins live in Torgersen (around 15%) while almost have of all penguins are from Biscoe.

<div class = 'alert alert-block alert-info'> Task 3: Using <b>make_subplots</b> from Plotly, plot histograms on the culmen length, culmen depth, flipper length, body mass and sex. Then calculate the mean lenght of a culmen length.
    
At last, observe what the method <b>describe()</b> does.

Extra: plot a histogram showing the proportion of females and males on each island (tip: take a look on the histogram documentation on Plotly)
</div>    

In [None]:
# TO DO
# plot the histograms in subplots

# TO DO
# calculate the mean culmen length

# TO DO
# Extra: plot the histogram where the islands are on x-axis,
# and showing the quantity of each sex on each island


mean_culmen_length = float(input("Mean culmen length: "))
part2exo3(mean_culmen_length)

penguins.describe()

<div class = 'alert alert-block alert-info'> Task 4: To better understand how data is statistically distributed, try to plot <b>box-plots</b>.

In [None]:
# TO DO
# plot box-plots

Tables are also very important in data visualization. They help to summarize information and compare values.

<div class = 'alert alert-block alert-info'> Task 5: Using the groupby pandas method, create a table with the mean values of each numeric feature with respect to the sex. In other words, calculate the average culmen length and depth, the average flipper length and the average body mass for the female penguins and for the male penguins.

In [None]:
# TO DO
# create the table

An important analysis you can make if plotting the `correlation matrix`. It shows how correlated are the features among themselves. The values vary from -1 (inversely correlated) to 1 (strongly correlated). To plot the matrix as a heatmap, you may follow some steps:

<div class = 'alert alert-block alert-info'> Task 6: Firstly, you may calculate the correlation between the features. You can use the pandas method <b>corr</b>. Then, you can choose between <b>imshow</b> from plotly.express and <b>Heatmap</b> from plotly.graph_objects to plot your matrix.

In [None]:
# TO DO
# create the correlation matrix using corr
# then plot it as a heatmap


The colors helps us to visualize which features are more correlated between them and which are not. The main diagonal of this matrix has all its values equal to 1 because it contains the correlation between a feature and this same feature. We can thus ignore this diagonal. It is important to observe that this matrix is symmetric as well.

<div class = 'alert alert-block alert-info'> Task 7: Write down which feature has the highest correlation and which has the smallest correlation with the flipper length.

In [None]:
hcorr = input("The feature with the highest correlation with flipper_length_mm is: ")
lcorr = input("The feature with the lowest correlation with flipper_length_mm is: ")

part2exo7(hcorr, lcorr)