# Data Preparation

## Introduction

**Data** is an essential part of problem solving in machine learning. Without data, it is not possible to solve problems using machine learning techniques.

With this in mind, the first step in problem solving using machine learning is to **prepare the data** that we will use to solve the problem. This preparation involves using data preparation operations, which allow us to perform tasks such as data importing, ata correcting and  formatting. 

We can draw an **analogy** between the data preparation phase and the personal preparations we undergo before going to a wedding. Before we set off for a wedding, there is a whole range of operations to perform (e.g., taking a shower, preparing your clothes, fixing your hair). In machine learning it is much the same: we have to prepare the data before applying machine learning techniques.

In this tutorial, we will present the main elements of data preparation:

*   ***Pandas* library.** The *pandas* library has most of the methods and functions used to prepare data.
*   **Data structures.** *Pandas* have their own way of structuring data and you need to know it.
*   **Data import.** The data preparation process begins with data import. 
*   **Missing data.** Missing data is a common problem that we need to know how to resolve.
*   **Other data preparation operations.** There is a general set of operations that is useful to know.

This tutorial has several **examples** that illustrate how code is applied and the effects of its application. In addition to this, along the course of the tutorial you will be faced with various challenges which serve to check whether you understand the material. At the end, all of the content presented will be summarised. 

Note that this is an **introductory-level** tutorial and for this reason, several important aspects are not covered. For more information, we recommend that you consult the [official *pandas* library documentation](https://pandas.pydata.org/).


## *Pandas* Library

***Pandas*** is a Python library that allows you to prepare and analyse data. As such, *pandas* has a set of features that make it easier to perform tasks related to data preparation and analysis.  

To **import** the *pandas* library, do the following: 


In [None]:
import pandas as pd

In the previous instruction:

*   The word `import` tells the computer that we want to import a library.
*   The word `pandas` identifies the library that we want to import (the `pandas` library).
*   The expression `as pd` means that we want to abbreviate the library access request (instead of having to write pandas every time we want to use the library, the command is shortened to just `pd`).

The library’s **features become available** from the moment the library is imported.

This tutorial focuses primarily on features related to **data preparation**, however on occasion, features related to data analysis will also be mentioned. The rest of the tutorial focuses on the following aspects: 

*   Data structures.
*   Data import.
*   Missing data.
*   Other data preparation operations.

## Data Structures

**Data structures** refer to how data is saved and organised on a computer.

When using *pandas*, there are two types of data structures that are recurrent:

1.   *Series*.
2.   *DataFrame*.

### Series

The *Series* data structure acts as a kind of **list** and is used whenever we want to store sequences of values. 

***Series*** can be used to store data such as the weight of players on a basketball team, for example. Let us suppose that the players on a team weigh 73, 89, 64, 72, 78, 83, 92, 97, 70 and 68 kg and that we wanted to store this data in a *Series* data structure, all we would have to do is:

In [None]:
peso = pd.Series([73,89,64,72,78,83,92,97,70,68])
print(peso)

0    73
1    89
2    64
3    72
4    78
5    83
6    92
7    97
8    70
9    68
dtype: int64


In the previous instruction:

*   `weight` is the variable where we store our *Series*.
*   `pd.Series ()` is the instruction we use to build the *Series*.
  *   The use of `pd.` tells the computer that the `Series ()` function is in the *pandas* library.
  *   Remember that we used the  `as pd` instruction previously, when importing the library. This means that we can write `pd.` to access and use the library instead of having to write `pandas.`.
*   `[73,89,64,72,78,83,92,97,70,68]` is the information we want to store in the *Series*.
  *   For this reason, the information is found between the `()` of the `pd.Series ()` instruction.
  *   The use of `[]` serves only to indicate that we are going to input a list of values. If it were just one single value, there would be no need to use `[]`.
*   `print (weight)` is used to display what is stored in the `weight` variable.

When displaying the `weight` variable, we have the identification number (index) of each of the elements to the left, and on the right we have the weight corresponding to that element. So, we can determine, for example, that the player identified by the index '3' weighs 72 kg. 

**Challenge**: Based on the previous code, create a *Series* that stores the height of 10 people you know.

In [None]:
# Solution for the challenge

### DataFrame

A *DataFrame* is a kind of **table**. Each column in a `DataFrame` corresponds to a variable and each row corresponds to an entry (or observation).

Let us imagine that we created a questionnaire for 100 people and that we got the following responses:

*   **Question 1**. 74 'Yes' responses and 26 'No' responses.
*   **Question 2**. 23 'Yes' responses and 77 'No' responses.
*   **Question 3**. 56 'Yes' responses and 44 'No' responses.

A possible way to record this information in a ***DataFrame*** would be: 

In [None]:
responses = pd.DataFrame({'Yes': [74,23,56], 'No':[26, 77, 44]})
print(responses)

   Yes  No
0   74  26
1   23  77
2   56  44


In the previous instruction:

*   `responses` is the variable where we store our *DataFrame*.
*   `pd.DataFrame ()` is the instruction we use to build the `DataFrame`.
*   `{'Yes': [74,23,56], 'No': [26, 77, 44]}` is the information we want to store in the *DataFrame*.
  *   For this reason, the information is found between the `()` of the `pd.DataFrame ()` instruction.
  *   The use of `{}` serves only to indicate that we are going to input a set of values and that these values are associated with a key (in this case, the keys are '`Yes`' and '`No`').
*   `print (responses)` serves to display what is stored in the `responses` variable.

When displaying the `responses` variable, we have the index of each question to the left, and on the right we have the number of 'Yes' and 'No' responses. So, we can determine, for example, that the question with an index of '0' has 74 'Yes' responses and 26 'No' responses.

Note that:

*   A *Series* is equivalent to a *DataFrame* with a single column.
*   In Python, **indexing** (assigning an identification number) starts at zero. That is why 'Question 1' has an index of '0': since this is the first question, Python will assign it the first available index, in this case '0', when indexing it.

**Challenge:** Based on the previous code, create a *DataFrame* that stores the height and weight of 10 people you know.

In [None]:
# Solution for the challenge

## Data Importing

In the real world, most of the machine learning problems that we are going to solve involve using data that is **already stored somewhere**. For example, if we want to create machine learning models to predict the weather, what we will most likely have to do is import data from a database. Fortunately, it is very unlikely that we will have to enter all of the data ourselves, observation by observation, just as we did to illustrate the concepts behind the *Series* and *DataFrame* data structures.

The data we plan to import can be saved in **different formats**. One of the most common storage formats is the CSV (*Comma-Separated Values*) file.

In **CSV files**, data is stored according to the following rules:

*   Existing fields are identified in the first row.
*   In the remaining rows, the values for each observation are described in the fields previously identified.
*   Fields and values are separated by commas (which explains the logic behind the name *Comma-Separated Values*).

For example, in a CSV file, the data that makes up the *DataFrame* `responses` seen earlier, would be presented as follows:

`Yes, No`

`74, 26`

`23, 77`

`56, 44`

Visually there is a significant difference between organising data in a *DataFrame* and organising it in a CSV file. However, computationally, there is no difference. All that matters to the computer is that the data is organised in a **consistent** manner and according to **logic** that it is already familiar with.

The *pandas* library allows us to enable our computer to read CSV files. For that, all we have to do is import *pandas* and use the `pd.read_csv ()` function which converts data from a CSV file to a *DataFrame*. This function can accept several types of arguments, one of which is the file’s location. The rest of the arguments can be found in the [official documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv).

For example, if we want to use a dataset related to COVID-19 that is available at https://raw.githubusercontent.com/pmarcelino/datasets/master/covid-19.csv, in a repository that we created, all we have to do is follow the syntax of the `pd.read_csv ()` function and enter the file’s location:

In [None]:
pd.read_csv("https://raw.githubusercontent.com/pmarcelino/datasets/master/covid-19.csv")

Unnamed: 0,Key,Date,CountryCode,CountryName,RegionCode,RegionName,aggregation_level,Confirmed,Deaths,Population,Latitude,Longitude
0,AE,2020-01-01,AE,United Arab Emirates,,,0,0.0,0.0,9770529.0,24.400000,54.300000
1,AF,2020-01-01,AF,Afghanistan,,,0,0.0,0.0,38041754.0,34.000000,66.000000
2,AM,2020-01-01,AM,Armenia,,,0,0.0,0.0,2957731.0,40.383333,44.950000
3,AR,2020-01-01,AR,Argentina,,,0,0.0,0.0,44938712.0,-34.000000,-64.000000
4,AR_C,2020-01-01,AR,Argentina,C,City of Buenos Aires,1,0.0,0.0,3063728.0,-34.599722,-58.381944
...,...,...,...,...,...,...,...,...,...,...,...,...
236651,UA_65,2020-10-04,UA,Ukraine,65,Kherson,1,1340.0,26.0,1046981.0,46.500000,34.000000
236652,UA_68,2020-10-04,UA,Ukraine,68,Khmelnytskyi,1,6647.0,131.0,1274409.0,49.530000,26.870000
236653,UA_71,2020-10-04,UA,Ukraine,71,Cherkasy,1,4430.0,59.0,1220363.0,49.444722,32.060278
236654,UA_74,2020-10-04,UA,Ukraine,74,Chernihiv,1,4518.0,77.0,1020078.0,51.340000,32.060000


In the previous instruction:

*   `pd.read_csv ()` is the instruction we use to read the CSV file.
*   "https://raw.githubusercontent.com/pmarcelino/datasets/master/covid-19.csv is the *link* to the CSV file.
  *   For this reason, the link is found between the `()` of the `pd.read_csv()` instruction.
  *   We need to know in advance where the data file is located.

Now that we know how to read data files, we can demonstrate how to store this data in a variable:

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/datasets/master/covid-19.csv")

In the previous instruction:

*   `df` is the variable where we store the data from the CSV file.
  *   We use the name `df` because it is common practice to name the variable containing the original dataset this way.
  *   Since we used `pd.read_csv ()`, the `df` variable will be a *DataFrame*.

To demonstrate that the `df` variable is a *DataFrame*, we can evaluate its type using the `type ()` function:

In [None]:
type(df)

pandas.core.frame.DataFrame

For the sake of style and clarity, data import is usually done as follows:

In [None]:
url = "https://raw.githubusercontent.com/pmarcelino/datasets/master/covid-19.csv"
df = pd.read_csv(url)

In the previous instruction:

*   `url` is the variable where we store the *link* to the CSV file.

In practical terms, it is exactly the same as what we saw earlier, but the use of a `url` variable makes the code more readable.

After importing the data, it is routine to verify that the import went well. For this, we usually:

*   Display the first rows of the `df` variable to verify that the import was successful (`df.head ()`).
*   Check the dimension of the dataset (`df.shape`).
*   Display a statistical summary of the dataset to see if the values are within a reasonable range (`df.describe ()`).

For instance:

In [None]:
df.head()

Unnamed: 0,Key,Date,CountryCode,CountryName,RegionCode,RegionName,aggregation_level,Confirmed,Deaths,Population,Latitude,Longitude
0,AE,2020-01-01,AE,United Arab Emirates,,,0,0.0,0.0,9770529.0,24.4,54.3
1,AF,2020-01-01,AF,Afghanistan,,,0,0.0,0.0,38041754.0,34.0,66.0
2,AM,2020-01-01,AM,Armenia,,,0,0.0,0.0,2957731.0,40.383333,44.95
3,AR,2020-01-01,AR,Argentina,,,0,0.0,0.0,44938712.0,-34.0,-64.0
4,AR_C,2020-01-01,AR,Argentina,C,City of Buenos Aires,1,0.0,0.0,3063728.0,-34.599722,-58.381944


In [None]:
df.shape

(236656, 12)

In [None]:
df.describe()

Unnamed: 0,aggregation_level,Confirmed,Deaths,Population,Latitude,Longitude
count,236656.0,236580.0,203843.0,222033.0,235504.0,235504.0
mean,0.730575,19264.95,939.14574,13040020.0,24.482847,4.19156
std,0.443662,157354.2,5825.078273,73496730.0,26.107397,75.693291
min,0.0,0.0,0.0,50.0,-54.362,-178.10932
25%,0.0,29.0,0.0,581641.0,8.632279,-69.31
50%,1.0,561.0,18.0,1675502.0,29.646111,9.083333
75%,1.0,4849.0,218.0,5942089.0,46.825,47.0
max,1.0,7206769.0,206558.0,1397715000.0,72.0,178.005556


`df.head ()` and `df.shape` can be simply substituted by `df` which transmits the same information. Regarding the use of `df.describe ()`, more details will be given in the 'Data Exploration' tutorial.

Finally, the only thing left to mention is that in addition to importing data from files, it is also possible to import **data directly from libraries**. For example, the *seaborn* library has several datasets that can be imported. In the 'Data Exploration' tutorial this situation is explained in detail.

**Challenge**: Based on the previous code, import the data from this [link](https://github.com/pmarcelino/datasets/blob/master/penguins.csv).

In [None]:
# Solution for the challenge

## Missing Data

It is common to have databases with **missing data**. Missing data can occur due to several reasons, such as equipment reading errors or lapses on the part of the individual entering the data into the database.

In general, building machine learning models requires working with **complete** datasets, in other words, datasets that have no missing data. Therefore, resolving the issue of missing data is fundamental in order to be able to train prediction models using machine learning algorithms. 

Let us import a dataset with missing data to see how we can **identify missing data**:

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/pmarcelino/datasets/master/mock-missing-data.csv'
df = pd.read_csv(url)

df

Unnamed: 0,A,B,C,D
0,1,,2.0,4
1,2,4.0,,6
2,3,3.0,2.0,5


As you can see, two cells contain the expression `NaN`. This expression is used by the *pandas* library to indicate that a cell has data missing. In this case, the dataset is small and it is easy to identify the missing data visually.

When datasets are large, it is no longer possible to identify missing data visually and we, therefore, have to use a combination of the methods `isnull` and `sum`. Let us use the previous example to demonstrate how to combine these methods:

In [None]:
df.isnull().sum()

A    0
B    1
C    1
D    0
dtype: int64

In the previous instruction:

*   `df` is the variable that contains the dataset where we will apply the methods.
*   `isnull ()` identifies whether or not an observation has missing data, assigning the value `True` to observations with missing data.
*   `sum ()` adds up all the observations that have a `True` value.

As we can see, there is missing data in columns 'B' and 'C'. More specifically, it appears that both columns have one observation with missing data.

One of two **solutions** can be adopted to resolve problems related to missing data:

1.   Data elimination
2.   Data imputation

### Data Elimination

**Data elimination** is one way to solve the problem of missing data. This solution involves eliminating observations with missing data. If we delete all of the observations with missing data, our dataset becomes complete.

The simplest way to eliminate missing data is to use the `dropna` method, which is a variant of the `drop` method found in the **pandas** library.

The `dropna` method allows you to delete rows with missing data:

In [None]:
df.dropna()

Unnamed: 0,A,B,C,D
2,3,3.0,2.0,5


As well as columns that contain missing data:

In [None]:
df.dropna(axis=1)

Unnamed: 0,A,D
0,1,4
1,2,6
2,3,5


Note that the only difference is related to the definition of the `axis` parameter. This parameter defines the axis (horizontal/rows or vertical/columns) in which we want to delete data. By default, this parameter has a value of 0, which corresponds to the horizontal axis. Therefore, if we want to eliminate observations using the vertical axis as a reference, we have to define `axis=1`. 

**Challenge**: Eliminate the missing data from the following dataset so that you are left with a complete dataset.

In [None]:
import numpy as np

df = pd.DataFrame({'Altura':[np.nan, 1.72, 1.74, 1.76, 1.78], 
                   'Peso':[68, 68, np.nan, 72, 72]})
df

Unnamed: 0,Altura,Peso
0,,68.0
1,1.72,68.0
2,1.74,
3,1.76,72.0
4,1.78,72.0


In [None]:
# Solution for the challenge

### Data Imputation

In many cases, it is **not advisable to eliminate** rows (observations) or columns (variables) because this involves reducing the amount of data available.

To avoid this situation, we can use **data imputation**. Data imputation is a technique that allows us to estimate values to subsitute missing data, based on existing data.

Mean imputation is one of the most common forms of data imputation. In this scenario, what we do is **exchange the missing values for the mean of the observed values**. Let us look at the following example:

In [None]:
df = pd.DataFrame({'Altura':[1.70, 1.72, 1.74, 1.76, 1.78], 
                   'Peso':[68, 68, np.nan, 72, 72]})
df

Unnamed: 0,Altura,Peso
0,1.7,68.0
1,1.72,68.0
2,1.74,
3,1.76,72.0
4,1.78,72.0


In the previous example, if we wanted to estimate the missing value by imputing the mean, what we would do is say that this **value corresponds to the mean of the observed values in the 'Weight' column**. So in this example, the missing observation has a value of 70. By using this method, we avoid losing the row with an index number of '2' or the 'Weight' column - depending on whether we chose to eliminate rows containing missing data or columns containing missing data.

In Python, mean imputation can be done in different ways. One of the simplest ways is by combining two methods from the *pandas* library: `fillna` and `mean`. For instance:

In [None]:
df.fillna(df.mean())

Unnamed: 0,Altura,Peso
0,1.7,68.0
1,1.72,68.0
2,1.74,70.0
3,1.76,72.0
4,1.78,72.0


In the previous instruction:

*   `df.fillna ()` fills in the missing values in the `df` variable.
*   `df.mean ()` specifies that the missing values will be replaced by the mean values from the `df` variable.

**Challenge**: Fill in the missing data from the following dataset using mean imputation.

In [None]:
df = pd.DataFrame({'Carro':['Honda', 'Toyota', 'Fiat', 'Peugeot', 'Ford'], 
                   'Preço':[17000, 23000, np.nan, np.nan, 24000]})
df

Unnamed: 0,Carro,Preço
0,Honda,17000.0
1,Toyota,23000.0
2,Fiat,
3,Peugeot,
4,Ford,24000.0


In [None]:
# Solution for the challenge

## Other Data Preparation Operations

In general, data preparation operations aim to **organise** data in a more convenient way. 

**Examples of data preparation operations** are:

*   Selection 
*   Assignment
*   Grouping
*   Sorting
*   Data Removal

**Note:** In the following section, we explain and give examples for some of these operations. Although most of these operations are not essential for this data preparation class, their application will be necessary later, either in the context of this course or generally, to solve real-world machine learning problems.

### Selection

An operation currently used in data analysis is the **selection** of specific values. In many cases, it is necessary to select data subsets from an original dataset in order to solve problems. For example, if we wanted to use COVID-19 data to solve a problem related to Portugal, at some stage during the procedure we would most likely have to select specific observations from Portugal. Note that the data relevant to Portugal is a subset from the original data, which encompasses data from countries worldwide. 

There are **countless ways to select data** using pandas. Coming up, we will demonstrate several of these methods. In these examples, we will use the '[Iris](https://pt.wikipedia.org/wiki/Conjunto_de_dados_flor_Iris)' dataset.

The **'Iris' dataset** contains information about the physical characteristics of flowers and flower species. To be more specific, the information pertains to sepal length, sepal width, petal length, petal width and the species of different flowers.

Let us start by importing the data and saving it in a variable: 


In [None]:
url = 'https://raw.githubusercontent.com/pmarcelino/datasets/master/iris.csv'
df = pd.read_csv(url)
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


The first selection method that we will look at involves **selecting only one of the columns**. As an example, let us consider that we only want to focus on the values in the 'sepal_length' column. For this, we can do the following:

In [None]:
df.sepal_length

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

Another way to achieve the same result would be: 

In [None]:
df['sepal_length']

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

This last approach is particularly interesting when we want to view **more than one column**:

In [None]:
df[['sepal_length', 'species']]

Unnamed: 0,sepal_length,species
0,5.1,setosa
1,4.9,setosa
2,4.7,setosa
3,4.6,setosa
4,5.0,setosa
...,...,...
145,6.7,virginica
146,6.3,virginica
147,6.5,virginica
148,6.2,virginica


It should be noted that in the instruction above it was necessary to place an additional set of `[] `because we wanted to select more than one element (we wanted to select a list of elements). 

Now, imagine that we wanted to s**elect information pertaining to a set of rows**, rather than a set of columns. If the set of rows referred only to the first row, we could do the following:

In [None]:
df.iloc[0]

sepal_length       5.1
sepal_width        3.5
petal_length       1.4
petal_width        0.2
species         setosa
Name: 0, dtype: object

This instruction locates the information according to the index provided (which explains the logic behind the name `iloc`). Bearing in mind that Python starts counting the indices from zero, index '0' therefore corresponds to the first row of the data table.

In turn, if we wanted to **select the information pertaining to the first 5 rows**, we would do the following:

In [None]:
df.iloc[:5]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


As we can see, and this is always the case when working with Python in data selection, the code `: 5` reads as 'from the beginning to 5'. 

Now, if we wanted to **select the information found in rows six to ten**, we would have to do the following:

In [None]:
df.iloc[5:10]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


Here, you have to keep in mind that the sixth row starts at index number '5' because the index count starts from zero. The same reasoning explains why the tenth row corresponds to index number '9'. 

If we wanted to select the **information found in the sixth to last row**, we would do the following:

In [None]:
df.iloc[5:]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


Lastly, and to finish up the different row selection methods using the `iloc` instruction, if we wanted to **select the last 5 rows** of the data table, we would do the following: 

In [None]:
df.iloc[-5:]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


In the case above, the instruction `-5:` reads as 'from the last 5 to the end'. Likewise, if we wanted to specify the last ten rows of the table, we would do `-10:`.

If we now want to **combine the selection of columns with the selection of rows**, we can follow the logic of the following example:

In [None]:
df.iloc[:5,0]

0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: sepal_length, dtype: float64

In the example above, we are selecting the first five rows and the first column (did we mention that Python starts counting from zero?) of the data table. Therefore, when we use the `iloc` instruction, values to the left of the comma refer to rows and values to the right of the comma refer to columns.

Now let us look at another example, where **we select the first 10 rows and the first 3 columns** of the data table:

In [None]:
df.iloc[:10, :3]

Unnamed: 0,sepal_length,sepal_width,petal_length
0,5.1,3.5,1.4
1,4.9,3.0,1.4
2,4.7,3.2,1.3
3,4.6,3.1,1.5
4,5.0,3.6,1.4
5,5.4,3.9,1.7
6,4.6,3.4,1.4
7,5.0,3.4,1.5
8,4.4,2.9,1.4
9,4.9,3.1,1.5


We have learned that the `iloc` instruction serves to select data from the table, by indicating the rows and columns we want. 

An **alternative** way of selecting data subsets would be using the `loc` instruction. In this case, rather than identifying the columns by their index, we identify them according to their value. For example:

In [None]:
df.loc[:10, ['sepal_length', 'sepal_width', 'petal_length']]

Unnamed: 0,sepal_length,sepal_width,petal_length
0,5.1,3.5,1.4
1,4.9,3.0,1.4
2,4.7,3.2,1.3
3,4.6,3.1,1.5
4,5.0,3.6,1.4
5,5.4,3.9,1.7
6,4.6,3.4,1.4
7,5.0,3.4,1.5
8,4.4,2.9,1.4
9,4.9,3.1,1.5


In this case, the result is the same as when we did `df.iloc [: 10,: 3]`, but instead of defining the indices, we defined the names of the data columns we wanted to select. 

In practice, it does not make much difference whether we use `iloc` or `loc`. However, it is common for us to associate columns of data with their name and not with their index. As a result, it is quite possible that `loc` is used more frequently than `iloc`.

Finally, it is important that we talk about **conditional selection**. Conditional selection serves to select data according to certain conditions. To illustrate this, we will use conditional selection to find **observations where the petal length is greater than 6**:

In [None]:
df[df['petal_length'] > 6]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
105,7.6,3.0,6.6,2.1,virginica
107,7.3,2.9,6.3,1.8,virginica
109,7.2,3.6,6.1,2.5,virginica
117,7.7,3.8,6.7,2.2,virginica
118,7.7,2.6,6.9,2.3,virginica
122,7.7,2.8,6.7,2.0,virginica
130,7.4,2.8,6.1,1.9,virginica
131,7.9,3.8,6.4,2.0,virginica
135,7.7,3.0,6.1,2.3,virginica


In the previous instruction:
*   `df []` indicates that we want to select a data subset from the `df` set.
*   `df ['petal_length']> 6` defines the selection rule which, in this case, is based on the condition of having a petal length greater than 6.

Therefore, the instruction `df [df ['petal_length']>` 6 reads as follows: 'data from `df` whose condition `df [' petal_length ']> 6'` gives an outcome of true.

To make the 'gives an outcome of true' more evident, we can see what happens when we just do `df ['petal_length']> 6`:

In [None]:
df['petal_length'] > 6

0      False
1      False
2      False
3      False
4      False
       ...  
145    False
146    False
147    False
148    False
149    False
Name: petal_length, Length: 150, dtype: bool

As we can see, each row is assessed as either true or false, depending on whether or not the condition is met. Thus, it is clear that the `df []` instruction serves to select the data where the condition is true. 

A logical continuation of the previous examples would be to define **two conditions** instead of one. Supposing the additional condition were that the petal width must be greater than 2, we would have the following:

In [None]:
df[(df['petal_length'] > 6) & (df['petal_width'] > 2)]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
105,7.6,3.0,6.6,2.1,virginica
109,7.2,3.6,6.1,2.5,virginica
117,7.7,3.8,6.7,2.2,virginica
118,7.7,2.6,6.9,2.3,virginica
135,7.7,3.0,6.1,2.3,virginica


In this case, we would use the `&` operator to indicate that we want to select all cases where **both conditions**, `df ['petal_length']> 6` and `df ['petal_width']> 2`, are true. Remember that in Python, the `&` operator has the logical value of `E`.

Similarly, we could select cases where only **one of the conditions** has to be true. For that, it would be enough to exchange the `&` operator for the `|` operator (which in Python, has the logical value of `OU`):

In [None]:
df[(df['petal_length'] > 6) | (df['petal_width'] > 2)]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
100,6.3,3.3,6.0,2.5,virginica
102,7.1,3.0,5.9,2.1,virginica
104,6.5,3.0,5.8,2.2,virginica
105,7.6,3.0,6.6,2.1,virginica
107,7.3,2.9,6.3,1.8,virginica
109,7.2,3.6,6.1,2.5,virginica
112,6.8,3.0,5.5,2.1,virginica
114,5.8,2.8,5.1,2.4,virginica
115,6.4,3.2,5.3,2.3,virginica
117,7.7,3.8,6.7,2.2,virginica


**Challenge:** Import the data found [here](https://raw.githubusercontent.com/pmarcelino/datasets/master/titanic.csv) and select the data that allows you to identify which passengers on the Titanic were over 65 years old. 

In [None]:
# Solution for the challenge

### Assignment

**Assignment** is an operation that involves assigning values to rows or columns. In general, assignments alter existing data but they can also be used to create new data.

Once you know how to select data, assignment is very simple. Let us start by imagining that we want **to assign the value 1.0 to the petal length of all the flowers** in our dataset. In that case, we would do the following:

In [None]:
df['petal_length'] = 1.0
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.0,0.2,setosa
1,4.9,3.0,1.0,0.2,setosa
2,4.7,3.2,1.0,0.2,setosa
3,4.6,3.1,1.0,0.2,setosa
4,5.0,3.6,1.0,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,1.0,2.3,virginica
146,6.3,2.5,1.0,1.9,virginica
147,6.5,3.0,1.0,2.0,virginica
148,6.2,3.4,1.0,2.3,virginica


As we can see, all the flowers in the dataset now have a value of '1.0' in the 'petal_length' column. 

Upon analysing the code, we find that assignment involves nothing more than selecting the column 'petal_length' and assigning a value to that selection (o = represents that assignment).

Following the same methodology, we could **assign a value to just one row** in the dataset:

In [None]:
df.iloc[0] = 2
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,2.0,2.0,2.0,2.0,2
1,4.9,3.0,1.0,0.2,setosa
2,4.7,3.2,1.0,0.2,setosa
3,4.6,3.1,1.0,0.2,setosa
4,5.0,3.6,1.0,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,1.0,2.3,virginica
146,6.3,2.5,1.0,1.9,virginica
147,6.5,3.0,1.0,2.0,virginica
148,6.2,3.4,1.0,2.3,virginica


In the case above, we selected the first row and assigned the value '2.0' to all the characteristics. 

Following this logic, we can therefore conclude that in order to **assign values to a specific subset of data**, all we have to write is:

In [None]:
df.loc[:10, ['sepal_length', 'sepal_width', 'petal_length']] = 100
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,100.0,100.0,100.0,2.0,2
1,100.0,100.0,100.0,0.2,setosa
2,100.0,100.0,100.0,0.2,setosa
3,100.0,100.0,100.0,0.2,setosa
4,100.0,100.0,100.0,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,1.0,2.3,virginica
146,6.3,2.5,1.0,1.9,virginica
147,6.5,3.0,1.0,2.0,virginica
148,6.2,3.4,1.0,2.3,virginica


Or, if a **conditional selection** defines the subset:

In [None]:
df[(df['petal_length'] > 6) | (df['petal_width'] > 2)] = 200
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,200.0,200.0,200.0,200.0,200
1,200.0,200.0,200.0,200.0,200
2,200.0,200.0,200.0,200.0,200
3,200.0,200.0,200.0,200.0,200
4,200.0,200.0,200.0,200.0,200
...,...,...,...,...,...
145,200.0,200.0,200.0,200.0,200
146,6.3,2.5,1.0,1.9,virginica
147,6.5,3.0,1.0,2.0,virginica
148,200.0,200.0,200.0,200.0,200


The series of examples shown above serve to illustrate that, for any data assignment, all you have to do is select the data and use the `=` operator, followed by the value you want to assign.

Finally, we will demonstrate how we would **assign values to a new column**. In this case, let us imagine that we want to add a 'colour' column to our dataset and that we want all of our observations in that column to have the value 'green'. This can be done in the following way:

In [None]:
df['colour'] = 'green'
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,colour
0,200.0,200.0,200.0,200.0,200,green
1,200.0,200.0,200.0,200.0,200,green
2,200.0,200.0,200.0,200.0,200,green
3,200.0,200.0,200.0,200.0,200,green
4,200.0,200.0,200.0,200.0,200,green
...,...,...,...,...,...,...
145,200.0,200.0,200.0,200.0,200,green
146,6.3,2.5,1.0,1.9,virginica,green
147,6.5,3.0,1.0,2.0,virginica,green
148,200.0,200.0,200.0,200.0,200,green


**Challenge:** Import the data found [here](https://raw.githubusercontent.com/pmarcelino/datasets/master/titanic.csv) and change the column 'Sex', so that the observations with the value 'male' are replaced by the value '1' and observations with the value 'female' are replaced by the value '0'.

In [None]:
# Solution for the challenge

### Grouping

**Grouping** relates to the process of grouping data. This task is useful when we want to perform operations or analyses on specific data subsets. 

To perform grouping, we use the `pd.groupby ()` function. In this tutorial we will show you examples of how this function is used and you can consult all its uses in the [official documentation of the *pandas* library](https://pandas.pydata.org).

So, let us start by recovering the original dataset because in the previous chapter we made several changes to it.

In [None]:
url = 'https://raw.githubusercontent.com/pmarcelino/datasets/master/iris.csv'
df = pd.read_csv(url)
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


Now, let us imagine that we want to analyse **information about each species**. More specifically, we want to: 

1.   Count the number of flowers in each species.
2.   For each of the flowers’ physical characteristics, view the minimum value.

To count the **number of flowers in each species**, we do the following:

In [None]:
df.groupby('species').count()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,50,50,50,50
versicolor,50,50,50,50
virginica,50,50,50,50


In the previous instruction:

*   `df.groupby ()` indicates that we want to group data.
*   `'species'` defines the data that we want to group.
  *   This is why it falls within the `()` of the `df.groupby ()` instruction.
*   `.count ()` indicates that we want to count the number of observations from the grouped data.

As we can see, the structure `df.groupby ()` defines the data we want to group and then the method `.count ()` defines the operation we want to perform on that dataset.

In turn, to find the **minimum value for each of the flowers’ physical characteristics**, we do the following: 

In [None]:
df.groupby('species').min()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,4.3,2.3,1.0,0.1
versicolor,4.9,2.0,3.0,1.0
virginica,4.9,2.2,4.5,1.4


In the previous instruction:

*   `df.groupby ()` indicates that we want to group data.
*   `'species'` defines the data we want to group.
  *   This is why it falls within the `()` of the `df.groupby ()` instruction.
*   `.min ()` indicates that we want to view the minimum values for each of the flowers’ physical characteristics.

The examples above illustrate how `groupby ()` is nothing more than a way to select and carry out operations on specific data subsets. 

Delving a little further, we can explore the `agg ()` method, which allows us to combine **several operations** at once. For example, if we wanted to identify the minimum and maximum values, we could do the following:

In [None]:
df.groupby('species').agg([min, max])

Unnamed: 0_level_0,sepal_length,sepal_length,sepal_width,sepal_width,petal_length,petal_length,petal_width,petal_width
Unnamed: 0_level_1,min,max,min,max,min,max,min,max
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
setosa,4.3,5.8,2.3,4.4,1.0,1.9,0.1,0.6
versicolor,4.9,7.0,2.0,3.4,3.0,5.1,1.0,1.8
virginica,4.9,7.9,2.2,3.8,4.5,6.9,1.4,2.5


Finally, we just need to see how to **aggregate several groups**:

In [None]:
df.groupby(['species','petal_width']).min()

Unnamed: 0_level_0,Unnamed: 1_level_0,sepal_length,sepal_width,petal_length
species,petal_width,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,0.1,4.3,3.0,1.1
setosa,0.2,4.4,2.9,1.0
setosa,0.3,4.5,2.3,1.3
setosa,0.4,5.0,3.4,1.3
setosa,0.5,5.1,3.3,1.7
setosa,0.6,5.0,3.5,1.6
versicolor,1.0,4.9,2.0,3.3
versicolor,1.1,5.1,2.4,3.0
versicolor,1.2,5.5,2.6,3.9
versicolor,1.3,5.5,2.3,3.6


And how to **aggregate various groups and operations**: 

In [None]:
df.groupby(['species','petal_width']).agg([min, max])

Unnamed: 0_level_0,Unnamed: 1_level_0,sepal_length,sepal_length,sepal_width,sepal_width,petal_length,petal_length
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,min,max,min,max
species,petal_width,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
setosa,0.1,4.3,5.2,3.0,4.1,1.1,1.5
setosa,0.2,4.4,5.8,2.9,4.2,1.0,1.9
setosa,0.3,4.5,5.7,2.3,3.8,1.3,1.7
setosa,0.4,5.0,5.7,3.4,4.4,1.3,1.9
setosa,0.5,5.1,5.1,3.3,3.3,1.7,1.7
setosa,0.6,5.0,5.0,3.5,3.5,1.6,1.6
versicolor,1.0,4.9,6.0,2.0,2.7,3.3,4.1
versicolor,1.1,5.1,5.6,2.4,2.5,3.0,3.9
versicolor,1.2,5.5,6.1,2.6,3.0,3.9,4.7
versicolor,1.3,5.5,6.6,2.3,3.0,3.6,4.6


**Challenge**: Import the data found [here](https://raw.githubusercontent.com/pmarcelino/datasets/master/iris.csv) and group it by species, determining the mean values (`mean`) of each of the flowers’ physical characteristics.

In [None]:
# Solution for the challenge

### Sorting

**Sorting** refers to operations whose objective is to sort data. Datasets are not always ordered the way we want them to be, so it is common to have to perform sorting operations.

To sort, we use the `sort_values ()` method. The following example shows how to arrange a dataset in **ascending order** for given variable (in this case, 'petal_width'):


In [None]:
df.sort_values(by='petal_width')

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
32,5.2,4.1,1.5,0.1,setosa
13,4.3,3.0,1.1,0.1,setosa
37,4.9,3.1,1.5,0.1,setosa
9,4.9,3.1,1.5,0.1,setosa
12,4.8,3.0,1.4,0.1,setosa
...,...,...,...,...,...
140,6.7,3.1,5.6,2.4,virginica
114,5.8,2.8,5.1,2.4,virginica
100,6.3,3.3,6.0,2.5,virginica
144,6.7,3.3,5.7,2.5,virginica


As you can see, the observations are no longer ordered by their index (which is from 0 to 149), instead they are arranged in ascending order of petal width. 

We could also have sorted the data in **descending order**. For that, we would have to set the `ascending` parameter to `False` (it is `True` by default, and that is why we did not have to change this parameter in the previous example):

In [None]:
df.sort_values(by='petal_width', ascending=False)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
100,6.3,3.3,6.0,2.5,virginica
109,7.2,3.6,6.1,2.5,virginica
144,6.7,3.3,5.7,2.5,virginica
114,5.8,2.8,5.1,2.4,virginica
140,6.7,3.1,5.6,2.4,virginica
...,...,...,...,...,...
13,4.3,3.0,1.1,0.1,setosa
37,4.9,3.1,1.5,0.1,setosa
32,5.2,4.1,1.5,0.1,setosa
34,4.9,3.1,1.5,0.1,setosa


As you can imagine, it is possible **to combine `sort_values ()` with data selections**. Let us look at the following example: 

In [None]:
df.loc[:10, ['sepal_length', 'sepal_width', 'petal_length']].sort_values(by='sepal_width')

Unnamed: 0,sepal_length,sepal_width,petal_length
8,4.4,2.9,1.4
1,4.9,3.0,1.4
3,4.6,3.1,1.5
9,4.9,3.1,1.5
2,4.7,3.2,1.3
6,4.6,3.4,1.4
7,5.0,3.4,1.5
0,5.1,3.5,1.4
4,5.0,3.6,1.4
10,5.4,3.7,1.5


In the example above, just to switch things up, we decided to sort according to the variable 'sepal_width'. 

To finish, here is an example where the data is **ordered by two variables** (first by 'sepal_length' and then by 'petal_width'):

In [None]:
df.loc[:10, ['sepal_length', 'sepal_width', 'petal_length']].sort_values(by=['sepal_length','sepal_width'])

Unnamed: 0,sepal_length,sepal_width,petal_length
8,4.4,2.9,1.4
3,4.6,3.1,1.5
6,4.6,3.4,1.4
2,4.7,3.2,1.3
1,4.9,3.0,1.4
9,4.9,3.1,1.5
7,5.0,3.4,1.5
4,5.0,3.6,1.4
0,5.1,3.5,1.4
10,5.4,3.7,1.5


**Challenge**: Import the data found [here](https://raw.githubusercontent.com/pmarcelino/datasets/master/titanic.csv) and sort it in descending order of age.

In [None]:
# Solution for the challenge

### Data Removal

*Pandas* also allows us to **remove** data. This is done using the `drop ()` method, making it possible to remove data from either rows or columns. Note that the `dropna ()` method, which we saw earlier, is a variant of the `drop ()` method, used to remove rows with missing data.

If we want to remove rows, we have to disclose information about the rows we want to remove to the `drop ()` method . This is done using the `index` method, which identifies these rows. For example, if we want to **remove the first five rows**, we can do the following:

In [None]:
df.drop([0,1,2,3,4])

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


Alternatively, we can use the `index` method: 

In [None]:
df.drop(df.index[:5])

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


If we want **to remove rows based on a condition**, we do the following:

In [None]:
df.drop(df[df['species']=='setosa'].index)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
50,7.0,3.2,4.7,1.4,versicolor
51,6.4,3.2,4.5,1.5,versicolor
52,6.9,3.1,4.9,1.5,versicolor
53,5.5,2.3,4.0,1.3,versicolor
54,6.5,2.8,4.6,1.5,versicolor
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


To remove columns, the process is similar. The only differences are:

1.   The way we identify the columns to be removed is simpler (it is done by using the variables’ names).
2.   We have to make it clear that we want to remove columns (by assigning the value '1' to the `axis` parameter of the `drop` method).

For example, if we wanted **to remove the 'petal_width' column**, we would do the following:

In [None]:
df.drop('petal_width', axis=1)

Unnamed: 0,sepal_length,sepal_width,petal_length,species
0,5.1,3.5,1.4,setosa
1,4.9,3.0,1.4,setosa
2,4.7,3.2,1.3,setosa
3,4.6,3.1,1.5,setosa
4,5.0,3.6,1.4,setosa
...,...,...,...,...
145,6.7,3.0,5.2,virginica
146,6.3,2.5,5.0,virginica
147,6.5,3.0,5.2,virginica
148,6.2,3.4,5.4,virginica


To remove a set of columns, we would apply the same logic we have already seen in previous examples:

In [None]:
df.drop(['petal_width', 'petal_length'], axis=1)

Unnamed: 0,sepal_length,sepal_width,species
0,5.1,3.5,setosa
1,4.9,3.0,setosa
2,4.7,3.2,setosa
3,4.6,3.1,setosa
4,5.0,3.6,setosa
...,...,...,...
145,6.7,3.0,virginica
146,6.3,2.5,virginica
147,6.5,3.0,virginica
148,6.2,3.4,virginica


In all of these examples, it is important to note that the operation to remove data was performed but the `df` variable remained unchanged:

In [None]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


As we can see, the variable maintains the same rows and columns as before.

For the variable to be changed, it is necessary to save these changes in the variable itself, something we can do by redefining the `df` variable:

In [None]:
df = df.drop('petal_width', axis=1)
df

Unnamed: 0,sepal_length,sepal_width,petal_length,species
0,5.1,3.5,1.4,setosa
1,4.9,3.0,1.4,setosa
2,4.7,3.2,1.3,setosa
3,4.6,3.1,1.5,setosa
4,5.0,3.6,1.4,setosa
...,...,...,...,...
145,6.7,3.0,5.2,virginica
146,6.3,2.5,5.0,virginica
147,6.5,3.0,5.2,virginica
148,6.2,3.4,5.4,virginica


Now when we call the `df` variable, the dataset appears without the petal width variable. 

**Challenge**: Import the data found [here](https://raw.githubusercontent.com/pmarcelino/datasets/master/titanic.csv) and remove the 'Name', 'Sex' and 'Age' columns.

In [None]:
# Solution for the challenge

## Summary

In this tutorial, we looked at:

*   The ***Pandas* library**. The *pandas* library has a set of features that allow us to prepare and analyse data.
*   **Data structures**. *Pandas* uses two types of data structures, *Series* and *DataFrames*.
*   *Data import*. Data can be stored in files, such as CSV files, and it is possible to use *pandas* to import this data.
*   **Missing data**. The problem of missing data can be solved by eliminating data or imputing data.
*   **Other data preparation operations**. Data selection can be done in several ways, depending on what you want to do. With *pandas* it is also possible to assign, group, sort and remove data.

This tutorial introduced several instructions and you are not expected to know them by heart. Above all, this tutorial intends to illustrate the **potential of the *pandas* library** and to serve as a **document for future reference**. Later, with practice, you will begin to retain the instructions that you use most often and the logic behind each instruction becomes more intuitive. 