# Introduction to data manipulation and analysis with Pandas

## Try me
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ffraile/computer_science_tutorials/blob/main/source/Data%20Manipulation/tutorials/Pandas%20tutorial.ipynb)[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ffraile/computer_science_tutorials/main?labpath=source%2FData%20Manipulation%2Ftutorials%2FPandas%20tutorial.ipynb)

[Pandas](https://pandas.pydata.org/) is a powerful library that provides convenient data structures and functions to work with data.
Throughout the next chapters, we will use Pandas for data manipulation and analysis.
To use Pandas in your project, you first need to install it in your environment. Additionally, in this tutorial we will import the display and Markdown libraries to display the dataframes as tables.

In [None]:
# !pip install pandas
!pip install ipython

In [None]:
# Import pandas
import pandas as pd

# Import display and Markdown
from IPython.display import display, Markdown


## Introduction to Pandas data structures
This introduction covers the fundamentals of Pandas data structures used in this course. It is however recommended that you also take a look to the [official documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dsintro).

There are primarily two types of Pandas data structures:

- **Dataframes**: Dataframes are two dimensional data structures, similar to a two dimensional numpy array, a spreadsheet or an SQL table.
- **Series**: Series are one-dimensional data structures, similar to one-dimensional numpy arrays

### Dataframes
Data frames facilitates the manipulation and analysis of two dimensional data structures, combining convenient functions to:

- **Extract** data from different sources, like CSV files, Excel sheets, or SQL databases
- **Transform** data, applying functions to change the values, or combining different data sources to prepare the data for our application
- **Load** data into output CSV files, Excel sheets or SQL databases
- **Analyse** Analyse the data

Pandas columns in a dataframe are not necessarily of the same type. In the course, we will primarily work with numeric types, string types and the datetime type, so we will focus on those two types in this tutorial. We will also store complex objects, and find convenient Pandas functions to manipulate the dataframe using them, so you can see that Pandas is going to be very useful throughout the course!

You can create a dataframe from different structures:
DataFrame accepts many different kinds. Among others, in this course, we will primarily use:

- A dictionary of one dimensional numpy arrays, lists, dicts, or Series
- CSV, or Excel files
- Another Pandas dataframe
- A Pandas series

Let us start with a simple example, using a dictionary.

In [4]:
# This is a dictionary
dict = {"one": [1, 2, 3, 4, 5], "two": [5, 4, 3, 2, 1], "three":["Awesome", "Brutal", "Cool", "Divine", "Exquisite"], "date": ['2021-01-01', '2021-01-02', '2021-01-02', '2021-02-03', '2021-02-05'] }

# And this is a dataframe
dataframe = pd.DataFrame.from_dict(dict)

display(dataframe)

Unnamed: 0,one,two,three,date
0,1,5,Awesome,2021-01-01
1,2,4,Brutal,2021-01-02
2,3,3,Cool,2021-01-02
3,4,2,Divine,2021-02-03
4,5,1,Exquisite,2021-02-05


Our Pandas dataframe, Pandas, has three columns, with names, 'one', 'two', and 'three'. Notice the integer number in the column to the left. This is the **index**, and by default, the index has no name.

Another convenient way to create a dataframe is to use a set of records in an array:

In [5]:
# This is an array of dict
records = [{"one": 1, "two": 5, "three":"Awesome", "date":"2021-01-01"},
           {"one": 2, "two": 4, "three":"Brutal", "date":"2021-01-02"},
           {"one": 3, "two": 3, "three":"Cool", "date":"2021-01-03"},
           {"one": 4, "two": 2, "three":"Divine", "date":"2021-02-04"},
           {"one": 5, "two": 1, "three":"Exquisite", "date":"2021-02-05"}
           ]

# This is a dataframe
dataframe_2 = pd.DataFrame.from_records(records)
display(dataframe)

Unnamed: 0,one,two,three,date
0,1,5,Awesome,2021-01-01
1,2,4,Brutal,2021-01-02
2,3,3,Cool,2021-01-02
3,4,2,Divine,2021-02-03
4,5,1,Exquisite,2021-02-05


Another convenient function to create a dataset is loading a csv file:

``` python
online_retail = pd.read_csv('datasets/Online Retail_2.csv', usecols=['InvoiceNo', 'StockCode', 'Quantity','InvoiceDate')

```
This line of code will create a dataset from the file 'datasets/Online Retail_2.csv', loading only data from the columns labeled as 'InvoiceNo', 'StockCode', 'QUantity', and 'InvoiceDate'.

#### Accessing data
There are multiple ways to access the data in a dataframe. Dataframe columns are **series**, the other important data structure in Pandas. You can access columns in a dataframe using column names as keys:

In [9]:
names = dataframe['three']
display(names)

Unnamed: 0,three
0,Awesome
1,Brutal
2,Cool
3,Divine
4,Exquisite


Notice that series are also indexed, you can access *elements* or *slices* just as in numpy:


In [10]:
print(names[0])
print(names[1])
print(names[0:2])


Awesome
Brutal
0    Awesome
1     Brutal
Name: three, dtype: object


Obviously, you can access the elements directly from the data frame:

In [11]:
print(dataframe['three'][0])

Awesome


#### Queries
You can use queries to access parts of your dataframe, that is, to *filter* the data, using python expressions.
The function *query()* is used to issue the query. For instance, the following query finds all rows where the column 'one' is less or equal than two:


In [13]:
filtered_df = dataframe.query('two <= 2')
display(filtered_df)

Unnamed: 0,one,two,three,date
3,4,2,Divine,2021-02-03
4,5,1,Exquisite,2021-02-05


You can combine different columns in the expressions, for instance, the following example filters data where the column 'two' is greater than 3 and the column 'three' is equal to "Awesome". The second example shows values where the column 'one' is equal to the column 'two'

In [14]:
filtered_df2 = dataframe.query('two > 3 and three == "Awesome"')
display(filtered_df2)
filtered_df3 = dataframe.query('one == two')
display(filtered_df3)

Unnamed: 0,one,two,three,date
0,1,5,Awesome,2021-01-01


Unnamed: 0,one,two,three,date
2,3,3,Cool,2021-01-02



#### Vectorized functions
Pandas series have a huge set of vectorized functions you can apply to the entire series. For instance, the different functions in the numpy tutorial:



In [15]:
display(dataframe['two'].sum())
display(dataframe['one'].cumsum())
display(dataframe['two'].mean())

15

Unnamed: 0,one
0,1
1,3
2,6
3,10
4,15


3.0

One interesting function is describe(), which returns another dataframe with statistical information about the series or dataframe:

In [16]:
dataframe.describe()


Unnamed: 0,one,two
count,5.0,5.0
mean,3.0,3.0
std,1.581139,1.581139
min,1.0,1.0
25%,2.0,2.0
50%,3.0,3.0
75%,4.0,4.0
max,5.0,5.0


Another important vectorized function is unique(). We will use unique()

All string type functions are vectorised and can be accessed through *str*. For instance, the following example gets the first character of the string:

In [18]:
upper_three = dataframe['three'].str.lower()
display(upper_three)

Unnamed: 0,three
0,awesome
1,brutal
2,cool
3,divine
4,exquisite


*str* provides vectorized access to the string value of the series, and thus you can access parts of every string in a series using indexes:

In [22]:
first_character = dataframe['three'].str[2]
display(first_character)

Unnamed: 0,three
0,e
1,u
2,o
3,v
4,q


Similar to str, *dt* implements vectorized functions to work with dates.The following examples first transforms column "date" into a datetime, and then calculates the month.

In [25]:
month = pd.to_datetime(dataframe['date']).dt.to_period("Y")
display(month)

Unnamed: 0,date
0,2021
1,2021
2,2021
3,2021
4,2021


The function pd.to_datetime converts the series in column 'date' into a datetime. Then, dt.to_period() transforms this column into a period, in this case, the letter "M" represents *months*, and therefore, the period obtained is months.

Finally, *apply* allows us to apply and *lambda* function taking all the values of the dataframe or just a series:

In [29]:
plus_one = dataframe['one'].apply(lambda x: x*.1)
display(plus_one)

Unnamed: 0,one
0,0.1
1,0.2
2,0.3
3,0.4
4,0.5


In the code above, we apply a newly created lambda function that takes one parameter x and just adds one. apply() will apply this function to the series in column 'one'.
There is a huge set of functions available, and many things you can do which are not convered in this tutorial, so you better check the documentation at
[Pandas Series API Reference](https://pandas.pydata.org/docs/reference/series.html)

#### Indexing
By default, the dataframes we have created have an integer *index* to access data in a raw, but we can define any type of index. Let us create another dataframe with another type of index:

In [32]:
data_2 = {"2021-01-01": ['003', '46021'], "2021-01-02": ['005', '46022'], "2021-01-03": ['007', '46022'],
          "2021-02-04": ['003', '46021'], "2021-02-05": ['003', '46027']}

dataframe_3 = pd.DataFrame.from_dict(data_2, orient='index', columns=['client_id', 'postal_code'])
display(dataframe_3)

Unnamed: 0,client_id,postal_code
2021-01-01,3,46021
2021-01-02,5,46022
2021-01-03,7,46022
2021-02-04,3,46021
2021-02-05,3,46027


We have used two new parameters to create our dataframe from a dictionary (from_dict), *orient*, and *columns*. When *orient* is index, the dictionary keys are used as the index of the created dataframe. In this case, we are using a string. The columns parameters allow us to give a name to the columns of our dataframe.
Now, we can access the data using the new index:

In [33]:
print(dataframe_3['client_id']['2021-01-01'])

003


We can also set the index of an existing dataframe using its columns. Let us change the index of our first dataframe:

In [34]:
new_dataframe = dataframe.set_index(['one'])
display(new_dataframe)

Unnamed: 0_level_0,two,three,date,month
one,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,5,Awesome,2021-01-01,1
2,4,Brutal,2021-01-02,1
3,3,Cool,2021-01-02,1
4,2,Divine,2021-02-03,2
5,1,Exquisite,2021-02-05,2


Notice, that since our column had a name, now our index also has a name. We can set the name of an index like:


In [35]:
dataframe_3.index.set_names(['date'], inplace=True)
display(dataframe_3)

Unnamed: 0_level_0,client_id,postal_code
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-01-01,3,46021
2021-01-02,5,46022
2021-01-03,7,46022
2021-02-04,3,46021
2021-02-05,3,46027


We can use different columns as index to create a dataframe similar to a **pivot table**:

In [36]:
dataframe_4 = dataframe.set_index(['one', 'two', 'three'])
display(dataframe_4)
print(dataframe_4['date'][1][5]['Awesome'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,date,month
one,two,three,Unnamed: 3_level_1,Unnamed: 4_level_1
1,5,Awesome,2021-01-01,1
2,4,Brutal,2021-01-02,1
3,3,Cool,2021-01-02,1
4,2,Divine,2021-02-03,2
5,1,Exquisite,2021-02-05,2


2021-01-01


#### New columns
To add a new column, you just need to define the column name. For instance, let us to the original dataframe a new column containing the month:

In [37]:
dataframe['month'] = pd.to_datetime(dataframe['date']).dt.month
display(dataframe)

Unnamed: 0,one,two,three,date,month
0,1,5,Awesome,2021-01-01,1
1,2,4,Brutal,2021-01-02,1
2,3,3,Cool,2021-01-02,1
3,4,2,Divine,2021-02-03,2
4,5,1,Exquisite,2021-02-05,2


#### Grouping
Grouping allows us to group by values that have the same index value:

In [38]:
ones_in_month = dataframe.set_index(['month']).groupby(['month']).sum()
display(ones_in_month)



Unnamed: 0_level_0,one,two,three,date
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,6,12,AwesomeBrutalCool,2021-01-012021-01-022021-01-02
2,9,3,DivineExquisite,2021-02-032021-02-05


Note that, first, we define the index month, and then we group by this index. Finally, we apply the vectorized function sum to sum all values of the columns with the same value of the index month.
Column three silently disappears since its type does not support this function.

#### Joining
Indexes allow us to join different dataframes that have an index with the same name:

In [None]:
dataframe_5 = dataframe.set_index(['date'])
dataframe_6 = dataframe_5.join(dataframe_3, how="inner")
display(dataframe_6)