***
# Pandas

Author: Olatomiwa Bifarin. 

_Read as draft_

**Notebook Content**

1. [What is Pandas?](#1)
2. [Introducing Series and Dataframes](#2) <br>
    2.1 [Series](#2.1) <br>
    2.2 [Dataframe](#2.2)
3. [Main Dataframe functionalities](#3) <br>
    3.1 [Listing Columns](#3.1) <br>
    3.2 [Dataframe dimensions](#3.2) <br>
    3.3 [Dataframe Information and Statistics](#3.3) <br>
    3.4 [Dropping Columns](#3.4) <br>
    3.5 [Value Counts](#3.5) <br>
    3.6 [Sorting Columns](#3.6) <br>
    3.7 [Indexing and Retrieval](#3.7) <br>
    3.8 [Renaming Columns](#3.8)
4. [Resources and References](#4)

## 1 | What is Pandas?
<a id='1'></a>

__[Pandas](https://pandas.pydata.org/)__ is an open source python libraries that is used for data handling and manipulation. it was developed to work with the Numpy library. And it is helpful to think of pandas as a building block for data wrangling and analysis in python. So, we are talking about a very essential tool here. 

If you don't have pandas installed on your system, you can install the library using the terminal via the following command line: <br> 

*<center>conda install pandas</center>* 

Ananconda will take care of the rest. 


**Installed Pandas Version**

In [44]:
pd.__version__

'0.23.4'

## 2 | Introducing Series and Dataframes
<a id='2'></a>

The main data structures are __Series__ and __Dataframe__. As the name suggests, series is a one dimensional data structure (with an indexed array): you have an index, you have a value. While dataframe is multidimensional - you can think of dataframe as having indexes with many instances.  

Let's take a look at these things more closely. 

### 2.1 | Series
<a id='2.1'></a>

In [45]:
# import numpy and pandas libraries
import numpy as np
import pandas as pd

**Declare a series**

A series of age

In [46]:
age = pd.Series([31, 26, 37, 49, 57])
age

0    31
1    26
2    37
3    49
4    57
dtype: int64

There you go! and then you see how pandas, by default it assign an index to these values. However, let's see if we can do that ourselves, we don't really need pandas's help. Afterall, we need names of actual people to match this age.  

In [47]:
age = pd.Series([31, 26, 37, 49, 57], 
             index=['tobi', 'ife', 'ayo', 'olori', 'ope'])
age

tobi     31
ife      26
ayo      37
olori    49
ope      57
dtype: int64

Simple. 

**Select values**

Now let's see if I can programmatically select, say ayo's age. In other words, let's select an internal element. 

In [48]:
age['ayo']

37

Olori's age

In [49]:
age['olori']

49

And we can even do both at the same time.

In [50]:
age[['ayo','olori']]

ayo      37
olori    49
dtype: int64

**Change values**

Let's say I just suddenly realize now that we got Olori's age wrong, she is actually 22. There is a way we assign values to the element without declaring the entire series again. 

In [51]:
age['olori'] = 22
age

tobi     31
ife      26
ayo      37
olori    22
ope      57
dtype: int64

**Filter Series**

Now, I want the series of those greater than the age of 30. That one too is possible. 

In [52]:
age[age > 30]

tobi    31
ayo     37
ope     57
dtype: int64

**Mathematics on Series**

In [53]:
age * 2

tobi      62
ife       52
ayo       74
olori     44
ope      114
dtype: int64

In [54]:
age ** 2

tobi      961
ife       676
ayo      1369
olori     484
ope      3249
dtype: int64

and so on

### 2.2 | Dataframes
<a id='2.2'></a>
If you have ever seen a spreadsheet filled with data before, then you know what a dataframe is. That is a dataframe - an extended form of series.

And I can show you. 

**Declaring Dataframes**

This can be done by passing dictionaries into <mark>DataFrame( )</mark>. 

In [55]:
# our dictionary: age_updated
age_updated = {'name' : ['tobi', 'ife', 'ayo', 'olori', 'ope'],
            'age' : [31, 26, 37, 49, 57], 
            'height(cm)' : [1.6, 1.8, 0.72, 1.72, 1.9], 
            'BMI': [31.1, 28.6, 25.7, 25.9, 25]}

dfage = pd.DataFrame(age_updated)
dfage

Unnamed: 0,name,age,height(cm),BMI
0,tobi,31,1.6,31.1
1,ife,26,1.8,28.6
2,ayo,37,0.72,25.7
3,olori,49,1.72,25.9
4,ope,57,1.9,25.0


**Selecting specific Columns in Dataframes**

Now, I want to see only the name, height, and the associated BMI in a new dataframe. In other words, I want to select specific columns from our <mark>dfage</mark> dataframe.

In [56]:
newframe = pd.DataFrame(dfage, columns=['name','height(cm)','BMI'])
newframe

Unnamed: 0,name,height(cm),BMI
0,tobi,1.6,31.1
1,ife,1.8,28.6
2,ayo,0.72,25.7
3,olori,1.72,25.9
4,ope,1.9,25.0


## 3 | Main Dataframe functionalities
<a id='3'></a>

Now, to delve deeper into the main dataframe functionalities, let's take, hopefully, a more exicting dataset.

In [57]:
# import seaborn library
import seaborn as sns
sns.get_dataset_names(); # to check for the names of dataset available on seaborn. 

I like how this _exercise_ data set looks, so let's use it. And we can use the <mark>head( )</mark> method to peep at the top-most section of the dataframe. We can do the same for the 'tail' too. 

In [58]:
exercise = sns.load_dataset('exercise');
exercise.head()

Unnamed: 0.1,Unnamed: 0,id,diet,pulse,time,kind
0,0,1,low fat,85,1 min,rest
1,1,1,low fat,85,15 min,rest
2,2,1,low fat,88,30 min,rest
3,3,2,low fat,90,1 min,rest
4,4,2,low fat,92,15 min,rest


In [59]:
exercise.tail()

Unnamed: 0.1,Unnamed: 0,id,diet,pulse,time,kind
85,85,29,no fat,135,15 min,running
86,86,29,no fat,130,30 min,running
87,87,30,no fat,99,1 min,running
88,88,30,no fat,111,15 min,running
89,89,30,no fat,150,30 min,running


### 3.1 Columns
<a id='3.1'></a>

What are the columns in my (exercise) dataframe? Yes, we can do that too. 

In [60]:
exercise.columns

Index(['Unnamed: 0', 'id', 'diet', 'pulse', 'time', 'kind'], dtype='object')

### 3.2 Dataframe dimensions
<a id='3.2'></a>

In [61]:
exercise.shape

(90, 6)

The output states it clearly 90 rows, 6 columns. 

### 3.3 Dataframe Information and Statistics
<a id='3.3'></a>

In [62]:
exercise.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 6 columns):
Unnamed: 0    90 non-null int64
id            90 non-null int64
diet          90 non-null category
pulse         90 non-null int64
time          90 non-null category
kind          90 non-null category
dtypes: category(3), int64(3)
memory usage: 2.7 KB


<mark>category</mark> and <mark>int64</mark> are different data types. The technical name of pandas dataframe is 'pandas.core.frame.DataFrame'. The total number of data columns is indicated: 6. The range index, and even the memory usage. 

In [63]:
exercise.describe()

Unnamed: 0.1,Unnamed: 0,id,pulse
count,90.0,90.0,90.0
mean,44.5,15.5,99.7
std,26.124701,8.703932,14.858471
min,0.0,1.0,80.0
25%,22.25,8.0,90.25
50%,44.5,15.5,96.0
75%,66.75,23.0,103.0
max,89.0,30.0,150.0


The <mark>describe( )</mark> method gives you all basic statistically information: counts, mean, standard deviation, minimum and maximum values, the 0.25 and 0.75 quartiles. If you also have missing values, it lets you know. <br>

Now notice that the category data types are misssing in this data analysis. This is because by default the <mark>describe( )</mark> method only takes into consideration the numerical data types. 

Now let's see if we can get it to consider our categorical datasets, all we need is just the <mark>include</mark> parameter. 

In [64]:
exercise.describe(include=['category'])

Unnamed: 0,diet,time,kind
count,90,90,90
unique,2,3,3
top,low fat,30 min,running
freq,45,30,30


### 3.4 Dropping Columns
<a id='3.4'></a>

If you take a look at our exercise dataframe, the first column. I don't like that column, it pisses me off, so let's see if we can drop it off. And I will give my new dataframe a new name - <mark>df</mark>. 

In [65]:
df = exercise.drop(['Unnamed: 0'], axis=1)
df.head()

Unnamed: 0,id,diet,pulse,time,kind
0,1,low fat,85,1 min,rest
1,1,low fat,85,15 min,rest
2,1,low fat,88,30 min,rest
3,2,low fat,90,1 min,rest
4,2,low fat,92,15 min,rest


### 3.5 Value Counts
<a id='3.5'></a>

In [66]:
df['diet'].value_counts()

low fat    45
no fat     45
Name: diet, dtype: int64

In [67]:
df['diet'].value_counts(normalize=True)

low fat    0.5
no fat     0.5
Name: diet, dtype: float64

### 3.6 Sorting Columns
<a id='3.6'></a>

Let's call our __df__ dataframe again - just the 'head'.

In [68]:
df.head(3)

Unnamed: 0,id,diet,pulse,time,kind
0,1,low fat,85,1 min,rest
1,1,low fat,85,15 min,rest
2,1,low fat,88,30 min,rest


Let's take <mark>pulse</mark> as an example, let's say I want to sort my entire dataframe based on the pulse. This is how you do it: 

In [69]:
df.sort_values(by='pulse', ascending=True).head(10)

Unnamed: 0,id,diet,pulse,time,kind
9,4,low fat,80,1 min,rest
10,4,low fat,82,15 min,rest
11,4,low fat,83,30 min,rest
16,6,no fat,83,15 min,rest
15,6,no fat,83,1 min,rest
45,16,no fat,84,1 min,walking
32,11,low fat,84,30 min,walking
17,6,no fat,84,30 min,rest
0,1,low fat,85,1 min,rest
1,1,low fat,85,15 min,rest


This could be done on multiple columns: 

In [70]:
df.sort_values(by=['pulse', 'time'],
               ascending=[False, True]).head()

Unnamed: 0,id,diet,pulse,time,kind
89,30,no fat,150,30 min,running
77,26,no fat,143,30 min,running
80,27,no fat,140,30 min,running
83,28,no fat,140,30 min,running
85,29,no fat,135,15 min,running


### 3.7 Indexing and Retrieval
<a id='3.7'></a>

To index and retrieve the first and the last rows of our dataframe, we can use the following <mark>df [:1]</mark> and <mark>df [-1:]</mark> respectively

In [71]:
df[:1]

Unnamed: 0,id,diet,pulse,time,kind
0,1,low fat,85,1 min,rest


In [72]:
df[-1:]

Unnamed: 0,id,diet,pulse,time,kind
89,30,no fat,150,30 min,running


To index <mark>walking</mark> in the <mark>kind</mark> column, and retrieve just that dataframe, we can do the following:  

In [73]:
df[df['kind'] == 'walking'].head(6)

Unnamed: 0,id,diet,pulse,time,kind
30,11,low fat,86,1 min,walking
31,11,low fat,86,15 min,walking
32,11,low fat,84,30 min,walking
33,12,low fat,93,1 min,walking
34,12,low fat,103,15 min,walking
35,12,low fat,104,30 min,walking


Note the use of boolean indexing here: 

In [74]:
df['kind'] == 'walking';

Other ways for indexing include: <mark>loc</mark> and the <mark>iloc</mark> methods. <mark>loc</mark> index by column names, while <mark>iloc</mark> index by numbers. Let's see how it works. And notice how the first number of the range is inclusive the the last one is not! 

In [75]:
df.iloc[30:33, 1:4]

Unnamed: 0,diet,pulse,time
30,low fat,86,1 min
31,low fat,86,15 min
32,low fat,84,30 min


In [76]:
df.loc[30:33, 'diet':'time']

Unnamed: 0,diet,pulse,time
30,low fat,86,1 min
31,low fat,86,15 min
32,low fat,84,30 min
33,low fat,93,1 min


### 3.8 Renaming Columns
<a id='3.8'></a>

In [77]:
df.columns

Index(['id', 'diet', 'pulse', 'time', 'kind'], dtype='object')

In [78]:
df = df.rename({'id':'No', 'diet':'food'}, axis=1)
df.columns

Index(['No', 'food', 'pulse', 'time', 'kind'], dtype='object')

To add prefix or suffix to name

In [82]:
df.add_prefix('X').head(1)

Unnamed: 0,XNo,Xfood,Xpulse,Xtime,Xkind
0,1,low fat,85,1 min,rest


In [84]:
df.add_suffix('Y').head(1)

Unnamed: 0,NoY,foodY,pulseY,timeY,kindY
0,1,low fat,85,1 min,rest


Now let's change things back. 

In [86]:
df = df.rename({'No':'id', 'food':'diet'}, axis=1)
df.columns

Index(['id', 'diet', 'pulse', 'time', 'kind'], dtype='object')

## 4. Resources and References
<a id='4'></a>
-  __[Pandas website](https://pandas.pydata.org/)__
-  __[Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)__
-  __[F. Nelli, *Python Data Analytics* 2018](https://doi.org/10.1007/978-1-4842-3913-1_4)__
-  __[Data School, My Top 25 Pandas Trick](https://www.youtube.com/watch?v=RlIiVeig3hc&list=WL&index=5&t=62s)__ 