# Pandas
---

Pandas is a data analysis tool, built on top of numpy.

In [2]:
import pandas as pd

In [3]:
names = ["Nepal", "India", "Bhutan"]
codes = [977, 81, 123]

In [4]:
list(zip(names, codes))

[('Nepal', 977), ('India', 81), ('Bhutan', 123)]

**zip: ** *zip creates collection of tuples out of each elements of given lists*

In [5]:
zip([1, 2, 3], [11, 12, 13], [21, 22, 23])

<zip at 0x7fa978c41f88>

In [6]:
list(zip([1, 2, 3], [11, 12, 13], [21, 22, 23]))

[(1, 11, 21), (2, 12, 22), (3, 13, 23)]

*We create a dataset with names and codes*

In [7]:
dataset = list(zip(names, codes))

In [8]:
dataset

[('Nepal', 977), ('India', 81), ('Bhutan', 123)]

**Now we create a dataframe**

Dataframe is generally a tabular data with rows and columns ( similar to that of excel ).

In [9]:
df = pd.DataFrame(data=dataset, columns=["Name", "Code"])

In [10]:
df

Unnamed: 0,Name,Code
0,Nepal,977
1,India,81
2,Bhutan,123


*Here, each row defines a unique set of observation or condition, while each column defines parameters for observations*

*In above dataframe, Name and Code are parameters we are using to define a country and each row i.e Nepal and 977 collectively gives us unique set*

*__0__, __1__ .. are called index in pandas, they are generated while creating a dataframe*

*Let's create a large dataset*

In [11]:
import numpy as np

In [12]:
names = ['Bob', 'Jessica', 'Mary', 'John', 'Mel']

In [13]:
random_names = [names[np.random.randint(low=0, high=len(names))] 
                for i in range(1000)]

In [14]:
random_names[:10]

['Mel',
 'Mary',
 'Bob',
 'Jessica',
 'John',
 'Mel',
 'John',
 'Jessica',
 'Mary',
 'John']

In [15]:
ages = [np.random.randint(low=1, high=100) for i in range(1000)]

In [16]:
ages[:10]

[55, 58, 48, 56, 95, 42, 66, 42, 69, 40]

In [17]:
df = pd.DataFrame(list(zip(random_names, ages)), columns=["Name", "Age"])

In [18]:
df

Unnamed: 0,Name,Age
0,Mel,55
1,Mary,58
2,Bob,48
3,Jessica,56
4,John,95
5,Mel,42
6,John,66
7,Jessica,42
8,Mary,69
9,John,40


**Overview of data**

*show first 5 rows*

In [19]:
df.head()

Unnamed: 0,Name,Age
0,Mel,55
1,Mary,58
2,Bob,48
3,Jessica,56
4,John,95


*or 10*

In [20]:
df.head(10)

Unnamed: 0,Name,Age
0,Mel,55
1,Mary,58
2,Bob,48
3,Jessica,56
4,John,95
5,Mel,42
6,John,66
7,Jessica,42
8,Mary,69
9,John,40


*show last 5 rows*

In [21]:
df.tail()

Unnamed: 0,Name,Age
995,Mary,88
996,Mary,33
997,Jessica,1
998,Jessica,44
999,Mary,70


*or 10*

In [22]:
df.tail(10)

Unnamed: 0,Name,Age
990,Mel,11
991,Mary,84
992,John,5
993,John,9
994,Jessica,93
995,Mary,88
996,Mary,33
997,Jessica,1
998,Jessica,44
999,Mary,70


### Selecting rows and columns

**select single column from dataframe**

In [23]:
df["Name"]

0          Mel
1         Mary
2          Bob
3      Jessica
4         John
5          Mel
6         John
7      Jessica
8         Mary
9         John
10     Jessica
11         Mel
12        John
13         Bob
14        Mary
15         Bob
16        John
17         Bob
18     Jessica
19        Mary
20        John
21         Mel
22        John
23        Mary
24         Mel
25         Mel
26        Mary
27        Mary
28     Jessica
29         Mel
        ...   
970        Bob
971        Mel
972       Mary
973       John
974        Mel
975        Mel
976        Mel
977       Mary
978        Mel
979        Mel
980       John
981       John
982       Mary
983       John
984        Mel
985        Mel
986       Mary
987        Mel
988        Mel
989       Mary
990        Mel
991       Mary
992       John
993       John
994    Jessica
995       Mary
996       Mary
997    Jessica
998    Jessica
999       Mary
Name: Name, dtype: object

*Note: such selection which gave us rows of a single column, are called Series in pandas*

**Select multiple columns from dataframe**

In [24]:
df[["Name", "Age"]].head()

Unnamed: 0,Name,Age
0,Mel,55
1,Mary,58
2,Bob,48
3,Jessica,56
4,John,95


*we are selecting multiple columns with list of columns ["Name", "Age"]*

**Select row from dataframe**

*Select from index*

In [25]:
df.ix[0]

Name    Mel
Age      55
Name: 0, dtype: object

> index can be either integers as above or any string, depends on dataframe

*Select from location*

In [26]:
df.iloc[0]

Name    Mel
Age      55
Name: 0, dtype: object

> it is integer based location

*Select from location*

In [27]:
df.loc[0]

Name    Mel
Age      55
Name: 0, dtype: object

*Slicing based selection*

In [28]:
df.loc[0:6]

Unnamed: 0,Name,Age
0,Mel,55
1,Mary,58
2,Bob,48
3,Jessica,56
4,John,95
5,Mel,42
6,John,66


In [29]:
df.iloc[0:6]

Unnamed: 0,Name,Age
0,Mel,55
1,Mary,58
2,Bob,48
3,Jessica,56
4,John,95
5,Mel,42


**Behavior of ix, loc and iloc**

In [33]:
# let's create a new dataframe with selected rows
ndf = df.loc[10:20]

In [34]:
ndf

Unnamed: 0,Name,Age
10,Jessica,64
11,Mel,83
12,John,72
13,Bob,3
14,Mary,29
15,Bob,28
16,John,86
17,Bob,40
18,Jessica,20
19,Mary,79


In [35]:
ndf.ix[0]

KeyError: 0

In [36]:
ndf.loc[0]

KeyError: 'the label [0] is not in the [index]'

In [37]:
ndf.iloc[0]

Name    Jessica
Age          64
Name: 10, dtype: object

*In above 3 cells, __ix__ and __loc__ are trying to access by index __0__, which is not present in our new dataframe while __iloc__ fetched 1st row or 0th data from our new dataframe*

**Silicing by rows and columns**

In [38]:
# get all data from 0 to 6 row and column "Age"
df.loc[0:6, "Age"]

0    55
1    58
2    48
3    56
4    95
5    42
6    66
Name: Age, dtype: int64

In [39]:
# get all data from 0 to 6 and column "Name" to "Age"
df.loc[0:6, "Name": "Age"]

Unnamed: 0,Name,Age
0,Mel,55
1,Mary,58
2,Bob,48
3,Jessica,56
4,John,95
5,Mel,42
6,John,66


In [40]:
# column slicing should be in sequential order to that of columns of dataframe
df.loc[0:6, "Age": "Name"]

0
1
2
3
4
5
6


**Changing columns of dataframe**

In [41]:
df.columns = ["Name of people", "Age of people"]

In [42]:
df.head()

Unnamed: 0,Name of people,Age of people
0,Mel,55
1,Mary,58
2,Bob,48
3,Jessica,56
4,John,95


**Filtering basics**

In [43]:
# get all people whose age is less than 10
df[df["Age of people"] < 10]

Unnamed: 0,Name of people,Age of people
13,Bob,3
29,Mel,6
58,John,9
60,Bob,8
73,Jessica,8
94,John,4
101,Mel,4
107,Mel,2
111,Bob,7
129,Bob,9


In [46]:
# get all people whose age is less than 10 or age is greater than 90
df[(df["Age of people"] < 10) | (df["Age of people"] > 90)]

Unnamed: 0,Name of people,Age of people
4,John,95
13,Bob,3
29,Mel,6
31,Jessica,95
58,John,9
60,Bob,8
61,Mary,95
62,Mary,92
66,Mary,99
68,Bob,98


In [47]:
# get mean
df.mean()

Age of people    49.826
dtype: float64

In [48]:
# get max
df.max()

Name of people    Mel
Age of people      99
dtype: object

In [49]:
# get min
df.min()

Name of people    Bob
Age of people       1
dtype: object

In [50]:
# get median
df.median()

Age of people    49.0
dtype: float64

In [51]:
# get standard deviation
df.std()

Age of people    28.771923
dtype: float64