# Pandas
---

Pandas is a data analysis tool, built on top of numpy.

In [1]:
import pandas as pd

In [2]:
names = ["Nepal", "India", "Bhutan"]
codes = [977, 81, 123]

In [3]:
list(zip(names, codes))

[('Nepal', 977), ('India', 81), ('Bhutan', 123)]

**zip: ** *zip creates collection of tuples out of each elements of given lists*

In [4]:
zip([1, 2, 3], [11, 12, 13], [21, 22, 23])

<zip at 0x7fdbdf22d0c8>

In [5]:
list(zip([1, 2, 3], [11, 12, 13], [21, 22, 23]))

[(1, 11, 21), (2, 12, 22), (3, 13, 23)]

*We create a dataset with names and codes*

In [6]:
dataset = list(zip(names, codes))

In [7]:
dataset

[('Nepal', 977), ('India', 81), ('Bhutan', 123)]

**Now we create a dataframe**

Dataframe is generally a tabular data with rows and columns ( similar to that of excel ).

In [8]:
df = pd.DataFrame(data=dataset, columns=["Name", "Code"])

In [9]:
df

Unnamed: 0,Name,Code
0,Nepal,977
1,India,81
2,Bhutan,123


*Here, each row defines a unique set of observation or condition, while each column defines parameters for observations*

*In above dataframe, Name and Code are parameters we are using to define a country and each row i.e Nepal and 977 collectively gives us unique set*

*__0__, __1__ .. are called index in pandas, they are generated while creating a dataframe*

*Let's create a large dataset*

In [10]:
import numpy as np

In [11]:
names = ['Bob', 'Jessica', 'Mary', 'John', 'Mel']

In [32]:
np.random.randint(low=0, high=len(names))

4

In [33]:
random_names = [names[np.random.randint(low=0, high=len(names))] 
                for i in range(1000)]

In [34]:
random_names[:10]

['Jessica',
 'Mel',
 'Bob',
 'Jessica',
 'Jessica',
 'John',
 'John',
 'Mary',
 'Jessica',
 'John']

In [35]:
ages = [np.random.randint(low=1, high=100) for i in range(1000)]

In [36]:
ages[:10]

[15, 31, 30, 28, 90, 6, 59, 25, 77, 91]

In [37]:
df = pd.DataFrame(list(zip(random_names, ages)), 
                  columns=["Name", "Age"])

In [38]:
df

Unnamed: 0,Name,Age
0,Jessica,15
1,Mel,31
2,Bob,30
3,Jessica,28
4,Jessica,90
5,John,6
6,John,59
7,Mary,25
8,Jessica,77
9,John,91


**Overview of data**

*show first 5 rows*

In [39]:
df.head()

Unnamed: 0,Name,Age
0,Jessica,15
1,Mel,31
2,Bob,30
3,Jessica,28
4,Jessica,90


*or 10*

In [40]:
df.head(10)

Unnamed: 0,Name,Age
0,Jessica,15
1,Mel,31
2,Bob,30
3,Jessica,28
4,Jessica,90
5,John,6
6,John,59
7,Mary,25
8,Jessica,77
9,John,91


*show last 5 rows*

In [41]:
df.tail()

Unnamed: 0,Name,Age
995,John,69
996,Mary,86
997,Mary,71
998,Bob,52
999,Mary,49


*or 10*

In [42]:
df.tail(10)

Unnamed: 0,Name,Age
990,John,73
991,John,28
992,Mel,48
993,John,95
994,Bob,66
995,John,69
996,Mary,86
997,Mary,71
998,Bob,52
999,Mary,49


### Selecting rows and columns

**select single column from dataframe**

In [43]:
df["Name"]

0      Jessica
1          Mel
2          Bob
3      Jessica
4      Jessica
5         John
6         John
7         Mary
8      Jessica
9         John
10        John
11         Mel
12     Jessica
13        John
14        Mary
15         Mel
16     Jessica
17         Mel
18     Jessica
19        John
20        Mary
21         Bob
22     Jessica
23         Mel
24         Bob
25        John
26         Mel
27         Mel
28        John
29        John
        ...   
970        Mel
971    Jessica
972        Mel
973        Bob
974       John
975    Jessica
976       John
977       John
978        Mel
979        Mel
980       Mary
981       Mary
982       Mary
983       John
984        Mel
985        Mel
986       Mary
987        Mel
988        Mel
989       Mary
990       John
991       John
992        Mel
993       John
994        Bob
995       John
996       Mary
997       Mary
998        Bob
999       Mary
Name: Name, dtype: object

*Note: such selection which gave us rows of a single column, are called Series in pandas*

**Select multiple columns from dataframe**

In [44]:
df[["Name", "Age"]].head()

Unnamed: 0,Name,Age
0,Jessica,15
1,Mel,31
2,Bob,30
3,Jessica,28
4,Jessica,90


*we are selecting multiple columns with list of columns ["Name", "Age"]*

**Select row from dataframe**

*Select from index*

In [45]:
df.ix[0]

Name    Jessica
Age          15
Name: 0, dtype: object

> index can be either integers as above or any string, depends on dataframe

*Select from location*

In [46]:
df.iloc[0]

Name    Jessica
Age          15
Name: 0, dtype: object

> it is integer based location

*Select from location*

In [47]:
df.loc[0]

Name    Jessica
Age          15
Name: 0, dtype: object

*Slicing based selection*

In [48]:
df.loc[0:6]

Unnamed: 0,Name,Age
0,Jessica,15
1,Mel,31
2,Bob,30
3,Jessica,28
4,Jessica,90
5,John,6
6,John,59


In [49]:
df.iloc[0:6]

Unnamed: 0,Name,Age
0,Jessica,15
1,Mel,31
2,Bob,30
3,Jessica,28
4,Jessica,90
5,John,6


**Behavior of ix, loc and iloc**

In [50]:
# let's create a new dataframe with selected rows
ndf = df.loc[10:20]

In [51]:
ndf

Unnamed: 0,Name,Age
10,John,16
11,Mel,72
12,Jessica,29
13,John,5
14,Mary,39
15,Mel,85
16,Jessica,46
17,Mel,75
18,Jessica,41
19,John,54


In [52]:
ndf.ix[0]

KeyError: 0

In [53]:
ndf.loc[0]

KeyError: 'the label [0] is not in the [index]'

In [54]:
ndf.iloc[0]

Name    John
Age       16
Name: 10, dtype: object

In [55]:
ndf.loc[10]

Name    John
Age       16
Name: 10, dtype: object

*In above 3 cells, __ix__ and __loc__ are trying to access by index __0__, which is not present in our new dataframe while __iloc__ fetched 1st row or 0th data from our new dataframe*

**Silicing by rows and columns**

In [56]:
# get all data from 0 to 6 row and column "Age"
df.loc[0:6, "Age"]

0    15
1    31
2    30
3    28
4    90
5     6
6    59
Name: Age, dtype: int64

In [57]:
# get all data from 0 to 6 and column "Name" to "Age"
df.loc[0:6, "Name": "Age"]

Unnamed: 0,Name,Age
0,Jessica,15
1,Mel,31
2,Bob,30
3,Jessica,28
4,Jessica,90
5,John,6
6,John,59


In [58]:
# column slicing should be in sequential order to that 
# of columns of dataframe
df.loc[0:6, "Age": "Name"]

0
1
2
3
4
5
6


**Changing columns of dataframe**

In [59]:
df.columns = ["Name of people", "Age of people"]

In [60]:
df.head()

Unnamed: 0,Name of people,Age of people
0,Jessica,15
1,Mel,31
2,Bob,30
3,Jessica,28
4,Jessica,90


**Filtering basics**

In [61]:
# get all people whose age is less than 10
df[df["Age of people"] < 10]

Unnamed: 0,Name of people,Age of people
5,John,6
13,John,5
44,Mel,3
52,Bob,8
54,Bob,2
55,John,7
60,John,5
72,John,5
79,John,1
86,Mary,3


In [62]:
# get all people whose age is less than 10 or age is greater than 90
df[(df["Age of people"] < 10) | (df["Age of people"] > 90)]

Unnamed: 0,Name of people,Age of people
5,John,6
9,John,91
13,John,5
22,Jessica,95
28,John,95
31,Mary,93
42,Jessica,93
44,Mel,3
52,Bob,8
54,Bob,2


In [63]:
# get mean
df.mean()

Age of people    50.6
dtype: float64

In [65]:
# get max
df.max()

Name of people    Mel
Age of people      99
dtype: object

In [66]:
# get min
df.min()

Name of people    Bob
Age of people       1
dtype: object

In [67]:
# get median
df.median()

Age of people    51.0
dtype: float64

In [68]:
# get standard deviation
df.std()

Age of people    29.067041
dtype: float64

In [69]:
df.head()

Unnamed: 0,Name of people,Age of people
0,Jessica,15
1,Mel,31
2,Bob,30
3,Jessica,28
4,Jessica,90


In [72]:
df.groupby("Name of people").mean()

Unnamed: 0_level_0,Age of people
Name of people,Unnamed: 1_level_1
Bob,52.129353
Jessica,54.720588
John,47.2
Mary,51.095745
Mel,47.923858
