### Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

# SERIES

The first main data type for pandas is the Series data type.

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

In [1]:
#Importing dependencies

import numpy as np
import pandas as pd

### Creating a SERIES

A list, numpy array, or a dictionary can be converted into a Series.

In [2]:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

In [3]:
#Using a list

pd.Series(data=my_list)

0    10
1    20
2    30
dtype: int64

In [4]:
pd.Series(data=my_list,index=labels)

a    10
b    20
c    30
dtype: int64

In [5]:
pd.Series(my_list,labels)

a    10
b    20
c    30
dtype: int64

In [6]:
#Using Numpy array

pd.Series(arr)

0    10
1    20
2    30
dtype: int32

In [7]:
pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int32

In [8]:
#Using a dictionary

pd.Series(d)

a    10
b    20
c    30
dtype: int64

## Using an INDEX

In [10]:
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','Canada', 'Mexico'])                                 

In [11]:
ser1

USA        1
Germany    2
Canada     3
Mexico     4
dtype: int64

In [16]:
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Canada', 'India'])

In [17]:
ser2

USA        1
Germany    2
Canada     5
India      4
dtype: int64

Operations on series are based on the index:

In [18]:
ser1 + ser2

Canada     8.0
Germany    4.0
India      NaN
Mexico     NaN
USA        2.0
dtype: float64

# DATAFRAMES

Dataframes are a bunch of series objects put togther to share the same index.

In [23]:
from numpy.random import randn #random sample from the sample normal distribution
np.random.seed(49)             #generating pseudo random numbers. re-using seed value with generate same output

In [24]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

In [25]:
df

Unnamed: 0,W,X,Y,Z
A,-1.043159,-0.820856,0.665146,1.822627
B,-1.441583,0.233808,0.339619,0.231214
C,-0.009926,1.803848,1.367844,-0.261362
D,-0.447606,0.569842,-1.00902,-1.128953
E,1.059545,1.030441,0.142366,0.592879


### Selection and Indexing

In [27]:
df['W']

A   -1.043159
B   -1.441583
C   -0.009926
D   -0.447606
E    1.059545
Name: W, dtype: float64

In [28]:
#Passing a list of column names

df[['W','Z']]

Unnamed: 0,W,Z
A,-1.043159,1.822627
B,-1.441583,0.231214
C,-0.009926,-0.261362
D,-0.447606,-1.128953
E,1.059545,0.592879


DataFrame Columns are just Series!

In [29]:
type(df['W'])

pandas.core.series.Series

#### Creating a new column:

In [31]:
df['new'] = df['W'] + df['Y']

In [32]:
df

Unnamed: 0,W,X,Y,Z,new
A,-1.043159,-0.820856,0.665146,1.822627,-0.378013
B,-1.441583,0.233808,0.339619,0.231214,-1.101964
C,-0.009926,1.803848,1.367844,-0.261362,1.357918
D,-0.447606,0.569842,-1.00902,-1.128953,-1.456626
E,1.059545,1.030441,0.142366,0.592879,1.201911


#### Removing a column:

In [34]:
df.drop('new',axis=1)     #axis=1 indicates dropping a column, axis=0 would drop row

Unnamed: 0,W,X,Y,Z
A,-1.043159,-0.820856,0.665146,1.822627
B,-1.441583,0.233808,0.339619,0.231214
C,-0.009926,1.803848,1.367844,-0.261362
D,-0.447606,0.569842,-1.00902,-1.128953
E,1.059545,1.030441,0.142366,0.592879


In [36]:
df 

#column is not dropped permanently

Unnamed: 0,W,X,Y,Z,new
A,-1.043159,-0.820856,0.665146,1.822627,-0.378013
B,-1.441583,0.233808,0.339619,0.231214,-1.101964
C,-0.009926,1.803848,1.367844,-0.261362,1.357918
D,-0.447606,0.569842,-1.00902,-1.128953,-1.456626
E,1.059545,1.030441,0.142366,0.592879,1.201911


In [37]:
df.drop('new',axis=1,inplace=True)

In [38]:
df

Unnamed: 0,W,X,Y,Z
A,-1.043159,-0.820856,0.665146,1.822627
B,-1.441583,0.233808,0.339619,0.231214
C,-0.009926,1.803848,1.367844,-0.261362
D,-0.447606,0.569842,-1.00902,-1.128953
E,1.059545,1.030441,0.142366,0.592879


#### Removing a row:

In [39]:
df.drop('E',axis=0)

Unnamed: 0,W,X,Y,Z
A,-1.043159,-0.820856,0.665146,1.822627
B,-1.441583,0.233808,0.339619,0.231214
C,-0.009926,1.803848,1.367844,-0.261362
D,-0.447606,0.569842,-1.00902,-1.128953


#### Selecting ROWS:

In [40]:
df.loc['A']

W   -1.043159
X   -0.820856
Y    0.665146
Z    1.822627
Name: A, dtype: float64

Selecting based off of position instead of label

In [41]:
df.iloc[2]

W   -0.009926
X    1.803848
Y    1.367844
Z   -0.261362
Name: C, dtype: float64

Selecting subsets of rows and columns:

In [42]:
df.loc['B','Y']

0.3396193339525251

In [43]:
df.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,-1.043159,0.665146
B,-1.441583,0.339619


### Conditional Selection

In [44]:
df

Unnamed: 0,W,X,Y,Z
A,-1.043159,-0.820856,0.665146,1.822627
B,-1.441583,0.233808,0.339619,0.231214
C,-0.009926,1.803848,1.367844,-0.261362
D,-0.447606,0.569842,-1.00902,-1.128953
E,1.059545,1.030441,0.142366,0.592879


In [45]:
df > 0

Unnamed: 0,W,X,Y,Z
A,False,False,True,True
B,False,True,True,True
C,False,True,True,False
D,False,True,False,False
E,True,True,True,True


In [46]:
df[df > 0]

Unnamed: 0,W,X,Y,Z
A,,,0.665146,1.822627
B,,0.233808,0.339619,0.231214
C,,1.803848,1.367844,
D,,0.569842,,
E,1.059545,1.030441,0.142366,0.592879


In [47]:
df[df['W']>0]

Unnamed: 0,W,X,Y,Z
E,1.059545,1.030441,0.142366,0.592879


In [48]:
df[df['W']>0]['Y']

E    0.142366
Name: Y, dtype: float64

In [49]:
df[df['W']>0][['Y','X']]

Unnamed: 0,Y,X
E,0.142366,1.030441


#### Resetting index

In [50]:
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,-1.043159,-0.820856,0.665146,1.822627
1,B,-1.441583,0.233808,0.339619,0.231214
2,C,-0.009926,1.803848,1.367844,-0.261362
3,D,-0.447606,0.569842,-1.00902,-1.128953
4,E,1.059545,1.030441,0.142366,0.592879


In [51]:
newind = 'CA NY WY OR CO'.split()

In [52]:
df['States'] = newind

In [53]:
df

Unnamed: 0,W,X,Y,Z,States
A,-1.043159,-0.820856,0.665146,1.822627,CA
B,-1.441583,0.233808,0.339619,0.231214,NY
C,-0.009926,1.803848,1.367844,-0.261362,WY
D,-0.447606,0.569842,-1.00902,-1.128953,OR
E,1.059545,1.030441,0.142366,0.592879,CO


In [54]:
df.set_index('States')

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,-1.043159,-0.820856,0.665146,1.822627
NY,-1.441583,0.233808,0.339619,0.231214
WY,-0.009926,1.803848,1.367844,-0.261362
OR,-0.447606,0.569842,-1.00902,-1.128953
CO,1.059545,1.030441,0.142366,0.592879


In [55]:
df.set_index('States',inplace=True)

In [56]:
df

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,-1.043159,-0.820856,0.665146,1.822627
NY,-1.441583,0.233808,0.339619,0.231214
WY,-0.009926,1.803848,1.367844,-0.261362
OR,-0.447606,0.569842,-1.00902,-1.128953
CO,1.059545,1.030441,0.142366,0.592879
