In [3]:
import pandas as pd
import numpy as np

### Introduction to pandas Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse
data structures: 

    1.Series 
    2.DataFrame. 

### Series

A Series is a one-dimensional array-like object containing an array of data (of any
NumPy data type)

An associated array of data labels, called its index. The simplest
Series is formed from only an array of data:


In [4]:
obj=pd.Series([1,2,3,4])
obj

0    1
1    2
2    3
3    4
dtype: int64

In [5]:
obj.shape

(4,)

In [6]:
obj.dtypes

dtype('int64')

In [7]:
obj.values

array([1, 2, 3, 4], dtype=int64)

 Since we did not specify an index for the data, a default
one consisting of the integers 0 through N - 1 (where N is the length of the data) is
created.

In [8]:
obj

0    1
1    2
2    3
3    4
dtype: int64

Often it will be desirable to create a Series with an index identifying each data point:

In [9]:
obj1=pd.Series([2,4,5,7],index=['a','b','c','d'])
obj1

a    2
b    4
c    5
d    7
dtype: int64

In [10]:
obj1.index

Index(['a', 'b', 'c', 'd'], dtype='object')

Compared with a regular NumPy array, you can use values in the index when selecting
single values or a set of values:

In [11]:
obj1['a']

2

In [12]:
obj1['b':]

b    4
c    5
d    7
dtype: int64

NumPy array operations, such as filtering with a boolean array, scalar multiplication,
or applying math functions, will preserve the index-value link:


In [13]:
obj1['c']=56

In [14]:
obj1

a     2
b     4
c    56
d     7
dtype: int64

In [15]:
obj1*2

a      4
b      8
c    112
d     14
dtype: int64

In [16]:
obj1[obj1>4]

c    56
d     7
dtype: int64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping
of index values to data values. It can be substituted into many functions that expect a
dict:


In [17]:
'b' in obj1

True

In [18]:
'f' in obj1

False

In [19]:
1 in obj1

False

Should you have data contained in a Python dict, you can create a Series from it by
passing the dict:


In [20]:
sd={'surya':1000,'charles':1700,'hari':2000,'selva':5000}

In [21]:
obj2=pd.Series(sd)

In [22]:
obj2

surya      1000
charles    1700
hari       2000
selva      5000
dtype: int64

When only passing a dict, the index in the resulting Series will have the dict’s keys in
sorted order.

In [24]:
lis=['hari','charles','surya','selva','jai']

In [25]:
obj3=pd.Series(sd,index=lis)
obj3

hari       2000.0
charles    1700.0
surya      1000.0
selva      5000.0
jai           NaN
dtype: float64

null value or not

In [26]:
obj3.isnull()

hari       False
charles    False
surya      False
selva      False
jai         True
dtype: bool

In [27]:
obj3.notnull()

hari        True
charles     True
surya       True
selva       True
jai        False
dtype: bool

In [28]:
obj2

surya      1000
charles    1700
hari       2000
selva      5000
dtype: int64

In [29]:
obj3

hari       2000.0
charles    1700.0
surya      1000.0
selva      5000.0
jai           NaN
dtype: float64

In [30]:
obj2+obj3

charles     3400.0
hari        4000.0
jai            NaN
selva      10000.0
surya       2000.0
dtype: float64

In [31]:
obj3.name='candidate'
obj3.index.name='li'

In [32]:
obj3

li
hari       2000.0
charles    1700.0
surya      1000.0
selva      5000.0
jai           NaN
Name: candidate, dtype: float64

A Series’s index can be altered in place by assignment:

In [33]:
obj.index=['one','two','three','four']

In [34]:
obj

one      1
two      2
three    3
four     4
dtype: int64

### Dataframe

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric,
string, boolean, etc.).

The DataFrame has both a row and column index; it can be
thought of as a dict of Series (one for all sharing the same index). 


There are numerous ways to construct a DataFrame, though one of the most common
is from a dict of equal-length lists or NumPy arrays


In [35]:
df={'name':['selva','surya','charles','hari'],
    'age':[20,22,19,21],
    'score':[98,100,96,98]
   }

In [36]:
frame=pd.DataFrame(df)

In [37]:
frame

Unnamed: 0,name,age,score
0,selva,20,98
1,surya,22,100
2,charles,19,96
3,hari,21,98


In [38]:
c=pd.DataFrame(df,columns=['name','age','score'])
c

Unnamed: 0,name,age,score
0,selva,20,98
1,surya,22,100
2,charles,19,96
3,hari,21,98


As with Series, if you pass a column that isn’t contained in data, it will appear with NA
values in the result:

In [39]:
d=pd.DataFrame(df,columns=['name','age','score','dept'],index=['one','two','three','four'])
d

Unnamed: 0,name,age,score,dept
one,selva,20,98,
two,surya,22,100,
three,charles,19,96,
four,hari,21,98,


In [40]:
d.columns

Index(['name', 'age', 'score', 'dept'], dtype='object')

In [41]:
d['name']

one        selva
two        surya
three    charles
four        hari
Name: name, dtype: object

In [42]:
d['age']

one      20
two      22
three    19
four     21
Name: age, dtype: int64

In [43]:
d['score']

one       98
two      100
three     96
four      98
Name: score, dtype: int64

In [44]:
d.score

one       98
two      100
three     96
four      98
Name: score, dtype: int64

In [45]:
d.ix['three']


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


name     charles
age           19
score         96
dept         NaN
Name: three, dtype: object

Columns can be modified by assignment. For example, the empty 'debt' column could
be assigned a scalar value or an array of values:


In [46]:
d['dept']='DataScience'

In [47]:
d

Unnamed: 0,name,age,score,dept
one,selva,20,98,DataScience
two,surya,22,100,DataScience
three,charles,19,96,DataScience
four,hari,21,98,DataScience


In [48]:
d['dept']=np.arange(4)
d

Unnamed: 0,name,age,score,dept
one,selva,20,98,0
two,surya,22,100,1
three,charles,19,96,2
four,hari,21,98,3


When assigning lists or arrays to a column, the value’s length must match the length
of the DataFrame. If you assign a Series, it will be instead conformed exactly to the
DataFrame’s index, inserting missing values in any holes:

In [49]:
val=pd.Series(['DS','DS'],index=['one','four'])

In [50]:
d['dept']=val

In [51]:
d

Unnamed: 0,name,age,score,dept
one,selva,20,98,DS
two,surya,22,100,
three,charles,19,96,
four,hari,21,98,DS


Assigning a column that doesn’t exist will create a new column. The del keyword will
delete columns as with a dict:

In [52]:
d['new']=d.dept=='DS'

In [53]:
d

Unnamed: 0,name,age,score,dept,new
one,selva,20,98,DS,True
two,surya,22,100,,False
three,charles,19,96,,False
four,hari,21,98,DS,True


In [54]:
del d['new']

In [55]:
d

Unnamed: 0,name,age,score,dept
one,selva,20,98,DS
two,surya,22,100,
three,charles,19,96,
four,hari,21,98,DS


In [56]:
d.columns

Index(['name', 'age', 'score', 'dept'], dtype='object')

Another common form of data is a nested dict of dicts format:


In [58]:
d1={'name':['swetha','pavithra','rama'],
    'age':[20,22,19],
    'score':[98,95,94]
   }

In [59]:
e=pd.DataFrame(d1,index=['a','b','c'])

In [60]:
e

Unnamed: 0,name,age,score
a,swetha,20,98
b,pavithra,22,95
c,rama,19,94


s transpose

In [61]:
e.T

Unnamed: 0,a,b,c
name,swetha,pavithra,rama
age,20,22,19
score,98,95,94


The keys in the inner dicts are unioned and sorted to form the index in the result. This
isn’t true if an explicit index is specified:

In [62]:
pd.DataFrame(e,index=['a','b','c','d'])

Unnamed: 0,name,age,score
a,swetha,20.0,98.0
b,pavithra,22.0,95.0
c,rama,19.0,94.0
d,,,


In [63]:
#Dicts of Series are treated much in the same way:
pdata = {'name': d1['name'][:-1],'age': d1['age'][:2]}

In [64]:
pdata

{'name': ['swetha', 'pavithra'], 'age': [20, 22]}

In [65]:
 e.index.name = 'students';

In [66]:
e

Unnamed: 0_level_0,name,age,score
students,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,swetha,20,98
b,pavithra,22,95
c,rama,19,94


In [67]:
e.values

array([['swetha', 20, 98],
       ['pavithra', 22, 95],
       ['rama', 19, 94]], dtype=object)

In [68]:
c.values

array([['selva', 20, 98],
       ['surya', 22, 100],
       ['charles', 19, 96],
       ['hari', 21, 98]], dtype=object)