# Introducing Pandas Objects


In [7]:
import numpy as np
import pandas as pd

# The Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows.

In [8]:
data = pd.Series([0.25, 0.5, 0.75, 1])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [9]:
data.values

array([ 0.25,  0.5 ,  0.75,  1.  ])

In [10]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [11]:
data[1]

0.5

In [12]:
data[1:3]

1    0.50
2    0.75
dtype: float64

# Series as a generalized NumPy array

In [13]:
data = pd.Series([0.25, 0.5, 0.75, 1],
                index=['a', 'b', 'c', 'd'])

In [14]:
data['b']

0.5

In [15]:
data = pd.Series([.25, .5, .75, 1],
                index=[2, 5, 3, 7])

In [16]:
data[5]

0.5

# Series as a specialized dictionary

In [17]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

population = pd.Series(population_dict)
population

California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

In [18]:
population['California']

38332521

Unlike a dictionary, the series also supports array-style operations such as slicing

In [19]:
population['California':'Illinois']

California    38332521
Florida       19552860
Illinois      12882135
dtype: int64

# Constructing Series objects

they tend to look like this

>>> pd.Series(data, index=index)

Index is an optional argument, and data can be many variables.

Data can be a list or a NumPy array, in which case index defaults to an integer sequence.


In [20]:
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

Data can be scalar

In [21]:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

data can be a dictionary, where *index* defaults to the sorted dictionary keys:


In [22]:
pd.Series({2:'a', 1:'b', 3:'c'})

1    b
2    a
3    c
dtype: object

in each case, the index can be explicitly set if a different result is preferred:

In [23]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3    c
2    a
dtype: object

In this case, the Series is only populated with the explicitly defined keys

In [24]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

population = pd.Series(population_dict)


area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
dtype: int64

In [25]:

states = pd.DataFrame({'population': population, 'area': area})

states

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


In [26]:
states.index

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

In [27]:
states.columns

Index(['area', 'population'], dtype='object')

The datafram can be thought of as a generalization of two-dimensional NumPy array, where both the rows and columns have generalized index for accessing the data.

# DataFrame as specialized dictionary

Similiarly, we can also think of DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a vcalue, a DataFrame maps a column name to a Series of column data. For example, asking for the 'area' attribture returns the 'Series' object containing the areas we saw earlier.

In [28]:
states['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

Note the potential point of confusion here: ina two-dimensional Numpy array, data[0] will return the first row. For a DataFrame, data ['col0'] will return the first column. Because of this, it is probably better to think about DataFrames as gerneralized dictionaries rather than generalized arrays, thought both ways of looking at the situation can be useful. We'll explore more flexible means of indexing DataFrames 

# Constructing DataFrame objects

A Pandas DataFrame can be constructed in a variety of ways. Here we'll give several examples.

### From a single Series object

A dataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series.:


In [29]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Florida,19552860
Illinois,12882135
New York,19651127
Texas,26448193


In [30]:
pd.DataFrame(area, columns=['area'])

Unnamed: 0,area
California,423967
Florida,170312
Illinois,149995
New York,141297
Texas,695662


**From a single Sries object**

A Dataframe is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series:

In [31]:
pd.DataFrame(population, columns=['Population'])

Unnamed: 0,Population
California,38332521
Florida,19552860
Illinois,12882135
New York,19651127
Texas,26448193


From a list of Dicts

any list of dictionaries can be made into a dataframe. We'll use a simple list comprehension to create some data:


In [32]:
data = [{'a': i, 'b': 2 * i}
       for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


Even if some keys in the dictionary are missing, Pandas will fill them with Nan (i.e, "not a number") values:

In [33]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


**From a dictionary of Series objects**

As we saw before, a DataFrame can be constructed from a dcitonary of Series objcets as well:

In [34]:
pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


**From a two-dimensional NumPy array**

Given two-dimsenaional array of data, we can create a DataFrame with any specified column and index names. If omitted, an integer index will be used for each:

In [38]:
pd.DataFrame(np.random.rand(3, 2),
            columns=['column1', 'column2'],
            index=['row1', 'row2', 'row3'])

Unnamed: 0,column1,column2
row1,0.735495,0.066894
row2,0.647538,0.562818
row3,0.488825,0.613811


**From a NumPy structured array**

A Pandas DataFrame operates much like a structured array, and can be created directly from one:

In [40]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A

array([(0, 0.0), (0, 0.0), (0, 0.0)], 
      dtype=[('A', '<i8'), ('B', '<f8')])

In [41]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


# The Pandas index Object

We have seen herer that both the *Series* and *DataFrame* objects contain an explicit index that lets you reference and modify data. This index object is an interesting structure in itself, and it can be thought of either an immutable array or as an ordered set (technically a multi-set, as index objcets may contain repeated values). Those views have some interesting consequences in the operations available on Index objects. As a simple example, let's construct an Index from a  list of integers:

In [64]:
ind = pd.Index([ 1, 2, 3, 5, 8, 13])
ind

Int64Index([1, 2, 3, 5, 8, 13], dtype='int64')

# Index as immutable array

The Index in many ways operates like an array. For example, we can use standard Python indexing notation to retrieve values or slices:

## Crash Course in Indexing

In [53]:
#It's pretty simple really:



a[start:end] # items start through end-1
a[start:]    # items start through the rest of the array
a[:end]      # items from the beginning through end-1
a[:]         # a copy of the whole array


#There is also the step value, which can be used with any of the above:

a[start:end:step] # start through not past end, by step


#The key point to remember is that the :end value represents the first value that is not in the selected slice. 
#So, the difference beween end and start is the number of elements selected (if step is 1, the default).

#The other feature is that start or end may be a negative number, which means it counts from the end of the array 
#instead of the beginning. So:

a[-1]    # last item in the array
a[-2:]   # last two items in the array
a[:-2]   # everything except the last two items


# Python is kind to the programmer if there are fewer items than you ask for. 
#For example, if you ask for  a[:-2] and a only contains one element, you get an empty list instead of an error.
#Sometimes you would prefer the error, so you have to be aware that this may happen.

NameError: name 'a' is not defined

In [54]:
ind[3] # python starts at 'O' first so ind[1] really means the second character in the array

3

In [69]:
ind[::2]

Int64Index([1, 3, 8], dtype='int64')

Index objcets also have many of the attributes familiar from NumPy arrays:


In [76]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

6 (6,) 1 int64


Since index objects are immutable, they cannot be changed.

In [78]:
ind[0] = 9

TypeError: Index does not support mutable operations

# Index as ordered set

Pandas objcets are designed to facilitate operations such as joinds across datasets, which depend on many aspects of set arithmetic. The index objcet follows many of the conventions used by Python's built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

In [79]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [80]:
indA & indB # intersection

Int64Index([3, 5, 7], dtype='int64')

In [93]:
indA | indB   # union

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [94]:
indA ^ indB # difference

Int64Index([1, 2, 9, 11], dtype='int64')