# this tutorial refers "python data science handbook"

Pandas is a newer package built on top of NumPy, and provides an
efficient implementation of a DataFrame. DataFrames are essentially multidimensional
arrays with attached row and column labels, and often with heterogeneous
types and/or missing data. As well as offering a convenient storage interface for
labeled data, Pandas implements a number of powerful data operations familiar to
users of both database frameworks and spreadsheet programs.

NumPy’s ndarray data structure provides essential features for the type of
clean, well-organized data typically seen in numerical computing tasks. While it
serves this purpose very well, its limitations become clear when we need more flexibility
(attaching labels to data, working with missing data, etc.) and when attempting
operations that do not map well to element-wise broadcasting (groupings, pivots,
etc.), each of which is an important piece of analyzing the less structured data available
in many forms in the world around us. Pandas, and in particular its Series and
DataFrame objects, builds on the NumPy array structure and provides efficient access
to these sorts of “data munging” tasks that occupy much of a data scientist’s time.NumPy’s ndarray data structure provides essential features for the type of
clean, well-organized data typically seen in numerical computing tasks. While it
serves this purpose very well, its limitations become clear when we need more flexibility
(attaching labels to data, working with missing data, etc.) and when attempting
operations that do not map well to element-wise broadcasting (groupings, pivots,
etc.), each of which is an important piece of analyzing the less structured data available
in many forms in the world around us. Pandas, and in particular its Series and
DataFrame objects, builds on the NumPy array structure and provides efficient access
to these sorts of “data munging” tasks that occupy much of a data scientist’s time.

In [9]:
import pandas as pd
import numpy as np

In [10]:
# to check pandas version
pd.__version__

'0.23.0'

## introducing pandas objects

At the very basic level, Pandas objects can be thought of as enhanced versions of
NumPy structured arrays in which the rows and columns are identified with labels
rather than simple integer indices. As we will see during the course of this chapter,
Pandas provides a host of useful tools, methods, and functionality on top of the basic
data structures, but nearly everything that follows will require an understanding of
what these structures are. Thus, before we go any further, let’s introduce these three
fundamental Pandas data structures: the Series, DataFrame, and Index.

## The Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data. It can be created from a
list or array as follows:

In [12]:
data=pd.Series([1,2,3,4])
data

0    1
1    2
2    3
3    4
dtype: int64

In [13]:
data=pd.Series([1.0,1.2,2.3,4,5.4])
data

0    1.0
1    1.2
2    2.3
3    4.0
4    5.4
dtype: float64

As we see in the preceding output, the Series wraps both a sequence of values and a
sequence of indices, which we can access with the values and index attributes. The
values are simply a familiar NumPy array:

In [15]:
data.values

array([1. , 1.2, 2.3, 4. , 5.4])

In [17]:
# The index is an array-like object of type pd.Index, which we’ll discuss in more detail momentarily:
data.index

RangeIndex(start=0, stop=5, step=1)

Like with a NumPy array, data can be accessed by the associated index via the familiar
Python square-bracket notation:

In [18]:
data[0]

1.0

In [19]:
data[1:3]

1    1.2
2    2.3
dtype: float64

As we will see, though, the Pandas Series is much more general and flexible than the
one-dimensional NumPy array that it emulates.

## Series as generalized NumPy array

From what we’ve seen so far, it may look like the Series object is basically interchangeable
with a one-dimensional NumPy array. The essential difference is the presence
of the index: while the NumPy array has an implicitly defined integer index used
to access the values, the Pandas Series has an explicitly defined index associated with
the values.

This explicit index definition gives the Series object additional capabilities. For
example, the index need not be an integer, but can consist of values of any desired
type. For example, if we wish, we can use strings as an index:

In [20]:
data=pd.Series([1,2,3,4],index=['a','b','c','d'])
data

a    1
b    2
c    3
d    4
dtype: int64

In [21]:
# And the item access works as expected:
data['a']

1

In [22]:
data=pd.Series([1,2,3,4],index=[34,23,12,22])
data

34    1
23    2
12    3
22    4
dtype: int64

In [23]:
data[23]

2

## series as specialized dictionary

In this way, you can think of a Pandas Series a bit like a specialization of a Python
dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary
values, and a Series is a structure that maps typed keys to a set of typed values. This
typing is important: just as the type-specific compiled code behind a NumPy array
makes it more efficient than a Python list for certain operations, the type information
of a Pandas Series makes it much more efficient than Python dictionaries for certain
operations.

We can make the Series-as-dictionary analogy even more clear by constructing a
Series object directly from a Python dictionary:

In [25]:
population_dict={'indore':12345,'ujjain':123,'dewas':234,'mhow':345}
population=pd.Series(population_dict)
population

indore    12345
ujjain      123
dewas       234
mhow        345
dtype: int64

In [26]:
population['indore']

12345

Unlike a dictionary, though, the Series also supports array-style operations such as
slicing:

In [27]:
population['indore':'mhow']

indore    12345
ujjain      123
dewas       234
mhow        345
dtype: int64

In [29]:
# data can be a scalar, which is repeated to fill the specified index:
data=pd.Series(5,index=[1,2,3])
data

1    5
2    5
3    5
dtype: int64

In [32]:
# data can be a dictionary, in which index defaults to the sorted dictionary keys:
pd.Series({1:'hi',2:'hello'})

1       hi
2    hello
dtype: object

In [34]:
pd.Series({1:'hi',2:'hello'},index=[2])
# In each case, the index can be explicitly set if a different result is preferred:
# Notice that in this case, the Series is populated only with the explicitly identified keys.

2    hello
dtype: object

## pandas dataframe object

The next fundamental structure in Pandas is the DataFrame. Like the Series object
discussed in the previous section, the DataFrame can be thought of either as a generalization
of a NumPy array, or as a specialization of a Python dictionary. We’ll now
take a look at each of these perspectives.

 ### DataFrame as a generalized NumPy array

If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame
is an analog of a two-dimensional array with both flexible row indices and flexible
column names. Just as you might think of a two-dimensional array as an ordered
sequence of aligned one-dimensional columns, you can think of a DataFrame as a
sequence of aligned Series objects. Here, by “aligned” we mean that they share the
same index.

To demonstrate this, let’s first construct a new Series listing the area of each of the
five states discussed in the previous section:

In [40]:
population_dict={'indore':12345,'ujjain':123,'dewas':234,'mhow':345}
population=pd.Series(population_dict)

area_dict={'indore':23,'ujjain':34,'dewas':45,'mhow':56}
area=pd.Series(area_dict)

city=pd.DataFrame({'population':population,'area':area})
city

Unnamed: 0,population,area
indore,12345,23
ujjain,123,34
dewas,234,45
mhow,345,56


In [42]:
# Like the Series object, the DataFrame has an index attribute that gives access to the index labels:
city.index

Index(['indore', 'ujjain', 'dewas', 'mhow'], dtype='object')

In [44]:
# Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels:
city.columns

Index(['population', 'area'], dtype='object')

Thus the DataFrame can be thought of as a generalization of a two-dimensional
NumPy array, where both the rows and columns have a generalized index for accessing
the data.

### DataFrame as specialized dictionary