# The Pandas DataFrame Object
DataFrame is another fundamental structure in Pandas. Like the <i>Series object</i>, the DataFrame can be thought of either as a generalization
of a NumPy array, or as a specialization of a Python dictionary. We’ll now
take a look at each of these perspectives.

## DataFrame as a generalized NumPy array
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame
is an analog of a two-dimensional array with both flexible row indices and flexible
column names. Just as you might think of a two-dimensional array as an ordered
sequence of aligned one-dimensional columns, you can think of a DataFrame as a
sequence of aligned Series objects. Here, by “aligned” we mean that they share the
same index.

To demonstrate this, let’s first construct a new Series listing:

In [1]:
import numpy as np 
import pandas as pd 
# Revenue Dictionary
product_rev_dict = {'Bananas': 4000000, 'Onions': '3000000', 'Tomatoes': '3500000', 'Maize Flour': 6500000}
# Product Price per Kg. dictionary
product_price_dict = {'Bananas': 34, 'Onions': '97', 'Tomatoes': '101', 'Maize Flour': 65}
product_rev = pd.Series(product_rev_dict)
product_price = pd.Series(product_price_dict)
print(product_rev)
print(product_price)

Bananas        4000000
Onions         3000000
Tomatoes       3500000
Maize Flour    6500000
dtype: object
Bananas         34
Onions          97
Tomatoes       101
Maize Flour     65
dtype: object


In [2]:
# we can create a single two - dimensional object containing the above information
sales = pd.DataFrame({'actual_revenue': product_rev, 'product_unit_price': product_price})
sales

Unnamed: 0,actual_revenue,product_unit_price
Bananas,4000000,34
Onions,3000000,97
Tomatoes,3500000,101
Maize Flour,6500000,65


Like the Series object, the DataFrame has an index attribute that gives access to the
index labels:

In [3]:
# Index attribute
print("Index Attribute:", sales.index)
# column attribute
print("Column Attribute:", sales.columns)

Index Attribute: Index(['Bananas', 'Onions', 'Tomatoes', 'Maize Flour'], dtype='object')
Column Attribute: Index(['actual_revenue', 'product_unit_price'], dtype='object')


<i><b>Thus the DataFrame can be thought of as a generalization of a two-dimensional
NumPy array, where both the rows and columns have a generalized index for accessing
    the data.</b></i>

## DataFrame as specialized dictionary
We can also think of a DataFrame as a specialization of a dictionary. Where
a dictionary maps a key to a value, a DataFrame maps a column name to a Series of
column data. For example, asking for the 'actual_rev' attribute returns the Series object
containing the actual sales per product:

In [4]:
sales['actual_revenue']

Bananas        4000000
Onions         3000000
Tomatoes       3500000
Maize Flour    6500000
Name: actual_revenue, dtype: object

## Constructing DataFrame objects
A Pandas DataFrame can be constructed in a variety of ways:

### 1. From a single Series object. 
A DataFrame is a collection of Series objects, and a single column
DataFrame can be constructed from a single Series:

In [5]:
# Constructing a Single Column DataFrame
actual_revenues = pd.DataFrame({'actual_revenue': product_rev})
actual_revenues

Unnamed: 0,actual_revenue
Bananas,4000000
Onions,3000000
Tomatoes,3500000
Maize Flour,6500000


### 2. From a list of dicts. 
Any list of dictionaries can be made into a DataFrame. We’ll use a
simple list comprehension to create some data:

In [6]:
# constructing DataFrame from a list of dictionaries
my_data = [{'a':i, 'b': i * 2, 'c': i * 3, 'd': i * 4, 'e': i * 4} for i in range(3)]
print(my_data)
print(pd.DataFrame(my_data))

[{'a': 0, 'b': 0, 'c': 0, 'd': 0, 'e': 0}, {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 4}, {'a': 2, 'b': 4, 'c': 6, 'd': 8, 'e': 8}]
   a  b  c  d  e
0  0  0  0  0  0
1  1  2  3  4  4
2  2  4  6  8  8


Even if some keys in the dictionary are missing, Pandas will fill them in with NaN (i.e.,
“not a number”) values:

In [7]:
pd.DataFrame([{'a':1, 'b':2, 'c': 3}, {'b': 5, 'c':6}, {'a':7}])

Unnamed: 0,a,b,c
0,1.0,2.0,3.0
1,,5.0,6.0
2,7.0,,


### 3. From a dictionary of Series objects. 
As we saw before, a DataFrame can be constructed from a dictionary of Series objects as well:

In [8]:
sales = pd.DataFrame({'actual_revenue': product_rev, 'product_unit_price': product_price})
sales

Unnamed: 0,actual_revenue,product_unit_price
Bananas,4000000,34
Onions,3000000,97
Tomatoes,3500000,101
Maize Flour,6500000,65


### 4. From a two-dimensional NumPy array.
Given a two-dimensional array of data, we can
create a DataFrame with any specified column and index names. If omitted, an integer
index will be used for each:

In [9]:
# DataFrame from a 2d array
pd.DataFrame(np.random.rand(5, 5),
            columns = ['foo', 'bar', 'egg', 'spam', 'hen'],
            index =  ['a', 'b', 'c', 'd', 'e'])

Unnamed: 0,foo,bar,egg,spam,hen
a,0.438722,0.576958,0.242379,0.940302,0.967286
b,0.887959,0.714468,0.065974,0.043574,0.871613
c,0.724005,0.313264,0.605285,0.579484,0.906878
d,0.840704,0.186129,0.045894,0.919648,0.399972
e,0.953896,0.423365,0.005171,0.707399,0.369648


### 5. From a NumPy structured array.
A Pandas DataFrame operates much like a
structured array, and can be created directly from one:

In [10]:
# DataFrame from a structured array
A = np.zeros(8, dtype = [('A', 'i8'), ('B', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.), (0, 0.), (0, 0.), (0, 0.), (0, 0.),
       (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [11]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0
3,0,0.0
4,0,0.0
5,0,0.0
6,0,0.0
7,0,0.0


## Data Selection in DataFrame

### DataFrame as a dictionary

In [24]:
# Revenue Dictionary
product_rev_dict = {'Bananas': 4000000, 'Onions': '3000000', 'Tomatoes': '3500000', 'Maize Flour': 6500000}
# Product Price per Kg. dictionary
product_price_dict = {'Bananas': 34, 'Onions': '97', 'Tomatoes': '101', 'Maize Flour': 65}
product_rev = pd.Series(product_rev_dict)
product_price = pd.Series(product_price_dict)
# we can create a single two - dimensional object containing the above information
sales = pd.DataFrame({'actual_revenue': product_rev, 'product_unit_price': product_price})
sales

Unnamed: 0,actual_revenue,product_unit_price
Bananas,4000000,34
Onions,3000000,97
Tomatoes,3500000,101
Maize Flour,6500000,65


The individual Series that make up the columns of the DataFrame can be accessed
via dictionary-style indexing of the column name:

In [25]:
# dictionary-style indexing access
print(sales['actual_revenue'])
# Equivalently, we can use attribute-style access with column names that are strings:
print(sales.actual_revenue)

Bananas        4000000
Onions         3000000
Tomatoes       3500000
Maize Flour    6500000
Name: actual_revenue, dtype: object
Bananas        4000000
Onions         3000000
Tomatoes       3500000
Maize Flour    6500000
Name: actual_revenue, dtype: object


This attribute-style column access actually accesses the exact same object as the
dictionary-style access:

This dictionary-style syntax can also be
used to modify the object, in this case to add a new column:

In [32]:
sales['actual_revenue'] = sales['actual_revenue'].astype(float)
sales['product_unit_price'] = sales['product_unit_price'].astype(float)
# This shows a preview of the straightforward syntax of element-by-element arithmetic between Series objects;
sales['volume'] = sales['actual_revenue'] / sales['product_unit_price']
sales.round(0)

Unnamed: 0,actual_revenue,product_unit_price,volume
Bananas,4000000.0,34.0,117647.0
Onions,3000000.0,97.0,30928.0
Tomatoes,3500000.0,101.0,34653.0
Maize Flour,6500000.0,65.0,100000.0


### DataFrame as two-dimensional array
We can also view the DataFrame as an enhanced two-dimensional
array. We can examine the raw underlying data array using the values
attribute:

In [35]:
# examining the underlying data
sales.values

array([[4.00000000e+06, 3.40000000e+01, 1.17647059e+05],
       [3.00000000e+06, 9.70000000e+01, 3.09278351e+04],
       [3.50000000e+06, 1.01000000e+02, 3.46534653e+04],
       [6.50000000e+06, 6.50000000e+01, 1.00000000e+05]])

With this picture in mind, we can do many familiar array-like observations on the
DataFrame itself. For example, we can transpose the full DataFrame to swap rows and
columns:

In [37]:
# Transpose the full dataframe
sales.T

Unnamed: 0,Bananas,Onions,Tomatoes,Maize Flour
actual_revenue,4000000.0,3000000.0,3500000.0,6500000.0
product_unit_price,34.0,97.0,101.0,65.0
volume,117647.1,30927.84,34653.47,100000.0


When it comes to indexing of DataFrame objects, however, it is clear that the
dictionary-style indexing of columns precludes our ability to simply treat it as a
NumPy array. In particular, passing a single index to an array accesses a row:

In [38]:
sales.values[0]

array([4.00000000e+06, 3.40000000e+01, 1.17647059e+05])

In [39]:
# and passing a single “index” to a DataFrame accesses a column:
sales['actual_revenue']

Bananas        4000000.0
Onions         3000000.0
Tomatoes       3500000.0
Maize Flour    6500000.0
Name: actual_revenue, dtype: float64

Thus for array-style indexing, we need another convention. Here Pandas again uses
the loc, iloc, and ix indexers. Using the iloc indexer, we can
index the underlying array as if it is a simple NumPy array (using the implicit
Python-style index), but the DataFrame index and column labels are maintained in
the result:

In [42]:
sales.iloc[:3, :2]

Unnamed: 0,actual_revenue,product_unit_price
Bananas,4000000.0,34.0
Onions,3000000.0,97.0
Tomatoes,3500000.0,101.0


In [44]:
sales.loc[:'Tomatoes', :'product_unit_price']

Unnamed: 0,actual_revenue,product_unit_price
Bananas,4000000.0,34.0
Onions,3000000.0,97.0
Tomatoes,3500000.0,101.0


The ix indexer allows a hybrid of these two approaches:

In [45]:
sales.ix[:3, :'product_unit_price']

Unnamed: 0,actual_revenue,product_unit_price
Bananas,4000000.0,34.0
Onions,3000000.0,97.0
Tomatoes,3500000.0,101.0


Any of the familiar NumPy-style data access patterns can be used within these indexers.
For example, in the loc indexer we can combine masking and fancy indexing as
in the following:

In [47]:
sales.loc[sales.volume >= 100000, ['actual_revenue', 'product_unit_price']]

Unnamed: 0,actual_revenue,product_unit_price
Bananas,4000000.0,34.0
Maize Flour,6500000.0,65.0


Any of these indexing conventions may also be used to set or modify values; this is
done in the standard way that you might be accustomed to from working with
NumPy:

In [50]:
# changing the unit_price of Bananas
sales.iloc[0,1] = 35.66
sales

Unnamed: 0,actual_revenue,product_unit_price,volume
Bananas,4000000.0,35.66,117647.058824
Onions,3000000.0,97.0,30927.835052
Tomatoes,3500000.0,101.0,34653.465347
Maize Flour,6500000.0,65.0,100000.0


## Additional Indexing Conventions

There are a couple extra indexing conventions that might seem at odds with the preceding
discussion, but nevertheless can be very useful in practice. First, while indexing
refers to columns, slicing refers to rows:

In [51]:
sales['Tomatoes' : 'Maize Flour']

Unnamed: 0,actual_revenue,product_unit_price,volume
Tomatoes,3500000.0,101.0,34653.465347
Maize Flour,6500000.0,65.0,100000.0


In [54]:
# Such slices can also refer to rows by number rather than by index:
sales[2:4]

Unnamed: 0,actual_revenue,product_unit_price,volume
Tomatoes,3500000.0,101.0,34653.465347
Maize Flour,6500000.0,65.0,100000.0


In [55]:
# Similarly, direct masking operations are also interpreted row-wise rather than column-wise:
sales[sales.volume >= 100000]

Unnamed: 0,actual_revenue,product_unit_price,volume
Bananas,4000000.0,35.66,117647.058824
Maize Flour,6500000.0,65.0,100000.0
