# DataFrame

DataFrame is a two-dimensional array with heterogeneous data.

In the context of a relational database, 
A row—also called a tuple—represents a single, implicitly structured data item in a table.
Each row in a table represents a set of related data, and every row in the table has the same structure. 
https://en.wikipedia.org/wiki/Row_(database)
A column can also be called an attribute
https://en.wikipedia.org/wiki/Column_(database)


Key Points
    • Heterogeneous data 
    • Size Mutable 
    • Data Mutable 


Name	Age	Gender	Rating

Kumar	32	Male	3.45

Balaji	28	Male	4.6

Siva	45	Male	3.9

Sudha	38	Female	2.78



The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents tuple.

The data types of the four columns
Column	Type

Name 	String 

Age 	Integer 

Gender 	String 

Rating 	Float 



<img src='Dataframes-axis.png' />


The primary data structure in pandas is called a dataframe. Dataframes are the pandas equivalent of a Numpy 2D ndarray, with a few key differences:

    • Axis values can have string labels, not just numeric ones. 
    • Dataframes can contain columns with multiple data types: including integer, float, and string. 

On the last screen, we observed that when you select just one column of a dataframe, you get a new pandas type: a series object. Series is the pandas type for one-dimensional objects. Anytime you see a 1D pandas object, it will be a series. Anytime you see a 2D pandas object, it will be a dataframe.

In fact, you can think of a dataframe as a collection of series objects, which is similar to how pandas stores the data behind the scenes.



<img src='df_series-compare.svg' />







In [6]:
import pandas as pd
import numpy as np

### pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
Docstring:     
Two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns). Arithmetic operations
align on both row and column labels. Can be thought of as a dict-like
container for Series objects. The primary pandas data structure.

Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
    Dict can contain Series, arrays, constants, or list-like objects

    .. versionchanged :: 0.23.0
       If data is a dict, argument order is maintained for Python 3.6
       and later.

index : Index or array-like
    Index to use for resulting frame. Will default to RangeIndex if
    no indexing information part of input data and no index provided
    
columns : Index or array-like
    Column labels to use for resulting frame. Will default to
    RangeIndex (0, 1, 2, ..., n) if no column labels are provided
    
dtype : dtype, default None
    Data type to force. Only a single dtype is allowed. If None, infer
    
copy : boolean, default False
    Copy data from inputs. Only affects DataFrame / 2d ndarray input

Create an Empty DataFrame

A basic DataFrame, which can be created is an Empty Dataframe.

In [7]:
#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print (df)

Empty DataFrame
Columns: []
Index: []


Create DataFrame

A pandas DataFrame can be created using various inputs 
    Lists
    dict
    Series
    Numpy ndarrays
    Another DataFrame


In [10]:
#Create a DataFrame from Lists

#The DataFrame can be created using a single list or a list of lists.
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print (df)

   0
0  1
1  2
2  3
3  4
4  5


In [12]:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print (df)

     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13


In [14]:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print (df)
#Note − Observe, the dtype parameter changes the type of Age column to floating point.

     Name   Age
0    Alex  10.0
1     Bob  12.0
2  Clarke  13.0


Create a DataFrame from Dict of ndarrays / Lists

All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays.

If no index is passed, then by default, index will be range(n), where n is the array length.

In [16]:
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print (df)
#Note − Observe the values 0,1,2,3. They are the default index assigned to each using the function range(n).

    Name  Age
0    Tom   28
1   Jack   34
2  Steve   29
3  Ricky   42


In [17]:
# Let us now create an indexed DataFrame using arrays.

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print (df)

        Name  Age
rank1    Tom   28
rank2   Jack   34
rank3  Steve   29
rank4  Ricky   42


Create a DataFrame from List of Dicts

List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.

In [19]:
#The following example shows how to create a DataFrame by passing a list of dictionaries.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print (df)
#Note − Observe, NaN (Not a Number) is appended in missing areas.

   a   b     c
0  1   2   NaN
1  5  10  20.0


In [20]:
#The following example shows how to create a DataFrame by passing a list of dictionaries and the row indices.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print (df)

        a   b     c
first   1   2   NaN
second  5  10  20.0


In [22]:
#The following example shows how to create a DataFrame with a list of dictionaries, row indices, and column indices.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])

#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print (df1)
print (df2)

#Note − Observe, df2 DataFrame is created with a column index other than the dictionary key; thus, appended the NaN’s in place. Whereas, df1 is created with column indices same as dictionary keys, so NaN’s appended.

        a   b
first   1   2
second  5  10
        a  b1
first   1 NaN
second  5 NaN


Create a DataFrame from Dict of Series

Dictionary of Series can be passed to form a DataFrame. The resultant index is the union of all the series indexes passed.

In [23]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print (df)

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4


Note − Observe, for the series one, there is no label ‘d’ passed, but in the result, for the d label, NaN is appended with NaN.