<a href="https://colab.research.google.com/github/owaisahmad315/pandas/blob/main/Data_Frame_Methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# The data for this section is sample retail sales data:
import pandas as pd
from io import StringIO

data = StringIO(
    '''UPC,Units,Sales,Date
    1234,5,20.2,1-1-2014
    1234,2,8.,1-2-2014
    1234,3,13.,1-3-2014
    789,1,2.,1-1-2014
    789,2,3.8,1-2-2014
    789,,,1-3-2014
    789,1,1.8,1-5-2014'''
)

sales = pd.read_csv(data)
sales

Unnamed: 0,UPC,Units,Sales,Date
0,1234,5.0,20.2,1-1-2014
1,1234,2.0,8.0,1-2-2014
2,1234,3.0,13.0,1-3-2014
3,789,1.0,2.0,1-1-2014
4,789,2.0,3.8,1-2-2014
5,789,,,1-3-2014
6,789,1.0,1.8,1-5-2014


In [None]:
# Data Frame Attributes
"""
Let's dig in a little more. We can examine the axes of a data frame by
looking at the .axes attribute:

"""
sales.axes

[RangeIndex(start=0, stop=7, step=1),
 Index(['UPC', 'Units', 'Sales', 'Date'], dtype='object')]

In [None]:
# The .axes is a list that contains the index and columns:
sales.index

RangeIndex(start=0, stop=7, step=1)

In [None]:
sales.columns

Index(['UPC', 'Units', 'Sales', 'Date'], dtype='object')

In [None]:
# The number of row and columns is also available via the .shape attribute:
sales.shape

(7, 4)

In [None]:
sales.info

In [None]:
# Iteration
"""
Data frames include a variety of methods to iterate over the values. By
default, iteration occurs over the column names:

"""
for column in sales:
  print(column)

UPC
Units
Sales
Date


In [None]:
'''
The .iteritems method returns pairs of column names and the
individual column (as a Series):

'''
for col, ser in sales.iteritems():
  print(col, ser)

UPC 0    1234
1    1234
2    1234
3     789
4     789
5     789
6     789
Name: UPC, dtype: int64
Units 0    5.0
1    2.0
2    3.0
3    1.0
4    2.0
5    NaN
6    1.0
Name: Units, dtype: float64
Sales 0    20.2
1     8.0
2    13.0
3     2.0
4     3.8
5     NaN
6     1.8
Name: Sales, dtype: float64
Date 0    1-1-2014
1    1-2-2014
2    1-3-2014
3    1-1-2014
4    1-2-2014
5    1-3-2014
6    1-5-2014
Name: Date, dtype: object


  for col, ser in sales.iteritems():


In [None]:
'''
The .iterrows method returns a tuple for every row. The tuple has two
items. The first is the index value. The second is the row converted into a
Series object. This might be a little tricky in practice because a row's
values might not be homogenous, whereas that is usually the case in a
column of data. Notice that the dtype for the row series is object because
the row has strings and numeric values in it:


'''
for row in sales.iterrows():
  print(row)
  break # limit data

(0, UPC          1234
Units         5.0
Sales        20.2
Date     1-1-2014
Name: 0, dtype: object)


In [None]:
'''
The .itertuples method returns a namedtuple containing the index and
row values:

'''
for row in sales.itertuples():
  print(row)

Pandas(Index=0, UPC=1234, Units=5.0, Sales=20.2, Date='1-1-2014')
Pandas(Index=1, UPC=1234, Units=2.0, Sales=8.0, Date='1-2-2014')
Pandas(Index=2, UPC=1234, Units=3.0, Sales=13.0, Date='1-3-2014')
Pandas(Index=3, UPC=789, Units=1.0, Sales=2.0, Date='1-1-2014')
Pandas(Index=4, UPC=789, Units=2.0, Sales=3.8, Date='1-2-2014')
Pandas(Index=5, UPC=789, Units=nan, Sales=nan, Date='1-3-2014')
Pandas(Index=6, UPC=789, Units=1.0, Sales=1.8, Date='1-5-2014')


In [None]:
'''
If you aren't familiar with NamedTuples in Python, check them out
from the collections module. They give you all the benefits of a
tuple: immutable, low memory requirements, and index access. In
addition, the namedtuple allows you to access values by attribute:


'''
import collections

Sales = collections.namedtuple('Sales',
                               'upc,units,sales')
s = Sales(1234, 5., 20.2)
s[0] # index access
s.upc # attribute access


1234

## Matrix Operations


In [None]:
'''
 The data frame can be treated as a matrix. There is support for transposing
a matrix:


'''

sales.transpose()

Unnamed: 0,0,1,2,3,4,5,6
UPC,1234,1234,1234,789,789,789,789
Units,5.0,2.0,3.0,1.0,2.0,,1.0
Sales,20.2,8.0,13.0,2.0,3.8,,1.8
Date,1-1-2014,1-2-2014,1-3-2014,1-1-2014,1-2-2014,1-3-2014,1-5-2014


##Serialization

In [None]:
'''
Data frames can serialize to many forms. The most important functionality
is probably converting to and from a CSV file, as this format is the lingua
franca of data. We already saw that the pd.read_csv function will create a
DataFrame. Writing to CSV is easy, we simply use the .to_csv method:

'''
fout = StringIO()
sales.to_csv(fout, index_label = 'index')
print(fout.getvalue())

index,UPC,Units,Sales,Date
0,1234,5.0,20.2,1-1-2014
1,1234,2.0,8.0,1-2-2014
2,1234,3.0,13.0,1-3-2014
3,789,1.0,2.0,1-1-2014
4,789,2.0,3.8,1-2-2014
5,789,,,1-3-2014
6,789,1.0,1.8,1-5-2014



In [None]:
sales.to_dict

In [None]:
"""
An optional parameter orient can create a mapping of column name to
a list of values:


"""
sales.to_dict(orient='list')

{'UPC': [1234, 1234, 1234, 789, 789, 789, 789],
 'Units': [5.0, 2.0, 3.0, 1.0, 2.0, nan, 1.0],
 'Sales': [20.2, 8.0, 13.0, 2.0, 3.8, nan, 1.8],
 'Date': ['1-1-2014',
  '1-2-2014',
  '1-3-2014',
  '1-1-2014',
  '1-2-2014',
  '1-3-2014',
  '1-5-2014']}

In [None]:
# Data frames can also be created from the serialized dict if needed:
pd.DataFrame.from_dict(sales.to_dict())

Unnamed: 0,UPC,Units,Sales,Date
0,1234,5.0,20.2,1-1-2014
1,1234,2.0,8.0,1-2-2014
2,1234,3.0,13.0,1-3-2014
3,789,1.0,2.0,1-1-2014
4,789,2.0,3.8,1-2-2014
5,789,,,1-3-2014
6,789,1.0,1.8,1-5-2014


In [None]:
# In addition, data frames can read and write Excel files. Use the
# .to_excel method to dump the data out:
writer=  pd.ExcelWriter('/tmp/ouptput.xlsx')
sales.to_excel(writer, 'sheet1')
writer.save()

  writer.save()


In [None]:
# We can also read Excel data:
pd.read_excel('/content/Historicalinvesttemp.xlsx')

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,,,,
1,,,,
2,,,,
3,,,,
4,,Annual Returns on Investments in,,
...,...,...,...,...
85,2007,0.0549,0.0988,0.0466
86,2008,-0.37,0.2587,0.016
87,2009,0.2646,-0.149,0.001
88,,stocks,tbills,bonds


In [None]:
 # Index Operations
sales.reindex([0,4])

Unnamed: 0,UPC,Units,Sales,Date
0,1234,5.0,20.2,1-1-2014
4,789,2.0,3.8,1-2-2014


In [None]:
# This method also supports column selection:
sales.reindex(columns=['Date', 'Sales'])

Unnamed: 0,Date,Sales
0,1-1-2014,20.2
1,1-2-2014,8.0
2,1-3-2014,13.0
3,1-1-2014,2.0
4,1-2-2014,3.8
5,1-3-2014,
6,1-5-2014,1.8


In [None]:
# Getting and Setting Values
'''
There are two methods to pull out a single "cell" in the data frame. One
—.iat—uses the position of the index and column (0-based):
'''
sales.iat[4,2]

3.8

In [None]:
sales.replace(789, 790)

Unnamed: 0,UPC,Units,Sales,Date
0,1234,5.0,20.2,1-1-2014
1,1234,2.0,8.0,1-2-2014
2,1234,3.0,13.0,1-3-2014
3,790,1.0,2.0,1-1-2014
4,790,2.0,3.8,1-2-2014
5,790,,,1-3-2014
6,790,1.0,1.8,1-5-2014


In [None]:
# To insert a column at a specified location use the .insert method.
sales.insert(1, 'Category', 'Food')

In [None]:
sales

Unnamed: 0,UPC,Category,Units,Sales,Date
0,1234,Food,5.0,20.2,1-1-2014
1,1234,Food,2.0,8.0,1-2-2014
2,1234,Food,3.0,13.0,1-3-2014
3,789,Food,1.0,2.0,1-1-2014
4,789,Food,2.0,3.8,1-2-2014
5,789,Food,,,1-3-2014
6,789,Food,1.0,1.8,1-5-2014


In [None]:
# Deleting Columns
'''
There are at least four ways to remove a column:

The .pop method
The .drop method with axis=1
The .reindex method
Indexing with a list of new columns

The .pop method takes the name of a column and removes it from the
data frame. It operates in-place. Rather than returning a data frame, it
returns the removed column. Below, the column subcat will be added and
then subsequently removed:

'''

sales['subcat'] = 'Diary'
sales


Unnamed: 0,UPC,Units,Sales,Date,subcat
0,1234,5.0,20.2,1-1-2014,Diary
1,1234,2.0,8.0,1-2-2014,Diary
2,1234,3.0,13.0,1-3-2014,Diary
3,789,1.0,2.0,1-1-2014,Diary
4,789,2.0,3.8,1-2-2014,Diary
5,789,,,1-3-2014,Diary
6,789,1.0,1.8,1-5-2014,Diary


In [None]:
sales.pop('subcat')

0    Diary
1    Diary
2    Diary
3    Diary
4    Diary
5    Diary
6    Diary
Name: subcat, dtype: object

In [None]:
sales

Unnamed: 0,UPC,Units,Sales,Date
0,1234,5.0,20.2,1-1-2014
1,1234,2.0,8.0,1-2-2014
2,1234,3.0,13.0,1-3-2014
3,789,1.0,2.0,1-1-2014
4,789,2.0,3.8,1-2-2014
5,789,,,1-3-2014
6,789,1.0,1.8,1-5-2014


In [None]:
'''
To drop a column with the .drop method, simply pass it in (or a list of
column names) along with setting the axis parameter to 1:
'''
sales.drop(['Category', 'Units'], axis=1)

Unnamed: 0,UPC,Sales,Date
0,1234,20.2,1-1-2014
1,1234,8.0,1-2-2014
2,1234,13.0,1-3-2014
3,789,2.0,1-1-2014
4,789,3.8,1-2-2014
5,789,,1-3-2014
6,789,1.8,1-5-2014


In [None]:
'''
To use the final two methods of removing columns, simply create a list
of desired columns. Pass that list into the .reindex method or the indexing
operation:
'''
cols = ['Sales', 'Date']
sales.reindex(columns=cols)

Unnamed: 0,Sales,Date
0,20.2,1-1-2014
1,8.0,1-2-2014
2,13.0,1-3-2014
3,2.0,1-1-2014
4,3.8,1-2-2014
5,,1-3-2014
6,1.8,1-5-2014


##Slicing
The pandas library provides powerful methods for slicing a data frame.
The .head and .tail methods allow for pulling data off the front and end
of a data frame. They come in handy when using an interpreter in
combination with pandas. By default, they display only the top five or
bottom five rows:


In [None]:
sales.head()

Unnamed: 0,UPC,Category,Units,Sales,Date
0,1234,Food,5.0,20.2,1-1-2014
1,1234,Food,2.0,8.0,1-2-2014
2,1234,Food,3.0,13.0,1-3-2014
3,789,Food,1.0,2.0,1-1-2014
4,789,Food,2.0,3.8,1-2-2014


In [None]:
sales.tail()

Unnamed: 0,UPC,Category,Units,Sales,Date
2,1234,Food,3.0,13.0,1-3-2014
3,789,Food,1.0,2.0,1-1-2014
4,789,Food,2.0,3.8,1-2-2014
5,789,Food,,,1-3-2014
6,789,Food,1.0,1.8,1-5-2014


In [None]:
# Simply pass in an integer to override the number of rows to show:
sales.tail(2)

Unnamed: 0,UPC,Category,Units,Sales,Date
5,789,Food,,,1-3-2014
6,789,Food,1.0,1.8,1-5-2014


In [None]:
'''
Data frames also support slicing based on index position and label. Let's
use a string based index so it will be clearer what the slicing options do
'''
sales['new_index'] = list('ancdefg')
df = sales.set_index('new_index')
del sales['new_index']

'''
To slice by position, use the .iloc attribute. Here we take rows in
positions two up to but not including four:
'''
df.iloc[2:4]

Unnamed: 0_level_0,UPC,Category,Units,Sales,Date
new_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
c,1234,Food,3.0,13.0,1-3-2014
d,789,Food,1.0,2.0,1-1-2014


In [None]:
'''
We can also provide column positions that we want to keep as well. The
column positions need to follow a comma in the index operation. Here we
keep rows from two up to but not including row four. We also take
columns from zero up to but not including one (just the column in the zero
index position):
'''
df.iloc[2:4, 0:1]

Unnamed: 0_level_0,UPC
new_index,Unnamed: 1_level_1
c,1234
d,789


In [None]:
'''
There is also support for slicing out data by labels. Using the .loc
attribute, we can take index values a through d:
'''
df.loc['a':'d']

Unnamed: 0_level_0,UPC,Category,Units,Sales,Date
new_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a,1234,Food,5.0,20.2,1-1-2014
n,1234,Food,2.0,8.0,1-2-2014
c,1234,Food,3.0,13.0,1-3-2014
d,789,Food,1.0,2.0,1-1-2014
