# Data wrangling & manipulation using Numpy and Pandas in Python -Part 2 (Pandas)

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.

**pandas is well suited for many different kinds of data:**

* Tabular data with heterogeneously-typed columns.
* Ordered and unordered time series data.
* Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
* Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas, Series **(1-dimensional)** and DataFrame **(2-dimensional)**, handle the vast majority of realtime applications. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment.

**Here are just a few of the things that pandas does well:**
   * Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data.
   * Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects.
   * Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
   * Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects.
   * Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
   * Intuitive merging and joining data sets
   * Flexible reshaping and pivoting of data sets
   * Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format.

Note that pandas is fast. Many of the low-level algorithmic bits have been extensively tweaked in Cython code.


**Series is a one-dimensional labeled array capable of holding data of any type.**

**A pandas Series can be created using the following constructor −**

_pandas.Series( data, index, dtype, copy)_

* data: data takes various forms like ndarray, list, constants
* index:Index values must be unique and hashable, same length as data. Default np.arrange(n) if no index is passed.
* dtype:dtype is for data type. If None, data type will be inferred
* copy: Copy data. Default False


### 1) Series

In [7]:
import pandas as pd # These 2 packages are required to use pandas
import numpy as np # Note that pandas is built on the top of numpy

### 1.1) Create a empty series

In [8]:
a=pd.Series() # Note: in Panda Series "S" should be capital letter
a

  a=pd.Series() # Note: in Panda Series "S" should be capital letter


Series([], dtype: float64)

### 1.2) Create a Series from ndarray

In [9]:
a=np.array(["a","b","c","d"])
b=pd.Series(a) 
b # We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1, i.e., 0 to 3.

0    a
1    b
2    c
3    d
dtype: object

In [10]:
# How to give index?
a=np.array(["a","b","c","d"])
b=pd.Series(a)# here we are giving our own index.
b

0    a
1    b
2    c
3    d
dtype: object

### 1.3) Create a Series from dict

A dict can be passed as input and if no index is specified, then the dictionary keys are taken in a sorted order to construct index. If index is passed, the values in data corresponding to the labels in the index will be pulled out.

In [11]:
a= {1:"a",2:"b",3:"c"} # We have created a dictionary
b=pd.Series(a)
b

1    a
2    b
3    c
dtype: object

**If needed, we can change the index**

In [12]:
a= {1:"a",2:"b",3:"c"} # We have created a dictionary
b=pd.Series(a,index=[1,3,2]) # We can give the index explicitly
b

1    a
3    c
2    b
dtype: object

### 1.4) Create a Series from Scalar

In [13]:
#If data is a scalar value, an index must be provided. The value will be repeated to match the length of index
a=pd.Series(5,index=[10,11,12])
a

10    5
11    5
12    5
dtype: int64

### 1.5) Accessing Data from Series with Position
Data in the series can be accessed similar to that in an ndarray.

In [14]:
a=pd.Series([11,12,22,32,26])
print(a)

0    11
1    12
2    22
3    32
4    26
dtype: int64


In [15]:
print(a[0]) #retrieve the first element

11


In [16]:
print(a[:2]) # retrive the first 2 elements

0    11
1    12
dtype: int64


In [17]:
print(a[2:4]) # retrive 3rd and 4th elements

2    22
3    32
dtype: int64


In [18]:
print(a[-3:-1]) # retrive 3rd and 4th elements Note: items are retrived from last

2    22
3    32
dtype: int64


In [19]:
print(a[-3:])  # retrive elements from 3rd element from last

2    22
3    32
4    26
dtype: int64


### 1.6) Retrieve Data Using Label (Index)
A Series is like a fixed-size dict in that you can get and set values by index label.

In [20]:
a=pd.Series([22,33,44,54,34],index=['aa','bb','cc','dd','ee'])
a

print(a['aa']) # Retrive the element at the index 'aa'


22


In [21]:
print(a[['aa','bb']]) # Retrive the elemet at index 'aa' and 'bb'
# print(a['ff']) # If a label is not contained, an exception is raised.

aa    22
bb    33
dtype: int64


### 2) DataFrame

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

**Features of DataFrame**
* Potentially columns are of different types
* Size – Mutable (Can be changed)
* Labeled axes (rows and columns)
* Can Perform Arithmetic operations on rows and columns

**A pandas DataFrame can be created using the following constructor −**

_pandas.DataFrame( data, index, columns, dtype, copy)_

* data: data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.
* index: For the row labels, the Index to be used for the resulting frame is Optional Default np.arrange(n) if no index is passed.
* columns:For column labels, the optional default syntax is - np.arrange(n). This is only true if no index is passed.
* dtype:Data type of each column.
* copy: This command (or whatever it is) is used for copying of data, if the default is False.

#### A pandas DataFrame can be created using various inputs like −

* Lists
* dict
* Series
* Numpy ndarrays
* Another DataFrame

In [22]:
import pandas as pd # These 2 packages are required to use pandas
import numpy as np # Note that pandas is built on the top of numpy

### 2.1) Create an data frame

In [23]:
# Create an empty dataframe
df=pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


### 2.2) Create a dataframe from lists

In [24]:
df=pd.DataFrame([["CSE501","A. RAMU",21],["CSE502","B. RAJESH",22],["CSE503","B.SUBBU",23]],columns=["Roll No","Name","stu_Age"])
df

Unnamed: 0,Roll No,Name,stu_Age
0,CSE501,A. RAMU,21
1,CSE502,B. RAJESH,22
2,CSE503,B.SUBBU,23


In [25]:
# We can specify the datatype too
df=pd.DataFrame([["CSE501","A. RAMU",21],
                 ["CSE502","B. RAJESH",22],
                 ["CSE503","B.SUBBU"]],
                columns=["Roll No","Name","Age"],dtype=float)
df

Unnamed: 0,Roll No,Name,Age
0,CSE501,A. RAMU,21.0
1,CSE502,B. RAJESH,22.0
2,CSE503,B.SUBBU,


### 2.3) Create a DataFrame from Dictionaries 

In [26]:
df=pd.DataFrame({"Roll No":["CSE501","CSE502","CSE503"],"Name":["A.RAMU","B.RAJESH","B.SUBBU"]})
df

Unnamed: 0,Roll No,Name
0,CSE501,A.RAMU
1,CSE502,B.RAJESH
2,CSE503,B.SUBBU


#### Observe the values 0,1,2,3. They are the default index assigned to each using the function range(n).

In [27]:
df=pd.DataFrame({"Roll No":["CSE501","CSE502","CSE503"],"Name":["A.RAMU","B.RAJESH","B.SUBBU"]},index=["1","2","3"])
df

Unnamed: 0,Roll No,Name
1,CSE501,A.RAMU
2,CSE502,B.RAJESH
3,CSE503,B.SUBBU


### 2.4) Create a DataFrame from Dict of Series
Dictionary of Series can be passed to form a DataFrame. The resultant index is the union of all the series indexes passed.

In [28]:
df=pd.DataFrame({"roll No": pd.Series(["CSE501","CSE502","CSE503"]),"Name":pd.Series(["A.RAMU","B.RAJESH","B.SUBBU"])})
df

Unnamed: 0,roll No,Name
0,CSE501,A.RAMU
1,CSE502,B.RAJESH
2,CSE503,B.SUBBU


### 2.5) Selecting a column from a dataframe

In [29]:
# Create a dataframe
lokesh=pd.DataFrame({"roll no":["cse501","cse502","cse503"],"Name":["A.RAMU","B.RAJESH","B.SUBBU"]})
lokesh

Unnamed: 0,roll no,Name
0,cse501,A.RAMU
1,cse502,B.RAJESH
2,cse503,B.SUBBU


In [30]:
lokesh["roll no"]

0    cse501
1    cse502
2    cse503
Name: roll no, dtype: object

### 2.6) Adding a column to a dataframe

In [31]:
# Create a dataframe
df=pd.DataFrame({"roll no":["cse501","cse502","cse503"],"Name":["A.RAMU","B.RAJESH","B.SUBBU"]})
df

Unnamed: 0,roll no,Name
0,cse501,A.RAMU
1,cse502,B.RAJESH
2,cse503,B.SUBBU


In [32]:
df["Marks"]=pd.Series([100,99,95])
df

Unnamed: 0,roll no,Name,Marks
0,cse501,A.RAMU,100
1,cse502,B.RAJESH,99
2,cse503,B.SUBBU,95


### 2.7) Deleting columns from a dataframe

In [33]:
df=pd.DataFrame({"roll no":["cse501","cse502","cse503"],"Name":["A.RAMU","B.RAJESH","B.SUBBU"],"Marks":[100,99,95]})
df

Unnamed: 0,roll no,Name,Marks
0,cse501,A.RAMU,100
1,cse502,B.RAJESH,99
2,cse503,B.SUBBU,95


In [34]:
del df["Name"]
df

Unnamed: 0,roll no,Marks
0,cse501,100
1,cse502,99
2,cse503,95


### 2.8) Row selection by Label

In [35]:
# Create a dataframe
df=pd.DataFrame({"roll no":["cse501","cse502","cse503"],"Name":["A.RAMU","B.RAJESH","B.SUBBU"],"Marks":[100,99,95]})
df

Unnamed: 0,roll no,Name,Marks
0,cse501,A.RAMU,100
1,cse502,B.RAJESH,99
2,cse503,B.SUBBU,95


In [36]:
df.iloc[1] # 2nd row is selected

roll no      cse502
Name       B.RAJESH
Marks            99
Name: 1, dtype: object

In [37]:
# We can even select multiple rows.
df.iloc[0:2] #Select 1st and 2nd row

Unnamed: 0,roll no,Name,Marks
0,cse501,A.RAMU,100
1,cse502,B.RAJESH,99


### 2.9) Addition of rows to a dataframe

In [38]:
# Create a dataframe
df=pd.DataFrame({"roll no":["cse501","cse502","cse503"],
                 "Name":["A.RAMU","B.RAMYA","B.SUBBU"],
                 "Marks":[100,99,95]},index=[1,2,3])
df

Unnamed: 0,roll no,Name,Marks
1,cse501,A.RAMU,100
2,cse502,B.RAMYA,99
3,cse503,B.SUBBU,95


In [39]:
df1=pd.DataFrame({"roll no":["cse504"],"Name":["R.SURESH"],"Marks":[89]},index=[4])
df1

Unnamed: 0,roll no,Name,Marks
4,cse504,R.SURESH,89


In [40]:
df=df.append(df1)
df

Unnamed: 0,roll no,Name,Marks
1,cse501,A.RAMU,100
2,cse502,B.RAMYA,99
3,cse503,B.SUBBU,95
4,cse504,R.SURESH,89


### 2.10) Deletion of rows from a dataframe

In [41]:
#  Create a dataframe
df=pd.DataFrame({"roll no":["cse501","cse502","cse503"],"Name":["A.RAMU","B.RAMYA","B.SUBBU"],"Marks":[100,99,95]})
df

Unnamed: 0,roll no,Name,Marks
0,cse501,A.RAMU,100
1,cse502,B.RAMYA,99
2,cse503,B.SUBBU,95


In [42]:
df.drop(0) # First row is deleted

Unnamed: 0,roll no,Name,Marks
1,cse502,B.RAMYA,99
2,cse503,B.SUBBU,95


### 2.11) T (Transpose)
Returns the transpose of the DataFrame. The rows and columns will interchange

In [43]:
#  Create a dataframe
df=pd.DataFrame({"roll no":["cse501","cse502","cse503"],"Name":["A.RAMU","B.RAMYA","B.SUBBU"],"Marks":[100,99,95]})
df

Unnamed: 0,roll no,Name,Marks
0,cse501,A.RAMU,100
1,cse502,B.RAMYA,99
2,cse503,B.SUBBU,95


In [44]:
df.T

Unnamed: 0,0,1,2
roll no,cse501,cse502,cse503
Name,A.RAMU,B.RAMYA,B.SUBBU
Marks,100,99,95


### 2.12) axes
Returns the list of row axis labels and column axis labels.

In [45]:
df # Consider a dataframe

Unnamed: 0,roll no,Name,Marks
0,cse501,A.RAMU,100
1,cse502,B.RAMYA,99
2,cse503,B.SUBBU,95


In [46]:
df.axes

[RangeIndex(start=0, stop=3, step=1),
 Index(['roll no', 'Name', 'Marks'], dtype='object')]

### 2.13) dtypes
Returns the data type of each column.

In [47]:
df

Unnamed: 0,roll no,Name,Marks
0,cse501,A.RAMU,100
1,cse502,B.RAMYA,99
2,cse503,B.SUBBU,95


In [48]:
type(df) # The datatype 

pandas.core.frame.DataFrame

In [49]:
df.dtypes # Data types of columns

roll no    object
Name       object
Marks       int64
dtype: object

### 2.14) astype()  

astype() function is used to change the data types of columns 

In [50]:
# Let's change the datatype of "Marks" to float.
df[["Marks"]]=df[["Marks"]].astype("float")
df

Unnamed: 0,roll no,Name,Marks
0,cse501,A.RAMU,100.0
1,cse502,B.RAMYA,99.0
2,cse503,B.SUBBU,95.0


In [51]:
df.dtypes # 

roll no     object
Name        object
Marks      float64
dtype: object

### 2.14) ndim
Returns the number of dimensions of the object. By definition, DataFrame is a 2D object.

In [52]:
import pandas as pd
import numpy as np
df # Consider a dataframe

Unnamed: 0,roll no,Name,Marks
0,cse501,A.RAMU,100.0
1,cse502,B.RAMYA,99.0
2,cse503,B.SUBBU,95.0


In [53]:
df.ndim

2

### 2.15) shape
Returns a tuple representing the dimensionality of the DataFrame. Tuple (a,b), where a represents the number of rows and b represents the number of columns.

In [54]:
df # Consider a dataframe

Unnamed: 0,roll no,Name,Marks
0,cse501,A.RAMU,100.0
1,cse502,B.RAMYA,99.0
2,cse503,B.SUBBU,95.0


In [55]:
df.shape # 3rows and 3 columns

(3, 3)

### 2.16) size
Returns the number of elements in the DataFrame.

In [56]:
df # Consider a dataframe

Unnamed: 0,roll no,Name,Marks
0,cse501,A.RAMU,100.0
1,cse502,B.RAMYA,99.0
2,cse503,B.SUBBU,95.0


In [57]:
df.size

9

### 2.17) values
Returns the actual data in the DataFrame as an NDarray.

In [58]:
df # Consider a dataframe

Unnamed: 0,roll no,Name,Marks
0,cse501,A.RAMU,100.0
1,cse502,B.RAMYA,99.0
2,cse503,B.SUBBU,95.0


In [60]:
df.values

array([['cse501', 'A.RAMU', 100.0],
       ['cse502', 'B.RAMYA', 99.0],
       ['cse503', 'B.SUBBU', 95.0]], dtype=object)

### 2.18) Head & Tail
To view a small sample of a DataFrame object, use the head() and tail() methods. head() returns the first n rows (observe the index values). The default number of elements to display is five, but you may pass a custom number.

In [63]:
df

Unnamed: 0,roll no,Name,Marks
0,cse501,A.RAMU,100.0
1,cse502,B.RAMYA,99.0
2,cse503,B.SUBBU,95.0


In [64]:
df.head(2)

Unnamed: 0,roll no,Name,Marks
0,cse501,A.RAMU,100.0
1,cse502,B.RAMYA,99.0


In [93]:
df.tail(2)

Unnamed: 0,roll no,Name,Marks
1,cse502,B.RAMYA,99.0
2,cse503,B.SUBBU,95.0


# 3) Descriptive statistics in pandas-python
A large number of methods collectively compute descriptive statistics and other related operations on DataFrame. Most of these are aggregations like sum(), mean()

In [58]:
#  Create a dataframe
import pandas as pd
import numpy as np

df=pd.DataFrame({"roll no":["cse501","cse502","cse503"],"Name":["A.RAMU","B.RAMYA","B.SUBBU"],"Marks":[100,99,95]})
df

Unnamed: 0,roll no,Name,Marks
0,cse501,A.RAMU,100
1,cse502,B.RAMYA,99
2,cse503,B.SUBBU,95


### Let's use some descriptive statistical functions on above dataframe

In [95]:
df.sum() # Calculate the sum column wise

roll no      cse501cse502cse503
Name       A.RAMUB.RAMYAB.SUBBU
Marks                       294
dtype: object

In [96]:
df["Marks"].sum() # Calculate sum of a particular column

294

In [97]:
df.count() # Count the number of entries in each column

roll no    3
Name       3
Marks      3
dtype: int64

In [98]:
df.mean() # Calculate the mean. Note the column contain numbers

Marks    98.0
dtype: float64

In [99]:
df.median() # Calculate the median. Note the column contain numbers

Marks    99.0
dtype: float64

In [100]:
df.std() # Calculate the standard deviation. Note the column contain numbers

Marks    2.645751
dtype: float64

In [101]:
df["Marks"].min() # Calculate the minimum element in a column

95

In [102]:
df["Marks"].max() # Calculate the maximum element in a column

100

In [103]:
df["Marks"].cumsum() # Calculate the cumulative sum element in a column

0    100
1    199
2    294
Name: Marks, dtype: int64

In [104]:
df["Marks"].cumprod() # Calculate the cumulative product element in a column

0       100
1      9900
2    940500
Name: Marks, dtype: int64

**Note − Since DataFrame is a Heterogeneous data structure. Generic operations don’t work with all functions.**

* Functions like sum(), cumsum() work with both numeric and character (or) string data elements without any error. Though n practice, character aggregations are never used generally, these functions do not throw any exception.
* Functions like abs(), cumprod() throw exception when the DataFrame contains character or string data because such operations cannot be performed.

In [105]:
df.describe() 

#it computes a summary of statistics pertaining to the DataFrame columns. Note that only numeric columns are considered

Unnamed: 0,Marks
count,3.0
mean,98.0
std,2.645751
min,95.0
25%,97.0
50%,99.0
75%,99.5
max,100.0


In [106]:
df.describe(include="all")

Unnamed: 0,roll no,Name,Marks
count,3,3,3.0
unique,3,3,
top,cse501,B.SUBBU,
freq,1,1,
mean,,,98.0
std,,,2.645751
min,,,95.0
25%,,,97.0
50%,,,99.0
75%,,,99.5


### 4) Python Pandas - Function Application

To apply our own or another library’s functions to Pandas objects, we use **apply()** function. 

In [65]:
# Create a user defined function
def math_apparao(x):
    return(x**3)

# Create a dataframe
df= pd.DataFrame({"age":[19,20,21,21,23,25],"Height":[165,163,170,176,175,150],"weight":[70,65,66,59,65,70]})
df

Unnamed: 0,age,Height,weight
0,19,165,70
1,20,163,65
2,21,170,66
3,21,176,59
4,23,175,65
5,25,150,70


In [66]:
# Here we have applied user defined function "math_fun"
df.apply(math_apparao) 

Unnamed: 0,age,Height,weight
0,6859,4492125,343000
1,8000,4330747,274625
2,9261,4913000,287496
3,9261,5451776,205379
4,12167,5359375,274625
5,15625,3375000,343000


In [67]:
# axis 0 means column and axis 1 means row. Default is axis 0
df.apply(np.sum,axis=0) # Here we are appling "numpy.sum" function on the dataframe  

age       129
Height    999
weight    395
dtype: int64

In [110]:
df.apply(np.sum,axis=1)

0    254
1    248
2    257
3    256
4    263
5    245
dtype: int64

### 5) Reindexing
Reindexing changes the row labels and column labels of a DataFrame. To reindex means to conform the data to match a given set of labels along a particular axis.

Multiple operations can be accomplished through indexing like −
* Reorder the existing data to match a new set of labels.
* Insert missing value (NA) markers in label locations where no data for the label existed.

In [111]:
# Create a dataframe.

df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3']) # Using np.random.randn(10,3), we create 10*3 array of random numbers
df1

Unnamed: 0,col1,col2,col3
0,-1.832213,1.090075,-0.134222
1,0.331146,-0.005331,0.663067
2,-1.414472,-1.890105,-0.002579
3,0.445025,-0.4413,-0.031455
4,-0.171267,0.938069,1.454594
5,0.19165,-1.262104,-0.912774
6,0.932771,-2.253757,-0.911735
7,0.662105,0.00402,-0.312201
8,1.716346,-0.569105,-0.914118
9,0.148519,0.599218,0.54034


In [63]:
# Create another dataframe
df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3']) # Here we create a array of 7*3 random numbers
df2

Unnamed: 0,col1,col2,col3
0,-0.769066,-0.497626,-0.648953
1,0.428485,-0.304639,0.946721
2,-2.057784,-0.979807,-0.544728
3,0.850304,-2.536105,-0.778439
4,0.570218,0.480372,-0.25594
5,1.504524,0.429776,-0.6959
6,-1.84132,0.988378,-0.117471


In [113]:
df2 = df1.reindex_like(df2) # Now we are reindexing df1. It should look like df2
print (df2) # First 7 rows of df1 are printed since df2 contains only 7 rows

       col1      col2      col3
0 -1.832213  1.090075 -0.134222
1  0.331146 -0.005331  0.663067
2 -1.414472 -1.890105 -0.002579
3  0.445025 -0.441300 -0.031455
4 -0.171267  0.938069  1.454594
5  0.191650 -1.262104 -0.912774
6  0.932771 -2.253757 -0.911735


### 6) Renaming
The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

In [114]:
# Create a dataframe
df= pd.DataFrame({"age":[19,20,21,21,23,25],"Height":[165,163,170,176,175,150],"weight":[70,65,66,59,65,70]})
df

Unnamed: 0,age,Height,weight
0,19,165,70
1,20,163,65
2,21,170,66
3,21,176,59
4,23,175,65
5,25,150,70


In [115]:
# We can change the name of columns
df=df.rename
(columns={"age":"Student Age","Height":"Student Height","weight":"Student Weight"}) 
df

# We can change the row indexes in the similar way

Unnamed: 0,Student Age,Student Height,Student Weight
0,19,165,70
1,20,163,65
2,21,170,66
3,21,176,59
4,23,175,65
5,25,150,70


### 7) Iteration

The behavior of basic iteration over Pandas objects depends on the type. When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. Other data structures, like DataFrame follow the dict-like convention of iterating over the keys of the objects.

In short, basic iteration (for i in object) produces −
* Series − values
* DataFrame − column labels

### 7.1) Iterating a DataFrame
Iterating a DataFrame gives column names.

In [116]:
#Consider a dataframe
df= pd.DataFrame({"age":[19,20,21,21,23,25],"Height":[165,163,170,176,175,150],"weight":[70,65,66,59,65,70]})
df

Unnamed: 0,age,Height,weight
0,19,165,70
1,20,163,65
2,21,170,66
3,21,176,59
4,23,175,65
5,25,150,70


In [117]:
for i in df:
    print(i)

age
Height
weight


### 7.2)To iterate over the rows of the DataFrame, we can use the following functions −
* **iteritems()** − to iterate over the (key,value) pairs
* **iterrows()** − iterate over the rows as (index,series) pairs
* **itertuples()** − iterate over the rows as namedtuples

### 7.2.1 ) iteritems()
Iterates over each column as key, value pair with label as key and column value as a Series object.

In [118]:
#Consider a dataframe
df= pd.DataFrame({"age":[19,20,21,21,23,25],"Height":[165,163,170,176,175,150],"weight":[70,65,66,59,65,70]})
df

Unnamed: 0,age,Height,weight
0,19,165,70
1,20,163,65
2,21,170,66
3,21,176,59
4,23,175,65
5,25,150,70


In [119]:
for i,j in df[["age"]].iteritems():
    
    print(j*2)



0    38
1    40
2    42
3    42
4    46
5    50
Name: age, dtype: int64


**Observe, each column is iterated separately as a key-value pair in a Series.**

### 7.2.2 ) iterrows()
iterrows() returns the iterator yielding each index value along with a series containing the data in each row.

In [120]:
#Consider a dataframe
df= pd.DataFrame({"age":[19,20,21,21,23,25],"Height":[165,163,170,176,175,150],"weight":[70,65,66,59,65,70]})
df

Unnamed: 0,age,Height,weight
0,19,165,70
1,20,163,65
2,21,170,66
3,21,176,59
4,23,175,65
5,25,150,70


In [121]:
for i,j in df[["age"]].iterrows():
    print("\n") # For proper spacing
    #print(i)
    print(j*2)



age    38
Name: 0, dtype: int64


age    40
Name: 1, dtype: int64


age    42
Name: 2, dtype: int64


age    42
Name: 3, dtype: int64


age    46
Name: 4, dtype: int64


age    50
Name: 5, dtype: int64


### 7.2.3) itertuples()
itertuples() method will return an iterator yielding a named tuple for each row in the DataFrame. The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.

In [122]:
#Consider a dataframe
df= pd.DataFrame({"age":[19,20,21,21,23,25],"Height":[165,163,170,176,175,150],"weight":[70,65,66,59,65,70]})
df

Unnamed: 0,age,Height,weight
0,19,165,70
1,20,163,65
2,21,170,66
3,21,176,59
4,23,175,65
5,25,150,70


In [123]:
for i in df[["age"]].itertuples():
    print(i)

Pandas(Index=0, age=19)
Pandas(Index=1, age=20)
Pandas(Index=2, age=21)
Pandas(Index=3, age=21)
Pandas(Index=4, age=23)
Pandas(Index=5, age=25)


**Note − Do not try to modify any object while iterating. Iterating is meant for reading and the iterator returns a copy of the original object (a view), thus the changes will not reflect on the original object.**

### 8) Sorting

### 8.1) Sort by value-Only one column

sort_values() is the method for sorting by values. It accepts a 'by' argument which will use the column name of the DataFrame with which the values are to be sorted.

In [124]:
#Consider a dataframe
df= pd.DataFrame({"age":[19,20,21,21,23,25],"Height":[165,163,170,176,175,150],"weight":[70,65,66,59,65,70]})
df

Unnamed: 0,age,Height,weight
0,19,165,70
1,20,163,65
2,21,170,66
3,21,176,59
4,23,175,65
5,25,150,70


In [125]:
df.sort_values(by="age",ascending=False) # default is ascending=True

Unnamed: 0,age,Height,weight
5,25,150,70
4,23,175,65
2,21,170,66
3,21,176,59
1,20,163,65
0,19,165,70


In [126]:
df

Unnamed: 0,age,Height,weight
0,19,165,70
1,20,163,65
2,21,170,66
3,21,176,59
4,23,175,65
5,25,150,70


**Observe, col1 values are sorted and the respective col2 value and row index will alter along with col1. Thus, they look unsorted.**

### 8.2) Sort by index

Using the sort_index() method, by passing the axis arguments and the order of sorting, DataFrame can be sorted. By default, sorting is done on row labels in ascending order.

In [127]:
# Consider a dataframe
df= pd.DataFrame({"age":[19,20,21,21,23,25],"Height":[165,163,170,176,175,150],"weight":[70,65,66,59,65,70]},index=[0,2,1,5,3,4])
df

Unnamed: 0,age,Height,weight
0,19,165,70
2,20,163,65
1,21,170,66
5,21,176,59
3,23,175,65
4,25,150,70


In [128]:
df.sort_index()

Unnamed: 0,age,Height,weight
0,19,165,70
1,21,170,66
2,20,163,65
3,23,175,65
4,25,150,70
5,21,176,59


### 9) Python Pandas - Indexing and Selecting Data

**Pandas supports 2 types of Multi-axes indexing; the three types are mentioned in the following table −**
* **.loc()** :Label based
* **.iloc()** :Integer based

### .loc

In [1]:
import pandas as pd
import numpy as np
df= pd.DataFrame({"age":[19,20,21,21,23,25],"Height":[165,163,170,176,175,150],"weight":[70,65,66,59,65,70]})
df

Unnamed: 0,age,Height,weight
0,19,165,70
1,20,163,65
2,21,170,66
3,21,176,59
4,23,175,65
5,25,150,70


In [2]:
df.loc[2] # returns 2nd row

age        21
Height    170
weight     66
Name: 2, dtype: int64

In [3]:
df.loc[2:5] # returns 2nd,3rd,4th,5th row.

Unnamed: 0,age,Height,weight
2,21,170,66
3,21,176,59
4,23,175,65
5,25,150,70


In [4]:
df.loc[:,"age"] # returns age column 

0    19
1    20
2    21
3    21
4    23
5    25
Name: age, dtype: int64

In [5]:
df.loc[[2,5],["age","Height"]] # returns age and height from 2nd and 5th rows

Unnamed: 0,age,Height
2,21,170
5,25,150


### .iloc()
Pandas provide various methods in order to get purely integer based indexing. Like python and numpy, these are 0-based indexing.

In [134]:
df.iloc[2:5] # returns 2nd,3rd,4th row 

Unnamed: 0,age,Height,weight
2,21,170,66
3,21,176,59
4,23,175,65


In [135]:
df.iloc[:3] # returns first 3 rows

Unnamed: 0,age,Height,weight
0,19,165,70
1,20,163,65
2,21,170,66


In [136]:
df.iloc[1:4,0:2] # returns 1st,2nd and 3rd row from age and height

Unnamed: 0,age,Height
1,20,163
2,21,170
3,21,176


In [137]:
df.iloc[:,2] # returns 3rd column

0    70
1    65
2    66
3    59
4    65
5    70
Name: weight, dtype: int64

In [138]:
df.iloc[2,:] # returns 2nd row 

age        21
Height    170
weight     66
Name: 2, dtype: int64

In [139]:
df.iloc[2,2] # returns element at 2nd row 2nd column

66

In [140]:
# df.iloc[2,"age"] # returns error because iloc is integer based indexing

### 10) Handiling missing values (Dealing with NA)

### 10.1 Find for NA's

In [8]:
# Create a dataframe
import pandas as pd
import numpy as np
df= pd.DataFrame({"age":[19,20,21,21,23,25,23],"Height":[165,163,170,176,175,150,np.nan],"weight":[70,65,66,59,65,70,71]})
df

Unnamed: 0,age,Height,weight
0,19,165.0,70
1,20,163.0,65
2,21,170.0,66
3,21,176.0,59
4,23,175.0,65
5,25,150.0,70
6,23,,71


In [8]:
df.isna().any() # to find for NA's in a dataframe

age       False
Height     True
weight    False
dtype: bool

### Summing is done by considering NA's as zero

In [143]:
df["Height"].sum() # When summing data, NA(NaN) will be treated as Zero

999.0

### 10.2 Fill NA's with zero

In [10]:
df.fillna(0) # Fill NA (missing values) with zero

Unnamed: 0,age,Height,weight
0,19,165.0,70
1,20,163.0,65
2,21,170.0,66
3,21,176.0,59
4,23,175.0,65
5,25,150.0,70
6,23,0.0,71


### 10.3 Fill NA's with previous value

In [11]:
# Create a dataframe
df= pd.DataFrame({"age":[19,20,21,21,23,25,23],"Height":[165,163,170,176,175,150,np.nan],"weight":[70,65,66,59,65,70,71]})
df


Unnamed: 0,age,Height,weight
0,19,165.0,70
1,20,163.0,65
2,21,170.0,66
3,21,176.0,59
4,23,175.0,65
5,25,150.0,70
6,23,,71


In [146]:
df.fillna(method='pad') # Fill NA with previous value

Unnamed: 0,age,Height,weight
0,19,165.0,70
1,20,163.0,65
2,21,170.0,66
3,21,176.0,59
4,23,175.0,65
5,25,150.0,70
6,23,150.0,71


### 10.4 Fill NA's with next values

In [147]:
# Create a dataframe
df= pd.DataFrame({"age":[19,20,21,21,23,25,23],"Height":[165,163,170,np.nan,175,150,np.nan],"weight":[np.nan,65,66,59,65,70,71]})
df

Unnamed: 0,age,Height,weight
0,19,165.0,
1,20,163.0,65.0
2,21,170.0,66.0
3,21,,59.0
4,23,175.0,65.0
5,25,150.0,70.0
6,23,,71.0


In [148]:
df["Height"].fillna(method ='bfill') # Fill NA with next value


0    165.0
1    163.0
2    170.0
3    175.0
4    175.0
5    150.0
6      NaN
Name: Height, dtype: float64

### 10.5 Fill NA's with a constant/text

In [15]:
# Create a dataframe
df= pd.DataFrame({"age":[19,20,21,21,23,25,23],"Height":[165,163,170,np.nan,175,150,np.nan],"weight":[70,65,66,59,65,70,71]})


In [16]:
df.fillna("Not available",inplace=True) # Note: this will modify any other views on this object


In [17]:
df # NaN is permanently changed

Unnamed: 0,age,Height,weight
0,19,165,70
1,20,163,65
2,21,170,66
3,21,Not available,59
4,23,175,65
5,25,150,70
6,23,Not available,71


### 10.6 Fill NA's with a constant value using replace function

In [18]:
# Create a dataframe
df= pd.DataFrame({"age":[19,20,21,21,23,25,23],"Height":[165,163,170,np.nan,175,150,np.nan],"weight":[np.nan,65,66,59,65,70,71]})
df

Unnamed: 0,age,Height,weight
0,19,165.0,
1,20,163.0,65.0
2,21,170.0,66.0
3,21,,59.0
4,23,175.0,65.0
5,25,150.0,70.0
6,23,,71.0


In [153]:
# Replace "NA" using replace function with a constant
df["Height"]=df["Height"].replace(to_replace=np.nan,value=165)  
df

Unnamed: 0,age,Height,weight
0,19,165.0,
1,20,163.0,65.0
2,21,170.0,66.0
3,21,165.0,59.0
4,23,175.0,65.0
5,25,150.0,70.0
6,23,165.0,71.0


### 10.7 Use interpolate function to fill the missing values

In [19]:
# Create a dataframe
df= pd.DataFrame({"age":[19,20,np.nan,22,23,25,23],
                  "Height":[165,163,170,np.nan,175,150,155],
                  "weight":[70,65,66,np.nan,65,70,71]})
df

Unnamed: 0,age,Height,weight
0,19.0,165.0,70.0
1,20.0,163.0,65.0
2,,170.0,66.0
3,22.0,,
4,23.0,175.0,65.0
5,25.0,150.0,70.0
6,23.0,155.0,71.0


In [155]:
df.interpolate(method="linear") #‘linear’: ignore the index and treat the values as equally spaced. default

Unnamed: 0,age,Height,weight
0,19.0,165.0,70.0
1,20.0,163.0,65.0
2,21.0,170.0,66.0
3,22.0,172.5,65.5
4,23.0,175.0,65.0
5,25.0,150.0,70.0
6,23.0,155.0,71.0


### 10.8) Remove rows with NA's

In [20]:
# Create a dataframe
df= pd.DataFrame({"age":[19,20,np.nan,22,23,25,23],
                  "Height":[165,163,170,np.nan,175,150,155],
                  "weight":[70,65,66,np.nan,65,70,71]})
df

Unnamed: 0,age,Height,weight
0,19.0,165.0,70.0
1,20.0,163.0,65.0
2,,170.0,66.0
3,22.0,,
4,23.0,175.0,65.0
5,25.0,150.0,70.0
6,23.0,155.0,71.0


In [21]:
df.dropna() # Drop rows with NA

Unnamed: 0,age,Height,weight
0,19.0,165.0,70.0
1,20.0,163.0,65.0
4,23.0,175.0,65.0
5,25.0,150.0,70.0
6,23.0,155.0,71.0


### 10.9) Remove row which contains all NA's

In [22]:
# Create a dataframe
df= pd.DataFrame({"age":[19,20,np.nan,22,23,25,23],
                  "Height":[165,163,np.nan,np.nan,175,150,155],
                  "weight":[70,65,np.nan,np.nan,65,70,71]})
df

Unnamed: 0,age,Height,weight
0,19.0,165.0,70.0
1,20.0,163.0,65.0
2,,,
3,22.0,,
4,23.0,175.0,65.0
5,25.0,150.0,70.0
6,23.0,155.0,71.0


In [159]:
df.dropna(how="all") # Drop the rows containing all NA's 

Unnamed: 0,age,Height,weight
0,19.0,165.0,70.0
1,20.0,163.0,65.0
3,22.0,,
4,23.0,175.0,65.0
5,25.0,150.0,70.0
6,23.0,155.0,71.0


### 10.10) Remove the columns containing NA's

In [23]:
# Create a dataframe
df= pd.DataFrame({"age":[19,20,24,22,23,25,23],"Height":[165,163,np.nan,np.nan,175,150,155],"weight":[70,65,75,67,65,70,71]})
df

Unnamed: 0,age,Height,weight
0,19,165.0,70
1,20,163.0,65
2,24,,75
3,22,,67
4,23,175.0,65
5,25,150.0,70
6,23,155.0,71


In [161]:
df.dropna(axis=1) # drop columns with NA's

Unnamed: 0,age,weight
0,19,70
1,20,65
2,24,75
3,22,67
4,23,65
5,25,70
6,23,71


### 10.11) Remove the NA's with mean value of the column

In [162]:
# Consider a dataframe
df= pd.DataFrame({"age":[19,20,24,22,23,25,23],"Height":[165,163,np.nan,np.nan,175,150,155],"weight":[70,65,75,67,65,70,71]})
df

Unnamed: 0,age,Height,weight
0,19,165.0,70
1,20,163.0,65
2,24,,75
3,22,,67
4,23,175.0,65
5,25,150.0,70
6,23,155.0,71


In [163]:
#Caliculate the mean height
Mean_height=df["Height"].mean()
Mean_height

161.6

In [164]:
df["Height"].replace(np.nan,Mean_height,inplace=True)
df

Unnamed: 0,age,Height,weight
0,19,165.0,70
1,20,163.0,65
2,24,161.6,75
3,22,161.6,67
4,23,175.0,65
5,25,150.0,70
6,23,155.0,71


In [165]:
df

Unnamed: 0,age,Height,weight
0,19,165.0,70
1,20,163.0,65
2,24,161.6,75
3,22,161.6,67
4,23,175.0,65
5,25,150.0,70
6,23,155.0,71


### 10.12) Replace missing values with most common value
This techniques is good to use with categorical attributes.

In [10]:
# Create a dataframe
df= pd.DataFrame({"age":[19,20,24,22,23,25,23],"Height":[165,163,np.nan,np.nan,175,150,155],"weight":[70,65,75,67,65,70,71],"Gender":["M",np.nan,"F","F","M","F","F"]})
df

Unnamed: 0,age,Height,weight,Gender
0,19,165.0,70,M
1,20,163.0,65,
2,24,,75,F
3,22,,67,F
4,23,175.0,65,M
5,25,150.0,70,F
6,23,155.0,71,F


In [12]:
# Now, Let's replace the missing value in Gender "NaN" with the majority. i.e we need to caliculate mode.
Gender_max=df["Gender"].value_counts().idxmax()
Gender_max

'F'

In [168]:
df["Gender"].replace(np.nan,Gender_max,inplace=True)
df

Unnamed: 0,age,Height,weight,Gender
0,19,165.0,70,M
1,20,163.0,65,F
2,24,,75,F
3,22,,67,F
4,23,175.0,65,M
5,25,150.0,70,F
6,23,155.0,71,F


### 10.13) Remove rows which contains NA's in a particular column. 
Some columns may be very important. Some times whole row information is useless if have NaN in a particular column and should be removed.

In [169]:
# Create a dataframe
df= pd.DataFrame({"age":[19,20,24,22,23,25,23],"Height":[165,163,np.nan,np.nan,175,150,155],"weight":[70,65,75,67,65,70,71],"Gender":["M",np.nan,"F","F","M","F","F"]})
df

Unnamed: 0,age,Height,weight,Gender
0,19,165.0,70,M
1,20,163.0,65,
2,24,,75,F
3,22,,67,F
4,23,175.0,65,M
5,25,150.0,70,F
6,23,155.0,71,F


Say, I can't take risk by assuming 2nd row as female. So, Its better to delete the row some times.

In [170]:
df.dropna(subset=["Gender"],axis=0,inplace=True)
df

Unnamed: 0,age,Height,weight,Gender
0,19,165.0,70,M
2,24,,75,F
3,22,,67,F
4,23,175.0,65,M
5,25,150.0,70,F
6,23,155.0,71,F


If you notice, the index is changed. Its better the modify the index

In [171]:
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,age,Height,weight,Gender
0,19,165.0,70,M
1,24,,75,F
2,22,,67,F
3,23,175.0,65,M
4,25,150.0,70,F
5,23,155.0,71,F


### 11) groupby()

Pandas dataframe.groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.

In [14]:
# Create a dataframe

import numpy as np
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'Kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}

df = pd.DataFrame(ipl_data)
df

Unnamed: 0,Team,Rank,Year,Points
0,Riders,1,2014,876
1,Riders,2,2015,789
2,Devils,2,2014,863
3,Devils,3,2015,673
4,Kings,3,2014,741
5,Kings,4,2015,812
6,Kings,1,2016,756
7,Kings,1,2017,788
8,Riders,2,2016,694
9,Royals,4,2014,701


### 11.1 )Group by a single column(Attribute).

In [15]:
grouped=df.groupby(by='Team').groups
grouped

{'Devils': Int64Index([2, 3], dtype='int64'),
 'Kings': Int64Index([4, 5, 6, 7], dtype='int64'),
 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'),
 'Royals': Int64Index([9, 10], dtype='int64')}

### 11.2) Group by multiple columns(Attributes)

In [6]:
df.groupby(by=['Team','Rank']).groups

{('Devils', 2): Int64Index([2], dtype='int64'),
 ('Devils', 3): Int64Index([3], dtype='int64'),
 ('Kings', 1): Int64Index([6, 7], dtype='int64'),
 ('Kings', 3): Int64Index([4], dtype='int64'),
 ('Kings', 4): Int64Index([5], dtype='int64'),
 ('Riders', 1): Int64Index([0], dtype='int64'),
 ('Riders', 2): Int64Index([1, 8, 11], dtype='int64'),
 ('Royals', 1): Int64Index([10], dtype='int64'),
 ('Royals', 4): Int64Index([9], dtype='int64')}

### 11.3) Iterating through Groups
With the groupby object in hand, we can iterate through the object similar to itertools.obj.

In [18]:
grouped=df.groupby('Team')
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000019AF80EBA00>

In [19]:
for i in grouped:
    print(i)

('Devils',      Team  Rank  Year  Points
2  Devils     2  2014     863
3  Devils     3  2015     673)
('Kings',     Team  Rank  Year  Points
4  Kings     3  2014     741
5  Kings     4  2015     812
6  Kings     1  2016     756
7  Kings     1  2017     788)
('Riders',       Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
8   Riders     2  2016     694
11  Riders     2  2017     690)
('Royals',       Team  Rank  Year  Points
9   Royals     4  2014     701
10  Royals     1  2015     804)


### 11.4) Select a Group
Using the get_group() method, we can select a single group.

In [177]:
grouped = df.groupby('Year')
grouped.get_group(2014) # We have selected year=2014 from year group.

Unnamed: 0,Team,Rank,Year,Points
0,Riders,1,2014,876
2,Devils,2,2014,863
4,Kings,3,2014,741
9,Royals,4,2014,701


### 11.5) Aggregations
An aggregated function returns a single aggregated value for each group. Once the group by object is created, several aggregation operations can be performed on the grouped data.

In [178]:
grouped=df.groupby('Year')
grouped.agg(np.mean) # Calculate mean w.r.t to year (calculate mean for all numeric column)

Unnamed: 0_level_0,Rank,Points
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2014,2.5,795.25
2015,2.5,769.5
2016,1.5,725.0
2017,1.5,739.0


**we can select a particular column.**

In [179]:
grouped=df.groupby('Year')
grouped['Points'].agg(np.mean) # Mean point w.r.t to year 

Year
2014    795.25
2015    769.50
2016    725.00
2017    739.00
Name: Points, dtype: float64

### Applying Multiple Aggregation Functions at Once
With grouped Series, you can also pass a list or dict of functions to do aggregation with, and generate DataFrame as output −

In [180]:
grouped=df.groupby('Team')
grouped['Points'].agg([np.mean,np.median,np.std])

Unnamed: 0_level_0,mean,median,std
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Devils,768.0,768.0,134.350288
Kings,774.25,772.0,31.899582
Riders,762.25,741.5,88.567771
Royals,752.5,752.5,72.831998


### 11.6) Transformations
Transformation on a group or a column returns an object that is indexed the same size of that is being grouped. Thus, the transform should return a result that is the same size as that of a group chunk.

In [181]:
score=lambda x: (x-np.mean(x))
grouped.transform(score)

Unnamed: 0,Rank,Year,Points
0,-0.75,-1.5,113.75
1,0.25,-0.5,26.75
2,-0.5,-0.5,95.0
3,0.5,0.5,-95.0
4,0.75,-1.5,-33.25
5,1.75,-0.5,37.75
6,-1.25,0.5,-18.25
7,-1.25,1.5,13.75
8,0.25,0.5,-68.25
9,1.5,-0.5,-51.5


### 12) Merging/Joining
Pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL.

Pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects

In [182]:
#Create 2 dataframes

left = pd.DataFrame({'id':[1,2,3,4,5],
                     'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
                     'subject_id':['sub1','sub2','sub4','sub6','sub5']})

right = pd.DataFrame({'id':[1,2,3,4,5],
                      'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
                      'subject_id':['sub2','sub4','sub3','sub6','sub5']})

print(left)
print('\n')
print(right)

   id    Name subject_id
0   1    Alex       sub1
1   2     Amy       sub2
2   3   Allen       sub4
3   4   Alice       sub6
4   5  Ayoung       sub5


   id   Name subject_id
0   1  Billy       sub2
1   2  Brian       sub4
2   3   Bran       sub3
3   4  Bryce       sub6
4   5  Betty       sub5


In [183]:
pd.merge(left,right,on='subject_id') # We have merged two dataframes by taking subject_id as reference.

Unnamed: 0,id_x,Name_x,subject_id,id_y,Name_y
0,2,Amy,sub2,1,Billy
1,3,Allen,sub4,2,Brian
2,4,Alice,sub6,4,Bryce
3,5,Ayoung,sub5,5,Betty


### 13) Binning

Binning is a technique to convert continous values into discrete value.

In [27]:
import pandas as pd
import numpy as np
# At first import the dataset.
#import pandas as pd
#import numpy as np
tips=pd.read_csv("Tips.csv")
tips.head()

Unnamed: 0,SINO,TotalBill,Tips,Smoker,Day,Time,Size
0,1,16.99,1.01,No,Sun,Dinner,2.0
1,2,10.34,1.66,No,Sun,Dinner,3.0
2,3,21.01,3.5,No,Sun,Dinner,3.0
3,4,23.68,3.31,No,Sun,Dinner,2.0
4,5,24.59,3.61,No,Sun,Dinner,4.0


In [22]:
# divide the data into 4 equal parts

bins=np.linspace(min(tips["TotalBill"]),max(tips["TotalBill"]),4)
bins

array([ 3.07      , 18.13666667, 33.20333333, 48.27      ])

In [23]:
# Give the name you want for each part

group_names=["small_bill","decent_bill","Large_bill"]

In [28]:
# Binning can be performed using cut command. include_lowest: Whether the first interval should be left-inclusive or not.

tips["TotalBill"]=pd.cut(tips["TotalBill"],bins,labels=group_names,include_lowest=True)

In [29]:
tips.head()

Unnamed: 0,SINO,TotalBill,Tips,Smoker,Day,Time,Size
0,1,small_bill,1.01,No,Sun,Dinner,2.0
1,2,small_bill,1.66,No,Sun,Dinner,3.0
2,3,decent_bill,3.5,No,Sun,Dinner,3.0
3,4,decent_bill,3.31,No,Sun,Dinner,2.0
4,5,decent_bill,3.61,No,Sun,Dinner,4.0
