# <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Pandas</p>

<div class="alert alert-block alert-info alert">

# <span style=" color:red">Pandas Series and DataFrames

### What can we do with Pandas
* Reading and writing data in many formats
* Grab data based on indexing, logic, conditional subsetting, and more
* Handle missing data
* Adjust and restructure data

## Table of Contents
#### **1-Series**
* Creare Series
* Index
* Operations

#### **2-Data Frames**
* Import (Read) excel and csv files
* Explore Columns
* Create and Remove (Drop) Columns
* Explore Rows
* Set and Reset Index
* Grab Rows
* Remove and Insert Rows

## 1- Series
* A Series is a data structure in Pandas that holds an array of information along with a named index.
* The named index differentiates it frpm a simlpe NumPy array.
* **Formal Definition:** One-dimensional ndarray with axis labels

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Examine Series. It is very comprehensive.
# help(pd.Series) 

#### From a List

In [3]:
# Let's create a list object
myindex = ['USA', 'Canada', 'Mexico']
mydata = [1776,1867, 1821]

In [4]:
# Display mydata as Series
myser = pd.Series(data=mydata)
myser

# by default it gives indexes

0    1776
1    1867
2    1821
dtype: int64

In [5]:
# Let^s look at its type
type(myser)  # it is Series

pandas.core.series.Series

In [6]:
# Let's assign "myindex" as index, insted of default number indexes
myser = pd.Series(data=mydata, index=myindex) # write simply "pd.Series(mydata, myindex)"
myser

USA       1776
Canada    1867
Mexico    1821
dtype: int64

In [7]:
# Still we can use number indexes to reach data although ewe cannot see those indexes
myser[0]

  myser[0]


1776

In [8]:
# Or we could use our labeled indexes
myser['USA']

1776

#### From a Dictionary

In [9]:
# Let's create a dictionary
ages = {'Sam':14, 'Michael':26, 'Julia': 21}

In [10]:
# Pass this dictionary into Series
pd.Series(ages)

Sam        14
Michael    26
Julia      21
dtype: int64

#### <span style=" color:red">Exercise
Employee expenses are being stored in an pandas Series **called expenses**. What are Bob's expenses? Assign your answer to a variable called **bob_expense**.

In [11]:
expenses = pd.Series({'Andrew':200,'Bob':150,'Claire':450})
expenses

Andrew    200
Bob       150
Claire    450
dtype: int64

In [12]:
bob_expense = expenses['Bob'] # use the key to find its value in dictionary
bob_expense

150

In [13]:
# Another way to find Bob's expenses
expenses[1]

  expenses[1]


150

#### Named Index

In [14]:
# Imaginary Sales Data for 1st and 2nd Quarters for Global Company
q1 = {'Japan': 80, 'China': 450, 'India': 200, 'USA': 250}
q2 = {'Brazil': 100,'China': 500, 'India': 210,'USA': 260}

In [15]:
# Convert it into Series
sales_sq1 = pd.Series(q1)
sales_sq1

Japan     80
China    450
India    200
USA      250
dtype: int64

In [16]:
sales_sq2 = pd.Series(q2)
sales_sq2

Brazil    100
China     500
India     210
USA       260
dtype: int64

In [17]:
# We can grab items from these Series
# Sales of Japan from sales_sq1
sales_sq1['Japan'] 

80

In [18]:
# We can also grab it using numeric indexes
sales_sq1[0]

  sales_sq1[0]


80

In [19]:
# Sales of China from sales_sq2
sales_sq2['China'] 

500

In [20]:
# Another way to display it
sales_sq2[1] 

  sales_sq2[1]


500

#### Operations

In [21]:
# It is more readable when we use keys instead of numeric indexes
# To see the keys

sales_sq1.keys()

Index(['Japan', 'China', 'India', 'USA'], dtype='object')

In [22]:
# We can display the values of dictionary but series give error. 
# For example, "sales_sq1.values()" does not work.

q1.values()

dict_values([80, 450, 200, 250])

In [23]:
# Reminder
# When we try to multiply values in a list, it does not multiply it rather make the number of items double

[1,2] * 2

[1, 2, 1, 2]

In [24]:
# However, arrays allows multiplying the values in a list
np.array([1,2]) * 2

array([2, 4])

In [25]:
# Series allow the same thing.
# Let's multiply the values of sales_sq1

sales_sq1 * 2

Japan    160
China    900
India    400
USA      500
dtype: int64

In [26]:
sales_sq1 / 100

Japan    0.8
China    4.5
India    2.0
USA      2.5
dtype: float64

In [27]:
# Total sales

sales_sq1 + sales_sq2

# There are two NaNs because these countries are not common in both Series

Brazil      NaN
China     950.0
India     410.0
Japan       NaN
USA       510.0
dtype: float64

In [28]:
# The solution could be adding these two countries with zero sales to the Series that they do not exist
# A simple way to do it is to fill these values with 0 by usind "add()" function

sales_sq1.add(sales_sq2,fill_value=0)

# There is no NaN

Brazil    100.0
China     950.0
India     410.0
Japan      80.0
USA       510.0
dtype: float64

In [29]:
# Pay attention to the data type after these operations. 
# The original dtype was integer but after this operation it is now float, as seen above.

# Original dtype
sales_sq1.dtype

dtype('int64')

## 2- DataFrames
* A DataFrame is a table of columns and rows in Pandas that we can easily restructure and filter.
* **Formal Definition:** A group of Pandas Series objects that share the same index.

![image.png](attachment:98422dc5-5721-436d-a265-ed0859cab810.png)

![image.png](attachment:25501a02-b06a-4336-a8bf-b9bd74b7bd53.png)

In [30]:
# To learn more about DataFrames use this code
# help(pd.DataFrame)

In [31]:
# Let's creare a random array and then series and dataframe
np.random.seed(101) # to display the same results
mydata = np.random.randint(0,101,(4,3)) 

# Random integer numbers between 0 and 101
# 2- dimensional array consisting of 4 rows and 3 columns

mydata

array([[95, 11, 81],
       [70, 63, 87],
       [75,  9, 77],
       [40,  4, 63]])

In [32]:
myindex = ['CA', 'NY', 'AZ', 'TX']  # rows
mycolumns = ['Jan', 'Feb', 'Mar']   # columns

In [33]:
# If we give only data and do not specify index and columns it displays them as numbers "0,1,2..". by default
df=pd.DataFrame(data=mydata)
df

Unnamed: 0,0,1,2
0,95,11,81
1,70,63,87
2,75,9,77
3,40,4,63


In [34]:
# Now, we can combine them as a DataFrame

df = pd.DataFrame(data=mydata, index=myindex, columns=mycolumns)
df

Unnamed: 0,Jan,Feb,Mar
CA,95,11,81
NY,70,63,87
AZ,75,9,77
TX,40,4,63


In [35]:
# Check the information thaht this dataframe has

df.info()  # number of columns, rows and null/non-null values, and the data type and memory usage

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, CA to TX
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Jan     4 non-null      int32
 1   Feb     4 non-null      int32
 2   Mar     4 non-null      int32
dtypes: int32(3)
memory usage: 80.0+ bytes


### Import / Read CSV or Excel Files for DataFrames
* 1-If your .py file or .ipynb notebook is located in the exact same folder location as the .csv file you want to read, simply pass in the file name as a string, for example:

 For csv file: df = pd.read_csv('my_file.csv') 
 
 For excel file: df = pd.read_excel('book2.xlsx') 
 
* 2- Pass in the entire file path if you are located in a different directory. The file path must be 100% correct in order for this to work. For example:

 For csv file: df = pd.read_csv("C:\\Users\\myself\\files\\my_file.csv")
 
 For excel file: df = pd.read_excel("C:\\Users\\myself\\files\\my_file.xlsx")

#### Where is my Python code Located?

In [36]:
# In your terminal/command prompt run:
#conda install xlrd  
#conda install openpyxl

# In Jupyter Notebook
!pip install xlrd
!pip install openpyxl



#### My current directory
* 
You can use os.getcwd(current working directory) or in the native os command pwd.

In [37]:
import os
os.getcwd()

'C:\\Users\\admin\\Desktop\\Python'

In [38]:
# Or use pwd
pwd

NameError: name 'pwd' is not defined

##### List of the files in my current directory

In [39]:
ls

 Volume in drive C has no label.
 Volume Serial Number is 2AA1-FCAF

 Directory of C:\Users\admin\Desktop\Python

05/01/2024  01:03 PM    <DIR>          .
05/01/2024  01:03 PM    <DIR>          ..
04/22/2024  06:59 PM    <DIR>          .git
04/23/2024  03:28 PM    <DIR>          .ipynb_checkpoints
08/13/2022  02:53 PM            33,829 Class_numpy_1.ipynb
06/10/2022  09:04 AM            44,949 Class_numpy_2.ipynb
06/10/2022  09:00 AM           427,718 Class_numpy_3.ipynb
06/11/2022  11:44 AM            31,158 Class_numpy_4.ipynb
06/14/2022  03:55 AM             7,584 Class_Pandas_1.ipynb
08/09/2022  07:28 AM            51,076 Class_Pandas_2.ipynb
06/15/2022  07:40 AM           112,856 Class_Pandas_3.ipynb
06/17/2022  11:38 AM           184,030 Class_Pandas_4.ipynb
06/17/2022  02:32 PM           226,284 Class_Pandas_5.ipynb
05/28/2022  08:05 AM            29,907 Class_python_function.ipynb
05/11/2022  02:23 PM            14,245 Class_python1.ipynb
05/17/2022  02:33 PM            25,869 

#### Import or Read the data

In [65]:
# Since my data is located in the same place that this notebook is, I can simply write...
df = pd.read_csv('tips.csv')
df

# Otherwise I would use the path that the csv file is located. 
#df = pd.read_csv('C:\\Users\\admin\\Desktop\\Python\\tips.csv')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251
...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17


In [66]:
# To see the columns
df.columns  

Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size',
       'price_per_person', 'Payer Name', 'CC Number', 'Payment ID'],
      dtype='object')

In [67]:
df.index  

# here it provides info about the automatically created index
# it is between 0 and 244 and stpe size is 1

RangeIndex(start=0, stop=244, step=1)

In [68]:
# To see just a couple of rows or a specific number of rows, use head or tail
df.head()  

# We can limit it with a number. By default it shows the first 5 rows

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251


In [69]:
# The last 10 rows
df.tail(10) 

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
234,15.53,3.0,Male,Yes,Sat,Dinner,2,7.76,Tracy Douglas,4097938155941930,Sat7220
235,10.07,1.25,Male,No,Sat,Dinner,2,5.04,Sean Gonzalez,3534021246117605,Sat4615
236,12.6,1.0,Male,Yes,Sat,Dinner,2,6.3,Matthew Myers,3543676378973965,Sat5032
237,32.83,1.17,Male,Yes,Sat,Dinner,2,16.42,Thomas Brown,4284722681265508,Sat2929
238,35.83,4.67,Female,No,Sat,Dinner,3,11.94,Kimberly Crane,676184013727,Sat9777
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
240,27.18,2.0,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766
241,22.67,2.0,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17
243,18.78,3.0,Female,No,Thur,Dinner,2,9.39,Michelle Hardin,3511451626698139,Thur672


In [70]:
# Another way to check data is info()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   total_bill        244 non-null    float64
 1   tip               244 non-null    float64
 2   sex               244 non-null    object 
 3   smoker            244 non-null    object 
 4   day               244 non-null    object 
 5   time              244 non-null    object 
 6   size              244 non-null    int64  
 7   price_per_person  244 non-null    float64
 8   Payer Name        244 non-null    object 
 9   CC Number         244 non-null    int64  
 10  Payment ID        244 non-null    object 
dtypes: float64(3), int64(2), object(6)
memory usage: 21.1+ KB


In [71]:
#  For numeric columns decribe() provides additional information such as count, mean, standard deviation, min, max, etc.
df.describe()

Unnamed: 0,total_bill,tip,size,price_per_person,CC Number
count,244.0,244.0,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672,7.888197,2563496000000000.0
std,8.902412,1.383638,0.9511,2.914234,2369340000000000.0
min,3.07,1.0,1.0,2.88,60406790000.0
25%,13.3475,2.0,2.0,5.8,30407310000000.0
50%,17.795,2.9,2.0,7.255,3525318000000000.0
75%,24.1275,3.5625,3.0,9.39,4553675000000000.0
max,50.81,10.0,6.0,20.27,6596454000000000.0


In [72]:
# To chance the place of columns and rows use "transpose" or "T" after describe
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
total_bill,244.0,19.78594,8.902412,3.07,13.3475,17.795,24.1275,50.81
tip,244.0,2.998279,1.383638,1.0,2.0,2.9,3.5625,10.0
size,244.0,2.569672,0.9510998,1.0,2.0,2.0,3.0,6.0
price_per_person,244.0,7.888197,2.914234,2.88,5.8,7.255,9.39,20.27
CC Number,244.0,2563496000000000.0,2369340000000000.0,60406790000.0,30407310000000.0,3525318000000000.0,4553675000000000.0,6596454000000000.0


### Columns

In [73]:
# These columns are Series. Check their types
type(df['total_bill'])

pandas.core.series.Series

In [74]:
# If we want to display a subset dataframe from the wntire dataframe...
# For example take only two columns...
mycols = ['total_bill', 'tip']
df[mycols]   # this is a dataframe

Unnamed: 0,total_bill,tip
0,16.99,1.01
1,10.34,1.66
2,21.01,3.50
3,23.68,3.31
4,24.59,3.61
...,...,...
239,29.03,5.92
240,27.18,2.00
241,22.67,2.00
242,17.82,1.75


In [75]:
# We can write these column names directly inside the dataframe but we have to use double "[[]]"
# the bracket inside holds a list with column names
df[['total_bill', 'tip']]

Unnamed: 0,total_bill,tip
0,16.99,1.01
1,10.34,1.66
2,21.01,3.50
3,23.68,3.31
4,24.59,3.61
...,...,...
239,29.03,5.92
240,27.18,2.00
241,22.67,2.00
242,17.82,1.75


#### Create a new column

In [76]:
# Let's calculate percentage of tips to total bill. Then we will add it as a new column
# Mathematical operation
100 * df['tip'] / df['total_bill']

0       5.944673
1      16.054159
2      16.658734
3      13.978041
4      14.680765
         ...    
239    20.392697
240     7.358352
241     8.822232
242     9.820426
243    15.974441
Length: 244, dtype: float64

In [77]:
# New column
df["tip_percentage"] = 100 * df['tip'] / df['total_bill']

# see the new dataframe
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,tip_percentage
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,5.944673
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,16.054159
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,16.658734
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,13.978041
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,14.680765


In [78]:
# Careful: If you have a column with the same name it will overwrite it instead of creating a new one

# Creation of price_per_person column with rounded numbers 

#df["price_per_person"] = np.round(100 * df["total_bill"] / df["size"], 2)

# np.round(argument, decimal)

#df.head()

# Since it is already an existing column, we could round it as follows. 
# The code above shows how the column was created.
#df["price_per_person"] = np.round(df["price_per_person"], 2)
#df.head()

In [79]:
# Let's round the "tip_percentage" column with 2 decimals
df["tip_percentage"] = np.round(df["tip_percentage"], 2)
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,tip_percentage
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,5.94
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,16.05
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,16.66
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,13.98
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,14.68


#### Drop columns

To remove rows or columns use "drop()"

For columns axis=1, for rows axis=0. By default axis=0. These numbers come from shape(row, column).

In [80]:
# Let's drop tip_percentage column

df.drop('tip_percentage', axis=1)   # do not forget the axis, if you want to drop columns

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251
...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17


In [81]:
# Let's check data frame
df.head()   # tip_percentage column still exists

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,tip_percentage
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,5.94
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,16.05
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,16.66
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,13.98
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,14.68


As seen above, the column has not droopped yet. 
We have two options to remove it permanently. 

1-Use an equation with this operation: "df = df.drop('tip_percentage', axis=1)"

2- Use "inplace = True" as an argument: "df.drop('tip_percentage', axis=1, inplace=True)"

In [82]:
# Drop it permanently
df.drop('tip_percentage', axis=1, inplace=True)
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251


### Rows

In [83]:
# One way to get information about rows
df.index

RangeIndex(start=0, stop=244, step=1)

The index sould be a unique id or primary key in the analysis. By default it gives a unique index. But we can set up a different index. For example, we can use one of the columns that has unique values.

#### set_index()

In [84]:
# Let's set up "Payment ID" column as our new index 
df.set_index("Payment ID")

Unnamed: 0_level_0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number
Payment ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Sun2959,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410
Sun4608,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230
Sun4458,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322
Sun5260,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994
Sun2251,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221
...,...,...,...,...,...,...,...,...,...,...
Sat2657,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842
Sat1766,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404
Sat3880,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196
Sat17,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950


After setting "Payment ID" as the new index, now we have 10 columns instead of 11. "Payment ID" is now the name of the Index.

In [85]:
# However, this change is not permanent. Let's check our dataframe
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251


In [86]:
# To make this change permanent, we need to assign it to our df
df = df.set_index("Payment ID")

In [87]:
df.head()

Unnamed: 0_level_0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number
Payment ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Sun2959,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410
Sun4608,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230
Sun4458,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322
Sun5260,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994
Sun2251,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221


#### reset_index()

In [88]:
# If we want to set this index as column again, or we want to use the default index...
df.reset_index()

Unnamed: 0,Payment ID,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number
0,Sun2959,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410
1,Sun4608,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230
2,Sun4458,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322
3,Sun5260,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994
4,Sun2251,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221
...,...,...,...,...,...,...,...,...,...,...,...
239,Sat2657,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842
240,Sat1766,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404
241,Sat3880,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196
242,Sat17,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950


We are back to range index. Now, there are 11 columns including "Payment ID". But, to make ths change permanent, we need to assign it to the df.

In [89]:
df = df.reset_index()

In [90]:
df.head()

Unnamed: 0,Payment ID,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number
0,Sun2959,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410
1,Sun4608,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230
2,Sun4458,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322
3,Sun5260,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994
4,Sun2251,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221


#### Grab a Row or multiple Rows: loc[ ] and iloc[ ]
* **loc** (labeled index) is used to reach a row by the name  of the index whereas **iloc** is used by position.

In [91]:
# We decided to use "Payment ID" as index
df = df.set_index("Payment ID")

In [92]:
df.head()

Unnamed: 0_level_0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number
Payment ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Sun2959,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410
Sun4608,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230
Sun4458,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322
Sun5260,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994
Sun2251,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221


In [93]:
# Grab the first row
df.iloc[0]

total_bill                       16.99
tip                               1.01
sex                             Female
smoker                              No
day                                Sun
time                            Dinner
size                                 2
price_per_person                  8.49
Payer Name          Christy Cunningham
CC Number             3560325168603410
Name: Sun2959, dtype: object

In [94]:
# Let's use the name of the index in Payment ID to reach the first row
df.loc["Sun2959"]

total_bill                       16.99
tip                               1.01
sex                             Female
smoker                              No
day                                Sun
time                            Dinner
size                                 2
price_per_person                  8.49
Payer Name          Christy Cunningham
CC Number             3560325168603410
Name: Sun2959, dtype: object

In [96]:
# Multiple rows, for example the first 4 rows
df.iloc[0:4]

Unnamed: 0_level_0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number
Payment ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Sun2959,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410
Sun4608,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230
Sun4458,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322
Sun5260,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994


In [97]:
# To reach multiple rows using loc [], we need a list
# Let's take two rows: the first and the fourth rows

df.loc[["Sun2959","Sun5260"]]

Unnamed: 0_level_0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number
Payment ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Sun2959,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410
Sun5260,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994


#### Remove and Insert Rows

In [98]:
# Let's remove the first row. We can reach it by using the related "Payment ID" index.
# For rows, axis=0

df.drop("Sun2959", axis=0)           # this change is not permanent

Unnamed: 0_level_0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number
Payment ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Sun4608,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230
Sun4458,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322
Sun5260,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994
Sun2251,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221
Sun9679,25.29,4.71,Male,No,Sun,Dinner,4,6.32,Erik Smith,213140353657882
...,...,...,...,...,...,...,...,...,...,...
Sat2657,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842
Sat1766,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404
Sat3880,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196
Sat17,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950


In [99]:
# To make this change permanent, assign it to the df
df = df.drop("Sun2959", axis=0)  

In [100]:
df.head()

Unnamed: 0_level_0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number
Payment ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Sun4608,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230
Sun4458,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322
Sun5260,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994
Sun2251,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221
Sun9679,25.29,4.71,Male,No,Sun,Dinner,4,6.32,Erik Smith,213140353657882


In [None]:
# Alternative way to remove the first row using iloc[]

# df = df.iloc[1:]

# Our new df would be equal to a new df starting from the index 1, instead of index 0. 

**Note:** In the new version of Pandas, the "append" method is changed to "_append". You can simply use "_append" instead of "append", i.e., df._append(df2).

**Why is it changed?**

The append method in pandas looks similar to "list.append" in Python. That's why the append method in pandas is now modified to "_append".

In [101]:
# Use append method to add a new row
# First, let's create a row to add

one_row = df.iloc[0] # as an example we take a first row and will add it to the df
one_row

total_bill                     10.34
tip                             1.66
sex                             Male
smoker                            No
day                              Sun
time                          Dinner
size                               3
price_per_person                3.45
Payer Name            Douglas Tucker
CC Number           4478071379779230
Name: Sun4608, dtype: object

In [104]:
df = df._append(one_row)

In [107]:
# "_append" added a row to the end of the df. 
# It is actually a copy of the first row. So, our index is not unique anymore, but have duplicated indexes or rows.
df.tail()

Unnamed: 0_level_0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number
Payment ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Sat1766,27.18,2.0,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404
Sat3880,22.67,2.0,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196
Sat17,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950
Thur672,18.78,3.0,Female,No,Thur,Dinner,2,9.39,Michelle Hardin,3511451626698139
Sun4608,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230
