<a href="https://colab.research.google.com/github/supportvectors/Data-Wrangling/blob/main/DW_Rows_and_Columns.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%run common.ipynb

<H1> Dealing with Rows and Columns in Pandas DataFrame </H1>

In this notebook we will show you the techniques to deal with manipulating rows and columns in a Pandas DataFrame.

## Column Operations

1. Selecting

2. Deleting

3. Renaming

4. Adding


In [2]:
# Import pandas package
import pandas as pd
  
# Define a pandas dictionary containing employee data to create a dataset
data = {'Name': ['Jack', 'Peter', 'Gale', 'Anne'],
          'Age':[27, 24, 22, 32],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}

# Convert the dictionary into DataFrame 
data = pd.DataFrame(data)
data

Unnamed: 0,Name,Age,Height,Qualification
0,Jack,27,5.1,Msc
1,Peter,24,6.2,MA
2,Gale,22,5.1,Msc
3,Anne,32,5.2,Msc


In [3]:
data.describe(include='all')

Unnamed: 0,Name,Age,Height,Qualification
count,4,4.0,4.0,4
unique,4,,,2
top,Jack,,,Msc
freq,1,,,3
mean,,26.25,5.4,
std,,4.349329,0.535413,
min,,22.0,5.1,
25%,,23.5,5.1,
50%,,25.5,5.15,
75%,,28.25,5.45,


### Selecting Columns

#### Selecting a single column

We can access columns using the python dictionary syntax, i.e. by passing the name of the column as `key` like so:

`df['column_name']`

In [4]:
data['Name']

0     Jack
1    Peter
2     Gale
3     Anne
Name: Name, dtype: object

#### Selecting Multiple columns

We can pass a list of column names to select specific columns 

In [5]:
# Acessing the column by its name
print(data[['Name','Age']])

    Name  Age
0   Jack   27
1  Peter   24
2   Gale   22
3   Anne   32


#### Selecting columns using slicing

In [6]:
# select all rows 
# and second to fourth column
data[data.columns[1:4]] # include the start index and excludes the last index

Unnamed: 0,Age,Height,Qualification
0,27,5.1,Msc
1,24,6.2,MA
2,22,5.1,Msc
3,32,5.2,Msc


#### Selecting a subset of rows and columns using loc[ ]

Access a group of rows and columns by label(s) or a boolean array.

Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html

In [7]:
#Using loc[]

# select three rows and two columns
data.loc[1:3, ['Name', 'Qualification']]

Unnamed: 0,Name,Qualification
1,Peter,MA
2,Gale,Msc
3,Anne,Msc


In [8]:
#Select one to another columns. In our case we select column name “Name” to “Address” i.e Name,Age,Height. and rows 0 and 1
data.loc[0:1,'Name':'Height']

Unnamed: 0,Name,Age,Height
0,Jack,27,5.1
1,Peter,24,6.2


In [9]:
data.loc[:,'Name' : 'Height'] # getting all rows

Unnamed: 0,Name,Age,Height
0,Jack,27,5.1
1,Peter,24,6.2
2,Gale,22,5.1
3,Anne,32,5.2


In [10]:
#First filtering rows and selecting columns by label format and then Select all columns.
# df.loc[rows, columns]
# row 1, all columns
data.loc[0, :]

Name             Jack
Age                27
Height            5.1
Qualification     Msc
Name: 0, dtype: object

#### Selecting columns based on their data types


In [11]:
data.loc[:,(data.dtypes=='float64').values]  # .values is used to get the column values 

Unnamed: 0,Height
0,5.1
1,6.2
2,5.1
3,5.2


#### Selecting a subset of rows and columns using iloc[ ]

Purely integer-location based indexing for selection by position.

Documnetation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html

In [12]:
# Using iloc[]
# Remember that Python does not
# slice inclusive of the ending index.
# select all rows 
# select first two column
data.iloc[:, 0:2] 

Unnamed: 0,Name,Age
0,Jack,27
1,Peter,24
2,Gale,22
3,Anne,32


In [13]:
# iloc[row slicing, column slicing]
data.iloc [0:2, 1:3]

Unnamed: 0,Age,Height
0,27,5.1
1,24,6.2


### Deleting Columns

The drop() method is used to delete columns in a Pandas dataframe. Columns is deleted by dropping columns with column names.

In [14]:
import pandas as pd

data = {'Name': ['Jack', 'Peter', 'Gale', 'Anne'],
        'Age':[27, 24, 22, 32],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}

data = pd.DataFrame(data)
data

Unnamed: 0,Name,Age,Height,Qualification
0,Jack,27,5.1,Msc
1,Peter,24,6.2,MA
2,Gale,22,5.1,Msc
3,Anne,32,5.2,Msc


In [15]:
## dropping passed columns
data.drop(["Height"], axis = 1, inplace = True)
  
# display
data.head()

Unnamed: 0,Name,Age,Qualification
0,Jack,27,Msc
1,Peter,24,MA
2,Gale,22,Msc
3,Anne,32,Msc


### Renaming Columns

In [16]:
data.head()

Unnamed: 0,Name,Age,Qualification
0,Jack,27,Msc
1,Peter,24,MA
2,Gale,22,Msc
3,Anne,32,Msc


To rename columns, pass a dictionary with keys as the old column names and values as the new column names as argument

In [17]:
data.rename(columns = {'Qualification':'Education'}, inplace = True)
   
# After renaming the columns
print("\nAfter modifying first column:\n", data.columns)


After modifying first column:
 Index(['Name', 'Age', 'Education'], dtype='object')


In [18]:
data

Unnamed: 0,Name,Age,Education
0,Jack,27,Msc
1,Peter,24,MA
2,Gale,22,Msc
3,Anne,32,Msc


### Adding columns


#### Adding a column by declaring a new list

In [19]:
#To add a column in Pandas DataFrame, we can declare a new list as a column and add to a existing Dataframe.
 
# Declare a list that is to be converted into a column
address = ['Delaware', 'Boston', 'California', 'Atlanta']
  
# Using 'Address' as the column name
# and equating it to the list
data['Address'] = address
  
# Observe the result
print(data)

    Name  Age Education     Address
0   Jack   27       Msc    Delaware
1  Peter   24        MA      Boston
2   Gale   22       Msc  California
3   Anne   32       Msc     Atlanta


## Row operations

Now we will learn how to do the following:
1. Selecting rows
2. Adding a row
3. Deleting a row

### Selecting rows

#### Retrieving row by loc method

In [20]:

# setting name as index column
data.set_index("Name", inplace = True)
 
# display
data.head()


Unnamed: 0_level_0,Age,Education,Address
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jack,27,Msc,Delaware
Peter,24,MA,Boston
Gale,22,Msc,California
Anne,32,Msc,Atlanta


In [21]:

first = data.loc["Jack"]      # row indexed by 'Jack' is assigned to first
second = data.loc["Peter"]    
  
  
print(first, "\n\n\n", second)

Age                27
Education         Msc
Address      Delaware
Name: Jack, dtype: object 


 Age              24
Education        MA
Address      Boston
Name: Peter, dtype: object


Note that the output is an object and not a dataframe.

#### Getting rows based on a condition - Boolean subsetting

In [22]:
data['Age'] >= 28 # returns a key value pair of index : boolean

Name
Jack     False
Peter    False
Gale     False
Anne      True
Name: Age, dtype: bool

In [23]:
# Retreiving a rows based on condition  called as Boolean Subsetting. We use .loc method.

select_age= data.loc[data['Age'] >= 26]
print (select_age)

      Age Education   Address
Name                         
Jack   27       Msc  Delaware
Anne   32       Msc   Atlanta


Let us reset the index to positive integers. `drop=False` is important to retain the `old index` column. `drop=True` drops the `old index` column and reasigns a `new index`

In [24]:
data.reset_index(drop=False, inplace=True) 
data

Unnamed: 0,Name,Age,Education,Address
0,Jack,27,Msc,Delaware
1,Peter,24,MA,Boston
2,Gale,22,Msc,California
3,Anne,32,Msc,Atlanta


### Adding a row

#### using concat()


Concat function takes the main argument “objs” as a set of objects.
Another key argument is axis.

If both the dataframes contains rows with same index then both the rows will be retained without any change or reset in indexes.

If axis = 1 then concatenates in column wise. 

If no data available with same index, then NaN will be filled.

In [25]:
# when order matters we can do it by specifying the index
new_row = pd.DataFrame({'Name':['Bill'], 'Age': [54], 'Education' :['MBA'],'Address':['New york']},
                       ) 

# simply concat both dataframes
data_2 = pd.concat([new_row, data],axis = 0)
data_2

Unnamed: 0,Name,Age,Education,Address
0,Bill,54,MBA,New york
0,Jack,27,Msc,Delaware
1,Peter,24,MA,Boston
2,Gale,22,Msc,California
3,Anne,32,Msc,Atlanta


In [26]:
data_2.reset_index(drop = True, inplace=True)
data_2

Unnamed: 0,Name,Age,Education,Address
0,Bill,54,MBA,New york
1,Jack,27,Msc,Delaware
2,Peter,24,MA,Boston
3,Gale,22,Msc,California
4,Anne,32,Msc,Atlanta


If the new row has a missing value it is filled with NaN in the concatenated dataframe.

In [27]:
new_row_2 = pd.DataFrame({'Name':['John'], 'Age': [54], 'Education' :['MBA']})
new_row_2

data_3 = pd.concat([new_row_2, data_2],axis = 0).reset_index(drop=True)
data_3

Unnamed: 0,Name,Age,Education,Address
0,John,54,MBA,
1,Bill,54,MBA,New york
2,Jack,27,Msc,Delaware
3,Peter,24,MA,Boston
4,Gale,22,Msc,California
5,Anne,32,Msc,Atlanta


#### using append()
Append is the specific case of concat, which concats the second dataframe’s records at the end of first dataframe.

Append has no axis argument.

Syntax of Append is different from Concat. 
Append considers the calling dataframe as main object and adds rows to that dataframe from the dataframes that are passed to the function as argument.

If any of the dataframe contains new columns that is not existing in calling dataframe, then it will be added as new column.

If both the dataframes contains same columns then both the columns will be retained without any change in column name

In [28]:
dt = {'Name': 'Vikram', 'Age': 56, 'Education':'MBA','Address':'Austin'}

data = data.append(dt, ignore_index = True)

data

Unnamed: 0,Name,Age,Education,Address
0,Jack,27,Msc,Delaware
1,Peter,24,MA,Boston
2,Gale,22,Msc,California
3,Anne,32,Msc,Atlanta
4,Vikram,56,MBA,Austin


Performance: Which is faster pandas concat or append?

However there will be a slight change depending on the data.

Append function will add rows of second data frame to first dataframe iteratively one by one. 

Concat function will do a single operation to finish the job, which makes it faster than append().

As append will add rows one by one, if the dataframe is significantly very small, then append operation is fine as only a few appends will be done for the number of rows in second dataframe.

Append function will create a new resultant dataframe instead of modifying the existing one. Due to this buffering and creating process, Append operation’s performance is less than concat() function. 

However Append() is fine if the number of append operation is a very few. If there are a multiple append operations needed, it is better to use concat().

### Deleting a row

use the drop() method. Rows is deleted by dropping Rows by index label.

In [29]:
data.set_index("Name", inplace = True) # set the index
# dropping passed values
data.drop(["Peter"], inplace = True)
  
# display
data

Unnamed: 0_level_0,Age,Education,Address
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jack,27,Msc,Delaware
Gale,22,Msc,California
Anne,32,Msc,Atlanta
Vikram,56,MBA,Austin
