![Venn Diagram](merging1.png)

This is a venn diagram. The circle on the left is the population of students in a university. The circle on the right is the population of staff in the university. And the overlaping region in the middle is all those students who are staff, may be these students can be a part of research activity or some graded assignment. 

In [1]:
#We will now see how we can use merge function or how we can merge the dataframes in order to get required information

import pandas as pd
import numpy as np

#We will now create two dataframes one staff and other students

staff_df = pd.DataFrame([{"Name":'Amit','Role':'Marketing HOD'},
                         {'Name':'Priya','Role':'Controller of Examination'},
                        {'Name':'Mohit','Role':'Researcher'}])

staff_df = staff_df.set_index('Name')


student_df = pd.DataFrame([{'Name':'Manoj','School':'Business'},
                          {'Name':'Priya','School':'Law'},
                          {'Name':'Mohit','School':'Engineering'}])


student_df = student_df.set_index('Name')

print(staff_df)
print(student_df)

                            Role
Name                            
Amit               Marketing HOD
Priya  Controller of Examination
Mohit                 Researcher
            School
Name              
Manoj     Business
Priya          Law
Mohit  Engineering


In [2]:
#In these two dataframe Mohit and Priya are in both student and staff.
#If we want to get an outer join after combining both the dfs, we will use merge function and use join type as outer and we will
#use left and right indices as joining column

#left_index = True and Right_index = True states that we need to consider indexes of both dfs while joinin
pd.merge(staff_df,student_df,left_index=True,right_index=True, how='outer')

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Amit,Marketing HOD,
Manoj,,Business
Mohit,Researcher,Engineering
Priya,Controller of Examination,Law


In [3]:
#Now if we want to get intersection of these two df we will use inner join
pd.merge(staff_df,student_df,left_index=True,right_index=True, how='inner')

#By merge uses inner join

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Priya,Controller of Examination,Law
Mohit,Researcher,Engineering


In [4]:
#Now we want to get a list of all staff regardless of whether they're students or not. But if there were students, we'd 
#want to get their student details as well. To do this, we would use a left join. It's important to note that the order 
#of DataFrames in this function, the first DataFrame is the left DataFrame and the second is the right.

pd.merge(staff_df,student_df,how='left',left_index=True,right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Amit,Marketing HOD,
Priya,Controller of Examination,Law
Mohit,Researcher,Engineering


In [5]:
#We want a list of all of the students and their roles if they are also staff. To do this, we would do a right join.

pd.merge(staff_df,student_df,how='right',left_index=True,right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Manoj,,Business
Priya,Controller of Examination,Law
Mohit,Researcher,Engineering


In [6]:
#Now we will use 'on' parameter of merge instead of left_index and right_index

staff_df = staff_df.reset_index()
student_df = student_df.reset_index()

pd.merge(staff_df,student_df,how='inner',on='Name')

Unnamed: 0,Name,Role,School
0,Priya,Controller of Examination,Law
1,Mohit,Researcher,Engineering


In [7]:
#We will now see merge in some more depth

staff_df = pd.DataFrame([{"Name":'Amit','Role':'Marketing HOD','Location':'Office 22'},
                         {'Name':'Priya','Role':'Controller of Examination','Location':'Office 26'},
                        {'Name':'Mohit','Role':'Researcher','Location':'Office 35'}])




student_df = pd.DataFrame([{'Name':'Manoj','School':'Business','Location':'Indra Nagar'},
                          {'Name':'Priya','School':'Law','Location':'RT Nagar'},
                          {'Name':'Mohit','School':'Engineering','Location':'Ejipura'}])

In [8]:
#The staff DataFrame, this is an office location where we can find the staff person, but for the student DataFrame, the 
#location information is actually their home address. The merge function preserves this information, but depends either 
#on underscore x or underscore y to help differentiate which index went with which column of data. The underscore x is 
#always the left DataFrame information and the underscore y is always the right DataFrame information.

#Here if we want to get details of staff members and check whether if they are students then difference in theior location

pd.merge(staff_df,student_df,how='left',on='Name')

Unnamed: 0,Name,Role,Location_x,School,Location_y
0,Amit,Marketing HOD,Office 22,,
1,Priya,Controller of Examination,Office 26,Law,RT Nagar
2,Mohit,Researcher,Office 35,Engineering,Ejipura


In [9]:
#We can change the suffix of the loaction if we want  

pd.merge(staff_df,student_df,suffixes=('_office','_home'), how='left',on='Name')

Unnamed: 0,Name,Role,Location_office,School,Location_home
0,Amit,Marketing HOD,Office 22,,
1,Priya,Controller of Examination,Office 26,Law,RT Nagar
2,Mohit,Researcher,Office 35,Engineering,Ejipura


In [10]:
#We will now talk about multi indexing and multiplt columns. 

#It is quite possible that there are chances that the there can be more than one column whose value overlap in the two dFs
#in that case we pass the list of those two columns names to the "on" parameter. 

staff_df = pd.DataFrame([{"first_name":'Amit','last_name':'Kapoor','Role':'Marketing HOD'},
                         {'first_name':'Priya','last_name':'Gupta','Role':'Controller of Examination'},
                        {'first_name':'Mohit','last_name':'Aggarwal','Role':'Researcher'}])



student_df = pd.DataFrame([{'first_name':'Manoj','last_name':'Kapoor','School':'Business'},
                          {'first_name':'Priya','last_name':'Gupta','School':'Law'},
                          {'first_name':'Mohit','last_name':'Aggarwal','School':'Engineering'}])

print(staff_df)
print(student_df)


  first_name last_name                       Role
0       Amit    Kapoor              Marketing HOD
1      Priya     Gupta  Controller of Examination
2      Mohit  Aggarwal                 Researcher
  first_name last_name       School
0      Manoj    Kapoor     Business
1      Priya     Gupta          Law
2      Mohit  Aggarwal  Engineering


In [11]:
#We will now perform inner join on the dataframes 

pd.merge(staff_df,student_df,on=['first_name','last_name'], how='inner')

Unnamed: 0,first_name,last_name,Role,School
0,Priya,Gupta,Controller of Examination,Law
1,Mohit,Aggarwal,Researcher,Engineering


In [14]:
#If we think of merging as joining ''horizontally,'' meaning we join on similar values in a column found in two dataframes, 
#then concatenating is joining ''vertically,'' meaning we put dataframes on top or at the bottom of one another

#We will see some of the examples now 

#Let's take a look at the US Department of Education College Scorecard data. It has each US university's data on student 
#completion, student debt, after graduation income, and others. The data is stored in separate CSVs, with each CSV containing 
#a year's record. Let's say we wanted the records from 2011-2013. We create first three dataframes, each containing one year's 
#record, and because the CSV files we're working with are messy, I want to suppress some of the Jupyter warning messages 
#and just tell read_csv to ignore bad lines. So I'm going to start the cell with the cell magic, %%capture. 
#It's just to suppress output as we're loading these CSV files because there are errors in them. 



df1 = pd.read_csv('datasets/college_scorecard/MERGED2011_12_PP.csv',error_bad_lines=False)
df2 = pd.read_csv('datasets/college_scorecard/MERGED2012_13_PP.csv',error_bad_lines=False)
df3 = pd.read_csv('datasets/college_scorecard/MERGED2013_14_PP.csv',error_bad_lines=False)

b'Skipping line 309: expected 1977 fields, saw 2485\nSkipping line 514: expected 1977 fields, saw 2552\n'
b'Skipping line 615: expected 1977 fields, saw 2425\nSkipping line 922: expected 1977 fields, saw 2270\nSkipping line 1026: expected 1977 fields, saw 2199\n'
b'Skipping line 1129: expected 1977 fields, saw 2212\nSkipping line 1439: expected 1977 fields, saw 2476\nSkipping line 1543: expected 1977 fields, saw 2103\n'
b'Skipping line 1852: expected 1977 fields, saw 3023\n'
b'Skipping line 2161: expected 1977 fields, saw 2372\nSkipping line 2473: expected 1977 fields, saw 2559\n'
b'Skipping line 2576: expected 1977 fields, saw 2328\nSkipping line 2880: expected 1977 fields, saw 3153\nSkipping line 2981: expected 1977 fields, saw 2103\n'
b'Skipping line 3388: expected 1977 fields, saw 2074\nSkipping line 3489: expected 1977 fields, saw 3116\n'
b'Skipping line 3694: expected 1977 fields, saw 2135\nSkipping line 4006: expected 1977 fields, saw 2585\n'
b'Skipping line 4212: expected 1977 

In [16]:
df1.head(2)

Unnamed: 0,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,OMAWDP8_NOTFIRSTTIME_POOLED_SUPP,OMENRUP_NOTFIRSTTIME_POOLED_SUPP,OMENRYP_FULLTIME_POOLED_SUPP,OMENRAP_FULLTIME_POOLED_SUPP,OMAWDP8_FULLTIME_POOLED_SUPP,OMENRUP_FULLTIME_POOLED_SUPP,OMENRYP_PARTTIME_POOLED_SUPP,OMENRAP_PARTTIME_POOLED_SUPP,OMAWDP8_PARTTIME_POOLED_SUPP,OMENRUP_PARTTIME_POOLED_SUPP
0,100654.0,100200.0,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
1,100663.0,105200.0,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,


In [20]:
print(df1.shape)
print(df2.shape)
print(df3.shape)

(15235, 1977)
(7793, 1977)
(7804, 1977)


In [19]:
print(len(df1))
print(len(df2))
print(len(df3))

15235
7793
7804


In [21]:
#We will now pass these dataframes and concatinate these into on dataframe
df_list = [df1,df2,df3]

merged_df =pd.concat(df_list)
len(merged_df)

30832

In [22]:
#we will verify whether all the dfs merged successfully or not
len(df1)+len(df2)+len(df3)

30832

In [23]:
#But there is a problem now we dont know which record belongs to which year. pd.concat has a solution to this problem,
#We will pass key parameter

merged_df = pd.concat(df_list,keys=['2011','2012','2013'])

In [24]:
#If you're concatenating two dataframes that do not have identical columns, and choose the outer methods, some cells will 
#be NaN. If you choose to do the inner, then some observations will be dropped due to NaN values. You can think this as 
#analogous to the left and right joints of the merge function.

## Grouping

The idea behind groupby is that it takes some data frame, splits it into chunks based on some key values, and then applies computation on those chunks, and then combines the result back together into another data frame. In pandas it is referred to as split-apply-combine pattern.

In [38]:
import pandas as pd
import numpy as np

In [39]:
#we will read some US census data 

df = pd.read_csv('datasets/census.csv')
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


In [40]:
#We will only those values which have sum level of 50

df = df[df['SUMLEV']==50]
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


In [41]:
#Let us see the average of census population of 2010
for group,frame in df.groupby('STNAME'):
    average= np.average(frame['CENSUS2010POP'])
    print("Population of state "+ group + 
         "is " + str(average))

Population of state Alabamais 71339.34328358209
Population of state Alaskais 24490.724137931036
Population of state Arizonais 426134.4666666667
Population of state Arkansasis 38878.90666666667
Population of state Californiais 642309.5862068966
Population of state Coloradois 78581.1875
Population of state Connecticutis 446762.125
Population of state Delawareis 299311.3333333333
Population of state District of Columbiais 601723.0
Population of state Floridais 280616.5671641791
Population of state Georgiais 60928.63522012578
Population of state Hawaiiis 272060.2
Population of state Idahois 35626.86363636364
Population of state Illinoisis 125790.50980392157
Population of state Indianais 70476.10869565218
Population of state Iowais 30771.262626262625
Population of state Kansasis 27172.55238095238
Population of state Kentuckyis 36161.39166666667
Population of state Louisianais 70833.9375
Population of state Maineis 83022.5625
Population of state Marylandis 240564.66666666666
Population of st

In [42]:
#We'll create some new function called set_batch_number. If the first number of the parameter is the capital M, we'll 
#return a zero. If it's capital Q, we'll return a one. Otherwise, we'll return a two. Then we'll pass this function to 
#the DataFrame. 
df = df.set_index('STNAME')
def set_batch(item):
    if item[0]<'M':
        return 0
    if item[0] <'Q':
        return 1
    else:
        return 2

In [43]:
for group,frame in df.groupby(set_batch):
    print("There are " + str(len(frame)) +  ' records in group ' + str(group))

There are 1177 records in group 0
There are 1134 records in group 1
There are 831 records in group 2


In [44]:
#Notice i did not pass any column name to groupby(). Instead I set the index of the dataframe to SETNAME and when no identifier
#is passed to groupby it will automatically use the index as the identifier for groupby

In [45]:
#Let us read another dataset 
df = pd.read_csv('datasets/listings.csv')
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,f,,,f,moderate,f,f,1,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,9.0,f,,,t,moderate,f,f,1,1.3
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,10.0,f,,,f,moderate,t,f,1,0.47
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,10.0,f,,,f,moderate,f,f,1,1.0
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,10.0,f,,,f,flexible,f,f,1,2.25


In [46]:
#In the above df we want to groupby on the basis of 2 columns one is cancellation policy and other is review scores value
#There are 2 ways to do that
#First we will set these two columns as multilevel index and we wont neet to pass columns to groupby

df.set_index(['cancellation_policy','review_scores_value'],inplace=True)

for group,level in df.groupby(level=(0,1)):
    print(group)

('flexible', 2.0)
('flexible', 4.0)
('flexible', 5.0)
('flexible', 6.0)
('flexible', 7.0)
('flexible', 8.0)
('flexible', 9.0)
('flexible', 10.0)
('moderate', 2.0)
('moderate', 4.0)
('moderate', 6.0)
('moderate', 7.0)
('moderate', 8.0)
('moderate', 9.0)
('moderate', 10.0)
('strict', 2.0)
('strict', 3.0)
('strict', 4.0)
('strict', 5.0)
('strict', 6.0)
('strict', 7.0)
('strict', 8.0)
('strict', 9.0)
('strict', 10.0)
('super_strict_30', 6.0)
('super_strict_30', 7.0)
('super_strict_30', 8.0)
('super_strict_30', 9.0)
('super_strict_30', 10.0)


In [47]:


def grouping_func(item):
    if item[1]==10.0:
        return (item[0],"10.0")
    else:
        return (item[0],"Not 10.00")
    
for group,frame in df.groupby(by=grouping_func):
    print(group)

('flexible', '10.0')
('flexible', 'Not 10.00')
('moderate', '10.0')
('moderate', 'Not 10.00')
('strict', '10.0')
('strict', 'Not 10.00')
('super_strict_30', '10.0')
('super_strict_30', 'Not 10.00')


In [48]:
#There is another method to do the same thing 
df = pd.read_csv('datasets/listings.csv')

In [49]:
df['review_scores_value']= np.where(df['review_scores_value']==10.00,'10.00','Not 10.00')
for group, label in df.groupby(by=['cancellation_policy','review_scores_value']):
    print(group)

('flexible', '10.00')
('flexible', 'Not 10.00')
('moderate', '10.00')
('moderate', 'Not 10.00')
('strict', '10.00')
('strict', 'Not 10.00')
('super_strict_30', '10.00')
('super_strict_30', 'Not 10.00')


In [50]:
#The panda's developers have three broad categories of data processing to happen during the apply step. Aggregations 
#of group data, transformation of group data, and filtration of group data

### Aggregation

In [51]:
#The apply step in split-apply-combine is the aggregation of data
#let us use our last dataset

#letus reload that again

df = pd.read_csv('datasets/listings.csv')
df.groupby('cancellation_policy').agg({'review_scores_value':np.nanmean})

#Instead of not using agg we can use the below code also to get same result
#df.groupby('cancellation_policy',as_index=False)['review_scores_value'].mean()

Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,9.237421
moderate,9.307398
strict,9.081441
super_strict_30,8.537313


In [52]:
#We can pass multiple number ofcolumns and multiple functions to the column we wan to aggregate 
df.groupby('cancellation_policy').agg({'review_scores_value':(np.nanmean,np.nanstd,np.nanmedian),
                                      'reviews_per_month':(np.nanmean,np.nanstd,np.nanmedian)})

Unnamed: 0_level_0,review_scores_value,review_scores_value,review_scores_value,reviews_per_month,reviews_per_month,reviews_per_month
Unnamed: 0_level_1,nanmean,nanstd,nanmedian,nanmean,nanstd,nanmedian
cancellation_policy,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
flexible,9.237421,1.096271,10.0,1.82921,2.005405,1.0
moderate,9.307398,0.859859,9.0,2.391922,2.432009,1.615
strict,9.081441,1.040531,9.0,1.873467,1.9538,1.21
super_strict_30,8.537313,0.840785,9.0,0.340143,0.752392,0.14


### Transaformation

In [59]:
#Transformation is different from aggregation. agg() returns a single value per category/group, transform returns an objec 
#that is the same size as the group. Essentially it broadcasts the function you supply over the grouped df, returning a new df.

transformed_df = df.groupby('cancellation_policy',as_index=False)['review_scores_value'].transform(np.nanmean)
transformed_df.rename(columns={'review_scores_value':'mean_review_scores_value'},inplace=True)
transformed_df.head()

Unnamed: 0,mean_review_scores_value
0,9.307398
1,9.307398
2,9.307398
3,9.307398
4,9.237421


In [61]:
#We will now concatinate this to the main df or we can perform merge
df = pd.concat([df,transformed_df],axis=1)

#df.merge(transform_df,left_index=True,right_index=True)

In [62]:
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,mean_review_scores_value
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,f,,,f,moderate,f,f,1,,9.307398
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,f,,,t,moderate,f,f,1,1.3,9.307398
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,f,,,f,moderate,t,f,1,0.47,9.307398
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,f,,,f,moderate,f,f,1,1.0,9.307398
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,f,,,f,flexible,f,f,1,2.25,9.237421


In [63]:
#Now we will calculate the difference between the actual and mean value of review_score_value
df['difference'] = abs(df['review_scores_value']-df['mean_review_scores_value'])
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,mean_review_scores_value,difference
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,,f,moderate,f,f,1,,9.307398,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,,,t,moderate,f,f,1,1.3,9.307398,0.307398
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,,,f,moderate,t,f,1,0.47,9.307398,0.692602
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,,,f,moderate,f,f,1,1.0,9.307398,0.692602
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,,,f,flexible,f,f,1,2.25,9.237421,0.762579


### Filter

In [66]:
#Group-by object is built-in support for filtering groups as well. It's often that you'll want to group by some features 
#then make some transformations to the groups, then drop certain groups as part of your cleaning routine. 
#The Filter Function takes in a function which it applies to each group data frame and returns either a true or false, 
#depending on whether that group should be included in the results.

df.groupby('cancellation_policy').filter(lambda x: np.mean(x['review_scores_value'])>9.2)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,mean_review_scores_value,difference
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,,f,moderate,f,f,1,,9.307398,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,,,t,moderate,f,f,1,1.30,9.307398,0.307398
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,,,f,moderate,t,f,1,0.47,9.307398,0.692602
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,,,f,moderate,f,f,1,1.00,9.307398,0.692602
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,,,f,flexible,f,f,1,2.25,9.237421,0.762579
5,12386020,https://www.airbnb.com/rooms/12386020,20160906204935,2016-09-07,Private Bedroom + Great Coffee,Super comfy bedroom plus your own bathroom in ...,Our sunny condo is located on the second and t...,Super comfy bedroom plus your own bathroom in ...,none,We love our corner of Roslindale! For quiet wa...,...,,,f,flexible,f,f,1,1.70,9.237421,0.762579
7,2843445,https://www.airbnb.com/rooms/2843445,20160906204935,2016-09-07,"""Tranquility"" on ""Top of the Hill""","We can accommodate guests who are gluten-free,...",We provide a bedroom and full shared bath. Ra...,"We can accommodate guests who are gluten-free,...",none,Our neighborhood is residential with friendly ...,...,,,f,moderate,t,t,2,2.38,9.307398,0.692602
8,753446,https://www.airbnb.com/rooms/753446,20160906204935,2016-09-07,6 miles away from downtown Boston!,Nice and cozy apartment about 6 miles away to ...,Nice and cozy apartment about 6 miles away to ...,Nice and cozy apartment about 6 miles away to ...,none,Roslindale is a primarily residential neighbor...,...,,,f,moderate,f,f,1,5.36,9.307398,0.692602
10,12023024,https://www.airbnb.com/rooms/12023024,20160906204935,2016-09-07,Cozy room in a well located house,The room is in a single family house located i...,,The room is in a single family house located i...,none,,...,,,f,flexible,f,f,1,0.36,9.237421,0.762579
11,1668313,https://www.airbnb.com/rooms/1668313,20160906204935,2016-09-07,Room in Rozzie-Twin Bed-Full Bath,Quiet second floor bedroom sleeps one in comfo...,,Quiet second floor bedroom sleeps one in comfo...,none,Our neighborhood is quiet and relaxed. There i...,...,,,f,flexible,f,f,2,0.48,9.237421,0.237421


### Apply

In [67]:
# the most common operation I invoke on Group-by objects is the Apply function. This allows you to apply an arbitrary 
#function to each group and stitched the results back together for each apply into a single data frame where the index 
#is preserved.

#lets get the clean copy of the dataframe
df = pd.read_csv('datasets/listings.csv')
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,f,,,f,moderate,f,f,1,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,9.0,f,,,t,moderate,f,f,1,1.3
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,10.0,f,,,f,moderate,t,f,1,0.47
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,10.0,f,,,f,moderate,f,f,1,1.0
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,10.0,f,,,f,flexible,f,f,1,2.25


In [68]:
#We will just use two columns
df = df[['cancellation_policy','review_scores_value']]
df.head()

Unnamed: 0,cancellation_policy,review_scores_value
0,moderate,
1,moderate,9.0
2,moderate,10.0
3,moderate,10.0
4,flexible,10.0


In [70]:
#Now whatever we did in the last process we are going to do that again but with apply funciton 
def cal_review_scores(group):
    # group is a data frame just of whatever we've grouped, in this case the cancellation policy. So we can treat this 
    #as the complete data frame that we're operating on
    average = np.nanmean(group['review_scores_value'])
    group['mean_average_review_score'] = abs(average-group['review_scores_value'])
    return group

df.groupby('cancellation_policy').apply(cal_review_scores).head()
    

Unnamed: 0,cancellation_policy,review_scores_value,mean_average_review_score
0,moderate,,
1,moderate,9.0,0.307398
2,moderate,10.0,0.692602
3,moderate,10.0,0.692602
4,flexible,10.0,0.762579


## Pivot Table

A pivot table is a way of summarizing data in a DataFrame for a particular purpose. It makes heavy use of this aggregation function we've been talking about. A pivot table is in itself a DataFrame, where the rows represent one variable that you're interested in, the columns another, and then the cells some aggregate value. A pivot table also tends to include marginal values as well, which are sum for each column and row. This allows you to be able to see the relationship between two variables at just a glance.

In [2]:
#Importing dataset
df = pd.read_csv('datasets/cwurData.csv')
df.head()

Unnamed: 0,world_rank,institution,country,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,year
0,1,Harvard University,USA,1,7,9,1,1,1,1,,5,100.0,2012
1,2,Massachusetts Institute of Technology,USA,2,9,17,3,12,4,4,,1,91.67,2012
2,3,Stanford University,USA,3,17,11,5,4,2,2,,15,89.5,2012
3,4,University of Cambridge,United Kingdom,1,10,24,4,16,16,11,,50,86.17,2012
4,5,California Institute of Technology,USA,4,2,29,7,37,22,22,,18,85.21,2012


In [3]:
# Let's say we wanted to create a new column called Rank_Level, where institutions with world rankings 1-100 are 
#categorized as first tier, and those with world rankings 101 to 200 are second tier, and ranking 201 to 300 are 
#third tier. Then after 301, we'll just bucket those as other top universities.

def university_label(group):
    
    if group <= 100:
        return '1st Tier'
    elif (group >100) & (group<=200):
        return '2nd Tier'
    elif (group >200) & (group<=300):
        return '3rd Tier'
    else:
        return 'Other'
    
df['Label'] = df['world_rank'].apply(lambda x: university_label(x))
df.head()

Unnamed: 0,world_rank,institution,country,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,year,Label
0,1,Harvard University,USA,1,7,9,1,1,1,1,,5,100.0,2012,1st Tier
1,2,Massachusetts Institute of Technology,USA,2,9,17,3,12,4,4,,1,91.67,2012,1st Tier
2,3,Stanford University,USA,3,17,11,5,4,2,2,,15,89.5,2012,1st Tier
3,4,University of Cambridge,United Kingdom,1,10,24,4,16,16,11,,50,86.17,2012,1st Tier
4,5,California Institute of Technology,USA,4,2,29,7,37,22,22,,18,85.21,2012,1st Tier


In [4]:
#Let us create our first pivot table

df.pivot_table(values='score',index='country',columns='Label',aggfunc=[np.sum])

Unnamed: 0_level_0,sum,sum,sum,sum
Label,1st Tier,2nd Tier,3rd Tier,Other
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Argentina,,,,312.71
Australia,383.54,393.94,94.57,1785.83
Austria,,,141.2,942.15
Belgium,103.75,245.42,140.24,450.81
Brazil,,99.13,,1512.99
Bulgaria,,,,88.67
Canada,697.24,541.4,515.09,1656.14
Chile,,,,358.14
China,214.37,239.34,375.41,6684.64
Colombia,,,,177.73


In [5]:
#Here we can see there is hierarichal dataframe, for index we have conuntires and for the columns we have hieririchal result
#for the first level we have sum stating that the value used is sum and second level teels we are using labels

In [6]:
#We can pass multiple aggregated functions to the pivot table. let us try to do that too

#In this 
df.pivot_table(values='score',index='country',columns='Label',aggfunc=[np.sum,np.max,np.min])

Unnamed: 0_level_0,sum,sum,sum,sum,amax,amax,amax,amax,amin,amin,amin,amin
Label,1st Tier,2nd Tier,3rd Tier,Other,1st Tier,2nd Tier,3rd Tier,Other,1st Tier,2nd Tier,3rd Tier,Other
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
Argentina,,,,312.71,,,,45.66,,,,44.1
Australia,383.54,393.94,94.57,1785.83,51.61,50.4,47.47,45.97,44.13,47.97,47.1,44.09
Austria,,,141.2,942.15,,,47.78,46.29,,,46.39,44.19
Belgium,103.75,245.42,140.24,450.81,52.03,49.73,47.14,46.21,51.72,48.08,46.21,44.31
Brazil,,99.13,,1512.99,,49.82,,46.08,,49.31,,44.03
Bulgaria,,,,88.67,,,,44.48,,,,44.19
Canada,697.24,541.4,515.09,1656.14,60.87,51.23,47.69,45.74,44.5,47.6,46.17,44.03
Chile,,,,358.14,,,,45.33,,,,44.07
China,214.37,239.34,375.41,6684.64,55.3,48.14,47.76,45.92,52.21,47.56,46.25,44.02
Colombia,,,,177.73,,,,44.85,,,,44.06


In [7]:
#Now we will use another parameter margins and see the chanegs

df.pivot_table(values='score',index='country',columns='Label',aggfunc=[np.sum,np.max,np.min], margins=True).head(10)

Unnamed: 0_level_0,sum,sum,sum,sum,sum,amax,amax,amax,amax,amax,amin,amin,amin,amin,amin
Label,1st Tier,2nd Tier,3rd Tier,Other,All,1st Tier,2nd Tier,3rd Tier,Other,All,1st Tier,2nd Tier,3rd Tier,Other,All
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
Argentina,,,,312.71,312.71,,,,45.66,45.66,,,,44.1,44.1
Australia,383.54,393.94,94.57,1785.83,2657.88,51.61,50.4,47.47,45.97,51.61,44.13,47.97,47.1,44.09,44.09
Austria,,,141.2,942.15,1083.35,,,47.78,46.29,47.78,,,46.39,44.19,44.19
Belgium,103.75,245.42,140.24,450.81,940.22,52.03,49.73,47.14,46.21,52.03,51.72,48.08,46.21,44.31,44.31
Brazil,,99.13,,1512.99,1612.12,,49.82,,46.08,49.82,,49.31,,44.03,44.03
Bulgaria,,,,88.67,88.67,,,,44.48,44.48,,,,44.19,44.19
Canada,697.24,541.4,515.09,1656.14,3409.87,60.87,51.23,47.69,45.74,60.87,44.5,47.6,46.17,44.03,44.03
Chile,,,,358.14,358.14,,,,45.33,45.33,,,,44.07,44.07
China,214.37,239.34,375.41,6684.64,7513.76,55.3,48.14,47.76,45.92,55.3,52.21,47.56,46.25,44.02,44.02
Colombia,,,,177.73,177.73,,,,44.85,44.85,,,,44.06,44.06


In [8]:
#The pivot table is nothing but it is a dataframe but with multiple order
#Lets save the previous pivot table into a dataframe and have a look at the columns and indexes 

pivot_dataframe = df.pivot_table(values='score',index='country',columns='Label',aggfunc=[np.sum,np.max,np.min])

print(pivot_dataframe.index)

print(pivot_dataframe.columns)

Index(['Argentina', 'Australia', 'Austria', 'Belgium', 'Brazil', 'Bulgaria',
       'Canada', 'Chile', 'China', 'Colombia', 'Croatia', 'Cyprus',
       'Czech Republic', 'Denmark', 'Egypt', 'Estonia', 'Finland', 'France',
       'Germany', 'Greece', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Iran',
       'Ireland', 'Israel', 'Italy', 'Japan', 'Lebanon', 'Lithuania',
       'Malaysia', 'Mexico', 'Netherlands', 'New Zealand', 'Norway', 'Poland',
       'Portugal', 'Puerto Rico', 'Romania', 'Russia', 'Saudi Arabia',
       'Serbia', 'Singapore', 'Slovak Republic', 'Slovenia', 'South Africa',
       'South Korea', 'Spain', 'Sweden', 'Switzerland', 'Taiwan', 'Thailand',
       'Turkey', 'USA', 'Uganda', 'United Arab Emirates', 'United Kingdom',
       'Uruguay'],
      dtype='object', name='country')
MultiIndex([( 'sum', '1st Tier'),
            ( 'sum', '2nd Tier'),
            ( 'sum', '3rd Tier'),
            ( 'sum',    'Other'),
            ('amax', '1st Tier'),
            ('amax',

In [9]:
#Let us now see how we can access sum  for 2nd Tier 

pivot_dataframe['sum']['2nd Tier'].head()

country
Argentina       NaN
Australia    393.94
Austria         NaN
Belgium      245.42
Brazil        99.13
Name: 2nd Tier, dtype: float64

In [10]:
#What if we want to find out the county which has highest maximum score  in 1st Tier

#We will use idxmax() function it is n
pivot_dataframe['sum']['2nd Tier'].idxmax()

'USA'

In [12]:
#If you wanted to achieve a different shape of your pivot table, you can do so with the stack and unstack functions. 
#Stacking is pivoting the lowest column index to become the innermost row index, and unstacking is just the inverse of 
#stacking, pivoting the innermost row index to become the lowermost column index. 

pivot_dataframe.stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,amax,amin
country,Label,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Argentina,Other,312.71,45.66,44.10
Australia,1st Tier,383.54,51.61,44.13
Australia,2nd Tier,393.94,50.40,47.97
Australia,3rd Tier,94.57,47.47,47.10
Australia,Other,1785.83,45.97,44.09
Austria,3rd Tier,141.20,47.78,46.39
Austria,Other,942.15,46.29,44.19
Belgium,1st Tier,103.75,52.03,51.72
Belgium,2nd Tier,245.42,49.73,48.08
Belgium,3rd Tier,140.24,47.14,46.21


## TimeStamp

In [2]:
#Python majorly has 4 timestamp classes. timestamp, datatimeindex, period, periodindex

#Lets start with timestamp

pd.Timestamp("1/10/2020") #By defualt time will be 00:00:00 

Timestamp('2020-01-10 00:00:00')

In [3]:
#We can pass year, month, day, hour etc seprately

pd.Timestamp(2020,5,12,13,5)

Timestamp('2020-05-12 13:05:00')

In [4]:
#Timestamp also has some useful attributes, such as isoweekday(), which shows the weekday of the timestamp. 
#Note that 1 represents Monday and 7 represents Sunday.

pd.Timestamp(2021,4,9).isoweekday()

5

In [6]:
#We can extract year, month, day, hour etc from timestamp

pd.Timestamp(2021,4,10,9,53).minute

53

### Period

In [7]:
#suppose we weren't interested in a specific point in time and instead we wanted a span of time. This is where the Period 
#class comes into play. Period represents a single time span, such as a specific day or month

pd.Period("2012-02")

Period('2012-02', 'M')

In [8]:
#Period objects represent the full time span that you specify. So arithmetic on period is very easy and intuitive. 
#For instance, if we just want to find out 5 months after January 2016, we simply plus 5 to it.

pd.Period("2012-01")+5

Period('2012-06', 'M')

In [10]:
#Suppose we want to find out 15 days before 2nd March 2016

pd.Period("2016-03-02")-15

Period('2016-02-16', 'D')

### DateTimeIndex and PeriodIndex

In [13]:
#We will create a series which is having a datetimeseries as index

t1 = pd.Series(list('abc'), [pd.Timestamp("2016-01-02"),pd.Timestamp("2016-01-03"),pd.Timestamp("2016-01-04")])
t1

2016-01-02    a
2016-01-03    b
2016-01-04    c
dtype: object

In [14]:
#Similarly we can do it for period index 
t1 = pd.Series(list('abc'), [pd.Period("2016-01-02"),pd.Period("2016-01-03"),pd.Period("2016-01-04")])
t1

2016-01-02    a
2016-01-03    b
2016-01-04    c
Freq: D, dtype: object

### Converting to DateTime

In [19]:
# let's look into how to convert to Datetime. Suppose we have a list of dates of strings and we want to create a new dataframe.

date = ['2015-2-4','August 2012, 5','7/9/2020','8/9/2020']

df1 = pd.DataFrame(np.random.randint(10,100, (4,2)), index=date, columns=list('ab'))
df1

Unnamed: 0,a,b
2015-2-4,49,77
"August 2012, 5",22,58
7/9/2020,85,48
8/9/2020,57,44


In [20]:
# pandas has these nice built-in to_datetime to try and figure out the format for us.

pd.to_datetime(df1.index)

DatetimeIndex(['2015-02-04', '2012-08-05', '2020-07-09', '2020-08-09'], dtype='datetime64[ns]', freq=None)

### TimeDelta

In [5]:
#It is the difference between the time. Let us say we want to see the time difference between september 9 and November 16.

pd.Timestamp("2021-11-16")-pd.Timestamp("2021-09-08")

Timedelta('69 days 00:00:00')

In [10]:
#We can also get what will be date and time after 7D 3H from september 9

pd.Timestamp("2021-09-08 8:00AM") + pd.Timedelta("7D 3H")

Timestamp('2021-09-15 11:00:00')

### Offset

In [13]:
#offset is similar to timedelta, but it follows specific calendar duration rules. Offset allows flexibility 
#in terms of types of time intervals. Besides hour, day, week, month, etc., it also has things like business day, 
#and end of month, semi month begin, etc. So very non-traditional time series, but things that we would use in 
#business all the time.

#We will create a timestamp and see what day it is

pd.Timestamp("2021-11-16").weekday()

1

In [15]:
#We can add timestamp with a week ahead

pd.Timestamp("2021-11-16") + pd.offsets.Week()

Timestamp('2021-11-23 00:00:00')