## Match searches with bookings

- For every search in the searches file, find out whether the search ended up in a booking or not (using the info in the bookings file). For instance, search and booking origin and destination should match. 

- For the bookings file, origin and destination are the columns dep_port and arr_port, respectively. 

- Generate a CSV file with the search data, and an additional field, containing 1 if the search ended up in a booking, and 0 otherwise.


STEPS TO FOLLOW

1) Understand the data

2) Quick Notes

3) Action plan

4) Plan Execution with sample
    
5) Plan Execution with all data


# 1) Lets check what we have

In [1]:
import pandas as pd
pd.options.mode.chained_assignment = None  

In [2]:
bookings_check = pd.read_csv('bookings.csv', sep= '^', error_bad_lines=False, nrows=1)
pd.set_option("display.max_columns", None)
bookings_check

Unnamed: 0,act_date,source,pos_ctry,pos_iata,pos_oid,rloc,cre_date,duration,distance,dep_port,dep_city,dep_ctry,arr_port,arr_city,arr_ctry,lst_port,lst_city,lst_ctry,brd_port,brd_city,brd_ctry,off_port,off_city,off_ctry,mkt_port,mkt_city,mkt_ctry,intl,route,carrier,bkg_class,cab_class,brd_time,off_time,pax,year,month,oid
0,2013-03-05 00:00:00,1A,DE,a68dd7ae953c8acfb187a1af2dcbe123,1a11ae49fcbf545fd2afc1a24d88d2b7,ea65900e72d71f4626378e2ebd298267,2013-02-22 00:00:00,1708,0,ZRH,ZRH,CH,LHR,LON,GB,ZRH,ZRH,CH,LHR,LON,GB,ZRH,ZRH,CH,LHRZRH,LONZRH,CHGB,1,LHRZRH,VI,T,Y,2013-03-07 08:50:00,2013-03-07 11:33:37,-1,2013,3,


In [3]:
searches_check = pd.read_csv('searches.csv', sep= '^', error_bad_lines=False, nrows=1)
searches_check

Unnamed: 0,Date,Time,TxnCode,OfficeID,Country,Origin,Destination,RoundTrip,NbSegments,Seg1Departure,Seg1Arrival,Seg1Date,Seg1Carrier,Seg1BookingCode,Seg2Departure,Seg2Arrival,Seg2Date,Seg2Carrier,Seg2BookingCode,Seg3Departure,Seg3Arrival,Seg3Date,Seg3Carrier,Seg3BookingCode,Seg4Departure,Seg4Arrival,Seg4Date,Seg4Carrier,Seg4BookingCode,Seg5Departure,Seg5Arrival,Seg5Date,Seg5Carrier,Seg5BookingCode,Seg6Departure,Seg6Arrival,Seg6Date,Seg6Carrier,Seg6BookingCode,From,IsPublishedForNeg,IsFromInternet,IsFromVista,TerminalID,InternetOffice
0,2013-01-01,20:25:57,MPT,624d8c3ac0b3a7ca03e3c167e0f48327,DE,TXL,AUH,1,2,TXL,AUH,2013-01-26,D2,,AUH,TXL,2013-02-02,D2,,,,,,,,,,,,,,,,,,,,,,1ASIWS,0,0,0,d41d8cd98f00b204e9800998ecf8427e,FRA


# 2) Quick Notes:

Before taking an action plan I want to check:

    1) Q: Difference between act_date and cre_date (bookings). 
       A: Both are always before brd_time. It seems cre_date is when the account was created or system detected the user for the first time and act_date is when the flight was booked in the system. The reason to make this assumption is that cre_data is always less than act_data and all act_date is from 2013, while some of cre_dates are dated in 2011 or 2012.
       
    2) Q: Have both datasets all the months of the year 2013? 
       A: Both datasets have the 12 months of 2013
       
       
    

# 3) Action Plan:

Information that appears in both datasets:
- Search day and booking day: bookings['act_date'], searches ['Date']
- Origin and destination: bookings[['dept_port', 'arr_port'], searches[['Origin', 'Destination']]
- Boarding time: bookings['brd_time'], searches['Seg1Date']
- Pax: bookings['pax'] has to be > 1 as cancelations are not taken into account. People dont cancel through searches.

Once data has no duplicates, search_id and booked_id variable will be created to give a "key" to each bookings and search. By doing this, once df´s are merged, it is possible to see how many booking ids and search ids are repeated and find a way to deal with them.



#### 1. Discard all no needed information:

- Drop duplicates
- Create search_id and bookings_id varaibles
       
#### 2. Create both df with the 4 columns

#### 3. Change data to correct type and deal with NaN
- Change dates to datetime
- Find out how many NaN and evaluate if it makes sense TO DO something OR not
- Delete spaces for string in iata codes of bookings


#### 4. Merge data

- Drop duplicates again if necessary
- Check if size is the same and data make sense
- Add the 1 and 0 to the raw search df
     

# 4) Plan Execution with Sample:

In [4]:
#Sample of 2M rows
bookings_raw = pd.read_csv('bookings.csv', sep= '^', error_bad_lines=False, nrows=2000000)
searches_raw = pd.read_csv('searches.csv', sep= '^', error_bad_lines=False, nrows=2000000)

  interactivity=interactivity, compiler=compiler, result=result)


In [5]:
#Drop duplicates
bookings_nodup = bookings_raw.drop_duplicates()
searches_nodup = searches_raw.drop_duplicates()

In [6]:
# Reseting index to follow an order from 0 to nrows
bookings_nodup.reset_index(inplace=True)
bookings_nodup.drop('index', axis = 1, inplace=True)

searches_nodup.reset_index(inplace=True)
searches_nodup.drop('index', axis = 1, inplace=True)

Adding variables(ID): 

In [7]:
bookings_nodup['bookings_Id'] = bookings_nodup.index
searches_nodup['searches_Id'] = searches_nodup.index

In [8]:
#Selecting useful columns
bookings_cols_nodup = bookings_nodup[['act_date           ','dep_port', 'arr_port','brd_time           ','pax', "bookings_Id"]]
searches_cols_nodup = searches_nodup[['Date','Origin','Destination','Seg1Date', "searches_Id" ]]

#bookings with pax>1
bookings_cols_nodup = bookings_cols_nodup[bookings_cols_nodup['pax'] > 0]

In [9]:
#Drop NaN
bookings_cols_nodup_nonan = bookings_cols_nodup.dropna()
searches_cols_nodup_nonan = searches_cols_nodup.dropna()

In [10]:
print('bookings_raw shape = ', bookings_raw.shape,'\nbookings without duplicates = ', bookings_nodup.shape, '\nbookings selected columns and no duplicates =', bookings_cols_nodup.shape,'\nbookings selected columns, no duplicates and no Nan=', bookings_cols_nodup_nonan.shape)
print("")
print('searches_raw shape = ', searches_raw.shape,'\nsearches without duplicates = ', searches_nodup.shape, '\nsearches selected columns and no duplicates =', searches_cols_nodup.shape,'\nsearches selected columns, no duplicates and no Nan=', searches_cols_nodup_nonan.shape)  

bookings_raw shape =  (2000000, 38) 
bookings without duplicates =  (1000000, 39) 
bookings selected columns and no duplicates = (680866, 6) 
bookings selected columns, no duplicates and no Nan= (680866, 6)

searches_raw shape =  (2000000, 45) 
searches without duplicates =  (388369, 46) 
searches selected columns and no duplicates = (388369, 5) 
searches selected columns, no duplicates and no Nan= (387276, 5)


In [11]:
# Changing dates to datetime to have same format in both df and delete the hour
import datetime as dt
#First for the creation date of the booking and search
bookings_cols_nodup_nonan['created_date'] = pd.to_datetime(bookings_cols_nodup_nonan['act_date           '], errors='coerce', format='%Y-%m-%d').dt.date
searches_cols_nodup_nonan['created_date'] = pd.to_datetime(searches_cols_nodup_nonan['Date'], errors='coerce', format='%Y-%m-%d').dt.date

#Second for the boarding time
bookings_cols_nodup_nonan['board_date'] = pd.to_datetime(bookings_cols_nodup_nonan['brd_time           '], errors='coerce', format='%Y-%m-%d').dt.date
searches_cols_nodup_nonan['board_date'] = pd.to_datetime(searches_cols_nodup_nonan['Seg1Date'], errors='coerce', format='%Y-%m-%d').dt.date


In [12]:
#Drop old columns with dates and pax as we don't need them anymore
bookings_datetime = bookings_cols_nodup_nonan.drop(['act_date           ','brd_time           '], axis =1)
searches_datetime = searches_cols_nodup_nonan.drop(['Date', 'Seg1Date'], axis = 1)

In [13]:
#Changing column names of bookings to match searches to make the merge easier
bookings_datetime['Origin'] = bookings_datetime['dep_port'].str.split(" ", n = 1, expand = True)[0]
bookings_datetime['Destination'] = bookings_datetime['arr_port'].str.split(" ", n = 1, expand = True)[0]

#Droping no needed columns: pax as it is already filtered and the dep_port arr_port
bookings_datetime.drop(['pax','dep_port', 'arr_port'], axis = 1, inplace = True)



In [14]:
#Changing column names of searches to make sure
searches_datetime['Origin'] = searches_datetime['Origin'].str.split(" ", n = 1, expand = True)[0]
searches_datetime['Destination'] = searches_datetime['Destination'].str.split(" ", n = 1, expand = True)[0]


In [16]:
#Merging both df
merged_searches = searches_datetime.merge(bookings_datetime, how = 'left')

In [17]:
merged_searches

Unnamed: 0,Origin,Destination,searches_Id,created_date,board_date,bookings_Id
0,TXL,AUH,0,2013-01-01,2013-01-26,
1,ATH,MIL,1,2013-01-01,2013-01-04,
2,ICT,SFO,2,2013-01-01,2013-08-02,
3,RNB,ARN,3,2013-01-01,2013-01-02,
4,OSL,MAD,4,2013-01-01,2013-03-22,
...,...,...,...,...,...,...
387321,HLZ,BNE,388364,2013-01-08,2013-10-26,
387322,PEK,SEL,388365,2013-01-08,2013-03-22,
387323,YUL,NCL,388366,2013-01-08,2013-01-27,
387324,MIA,NCE,388367,2013-01-08,2013-01-23,


In [18]:
print('shape seraches: ', searches_datetime.shape, '\nshape merged:   ', merged_searches.shape)

shape seraches:  (387276, 5) 
shape merged:    (387326, 6)


In [19]:
merged_searches.shape[0]- searches_datetime.shape[0]

50

In [20]:
merged_searches['bookings_Id'].count()

588

Checking if searches are duplicated based on the ID:

In [21]:
merged_searches['searches_Id'].value_counts()

71021     3
10943     3
289292    2
72541     2
382728    2
         ..
85379     1
95620     1
97669     1
91526     1
0         1
Name: searches_Id, Length: 387276, dtype: int64

As searches are duplicated, new df is created droping duplicates

In [22]:
merged_searches_nosearchiddup = merged_searches.drop_duplicates(subset=['searches_Id'])

In [24]:
print('shape seraches:               ', searches_datetime.shape, '\nshape merged no search dup:   ', merged_searches_nosearchiddup.shape)

shape seraches:                (387276, 5) 
shape merged no search dup:    (387276, 6)


Even druplicated searches were deleted, there are still duplicated bookings:

In [42]:
merged_searches_nosearchiddup['bookings_Id'].value_counts()

576143.0    4
722996.0    3
525086.0    2
746918.0    2
958881.0    2
           ..
866675.0    1
867241.0    1
108440.0    1
434262.0    1
98871.0     1
Name: bookings_Id, Length: 474, dtype: int64

## Problem

It is not possible to know if one booking corresponds to an specific search, as there are several searches that match the bookings. This gives us duplicated booking ids and when merged, the system doesnt understand which search ID corresponds to the particular bookings ID and gives a value of "1" to both search ids. This ends up with duplicate bookings. 

# Solution

At the same time, it is not possible to have duplicated bookings (1 booking = 1 booking), therefore the decision I made was to delete the duplicated bookings based on the ID. By doing this probably, some booking ids are going to be assigned to a search that didn´t end up booking, but is identical to another one. Adding other variables like exact time or others in both files we could solve this problem in a better way.

Creating a new df with no booking duplicates to know exactly how many searches were booked (it doesnt detect which ones exactly, but the number should be the correct one)

In [27]:
merged_searches_nosearchiddup_nobookiddup = merged_searches_nosearchiddup.drop_duplicates(subset=['bookings_Id'])

In [28]:
merged_searches_nosearchiddup_nobookiddup

Unnamed: 0,Origin,Destination,searches_Id,created_date,board_date,bookings_Id
0,TXL,AUH,0,2013-01-01,2013-01-26,
430,RUH,JED,432,2013-01-01,2013-01-14,544420.0
593,DME,BKK,596,2013-01-01,2013-01-29,372991.0
737,JED,RUH,741,2013-01-01,2013-01-04,576143.0
919,DEL,BOM,923,2013-01-01,2013-01-02,233523.0
...,...,...,...,...,...,...
347759,CAI,DXB,348719,2013-12-15,2013-12-17,722996.0
350077,DRS,FCO,351040,2013-12-18,2014-10-16,912562.0
350341,SFO,TPE,351304,2013-12-18,2014-02-08,495411.0
350702,DEN,YYZ,351666,2013-12-18,2014-01-06,985652.0


Number of searches that were booked:

In [29]:
merged_searches_nosearchiddup_nobookiddup['bookings_Id'].count()

474

Creating final df with 1 and 0:

In [30]:
merged_searches_nosearchiddup_nobookiddup_nonan = merged_searches_nosearchiddup_nobookiddup.dropna()

In [31]:
merged_searches_nosearchiddup_nobookiddup_nonan['booked 1= yes, 0 = no'] = 1

In [32]:
for_final_merge = merged_searches_nosearchiddup_nobookiddup_nonan[['searches_Id', 'booked 1= yes, 0 = no']]

In [33]:
answer = searches_nodup.merge(for_final_merge, how='left')

In [34]:
#Changing Nan for 0
answer['booked 1= yes, 0 = no'].fillna(0, inplace = True)
answer['booked 1= yes, 0 = no'] = answer['booked 1= yes, 0 = no'].astype(int)

Checking if results make sense

In [35]:
#Number of searches booked
print(sum(answer['booked 1= yes, 0 = no']), merged_searches_nosearchiddup_nobookiddup['bookings_Id'].count())

474 474


In [36]:
# Shape of df
print(searches_nodup.shape, answer.shape)

(388369, 46) (388369, 47)


In [37]:
#Checking if there are duplicates
print(answer[answer['searches_Id'].value_counts() > 1])

Empty DataFrame
Columns: [Date, Time, TxnCode, OfficeID, Country, Origin, Destination, RoundTrip, NbSegments, Seg1Departure, Seg1Arrival, Seg1Date, Seg1Carrier, Seg1BookingCode, Seg2Departure, Seg2Arrival, Seg2Date, Seg2Carrier, Seg2BookingCode, Seg3Departure, Seg3Arrival, Seg3Date, Seg3Carrier, Seg3BookingCode, Seg4Departure, Seg4Arrival, Seg4Date, Seg4Carrier, Seg4BookingCode, Seg5Departure, Seg5Arrival, Seg5Date, Seg5Carrier, Seg5BookingCode, Seg6Departure, Seg6Arrival, Seg6Date, Seg6Carrier, Seg6BookingCode, From, IsPublishedForNeg, IsFromInternet, IsFromVista, TerminalID, InternetOffice, searches_Id, booked 1= yes, 0 = no]
Index: []


  


In [38]:
answer

Unnamed: 0,Date,Time,TxnCode,OfficeID,Country,Origin,Destination,RoundTrip,NbSegments,Seg1Departure,Seg1Arrival,Seg1Date,Seg1Carrier,Seg1BookingCode,Seg2Departure,Seg2Arrival,Seg2Date,Seg2Carrier,Seg2BookingCode,Seg3Departure,Seg3Arrival,Seg3Date,Seg3Carrier,Seg3BookingCode,Seg4Departure,Seg4Arrival,Seg4Date,Seg4Carrier,Seg4BookingCode,Seg5Departure,Seg5Arrival,Seg5Date,Seg5Carrier,Seg5BookingCode,Seg6Departure,Seg6Arrival,Seg6Date,Seg6Carrier,Seg6BookingCode,From,IsPublishedForNeg,IsFromInternet,IsFromVista,TerminalID,InternetOffice,searches_Id,"booked 1= yes, 0 = no"
0,2013-01-01,20:25:57,MPT,624d8c3ac0b3a7ca03e3c167e0f48327,DE,TXL,AUH,1.0,2.0,TXL,AUH,2013-01-26,D2,,AUH,TXL,2013-02-02,D2,,,,,,,,,,,,,,,,,,,,,,1ASIWS,0,0,0,d41d8cd98f00b204e9800998ecf8427e,FRA,0,0
1,2013-01-01,10:15:33,MPT,b0af35b31588dc4ab06d5cf2986e8e02,MD,ATH,MIL,0.0,1.0,ATH,MIL,2013-01-04,,,,,,,,,,,,,,,,,,,,,,,,,,,,1ASIWS,0,0,0,d41d8cd98f00b204e9800998ecf8427e,KIV,1,0
2,2013-01-01,18:04:49,MPT,3561a60621de06ab1badc8ca55699ef3,US,ICT,SFO,1.0,2.0,ICT,SFO,2013-08-02,,,SFO,ICT,2013-08-09,,,,,,,,,,,,,,,,,,,,,,,1ASIWS,0,0,0,d41d8cd98f00b204e9800998ecf8427e,NYC,2,0
3,2013-01-01,17:42:40,FXP,1864e5e8013d9414150e91d26b6a558b,SE,RNB,ARN,0.0,1.0,RNB,ARN,2013-01-02,DU,W,,,,,,,,,,,,,,,,,,,,,,,,,,1ASI,0,0,0,d41d8cd98f00b204e9800998ecf8427e,STO,3,0
4,2013-01-01,17:48:29,MPT,1ec336348f44207d2e0027dc3a68c118,NO,OSL,MAD,1.0,2.0,OSL,MAD,2013-03-22,,,MAD,OSL,2013-03-31,,,,,,,,,,,,,,,,,,,,,,,1ASIWS,0,0,0,d41d8cd98f00b204e9800998ecf8427e,OSL,4,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
388364,2013-01-08,01:54:16,MPT,4337f2104b2ff5bf8ec437c68c5ed20e,NZ,HLZ,BNE,0.0,1.0,HLZ,BNE,2013-10-26,,,,,,,,,,,,,,,,,,,,,,,,,,,,1ASIWS,0,0,0,d41d8cd98f00b204e9800998ecf8427e,CHC,388364,0
388365,2013-01-08,03:38:11,MPT,4b3d2a0a608ccaefbc68d1b9531d7245,HK,PEK,SEL,1.0,2.0,PEK,SEL,2013-03-22,,,SEL,PEK,2013-03-25,,,,,,,,,,,,,,,,,,,,,,,1ASI,0,0,0,d41d8cd98f00b204e9800998ecf8427e,HKG,388365,0
388366,2013-01-08,21:50:30,MPT,440642a9bdaeb6287f826cefd73255e8,US,YUL,NCL,0.0,1.0,YUL,NCL,2013-01-27,DX,,,,,,,,,,,,,,,,,,,,,,,,,,,1ASIWS,0,0,0,d41d8cd98f00b204e9800998ecf8427e,HPN,388366,0
388367,2013-01-08,23:40:20,MPT,e175ff926640d0f543bb15f8b4a88ed0,US,MIA,NCE,0.0,1.0,MIA,NCE,2013-01-23,LK,,,,,,,,,,,,,,,,,,,,,,,,,,,1ASI,0,0,0,d41d8cd98f00b204e9800998ecf8427e,LGA,388367,0


# 5) Plan Execution with All Data:

In [1]:
def all_data_result(bookings_data,searches_data):

    bookings_raw = pd.read_csv('bookings.csv', sep= '^', error_bad_lines=False)
    searches_raw = pd.read_csv('searches.csv', sep= '^', error_bad_lines=False)

    #Drop duplicates
    bookings_nodup = bookings_raw.drop_duplicates()
    searches_nodup = searches_raw.drop_duplicates()

    # Reseting index to follow an order from 0 to nrows
    bookings_nodup.reset_index(inplace=True)
    bookings_nodup.drop('index', axis = 1, inplace=True)

    searches_nodup.reset_index(inplace=True)
    searches_nodup.drop('index', axis = 1, inplace=True)
    
    #Adding variables:
    bookings_nodup['bookings_Id'] = bookings_nodup.index
    searches_nodup['searches_Id'] = searches_nodup.index

    #Selecting useful columns
    bookings_cols_nodup = bookings_nodup[['act_date           ','dep_port', 'arr_port','brd_time           ','pax', "bookings_Id"]]
    searches_cols_nodup = searches_nodup[['Date','Origin','Destination','Seg1Date', "searches_Id" ]]

    #bookings with pax>1
    bookings_cols_nodup = bookings_cols_nodup[bookings_cols_nodup['pax'] > 0]

    #Drop NaN
    bookings_cols_nodup_nonan = bookings_cols_nodup.dropna()
    searches_cols_nodup_nonan = searches_cols_nodup.dropna()

    print('bookings_raw shape = ', bookings_raw.shape,'\nbookings without duplicates = ', bookings_nodup.shape, '\nbookings selected columns and no duplicates =', bookings_cols_nodup.shape,'\nbookings selected columns, no duplicates and no Nan=', bookings_cols_nodup_nonan.shape)
    print("")
    print('searches_raw shape = ', searches_raw.shape,'\nsearches without duplicates = ', searches_nodup.shape, '\nsearches selected columns and no duplicates =', searches_cols_nodup.shape,'\nsearches selected columns, no duplicates and no Nan=', searches_cols_nodup_nonan.shape)  

    # Changing dates to datetime to have same format in both df and delete the hour
    import datetime as dt
    #First for the creation date of the booking and search
    bookings_cols_nodup_nonan['created_date'] = pd.to_datetime(bookings_cols_nodup_nonan['act_date           '], errors='coerce', format='%Y-%m-%d').dt.date
    searches_cols_nodup_nonan['created_date'] = pd.to_datetime(searches_cols_nodup_nonan['Date'], errors='coerce', format='%Y-%m-%d').dt.date

    #Second for the boarding time
    bookings_cols_nodup_nonan['board_date'] = pd.to_datetime(bookings_cols_nodup_nonan['brd_time           '], errors='coerce', format='%Y-%m-%d').dt.date
    searches_cols_nodup_nonan['board_date'] = pd.to_datetime(searches_cols_nodup_nonan['Seg1Date'], errors='coerce', format='%Y-%m-%d').dt.date


    #Drop old columns with dates and pax as we don't need them anymore
    bookings_datetime = bookings_cols_nodup_nonan.drop(['act_date           ','brd_time           '], axis =1)
    searches_datetime = searches_cols_nodup_nonan.drop(['Date', 'Seg1Date'], axis = 1)

    #Changing column names of bookings to match searches to make the merge easier
    bookings_datetime['Origin'] = bookings_datetime['dep_port'].str.split(" ", n = 1, expand = True)[0]
    bookings_datetime['Destination'] = bookings_datetime['arr_port'].str.split(" ", n = 1, expand = True)[0]

    #Droping no needed columns: pax as it is already filtered and the dep_port arr_port
    bookings_datetime.drop(['pax','dep_port', 'arr_port'], axis = 1, inplace = True)



    #Changing column names of searches to make sure
    searches_datetime['Origin'] = searches_datetime['Origin'].str.split(" ", n = 1, expand = True)[0]
    searches_datetime['Destination'] = searches_datetime['Destination'].str.split(" ", n = 1, expand = True)[0]


    #Lets try to join it with searches at the left

    #Merging both df
    merged_searches = searches_datetime.merge(bookings_datetime, how = 'left')


    merged_searches_nosearchiddup = merged_searches.drop_duplicates(subset=['searches_Id'])


    merged_searches_nosearchiddup_nobookiddup = merged_searches_nosearchiddup.drop_duplicates(subset=['bookings_Id'])

    #Creating final df with 1 and 0:

    merged_searches_nosearchiddup_nobookiddup_nonan = merged_searches_nosearchiddup_nobookiddup.dropna()

    merged_searches_nosearchiddup_nobookiddup_nonan['booked 1= yes, 0 = no'] = 1

    for_final_merge = merged_searches_nosearchiddup_nobookiddup_nonan[['searches_Id', 'booked 1= yes, 0 = no']]

    answer = searches_nodup.merge(for_final_merge, how='left')

    #Changing Nan for 0
    answer['booked 1= yes, 0 = no'].fillna(0, inplace = True)
    answer['booked 1= yes, 0 = no'] = answer['booked 1= yes, 0 = no'].astype(int)

    #Number of searches booked
    print(sum(answer['booked 1= yes, 0 = no']), merged_searches_nosearchiddup_nobookiddup['bookings_Id'].count())

    # Shape of df
    print(searches_nodup.shape, answer.shape)

    #Checking if there are duplicates
    print(answer[answer['searches_Id'].value_counts() > 1])

    return(answer)

In [None]:
import pandas as pd
import datetime as dt
all_data_result('bookings.csv', 'searches.csv')