## Match searches with bookings

- For every search in the searches file, find out whether the search ended up in a booking or not (using the info in the bookings file). For instance, search and booking origin and destination should match. 

- For the bookings file, origin and destination are the columns dep_port and arr_port, respectively. 

- Generate a CSV file with the search data, and an additional field, containing 1 if the search ended up in a booking, and 0 otherwise.



Suggestion: follow the below plan of action:

* Get familiar with the data
* Select columns of interest
* Decide what to do with NaNs

* Make processing plan
* Develop code that works with a sample

* Adjust the code to work with Big data
* Test big data approach on a sample

* Run program with big data

You can skip the first step this time, since you already did it for the other exercises

## 2) Prepare the data for processing

### Booking

#### We didnt check for duplicates so far... What if the file is has duplicated lines?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

In [15]:
bookings_path = 'bookings.csv.bz2'
searches_path = 'searches.csv.bz2'


In [31]:
bookings = pd.read_csv(bookings_path, sep = '^', nrows = 10000)
searches = pd.read_csv(searches_path, sep = '^', nrows = 10000)

In [32]:
bookings_types = bookings_sample.dtypes.to_dict()
searches_types = searches_sample.dtypes.to_dict()

In [33]:
bookings.columns

Index(['act_date           ', 'source', 'pos_ctry', 'pos_iata', 'pos_oid  ',
       'rloc          ', 'cre_date           ', 'duration', 'distance',
       'dep_port', 'dep_city', 'dep_ctry', 'arr_port', 'arr_city', 'arr_ctry',
       'lst_port', 'lst_city', 'lst_ctry', 'brd_port', 'brd_city', 'brd_ctry',
       'off_port', 'off_city', 'off_ctry', 'mkt_port', 'mkt_city', 'mkt_ctry',
       'intl', 'route          ', 'carrier', 'bkg_class', 'cab_class',
       'brd_time           ', 'off_time           ', 'pax', 'year', 'month',
       'oid      '],
      dtype='object')

In [34]:
cols_to_use_bookings = ['cre_date           ','dep_port','arr_port']


In [35]:
bookings_sample = pd.read_csv(bookings_path, sep = '^', nrows = 100000, dtype=bookings_types,usecols=cols_to_use_bookings)
searches_sample = pd.read_csv(searches_path, sep = '^', nrows = 100000, dtype=searches_types)

In [27]:
#hacemos la "deduplicacion" de los bookings, con eso podŕiamos hacer una combinación de los dos archivos.
bdeduped_bookings_sample = bookings_sample.drop_duplicates()

In [36]:
bookings = pd.read_csv(bookings_path, sep = '^', dtype=bookings_types, usecols=cols_to_use_bookings)



AttributeError: can't set attribute

In [40]:
deduped_bookings = bookings.drop_duplicates()
deduped_bookings.shape

(334877, 3)

In [38]:
bookings.shape

(10000010, 3)

http://stackoverflow.com/questions/24251219/pandas-read-csv-low-memory-and-dtype-options

#### We have seen that we have white space in some columns....

In [41]:
deduped_bookings.columms = deduped_bookings.columns.str.strip()

#### Could we do this with command line?

###  Search

### NaN and prepare dates

## 3) Make processing plan

Target: generate a CSV file with the search data, and an additional field, containing 1 if the search ended up in a booking, and 0 otherwise.

1) remove duplicates

2) parse dates from string to datetime  (X) NO HACE FALTA

3) remove whitespaces

    a) from colum names
    
    b) from content

4) remove NaN

5) define the model
    if there is one booking for a given O&D done at the same day as the search (for a given O&D), ALL searches of the day (for a given 0&D) might have resulted from the same source and will be set with 1.
    This is regardless of the boarding time of the plane... So if I was looking for plane for the first 4 days of December for a given O&D all those searches would be set to 1 not just the one correspoding to the correct boarding time

        match
        Search : [search_date, O&D] 
        Booking: [Activity_date, O&D]

6) execute the model

    a) Group by bookings on [Activity_date, O&D] so that we dont have duplicates (and we can have number of bookings for the day), or we can just drop the duplicates
    
    b) search left join bookings adding "Booked" column
    
    c) test if the merge was done right
    
    d) fill NaN of "booked" column with 0
    
    e) pull all values of booked column >1 to 1

more complex... get number of segments from searches
match search_date, and then split all O&D of all segments, and match the date of first flight of each segment (seg1Date, seg2Date)
with boarding time and O&D and act_date of booking

What do we have?

In [43]:
list(deduped_bookings)

['cre_date           ', 'dep_port', 'arr_port']

In [44]:
# Tenemos que cruzar bookings_sample con searches_sample
deduped_bookings['dep_port'] = deduped_bookings['dep_port'].str.strip()
deduped_bookings['cre_date'] = deduped_bookings['cre_date           '].str.strip()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [46]:
#Comprobamos que todo sea igual dentro de la columna:
lens = searches_sample['Origin'].apply(len)
lens.unique()

array([3])

In [48]:
deduped_bookings['dep_port'] = deduped_bookings['dep_port'].str.strip()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [55]:
deduped_bookings['arr_port'] = deduped_bookings['arr_port'].str.strip()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [56]:
joined = searches_sample.merge(deduped_bookings,
                      left_on=['Origin','Destination','Date'],
                      right_on=['dep_port','arr_port','cre_date'],
                      how ='left')

In [57]:
joined

Unnamed: 0,Date,Time,TxnCode,OfficeID,Country,Origin,Destination,RoundTrip,NbSegments,Seg1Departure,...,From,IsPublishedForNeg,IsFromInternet,IsFromVista,TerminalID,InternetOffice,cre_date,dep_port,arr_port,cre_date.1
0,2013-01-01,20:25:57,MPT,624d8c3ac0b3a7ca03e3c167e0f48327,DE,TXL,AUH,1,2,TXL,...,1ASIWS,0,0,0,d41d8cd98f00b204e9800998ecf8427e,FRA,,,,
1,2013-01-01,10:15:33,MPT,b0af35b31588dc4ab06d5cf2986e8e02,MD,ATH,MIL,0,1,ATH,...,1ASIWS,0,0,0,d41d8cd98f00b204e9800998ecf8427e,KIV,,,,
2,2013-01-01,18:04:49,MPT,3561a60621de06ab1badc8ca55699ef3,US,ICT,SFO,1,2,ICT,...,1ASIWS,0,0,0,d41d8cd98f00b204e9800998ecf8427e,NYC,,,,
3,2013-01-01,17:42:40,FXP,1864e5e8013d9414150e91d26b6a558b,SE,RNB,ARN,0,1,RNB,...,1ASI,0,0,0,d41d8cd98f00b204e9800998ecf8427e,STO,,,,
4,2013-01-01,17:48:29,MPT,1ec336348f44207d2e0027dc3a68c118,NO,OSL,MAD,1,2,OSL,...,1ASIWS,0,0,0,d41d8cd98f00b204e9800998ecf8427e,OSL,,,,
5,2013-01-01,22:00:28,MPT,3561a60621de06ab1badc8ca55699ef3,US,IAH,BLR,1,2,IAH,...,1ASIWS,0,0,0,d41d8cd98f00b204e9800998ecf8427e,NYC,,,,
6,2013-01-01,10:47:14,MPT,d327ca6e35cc6732d4709828327ac7c1,DK,CPH,PAR,1,2,CPH,...,1ASI,0,0,0,d41d8cd98f00b204e9800998ecf8427e,CPH,,,,
7,2013-01-01,23:39:49,MPT,38a3abb0a28e3f00fa79a11f552a5052,FR,PAR,DUB,1,2,PAR,...,1ASIWS,0,0,0,d41d8cd98f00b204e9800998ecf8427e,PAR,,,,
8,2013-01-01,17:08:46,MPT,c8daef4f8bf73a61aa2c928705f7b82d,ES,DUS,ACE,1,2,DUS,...,1ASIWS,0,0,0,d41d8cd98f00b204e9800998ecf8427e,MAD,,,,
9,2013-01-01,19:57:57,MPT,28d7a8c95e4db88589d3d35b66920e78,DE,FRA,BGW,1,2,FRA,...,1ASI,0,0,0,d41d8cd98f00b204e9800998ecf8427e,BNJ,,,,


In [63]:
deduped_bookings['OnlyDate'] = deduped_bookings['cre_date'].str[:10]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [64]:
searches_matched = joined.drop(['OnlyDate','dep_port','arr_port'],axis=1)
searches_matched['booked'][searches_matched['booked'].isnull()]=0
#otra opción es searchesmatched['booked'] = searches_matched['booked'].fillna(0)

ValueError: labels ['OnlyDate'] not contained in axis

#### If we dont want to drop the duplicates

### Now we have to put 1 to all the booking with Num of Bookings>1 in the clean file

### Can we do it with concat?
no, since it just joins on the index

### Check if all airports are letters