[View in Colaboratory](https://colab.research.google.com/github/ibarrond/Amadeus/blob/master/AmadeusChallenge.ipynb)

# Amadeus challenge for Data Scientists

---



## 1. Set Up the Environment and Load the Data
This environment, as part of a Linux Virtual Machine, offers Python3 with GPU acceleration and 12 GB of RAM. The RAM size makes it possible to handle the size of the two files provided for this challenge in memory:
- `booking.csv` - 4GB
- `searches.csv` - 3.4GB

### 1.1 Obtaining the Data
In order to import the data we upload the files to a Google Drive folder , then mount our Drive folder in the VM (based on [this tutorial](https://colab.research.google.com/notebooks/io.ipynb)) and finally access the files:

In [4]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


We check that the files are there:

In [19]:
!ls -la 'gdrive/My Drive/Colab Notebooks/Amadeus'

total 7754993
-rw------- 1 root root       8350 Oct  8 12:59 AmadeusChallenge.ipynb
-rw------- 1 root root 4244874509 Oct  8 11:57 bookings.csv
-rw------- 1 root root 3696229366 Oct  8 12:36 searches.csv


In [0]:
# We store the path to each of the data files
bookings_file = 'gdrive/My Drive/Colab Notebooks/Amadeus/bookings.csv'
searches_file = 'gdrive/My Drive/Colab Notebooks/Amadeus/searches.csv'

#### # FIRST EXERCISE: count the number of lines in Python for each file_
After a quick searck online, this [source](https://stackoverflow.com/questions/16108526/count-how-many-lines-are-in-a-csv-python) suggested the use of `open()` function to read the number of files:

In [0]:
with open(bookings_file) as f: bookings_num_lines = sum(1 for line in f)
with open(searches_file) as f: searches_num_lines = sum(1 for line in f)

In [7]:
bookings_num_lines

10000011

In [8]:
searches_num_lines

20390199

__*RESULT*__: we counted **10,000,011** lines in the `bookings.csv` file, and **20,390,199** lines in the `searches.csv` file.

### 1.2 Importing the Data
Now that the files are there, we import them to our IPython session using **Pandas**. Indeed, we will use Pandas for most of our data management.

In [0]:
import pandas as pd
import numpy as np

Before importing the whole csv file, we'll first have a look at the data structure by examining the first few lines of the file:

In [25]:
bookings = pd.read_csv(bookings_file)
bookings.head()

Unnamed: 0,act_date ^source^pos_ctry^pos_iata^pos_oid ^rloc ^cre_date ^duration^distance^dep_port^dep_city^dep_ctry^arr_port^arr_city^arr_ctry^lst_port^lst_city^lst_ctry^brd_port^brd_city^brd_ctry^off_port^off_city^off_ctry^mkt_port^mkt_city^mkt_ctry^intl^route ^carrier^bkg_class^cab_class^brd_time ^off_time ^pax^year^month^oid
0,2013-03-05 00:00:00^1A ^DE ^a68dd7ae95...
1,2013-03-26 00:00:00^1A ^US ^e612b9eeee...
2,2013-03-26 00:00:00^1A ^US ^e612b9eeee...
3,2013-03-26 00:00:00^1A ^AU ^0f984b3bb6...
4,2013-03-26 00:00:00^1A ^AU ^0f984b3bb6...


Clearly the symbol '^' is being used as separator. With this in mind, we load the data correctly:

In [9]:
bookings = pd.read_csv(bookings_file, sep='^', nrows=5)
bookings.head()

Unnamed: 0,act_date,source,pos_ctry,pos_iata,pos_oid,rloc,cre_date,duration,distance,dep_port,...,route,carrier,bkg_class,cab_class,brd_time,off_time,pax,year,month,oid
0,2013-03-05 00:00:00,1A,DE,a68dd7ae953c8acfb187a1af2dcbe123,1a11ae49fcbf545fd2afc1a24d88d2b7,ea65900e72d71f4626378e2ebd298267,2013-02-22 00:00:00,1708,0,ZRH,...,LHRZRH,VI,T,Y,2013-03-07 08:50:00,2013-03-07 11:33:37,-1,2013,3,
1,2013-03-26 00:00:00,1A,US,e612b9eeeee6f17f42d9b0d3b79e75ca,7437560d8f276d6d05eeb806d9e7edee,737295a86982c941f1c2da9a46a14043,2013-03-26 00:00:00,135270,0,SAL,...,SALATLCLT,NV,L,Y,2013-04-12 13:04:00,2013-04-12 22:05:40,1,2013,3,
2,2013-03-26 00:00:00,1A,US,e612b9eeeee6f17f42d9b0d3b79e75ca,7437560d8f276d6d05eeb806d9e7edee,737295a86982c941f1c2da9a46a14043,2013-03-26 00:00:00,135270,0,SAL,...,CLTATLSAL,NV,U,Y,2013-07-15 07:00:00,2013-07-15 11:34:51,1,2013,3,
3,2013-03-26 00:00:00,1A,AU,0f984b3bb6bd06661c95529bbd6193bc,36472c6dbaf7afec9136ac40364e2794,5ecf00fdcbcec761c43dc7285253d0c1,2013-03-26 00:00:00,30885,0,AKL,...,AKLHKGSVO,XK,G,Y,2013-04-24 23:59:00,2013-04-25 16:06:31,1,2013,3,SYDA82546
4,2013-03-26 00:00:00,1A,AU,0f984b3bb6bd06661c95529bbd6193bc,36472c6dbaf7afec9136ac40364e2794,5ecf00fdcbcec761c43dc7285253d0c1,2013-03-26 00:00:00,30885,0,AKL,...,SVOHKGAKL,XK,G,Y,2013-05-14 20:15:00,2013-05-16 10:44:50,1,2013,3,SYDA82546


We still need to format the data in a propper way. To do so we inspect one row and assign types to each of the fields:

In [16]:
bookings.iloc[0]

act_date                            2013-03-05 00:00:00
source                                           1A    
pos_ctry                                       DE      
pos_iata               a68dd7ae953c8acfb187a1af2dcbe123
pos_oid                1a11ae49fcbf545fd2afc1a24d88d2b7
rloc                   ea65900e72d71f4626378e2ebd298267
cre_date                            2013-02-22 00:00:00
duration                                           1708
distance                                              0
dep_port                                       ZRH     
dep_city                                       ZRH     
dep_ctry                                       CH      
arr_port                                       LHR     
arr_city                                       LON     
arr_ctry                                       GB      
lst_port                                       ZRH     
lst_city                                       ZRH     
lst_ctry                                       C

In [0]:
dtype_dict = {
    'act_date':'datetime64',
    'source': 'category',
    'pos_ctry': 'category',
    'pos_iata': 'str',
    'pos_oid': 'str',
    'rloc': 'str',
    'cre_date': 'datetime64',
    'duration': 'int64',
    'distance': 'float64',
    'dep_port': 'category',
    'dep_city': 'category',
    'dep_ctry': 'category',
    'arr_port': 'category',
    'arr_city': 'category',
    'arr_ctry': 'category',
    'lst_port': 'category',
    'lst_city': 'category',
    'lst_ctry': 'category',
    'brd_port': 'category',
    'brd_city': 'category',
    'brd_ctry': 'category',
    'off_port': 'category',
    'off_city': 'category',
    'off_ctry': 'category',
    'mkt_port': 'str',
    'mkt_city': 'str',
    'mkt_ctry': 'str',
    'intl': 'int64',
    'route': 'str',
    'carrier': 'category',
    'bkg_class': 'category',
    'cab_class': 'category',
    'brd_time': 'datetime64',
    'off_time': 'datetime64',
    'pax': 'int64',
    'year': 'int64',
    'month': 'int64',
    'oid': 'str'}

We finally load all the `bookings.csv` data correctly:

In [24]:
bookings = pd.read_csv(bookings_file, sep='^', dtype=dtype_dict)

ValueError: ignored