[View in Colaboratory](https://colab.research.google.com/github/ibarrond/Amadeus/blob/master/AmadeusChallenge.ipynb)

# Amadeus challenge for Data Scientists

---



## 1. Set Up the Environment and Load the Data
This environment, as part of a Linux Virtual Machine, offers Python3 with GPU acceleration and 12 GB of RAM. The RAM size makes it possible to handle the size of the two files provided for this challenge in memory:
- `booking.csv` - 4GB
- `searches.csv` - 3.4GB

### 1.1 Obtaining the Data
In order to import the data we upload the files to a Google Drive folder , then mount our Drive folder in the VM (based on [this tutorial](https://colab.research.google.com/notebooks/io.ipynb)) and finally access the files:

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


We check that the files are there:

In [19]:
!ls -la 'gdrive/My Drive/Colab Notebooks/Amadeus'

total 7754993
-rw------- 1 root root       8350 Oct  8 12:59 AmadeusChallenge.ipynb
-rw------- 1 root root 4244874509 Oct  8 11:57 bookings.csv
-rw------- 1 root root 3696229366 Oct  8 12:36 searches.csv


In [0]:
# We store the path to each of the data files
bookings_file = 'gdrive/My Drive/Colab Notebooks/Amadeus/bookings.csv'
searches_file = 'gdrive/My Drive/Colab Notebooks/Amadeus/searches.csv'

#### # FIRST EXERCISE: count the number of lines in Python for each file_
After a quick searck online, this [source](https://stackoverflow.com/questions/16108526/count-how-many-lines-are-in-a-csv-python) suggested the use of `open()` function to read the number of files:

In [0]:
with open(bookings_file) as f: bookings_num_lines = sum(1 for line in f)
with open(searches_file) as f: searches_num_lines = sum(1 for line in f)

In [7]:
bookings_num_lines

10000011

In [8]:
searches_num_lines

20390199

__*RESULT*__: we counted **10,000,011** lines in the `bookings.csv` file, and **20,390,199** lines in the `searches.csv` file.

### 1.2 Importing the Bookings Data
Now that the files are there, we import them to our IPython session using **Pandas**. Indeed, we will use Pandas for most of our data management.

In [0]:
import pandas as pd
import numpy as np

Before importing the whole csv file, we'll first have a look at the data structure by examining the first few lines of the file:

In [7]:
bookings = pd.read_csv(bookings_file, nrows=5)
bookings.head()

Unnamed: 0,act_date ^source^pos_ctry^pos_iata^pos_oid ^rloc ^cre_date ^duration^distance^dep_port^dep_city^dep_ctry^arr_port^arr_city^arr_ctry^lst_port^lst_city^lst_ctry^brd_port^brd_city^brd_ctry^off_port^off_city^off_ctry^mkt_port^mkt_city^mkt_ctry^intl^route ^carrier^bkg_class^cab_class^brd_time ^off_time ^pax^year^month^oid
0,2013-03-05 00:00:00^1A ^DE ^a68dd7ae95...
1,2013-03-26 00:00:00^1A ^US ^e612b9eeee...
2,2013-03-26 00:00:00^1A ^US ^e612b9eeee...
3,2013-03-26 00:00:00^1A ^AU ^0f984b3bb6...
4,2013-03-26 00:00:00^1A ^AU ^0f984b3bb6...


Clearly the symbol '^' is being used as separator. With this in mind, we load the data correctly:

In [62]:
bookings = pd.read_csv(bookings_file, sep='^', nrows=5)
bookings.head()

Unnamed: 0,act_date,source,pos_ctry,pos_iata,pos_oid,rloc,cre_date,duration,distance,dep_port,...,route,carrier,bkg_class,cab_class,brd_time,off_time,pax,year,month,oid
0,2013-03-05 00:00:00,1A,DE,a68dd7ae953c8acfb187a1af2dcbe123,1a11ae49fcbf545fd2afc1a24d88d2b7,ea65900e72d71f4626378e2ebd298267,2013-02-22 00:00:00,1708,0,ZRH,...,LHRZRH,VI,T,Y,2013-03-07 08:50:00,2013-03-07 11:33:37,-1,2013,3,
1,2013-03-26 00:00:00,1A,US,e612b9eeeee6f17f42d9b0d3b79e75ca,7437560d8f276d6d05eeb806d9e7edee,737295a86982c941f1c2da9a46a14043,2013-03-26 00:00:00,135270,0,SAL,...,SALATLCLT,NV,L,Y,2013-04-12 13:04:00,2013-04-12 22:05:40,1,2013,3,
2,2013-03-26 00:00:00,1A,US,e612b9eeeee6f17f42d9b0d3b79e75ca,7437560d8f276d6d05eeb806d9e7edee,737295a86982c941f1c2da9a46a14043,2013-03-26 00:00:00,135270,0,SAL,...,CLTATLSAL,NV,U,Y,2013-07-15 07:00:00,2013-07-15 11:34:51,1,2013,3,
3,2013-03-26 00:00:00,1A,AU,0f984b3bb6bd06661c95529bbd6193bc,36472c6dbaf7afec9136ac40364e2794,5ecf00fdcbcec761c43dc7285253d0c1,2013-03-26 00:00:00,30885,0,AKL,...,AKLHKGSVO,XK,G,Y,2013-04-24 23:59:00,2013-04-25 16:06:31,1,2013,3,SYDA82546
4,2013-03-26 00:00:00,1A,AU,0f984b3bb6bd06661c95529bbd6193bc,36472c6dbaf7afec9136ac40364e2794,5ecf00fdcbcec761c43dc7285253d0c1,2013-03-26 00:00:00,30885,0,AKL,...,SVOHKGAKL,XK,G,Y,2013-05-14 20:15:00,2013-05-16 10:44:50,1,2013,3,SYDA82546


We still need to format the data in a propper way. To do so we inspect one row and assign types to each of the fields:

In [16]:
bookings.iloc[0]

act_date                            2013-03-05 00:00:00
source                                           1A    
pos_ctry                                       DE      
pos_iata               a68dd7ae953c8acfb187a1af2dcbe123
pos_oid                1a11ae49fcbf545fd2afc1a24d88d2b7
rloc                   ea65900e72d71f4626378e2ebd298267
cre_date                            2013-02-22 00:00:00
duration                                           1708
distance                                              0
dep_port                                       ZRH     
dep_city                                       ZRH     
dep_ctry                                       CH      
arr_port                                       LHR     
arr_city                                       LON     
arr_ctry                                       GB      
lst_port                                       ZRH     
lst_city                                       ZRH     
lst_ctry                                       C

We finally load all the `bookings.csv` data correctly. For the second exercise we will only need the columns `pax`, and `arr_port`, thus we only import these:

In [0]:
bookings = pd.read_csv(bookings_file, usecols=['arr_port', 'pax'], sep='^')

In [36]:
bookings.head()

Unnamed: 0,arr_port,pax
0,LHR,-1.0
1,CLT,1.0
2,CLT,1.0
3,SVO,1.0
4,SVO,1.0


## 2. SECOND EXERCISE: Top 10 arrival airports in the world in 2013

IN order to accumulate the number of petitions, we should first check for empty data:

In [51]:
bookings.isnull().sum()

arr_port    0
pax         1
dtype: int64

Looks like there is one empty value in the pax column. We will find it and remove it from our dataframe:

In [55]:
bookings[bookings.isnull().any(axis=1)]

Unnamed: 0,arr_port,pax
5000007,SG,


In [0]:
bookings = bookings.drop(index=5000007)

Now we can launch our query to the dataframe. It is composed of:
1. Aggregation by `arr_port`: We will aggregate the destination airport using `groupby(['arr_port'])`.
2. Accumulating the number of passengers: sum the pax, simply using `sum()`,
3. Obtaining the top 10 by sorting via `.sort_values(by=['pax'], ascending=False)` and extracting the top 10 with `head(10)`.

In [0]:
top10_arr_port = bookings.groupby(['arr_port']).sum().sort_values(by=['pax'], ascending=False).head(10)

__*RESULT*__: The top 10 airports, including the number of passengers (pax column) are: 

In [74]:
top10_arr_port

Unnamed: 0_level_0,pax
arr_port,Unnamed: 1_level_1
LHR,88809.0
MCO,70930.0
LAX,70530.0
LAS,69630.0
JFK,66270.0
CDG,64490.0
BKK,59460.0
MIA,58150.0
SFO,58000.0
DXB,55590.0


__*EXTRA*__: we can also obtain the name of the airport using geobases: