[View in Colaboratory](https://colab.research.google.com/github/ibarrond/Amadeus/blob/master/AmadeusChallenge.ipynb)

# Amadeus challenge for Data Scientists

---



## 1. Set Up the Environment and Load the Data
This environment, as part of a Linux Virtual Machine, offers Python3 with GPU acceleration and 12 GB of RAM. The RAM size makes it possible to handle the size of the two files provided for this challenge in memory:
- `booking.csv` - 4GB
- `searches.csv` - 3.4GB

## 1.1 Obtaining the Data
In order to import the data we upload the files to a Google Drive folder , then mount our Drive folder in the VM (based on [this tutorial](https://colab.research.google.com/notebooks/io.ipynb)) and finally access the files:

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

We check that the files are there:

In [19]:
!ls -la 'gdrive/My Drive/Colab Notebooks/Amadeus'

total 7754993
-rw------- 1 root root       8350 Oct  8 12:59 AmadeusChallenge.ipynb
-rw------- 1 root root 4244874509 Oct  8 11:57 bookings.csv
-rw------- 1 root root 3696229366 Oct  8 12:36 searches.csv


In [0]:
# We store the path to each of the data files
bookings_file = 'gdrive/My Drive/Colab Notebooks/Amadeus/bookings.csv'
searches_file = 'gdrive/My Drive/Colab Notebooks/Amadeus/searches.csv'

### # FIRST EXERCISE: count the number of lines in Python for each file_
After a quick searck online, this [source](https://stackoverflow.com/questions/16108526/count-how-many-lines-are-in-a-csv-python) suggested the use of `open()` function to read the number of files:

In [0]:
with open(bookings_file) as f: bookings_num_lines = sum(1 for line in f)
with open(searches_file) as f: searches_num_lines = sum(1 for line in f)

In [23]:
bookings_num_lines

10000011

In [24]:
searches_num_lines

20390199

__*RESULT*__: we counted **10,000,011** lines in the `bookings.csv` file; and **2,0390,199** lines in the `searches.csv` file.

## 1.2 Importing the Data
Now that the files are there, we import them to our IPython session using **Pandas**. Indeed, we will use Pandas for most of our data management.

In [0]:
import pandas as pd

Before importing the whole csv file, we'll first have a look at the data structure by examining the first few lines of the file:

In [25]:
bookings = pd.read_csv(bookings_file, nrows=5)
bookings.head()

Unnamed: 0,act_date ^source^pos_ctry^pos_iata^pos_oid ^rloc ^cre_date ^duration^distance^dep_port^dep_city^dep_ctry^arr_port^arr_city^arr_ctry^lst_port^lst_city^lst_ctry^brd_port^brd_city^brd_ctry^off_port^off_city^off_ctry^mkt_port^mkt_city^mkt_ctry^intl^route ^carrier^bkg_class^cab_class^brd_time ^off_time ^pax^year^month^oid
0,2013-03-05 00:00:00^1A ^DE ^a68dd7ae95...
1,2013-03-26 00:00:00^1A ^US ^e612b9eeee...
2,2013-03-26 00:00:00^1A ^US ^e612b9eeee...
3,2013-03-26 00:00:00^1A ^AU ^0f984b3bb6...
4,2013-03-26 00:00:00^1A ^AU ^0f984b3bb6...


Clearly the symbol '^' is being used as separator. With this in mind, we load the data correctly:

In [0]:
bookings = pd.read_csv(bookings_file, sep='^', error_bad_lines=False, warn_bad_lines=True)
bookings.head()

  interactivity=interactivity, compiler=compiler, result=result)


In [6]:
bookings = pd.read_csv(bookings_file, error_bad_lines=False, warn_bad_lines=True)

b'Skipping line 5000009: expected 1 fields, saw 2\nSkipping line 5000010: expected 1 fields, saw 3\n'
