# Technical Interview

This document presents notes and codes about 3 exercices. The code will be in python 2, the data is given in a separate csv (too big for github).

The data were extract from two compressed files. The files were corrupt so I had to use `bzip2recover` to uncompress them.

## Exercise 1

### Task: count the number of lines in Python for each file

We use a csv reader to iterate over the whole file, using a small counter to count the number of (not empty) lines.
Reading the whole file with `f.read()` won't work because of the size of the files.
We found 20390199 lines for **searches.csv** and 10000011 lines for **bookings.csv**

In [8]:
import csv

with open("./data/searches.csv","r") as f:
    reader = csv.reader(f,delimiter="^")
    i = 0
    for row in reader:
        if row!="\n" and row!="":
            i +=1
    print "nb_searches = %d lines"%i
with open("./data/bookings.csv","r") as f:
    reader = csv.reader(f,delimiter="^")
    i = 0
    for row in reader:
        if row!="\n" and row!="":
            i +=1
    print "nb_bookings = %d lines"%i

nb_searches = 20390199 lines
nb_bookings = 10000011 lines


## Exercise 2

### Task: top 10 arrival airports in the world in 2013 (using the bookings file)

We use a smaller version of bookings.csv for testing. That way,  we can open it with *libre office* to better understand the data and also make faster test. To make this file, we use `head` function and redirect the result to **bookings_head.csv**.
    First, we map all the flight where duration + off time is prior to 2014. Then, we reduce all the previous result by adding the pax column and we get the first ten airports. At last, we use GeoBase to get the name of the airport.
We find 

In [None]:
import pandas as pd
import numpy as np
from GeoBases import GeoBase


#reading the csv_file
data = pd.read_csv("./data/bookings.csv", delimiter="^")

# convert off_time column from string to timestamp
off_time = data['off_time           ']
off_time = pd.to_datetime(off_time)
off_time = off_time.astype(np.int64)


# filter the column where duration + off_time < 2014 and off_time+duration >= 2013
arrival = pd.to_datetime(off_time+data['duration'])
data = data[arrival < pd.to_datetime("2014-1-1")]
data = data[arrival >= pd.to_datetime("2013-1-1")]

# group the number of passenger by airport name
arrival = data.groupby('arr_port').sum()

best_arr_port = arrival['pax'].nlargest(10)

# use geobase to get the name of the airports
geo_a = GeoBase(data='airports', verbose=False)
result = []
for row in best_arr_port.iteritems():
    name = geo_a.get(row[0].strip(),"name")
    result.append((row[0],name, row[1]))

print pd.DataFrame(result,columns=['arr_port_code','arr_port_name','pax'])