# Challenge

## Problem

A travel agency has a database with the delays per month and day of several flight companies in the United States. In this challenge the agency ask us for help. They want to know which companies are more convenient to fly with. The precise question is:

* What has been the average delay of each company for each month of the year?

**Do not use Pandas** to implement the python code that answers this question.

## Input data

The file `flightDelays.csv` contains information about the delay of U.S. flights. The format is: 

    month, day of the month, company ID, company acronym, delay
    
Note that it is possible that for some days or even months some airlines do not have delay information.

Also note that there may be negative delay values which should not be counted in the computation.

## Solution

In the file `solution.txt` you can find the solution and the expected output format. The file format is as follows:

Company ID-month, delayMedium

Note that the output is sorted from lowest to highest company ID, and for each company from lowest to highest number of month.

You can also use the script `compare_valid_solution.py <input_file>` to check if the solution is valid. 

In [1]:
!ls

03-Cython.pdf		 InterfacingCwithCython.ipynb  prun0
byhand			 lprof0			       prunRapido
Challenge2-Cython.ipynb  lprof1			       __pycache__
ChallengeSolution.ipynb  manual			       simulation.py
Cython.ipynb		 memscript.py		       Solution.txt
flightDelays.csv	 mpg.csv		       wrapfib
interactive		 mprof0			       wraprapido


In [2]:
import csv

with open('flightDelays.csv') as csvfile:
    fd=list(csv.DictReader(csvfile))
print(len(fd))

6592128


In [3]:
def retoA(fdelays):
    flight_dict = {}
    for d in fdelays:
        cur_key = str(d['compID']) + '-' + str(d['month'])
        if ( cur_key in flight_dict and float(d['delay']) > 0):
            flight_dict[cur_key][0] += float(d['delay'])
            flight_dict[cur_key][1] += 1
        elif (float(d['delay']) > 0):
            flight_dict[cur_key] = [float(d['delay']),1]

    y=[(x[0],x[1][0]/x[1][1]) for x in flight_dict.items()]
    y.sort(key=lambda x: x[0])
    return y

In [4]:
def retoB(flights):
    result_dict = {}
    for row in flights:
        if float(row['delay']) > 0.0:
            curr_str = row['compID'] + '-' + row['month']
            current_dict = result_dict.get(curr_str, [0,0])
            current_dict[0] += 1
            current_dict[1] += float(row['delay'])
            result_dict[curr_str] = current_dict

    y = [(a, b[1]/b[0]) for a,b in result_dict.items()]
    y.sort(key=lambda x: x[0])
    return y

In [5]:
from operator import add
mia=[1,2]
mib=[5,6]
print(mia+mib)
list(map(add,mia,mib))

[1, 2, 5, 6]


[6, 8]

In [6]:
from operator import add
def retoC(mpg):    
    histo={}
    for d in mpg:
        delay = float(d['delay'])
        key = d['compID']+'-'+d['month']
        if(key in histo and delay>0):
            #histo[key]=list(map(add,histo[key],[delay,1]))
            histo[key][0]+=delay
            histo[key][1]+=1
        elif (delay>0):
            histo[key]=[delay,1]

    result=[(x[0],x[1][0]/x[1][1]) for x in histo.items()]

    result.sort(key=lambda x: x[0])
    return result

In [None]:
r=retoA(fd)
for i,j in r:
    print("{0}, {1:.4f}".format(i,j))

In [None]:
r=retoB(fd)
for i,j in r:
    print("{0}, {1:.4f}".format(i,j))

In [None]:
r=retoC(fd)
for i,j in r:
    print("{0}, {1:.4f}".format(i,j))

In [7]:
%timeit retoA(fd)

5.44 s ± 658 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]:
%timeit retoB(fd)

2.67 s ± 8.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [9]:
%timeit retoC(fd)

3.1 s ± 558 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Other alternatives

In [16]:
#Finding average 

def m_average(file):

    dic_values={}
    averages=[]
    
    for row in file:
        key = row['compID']+'-'+ row['month']  
        if not (key in dic_values):
            dic_values[key]=[]#no match class data
        if(float(row['delay']) > 0.0):
            dic_values[key].append(float(row['delay']))#match class data
   
   #ordering
    for names, v in sorted(dic_values.items()):#order list
            average= (sum(v)/len(v))
            #print(names, average)
    
    #averages.append((names, average))
    #print(averages)

In [17]:
    
%timeit m_average(fd)
#m_average(mpg)

3.26 s ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### With Pandas

In [18]:
import math
import numpy as np
import pandas as pd
import scipy.stats as stats

#reads dataframe
df = pd.read_csv("flightDelays.csv")


Dropping compName column

In [19]:
df = df.drop(['compName', 'day'], 1)

- Dropping 'no delay' entries
- grouping by compID and month
- resetting index

In [20]:
%timeit df[df['delay']>0].groupby(['compID','month']).mean().reset_index()

152 ms ± 555 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [21]:
df = df[df['delay']>0].groupby(['compID','month']).mean().reset_index()

Pretty printing (to adjust to compare_valid_solution.py format)

In [22]:
df['compID']= df['compID'].map(str) + "-" + df["month"].map(str)

df = df.drop('month', 1)
df['delay'] = df['delay'].round(4).map(str)
df['delay'] = " " + df['delay']

In [23]:
#Exporting

df.to_csv('Solution.txt', header=None, index=None, sep=",")

### Implement Cython Version

In [26]:
%timeit retoD(fd)

1.86 s ± 5.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Implement C++ version wrapped with Cython Version