# TAREA. Dataset logs nasa

### ETL

Start Date: 29/11/2019

End Date: 29/11/2019

### José María Álvarez Silva

El objetivo de esta práctica es emplear el RDD access_logs para realizar una serie de cálculos:

 * Se ha realizado con `pyspark`
 * Ahora se hará con `pyspark sql`

## 1. En primer lugar se deben cargar los datos en un contexto Spark:

In [1]:
from pyspark import SparkContext
from datetime import datetime
import pandas as pd

In [2]:
sc = SparkContext()

In [3]:
import urllib.request
f = urllib.request.urlretrieve("https://www.dropbox.com/s/73wr8xb5s6fdj7g/apache.access.log.PROJECT?dl=1", "apache.access.log.PROJECT")

In [6]:
data_file = "./apache.access.log.PROJECT"
raw_data = sc.textFile(data_file)


Hasta aquí se han cargado los datos en el RDD raw_data. Como primera medida para asegurar que los datos se han subido de manera correcta se procede al conteo de los mismos:

In [7]:
raw_data.count()

1043177

Además vamos a visualizar los primero elementos del RDD:

In [8]:
raw_data.take(5)

['in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839',
 'uplherc.upl.com - - [01/Aug/1995:00:00:07 -0400] "GET / HTTP/1.0" 304 0',
 'uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 304 0',
 'uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/MOSAIC-logosmall.gif HTTP/1.0" 304 0',
 'uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/USA-logosmall.gif HTTP/1.0" 304 0']

Una vez confirmado que la carga ha sido realizada de manera correcta se procede al parseado de los datos:

En el parseado se emplea la expresión regular vista hoy en clase:

In [9]:
import re
def parse_log1(line):
    match = re.search('^(\S+) (\S+) (\S+) \[(\S+) [-](\d{4})\] "(\S+)\s*(\S+)\s*(\S+)\s*([\w\.\s*]+)?\s*"*(\d{3}) (\S+)', line)
    if match is None:
        return 0
    else:
        return 1
n_logs = raw_data.count()

In [10]:
def parse_log2(line):
    match = re.search('^(\S+) (\S+) (\S+) \[(\S+) [-](\d{4})\] "(\S+)\s*(\S+)\s*(\S+)\s*([/\w\.\s*]+)?\s*"* (\d{3}) (\S+)',line)
    if match is None:
        match = re.search('^(\S+) (\S+) (\S+) \[(\S+) [-](\d{4})\] "(\S+)\s*([/\w\.]+)>*([\w/\s\.]+)\s*(\S+)\s*(\d{3})\s*(\S+)',line)
    if match is None:
        return (line, 0)
    else:
        return (line, 1)


In [11]:
def map_log(line):
    match = re.search('^(\S+) (\S+) (\S+) \[(\S+) [-](\d{4})\] "(\S+)\s*(\S+)\s*(\S+)\s*([/\w\.\s*]+)?\s*"* (\d{3}) (\S+)',line)
    if match is None:
        match = re.search('^(\S+) (\S+) (\S+) \[(\S+) [-](\d{4})\] "(\S+)\s*([/\w\.]+)>*([\w/\s\.]+)\s*(\S+)\s*(\d{3})\s*(\S+)',line)
    return(match.groups())
parsed_rdd = raw_data.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])
parsed_def = parsed_rdd.map(lambda line: map_log(line))

In [12]:
parsed_rdd.take(5)

['in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839',
 'uplherc.upl.com - - [01/Aug/1995:00:00:07 -0400] "GET / HTTP/1.0" 304 0',
 'uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 304 0',
 'uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/MOSAIC-logosmall.gif HTTP/1.0" 304 0',
 'uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/USA-logosmall.gif HTTP/1.0" 304 0']

In [13]:
parsed_def.take(2)

[('in24.inetnebr.com',
  '-',
  '-',
  '01/Aug/1995:00:00:01',
  '0400',
  'GET',
  '/shuttle/missions/sts-68/news/sts-68-mcc-05.txt',
  'HTTP/1.0"',
  None,
  '200',
  '1839'),
 ('uplherc.upl.com',
  '-',
  '-',
  '01/Aug/1995:00:00:07',
  '0400',
  'GET',
  '/',
  'HTTP/1.0"',
  None,
  '304',
  '0')]

## Pregunta 1

## Mínimo, máximo y media del tamaño de las peticiones

A continuación se presentan las estadísticas (incluyendo mínimo, máximo y media) del tamaño de las peticiones:

In [14]:
def convert_long(x):
    x = re.sub('[^0-9]',"",x) 
    if x =="":
        return 0
    else:
        return int(x)
parsed_def.map(lambda line: convert_long(line[-1])).stats()

(count: 1043177, mean: 17531.55570243611, stdev: 68561.9662005, max: 3421948.0, min: 0.0)

In [15]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
from pyspark.sql import Row

In [16]:
row_data = parsed_def.map(lambda p: Row(
    host = p[0], 
#     fecha = (datetime.strptime(p[3][:11], "%d/%b%Y")),
    endpoint = p[6],
    size = convert_long(p[-1]),
    repuesta = p[-2]
    )
)

In [17]:
lognasa_df = sqlContext.createDataFrame(row_data)
lognasa_df.registerTempTable("lognasa")

In [18]:
statsNASA = sqlContext.sql("""
    SELECT MIN(size), MAX(size), MEAN(size) FROM lognasa
""")
statsNASA.show()

+---------+---------+------------------+
|min(size)|max(size)|         avg(size)|
+---------+---------+------------------+
|        0|  3421948|17531.555702435926|
+---------+---------+------------------+



## Pregunta 2

## Número de peticiones de cada código de respuesta

El siguiente objetivo es calcular el número de peticiones de cada código de respuesta:

In [19]:
n_codes = parsed_def.map(lambda line: (line[-2], 1)).distinct().count()
codes_count = (parsed_def.map(lambda line: (line[-2], 1))
          .reduceByKey(lambda a, b: a + b)
          .takeOrdered(n_codes, lambda x: -x[1]))
codes_count

[('200', 940847),
 ('304', 79824),
 ('302', 16244),
 ('404', 6185),
 ('403', 58),
 ('501', 17),
 ('500', 2)]

__Solución:__

In [20]:
respuestasCode = sqlContext.sql("""
    SELECT repuesta, COUNT(repuesta) FROM lognasa GROUP BY repuesta
""")
respuestasCode.show()

+--------+---------------+
|repuesta|count(repuesta)|
+--------+---------------+
|     200|         940847|
|     302|          16244|
|     501|             17|
|     404|           6185|
|     403|             58|
|     500|              2|
|     304|          79824|
+--------+---------------+



## Pregunta 3

## Mostrar 20 hosts que han sido visitados más de 10 veces

Si se desean ver  20 hosts visitados más de diez veces:

In [21]:
result = parsed_def.map(lambda line: (line[0],1)).reduceByKey(lambda a, b: a + b).filter(lambda x : x[1] > 10).takeOrdered(20, lambda x: -x[1])
result

[('edams.ksc.nasa.gov', 4034),
 ('piweba5y.prodigy.com', 3237),
 ('piweba4y.prodigy.com', 3043),
 ('piweba3y.prodigy.com', 2830),
 ('www-d1.proxy.aol.com', 2715),
 ('www-b3.proxy.aol.com', 2518),
 ('news.ti.com', 2507),
 ('www-b2.proxy.aol.com', 2481),
 ('163.206.89.4', 2478),
 ('www-c2.proxy.aol.com', 2438),
 ('www-c3.proxy.aol.com', 2400),
 ('www-d2.proxy.aol.com', 2371),
 ('www-d4.proxy.aol.com', 2356),
 ('www-b5.proxy.aol.com', 2354),
 ('www-b4.proxy.aol.com', 2297),
 ('www-d3.proxy.aol.com', 2284),
 ('www-a2.proxy.aol.com', 2238),
 ('www-c4.proxy.aol.com', 2207),
 ('www-c5.proxy.aol.com', 2198),
 ('www-c6.proxy.aol.com', 2181)]

In [22]:
host20 = sqlContext.sql("""
   SELECT host, num FROM (SELECT host, COUNT(host) as num FROM lognasa GROUP BY host) WHERE num > 10 ORDER BY num DESC LIMIT 20
""")
host20.show()

+--------------------+----+
|                host| num|
+--------------------+----+
|  edams.ksc.nasa.gov|4034|
|piweba5y.prodigy.com|3237|
|piweba4y.prodigy.com|3043|
|piweba3y.prodigy.com|2830|
|www-d1.proxy.aol.com|2715|
|www-b3.proxy.aol.com|2518|
|         news.ti.com|2507|
|www-b2.proxy.aol.com|2481|
|        163.206.89.4|2478|
|www-c2.proxy.aol.com|2438|
|www-c3.proxy.aol.com|2400|
|www-d2.proxy.aol.com|2371|
|www-d4.proxy.aol.com|2356|
|www-b5.proxy.aol.com|2354|
|www-b4.proxy.aol.com|2297|
|www-d3.proxy.aol.com|2284|
|www-a2.proxy.aol.com|2238|
|www-c4.proxy.aol.com|2207|
|www-c5.proxy.aol.com|2198|
|www-c6.proxy.aol.com|2181|
+--------------------+----+



## Pregunta 4 

## Mostrar los 10 endpoints más visitados

A continuación se muestran los 10 endpoints más visitados:

In [23]:
result = parsed_def.map(lambda line: (line[6],1)).reduceByKey(lambda a, b: a + b).takeOrdered(10, lambda x: -x[1])
result

[('/images/NASA-logosmall.gif', 59737),
 ('/images/KSC-logosmall.gif', 50452),
 ('/images/MOSAIC-logosmall.gif', 43890),
 ('/images/USA-logosmall.gif', 43664),
 ('/images/WORLD-logosmall.gif', 43277),
 ('/images/ksclogo-medium.gif', 41336),
 ('/ksc.html', 28582),
 ('/history/apollo/images/apollo-logo1.gif', 26778),
 ('/images/launch-logo.gif', 24755),
 ('/', 20292)]

In [24]:
endpoints10 = sqlContext.sql("""
   SELECT endpoint, COUNT(endpoint) as num FROM lognasa GROUP BY endpoint ORDER BY num DESC LIMIT 10
""")
endpoints10.show()

+--------------------+-----+
|            endpoint|  num|
+--------------------+-----+
|/images/NASA-logo...|59737|
|/images/KSC-logos...|50452|
|/images/MOSAIC-lo...|43890|
|/images/USA-logos...|43664|
|/images/WORLD-log...|43277|
|/images/ksclogo-m...|41336|
|           /ksc.html|28582|
|/history/apollo/i...|26778|
|/images/launch-lo...|24755|
|                   /|20292|
+--------------------+-----+



## Pregunta 5

## Mostrar los 10 endpoints mas visitados que no tienen el codigo de respuesta 200

Si se desean los 10 endpoints más visitados que no han devuelto un resultado (es decir, que tienen un código distinto de 200):

In [25]:
result = (parsed_def.filter(lambda line: line[9] != '200')
          .map(lambda line: (line[6], 1))
          .reduceByKey(lambda a, b: a+b)
          .takeOrdered(10, lambda x: -x[1]))
result

[('/images/NASA-logosmall.gif', 8761),
 ('/images/KSC-logosmall.gif', 7236),
 ('/images/MOSAIC-logosmall.gif', 5197),
 ('/images/USA-logosmall.gif', 5157),
 ('/images/WORLD-logosmall.gif', 5020),
 ('/images/ksclogo-medium.gif', 4728),
 ('/history/apollo/images/apollo-logo1.gif', 2907),
 ('/images/launch-logo.gif', 2811),
 ('/', 2199),
 ('/images/ksclogosmall.gif', 1622)]

In [26]:
endpoints10sin200 = sqlContext.sql("""
   SELECT endpoint, COUNT(endpoint) as num FROM (SELECT endpoint FROM lognasa WHERE repuesta != 200) GROUP BY endpoint ORDER BY num DESC LIMIT 10 
""")
endpoints10sin200.show()

+--------------------+----+
|            endpoint| num|
+--------------------+----+
|/images/NASA-logo...|8761|
|/images/KSC-logos...|7236|
|/images/MOSAIC-lo...|5197|
|/images/USA-logos...|5157|
|/images/WORLD-log...|5020|
|/images/ksclogo-m...|4728|
|/history/apollo/i...|2907|
|/images/launch-lo...|2811|
|                   /|2199|
|/images/ksclogosm...|1622|
+--------------------+----+



## Pregunta 6

## Número de Hosts Distintos

A continuación se calcula el número de hosts distintos:

In [27]:
parsed_def.map(lambda line: line[0]).distinct().count()

54507

In [28]:
distinctHosts = sqlContext.sql("""
   SELECT COUNT(DISTINCT host) FROM lognasa 
""")
distinctHosts.show()

+--------------------+
|count(DISTINCT host)|
+--------------------+
|               54507|
+--------------------+



## Pregunta 7

## Número de Hosts unicos por día

Tras ello se puede calcular el número de hosts únicos cada día:

In [29]:
def day_month(line):
    date_time = line[3]
    return datetime.strptime(date_time[:11], "%d/%b/%Y") #Se parsea la fecha para trabajar con ella tal y como vimos en clase.
result = parsed_def.map(lambda line:  (day_month(line), 1)).reduceByKey(lambda a, b: a + b).distinct().collect()
result



[(datetime.datetime(1995, 8, 22, 0, 0), 57758),
 (datetime.datetime(1995, 8, 14, 0, 0), 59873),
 (datetime.datetime(1995, 8, 7, 0, 0), 57355),
 (datetime.datetime(1995, 8, 21, 0, 0), 55539),
 (datetime.datetime(1995, 8, 19, 0, 0), 32092),
 (datetime.datetime(1995, 8, 18, 0, 0), 56244),
 (datetime.datetime(1995, 8, 13, 0, 0), 36480),
 (datetime.datetime(1995, 8, 1, 0, 0), 33996),
 (datetime.datetime(1995, 8, 15, 0, 0), 58845),
 (datetime.datetime(1995, 8, 17, 0, 0), 58980),
 (datetime.datetime(1995, 8, 11, 0, 0), 61242),
 (datetime.datetime(1995, 8, 16, 0, 0), 56651),
 (datetime.datetime(1995, 8, 10, 0, 0), 61245),
 (datetime.datetime(1995, 8, 8, 0, 0), 60142),
 (datetime.datetime(1995, 8, 3, 0, 0), 41387),
 (datetime.datetime(1995, 8, 20, 0, 0), 32963),
 (datetime.datetime(1995, 8, 5, 0, 0), 31888),
 (datetime.datetime(1995, 8, 4, 0, 0), 59554),
 (datetime.datetime(1995, 8, 9, 0, 0), 60457),
 (datetime.datetime(1995, 8, 12, 0, 0), 38070),
 (datetime.datetime(1995, 8, 6, 0, 0), 32416)]

In [30]:
row_data = parsed_def.map(lambda p: Row(
    host = p[0], 
    fecha = p[3][:11],
    endpoint = p[6],
    size = convert_long(p[-1]),
    repuesta = p[-2]
    )
)
row_data

PythonRDD[99] at RDD at PythonRDD.scala:48

In [31]:
lognasa_df = sqlContext.createDataFrame(row_data)
lognasa_df.registerTempTable("lognasa")

In [32]:
distinctHosts = sqlContext.sql("""
   SELECT fecha, COUNT(DISTINCT host) FROM lognasa GROUP BY fecha
""")
distinctHosts.show()

+-----------+--------------------+
|      fecha|count(DISTINCT host)|
+-----------+--------------------+
|21/Aug/1995|                4134|
|06/Aug/1995|                2537|
|07/Aug/1995|                4106|
|11/Aug/1995|                4346|
|03/Aug/1995|                3222|
|18/Aug/1995|                4168|
|17/Aug/1995|                4385|
|14/Aug/1995|                4454|
|20/Aug/1995|                2560|
|13/Aug/1995|                2650|
|15/Aug/1995|                4214|
|22/Aug/1995|                4456|
|08/Aug/1995|                4406|
|19/Aug/1995|                2550|
|04/Aug/1995|                4190|
|12/Aug/1995|                2864|
|05/Aug/1995|                2502|
|01/Aug/1995|                2582|
|16/Aug/1995|                4340|
|09/Aug/1995|                4317|
+-----------+--------------------+
only showing top 20 rows



## Pregunta 9 

## media de peticiones diarias por host

Tras esto se pide calcular la media de peticiones diaria por host.

In [33]:
unique_result = (parsed_def.map(lambda line:  (day_month(line),line[0]))
          .groupByKey().mapValues(set)
          .map(lambda x: (x[0], len(x[1]))))

length_result = (parsed_def.map(lambda line:  (day_month(line),line[0]))
          .groupByKey().mapValues(len))

joined = length_result.join(unique_result).map(lambda a: (a[0], (a[1][0])/(a[1][1]))).collect()
day = [x[0] for x in joined]
count = [x[1] for x in joined]
day_count_dct = {'Día':day, 'Media':count}
day_count_df = pd.DataFrame(day_count_dct )


El panda en este caso no se usa para realizar ningún cálculo si no simplemente para una mejor visualización.

__Solución:__ 

In [34]:
day_count_df

Unnamed: 0,Día,Media
0,1995-08-22,12.961849
1,1995-08-14,13.442524
2,1995-08-16,13.053226
3,1995-08-11,14.091578
4,1995-08-12,13.292598
5,1995-08-13,13.766038
6,1995-08-15,13.964167
7,1995-08-20,12.876172
8,1995-08-05,12.745004
9,1995-08-06,12.777296


In [35]:
sizeFecha = sqlContext.sql("""
   SELECT fecha, MEAN(size) FROM lognasa GROUP BY fecha
""")
sizeFecha.show()

+-----------+------------------+
|      fecha|         avg(size)|
+-----------+------------------+
|21/Aug/1995|16572.278633032645|
|06/Aug/1995|19561.356120434353|
|07/Aug/1995|16700.041234417225|
|11/Aug/1995|18005.917197348226|
|03/Aug/1995|17709.754391475584|
|18/Aug/1995|16097.694082924401|
|17/Aug/1995|17733.498965751103|
|14/Aug/1995|18097.275466403888|
|20/Aug/1995|18506.597518429753|
|13/Aug/1995|19101.447450657895|
|15/Aug/1995|17918.424946894385|
|22/Aug/1995|16328.244502926002|
|08/Aug/1995| 17677.64972897476|
|19/Aug/1995|18058.618347251653|
|04/Aug/1995| 18634.01215703395|
|12/Aug/1995| 18271.50562122406|
|05/Aug/1995|19242.211176618162|
|01/Aug/1995|15570.117631486057|
|16/Aug/1995| 17557.41164321901|
|09/Aug/1995|16224.069669351771|
+-----------+------------------+
only showing top 20 rows



In [36]:
sizeFecha = sqlContext.sql("""
   SELECT host, COUNT(DISTInCT fecha) as nfecha FROM lognasa GROUP BY host
""")
sizeFecha.show()

+--------------------+------+
|                host|nfecha|
+--------------------+------+
|tibia.mech.kuleuv...|     1|
|      205.197.248.13|     2|
|      138.253.42.199|     1|
|slip02.cs1.electr...|     2|
|ppp2_100.bekkoame...|     1|
|quadraohenia.jsc....|     3|
|ix-sea6-23.ix.net...|     1|
|      140.251.205.85|     1|
|houston-1-6.i-lin...|     1|
|      128.159.112.47|     4|
|ix-lv4-29.ix.netc...|     1|
|      163.205.166.15|    13|
|pipe2.nyc.pipelin...|     6|
|  tpafl2-48.gate.net|     2|
|       199.174.180.5|     1|
|d54.net.interacce...|     1|
|pppd053.compuserv...|     1|
|slip-1-8.afit.af.mil|     1|
|   fiji.cs.brown.edu|     1|
|bluegum06.itd.uts...|     2|
+--------------------+------+
only showing top 20 rows



In [37]:
sizeHost = sqlContext.sql("""
   SELECT host, COUNT(host) as nhost FROM lognasa GROUP BY host
""")
sizeHost.show()

+--------------------+-----+
|                host|nhost|
+--------------------+-----+
|ix-sea6-23.ix.net...|    9|
|grimnet23.idirect...|   10|
|      ird.scitex.com|   13|
|      163.205.166.15|  228|
|   chrism.tmx.com.au|    4|
| boom.marblehead.com|    3|
|        199.3.230.80|    9|
|  enigma.idirect.com|  276|
|ip26.abq-dialin.h...|    6|
|   ppp20.coara.or.jp|   39|
|      128.159.63.129|   12|
|      132.170.244.49|   12|
|   hp165.den.mmc.com|   68|
|      128.159.143.43|   52|
|   lib-golf.tamu.edu|   20|
|       163.205.80.44|  203|
|      192.195.243.61|    7|
|   gigi.jpl.nasa.gov|    5|
|     dyna-53.bart.nl|   12|
|       164.116.78.80|   29|
+--------------------+-----+
only showing top 20 rows



In [38]:
meanHostByFecha = sqlContext.sql("""
   Select host, nhost/nfecha as media_Peticiones_Diarias  FROM (SELECT host, COUNT(host) as nhost FROM lognasa GROUP BY host) JOIN (SELECT host as fhost, COUNT(DISTInCT fecha) as nfecha FROM lognasa GROUP BY host) ON fhost = host 
""")
meanHostByFecha.show()

+---------------+------------------------+
|           host|media_Peticiones_Diarias|
+---------------+------------------------+
| 128.159.112.47|                    44.0|
| 128.159.143.43|                    13.0|
| 128.159.144.57|                    12.0|
| 128.159.63.129|                    12.0|
| 128.200.148.26|      17.666666666666668|
| 128.217.61.131|                     1.0|
|   129.231.2.34|                     6.0|
|  129.52.31.150|                    12.0|
|  130.151.82.70|                    18.0|
|   130.181.8.86|                    12.0|
|130.225.253.158|                     5.0|
|131.188.160.155|                     6.0|
| 132.170.244.49|                    12.0|
|   133.28.80.84|                    23.0|
|  134.39.70.204|      15.666666666666666|
| 136.145.30.147|                     7.0|
| 137.161.202.98|                    37.0|
| 138.115.11.103|                    11.0|
| 138.253.42.199|                     6.0|
|    138.47.49.4|                    19.0|
+----------

## Pregunta 10

## se muestra una lista de 40 endpoints que generan código de respuesta = 404

A continuación, se muestra una lista de 40 endpoints que generan código de respuesta = 404:

In [39]:
result = (parsed_def.filter(lambda line: line[9] == '404')
          .map(lambda line: (line[6], 1))
          .reduceByKey(lambda a, b: a+b).distinct()
          .takeOrdered(40, lambda x: -x[1]))
result

[('/pub/winvn/readme.txt', 633),
 ('/pub/winvn/release.txt', 494),
 ('/shuttle/missions/STS-69/mission-STS-69.html', 431),
 ('/images/nasa-logo.gif', 319),
 ('/elv/DELTA/uncons.htm', 178),
 ('/shuttle/missions/sts-68/ksc-upclose.gif', 156),
 ('/history/apollo/sa-1/sa-1-patch-small.gif', 146),
 ('/images/crawlerway-logo.gif', 120),
 ('/://spacelink.msfc.nasa.gov', 117),
 ('/history/apollo/pad-abort-test-1/pad-abort-test-1-patch-small.gif', 100),
 ('/history/apollo/a-001/a-001-patch-small.gif', 97),
 ('/images/Nasa-logo.gif', 85),
 ('/shuttle/resources/orbiters/atlantis.gif', 64),
 ('/history/apollo/images/little-joe.jpg', 62),
 ('/images/lf-logo.gif', 59),
 ('/shuttle/resources/orbiters/discovery.gif', 56),
 ('/shuttle/resources/orbiters/challenger.gif', 54),
 ('/robots.txt', 53),
 ('/elv/new01.gif>', 43),
 ('/history/apollo/pad-abort-test-2/pad-abort-test-2-patch-small.gif', 38),
 ('/pub/', 36),
 ('/pub', 36),
 ('/history/apollo/sa-2/sa-2-patch-small.gif', 35),
 ('/history/apollo/sa-5/

In [40]:
codigo404 = sqlContext.sql("""
   SELECT DISTINCT(endpoint) FROM lognasa WHERE repuesta = "404"
""")
codigo404.show()

+--------------------+
|            endpoint|
+--------------------+
|/shuttle/missions...|
|/history/apollo/a...|
|/history/apollo/a...|
|        /CSMT_PageNS|
|/pub/wiinvn/win3/...|
|  /public.win3/winvn|
|/shuttle/sts-1/st...|
|/history/apollo/a...|
|/shuttle/technolo...|
|/shuttle/missions...|
|/shuttle/countdow...|
|     /pub/winvn/docs|
|     /IMAGES/RSS.GIF|
|/history/apollo/-...|
|/pub/winvn/readme...|
|          /ksc.shtml|
|/img/sportstalk3.gif|
|          /home.html|
|/shuttle/missions...|
|/shuttle/technolo...|
+--------------------+
only showing top 20 rows



## Pregunta 11

## mostrar los 25 endpoints que más código 404 generan

El siguiente comando mostraría el top 25 de endpoints que más códigos 404 generan de dispnerse de más de 13:

In [41]:
result = (parsed_def.filter(lambda line: line[9] == '404')
          .map(lambda line: (line[6], 1))
          .reduceByKey(lambda a, b: a+b).distinct()
          .takeOrdered(25, lambda x: -x[1]))
result

[('/pub/winvn/readme.txt', 633),
 ('/pub/winvn/release.txt', 494),
 ('/shuttle/missions/STS-69/mission-STS-69.html', 431),
 ('/images/nasa-logo.gif', 319),
 ('/elv/DELTA/uncons.htm', 178),
 ('/shuttle/missions/sts-68/ksc-upclose.gif', 156),
 ('/history/apollo/sa-1/sa-1-patch-small.gif', 146),
 ('/images/crawlerway-logo.gif', 120),
 ('/://spacelink.msfc.nasa.gov', 117),
 ('/history/apollo/pad-abort-test-1/pad-abort-test-1-patch-small.gif', 100),
 ('/history/apollo/a-001/a-001-patch-small.gif', 97),
 ('/images/Nasa-logo.gif', 85),
 ('/shuttle/resources/orbiters/atlantis.gif', 64),
 ('/history/apollo/images/little-joe.jpg', 62),
 ('/images/lf-logo.gif', 59),
 ('/shuttle/resources/orbiters/discovery.gif', 56),
 ('/shuttle/resources/orbiters/challenger.gif', 54),
 ('/robots.txt', 53),
 ('/elv/new01.gif>', 43),
 ('/history/apollo/pad-abort-test-2/pad-abort-test-2-patch-small.gif', 38),
 ('/pub/', 36),
 ('/pub', 36),
 ('/history/apollo/sa-2/sa-2-patch-small.gif', 35),
 ('/history/apollo/sa-5/

In [42]:
topCodigo404 = sqlContext.sql("""
   SELECT endpoint, COUNT(endpoint) as top FROM lognasa WHERE repuesta = "404" GROUP BY endpoint ORDER BY top DESC LIMIT 25
""")
topCodigo404.show()

+--------------------+---+
|            endpoint|top|
+--------------------+---+
|/pub/winvn/readme...|633|
|/pub/winvn/releas...|494|
|/shuttle/missions...|431|
|/images/nasa-logo...|319|
|/elv/DELTA/uncons...|178|
|/shuttle/missions...|156|
|/history/apollo/s...|146|
|/images/crawlerwa...|120|
|/://spacelink.msf...|117|
|/history/apollo/p...|100|
|/history/apollo/a...| 97|
|/images/Nasa-logo...| 85|
|/shuttle/resource...| 64|
|/history/apollo/i...| 62|
| /images/lf-logo.gif| 59|
|/shuttle/resource...| 56|
|/shuttle/resource...| 54|
|         /robots.txt| 53|
|     /elv/new01.gif>| 43|
|/history/apollo/p...| 38|
+--------------------+---+
only showing top 20 rows



## Pregunta 12

## TOP 5 días con mayor número de 404's

Si se desean obtener el top 5 de días que más códigos de error 404 generan:

In [43]:
result = (parsed_def.filter(lambda line: line[9] == '404')
          .map(lambda line:  (day_month(line), 1))
          .reduceByKey(lambda a, b: a+b).collect())
day = [x[0] for x in result]
count = [x[1] for x in result]
day_count_dct = {'day':day, 'count':count}
day_count_df = pd.DataFrame(day_count_dct )


De nuevo el paso a DF se emplea únicamente para una mejor visualización:

In [44]:
day_count_df.sort_values('count', ascending = False)[:10]


Unnamed: 0,count,day
20,532,1995-08-07
10,381,1995-08-08
6,372,1995-08-06
17,346,1995-08-04
1,326,1995-08-15
14,314,1995-08-10
2,312,1995-08-20
19,305,1995-08-21
5,303,1995-08-03
0,288,1995-08-22


In [45]:
top5 = sqlContext.sql("""
   SELECT fecha, COUNT(fecha) as top FROM lognasa WHERE repuesta = "404" GROUP BY fecha ORDER BY top DESC LIMIT 5
""")
top5.show()

+-----------+---+
|      fecha|top|
+-----------+---+
|07/Aug/1995|532|
|08/Aug/1995|381|
|06/Aug/1995|372|
|04/Aug/1995|346|
|15/Aug/1995|326|
+-----------+---+



In [50]:
sc.stop()