# [FiveThirtyRight: Every Guest Jon Stewart Ever Had On ‘The Daily Show’](https://fivethirtyeight.com/features/every-guest-jon-stewart-ever-had-on-the-daily-show/)



### Daily Show Guests

Historia original: [Every Guest Jon Stewart Ever Had On ‘The Daily Show’](http://fivethirtyeight.com/datalab/every-guest-jon-stewart-ever-had-on-the-daily-show/)

Header | Definition
---|---------
`YEAR` | The year the episode aired
`GoogleKnowlege_Occupation` | Their occupation or office, according to Google's Knowledge Graph or, if they're not in there, how Stewart introduced them on the program.
`Show` | Air date of episode. Not unique, as some shows had more than one guest
`Group` | A larger group designation for the occupation. For instance, us senators, us presidents, and former presidents are all under "politicians"
`Raw_Guest_List` | The person or list of people who appeared on the show, according to Wikipedia. The GoogleKnowlege_Occupation only refers to one of them in a given row. 

Fuente: Google Knowlege Graph, The Daily Show clip library, Wikipedia.


CSV: https://github.com/israelzuniga/dlatam-bigdata-workshop/blob/master/notebooks/data/daily_show_guests.csv

### Descargar dataset:

In [None]:
!wget https://raw.githubusercontent.com/israelzuniga/dlatam-bigdata-workshop/master/notebooks/data/daily_show_guests.csv

#### Paso 1.

Crear un contexto de Spark, configurar el nombre "dailyshow" y nivel de logging en "ALL"

In [None]:
import pyspark
sc = pyspark.SparkContext(appName="dailyshow")
sc.setLogLevel("ALL")

![Spark Context](http://www.dataquest.io/blog/images/misc/cluster-overview.png)





https://spark.apache.org/docs/latest/cluster-overview.html#components

#### Paso 2.

Crear un RDD a partir del archivo CSV e inspeccionar las primeras cinco líneas del dataset

In [None]:
raw_data = sc.textFile("daily_show_guests.csv")

In [None]:
raw_data.take(5)

#### 3. 

Del RDD anterior, crear un nuevo RDD donde se separen los strings por cada coma (',') de la línea 

In [None]:
daily_show = raw_data.map(lambda line: line.split(',')) # Pipeline
daily_show.take(5)

Pipelines Spark vs Hadoop:

![](https://www.codeproject.com/KB/miscctrl/1023037/SparkVsHadoop.jpg)

#### 4. Obtener el número de invitados durante los años del show

Queremos obtener un conteo del número de invitados en cada año que el show ha estado al aire. Si daily_show fuera una "Lista de listas" en Python, usaríamos el siguiente código para obtener el resultado:

```python

tally = dict()
for line in daily_show:
  year = line[0]
  if year in tally.keys():
    tally[year] = tally[year] + 1
  else:
    tally[year] = 1
```


** Cómo lo haríamos en PySpark? **

In [None]:
tally = daily_show.map(lambda x: (x[0], 1)).reduceByKey(lambda x,y: x+y)

tally

In [None]:
# tally es un RDD, por lo tanto no podemos usar métodos tradicionales de Python como len()

tally.take(tally.count())

#### 5. Filtrar el RDD para eliminar la tupla ('YEAR', 1)

In [None]:
def filter_year(line):
    if line[0] == 'YEAR':
        return False
    else:
        return True

filtered_daily_show = daily_show.filter(lambda line: filter_year(line))



In [None]:
#filtered_daily_show.collect()

In [None]:
filtered_daily_show.take(5)

#### 6. Filtrar los actores sin profesión listada,  convertir a minúsculas cada texto, generar el conteo de profesiones y obtener 10 valores

#### 6a. En variables separadas 
#### 6b. En un pipeline; ordenar por valor (ascendente) y retornar los primeros 15 elementos
##### Hint: Ordenar con:   ```sortBy(lambda a: -a[1])```

In [None]:
paso1 = filtered_daily_show.filter(lambda line: line[1] != '')
paso2 = paso1.map(lambda line: (line[1].lower(), 1))
paso3 = paso2.reduceByKey(lambda x,y: x+y)

paso3.take(10)

In [None]:
filtered_daily_show.filter(lambda line: line[1] != '') \
                   .map(lambda line: (line[1].lower(), 1)) \
                   .reduceByKey(lambda x,y: x+y) \
                   .sortBy(lambda a: -a[1]) \
                   .take(15)

In [None]:
sc.stop()