## Wordcount

La idea de este ejemplos es obtener la cantidad de apariciones de cada palabra en todas las obras de shakespeare.
Fuente de Datos: http://www.gutenberg.org/cache/epub/100/pg100.txt (se elimino la introduccion y licencia).


In [1]:
import pyspark

try: 
    type(sc)
except NameError:
    sc = pyspark.SparkContext('local[*]')

In [2]:
shakespeareRDD = sc.textFile('data/shakespeare.txt',8)

In [3]:
shakespeareRDD.take(20)

['1609',
 '',
 'THE SONNETS',
 '',
 'by William Shakespeare',
 '',
 '',
 '',
 '                     1',
 '  From fairest creatures we desire increase,',
 "  That thereby beauty's rose might never die,",
 '  But as the riper should by time decease,',
 '  His tender heir might bear his memory:',
 '  But thou contracted to thine own bright eyes,',
 "  Feed'st thy light's flame with self-substantial fuel,",
 '  Making a famine where abundance lies,',
 '  Thy self thy foe, to thy sweet self too cruel:',
 "  Thou that art now the world's fresh ornament,",
 '  And only herald to the gaudy spring,',
 '  Within thine own bud buriest thy content,']

In [26]:
wordsRDD = shakespeareRDD.flatMap(lambda line: line.split())
wordsRDD.take(10)

['1609',
 'THE',
 'SONNETS',
 'by',
 'William',
 'Shakespeare',
 '1',
 'From',
 'fairest',
 'creatures']

In [30]:
wordsCountRDD = wordsRDD.map(lambda word: (word,1))
wordsCountRDD.take(10)

[('1609', 1),
 ('THE', 1),
 ('SONNETS', 1),
 ('by', 1),
 ('William', 1),
 ('Shakespeare', 1),
 ('1', 1),
 ('From', 1),
 ('fairest', 1),
 ('creatures', 1)]

In [9]:
wordsCountRDD.reduceByKey(lambda a,b: a+b).sortBy(ascending=False,keyfunc=lambda x:x[1]).take(10)

[('the', 23373),
 ('I', 19540),
 ('and', 18334),
 ('to', 15667),
 ('of', 15626),
 ('a', 12575),
 ('my', 10825),
 ('in', 9624),
 ('you', 9111),
 ('is', 7862)]

## N-Grams

### Que es un n-grama

Es una secuencia continua de n caracteres de una cadena de texto.

In [10]:
def trigrams(t):
    t=t.lower()
    return [t[i:i+3] for i in range(0, len(t) - 2)]

In [11]:
trigrams("hola datos")

['hol', 'ola', 'la ', 'a d', ' da', 'dat', 'ato', 'tos']

## Indiana Jones: En busca de todos los trigramas de todas las obras de shakespeare

In [12]:
anotherShakespeareRDD = sc.textFile('data/shakespeare.txt',8)

In [13]:
trigramsRDD = anotherShakespeareRDD.flatMap(trigrams).filter(lambda a : a != '   ')

In [14]:
trigramsRDD.take(10)

['160', '609', 'the', 'he ', 'e s', ' so', 'son', 'onn', 'nne', 'net']

### Calculando la frecuencia de cada trigrama

In [15]:
trigramsCount = trigramsRDD.map(lambda x: (x, 1)).reduceByKey(lambda x,y: x+y)
print(trigramsCount.take(5))

[('160', 16), ('iam', 475), ('am ', 2326), ('m s', 813), ('hak', 407)]


In [16]:
trigramsCountSorted = trigramsCount.sortBy(ascending=False,keyfunc=lambda x:x[1])
print(trigramsCountSorted.take(20))

[(' th', 83504), ('the', 52000), ('he ', 35101), ('and', 32677), (' an', 32633), ('nd ', 31158), (' to', 23607), ('is ', 23118), (' yo', 22873), ('you', 22242), (' he', 20994), ('to ', 19818), (' of', 19811), (' no', 19309), (' i ', 19146), ('her', 18969), ('hat', 18789), ('ll ', 18605), ('at ', 18091), (' wi', 17937)]


In [17]:
trigramsCountSorted.take(10)

[(' th', 83504),
 ('the', 52000),
 ('he ', 35101),
 ('and', 32677),
 (' an', 32633),
 ('nd ', 31158),
 (' to', 23607),
 ('is ', 23118),
 (' yo', 22873),
 ('you', 22242)]

### Caculando la frecuencia total de la coleccion

In [20]:
totalFrec2 = trigramsCountSorted.map(lambda x: x[1]).reduce(lambda x,y: x+y)
totalFrec1 = trigramsRDD.count()
print(totalFrec2)
print(totalFrec1)

4829019
4829019


In [21]:
assert totalFrec2 == totalFrec1

In [None]:
totalFrec1

In [None]:
totalFrec2

In [None]:
print(totalFrec1)

In [None]:
print(trigramsRDD.count())

### Calculando la probabilidad de cada trigrama

In [23]:
#print(trigramsCountSorted.take(5))
trigramsProb = trigramsCountSorted.map(lambda x: (x[0],round(float(x[1])/totalFrec1,3)))

In [24]:
trigramsProb.take(10)

[(' th', 0.017),
 ('the', 0.011),
 ('he ', 0.007),
 ('and', 0.007),
 (' an', 0.007),
 ('nd ', 0.006),
 (' to', 0.005),
 ('is ', 0.005),
 (' yo', 0.005),
 ('you', 0.005)]