## Wordcount

La idea de este ejemplos es obtener la cantidad de apariciones de cada palabra en todas las obras de shakespeare.
Fuente de Datos: http://www.gutenberg.org/cache/epub/100/pg100.txt (se elimino la introduccion y licencia).


In [1]:
import pyspark

try: 
    type(sc)
except NameError:
    sc = pyspark.SparkContext('local[*]')

In [2]:
shakespeareRDD = sc.textFile('data/shakespeare.txt',8)

In [3]:
shakespeareRDD.take(10)

['1609',
 '',
 'THE SONNETS',
 '',
 'by William Shakespeare',
 '',
 '',
 '',
 '                     1',
 '  From fairest creatures we desire increase,']

In [5]:
wordsRDD = shakespeareRDD.flatMap(lambda line: line.split())
wordsRDD.take(10)

['1609',
 'THE',
 'SONNETS',
 'by',
 'William',
 'Shakespeare',
 '1',
 'From',
 'fairest',
 'creatures']

In [8]:
wordsCountRDD = wordsRDD.map(lambda word: (word,1))
wordsCountRDD.take(5)

[('1609', 1), ('THE', 1), ('SONNETS', 1), ('by', 1), ('William', 1)]

In [9]:
wordsCountRDD.reduceByKey(lambda a,b: a+b).take(10)

[('him-let', 2),
 ('fleshly', 1),
 ('dislike', 9),
 ('PORTIA..', 1),
 ('sea.', 18),
 ('wronging', 1),
 ('cars;', 1),
 ('afford', 17),
 ('underground', 1),
 ('north-north-east', 1)]

In [13]:
wordsCountRDD.reduceByKey(lambda a,b: a+b).sortBy(ascending=False,keyfunc=lambda x:x[1]).take(10)

[('the', 23373),
 ('I', 19540),
 ('and', 18334),
 ('to', 15667),
 ('of', 15626),
 ('a', 12575),
 ('my', 10825),
 ('in', 9624),
 ('you', 9111),
 ('is', 7862)]

## N-Grams

### Que es un n-grama

Es una secuencia continua de n caracteres de una cadena de texto.

In [14]:
def trigrams(t):
    t=t.lower()
    return [t[i:i+3] for i in range(0, len(t) - 2)]

In [15]:
trigrams("hola datos")

['hol', 'ola', 'la ', 'a d', ' da', 'dat', 'ato', 'tos']

Buscando los trigramas de todas las obras de shakespeare

In [16]:
anotherShakespeareRDD = sc.textFile('data/shakespeare.txt',8)

In [20]:
trigramsRDD = anotherShakespeareRDD.flatMap(trigrams).filter(lambda a : a != '   ')

['160', '609', 'the', 'he ', 'e s', ' so', 'son', 'onn', 'nne', 'net']


In [21]:
trigramsRDD.take(10)

['160', '609', 'the', 'he ', 'e s', ' so', 'son', 'onn', 'nne', 'net']

Calculando la frecuencia de cada trigrama

In [24]:
trigramsCount = trigramsRDD.map(lambda x: (x, 1)).reduceByKey(lambda x,y: x+y)
print(trigramsCount.take(5))

[('wol', 166), ('dib', 8), ('e."', 7), ('xam', 68), ('ne:', 61)]


In [27]:
trigramsCountSorted = trigramsCount.sortBy(ascending=False,keyfunc=lambda x:x[1])
print(trigramsCountSorted.take(20))

[('   ', 274092), (' th', 83504), ('the', 52000), ('he ', 35101), ('and', 32677), (' an', 32633), ('nd ', 31158), (' to', 23607), ('is ', 23118), (' yo', 22873), ('you', 22242), (' he', 20994), ('to ', 19818), (' of', 19811), (' no', 19309), (' i ', 19146), ('her', 18969), ('hat', 18789), ('ll ', 18605), ('at ', 18091)]


In [28]:
trigramsCountSorted.take(10)

[('   ', 274092),
 (' th', 83504),
 ('the', 52000),
 ('he ', 35101),
 ('and', 32677),
 (' an', 32633),
 ('nd ', 31158),
 (' to', 23607),
 ('is ', 23118),
 (' yo', 22873)]

### Caculando la frecuencia total de la coleccion

In [29]:
totalFrec = trigramsCountSorted.map(lambda x: x[1]).reduce(lambda x,y: x+y)

In [30]:
print(totalFrec)

5103111


### Calculando la probabilidad de cada trigrama

In [32]:
print(trigramsCountSorted.take(5))
trigramsProb = trigramsCountSorted.map(lambda x: (x[0],round(float(x[1])/totalFrec,3)))

[('   ', 274092), (' th', 83504), ('the', 52000), ('he ', 35101), ('and', 32677)]


In [33]:
trigramsProb.take(10)

[('   ', 0.054),
 (' th', 0.016),
 ('the', 0.01),
 ('he ', 0.007),
 ('and', 0.006),
 (' an', 0.006),
 ('nd ', 0.006),
 (' to', 0.005),
 ('is ', 0.005),
 (' yo', 0.004)]