本文介绍spark中RDD的基本操作。RDD全称是Resilient Distributed Datasets（即弹性分布式数据集），它是spark的一种抽象数据类型。


## RDD的创建

创建RDD的方法一般有两种，第一种方式是从外部读取数据集，另一种是在程序里生成。下面结合两个例子来演示spark中如何创建RDD。



### 程序内部创建RDD
下面通过一个例子来说明spark中创建RDD的方法，该例中我们首先在程序里初始化一个由一组整数组成的RDD，接着将这组整数平方。

In [1]:
from pyspark import SparkContext
sc = SparkContext(appName='square the numbers')
nums = sc.parallelize([1,2,3,4])
squared = nums.map(lambda x:x * x)
for num in squared.collect():
    print '%i '%num

1 
4 
9 
16 


`SparkContext`是spark的上下文，任何spark程序都需要申请一个spark上下文来运行；通过`parallelize`方法，我们可以快速地在程序中生成一个RDD数据集；`collect`函数用于将rdd以list的形式载入到驱动程序的内存。


### 外部读取数据集创建RDD



In [7]:
lines = sc.textFile('/home/hschen/Data/wordcount.txt')
for i, line in enumerate(lines.collect()):
    print 'line %d:%s'%(i+1, line)

line 1:We've all heard the scare stories about North Korea: the homemade nuclear arsenal built while their people starve and then aimed imprecisely at the rest of the world, a 
line 2:leader so deluded he makes L Ron Hubbard look like a man excessively overburdened with self-doubt and their deep-seated belief that foreign capitalists will invade at any 
line 3:moment and steal all their bauxite.
line 4:The popular portrayal of this Marxist nation is something like one of the more harrowing episodes of M*A*S*H, only with the cast of wacky characters replaced by twitchy, 
line 5:    heavily armed Stalinist meth addicts
line 6:    Cracked would like to take a moment to celebrate the good things about North Korea though, the things that the country's enemies prefer to suppress as part of their politically 
line 7:    motivated jealousy. Like how no different to you and me, there's nothing every North Korean likes more after an 18 hour shift at the phosphorus plant than a nice beer to go wi

`sc.textFile()`方法用于从外部读取文本文件并创建RDD，该RDD由文本文件的所有行组成。

# RDD基本操作

RDD的操作符可以分为`transform`和`action`两类。

-  `transform` 

transform的作用是将一个RDD映射到另一个RDD。

-  `action` 

由于spark的计算采用的是lazy evaluation的机制，`transform`只是定义了一系列的变换操作，只有当程序执行`action`操作时才会有实质上的计算，其结果被返回给驱动程序或写入文件系统。

如果你无法区分一个函数属于哪种操作，可以查看该函数的返回值。如果返回值是RDD，那么它是transform操作；如果返回的是其他的数据类型，那么就是action操作。

## Spark的惰性计算机制

前面提到，Spark的计算模型是一种惰性计算（Lazy Evaluation）的方式，这意味着`transform`操作并不会马上得到执行，而是等到`action`操作被调用时才一并执行。

## 常用的transform操作

常用的transform操作有：

- `map`:对RDD中的每个元素执行相同的操作，并返回由操作的结果构成的RDD
- `filter`：根据条件过滤数据，筛选条件判断为true的元素
- `flatMap`：类似python的itertools.chain，把结果中所有可迭代对象里面的元素放在同一个可迭代对象内
- `distinct`:rdd集合去重
- `sample`:rdd集合随机抽样（有放回或无放回）



## 常用的action操作

常用的action操作
- `reduce`:对RDD中的两个元素执行操作，返回相同类型
- `collect`：返回rdd中所有的元素，执行collect操作时，驱动程序会从各个执行器收集数据并写入到内存，值得注意的是如果内存不够，该操作会失败，并发生内存溢出
- `take(n)`:取rdd的n个元素
- `top(n)`：取rdd的前n个元素
- `count`：rdd中有多少元素
- `countByValue`：rdd中每个元素出现次数


用reduce操作求和

In [2]:
squared.reduce(lambda x,y:x+y)

30

读取文件，去除重复单词

用spark实现wordcount

In [6]:
word_count = lines.countByValue()
for word, count in word_count.iteritems():
    print word, count

    his dried fish ration. Ever attentive to its people's needs and in the twinkling of a decade, North Korea's leadership bought, disassembled, transported and rebuilt a British  1
    that even the very blend of seasoning used is intentionally kept from them. And they call North Korea paranoid? 1
    Or how about the fried chicken restaurant that downtown Pyongyang boasts? Yes real chicken, fried and then delivered to your sleeping cube, with optional beer if you like! You  1
The popular portrayal of this Marxist nation is something like one of the more harrowing episodes of M*A*S*H, only with the cast of wacky characters replaced by twitchy,  1
    past Bill's many, many imperfections and treat him with the pity and kindness he deserves, accepting his feeble pleas to pardon the American spies rightly convicted of photographing  1
    the nation's sensitive beetroot fields. 1
moment and steal all their bauxite. 1
    And how many nations would entertain the syphilitic, bourgeois ramb