# DATASCI W261: Machine Learning at Scale
## Assignment Week 10
Miki Seltzer (miki.seltzer@berkeley.edu)<br>
W261-2, Spring 2016<br>
Submission: 

# Start PySpark

In [1]:
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)

if not spark_home:
    raise ValueError('SPARK_HOME enviroment variable is not set')
sys.path.insert(0,os.path.join(spark_home,'python'))
sys.path.insert(0,os.path.join(spark_home,'python/lib/py4j-0.8.2.1-src.zip'))
execfile(os.path.join(spark_home,'python/pyspark/shell.py'))
 

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.0-cdh5.5.0
      /_/

Using Python version 2.7.11 (default, Dec  6 2015 18:08:32)
SparkContext available as sc, HiveContext available as sqlContext.


# HW 10.0
## What is Apache Spark and how is it different to Apache Hadoop? 

Spark is a much faster processing engine than we have currently been using. It can run on top of Hadoop clusters, or in standalone mode. It is built to use data that is not only stored in HDFS, but also in Hadoop-based systems like HBase, Hive, Cassandra, etc. It can also ingest local text files, CSV files, etc.

Spark's main difference from Hadoop is that processing in Spark is done in memory via RDD (Resilient Distributed Datasets), which allows iterative jobs to be completed much more efficiently.

## Fill in the blanks:
Spark API consists of interfaces to develop applications based on it in Java, __Python, Scala__ languages (list languages). 

Using Spark, resource management can be done either in a single server instance or using a framework such as Mesos or __YARN__ in a distributed manner.

## What is an RDD and show a fun example of creating one and bringing the first element back to the driver program.

An RDD is a resilient distributed dataset with the following properties:
- Distributed storage
- Lazy evaluation
- Can have transformations (similar to mappers) and actions (similar to reducers)
- RDD is not materialized until an action is called (lazy evaluation)
- Results of an action can be cached/persisted so that they are not repetitively processed
- Can be a base RDD (only values) or a pair RDD (key-value pairs)

#### Example:
```
sc = SparkContext("local", "Example")
data = sc.textFile('some_text_file.txt').cache()
firstWithData = data.filter(lambda x: 'data' in x).first()
```

## What is lazy evaluation and give an intuitive example of lazy evaluation and comment on the massive computational savings to be had from lazy evaluation.

Lazy evaluation means that an RDD is not materialized until an action is called. Thus, when the RDD is loaded and transformed, the only thing that is really being done is to store the instructions on how the data should be constructed when an action is called. There are computational savings to be had by doing lazy evaluations because the data is not actually materialized until a result needs to be sent back to the driver.

#### Example:
>When working with the 2GB Wikipedia data, if we load the data into an RDD, this should be instantaneous, because we haven't actually performed any actions on it. We can write many transformations and this will all be nearly instantaneous as well, because nothing has been materialized. However, when we want to see the results of our transformation, we can call an action like take(10), and instead of running the transformations on the entire data set, it will only process enough data to return those 10 results that we requested in the action.

>For comparison, if we wanted to do this in Hadoop, we would have had to process the entire 2GB file, which is computationally expensive if we only needed to see 10 resulting records from the data set.


# HW10.1
In Spark write the code to count how often each word appears in a text document (or set of documents). Please use this homework document as a the example document to run an experiment.  Report the following: provide a sorted list of tokens in decreasing order of frequency of occurence.


In [12]:
!hdfs dfs -copyFromLocal MIDS-MLS-HW-10.txt /user/miki/week10/hw10_1/

In [49]:
import regex as re

def splitString(line):
    noPunc = re.sub(u'[^A-Za-z0-9 ]+', '', line)
    return noPunc.strip().split()

hwText = sc.textFile('/user/miki/week10/hw10_1/MIDS-MLS-HW-10.txt')
wordCounts = hwText.map(lambda x: x.lower()).flatMap(splitString).map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)
sortedCounts = wordCounts.map(lambda x: (x[1], x[0])).sortByKey(ascending=False).collect()

print '{:5s}  {:10s}'.format('count', 'word')
print '--------------------'
for count, word in sortedCounts:
    print '{:5d}  {:10s}'.format(count, word)

count  word      
--------------------
   46  the       
   24  and       
   20  in        
   17  of        
   13  hw        
   12  a         
   11  for       
   11  data      
   11  using     
   10  code      
    9  to        
    8  this      
    8  is        
    8  set       
    8  your      
    8  kmeans    
    7  with      
    7  on        
    6  clusters  
    6  x         
    6  regression
    5  comment   
    5  iterations
    5  from      
    5  as        
    5  report    
    5  linear    
    4  squared   
    4  words     
    4  hw103     
    4  what      
    4  sum       
    4  please    
    4  100       
    4  spark     
    4  following 
    4  model     
    4  each      
    4  provided  
    4  plot      
    4  one       
    4  example   
    3  count     
    3  list      
    3  findings  
    3  evaluation
    3  available 
    3  lazy      
    3  run       
    3  here      
    3  training  
    3  note      
    3  or        
    3  

# HW 10.1.1

Modify the above word count code to count words that begin with lower case letters (a-z) and report your findings. Again sort the output words in decreasing order of frequency.

In [52]:
def startsWithLower(word):
    if word[0].islower(): return True
    else: return False
    
hwText = sc.textFile('/user/miki/week10/hw10_1/MIDS-MLS-HW-10.txt')
wordCounts = hwText.flatMap(splitString).filter(startsWithLower).map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)
sortedCounts = wordCounts.map(lambda x: (x[1], x[0])).sortByKey(ascending=False).collect()

print '{:5s}  {:10s}'.format('count', 'word')
print '--------------------'
for count, word in sortedCounts:
    print '{:5d}  {:10s}'.format(count, word)

count  word      
--------------------
   46  the       
   24  and       
   18  in        
   17  of        
   12  a         
   11  for       
   10  data      
   10  code      
    9  to        
    8  is        
    7  with      
    7  this      
    7  on        
    7  your      
    6  clusters  
    5  iterations
    5  from      
    5  as        
    5  regression
    4  words     
    4  following 
    4  model     
    4  using     
    4  each      
    4  x         
    4  set       
    4  linear    
    4  one       
    4  example   
    4  provided  
    3  report    
    3  list      
    3  findings  
    3  evaluation
    3  available 
    3  lazy      
    3  training  
    3  count     
    3  results   
    3  import    
    3  plots     
    3  or        
    3  plot      
    3  it        
    3  an        
    3  document  
    3  after     
    3  notebook  
    2  httpswwwdropboxcomsq85t0ytb9apggnhkmeansdatatxtdl0
    2  languages 
    2  homework  
   