# CS494 - Colab
## Wordcount in Spark

### Setup

Let's setup Spark on your Colab environment.  Run the cell below!

In [1]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/f0/26/198fc8c0b98580f617cb03cb298c6056587b8f0447e20fa40c5b634ced77/pyspark-3.0.1.tar.gz (204.2MB)
[K     |████████████████████████████████| 204.2MB 67kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 49.5MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612242 sha256=a93ae532253c6e2328cc87325f825221120e05b4840c10aad1c38753e62fef62
  Stored in directory: /root/.cache/pip/wheels/5e/bd/07/031766ca628adec8435bb40f0bd83bb676ce65ff4007f8e73f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.1
The 

Now we authenticate a Google Drive client to download the file we will be processing in our Spark job.

**Make sure to follow the interactive instructions.**

In [2]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [3]:
id='1SE6k_0YukzGd5wK-E4i6mG83nydlfvSa'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('pg100.txt')
#downloaded.GetContentFile('test.txt')

If you executed the cells above, you should be able to see the file *pg100.txt* under the "Files" tab on the left panel.

### Your task

If you run successfully the setup stage, you are ready to work on the *pg100.txt* file which contains a copy of the complete works of Shakespeare.

Write a Spark application which outputs the number of words that start with each letter. This means that for every letter we want to count the total number of (non-unique) words that start with a specific letter. In your implementation **ignore the letter case**, i.e., consider all words as lower case. Also, you can ignore all the words **starting** with a non-alphabetic character.

In [4]:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext
import pyspark
import pandas as pd
import sys # this library does 
from pyspark.conf import SparkConf  # we add a new library
from operator import add
import random

# create the Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate()

# create the Spark Context
sc = spark.sparkContext.getOrCreate()



In [5]:
# YOUR

#it reads the file text 
text = sc.textFile("/content/pg100.txt")

#we use map to sort the words in the file
words = text.flatMap(lambda line: line.split(" "))


# we create this function to remove characters and do lowercases
def lower_clean_str(x):
  punc = '!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~- '
  lowercased_str = x.lower()
  for ch in punc:
      lowercased_str = lowercased_str.replace(ch,'')

  return lowercased_str

wordsCount = words.map(lower_clean_str)

wordsCount.take(20)


['the',
 'project',
 'gutenberg',
 'ebook',
 'of',
 'the',
 'complete',
 'works',
 'of',
 'william',
 'shakespeare',
 'by',
 'william',
 'shakespeare',
 '',
 'this',
 'ebook',
 'is',
 'for',
 'the']

In [6]:
wordTotal = wordsCount.map(lambda word:(word, 1))
#wordTotal.take(3) 
###### it shows the output withouth characters #####
#wordTotal.collect()

#### it shows and counts how many words can be repeated #####
wordTotal.countByValue()

defaultdict(int,
            {('the', 1): 27825,
             ('project', 1): 329,
             ('gutenberg', 1): 257,
             ('ebook', 1): 16,
             ('of', 1): 18289,
             ('complete', 1): 248,
             ('works', 1): 284,
             ('william', 1): 351,
             ('shakespeare', 1): 272,
             ('by', 1): 4426,
             ('', 1): 506966,
             ('this', 1): 6894,
             ('is', 1): 9621,
             ('for', 1): 8261,
             ('use', 1): 560,
             ('anyone', 1): 7,
             ('anywhere', 1): 8,
             ('at', 1): 2521,
             ('no', 1): 3807,
             ('cost', 1): 51,
             ('and', 1): 26791,
             ('with', 1): 8046,
             ('almost', 1): 163,
             ('restrictions', 1): 2,
             ('whatsoever', 1): 17,
             ('you', 1): 13716,
             ('may', 1): 1880,
             ('copy', 1): 27,
             ('it', 1): 7703,
             ('give', 1): 1335,
             ('awa

In [7]:
##### most common words #####
distinctWordsCount = wordTotal.reduceByKey(lambda x,y:(x+y)).sortByKey()
#distinctWordsCount.take(20)

######sort by most frequent words in the file #######
##### we stopwords using sortByKey
sortWordsCount = distinctWordsCount.map(lambda x: (x[1], x[0]))
sortWordsCount.sortByKey(False).take(10)
##print most frequent 20 words


[(506966, ''),
 (27825, 'the'),
 (26791, 'and'),
 (20681, 'i'),
 (19261, 'to'),
 (18289, 'of'),
 (14667, 'a'),
 (13716, 'you'),
 (12481, 'my'),
 (11135, 'that')]

In [8]:
#### We count numbers of words with a certain letter using startwith ###
#For letter 'A'
countAletter = words.filter(lambda x: x.startswith("a"))
output = ("A:")
countAletter.count()


63748

In [None]:
#For letter B
countBletter = words.filter(lambda x: x.startswith("b"))
countBletter.count()

In [9]:
#For letter C
countBletter = words.filter(lambda x: x.startswith("c"))
countBletter.count()

23496

In [10]:
#For letter D
countBletter = words.filter(lambda x: x.startswith("d"))
countBletter.count()

23531

In [12]:
#For letter E
countBletter = words.filter(lambda x: x.startswith("e"))
countBletter.count()

10431

In [11]:
#For letter F
countBletter = words.filter(lambda x: x.startswith("f"))
countBletter.count()

28819

In [13]:
#For letter G
countBletter = words.filter(lambda x: x.startswith("g"))
countBletter.count()

14703

In [16]:
#For letter H
countBletter = words.filter(lambda x: x.startswith("h"))
countBletter.count()

50511

In [15]:
#For letter I
countBletter = words.filter(lambda x: x.startswith("i"))
countBletter.count()

32292

In [14]:
#For letter J
countBletter = words.filter(lambda x: x.startswith("j"))
countBletter.count()


1593

In [17]:
#For letter K
countBletter = words.filter(lambda x: x.startswith("k"))
countBletter.count()


5789

In [18]:
#For letter L
countBletter = words.filter(lambda x: x.startswith("l"))
countBletter.count()

22353

In [20]:
#For letter M
countBletter = words.filter(lambda x: x.startswith("m"))
countBletter.count()

46233

In [19]:
#For letter N
countBletter = words.filter(lambda x: x.startswith("n"))
countBletter.count()

21813

In [21]:
#For letter O
countBletter = words.filter(lambda x: x.startswith("o"))
countBletter.count()

34201

In [22]:
#For letter P
countBletter = words.filter(lambda x: x.startswith("p"))
countBletter.count()

19344

In [24]:
#For letter Q
countBletter = words.filter(lambda x: x.startswith("q"))
countBletter.count()

1332

In [23]:
#For letter R
countBletter = words.filter(lambda x: x.startswith("r"))
countBletter.count()

10400

In [25]:
#For letter S
countBletter = words.filter(lambda x: x.startswith("s"))
countBletter.count()


52643

In [26]:
#For letter T
countBletter = words.filter(lambda x: x.startswith("t"))
countBletter.count()


101603

In [27]:
#For letter U
countBletter = words.filter(lambda x: x.startswith("u"))
countBletter.count()


7667

In [28]:
#For letter V
countBletter = words.filter(lambda x: x.startswith("v"))
countBletter.count()

4131

In [30]:
#For letter W
countBletter = words.filter(lambda x: x.startswith("w"))
countBletter.count()

44981

In [29]:
#For letter X
countBletter = words.filter(lambda x: x.startswith("x"))
countBletter.count()

0

In [31]:
#For letter Y
countBletter = words.filter(lambda x: x.startswith("y"))
countBletter.count()



21879

In [32]:
#For letter Z
countBletter = words.filter(lambda x: x.startswith("z"))
countBletter.count()

53

Once you obtained the desired results, **head over to Gradescope and submit your solution for this Colab**!