# CS494 - Colab
## Wordcount in Spark

### Setup

Let's setup Spark on your Colab environment.  Run the cell below!

In [None]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

openjdk-8-jdk-headless is already the newest version (8u252-b09-1~18.04).
0 upgraded, 0 newly installed, 0 to remove and 29 not upgraded.


Now we authenticate a Google Drive client to download the file we will be processing in our Spark job.

**Make sure to follow the interactive instructions.**

In [None]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
id='1SE6k_0YukzGd5wK-E4i6mG83nydlfvSa'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('pg100.txt')
#downloaded.GetContentFile('test.txt')

If you executed the cells above, you should be able to see the file *pg100.txt* under the "Files" tab on the left panel.

### Your task

If you run successfully the setup stage, you are ready to work on the *pg100.txt* file which contains a copy of the complete works of Shakespeare.

Write a Spark application which outputs the number of words that start with each letter. This means that for every letter we want to count the total number of (non-unique) words that start with a specific letter. In your implementation **ignore the letter case**, i.e., consider all words as lower case. Also, you can ignore all the words **starting** with a non-alphabetic character.

In [None]:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext
import pyspark
import pandas as pd
import sys # this library does 
from pyspark.conf import SparkConf  # we add a new library
from operator import add
import random

# create the Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate()

# create the Spark Context
sc = spark.sparkContext.getOrCreate()



In [None]:
# YOUR

#it reads the file text 
text = sc.textFile("/content/pg100.txt")

#we use map to sort the words in the file
words = text.flatMap(lambda line: line.split(" "))


# we create this function to remove characters and do lowercases
def lower_clean_str(x):
  punc = '!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~- '
  lowercased_str = x.lower()
  for ch in punc:
      lowercased_str = lowercased_str.replace(ch,'')

  return lowercased_str

wordsCount = words.map(lower_clean_str)

wordsCount.take(20)


In [None]:
wordTotal = wordsCount.map(lambda word:(word, 1))
#wordTotal.take(3) 
###### it shows the output withouth characters #####
#wordTotal.collect()

#### it shows and counts how many words can be repeated #####
wordTotal.countByValue()

In [None]:
##### most common words #####
distinctWordsCount = wordTotal.reduceByKey(lambda x,y:(x+y)).sortByKey()
#distinctWordsCount.take(20)

######sort by most frequent words in the file #######
##### we stopwords using sortByKey
sortWordsCount = distinctWordsCount.map(lambda x: (x[1], x[0]))
sortWordsCount.sortByKey(False).take(10)
##print most frequent 20 words


In [None]:
#### We count numbers of words with a certain letter using startwith ###
#For letter 'A'
countAletter = words.filter(lambda x: x.startswith("a"))
output = ("A:")
countAletter.count()


In [None]:
#For letter B
countBletter = words.filter(lambda x: x.startswith("b"))
countBletter.count()

In [None]:
#For letter C
countBletter = words.filter(lambda x: x.startswith("c"))
countBletter.count()

In [None]:
#For letter D
countBletter = words.filter(lambda x: x.startswith("d"))
countBletter.count()

In [None]:
#For letter E
countBletter = words.filter(lambda x: x.startswith("e"))
countBletter.count()

In [None]:
#For letter F
countBletter = words.filter(lambda x: x.startswith("f"))
countBletter.count()

In [None]:
#For letter G
countBletter = words.filter(lambda x: x.startswith("g"))
countBletter.count()

In [None]:
#For letter H
countBletter = words.filter(lambda x: x.startswith("h"))
countBletter.count()

In [None]:
#For letter I
countBletter = words.filter(lambda x: x.startswith("i"))
countBletter.count()

In [None]:
#For letter J
countBletter = words.filter(lambda x: x.startswith("j"))
countBletter.count()


In [None]:
#For letter K
countBletter = words.filter(lambda x: x.startswith("k"))
countBletter.count()


In [None]:
#For letter L
countBletter = words.filter(lambda x: x.startswith("l"))
countBletter.count()

In [None]:
#For letter M
countBletter = words.filter(lambda x: x.startswith("m"))
countBletter.count()

In [None]:
#For letter N
countBletter = words.filter(lambda x: x.startswith("n"))
countBletter.count()

In [None]:
#For letter O
countBletter = words.filter(lambda x: x.startswith("o"))
countBletter.count()

In [None]:
#For letter P
countBletter = words.filter(lambda x: x.startswith("p"))
countBletter.count()

In [None]:
#For letter Q
countBletter = words.filter(lambda x: x.startswith("q"))
countBletter.count()

In [None]:
#For letter R
countBletter = words.filter(lambda x: x.startswith("r"))
countBletter.count()

In [None]:
#For letter S
countBletter = words.filter(lambda x: x.startswith("s"))
countBletter.count()


In [None]:
#For letter T
countBletter = words.filter(lambda x: x.startswith("t"))
countBletter.count()


In [None]:
#For letter U
countBletter = words.filter(lambda x: x.startswith("u"))
countBletter.count()


In [None]:
#For letter V
countBletter = words.filter(lambda x: x.startswith("v"))
countBletter.count()

In [None]:
#For letter W
countBletter = words.filter(lambda x: x.startswith("w"))
countBletter.count()

In [None]:
#For letter X
countBletter = words.filter(lambda x: x.startswith("x"))
countBletter.count()

In [None]:
#For letter Y
countBletter = words.filter(lambda x: x.startswith("y"))
countBletter.count()



In [None]:
#For letter Z
countBletter = words.filter(lambda x: x.startswith("z"))
countBletter.count()

Once you obtained the desired results, **head over to Gradescope and submit your solution for this Colab**!