# CS245 - Lab 1
## Word Count in Spark

### Setup

Let's set up Spark on your Kaggle environment.  Run the cell below!

In [1]:
!pip install pyspark
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 34 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 29.0 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=7aee1535b89f332eeac87394850574347991997932151e66e1523991ac7ca5b4
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove

### Your task

You are ready to work on the *pg100.txt* file which contains a copy of the complete works of Shakespeare. The file locates in `../input/pg100/pg100.txt`.

Write a Spark application which outputs the number of words that start with each letter. This means that for every letter, we want to count the total number of (non-unique) words that start with a specific letter.

In your implementation, **ignore the letter case**, i.e., consider all words as lower case. Also, you can ignore all words that **start** with a non-alphabetic character. You should output word counts for the **entire document**, inclusive of the title, author, and the main texts. If you encounter words broken as a result of new lines, e.g. "pro-ject" where the segment after the dash sign is on a new line, no special processing is needed and you can safely consider it as two words.

Your outputs will be graded on a range -- if your differences from the ground-truths are within an error threshold of 5, you'll be considered correct.

In [2]:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext
import pandas as pd

# create the Spark Session
spark = SparkSession.builder.getOrCreate()

# create the Spark Context
sc = spark.sparkContext

In [3]:
# YOUR CODE HERE
txt = spark.read.text("input/pg100.txt")
txt.show()

+--------------------+
|               value|
+--------------------+
|The Project Guten...|
| William Shakespeare|
|                    |
|This eBook is for...|
|almost no restric...|
|re-use it under t...|
|with this eBook o...|
|                    |
|** This is a COPY...|
|**     Please fol...|
|                    |
|Title: The Comple...|
|                    |
|Author: William S...|
|                    |
|Posting Date: Sep...|
|Release Date: Jan...|
|                    |
|   Language: English|
|                    |
+--------------------+
only showing top 20 rows



In [4]:
txt = txt.select(explode(split(txt.value, '\s')).alias('value'))
txt = txt.select(lower(txt.value).alias('value')).filter(txt.value.rlike('^[a-z]')).dropna()

In [5]:
txt = txt.withColumn('letter', substring(txt.value, 1, 1)).groupby('letter').count()
txt.sort('letter').show(26)

+------+------+
|letter| count|
+------+------+
|     a| 63748|
|     b| 34561|
|     c| 23496|
|     d| 23531|
|     e| 10431|
|     f| 28819|
|     g| 14703|
|     h| 50511|
|     i| 32292|
|     j|  1593|
|     k|  5789|
|     l| 22353|
|     m| 46233|
|     n| 21813|
|     o| 34201|
|     p| 19344|
|     q|  1332|
|     r| 10400|
|     s| 52643|
|     t|101603|
|     u|  7667|
|     v|  4131|
|     w| 44981|
|     y| 21879|
|     z|    53|
+------+------+



Once you obtained the desired results, **save a version in Kaggle and share your notebook**!