# CS 5540 - Group 1 - Apache Spark Assignment

This is the submission document for our programming assignment over Apache Spark. 

The submission was written as a Jupyter notebook but will be exported to a PDF for submission. We can provide the GitHub repo or the original Jupyter notebook if requested.

In [23]:
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql.functions import col, split, explode, lower, trim

spark = SparkSession.builder\
    .appName("my-spark-app")\
    .config("spark.sql.catalogImplementation", "hive")\
    .getOrCreate()
sc = SparkContext.getOrCreate()

## Table of Contents

1. [Team Members](#team-members)

## Team Members

This assignment was completed by the following team members (Group 1):

- Ayushman Das
- Koti Paruchuri
- Odai Athamneh
- Scott Brunton
- Varshith Thota

## Question 1

Question 1 is as follows: 

> Given file (`/data/shakespeare-1.txt`) contains the scenes from Shakespeare’s plays. You may use this 
file as an input dataset to identify the following notes for a student of Classical Drama.

In [24]:
# tokenize input text file and clean the word column
df = spark.read.text("data/shakespeare-1.txt")

df = df.select(explode(split(col("value"), " ")).alias("word"))
df = df.select(lower(trim(col("word"))).alias("word"))

df.show(5, truncate=False)

+-----+
|word |
+-----+
|this |
|is   |
|the  |
|100th|
|etext|
+-----+
only showing top 5 rows



### Question 1.1

The question reads as follows:

> How many different countries are mentioned in the whole file? (Regardless of how many times a single country is mentioned, this country only contributes as a single entry). 

To address this question, we need a dataset of country names. We are using the `country-list.csv` file provided by the professor. The file contains 211 entries. 

The caveat to this approach is that the dataset may not contain all countries, such as: 
- Countries that no longer exist
- Countries that are misspelled in the original Shakespearean text
- Countries where the name or spelling has changed over time

Addressing this issue is beyond the scope of this assignment and would likely require some degree of manual curation.

In [25]:
# load countries dataframe and clean the country column
countries = spark.read.csv("data/country-list.csv", header=False)
countries = countries.select(lower(trim("_c0")).alias("country"))

countries.show(5, truncate=False)
countries.count()

+--------------+
|country       |
+--------------+
|afghanistan   |
|albania       |
|algeria       |
|american samoa|
|andorra       |
+--------------+
only showing top 5 rows



211

Now that we have our list of countries, we can use a simple `.join()` to find the number of countries mentioned in the Shakespearean text. We will use the `Country` column as our key and perform an inner join with the Shakespearean text. This will return a new DataFrame with only the rows that have a match in both DataFrames. We can then use `.count()` to get the number of rows in the resulting DataFrame.

In [27]:
# perform join after converting both columns to lowercase and trimming the country column
unique_countries = df.join(countries, df.word == countries.country, "inner").select("country").distinct()
unique_countries.show(5, truncate=False)

print("Number of unique countries in the text file: {}".format(unique_countries.count()))

+-------+
|country|
+-------+
|greece |
|poland |
|austria|
|guinea |
|france |
+-------+
only showing top 5 rows

Number of unique countries in the text file: 22


### Question 1.2

The question reads as follows:

> Compute the total number of times any country is mentioned. (This is different from  the  question1.1,  since  in  this  calculation,  if  a  country  is  mentioned  three  times,  then  it contributes three times). 

The code to do this is below. Note that, by default, a Jupyter notebook will only show the first 20 rows of the resulting DataFrame. We use `.show(1000)` to ensure all rows are listed.

In [29]:
country_mentions = df.join(countries, df.word == countries.country, "inner").select("country").groupBy("country").count().orderBy("count", ascending=False)

country_mentions.show(1000, truncate=False)

+---------+-----+
|country  |count|
+---------+-----+
|france   |149  |
|england  |128  |
|scotland |24   |
|egypt    |15   |
|wales    |15   |
|italy    |12   |
|cyprus   |10   |
|denmark  |10   |
|greece   |6    |
|oman     |4    |
|norway   |3    |
|austria  |3    |
|syria    |3    |
|spain    |2    |
|poland   |1    |
|guinea   |1    |
|iceland  |1    |
|germany  |1    |
|palestine|1    |
|turkey   |1    |
|russia   |1    |
|armenia  |1    |
+---------+-----+

