# CS 5540 - Group 1 - Apache Spark Assignment

This is the submission document for our programming assignment over Apache Spark. 

The submission was written as a Jupyter notebook but will be exported to a PDF for submission. We can provide the GitHub repo or the original Jupyter notebook if requested.

In [9]:
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql.functions import col, split, explode

spark = SparkSession.builder\
    .appName("my-spark-app")\
    .config("spark.sql.catalogImplementation", "hive")\
    .getOrCreate()
sc = SparkContext.getOrCreate()

## Table of Contents

1. [Team Members](#team-members)

## Team Members

This assignment was completed by the following team members (Group 1):

- Ayushman Das
- Koti Paruchuri
- Odai Athamneh
- Scott Brunton
- Varshith Thota

## Question 1

Question 1 is as follows: 

> Given file (`/data/shakespeare-1.txt`) contains the scenes from Shakespeare’s plays. You may use this 
file as an input dataset to identify the following notes for a student of Classical Drama.

In [16]:
# load the text file into a dataframe
df = spark.read.text("data/shakespeare-1.txt")

# split the text into words
df = df.select(split(col("value"), " ").alias("words"))

# explode the array of words into separate rows
df = df.select(explode(col("words")).alias("word"))

# show the first 5 words
df.show(5, truncate=False)


+-----+
|word |
+-----+
|This |
|is   |
|the  |
|100th|
|Etext|
+-----+
only showing top 5 rows



### Question 1.1

The question reads as follows:

> How many different countries are mentioned in the whole file? (Regardless of 
how many times a single country is mentioned, this country only contributes as a single entry). 

To address this question, we need a dataset of country names. There are many possible approaches to this problem, but we decided to use a dataset from [Kaggle](https://www.kaggle.com/datasets/fernandol/countries-of-the-world) that is sourced from the US government. This CSV file contains information on 227 present-day countries, including their names, population, and other information. We will use the `Country` column as our list of country names.

The caveat to this approach is that the dataset may not contain all countries, such as: 
- Countries that no longer exist
- Countries that are misspelled in the original Shakespearean text
- Countries where the name or spelling has changed over time

Addressing this issue is beyond the scope of this assignment and would likely require some degree of manual curation.

In [18]:
countries = spark.read.csv("data/countries.csv", header=True)
countries = countries.select("Country")
countries.show(5, truncate=False)

+---------------+
|Country        |
+---------------+
|Afghanistan    |
|Albania        |
|Algeria        |
|American Samoa |
|Andorra        |
+---------------+
only showing top 5 rows



Now that we have our list of countries, we can use a simple `.join()` to find the number of countries mentioned in the Shakespearean text. We will use the `Country` column as our key and perform an inner join with the Shakespearean text. This will return a new DataFrame with only the rows that have a match in both DataFrames. We can then use `.count()` to get the number of rows in the resulting DataFrame.

In [24]:
df.join(countries, df.word == countries.Country, "inner").show(5, truncate=False)
# df.join(countries, df.word == "Egypt", "inner").show(5, truncate=False)
# df.count()

+----+-------+
|word|Country|
+----+-------+
+----+-------+

+-----+---------------+
|word |Country        |
+-----+---------------+
|Egypt|Afghanistan    |
|Egypt|Albania        |
|Egypt|Algeria        |
|Egypt|American Samoa |
|Egypt|Andorra        |
+-----+---------------+
only showing top 5 rows



1418390