# CS 5540 - Group 1 - Apache Spark Assignment

This is the submission document for our programming assignment over Apache Spark. 

The submission was written as a Jupyter notebook but will be exported to a PDF for submission. We can provide the GitHub repo or the original Jupyter notebook if requested.

In [16]:
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql.functions import col, split, explode, lower, trim

spark = SparkSession.builder\
    .appName("my-spark-app")\
    .config("spark.sql.catalogImplementation", "hive")\
    .getOrCreate()
sc = SparkContext.getOrCreate()

## Table of Contents

1. [Team Members](#team-members)

## Team Members

This assignment was completed by the following team members (Group 1):

- Ayushman Das
- Koti Paruchuri
- Odai Athamneh
- Scott Brunton
- Varshith Thota

## Question 1

Question 1 is as follows: 

> Given file (`/data/shakespeare-1.txt`) contains the scenes from Shakespeare’s plays. You may use this 
file as an input dataset to identify the following notes for a student of Classical Drama.

In [17]:
# tokenize input text file and clean the word column
df = spark.read.text("data/shakespeare-1.txt")

df = df.select(explode(split(col("value"), " ")).alias("word"))
df = df.select(lower(trim(col("word"))).alias("word"))

df.show(5, truncate=False)

+-----+
|word |
+-----+
|this |
|is   |
|the  |
|100th|
|etext|
+-----+
only showing top 5 rows



### Question 1.1

The question reads as follows:

> How many different countries are mentioned in the whole file? (Regardless of how many times a single country is mentioned, this country only contributes as a single entry). 

To address this question, we need a dataset of country names. There are many possible approaches to this problem, but we decided to use a dataset from [Kaggle](https://www.kaggle.com/datasets/fernandol/countries-of-the-world) that is sourced from the US government. This CSV file contains information on 227 present-day countries, including their names, population, and other information. We will use the `Country` column as our list of country names.

The caveat to this approach is that the dataset may not contain all countries, such as: 
- Countries that no longer exist
- Countries that are misspelled in the original Shakespearean text
- Countries where the name or spelling has changed over time

Addressing this issue is beyond the scope of this assignment and would likely require some degree of manual curation.

In [18]:
# load countries dataframe and clean the Country column
countries = spark.read.csv("data/countries.csv", header=True)
countries = countries.select(lower(trim("Country")).alias("Country"))

countries.show(5, truncate=False)

+--------------+
|Country       |
+--------------+
|afghanistan   |
|albania       |
|algeria       |
|american samoa|
|andorra       |
+--------------+
only showing top 5 rows



Now that we have our list of countries, we can use a simple `.join()` to find the number of countries mentioned in the Shakespearean text. We will use the `Country` column as our key and perform an inner join with the Shakespearean text. This will return a new DataFrame with only the rows that have a match in both DataFrames. We can then use `.count()` to get the number of rows in the resulting DataFrame.

In [19]:
# perform join after converting both columns to lowercase and trimming the Country column
unique_countries = df.join(countries, df.word == countries.Country, "inner").select("Country").distinct()
unique_countries.show(5, truncate=False)

print("Number of unique countries in the text file: {}".format(unique_countries.count()))

+-------+
|Country|
+-------+
|greece |
|poland |
|austria|
|guinea |
|france |
+-------+
only showing top 5 rows

Number of unique countries in the text file: 20
