Running Your Queries In Spark
---------------------------

You need to take the data from Foursquare and perform your analysis based on the question you chose.

In our example below, we do the following:
1. We read the files that our Foursquare client generates from the drive.
2. For each city, we get the trending venue categories and the number of people currently being there.
3. We add up the numbers for the same categories.

You can extend this into a web dashboard, or plots inside this notebook if you choose.

----------------------------------------------------------------------------------------------------------------------

In [1]:
# Install Java, Spark, Findspark and PySpark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

!pip install -q findspark
!pip install -q pyspark

# mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

#config java
!sudo update-alternatives --config java

[K     |████████████████████████████████| 217.8MB 56kB/s 
[K     |████████████████████████████████| 204kB 50.8MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive
There are 2 choices for the alternative java (providing /usr/bin/java).

  Selection    Path                                            Priority   Status
------------------------------------------------------------
* 0            /usr/lib/jvm/java-11-openjdk-amd64/bin/java     

#### Import the relevant modules

In [0]:
from pyspark import SparkConf,SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession

#### The code below deletes all the log files inside the foursquare_logs directory

In [0]:
import shutil
folder = "/content/gdrive/My Drive/thinkful_big_data/Colab Datasets/foursquare_logs"
for the_file in os.listdir(folder):
    file_path = os.path.join(folder, the_file)
    try:
        if os.path.isfile(file_path):
            os.unlink(file_path)
        elif os.path.isdir(file_path): shutil.rmtree(file_path)
    except Exception as e:
        print(e)

#### We now create a helper function that allow us to store the aggregate number for each category.

In [0]:
def updateFunction(newValues, runningCount):
    if runningCount is None:
        runningCount = 0
    return sum(newValues, runningCount)

In [0]:
# create spark configuration
conf = SparkConf()
conf.setAppName("FoursquareStreamApp")

# create spark context with the above configuration
sc = SparkContext.getOrCreate(conf=conf)
sc.setLogLevel("ERROR")

# create the Streaming Context from the above spark context with 
# interval size 10 seconds
ssc = StreamingContext(sc,10)

# setting a checkpoint to allow RDD recovery
ssc.checkpoint("checkpoint_FoursquareApp")

# read data from drive
dataStream = ssc.textFileStream("/content/gdrive/My Drive/thinkful_big_data/Colab Datasets/foursquare_logs")

#### Finally, we implement our primary workflow.

After the implementation of our workflow, we begin the streaming with `ssc.start()`. The query stays open until we terminate it (`ssc.awaitTermination()`).

In [8]:
explore_places = dataStream.map(lambda x: (x.split(",")[0], (x.split(",")[1]))).reduceByKey(lambda a, b: a + b)

explore_places.pprint()

# start the streaming computation
ssc.start()

# wait for the streaming to finish
ssc.awaitTermination()

-------------------------------------------
Time: 2020-03-19 19:45:00
-------------------------------------------

-------------------------------------------
Time: 2020-03-19 19:45:10
-------------------------------------------

-------------------------------------------
Time: 2020-03-19 19:45:20
-------------------------------------------

-------------------------------------------
Time: 2020-03-19 19:45:30
-------------------------------------------

-------------------------------------------
Time: 2020-03-19 19:45:40
-------------------------------------------
('Harmon Face Values', 'Pharmacy')
('Soon Beauty Lab West', 'Salon / Barbershop')
('CAVA', 'Mediterranean Restaurant')
('World Seido Karate Honbu', 'Martial Arts Dojo')
('Bite', 'Mediterranean Restaurant')
('Chick-Fil-A', 'Fast Food Restaurant')
('Redbird', 'American Restaurant')
('2nd Street Cigar Lounge', 'Smoke Shop')
('Oreno Yakiniku Japanese Bar-B-Cue', 'BBQ Joint')
('by CHLOE', 'Vegetarian / Vegan Restaurant')
...

-

KeyboardInterrupt: ignored

#### When the running process halts, you may need to stop the current Spark Context by running the following cell:

In [0]:
ssc.stop()