<b>This demonstrates how to run spark locally</b><br>
When using the local master, you can read files locally.
In this case, version7.txt is local, it will not be copied to other nodes.

You can verify this by running
<code>hdfs dfs -ls</code>
on other nodes.<br>

Reference:<a href="https://spark.apache.org/docs/latest/configuration.html">Configuration.html</a>

In [3]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext

You can modify the number of threads that are assigned to this by:<br>
<code>
val conf = new SparkConf()
             .setMaster("local[2]")
             .setAppName("CountingSheep")
val sc = new SparkContext(conf)
</code><br>

Without these settings, the following is set in the kernel.json of Pyspark Python 3<br>
<code>
setMaster("spark://node1:7077")
</code><br>
This means that the setup would use client mode and will use the rest of the clusters as well as the hdfs. the sc.textFile(...), would need to use hdfs:/// or if only filename is placed in, it will use hdfs by default.
<br>

Try using this?
<code>
conf = (SparkConf()
         .setMaster("spark://node1:7077")
         .setAppName("My app")
         .set("spark.executor.memory", "512M")
         .set("spark.cores.max", 4)
         .set("spark.submit.deployMode", "client"))
</code>

Reference: <a href="https://spark.apache.org/docs/latest/submitting-applications.html">This</a>

<p>Some of the commonly used options are:</p>

<ul>
  <li><code>--class</code>: The entry point for your application (e.g. <code>org.apache.spark.examples.SparkPi</code>)</li>
  <li><code>--master</code>: The <a href="#master-urls">master URL</a> for the cluster (e.g. <code>spark://23.195.26.187:7077</code>)</li>
  <li><code>--deploy-mode</code>: Whether to deploy your driver on the worker nodes (<code>cluster</code>) or locally as an external client (<code>client</code>) (default: <code>client</code>) <b> &#8224; </b></li>
  <li><code>--conf</code>: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap &#8220;key=value&#8221; in quotes (as shown).</li>
  <li><code>application-jar</code>: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an <code>hdfs://</code> path or a <code>file://</code> path that is present on all nodes.</li>
  <li><code>application-arguments</code>: Arguments passed to the main method of your main class, if any</li>
</ul>

Right now, using jupyter, <a href="https://stackoverflow.com/questions/45997150/can-i-run-a-pyspark-jupyter-notebook-in-cluster-deploy-mode">we only use client, because cluster is not supported</a>. But it should(?) be the same


In [4]:
conf = (SparkConf()
         .setMaster("local[2]")
         .setAppName("My app ")
         .set("spark.executor.memory", "512M")
         .set("spark.cores.max", 4))
# sc.stop()
sc = SparkContext(conf = conf)
local_file = sc.textFile("version7.txt")

In [3]:
local_file = sc.textFile("file:///home/hduser/sandbox/version7.txt")

In [4]:
counts = local_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

In [8]:
print(counts)
print(counts.take(5))

PythonRDD[8] at RDD at PythonRDD.scala:48
[('laboris', 1), ('wirc', 1), ('mmodo', 1), ('', 1), ('anim', 1)]


When using cluster/client, (not local), file system will be hdfs, and this means that it will be duplicated to all nodes(?)

In [6]:
# counts.saveAsTextFile("file:///home/hduser/sandbox/version7out.txt")

In [7]:
with open("version7.txt", 'r') as fin:
    print(fin.read())

Lorem ipsum dolor sit amet, consectetur adipiscing elit, set eiusmod tempor incidunt et labore et dolore magna aliquam. Ut enim ad minim veniam, quis nostrud exerc. Irure dolor in reprehend incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse molestaie cillum. Tia non ob ea soluad incommod quae egen ium improb fugiend. Officia deserunt mollit anim id est laborum Et harumd dereud facilis est er expedit distinct. Nam liber te conscient to factor tum poen legum odioque civiuda et tam. Neque pecun modut est neque nonor et imper ned libidig met, consectetur adipiscing elit, sed ut labore et dolore magna aliquam is nostrud exercitation ullam mmodo consequet. Duis aute in voluptate velit esse cillum dolore eu fugiat nulla pariatur. At vver eos et accusam dignissum qui blandit est praesent. Trenz pruca beynocguon doas nog apoply su t

Useful links:
    - https://arnesund.com/2015/09/21/spark-cluster-on-openstack-with-multi-user-jupyter-notebook/
    - Differences:
    https://community.hortonworks.com/questions/89263/difference-between-local-vs-yarn-cluster-vs-yarn-c.html
   

In [5]:
local_file = sc.textFile("file:///home/hduser/sandbox/sample_lda_data.txt").map(lambda row: row.split(" "))

In [6]:
from pyspark.mllib.feature import Word2Vec
word2vec = Word2Vec()
model = word2vec.fit(local_file)

In [10]:
synonyms = model.findSynonyms('2', 5)
for word, cosine_distance in synonyms:
    print("{}:{}".format(word, cosine_distance))

4:0.27293306589126587
3:0.2072264403104782
0:0.13399849832057953
1:0.0637863352894783
9:0.04702875018119812
