<a href="https://colab.research.google.com/github/gregorio-saporito/Spark-AMD/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finding similar items: StackSample
Gregorio Luigi Saporito - DSE (2020-2021)

In [1]:
# optional remove files
# !rm kaggle.json
# !rm Questions.csv
# !rm Body.csv
# !rm -rf spark-3.1.1-bin-hadoop2.7

### Upload to session storage the Kaggle API token

In [2]:
from google.colab import files
uploaded = files.upload()

import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content'

Saving kaggle.json to kaggle.json


### Download the dataset through the Kaggle API

In [3]:
# access permissions with the API token
!chmod 600 /content/kaggle.json
!kaggle datasets download -d stackoverflow/stacksample
!unzip \*.zip && rm *.zip
# remove datasets which are not needed
!rm Answers.csv
!rm Tags.csv

Downloading stacksample.zip to /content
100% 1.11G/1.11G [00:12<00:00, 67.4MB/s]
100% 1.11G/1.11G [00:12<00:00, 93.9MB/s]
Archive:  stacksample.zip
  inflating: Answers.csv             
  inflating: Questions.csv           
  inflating: Tags.csv                


### Spark environment setup

In [4]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz 
!tar xf spark-3.1.1-bin-hadoop2.7.tgz
!pip install -q findspark
!rm /content/spark-3.1.1-bin-hadoop2.7.tgz

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"

import findspark
findspark.init("spark-3.1.1-bin-hadoop2.7")# SPARK_HOME
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

import pyspark
type(spark)

sc = spark.sparkContext

### Load Dataset
Spark reads files line by line for performance reasons and CSVs with newline characters cause problems for the parser. In this case the Body column of the file "Questions.csv" has characters like "\n" and "\r" which compromise the correct loading of the dataset. To solve the problem a third party parser capable of coping with this issue was used and a .csv file without newline characters is written on Disk. An alternative in a production scenario would be storing the files in a database. The RAM used for the parser is then freed up to save space. The new .csv file is then correctly loaded with Spark.

In [5]:
import pandas as pd
parsed = pd.read_csv("Questions.csv", encoding="ISO-8859-1", usecols=["Body"])
parsed['Body'] = parsed['Body'].str.replace(r'\n|\r', '')
parsed.to_csv("Body.csv", index=False)
del parsed

In [30]:
df = spark.read.load("Body.csv", format="csv",
                     inferSchema="true", header="true")
df

DataFrame[Body: string]

In [7]:
df.show(10)

+--------------------+
|                Body|
+--------------------+
|"<p>I've written ...|
|"<p>Are there any...|
|<p>Has anyone got...|
|<p>This is someth...|
|"<p>I have a litt...|
|<p>I am working o...|
|<p>I've been writ...|
|"<p>I wonder how ...|
|<p>I would like t...|
|<p>I'm trying to ...|
+--------------------+
only showing top 10 rows



### Dataset Cleaning

In [8]:
# check for missing values
from pyspark.sql.functions import isnan, when, count, col
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

+----+
|Body|
+----+
|   0|
+----+



In [31]:
from pyspark.sql.functions import col, lower, regexp_replace, split

def clean_html(x):
  x = regexp_replace(x, '<.*?>', '')
  return x

df = df.select(clean_html(col("Body")).alias("Body"))

In [32]:
df.show(10)

+--------------------+
|                Body|
+--------------------+
|"I've written a d...|
|"Are there any re...|
|Has anyone got ex...|
|This is something...|
|"I have a little ...|
|I am working on a...|
|I've been writing...|
|"I wonder how you...|
|I would like the ...|
|I'm trying to mai...|
+--------------------+
only showing top 10 rows



In [37]:
# extracting tokens from text
from pyspark.ml.feature import RegexTokenizer

regexTokenizer = RegexTokenizer(gaps = False, pattern = '\w+', inputCol = 'Body', outputCol = 'tokens')
tokenised = regexTokenizer.transform(df)
tokenised.show(3)

+--------------------+--------------------+
|                Body|              tokens|
+--------------------+--------------------+
|"I've written a d...|[i, ve, written, ...|
|"Are there any re...|[are, there, any,...|
|Has anyone got ex...|[has, anyone, got...|
+--------------------+--------------------+
only showing top 3 rows



In [39]:
# stopwords removal
from pyspark.ml.feature import StopWordsRemover
swr = StopWordsRemover(inputCol = 'tokens', outputCol = 'sw_removed')
Body_swr = swr.transform(tokenised)
Body_swr.show(3)

+--------------------+--------------------+--------------------+
|                Body|              tokens|          sw_removed|
+--------------------+--------------------+--------------------+
|"I've written a d...|[i, ve, written, ...|[ve, written, dat...|
|"Are there any re...|[are, there, any,...|[really, good, tu...|
|Has anyone got ex...|[has, anyone, got...|[anyone, got, exp...|
+--------------------+--------------------+--------------------+
only showing top 3 rows



In [44]:
from pyspark.ml.feature import Word2Vec
# average direction of vectorised words to represent a document
word2vec = Word2Vec(vectorSize = 100, seed=123, inputCol = 'sw_removed', outputCol = 'result')
model = word2vec.fit(Body_swr)
result = model.transform(Body_swr)

result.show(3)
result.select('result').show(1, truncate = True)

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:43549)
Traceback (most recent call last):
  File "spark-3.1.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 977, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "spark-3.1.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1115, in start
    self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 111] Connection refused


Py4JNetworkError: ignored