# Rank Text Similarity
### Techniques: tf-idf + svd
### Technos: pyspark

### Prerequisites
* Spark 2.4.4 is installed

### Setup environment variables

* workload: 1 hour to config (first time)

In [1]:
# setup SPARK environment variables
import os
import sys  
os.environ['SPARK_HOME'] = '/usr/local/Cellar/apache-spark/2.4.4/libexec'
os.environ['PYSPARK_PYTHON'] = '/Applications/anaconda3/envs/nlp_text_similarity/bin/python'  
os.environ['PYSPARK_DRIVER_PYTHON'] = '/Applications/anaconda3/envs/nlp_text_similarity/bin/python'  
os.environ['JAVA_HOME'] = '/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home'
os.environ['PYTHONPATH'] = os.environ['SPARK_HOME'] + '/python/lib/'
sys.path.insert(0, os.environ['SPARK_HOME'] + '/python/lib/py4j-0.10.7-src.zip')
sys.path.insert(0, os.environ['SPARK_HOME'] + '/python/lib/pyspark.zip')

In [2]:
# list dependencies
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
import pandas as pd
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

In [3]:
# dataset path
os.chdir('..')
project_path = os.getcwd()
dataset_path = 'data/02_preprocessed/train.csv'  

In [4]:
# create sparkSession
sparkSession = SparkSession.builder.appName("my_app").master("local[*]").getOrCreate() 

In [5]:
# print spark session config
sparkSession

In [6]:
# Load imdb dataset as pandas dataframe (<1 sec)
dataset_df = pd.read_csv(dataset_path)
dataset_df.head()


Unnamed: 0,sentiment,review
0,0,Working with one of the best Shakespeare sourc...
1,0,"Well...tremors I, the original started off in ..."
2,0,Ouch! This one was a bit painful to sit throug...
3,0,"I've seen some crappy movies in my life, but t..."
4,0,"""Carriers"" follows the exploits of two guys an..."


In [7]:
# filter dataset (keep reviews)
reviews_dataset_df = dataset_df[['sentiment','review']]
reviews_dataset_df.head()

Unnamed: 0,sentiment,review
0,0,Working with one of the best Shakespeare sourc...
1,0,"Well...tremors I, the original started off in ..."
2,0,Ouch! This one was a bit painful to sit throug...
3,0,"I've seen some crappy movies in my life, but t..."
4,0,"""Carriers"" follows the exploits of two guys an..."


In [8]:
# create spark dataframe (2 sec)
dataset_spark = sparkSession.createDataFrame(data=reviews_dataset_df, schema=['sentiment','review'])
dataset_spark.show()


+---------+--------------------+
|sentiment|              review|
+---------+--------------------+
|        0|Working with one ...|
|        0|Well...tremors I,...|
|        0|Ouch! This one wa...|
|        0|I've seen some cr...|
|        0|"Carriers" follow...|
|        0|I had been lookin...|
|        0|Effect(s) without...|
|        0|This picture star...|
|        0|I chose to see th...|
|        0|This film has to ...|
|        0|I felt brain dead...|
|        0|A young scientist...|
|        0|Inept, boring, an...|
|        0|From the first ti...|
|        0|I find it hard to...|
|        0|I actually saw Ch...|
|        0|I went to school ...|
|        0|I haven't seen th...|
|        0|I haven't seen an...|
|        0|One would think t...|
+---------+--------------------+
only showing top 20 rows



In [9]:
# check dataset schema
dataset_spark.printSchema()

root
 |-- sentiment: long (nullable = true)
 |-- review: string (nullable = true)



## Feature engineering: TF-IDF

In [10]:
tokenizer = Tokenizer(inputCol="review", outputCol="words")
wordsData = tokenizer.transform(dataset_spark)
wordsData.printSchema()
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
featurizedData.show()

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

rescaledData.select("sentiment", "features").show()

root
 |-- sentiment: long (nullable = true)
 |-- review: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)

+---------+--------------------+--------------------+--------------------+
|sentiment|              review|               words|         rawFeatures|
+---------+--------------------+--------------------+--------------------+
|        0|Working with one ...|[working, with, o...|(20,[0,1,2,3,4,5,...|
|        0|Well...tremors I,...|[well...tremors, ...|(20,[0,1,2,3,4,5,...|
|        0|Ouch! This one wa...|[ouch!, this, one...|(20,[0,1,2,3,4,5,...|
|        0|I've seen some cr...|[i've, seen, some...|(20,[0,1,2,3,4,5,...|
|        0|"Carriers" follow...|["carriers", foll...|(20,[0,1,2,3,4,5,...|
|        0|I had been lookin...|[i, had, been, lo...|(20,[0,1,2,3,4,5,...|
|        0|Effect(s) without...|[effect(s), witho...|(20,[0,1,2,3,4,5,...|
|        0|This picture star...|[this, picture, s...|(20,[0,1,2,3,4,5,...|
|       

## References

https://towardsdatascience.com/multi-class-text-classification-with-pyspark-7d78d022ed35  
https://spark.apache.org/docs/latest/ml-features#tf-idf  
    * how to run tf-idf on toy dataset