# Rank Text Similarity
### Techniques: tf-idf + svd
### Technos: pyspark
### Applications: e.g., collaborative filtering

### Prerequisites
* Spark 2.4.4 is installed

### Setup environment variables

* workload: 1 hour to config (first time)

In [1]:
# setup SPARK environment variables
import os
import sys  
os.environ['SPARK_HOME'] = '/usr/local/Cellar/apache-spark/2.4.4/libexec'
os.environ['PYSPARK_PYTHON'] = '/Applications/anaconda3/envs/nlp_text_similarity/bin/python'  
os.environ['PYSPARK_DRIVER_PYTHON'] = '/Applications/anaconda3/envs/nlp_text_similarity/bin/python'  
os.environ['JAVA_HOME'] = '/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home'
os.environ['PYTHONPATH'] = os.environ['SPARK_HOME'] + '/python/lib/'
sys.path.insert(0, os.environ['SPARK_HOME'] + '/python/lib/py4j-0.10.7-src.zip')
sys.path.insert(0, os.environ['SPARK_HOME'] + '/python/lib/pyspark.zip')

In [2]:
# list dependencies
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
import pandas as pd
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

In [3]:
# dataset path
os.chdir('..')
project_path = os.getcwd()
dataset_path = 'data/02_preprocessed/train.csv'  

In [4]:
# create sparkSession
sparkSession = SparkSession.builder.appName("my_app").master("local[*]").getOrCreate() 

In [5]:
# print spark session config
sparkSession

In [53]:
# Load imdb dataset as pandas dataframe (<1 sec)
dataset_df = pd.read_csv(dataset_path)
print('Dataset size:', dataset_df.shape)
dataset_df.head()

Dataset size: (25000, 2)


Unnamed: 0,sentiment,review
0,0,Working with one of the best Shakespeare sourc...
1,0,"Well...tremors I, the original started off in ..."
2,0,Ouch! This one was a bit painful to sit throug...
3,0,"I've seen some crappy movies in my life, but t..."
4,0,"""Carriers"" follows the exploits of two guys an..."


In [7]:
# filter dataset (keep reviews)
reviews_dataset_df = dataset_df[['sentiment','review']]
reviews_dataset_df.head()

Unnamed: 0,sentiment,review
0,0,Working with one of the best Shakespeare sourc...
1,0,"Well...tremors I, the original started off in ..."
2,0,Ouch! This one was a bit painful to sit throug...
3,0,"I've seen some crappy movies in my life, but t..."
4,0,"""Carriers"" follows the exploits of two guys an..."


In [8]:
# create spark dataframe (2 sec)
dataset_spark = sparkSession.createDataFrame(data=reviews_dataset_df, schema=['sentiment','review'])
dataset_spark.show()


+---------+--------------------+
|sentiment|              review|
+---------+--------------------+
|        0|Working with one ...|
|        0|Well...tremors I,...|
|        0|Ouch! This one wa...|
|        0|I've seen some cr...|
|        0|"Carriers" follow...|
|        0|I had been lookin...|
|        0|Effect(s) without...|
|        0|This picture star...|
|        0|I chose to see th...|
|        0|This film has to ...|
|        0|I felt brain dead...|
|        0|A young scientist...|
|        0|Inept, boring, an...|
|        0|From the first ti...|
|        0|I find it hard to...|
|        0|I actually saw Ch...|
|        0|I went to school ...|
|        0|I haven't seen th...|
|        0|I haven't seen an...|
|        0|One would think t...|
+---------+--------------------+
only showing top 20 rows



In [9]:
# check dataset schema
dataset_spark.printSchema()

root
 |-- sentiment: long (nullable = true)
 |-- review: string (nullable = true)



## Feature engineering: TF-IDF

In [68]:
# tokenize
tokenizer = Tokenizer(inputCol="review", outputCol="words")
words_data = tokenizer.transform(dataset_spark)
print('Words data schema:')
words_data.printSchema()

# terms frequency
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures")
features = hashingTF.transform(words_data)
features.show(5)

# inverse document frequency
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(features)
tfidf_data = idfModel.transform(features)
print('tf-idf data schema:')
tfidf_data.printSchema()
tfidf_data.select("sentiment", "features").show(5)

Words data schema:
root
 |-- sentiment: long (nullable = true)
 |-- review: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)

+---------+--------------------+--------------------+--------------------+
|sentiment|              review|               words|         rawFeatures|
+---------+--------------------+--------------------+--------------------+
|        0|Working with one ...|[working, with, o...|(262144,[9639,172...|
|        0|Well...tremors I,...|[well...tremors, ...|(262144,[14,1624,...|
|        0|Ouch! This one wa...|[ouch!, this, one...|(262144,[2437,912...|
|        0|I've seen some cr...|[i've, seen, some...|(262144,[14,3026,...|
|        0|"Carriers" follow...|["carriers", foll...|(262144,[14,1804,...|
+---------+--------------------+--------------------+--------------------+
only showing top 5 rows

tf-idf data schema:
root
 |-- sentiment: long (nullable = true)
 |-- review: string (nullable = true)
 |-- words: 

In [56]:
# count features  
num_features = tfidf_data.select("features").take(1)[0]["features"].toArray().shape[0]
print('Number of features:', num_features)

Number of features: 262144


In [120]:
# test svd on a toy RowMatrix
rows = sparkSession.sparkContext.parallelize([[3, 1, 1], [-1, 3, 1]])
rm = RowMatrix(rows)
svd_model = rm.computeSVD(2, True)
svd_model.U.rows.collect()
svd_model.s
svd_model.V

DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, 0.0], 0)

In [128]:
# test SVD on a toy indexedRowsMatrix  
indexedRows = sparkSession.sparkContext.parallelize(
    [
    (0, [1, 2, 3]), 
    (1, [4, 5, 6]),
    (2, [7, 8, 9]), 
    (3, [10, 11, 12])
    ])
mat = IndexedRowMatrix(indexedRows)
svd = mat.computeSVD(2, computeU=True)
print(svd_model.U.rows.collect())
print("\nSingular values:\n", svd_model.s)
print("\nEigenvectors:\n", svd_model.V)

[DenseVector([-0.7071, 0.7071]), DenseVector([-0.7071, -0.7071])]

Singular values:
 [3.464101615137755,3.1622776601683795]

Eigenvectors:
 DenseMatrix([[-4.08248290e-01,  8.94427191e-01],
             [-8.16496581e-01, -4.47213595e-01],
             [-4.08248290e-01,  2.77555756e-17]])


## References

https://towardsdatascience.com/multi-class-text-classification-with-pyspark-7d78d022ed35  
https://spark.apache.org/docs/latest/ml-features#tf-idf  
    * how to run tf-idf on toy dataset