# Group 13 - Jazz Sun,Yusong Wang 
# Song and Books recommendation system
## Analytics Goal
Our main objective is to build a recommendation system that provides personalized book recommendations based on a user's current song listening situation. To achieve this, we use a machine learning model trained on the TF-IDF algorithm to analyze the text of books and identify features that are relevant to the user's listening preferences. We then use cosine similarity to calculate the similarity between the user's input and each book in our database, and provide a list of top 10 recommendations based on this similarity score.

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat_ws,col
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

In [0]:
spark = SparkSession \
    .builder \
    .appName("pgML") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

### Load books data from MongoDB and Create features/label

In [0]:
database = 'test_db'
collection = 'Penguin_Random_House'
user_name = 'chuang86'
password = 'msds697'
address = 'msds697-cluster.sig8e.mongodb.net'
connection_string = f"mongodb+srv://{user_name}:{password}@{address}/{database}.{collection}"

In [0]:
pg = spark.read.format("mongo").option("uri",connection_string).load()

In [0]:
df=pg

### Combine all books features into one text column

In [0]:
from pyspark.sql.functions import concat_ws,col
df=df.select(df.title,df.aboutTheBook,df.author,df.authorBio,df.categories,df.keynote,df.praises,\
            concat_ws(',',df.title,df.aboutTheBook,df.author,df.authorBio,df.categories,df.keynote,df.praises)
              .alias("text"))
df.show()

### Tokenization on text and compute the TF-IDF features

In [0]:

# Define tokenizer
tokenizer = Tokenizer(inputCol="text", outputCol="words")

# Apply tokenizer to DataFrame
wordsData = tokenizer.transform(df)

# Apply hashing trick to words
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)

# Compute TF-IDF features
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)



### Books recommendation system based on song listening situation

Define a UDF to apply the text_to_vector function to each row of the DataFrame

In [0]:
from pyspark.ml.linalg import Vector, VectorUDT
# Define a function to convert a text string to a TF-IDF vector
def text_to_vector(text):
    words = text.split()
    wordsData = spark.createDataFrame([(0, words)], ["id", "words"])
    featurizedData = hashingTF.transform(wordsData)
    rescaledData = idfModel.transform(featurizedData)
    return rescaledData.select("features").first()[0]

# Define a UDF to apply the text_to_vector function to each row of the DataFrame
text_to_vector_udf = udf(text_to_vector, VectorUDT())

# Define the input text- song name
input_text = "I am listening to a love song"

# Convert the input text to a TF-IDF vector
input_vector = text_to_vector(input_text)




###Calculate cosine similarity between the input vector and the DataFrame vectors

In [0]:

cosine_similarity_udf = udf(lambda x: float(input_vector.dot(x) / (input_vector.norm(2) * x.norm(2))), StringType())


###Get the top 10 book recommendation

In [0]:
from pyspark.sql.types import StringType,BooleanType,DateType,DoubleType
final_df=rescaledData.select("title", cosine_similarity_udf("features").alias("cosine_similarity"))
# final_df.show()
final_df = final_df.withColumn("cosine_similarity", final_df["cosine_similarity"].cast(DoubleType()))
final_df.orderBy(col("cosine_similarity").desc()).limit(10).show(truncate=False)





###Conclusion:
Based on our analysis, we found that our recommendation system was able to accurately identify relevant books based on a user's current song listening situation. For example, when the user input "I am listening to a love song", our system recommended books such as "Poems That Touch the Heart" and "Being a Green Mother", which have a high degree of similarity to the user's input based on our model.

### Song recommendation system based on one book title
## Analytics Goal
The goal of this project is to recommend top 10 songs based on user input with a book title. To achieve this, we apply the TF-IDF model to train the song lyrics and calculate the similarity between the user input and the song lyrics using cosine similarity.

#### Load song lyrics data from MongoDB and Create features/label

In [0]:
database = 'test_db'
collection = 'BookToSong_Lyrics'
user_name = 'chuang86'
password = 'msds697'
address = 'msds697-cluster.sig8e.mongodb.net'
connection_string = f"mongodb+srv://{user_name}:{password}@{address}/{database}.{collection}"

In [0]:
song_df = spark.read.format("mongo").option("uri",connection_string).load()

In [0]:
song_df.show()

### Tokenization on text and compute the TF-IDF features

In [0]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
# Define tokenizer
tokenizer = Tokenizer(inputCol="Song_lyrics", outputCol="words")

# Apply tokenizer to DataFrame
wordsData = tokenizer.transform(song_df)

# Apply hashing trick to words
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)

# Compute TF-IDF features
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)




In [0]:
from pyspark.ml.linalg import Vector, VectorUDT
# Define a function to convert a text string to a TF-IDF vector
def text_to_vector(text):
    words = text.split()
    wordsData = spark.createDataFrame([(0, words)], ["id", "words"])
    featurizedData = hashingTF.transform(wordsData)
    rescaledData = idfModel.transform(featurizedData)
    return rescaledData.select("features").first()[0]

# Define a UDF to apply the text_to_vector function to each row of the DataFrame
text_to_vector_udf = udf(text_to_vector, VectorUDT())

# Define the input text
input_text = "The Truelove Bride"

# Convert the input text to a TF-IDF vector
input_vector = text_to_vector(input_text)


### Calculate cosine similarity between the input vector and the DataFrame vectors

In [0]:
cosine_similarity_udf = udf(lambda x: float(input_vector.dot(x) / (input_vector.norm(2) * x.norm(2))), StringType())

###Get the top 10 song recommendation

In [0]:
from pyspark.sql.types import StringType,BooleanType,DateType,DoubleType
final_df=rescaledData.select("Song_name", cosine_similarity_udf("features").alias("cosine_similarity"))
# final_df.show()
final_df = final_df.withColumn("cosine_similarity", final_df["cosine_similarity"].cast(DoubleType()))
final_df.orderBy(col("cosine_similarity").desc()).limit(10).show()



###Conclusion:
Based on our analysis, we found that our recommendation system was able to accurately identify relevant songs based on a book title. For example, when the user input book title "The Truelove Bride", our system recommended songs such as "Authenticity" and "Summer Memories", which have a high degree of similarity to the user's input based on our model.