# Prediction of the number of fans by an album title

Before working with the full dataset, consider debuggin the pipeline on a [sample](https://spark.apache.org/docs/3.2.1/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.sample.html).

In [0]:
from pyspark import pandas as pd

pd.set_option("compute.default_index_type", "distributed")
data = pd.read_json("dbfs:///data.json_lines").sample(frac=1.0)

In [0]:
# since we want to predict the number of fans
# we drop the lines where fan column is empty
data = data[~data.deezerFans.isnull()]
data[["title", "deezerFans"]].head()

Unnamed: 0,title,deezerFans
0,Despierta,2046
4,Det Var En Gång En Fågel,7
5,"Nein, Mann!",53
7,Some Things,1688
8,Far Away,1157


# Transforming Data

Spark has a vast library of feature engineering functions. For example, we can get TF-IDF representation for album titles. In the following snippet we construct a data preparation pipeline with three stages:
1. we get title parsed into words
1. we count term frequencies of our bags of words
1. we normalise by inverted document frequency

In [0]:
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.feature import Tokenizer, HashingTF, IDF

data_preparation = Pipeline(stages=[
    Tokenizer(inputCol="title", outputCol="words"),
    HashingTF(inputCol="words", outputCol="term_frequency"),
    IDF(inputCol="term_frequency", outputCol="embedding")
])
spark_data = data.spark.frame()
prepared_data = data_preparation.fit(spark_data).transform(spark_data)

Let's look into the details of the first row:

In [0]:
prepared_data[["title", "words", "term_frequency", "embedding"]].head(5)

Out[4]: [Row(title='Despierta', words=['despierta'], term_frequency=SparseVector(262144, {13861: 1.0}), embedding=SparseVector(262144, {13861: 10.8552})),
 Row(title='Det Var En Gång En Fågel', words=['det', 'var', 'en', 'gång', 'en', 'fågel'], term_frequency=SparseVector(262144, {32297: 2.0, 62868: 1.0, 84301: 1.0, 133624: 1.0, 228947: 1.0}), embedding=SparseVector(262144, {32297: 11.2447, 62868: 8.062, 84301: 11.2607, 133624: 11.2607, 228947: 9.0094})),
 Row(title='Nein, Mann!', words=['nein,', 'mann!'], term_frequency=SparseVector(262144, {85958: 1.0, 150003: 1.0}), embedding=SparseVector(262144, {85958: 11.2607, 150003: 10.8552})),
 Row(title='Some Things', words=['some', 'things'], term_frequency=SparseVector(262144, {19208: 1.0, 214676: 1.0}), embedding=SparseVector(262144, {19208: 7.0122, 214676: 6.5739})),
 Row(title='Far Away', words=['far', 'away'], term_frequency=SparseVector(262144, {9129: 1.0, 165678: 1.0}), embedding=SparseVector(262144, {9129: 6.8976, 165678: 7.3487}))]

Mind the representation of TF-IDF vectors - it's sparse.

# Do It Yourself

Try to follow [a tutorial from Spark docs](https://spark.apache.org/docs/3.2.1/ml-classification-regression.html#regression)

* calculate `word2vec` embeddings instead of TF-IDF
* build a linear regression (predict the number of fans by title)
* split data into train and validation sets and evaluate your model
* compare quality of models (TF-IDF vs word2vec, linear vs random forest vs gradient goosted trees)