Import spark, data and local libraries

Pyspark functions provide many utilies to process data in the cluster.

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pandas as pd
import matplotlib as plt
import mpd

In [None]:
# Will allow us to embed images in the notebook
%matplotlib inline
# change default plot size
plt.rcParams['figure.figsize'] = (15,10)

Build [spark connection](https://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sparksession) to running cluster

Need a spark context to understand and control heap space.  Create it from the spark session already created above.
https://stackoverflow.com/a/41267415

In [None]:
from pyspark import SparkContext, SparkConf
#conf = SparkConf().setAppName("mpd").setMaster("spark://localhost:")
#sc = spark.sparkContext

Trying to get the information about driver memory to avoid heap errors
https://datascience.stackexchange.com/a/12191

sc._conf.get('spark.driver.memory')

type(sc)

sc = spark.sparkContext

sc

Can [get the full configuration of the spark context](https://stackoverflow.com/a/30564947).

This shows the attribute has changed from spark.driver.memory to spark.executor.memory.

sc._conf.getAll()

#conf = SparkConf().setAppName("Python Spark SQL basic example")
#conf = (sc._conf.setMaster('spark://c0015:7077')
        .set('spark.executor.memory', '4G')
        .set('spark.driver.memory', '45G')
        .set('spark.driver.maxResultSize', '10G'))
#sc = SparkContext(conf=conf)


#conf = SparkConf().setAppName("Python Spark SQL basic example")
conf = (sc._conf
        .set('spark.executor.memory', '4G')
        .set('spark.driver.memory', '45G')
        .set('spark.driver.maxResultSize', '10G'))

In [None]:
spark = SparkSession.builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

In [None]:
sc._conf.getAll()

Load the mpd data set and run basic count santity check and print the schema for reference.

In [None]:
mpd_all=mpd.load(spark, "onebig", 1)

In [None]:
mpd_all.count()

In [None]:
mpd_all.printSchema()

# Explore power law of two variables

The power law is a linear relationship between the logarithms of variables. [Leskovec2014, 1.3.6]
A good blog on [power law characteristics and the type of power laws to expect](https://www.fs.blog/2017/11/power-laws/)

Look at the top N artists, albums, songs and observe the power law relationship.

## Tracks

Attempt to count tracks by uri but the following appproach doesn't work.  It's selects all the track_uris for each playlist as a string, not as individual (countable) tracks.

toptracks=mpd_all.select("tracks.track_uri").groupby("track_uri").count().sort(f.col("count").desc())

toptracks.printSchema()

#toptracks.show()

This one works because the playlist is exploded into a two column pid and track_uri, which can be grouped and counted.

In [None]:
pdf=mpd.playlist_flatten(mpd_all)

In [None]:
pdf.printSchema()

The track count matches the values from the stats.txt distributed with the mpd data points.

In [None]:
pdf.count()

## Track Ranks

In [None]:
trackrank = pdf.select("track_uri").groupby("track_uri").count().sort(f.col("count").desc())

In [None]:
trackrank.printSchema()

In [None]:
#trackrank.show()

In [None]:
#trackrank.count()

Create an X range to match the length of ranked tracks.
Capture the track rank as a local pandas data frame to speed plotting.
Use the .toPandas() method to get data into pandas format for plotting and make it easy to work with multiple replots without having to recompute on the spark cluster.

In [None]:
Y=trackrank.select("count").toPandas()

In [None]:
Y.size

The 'Y.size' matches the unique track count in the MPD, so we are seeing the rank of all the expected tracks.

In [None]:
X=pd.DataFrame({'X': range(1,Y.size+1,1)})

In [None]:
plt.pyplot.scatter(X,Y)

This plot has an extreme hook which doesn't match the power law shape expected.

Inspecting the data range more closely we see the same basic hook in the first 10K songs. 

In [None]:
plt.pyplot.scatter(X.head(10000),Y.head(10000))

But looking at the next 10K, we see the characteristic power law shape.  The Popularity of songs drops linearly in the log-log graph.  

The global shape has a superlinear (faster than linear) drop off in the first 10K.  

Inspecting the next 1000 has a very linear decay.

In [None]:
plt.pyplot.scatter(X[10000:11000],Y[10000:11000])

Expanding that to the next 10K shows the emerging power law shape again.

In [None]:
plt.pyplot.scatter(X[10000:20000],Y[10000:20000])

Indeed, looking at the remaining data set of 10K and above and we see the hook shape again.

In [None]:
plt.pyplot.scatter(X[10000:X.size],Y[10000:Y.size])

This shape is most promentent in the 10k to 500k

In [None]:
plt.pyplot.scatter(X[10000:500000],Y[10000:500000])

But it continues on until even the very end when we see the popular of tracks drop to one across a large portion of the playlists.

In [None]:
plt.pyplot.scatter(X[500000:X.size],Y[500000:Y.size])

In fact, the tracks that appear just once in all play lists make up almost half of the data set.

In [None]:
Y[Y['count'] < 2].count()

Almost half the tracks appear no more than once in the entire collection of playlist.   These songs are not likely to be predictive, unless our intention is to promote least frequently heard songs i.e. promote discovery.

In [None]:
Y[Y['count'] < 2].count() / Y.count()

In [None]:
Y[Y['count'] >= 2].count() + Y[Y['count'] < 2].count()

In [None]:
plt.pyplot.scatter(X.head(100),Y.head(100))

In [None]:
artistrank = pdf.select("artist_uri").groupby("artist_uri").count().sort(f.col("count").desc())

In [None]:
aY=trackrank.select("count").toPandas()

In [None]:
aX=pd.DataFrame({'X': range(1,aY.size+1,1)})

In [None]:
plt.pyplot.scatter(aX,aY)

In [None]:
plt.pyplot.scatter(aX.head(100),aY.head(100))

## Loglog Power Law Graphs
Convert loglog plots to dots instead of lines to avoid continuous interpretations.

Inspecting the log-log graphs should show us linear relationships if this is the standard power law, Y=mX^B.  This has been confusing because we see a shape that the more X increases (less popular song) the more Y decreases.  According to [the power-law blog](https://www.fs.blog/2017/11/power-laws/) this look like like Y=mX^-2 the inverse square law: the further we are from popular the less popular we will become.  That is, the less likely a track is to be included in a playlist.

From the data above, only about half the tracks are predictave (appear more than once) and of those only about the top 10K appear with any frequency.

## Tracks

In [None]:
plt.pyplot.loglog(X.head(1000),Y.head(1000), linestyle="None", marker=".")

In [None]:
plt.pyplot.loglog(X,Y, linestyle="None", marker=".")

In [None]:
plt.pyplot.loglog(X.head(100),Y.head(100), linestyle="None", marker=".")

In [None]:
plt.pyplot.loglog(X.head(10000),Y.head(10000), linestyle="None", marker=".")

## Artists

Artists follow the same shape in their powerlaw plot.  The less popular an artist the ever increasing unpopularity they have.

In [None]:
plt.pyplot.loglog(aX,aY, linestyle="None", marker=".")

# N-grams

build a playlist summary of all track uris

In [None]:
pDF=mpd_all.select("pid", "tracks.track_uri")

## Build Ranked 2-ngrams

In [None]:
from pyspark.ml.feature import NGram

ngram = NGram(n=2, inputCol="track_uri", outputCol="ngrams")                                
ngramdf = ngram.transform(pDF)


In [None]:
ngramdf.printSchema()

In [None]:
#ngramdf.select("ngrams").show(1,False)

represent 2grams as an exloded liste

In [None]:
ngramdf.printSchema()

In [None]:
ngram2list = ngramdf.select("pid", f.explode("ngrams"))

In [None]:
ngram2list.printSchema()

Note the count of 2-ngrams is exactly 1-million less (number of playlists) than the number of songs because each playlist is the source of ngrams.  Each playlist will be missing 1 2-ngram per playlist. The final song in the playlist won't serve the start of an ngram.

This indicates that a 3-ngram count would be 2 million less because each two last songs in a playlist would not serve as the start of an ngram.

In [None]:
ngram2cnt = ngram2list.count()
ngram2cnt

In [None]:
ngram2rank = ngram2list.select("col").groupby("col").count().sort(f.col("count").desc())

In [None]:
ngram2rank.printSchema()

ngram2rank.show(5)

In [None]:
ngram2rankcnt = ngram2rank.count()
ngram2rankcnt

There are almost half as many unique 2 ngrams as their are songs, so 2 ngrams are not uncommon.

Need to understand their popularity.

## Plot ranked the ngrams

Plot the  2-gram rank we need to get a sense of its shape.

There are  too many data point to plot meaningfully.  Practically, trying to return 36-million data elments into pandas (as we did above with the 2-million) won't work. It's guarunteed to return Java GC and heap errors.

We can look at a subset of the data by sampling down to a small fraction of the data in our cluster.  Here we convert to RDD and then run the sample function to get a small number of points 360k and plot those.

Note the rdd.sample() function returns an rdd so it can easily be converted to a spark dataframe and then brought into a pandas dataframe.

In [None]:
nYsample=ngram2rank.select("count").rdd.sample(False, .01, 1)

In [None]:
type(nYsample)

In [None]:
nY=nYsample.toDF()

In [None]:
nY.count()

In [None]:
nY=nY.toPandas()

In [None]:
nY.count()

In [None]:
nX=pd.DataFrame({'X': range(1,nY.size+1,1)})

The scatter plot shows the steep falloff and we're out of popular ngrams in the first few hundred.

In other words, most 2grams are unique.  This suggests they won't make a very good predictor of what's next.  At least not in most cases.

In [None]:
plt.pyplot.scatter(nX,nY)

Looking at the log-log plot shows a much more linear shape and intuitively much closer to the power law.

The shape is also very simlar to the popularity of song and artist.  There is definitely a Top-100 feel to all these rankings.  All the rest just decay in popularity at the same rate.

In [None]:
plt.pyplot.loglog(nX,nY, linestyle="None", marker=".")

The log-log plot suggests we want to know the popularty of the different ngram counts.  There are a large number of ngrams that occur very infequently

## Review the top 10k

We want to take the Top-10k (sorted by popularity above) and plot them directly.  This will let us take a good look at the first part of the curve. Give that we sampled out the data to produce the above graphs we'll get a better sense of a meaninful shape to the cuve and some real popularity numbers.

In [None]:
nY1k=ngram2rank.select("count").rdd.take(10000)

In [None]:
type(nY1k)

In [None]:
nY1k=pd.DataFrame(nY1k)

In [None]:
nX1k=pd.DataFrame({'X': range(1,nY1k.size+1,1)})

In [None]:
plt.pyplot.scatter(nX1k,nY1k, linestyle="None", marker=".")

The area under this curve is:

In [None]:
nY1k.sum()

In [None]:
nY1k.min()

That's about 10% of the ngrams that are have a popularity over about 200.

In [None]:
nY1k.describe()

In [None]:
plt.pyplot.loglog(nX1k,nY1k, linestyle="None", marker=".")

Here we see the popularity of the Top-10k 2grams ranges from 2500 to about 200.  

These seem like pretty small number for a count of 36-million

## Review the ranked popularity

How many 2-ngrams occur less than 200 times

In [None]:
ngram2rank.printSchema()

Convert the count column, which is actually a transformation method, to a an actual column of data representing real values.

Err.

Now realizing i thought this was a method because of the way i was calling it.  type(df.count) is a method because it is part of the data frame class.  type(df.select("count")) is a dataframe column because the word count doen't get mis-parsed.

The errors below may simply be parsing errors

In [None]:
nrdd=ngram2rank.select("count").rdd

In [None]:
type(nrdd)

In [None]:
ndf=nrdd.toDF()

In [None]:
ndf.printSchema()

In [None]:
type(ndf.select("count"))

There are actually [several ways of renaming a column](https://stackoverflow.com/a/34077809).  The selectExpr() is easiest to use a sql-like syntax but if I want to use alias, just need to cast a string as a column first to use the alias method.

In [None]:
cntcol=ndf.selectExpr("count as rcount")

In [None]:
type(cntcol)

In [None]:
cntcol.printSchema()

In [None]:
cntcol=ndf.select(f.col("count").alias("rcount"))

In [None]:
type(cntcol.rcount)

In [None]:
ngram2poprank = ngram2rank.selectExpr("count as rcount").groupby("rcount").count().sort(f.col("count").desc())

In [None]:
ngram2poprank.count()

There are 1202 distinct 2-ngram rankings.  I've taken away the ngram association so all ngrams that appear once for example are treated as once-occuring-2-ngrams an counted in that group.  Likewise for twice occuring 2-ngrams.  This continutes on to the many-times occurring 2-ngrams, the most popular 2-ngrams.  These will have fewer occuring ngrams.  

In [None]:
ngram2poprank.printSchema()

In [None]:
type(ngram2poprank)

In [None]:
ngram2poprank.show()

This shows that 30mill of the 35mill 2-ngrams appear only once.

In [None]:
ngram2poprank["rcount" == 1]

In [None]:
rankY=ngram2poprank.select("count").toPandas()

In [None]:
rankY

In [None]:
rankY.describe()

In [None]:
rankX=pd.DataFrame({'X': range(1,rankY.size+1,1)})

In [None]:
plt.pyplot.scatter(rankX, rankY, linestyle="None", marker=".")
plt.pyplot.xlabel("ngram appears x-times")
plt.pyplot.ylabel("count")

In [None]:
plt.pyplot.semilogy(rankX[0:200], rankY[0:200], linestyle="None", marker=".")
plt.pyplot.xlabel("ngram appears x-times")
plt.pyplot.ylabel("count")

In [None]:
plt.pyplot.loglog(rankX, rankY, linestyle="None", marker=".")
plt.pyplot.xlabel("ngram appears x-times")
plt.pyplot.ylabel("count")

The loglog plot helps identify the popularity of different "appearance count" buckets.  The vast majority of 2-ngrams (over 30million out of 35million) fall into the "appears only once" bucket.  An additional 4 million 2-ngrams fall into the "appear two to three times".  The remaing 1-million appear four or more times.  The count of unique bucket types seems to grow pretty smoothly, each appearance count increasing by one.  Then there are several hundred buckets with a uniqe appearance count.

Here it would be good to see a histogram of appearance count, to chunk the buckets, but expect it to be a steep drop off like the raw scatter plot.

In [None]:
plt.pyplot.hist(rankX)
plt.pyplot.xlabel("ngram appears x-times")
plt.pyplot.ylabel("count")

But this plot does not make much sense.

Wonder how far appart the bucket counts are and at what point we start getting further than a 1 increase in popularity.

Look at tail of ranked popularity requires reverse sort because [spark doesn't have tail function yet](https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2)

In [None]:
ngram2poprankrev = ngram2rank.selectExpr("count as rcount").groupby("rcount").count().sort(f.col("rcount").desc())
ngram2poprankrev.show()

## Explore Pipelines with CountVectorizer()

Use CountVectorizer to count the ngrams that have min support of 2 and  build a vocabolary ranked by popularity.

This is similar to the work above but just exploring pipelines and higherlevel functions.

from pyspark.ml.feature import CountVectorizer
cv = CountVectorizer(inputCol="ngrams", outputCol="features", minDF=2)
model = cv.fit(ngramdf)
result = model.transform(ngramdf)      
result.select("features").show(1,truncate=False)


result.printSchema()

The vocab size given in the feature vector by CountVectorizer matches the count in the model vocabular

len(model.vocabulary)

We can also inspect the most popular 2gram in the vocabulary by simply printing by its index in teh array

model.vocabulary[1]

help(model)

Add a flatmap to the count the terms
https://mingchen0919.github.io/learning-apache-spark/tf-idf.html

from pyspark.sql.types import StringType
df_vocab = result.select('ngrams').rdd.\
           flatMap(lambda x: x[0]).\
            toDF(schema=StringType()).toDF('terms')
df_vocab.show()

df_vocab.count()

Calculate term frequencies

vocab_freq = df_vocab.rdd.countByValue()
#pdf = pd.DataFrame({
#        'term': vocab_freq.keys(),
#        'frequency': vocab_freq.values()
#    })
tf = spark.createDataFrame(pdf).orderBy('frequency', ascending=False)
#tf.show()

df_vocab.printSchema()

#import org.apache.spark.sql.functions.monotonicallyIncreasingId 
newDf = df_vocab.withColumn("uniqueIdColumn", f.monotonically_increasing_id)

Calculate the frequency

vocab_freq = df_vocab.rdd.countByValue()

pandasdf = pd.DataFrame({
        'term': vocab_freq.keys(),
        'frequency': vocab_freq.values()
    })
tf = spark.createDataFrame(pdf).orderBy('frequency', ascending=False)
tf.show()

ngram2Y=ngram2rank.select("count").toPandas()

X=pd.DataFrame({'X': range(1,ngram2Y.size+1,1)})

# Explore Numeric Representations

This [post recommends the dataframe functions for indexstring](https://stackoverflow.com/a/43971119)

The functions [IndexToString and StringIndexer](https://spark.apache.org/docs/2.2.0/ml-features.html#indextostring) convert strings to numeric format for ML training

In [None]:
pdf.printSchema()

In [None]:
pDF=pdf.select("pid", "track_uri")

In [None]:
pDF.printSchema()

In [None]:
pDF=pDF.limit(10000)

In [None]:
pDF.show()

In [None]:
from pyspark.ml.feature import IndexToString, StringIndexer

indexer = StringIndexer(inputCol="track_uri", outputCol="trackID")
model = indexer.fit(pDF)
indexed = model.transform(pDF)

In [None]:
print("Transformed string column '%s' to indexed column '%s'"
      % (indexer.getInputCol(), indexer.getOutputCol()))
indexed.printSchema()

In [None]:
indexed.rdd.take(10)

In [None]:
print("StringIndexer will store labels in output column metadata\n")

converter = IndexToString(inputCol="trackID", outputCol="orig_track_uri")
converted = converter.transform(indexed)

print("Transformed indexed column '%s' back to original string column '%s' using "
      "labels in metadata" % (converter.getInputCol(), converter.getOutputCol()))

converted.printSchema()
#converted.select("id", "categoryIndex", "originalCategory").show()

In [None]:
converted.head(10)