# Introducing MLLIB
_Movie recommendations using Spark's machine learning library_

## MLLIB Capabilities
- Feature extraction
    - Term Frequency / Inverse Document Frequency useful for search
- Basic Statistics
    - Chi-squared test, Pearson or Spearman correlation, min, max, mean, variance
- Linear regression, logistic regression
- Support Vector Machines
- Naïve Bayes classifier
- Decision tress
- K-Means clustering
- Principal component analysis, singular value decomposition
- Reccomendations using Alternating Least Squares

## Special MLLIB Data Types
- Vector (dense or sparse)
- LabeledPoint
- Rating

### For more depth
_Advanced Analytics with Spark from O'Reilly_

# Movie reccomendations example:

In [1]:
import sys
from pyspark import SparkContext
from pyspark.mllib.recommendation import ALS, Rating

def loadMovieNames():
    movieNames = {}
    with open("../data/u.ITEM", encoding='ascii', errors="ignore") as f:
        for line in f:
            fields = line.split('|')
            movieNames[int(fields[0])] = fields[1]
    return movieNames

sc = SparkContext("local[*]", "MovieRecommendationsALS")
sc.setCheckpointDir('checkpoint')

In [9]:
print("\nLoading movie names...")
nameDict = loadMovieNames()

data = sc.textFile("../data/u.data")

ratings = data.map(lambda l: l.split()).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2]))).cache()

# Build the recommendation model using Alternating Least Squares
print("\nTraining recommendation model...")
rank = 10
# Lowered numIterations to ensure it works on lower-end systems
numIterations = 20
model = ALS.train(ratings, rank, numIterations)

userID = 0 # int(sys.argv[1])

print("\nRatings for user ID " + str(userID) + ":")
userRatings = ratings.filter(lambda l: l[0] == userID)
for rating in userRatings.collect():
    print ("\t" + nameDict[int(rating[1])] + ": " + str(rating[2]))

print()

print("\nTop 10 recommendations:")
recommendations = model.recommendProducts(userID, 10)
for recommendation in recommendations:
    print ("\t" + nameDict[int(recommendation[1])] + \
        " score " + str(recommendation[2]))



Loading movie names...

Training recommendation model...

Ratings for user ID 0:
	Star Wars (1977): 5.0
	Empire Strikes Back, The (1980): 5.0
	Gone with the Wind (1939): 1.0


Top 10 recommendations:
	Schizopolis (1996) score 8.263297649808099
	Die xue shuang xiong (Killer, The) (1989) score 7.751917667658414
	Go Fish (1994) score 7.581591837883766
	Mighty Morphin Power Rangers: The Movie (1995) score 7.567052272721566
	Harlem (1993) score 7.5323154493001345
	Hard Rain (1998) score 6.76689679471007
	Shooting Fish (1997) score 6.612484231447633
	My Man Godfrey (1936) score 6.531903894422301
	Shall We Dance? (1937) score 6.364868078374405
	Inspector General, The (1949) score 6.339408173163396


## Results aren't really that great.
- Very sensitive to the parameters chosen. Takes more work to find optimal parameters for a data set than to run the recommendations of parameters
    - Can use "train/test" to evaluate various premutations of parameters
    - But what is a "good recommendation" anyway?
- I'm not convinced it's even working properly internally
    - Putting your faith in a bblack box is dodgy
    - We'd get better results using our movie similarity results instead, to find similar movies to movies each user liked.
    - Complicated isn't always better.
- Never blindly trust results when analyzing big data
    - Smalll problems in algorithms become big ones
    - Very often, quality of your input data is the real issue
    
### MLLib is still really useful!
