# Similar Movies
- Import movie data
- import user-movie-rating data
- "wrangle" the data a bit
  - merge the two together
  - create a "view" of the data where movie-ratings are organized by user-id
  - Isolate a movie-of-choice in it's ratings by all people
  - correlate a movie-of-choice rating with other movies

## Dependencies

In [13]:
import pandas as pd
import numpy as np

## Import & Preview the Data

In [14]:
r_cols = ['user_id', 'movie_id', 'rating']
# m_cols = ['movie_id', 'title']
m_cols = ["movie_id",
"title",
"release_date",
"video_release_date",
"IMDb_URL",
"unknown",
"Action",
"Adventure",
"Animation",
"Children's",
"Comedy",
"Crime",
"Documentary",
"Drama",
"Fantasy",
"Film-Noir",
"Horror",
"Musical",
"Mystery",
"Romance",
"Sci-Fi",
"Thriller",
"War",
"Western"]

movieFileName = 'ml-100k/u.item'
ratingFileName = 'ml-100k/u.data'

ratings = pd.read_csv(ratingFileName, sep='\t', names=r_cols, usecols=range(len(r_cols)), encoding="ISO-8859-1")
movies = pd.read_csv(movieFileName, sep='|', names=m_cols, usecols=range(len(m_cols)), encoding="ISO-8859-1")

moviesAndRatings = pd.merge(movies, ratings)
moviesAndRatings.head()

Unnamed: 0,movie_id,title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children's,...,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,user_id,rating
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,308,4
1,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,287,5
2,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,148,4
3,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,280,4
4,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,66,3


### Merging Observation
Each row represents a single user-to-movie rating row.  
Many rows per movie.  
Many rows per user_id.  
1 row per user/movie intersection.  

In [15]:
# MORE inspection
print('RATINGS...')
print(pd.DataFrame(ratings).head())

RATINGS...
   user_id  movie_id  rating
0        0        50       5
1        0       172       5
2        0       133       1
3      196       242       3
4      186       302       3


## Pick A Movie

In [16]:
# MOVIE_OF_CHOICE = 'Star Wars (1977)'
MOVIE_OF_CHOICE = 'Toy Story (1995)'

## Similar Movies By Ratings

### Create A "Pivoted" View Of the Data: Movie-Rating-By-User
Now the amazing pivot_table function on a DataFrame will construct a user / movie rating matrix. Note how NaN indicates missing data - movies that specific users didn't rate.

In [17]:
movieRatingsByUserId = moviesAndRatings.pivot_table(index=['user_id'],columns=['title'],values='rating')


# uncomment to preview
# users with no ratings on a movie will show NaN for that user/movie
movieRatingsByUserId.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


### Extract Movie-of-Choice Only Ratings Data

In [18]:
movieOfChoiceRatingsByUser = movieRatingsByUserId[MOVIE_OF_CHOICE]
movieOfChoiceRatingsByUser.head()

user_id
0    NaN
1    5.0
2    4.0
3    NaN
4    NaN
Name: Toy Story (1995), dtype: float64

### Similar-Scored-Movies: Correlate Movie-Of-Choice Ratings with Other Movie Ratings
Pandas' [corrwith](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corrwith.html) can be used to compute the "pairwise correlation" (_link tbd_) of the chosen movies' vector of user rating with every other movie.  

In [19]:
movieSimilarityScores = movieRatingsByUserId.corrwith(movieOfChoiceRatingsByUser)
movieSimilarityScores = movieSimilarityScores.dropna()

# Temporary Data-Frame for previewing with head()
movieSimilarityScoresDF = pd.DataFrame(movieSimilarityScores)
movieSimilarityScoresDF.head(10)

# NOTE: The printed warning is safe to ignore

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
'Til There Was You (1997),0.534522
101 Dalmatians (1996),0.232118
12 Angry Men (1957),0.334943
187 (1997),0.651857
2 Days in the Valley (1996),0.162728
"20,000 Leagues Under the Sea (1954)",0.328472
2001: A Space Odyssey (1968),-0.06906
"39 Steps, The (1935)",0.150055
8 1/2 (1963),-0.117259
8 Heads in a Duffel Bag (1997),0.5


### Sort Similar-Movie Correlation Scores
Let's sort the results by similarity score, and we should have the movies most similar to Star Wars!   

In [20]:
movieSimilarityScores.sort_values(ascending=False)

title
Roseanna's Grave (For Roseanna) (1997)     1.0
Substance of Fire, The (1996)              1.0
Stranger, The (1994)                       1.0
Wooden Man's Bride, The (Wu Kui) (1994)    1.0
Newton Boys, The (1998)                    1.0
                                          ... 
Slingshot, The (1993)                     -1.0
Heavy (1995)                              -1.0
Stalker (1979)                            -1.0
Feast of July (1995)                      -1.0
Love and Death on Long Island (1997)      -1.0
Length: 1370, dtype: float64

### Cleanup: Get Movie-Rating counts and average rating score
These results are probably getting messed up by movies that have only been viewed by a handful of people who also happened to like The Movie Of Choice.  
Here: 
- count how many ratings exist for each movie 
- remove movies that were only watched by a few people
- get average rating per movie (_extra detail for now_)

In [21]:
movieStats = moviesAndRatings.groupby('title').agg({'rating': [np.size, np.mean]})
movieStats.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
'Til There Was You (1997),9,2.333333
1-900 (1994),5,2.6
101 Dalmatians (1996),109,2.908257
12 Angry Men (1957),125,4.344
187 (1997),41,3.02439


### Cleanup: Limiting By review Count
Let's get rid of any movies rated by fewer than 100 people, and check the top-rated ones that are left:
100 might still be too low, but these results look pretty good as far as "well rated movies that people have heard of."

In [22]:
MINIMUM_NUMBER_OF_RATINGS = 100
popularMovies = movieStats['rating']['size'] >= MINIMUM_NUMBER_OF_RATINGS
movieStats[popularMovies].sort_values([('rating', 'mean')], ascending=False)[:15]

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
"Close Shave, A (1995)",112,4.491071
Schindler's List (1993),298,4.466443
"Wrong Trousers, The (1993)",118,4.466102
Casablanca (1942),243,4.45679
"Shawshank Redemption, The (1994)",283,4.44523
Rear Window (1954),209,4.38756
"Usual Suspects, The (1995)",267,4.385768
Star Wars (1977),584,4.359589
12 Angry Men (1957),125,4.344
Citizen Kane (1941),198,4.292929


### Merge Rating-Score Data With Similarity-Score Data
Joining The Data:
- a dataset with `title, rating|size, rating|mean`
- a dataset with `title, similarity`
- those two merged

In [23]:
mappedColumnsMoviestat=movieStats[popularMovies]
mappedColumnsMoviestat.columns=[f'{i}|{j}' if j != '' else f'{i}' for i,j in mappedColumnsMoviestat.columns]
# COLUMNS: title, rating|size, rating|mean
# mappedColumnsMoviestat.head()


similarityScoreDF = pd.DataFrame(movieSimilarityScores, columns=['similarity'])
# COLUMNS: title, similarity
# similarityScoreDF.head()


mappedColumnsMoviestatDF = mappedColumnsMoviestat.join(similarityScoreDF)
mappedColumnsMoviestatDF.head()

Unnamed: 0_level_0,rating|size,rating|mean,similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
101 Dalmatians (1996),109,2.908257,0.232118
12 Angry Men (1957),125,4.344,0.334943
2001: A Space Odyssey (1968),259,3.969112,-0.06906
Absolute Power (1997),127,3.370079,0.31858
"Abyss, The (1989)",151,3.589404,0.329058


And, sort these new results by similarity score. That's more like it!

In [24]:
mappedColumnsMoviestatDF.sort_values(['similarity'], ascending=False)[:15]

Unnamed: 0_level_0,rating|size,rating|mean,similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Toy Story (1995),452,3.878319,1.0
"Craft, The (1996)",104,3.115385,0.5491
Down Periscope (1996),101,2.70297,0.457995
Miracle on 34th Street (1994),101,3.722772,0.456291
G.I. Jane (1997),175,3.36,0.454756
Amistad (1997),124,3.854839,0.449915
Beauty and the Beast (1991),202,3.792079,0.44296
"Mask, The (1994)",129,3.193798,0.432855
Cinderella (1950),129,3.581395,0.428372
That Thing You Do! (1996),176,3.465909,0.427936


Ideally we'd also filter out the movie we started from - of course The Movie-Of-Choice is 100% similar to itself. But otherwise these results aren't bad.  

### Review
Above, similar movies are calculated by....
- movie rating
- a minimum "cutoff" number of ratings per movie (100)