# HW 01
#### Name: Joanie Gannon
#### Name: Jake Schaeffer
#### Class: CSCI 349 - Intro to Data Mining
#### Semester: Spring 2020
#### Instructor: Brian King

In [2]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpgrowth, association_rules

# Phase I

The ratings file is a log of movies their customers have watched. For the first phase of the project, you can keep the
problem simple. Ignore the actual numeric rating and timestamp variables, and convert the ratings file into a set of
transactions, where universe of all possible items are movies. Then, each row is a customer, the items are actual
movies they watched. Your objective is to output a set of the strongest, most interesting association rules you
can. Try to generate at least 10-20 rules. A strong association rule can be interpreted as a potential
recommendation. Your rules must contain actual movie names, and not movie ids!

----
### Process:

- 1) Start by reading in cvs for ratings and movies
- 2) Looking at the data frame for ratings, we want to strip our ratings and timeestamps
- 3) Replace movieId column with corresponding movie title
- 4) Binarize data and generate rules

In [54]:
df_ratings = pd.read_csv("../data/ml-latest-small/ratings.csv")
df_movies = pd.read_csv("../data/ml-latest-small/movies.csv")
df_ratings.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


In [55]:
#Drop uneeded data
df_ratings = df_ratings.drop(columns = ['rating','timestamp'], errors = "ignore")
df_movies.index = df_movies.movieId
df_movies = df_movies.drop(columns = ['genres', 'movieId'], errors = "ignore")

In [56]:
#Map movieId with title
df_titles = df_ratings.merge(right = df_movies, right_on = 'movieId', left_on = 'movieId')
df_titles = df_titles.sort_values(['userId','movieId'])
df_titles = df_titles.reset_index()
df_titles = df_titles.drop(columns = ['movieId','index'], errors = "ignore")
df_titles

Unnamed: 0,userId,title
0,1,Toy Story (1995)
1,1,Grumpier Old Men (1995)
2,1,Heat (1995)
3,1,Seven (a.k.a. Se7en) (1995)
4,1,"Usual Suspects, The (1995)"
...,...,...
100831,610,Split (2017)
100832,610,John Wick: Chapter Two (2017)
100833,610,Get Out (2017)
100834,610,Logan (2017)


In [6]:
#Let's make the title a categorical
title_cat = pd.Categorical(df_titles.title)
df_titles.title = title_cat
df_titles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 2 columns):
userId    100836 non-null int64
title     100836 non-null category
dtypes: category(1), int64(1)
memory usage: 1.3 MB


In [7]:
#Now we binarize the data
df_movies_binarized = pd.get_dummies(data = df_titles.title)
df_movies_binarized = df_movies_binarized.set_index(df_titles.userId)
df_movies_binarized = df_movies_binarized.groupby("userId").max()
df_movies_binarized

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
607,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
608,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
609,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
#Generate supports
fp_support = fpgrowth(df_movies_binarized, min_support=0.3, use_colnames=True)
fp_support

Unnamed: 0,support,itemsets
0,0.539344,(Forrest Gump (1994))
1,0.503279,(Pulp Fiction (1994))
2,0.457377,"(Silence of the Lambs, The (1991))"
3,0.455738,"(Matrix, The (1999))"
4,0.411475,(Star Wars: Episode IV - A New Hope (1977))
5,0.390164,(Jurassic Park (1993))
6,0.388525,(Braveheart (1995))
7,0.360656,(Schindler's List (1993))
8,0.357377,(Fight Club (1999))
9,0.352459,(Toy Story (1995))


In [9]:
#Generated rules
ars = association_rules(fp_support, metric = "lift", min_threshold=1.3)
ars = ars.sort_values(by = "confidence", ascending = False)
ars

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
16,(Star Wars: Episode V - The Empire Strikes Bac...,(Star Wars: Episode IV - A New Hope (1977)),0.345902,0.411475,0.311475,0.900474,2.188403,0.169145,5.913271
12,(Jurassic Park (1993)),(Forrest Gump (1994)),0.390164,0.539344,0.32459,0.831933,1.542489,0.114157,2.740902
14,(Braveheart (1995)),(Forrest Gump (1994)),0.388525,0.539344,0.3,0.772152,1.431649,0.090451,2.021767
17,(Star Wars: Episode IV - A New Hope (1977)),(Star Wars: Episode V - The Empire Strikes Bac...,0.411475,0.345902,0.311475,0.756972,2.188403,0.169145,2.691454
0,(Pulp Fiction (1994)),(Forrest Gump (1994)),0.503279,0.539344,0.377049,0.749186,1.389068,0.105609,1.83664
4,"(Silence of the Lambs, The (1991))",(Pulp Fiction (1994)),0.457377,0.503279,0.339344,0.741935,1.474204,0.109156,1.924795
11,(Star Wars: Episode IV - A New Hope (1977)),"(Matrix, The (1999))",0.411475,0.455738,0.3,0.729084,1.599788,0.112475,2.008968
18,"(Shawshank Redemption, The (1994))",(Forrest Gump (1994)),0.519672,0.539344,0.378689,0.728707,1.351097,0.098406,1.697998
2,(Pulp Fiction (1994)),"(Shawshank Redemption, The (1994))",0.503279,0.519672,0.363934,0.723127,1.391506,0.102395,1.734831
6,"(Silence of the Lambs, The (1991))",(Forrest Gump (1994)),0.457377,0.539344,0.32623,0.713262,1.322461,0.079546,1.606537


In [10]:
#Print rules
for i in range(len(ars)): 
    antecedents = list(ars.iloc[i].antecedents)
    consequents = list(ars.iloc[i].consequents)
    #print(ars.iloc[i].consequents)
    print("{}   ->   {}".format(antecedents, consequents))

['Star Wars: Episode V - The Empire Strikes Back (1980)']   ->   ['Star Wars: Episode IV - A New Hope (1977)']
['Jurassic Park (1993)']   ->   ['Forrest Gump (1994)']
['Braveheart (1995)']   ->   ['Forrest Gump (1994)']
['Star Wars: Episode IV - A New Hope (1977)']   ->   ['Star Wars: Episode V - The Empire Strikes Back (1980)']
['Pulp Fiction (1994)']   ->   ['Forrest Gump (1994)']
['Silence of the Lambs, The (1991)']   ->   ['Pulp Fiction (1994)']
['Star Wars: Episode IV - A New Hope (1977)']   ->   ['Matrix, The (1999)']
['Shawshank Redemption, The (1994)']   ->   ['Forrest Gump (1994)']
['Pulp Fiction (1994)']   ->   ['Shawshank Redemption, The (1994)']
['Silence of the Lambs, The (1991)']   ->   ['Forrest Gump (1994)']
['Silence of the Lambs, The (1991)']   ->   ['Shawshank Redemption, The (1994)']
['Forrest Gump (1994)']   ->   ['Shawshank Redemption, The (1994)']
['Shawshank Redemption, The (1994)']   ->   ['Pulp Fiction (1994)']
['Forrest Gump (1994)']   ->   ['Pulp Fiction (19

---
### Discuss Finding Phase I

We find that several popular movies such as Jurrasic Park, Forrest Gump, Bravehart, ect. imply that other popular movies in the same category are viewed as well. The most interesting finding is that Star Wars V and IV are often watched together, and StarWars and The Matrix are also watched together frequently


# Phase II - Genre

The client is interested in a restricted set of rules for specific genres. For this task, demonstrate your skill by
selecting a genre of your own choosing. Select the subset of movies that match that genre, and rerun your rule
generation algorithm. For example, if the genre is "Comedy", then all ratings of movies that have Comedy in the
genre list should be selected. Run your algorithm on that subset, and generate a small set of strong rules. REPEAT
THIS FOR THREE DIFFERENT GENRES OF YOUR OWN CHOOSING.
Discuss – is this a better method than considering all movies? Or worse?


-----

### Process:

- 1) Choose a genre
- 2) Filter movies by chosen genre
- 3) Rerun phase 1
- 4) Repeat previous steps for 2 other genres

In [11]:
#Let's choose comdey, action, and horror
df_ratings = pd.read_csv("../data/ml-latest-small/ratings.csv")
df_movies = pd.read_csv("../data/ml-latest-small/movies.csv")
df_ratings.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


In [12]:
#Strip out only comedies
comedies = df_movies[df_movies.genres.str.contains("Comedy")]

In [13]:
#Strip out only action
comedies = df_movies[df_movies.genres.str.contains("Action")]

### Discuss Finding Phase II

Discuss

---

# Phase III – Genre Rules
The client has a bright idea. (Being a good agile developer, you eagerly respond positively, to ensure the client
knows they are valued and part of your team. J ) The client wants to take a more general view of genre. How?
Create a new transaction dataset, where the item universe is now all possible genres, not movies. A transaction for
each customer is then a list of genres collected over all movies they watched. The customer wants to understand
both the general frequent patterns among these data and their support levels. Again, be sure to output a good set
of strong rules. This can help the customer determine what types of movies they should invest in the most based
on current genres most watched. (NOTE: This is going to amount to a very dense dataset, compared to the movies,
and thus will require very different hyperparameters.)

____

### Process:

- 1) Start by reading in cvs for ratings and movies
- 2) Looking at the data frame for ratings, we want to strip our ratings and timeestamps
- 3) Replace movieId column with corresponding movie title
- 4) Binarize data and generate rules

In [178]:
df_ratings = pd.read_csv("../data/ml-latest-small/ratings.csv")
df_movies = pd.read_csv("../data/ml-latest-small/movies.csv")
df_ratings = df_ratings.drop(columns = ['rating','timestamp'], errors = "ignore")


In [179]:
df_titles = df_ratings.merge(right = df_movies, right_on = 'movieId', left_on = 'movieId')
df_titles = df_titles.sort_values(['userId','movieId'])
df_titles = df_titles.drop(["movieId","title"], axis = 1)

In [180]:
#get rid of same genre list and userid to minimze size
df_titles = df_titles.drop_duplicates()
df_titles.reset_index(drop = True, inplace = True)

In [181]:
df_titles.genres = df_titles.genres.str.split("|")

Unnamed: 0,userId,genres
0,1,"[Adventure, Animation, Children, Comedy, Fantasy]"
1,1,"[Comedy, Romance]"
2,1,"[Action, Crime, Thriller]"
3,1,"[Mystery, Thriller]"
4,1,"[Crime, Mystery, Thriller]"


In [184]:
#userGenres has userId as index and has genres with all genres they like
userGenres = df_titles.explode("genres").drop_duplicates().set_index("userId")

In [209]:
genre_binarized = pd.get_dummies(data = userGenres.genres).groupby("userId").max()
#we now have a binarized list of genres watched by UserId
genre_binarized = genre_binarized.drop("IMAX",axis = 1)

In [210]:
#Generate supports
genre_support = fpgrowth(genre_binarized, min_support=0.8, use_colnames=True)
#even with min support of .8, since the dataset is dense, we get 6400 results
genre_rules = association_rules(genre_support, metric = "lift", min_threshold=1.1)
genre_rules = genre_rules.sort_values(by = "confidence", ascending = False)
genre_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
21,"(Action, Animation, Mystery)","(Sci-Fi, Children, Comedy, Fantasy, Drama)",0.827869,0.881967,0.803279,0.970297,1.100151,0.073126,3.97377
36,"(Action, Animation, Comedy, Mystery)","(Sci-Fi, Children, Adventure, Fantasy, Drama)",0.827869,0.881967,0.803279,0.970297,1.100151,0.073126,3.97377
20,"(Drama, Action, Animation, Mystery)","(Sci-Fi, Children, Comedy, Fantasy)",0.827869,0.881967,0.803279,0.970297,1.100151,0.073126,3.97377
32,"(Animation, Comedy, Action, Mystery, Drama)","(Adventure, Sci-Fi, Children, Fantasy)",0.827869,0.881967,0.803279,0.970297,1.100151,0.073126,3.97377
33,"(Animation, Adventure, Action, Mystery, Drama)","(Sci-Fi, Children, Comedy, Fantasy)",0.827869,0.881967,0.803279,0.970297,1.100151,0.073126,3.97377
...,...,...,...,...,...,...,...,...,...
115,"(Adventure, Sci-Fi, Children, Fantasy)","(Romance, Animation, Action, Mystery, Thriller)",0.881967,0.824590,0.800000,0.907063,1.100017,0.072739,1.88741
137,"(Sci-Fi, Children, Comedy, Fantasy)","(Romance, Animation, Action, Mystery, Drama, T...",0.881967,0.824590,0.800000,0.907063,1.100017,0.072739,1.88741
161,"(Adventure, Sci-Fi, Children, Fantasy)","(Romance, Animation, Comedy, Action, Mystery, ...",0.881967,0.824590,0.800000,0.907063,1.100017,0.072739,1.88741
135,"(Sci-Fi, Children, Comedy, Fantasy, Drama)","(Romance, Animation, Action, Mystery, Thriller)",0.881967,0.824590,0.800000,0.907063,1.100017,0.072739,1.88741


In [212]:
genre_support

Unnamed: 0,support,itemsets
0,1.000000,(Drama)
1,0.998361,(Thriller)
2,0.998361,(Comedy)
3,0.996721,(Action)
4,0.993443,(Romance)
...,...,...
6438,0.800000,"(Sci-Fi, Children, Romance, Animation, Comedy,..."
6439,0.800000,"(Sci-Fi, Children, Romance, Animation, Comedy,..."
6440,0.800000,"(Sci-Fi, Children, Romance, Animation, Adventu..."
6441,0.800000,"(Sci-Fi, Children, Romance, Animation, Comedy,..."


# Phase IV – Incorporating Additional Variables
Consider how you can use other variables? You have access to the numeric ratings, a unique timestamp for the
rating, the year of the movie, and user-defined tags. Or, consider that, for the previous exercise, you ignored
multiplicity of genres. What else can you do with all of these data? For instance, are there patterns with movie
years? Could you create new items such as "70s", "80s", and so on for the decade of the movie and re-run your
frequent pattern search? Could you combine the decade and the genre? Imagine if you could figure out how to
generate rules that tell the client that people who like 80s movies are likely to watch "Comedy" or "Romance" with
a given confidence level. And of course, what about the ratings!!! Why would you output a rule that contains a
movie only given a rating of a 1 or a 2? You might be able filter these patterns and rules more intelligently!

For this last phase, come up with three different ideas that involve including additional variables in some way, and
implement it. In all three cases, generate a new set of association rules. Depending on what you choose to do here,
it will likely require that you filter rules out that do not meet certain criteria? Or, perhaps you could modify or
rewrite your own variant of the apriori algorithm. You could rewrite apriori just to generate relevant frequent
patterns, and still use mlxtend's association rules package, as long as the format of the data frame that is used as
input into the association rule generation are consistent.

I have no specific requirements here. I want you and your partner to think. Be creative. Put yourself in the client's
shoes. You have a lot of data. How can you leverage it to provide the best possible recommendations for their
customers?

------

### Process:

- 1) Start by reading in cvs for ratings and movies
- 2) Looking at the data frame for ratings, we want to strip our ratings and timeestamps
- 3) Replace movieId column with corresponding movie title
- 4) Binarize data and generate rules