Integrantes:

- Nahir Trógolo 
- Lucia Benitez
- Johanna Frau


In [188]:
import numpy as np
import pandas as pd

from efficient_apriori import apriori
from itertools import combinations, groupby

In [189]:
movies = pd.read_csv('movies.csv')
movies.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [190]:
movies.sample(5)

Unnamed: 0,movieId,title,genres
934,951,His Girl Friday (1940),Comedy|Romance
6815,6927,"Human Stain, The (2003)",Drama|Romance|Thriller
18042,90432,Lentsu (1990),Comedy
15051,76171,India (Indien) (1993),Comedy|Drama
22044,106232,"Reformer and the Redhead, The (1950)",Comedy|Romance


In [191]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 3 columns):
movieId    27278 non-null int64
title      27278 non-null object
genres     27278 non-null object
dtypes: int64(1), object(2)
memory usage: 639.4+ KB


El dataset de movies tiene 3 columnas de features:

- Identificador de Películas (movieId)
- Título de la película junto con el año de lanzamiento (title)
- Géneros a los cuales pertenece la película (genres). 

In [192]:
new_movies = movies
new_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## Preparación de los datos

### Creación de la columna year

Ahora crearemos una nueva columna que contenga sólo el año de las películas.

In [193]:
new_movies['year'] = new_movies['title'].str.rsplit(')',1).str[0].str.rsplit('(',1).str[1]
new_movies['year']

0        1995
1        1995
2        1995
3        1995
4        1995
5        1995
6        1995
7        1995
8        1995
9        1995
10       1995
11       1995
12       1995
13       1995
14       1995
15       1995
16       1995
17       1995
18       1995
19       1995
20       1995
21       1995
22       1995
23       1995
24       1995
25       1995
26       1995
27       1995
28       1995
29       1995
         ... 
27248    1999
27249    2011
27250    2006
27251    1966
27252    1999
27253    2002
27254    1991
27255    2009
27256    2014
27257    2011
27258    2009
27259    2014
27260    2015
27261    2013
27262    2014
27263    2014
27264    2015
27265    2014
27266    2010
27267    2011
27268    2000
27269    2003
27270    2006
27271    2000
27272    2001
27273    2007
27274    2002
27275    2014
27276    2001
27277    2014
Name: year, Length: 27278, dtype: object

Veamos si dicha columna quedó bien chequeando los valores únicos.

In [194]:
new_movies['year'].unique()

array(['1995', '1994', '1996', '1976', '1992', '1988', '1967', '1993',
       '1964', '1977', '1965', '1982', '1985', '1990', '1991', '1989',
       '1937', '1940', '1969', '1981', '1973', '1970', '1960', '1955',
       '1959', '1968', '1980', '1975', '1986', '1948', '1943', '1950',
       '1946', '1987', '1997', '1974', '1956', '1958', '1949', '1972',
       '1998', '1933', '1952', '1951', '1957', '1961', '1954', '1934',
       '1944', '1963', '1942', '1941', '1953', '1939', '1947', '1945',
       '1938', '1935', '1936', '1926', '1932', '1979', '1971', '1978',
       '1966', '1962', '1983', '1984', '1931', '1922', '1999', '1927',
       '1929', '1930', '1928', '1925', '1914', '2000', '1919', '1923',
       '1920', '1918', '1921', '2001', '1924', '2002', '2003', '1915',
       '2004', '1916', '1917', '2005', '2006', '1902', nan, '1903',
       '2007', '2008', '2009', '1912', '2010', 'Das Millionenspiel',
       '1913', '2011', '1898', '1899', 'Bicicleta, cullera, poma', '1894',
       

Observamos que dado que hay películas que no tienen el año de lanzamiento la columna 'year' contiene algunos valores que no son numéricos. Chequearemos si estos valores son pocos con respecto a la cantidad total y en dicho caso procederemos a eliminarlos de este análisis.

Los valores a chequear son:

- Das Millionenspiel
- Bicicleta, cullera, poma
- 2009–
- 2007-
- 1983)
- 1975-1979

y las columnas de los valores nulos.

In [195]:
new_movies[new_movies['year']=='Bicicleta, cullera, poma']

Unnamed: 0,movieId,title,genres,year
17341,87442,"Bicycle, Spoon, Apple (Bicicleta, cullera, poma)",Documentary,"Bicicleta, cullera, poma"


In [196]:
new_movies[new_movies['year']=='Das Millionenspiel']

Unnamed: 0,movieId,title,genres,year
15646,79607,"Millions Game, The (Das Millionenspiel)",Action|Drama|Sci-Fi|Thriller,Das Millionenspiel


In [197]:
new_movies[new_movies['year']=='1975-1979']

Unnamed: 0,movieId,title,genres,year
22679,108583,Fawlty Towers (1975-1979),Comedy,1975-1979


Procedemos a declarar estos índices como 'no year'.

In [198]:
new_movies.at[17341, 'year'] = 'no year'
new_movies.at[15646, 'year'] = 'no year'
new_movies.at[22679, 'year'] = 'no year'

Veamos que pasa con 2009–  y 2007-

In [199]:
new_movies[new_movies['year']=='2009– ']

Unnamed: 0,movieId,title,genres,year
22368,107434,Diplomatic Immunity (2009– ),Comedy,2009–


In [200]:
new_movies[new_movies['year']=='2007-']

Unnamed: 0,movieId,title,genres,year
22669,108548,"Big Bang Theory, The (2007-)",Comedy,2007-


In [201]:
new_movies[new_movies['year']=='1983)']

Unnamed: 0,movieId,title,genres,year
19859,98063,Mona and the Time of Burning Love (Mona ja pal...,Drama,1983)


Corregimos esos valores.

In [202]:
new_movies.at[22368, 'year'] = '2009'

In [203]:
new_movies.at[22669, 'year'] = '2007'

In [204]:
new_movies.at[19859, 'year'] = '1983'

Chequeamos nuevamente los valores únicos en la columna year.

In [205]:
new_movies['year'].unique()

array(['1995', '1994', '1996', '1976', '1992', '1988', '1967', '1993',
       '1964', '1977', '1965', '1982', '1985', '1990', '1991', '1989',
       '1937', '1940', '1969', '1981', '1973', '1970', '1960', '1955',
       '1959', '1968', '1980', '1975', '1986', '1948', '1943', '1950',
       '1946', '1987', '1997', '1974', '1956', '1958', '1949', '1972',
       '1998', '1933', '1952', '1951', '1957', '1961', '1954', '1934',
       '1944', '1963', '1942', '1941', '1953', '1939', '1947', '1945',
       '1938', '1935', '1936', '1926', '1932', '1979', '1971', '1978',
       '1966', '1962', '1983', '1984', '1931', '1922', '1999', '1927',
       '1929', '1930', '1928', '1925', '1914', '2000', '1919', '1923',
       '1920', '1918', '1921', '2001', '1924', '2002', '2003', '1915',
       '2004', '1916', '1917', '2005', '2006', '1902', nan, '1903',
       '2007', '2008', '2009', '1912', '2010', 'no year', '1913', '2011',
       '1898', '1899', '1894', '2012', '1909', '1910', '1901', '1893',
      

En las filas con valores nulos colocamos la categoría "no year".

In [206]:
year_null = new_movies[new_movies['year'].isnull()]

In [207]:
index_year_null = np.array(year_null.index)
index_year_null

array([10593, 23617, 23824, 24286, 24412, 26115, 26127, 26180, 26335,
       26395, 26432, 26749, 26784, 26963, 26974, 27027, 27114])

In [208]:
new_movies.at[index_year_null, 'year'] = 'no year'

In [209]:
new_movies['year'].unique()

array(['1995', '1994', '1996', '1976', '1992', '1988', '1967', '1993',
       '1964', '1977', '1965', '1982', '1985', '1990', '1991', '1989',
       '1937', '1940', '1969', '1981', '1973', '1970', '1960', '1955',
       '1959', '1968', '1980', '1975', '1986', '1948', '1943', '1950',
       '1946', '1987', '1997', '1974', '1956', '1958', '1949', '1972',
       '1998', '1933', '1952', '1951', '1957', '1961', '1954', '1934',
       '1944', '1963', '1942', '1941', '1953', '1939', '1947', '1945',
       '1938', '1935', '1936', '1926', '1932', '1979', '1971', '1978',
       '1966', '1962', '1983', '1984', '1931', '1922', '1999', '1927',
       '1929', '1930', '1928', '1925', '1914', '2000', '1919', '1923',
       '1920', '1918', '1921', '2001', '1924', '2002', '2003', '1915',
       '2004', '1916', '1917', '2005', '2006', '1902', 'no year', '1903',
       '2007', '2008', '2009', '1912', '2010', '1913', '2011', '1898',
       '1899', '1894', '2012', '1909', '1910', '1901', '1893', '2013',
   

In [210]:
new_movies.sample(20)

Unnamed: 0,movieId,title,genres,year
25958,123645,A Lesson Before Dying (1999),Drama,1999
25454,120448,Geronimo (1962),Action|Western,1962
14023,70366,"Silent Night, Deadly Night Part 2 (1987)",Comedy|Horror,1987
7658,8126,Shock Corridor (1963),Drama,1963
13287,65133,Blackadder Back & Forth (1999),Comedy,1999
16039,81184,"Short Film About John Bolton, A (2003)",Fantasy|Horror|Mystery,2003
3358,3447,"Good Earth, The (1937)",Drama,1937
25590,121029,No Distance Left to Run (2010),Documentary,2010
1778,1861,Junk Mail (Budbringeren) (1997),Comedy|Thriller,1997
14899,74613,"Therese Raquin (a.k.a. Adultress, The) (1953)",Crime|Drama|Romance,1953


### Creación de la columna con sólo el título

In [211]:
new_movies['only_title'] = new_movies['title'].str.rsplit('(',1).str[0]
new_movies.sample(50)

Unnamed: 0,movieId,title,genres,year,only_title
13498,66762,Paris (2008),Comedy|Drama|Romance,2008,Paris
18314,91533,Dacii (1967),Drama|War,1967,Dacii
17304,87306,Super 8 (2011),Mystery|Sci-Fi|Thriller|IMAX,2011,Super 8
17947,90035,"Report, The (Gozaresh) (1977)",Drama,1977,"Report, The (Gozaresh)"
23933,113634,Message from Akira Kurosawa: For Beautiful Mov...,Documentary,2000,Message from Akira Kurosawa: For Beautiful Mov...
11455,49280,Bobby (2006),Drama,2006,Bobby
16174,81765,Playing from the Plate (Grajacy z talerza) (1995),Drama|Fantasy|Mystery,1995,Playing from the Plate (Grajacy z talerza)
20040,98795,Jazz (2001),Documentary,2001,Jazz
25487,120627,The Disappeared (2008),Documentary,2008,The Disappeared
10506,39381,"Proposition, The (2005)",Crime|Drama|Western,2005,"Proposition, The"


## Rating dataset

In [212]:
rating = pd.read_csv('ratings.csv')
rating.sample(5)

Unnamed: 0,userId,movieId,rating,timestamp
17718861,122523,71135,3.5,1397519517
7233345,49914,1201,2.5,1225726266
12208273,84355,260,5.0,938953745
17040655,117899,2605,1.0,946625533
46177,348,59141,4.0,1228621024


In [213]:
rating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000263 entries, 0 to 20000262
Data columns (total 4 columns):
userId       int64
movieId      int64
rating       float64
timestamp    int64
dtypes: float64(1), int64(3)
memory usage: 610.4 MB


El dataset de rating contiene 4 columnas de features:

- Identificador de usuario (userId)
- Identificador de películas (movieId)
- Rating de la película (rating)
- timestamp

Vamos a proceder a unir los dos dataset mediante el Id de las películas.

In [214]:
df = pd.merge(rating,new_movies, on='movieId')

In [215]:
df.head(20)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,year,only_title
0,1,2,3.5,1112486027,Jumanji (1995),Adventure|Children|Fantasy,1995,Jumanji
1,5,2,3.0,851527569,Jumanji (1995),Adventure|Children|Fantasy,1995,Jumanji
2,13,2,3.0,849082742,Jumanji (1995),Adventure|Children|Fantasy,1995,Jumanji
3,29,2,3.0,835562174,Jumanji (1995),Adventure|Children|Fantasy,1995,Jumanji
4,34,2,3.0,846509384,Jumanji (1995),Adventure|Children|Fantasy,1995,Jumanji
5,54,2,3.0,974918176,Jumanji (1995),Adventure|Children|Fantasy,1995,Jumanji
6,88,2,1.0,1098277938,Jumanji (1995),Adventure|Children|Fantasy,1995,Jumanji
7,91,2,3.5,1112061358,Jumanji (1995),Adventure|Children|Fantasy,1995,Jumanji
8,116,2,2.0,1132728068,Jumanji (1995),Adventure|Children|Fantasy,1995,Jumanji
9,119,2,4.0,845110667,Jumanji (1995),Adventure|Children|Fantasy,1995,Jumanji


In [216]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20000263 entries, 0 to 20000262
Data columns (total 8 columns):
userId        int64
movieId       int64
rating        float64
timestamp     int64
title         object
genres        object
year          object
only_title    object
dtypes: float64(1), int64(3), object(4)
memory usage: 1.3+ GB


In [217]:
final_df = df.sample(frac= 0.01)
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200003 entries, 4942547 to 15838034
Data columns (total 8 columns):
userId        200003 non-null int64
movieId       200003 non-null int64
rating        200003 non-null float64
timestamp     200003 non-null int64
title         200003 non-null object
genres        200003 non-null object
year          200003 non-null object
only_title    200003 non-null object
dtypes: float64(1), int64(3), object(4)
memory usage: 13.7+ MB


### Reglas de asociación basada en películas

- Item: Peliculas vistas por un usuario
- I: todas las peliculas vistas por los usuarios
- Transacción: peliculas vistas por cada usuario

In [218]:
by_movies_df = final_df.sort_values( by='userId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
by_movies_df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,year,only_title
4488988,3,2699,3.0,945175870,Arachnophobia (1990),Comedy|Horror,1990,Arachnophobia
190505,3,223,5.0,944918444,Clerks (1994),Comedy,1994,Clerks
4458305,3,2657,3.0,945175730,"Rocky Horror Picture Show, The (1975)",Comedy|Horror|Musical|Sci-Fi,1975,"Rocky Horror Picture Show, The"
3756330,3,1225,3.0,944917494,Amadeus (1984),Drama,1984,Amadeus
6572160,7,1888,3.0,1011208527,Hope Floats (1998),Comedy|Drama|Romance,1998,Hope Floats


In [219]:
userId_title = by_movies_df[['userId','only_title']]
userId_title.head()

Unnamed: 0,userId,only_title
4488988,3,Arachnophobia
190505,3,Clerks
4458305,3,"Rocky Horror Picture Show, The"
3756330,3,Amadeus
6572160,7,Hope Floats


In [220]:
userId_title.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200003 entries, 4488988 to 14982572
Data columns (total 2 columns):
userId        200003 non-null int64
only_title    200003 non-null object
dtypes: int64(1), object(1)
memory usage: 4.6+ MB


In [221]:
userId_title['only_title'] = userId_title['only_title'].str.rstrip() #Sacamos los espacios en blanco a la derecha

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [222]:
userId_title[userId_title['only_title']=='Constantine']

Unnamed: 0,userId,only_title
2510923,3218,Constantine
2511034,6915,Constantine
2511119,9917,Constantine
2511263,14994,Constantine
2511373,18165,Constantine
2511454,20922,Constantine
2511458,21051,Constantine
2511621,25660,Constantine
2511644,26320,Constantine
2511721,28650,Constantine


In [223]:
grouped_df = userId_title.groupby(['userId'])
grouped_df

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7f0cca3ee198>

In [224]:
lst = userId_title.groupby('userId')['only_title'].apply(pd.Series.tolist).tolist()
lst

[['Arachnophobia', 'Clerks', 'Rocky Horror Picture Show, The', 'Amadeus'],
 ['Hope Floats',
  'Independence Day (a.k.a. ID4)',
  'Riding in Cars with Boys',
  'Six Days Seven Nights'],
 ['Forrest Gump'],
 ['Bad Education (La mala educación)',
  'Toy Story 2',
  'Professional, The (Le professionnel)',
  'War of the Worlds',
  'Ratatouille',
  'Ghost in the Shell (Kôkaku kidôtai)',
  'Last King of Scotland, The'],
 ['Forrest Gump'],
 ['Mrs. Doubtfire'],
 ['Galaxy Quest'],
 ['Grosse Pointe Blank', 'House Party'],
 ['Superman II', 'Sea Inside, The (Mar adentro)'],
 ['Dangerous Beauty',
  'Godfather, The',
  'Good Will Hunting',
  'Green Mile, The'],
 ['What Women Want', 'Seven (a.k.a. Se7en)'],
 ['Emma', "Muriel's Wedding"],
 ['Shawshank Redemption, The',
  'Grumpy Old Men',
  'Pi',
  'Lethal Weapon 4',
  'Shining, The',
  'Miss Congeniality'],
 ['Austin Powers: The Spy Who Shagged Me'],
 ['Accidental Tourist, The'],
 ['Outbreak'],
 ['Tommy Boy'],
 ['Aeon Flux', 'Frozen', 'Rambo (Rambo 4)'

In [226]:
itemsets, rules = apriori(lst, min_support=0.009,  min_confidence=0.3)
rules

[]

In [227]:
#itemsets, rules =  apriori(lst, min_support=0.005, min_confidence=0.4)
#rules

In [228]:
rules=sorted(rules, key=lambda rule: rule.confidence)
for rule in rules:
  print(rule)