# Overview

Recommendations are being used to recommend everything from movies to music to friends to new destinations. There are three main methods for implementing recommendations that you will become familiar with throughout this lesson:

- Knowledge Based Recommendations 
- Collaborative Filtering Based Recommendations
- Content Based Recommendations

...and more advanced technique to follow

### Collaborative filtering
Within Collaborative Filtering, there are two main branches:

- Model Based Collaborative Filtering
- Neighborhood Based Collaborative Filtering

In order to implement Neighborhood Based Collaborative Filtering, you will learn about some common ways to measure the similarity between two users (or two items) including:

- Pearson's correlation coefficient
- Spearman's correlation coefficient
- Kendall's Tau
- Euclidean Distance
- Manhattan Distance

You will learn why sometimes one metric works better than another by looking at a specific situation where one metric provides more information than another.


### Example Recommendations
- **LinkedIn and Facebook:** Both LinkedIn and Facebook have recommendations for connections (business of friends) similar to what is shown below.
- **AirBnB Experiences and Destinations:** AirBnB uses recommendations to determine experiences and destinations for their users.
- **Walmart, Amazon, and Other Retailers:** As humans on the Internet, we all get pinged with constant recommendations from retailers.

### Business Cases For Recommendations
Finally, you will look at the four ideas needed for businesses to implement successful recommendations to drive revenue, which include:

- Relevance
- Novelty
- Serendipity
- Increased Diversity


---
# Recommendations with MovieTweetings: 

## Getting to Know The Data

Throughout this lesson, you will be working with the [MovieTweetings Data](https://github.com/sidooms/MovieTweetings/tree/master/recsyschallenge2014).  To get started, you can read more about this project and the dataset from the [publication here](http://crowdrec2013.noahlab.com.hk/papers/crowdrec2013_Dooms.pdf).

**Note:** Think of pros and cons of each recommendation method and how to improve it.

In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tests as t

%matplotlib inline
%config Completer.use_jedi = False

In [211]:
import re
import datetime

In [33]:
movies = pd.read_csv('movies.dat', delimiter='::', engine='python', header=None, names=['movie_id', 'movie', 'genre'], dtype={'movie_id': object})
reviews = pd.read_csv('ratings.dat', delimiter='::', engine='python', header=None, names=['user_id', 'movie_id', 'rating', 'timestamp'], dtype={'movie_id': object, 'user_id': object, 'timestamp': object})

## 1. Take a Look At The Data 

Take a look at the data and use your findings to fill in the dictionary below with the correct responses to show your understanding of the data.

In [28]:
# Number of unique movie by id
movies.movie_id.shape[0]

35479

In [26]:
# Number of unique ratings 
reviews.shape[0]

863866

In [30]:
# Number of different genres
movies.genre.value_counts() # this may not be a good way

Drama                                              3602
Comedy                                             2091
Documentary                                        1443
Comedy|Drama                                       1371
Drama|Romance                                      1199
                                                   ... 
Animation|Comedy|Drama|Fantasy|Mystery|Thriller       1
Animation|Biography|Crime|Drama|Mystery               1
Animation|Short|Comedy|Drama|Sci-Fi                   1
Romance|Sport                                         1
Documentary|Biography|News|War                        1
Name: genre, Length: 2736, dtype: int64

In [183]:
# HINTed --> loop through each row 

# Reason for try-except
    # AttributeError: 'float' object has no attribute 'split'
    # may be caused by NaN
    
genre_list = []
for genre in movies.genre:
    try:
        genre_list.extend(genre.split('|'))
    except AttributeError:
        pass
    
genre_set = set(genre_list)
print(f'Total number of unique genres is {len(genre_set)}')

Total number of unique genres is 28


In [32]:
# Number of unique users
reviews.user_id.nunique()

67353

In [38]:
# Number of missing ratings
reviews.isnull().sum()

user_id      0
movie_id     0
rating       0
timestamp    0
dtype: int64

In [41]:
# Summary statistics for ratings
reviews.rating.describe()

count    863866.000000
mean          7.315878
std           1.853831
min           0.000000
25%           6.000000
50%           8.000000
75%           9.000000
max          10.000000
Name: rating, dtype: float64

In [229]:
# Use your findings to match each variable to the correct statement in the dictionary


dict_sol1 = {
'The number of movies in the dataset': 35479,
'The number of ratings in the dataset': 863866,
'The number of different genres': 28,
'The number of unique users in the dataset': 67353,
'The number missing ratings in the reviews dataset': 0,
'The average rating given across all ratings': 7.315878,
'The minimum rating given across all ratings': 0.000000,
'The maximum rating given across all ratings': 10.000000,
}

## 2. Data Cleaning

Next, we need to pull some additional relevant information out of the existing columns. 

For each of the datasets, there are a couple of cleaning steps we need to take care of:

#### Movies
* Pull the date from the title and create new column
* Dummy the date column with 1's and 0's for each century of a movie (1800's, 1900's, and 2000's)
* Dummy column the genre with 1's and 0's

#### Reviews
* Create a date out of time stamp

### `movies` Pull the data from the title and create new column

In [52]:
# Let's see how movie title column looks
movies.movie.sample(5)

6303                      Yasha (1985)
4013        Meng long guo jiang (1972)
11069    303 Fear Faith Revenge (1999)
4245                     Zardoz (1974)
17252          Zone of the Dead (2009)
Name: movie, dtype: object

After a random sampling, most movie titles are found to have formats as `title (year).`

In [135]:
def extract_year(title):
    ''' Extract year from movie title formatted as title (year).
    For example, 'Edison Kinetoscopic Record of a Sneeze (1894)'
    
    Argment: movie title in string
    Return: year in int 
    '''
    
    # Using regex operations to find year that match the pattern     
    import re
    pattern = re.compile(r'\((\d{4})\)') 
    year = pattern.search(title).group(0) # i.e. return '(2020)' 
    year = year.replace('(','').replace(')','')
    
    return int(year)

In [136]:
# Create a 'year' column with the extracted year
movies['year'] = movies.movie.apply(extract_year)

In [141]:
# Confirm the change
movies['year'].describe()

count    35479.000000
mean      2000.158911
std         21.052628
min       1878.000000
25%       1992.000000
50%       2009.000000
75%       2014.000000
max       2021.000000
Name: year, dtype: float64

In [142]:
# Check for each years - looks ok
movies.year.sort_index(ascending=True).unique()

array([1894, 1895, 1896, 1902, 1903, 1908, 1909, 1910, 1911, 1912, 1913,
       1914, 1916, 1915, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924,
       1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935,
       1936, 1937, 1940, 1938, 1939, 1941, 1942, 1944, 1943, 2001, 1945,
       1946, 1947, 1948, 1949, 1952, 1950, 1951, 1954, 1953, 1955, 1956,
       1957, 1958, 1959, 1961, 1962, 1960, 1963, 1968, 1964, 1965, 1971,
       1967, 1966, 1969, 1970, 1972, 1990, 1973, 1974, 2018, 1975, 1976,
       1978, 1977, 1981, 1979, 1982, 1980, 1989, 1986, 1984, 1983, 1987,
       1985, 1988, 2002, 1992, 1991, 1993, 1994, 1995, 1996, 1999, 1997,
       1998, 2005, 2000, 2003, 2010, 2004, 2008, 1901, 2006, 2007, 2012,
       1900, 2009, 2016, 2011, 2015, 2017, 2013, 2014, 2019, 1888, 2020,
       1898, 1878, 1907, 2021])

### `movies` Dummy the date column with 1's and 0's for each century of a movie 
 
Example of centuries: 1800's, 1900's, and 2000's

In [152]:
# Group year by century
movies['century'] = pd.cut(movies.year, bins=[1800, 1900, 2000 ,2100], 
                           right=False, labels=['1800s', '1900s', '2000s'])

In [160]:
# Get dummies 
dummies = pd.get_dummies(data=movies['century'])

# Concat dummies to movies dataframe
movies = pd.concat([movies, dummies], axis=1)

In [167]:
# Confirm changes
movies.sample(5)

Unnamed: 0,movie_id,movie,genre,year,century,1800s,1900s,2000s
16308,1002563,The Young Messiah (2016),Drama|Fantasy,2016,2000s,0,0,1
12508,325761,Luster (2002),Comedy|Drama,2002,2000s,0,0,1
27350,3489470,Pas son genre (2014),Drama|Romance,2014,2000s,0,0,1
13792,414161,Intermedio (2005),Horror|Thriller,2005,2000s,0,0,1
13886,419198,La tigre e la neve (2005),Comedy|Drama|Romance|War,2005,2000s,0,0,1


### `movies` Dummy column the genre with 1's and 0's

In [185]:
# Earlier defined 'genre_set'
print(genre_set)

{'History', 'Western', 'Drama', 'Biography', 'Talk-Show', 'War', 'Romance', 'Music', 'Comedy', 'Animation', 'Game-Show', 'Family', 'Horror', 'Musical', 'Sci-Fi', 'Action', 'Adult', 'Reality-TV', 'News', 'Thriller', 'Film-Noir', 'Short', 'Adventure', 'Fantasy', 'Crime', 'Mystery', 'Documentary', 'Sport'}


In [186]:
# Extract genre and encode it binary
def extract_genre(val):
    try:
        if val.find(genre) > -1: #  returns -1 if the value is not found.
            return 1
        else:
            return 0
        
    except AttributeError:
        return 0

In [188]:
for genre in genre_set:
    movies[genre] = movies.genre.apply(extract_genre)

In [208]:
# Confirm changes
pd.options.display.max_rows = 100
pd.concat([movies['genre'], movies.iloc[:, -28:]], axis=1).sample(5)

Unnamed: 0,genre,History,Western,Drama,Biography,Talk-Show,War,Romance,Music,Comedy,...,News,Thriller,Film-Noir,Short,Adventure,Fantasy,Crime,Mystery,Documentary,Sport
12648,Action|Crime|Horror|Mystery|Thriller,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,1,0,0
5479,Horror|Mystery|Thriller,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
5826,Horror|Sci-Fi,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15079,Documentary,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
30616,Drama,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### `reviews` Create a date out of time stamp

In [219]:
datetime.datetime.fromtimestamp(int(reviews.timestamp[0])).strftime('%Y-%m-%d %H:%M:%S')

'2013-10-06 05:00:50'

In [220]:
# Use lambda function to convert each row
date_format = '%Y-%m-%d %H:%M:%S'
date_convert = lambda timestamp: (datetime.datetime
                                  .fromtimestamp(int(timestamp))
                                  .strftime(date_format))

In [222]:
# Convert timestamp --> date (str)
reviews['date'] = reviews.timestamp.apply(date_convert)

In [223]:
# Confirm changes
reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,date
0,1,114508,8,1381006850,2013-10-06 05:00:50
1,2,208092,5,1586466072,2020-04-10 05:01:12
2,2,358273,9,1579057827,2020-01-15 11:10:27
3,2,10039344,5,1578603053,2020-01-10 04:50:53
4,2,6751668,9,1578955697,2020-01-14 06:48:17


In [225]:
reviews.to_csv('./reviews_clean.csv')
movies.to_csv('./movies_clean.csv')