In [43]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#### Loading and Preprocessing Data

In [54]:
df = pd.read_csv('top_rated_movies.csv')
df.head(5)

Unnamed: 0.1,Unnamed: 0,adult,genre_ids,original_language,original_title,overview,popularity,release_date,title,vote_average,vote_count
0,0,False,"[18, 80]",en,The Godfather,"Spanning the years 1945 to 1955, a chronicle of the fictional Italian-American Corleone crime family. When organized crime family patriarch, Vito Corleone barely survives an attempt on his life, his youngest son, Michael steps in to take care of the would-be killers, launching a campaign of bloody revenge.",114.774,1972-03-14,The Godfather,8.7,17855
1,1,False,"[18, 80]",en,The Shawshank Redemption,"Framed in the 1940s for the double murder of his wife and her lover, upstanding banker Andy Dufresne begins a new life at the Shawshank prison, where he puts his accounting skills to work for an amoral warden. During his long stretch in prison, Dufresne comes to be admired by the other inmates -- including an older prisoner named Red -- for his integrity and unquenchable sense of hope.",90.925,1994-09-23,The Shawshank Redemption,8.7,23711
2,2,False,"[35, 14]",es,Cuando Sea Joven,"70-year-old Malena gets a second chance at life when she magically turns into her 22-year-old self. Now, posing as ""Maria"" to hide her true identity, she becomes the lead singer of her grandson's band and tries to recover her dream of singing, which she had to give up at some point.",29.101,2022-09-14,Cuando Sea Joven,8.6,214
3,3,False,"[18, 80]",en,The Godfather Part II,"In the continuing saga of the Corleone crime family, a young Vito Corleone grows up in Sicily and in 1910s New York. In the 1950s, Michael Corleone attempts to expand the family business into Las Vegas, Hollywood and Cuba.",54.944,1974-12-20,The Godfather Part II,8.6,10801
4,4,False,"[18, 36, 10752]",en,Schindler's List,The true story of how businessman Oskar Schindler saved over a thousand Jewish lives from the Nazis while they worked as slaves in his factory during World War II.,55.735,1993-12-15,Schindler's List,8.6,14026


In [55]:
df = df[['title', 'overview']]
df.head(5)

Unnamed: 0,title,overview
0,The Godfather,"Spanning the years 1945 to 1955, a chronicle of the fictional Italian-American Corleone crime family. When organized crime family patriarch, Vito Corleone barely survives an attempt on his life, his youngest son, Michael steps in to take care of the would-be killers, launching a campaign of bloody revenge."
1,The Shawshank Redemption,"Framed in the 1940s for the double murder of his wife and her lover, upstanding banker Andy Dufresne begins a new life at the Shawshank prison, where he puts his accounting skills to work for an amoral warden. During his long stretch in prison, Dufresne comes to be admired by the other inmates -- including an older prisoner named Red -- for his integrity and unquenchable sense of hope."
2,Cuando Sea Joven,"70-year-old Malena gets a second chance at life when she magically turns into her 22-year-old self. Now, posing as ""Maria"" to hide her true identity, she becomes the lead singer of her grandson's band and tries to recover her dream of singing, which she had to give up at some point."
3,The Godfather Part II,"In the continuing saga of the Corleone crime family, a young Vito Corleone grows up in Sicily and in 1910s New York. In the 1950s, Michael Corleone attempts to expand the family business into Las Vegas, Hollywood and Cuba."
4,Schindler's List,The true story of how businessman Oskar Schindler saved over a thousand Jewish lives from the Nazis while they worked as slaves in his factory during World War II.


In [56]:
#drop rows with null values
df.isna().sum()
df = df.dropna(subset=['title', 'overview'])

In [62]:
#remove duplicate rows
df = df.drop_duplicates(subset='title', keep='first')

In [63]:
df = df.sample(n=500, random_state=42)
df.head(5)

Unnamed: 0,title,overview
9201,The Killer Inside Me,"Deputy Sheriff Lou Ford is a pillar of the community in his small west Texas town, patient and apparently thoughtful. Some people think he is a little slow and maybe boring, but that is the worst they say about him. But then nobody knows about what Lou calls his ""sickness"": He is a brilliant, but disturbed sociopathic sadist."
5227,The Bodyguard,"A former Secret Service agent grudgingly takes an assignment to protect a pop idol who's threatened by a crazed fan. At first, the safety-obsessed bodyguard and the self-indulgent diva totally clash. But before long, all that tension sparks fireworks of another sort, and the love-averse tough guy is torn between duty and romance."
106,Demon Slayer -Kimetsu no Yaiba- The Movie: Mugen Train,"Tanjiro Kamado, joined with Inosuke Hashibira, a boy raised by boars who wears a boar's head, and Zenitsu Agatsuma, a scared boy who reveals his true power when he sleeps, boards the Infinity Train on a new mission with the Fire Hashira, Kyojuro Rengoku, to defeat a demon who has been tormenting the people and killing the demon slayers who oppose it!"
4276,Clash of the Titans,"To win the right to marry his love, the beautiful princess Andromeda, and fulfil his destiny, half-God-half-mortal Perseus must complete various tasks including taming Pegasus, capturing Medusa's head and battling the feared Kraken."
3400,The Kissing Booth 3,"It’s the summer before Elle heads to college, and she has a secret decision to make. Elle has been accepted into Harvard, where boyfriend Noah is matriculating, and also Berkeley, where her BFF Lee is headed and has to decide if she should stay or not."


In [66]:
df.to_csv('data.csv', index=False)


#### Training TF-IDF Model

In [64]:
def recommend_movies(user_input, df):
    # stop_words is a parameter that removes common english words(such as "the", "is", "and") that dont provide extra meaning when distinguishing documents 
    # the token pattern ignores digits and special characters
    tfidf = TfidfVectorizer(stop_words='english', token_pattern=r'\b[a-zA-Z]+\b')

    # computes the IDF for each word(fit) and then converts each movie overview into tf-idf vectors based on this vocabulary(transform)
    # tfidf_matrix is a sparse matrix that has a row for each movie and a column for each word
    tfidf_matrix = tfidf.fit_transform(df['overview'])
    
    # Transform the user input into the same TF-IDF space
    user_input_tfidf = tfidf.transform([user_input])

    # compute the cosine similarity between user input and movie descriptions
    cosine_sim = cosine_similarity(user_input_tfidf, tfidf_matrix)

    #turn the cosine_sim matrix into a list of tuples with the index and the similarity score so we can map it to the movie
    sim_scores = list(enumerate(cosine_sim[0]))
    # sort similarities in descendinng order
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    #get indicies of top 5 movies
    top_movies_indices = [i[0] for i in sim_scores[:5]]

    #final dataframe of top 5 movies
    top_movies = df.iloc[top_movies_indices][['title', 'overview']]
    top_movies['similarity_score'] = [i[1] for i in sim_scores[:5]]

    return top_movies

In [65]:
user_input = input('Enter descriptions of movie a you like: ')

top_5_movies = recommend_movies(user_input, df)

print('Top 5 recommended movies:')
pd.set_option('display.max_colwidth', None)
display(top_5_movies)

Top 5 recommended movies:


Unnamed: 0,title,overview,similarity_score
23,Cinema Paradiso,"A filmmaker recalls his childhood, when he fell in love with the movies at his village's theater and formed a deep friendship with the theater's projectionist.",0.163449
7590,Monsters vs Aliens,"When Susan Murphy is unwittingly clobbered by a meteor full of outer space gunk on her wedding day, she mysteriously grows to 49-feet-11-inches. The military jumps into action and captures Susan, secreting her away to a covert government compound. She is renamed Ginormica and placed in confinement with a ragtag group of Monsters...",0.115026
2789,Secondhand Lions,"The comedic adventures of an introverted boy left on the doorstep of a pair of reluctant, eccentric great-uncles, whose exotic remembrances stir the boy's spirit and re-ignite the men's lives.",0.114894
6474,Camp Rock,"When Mitchie gets a chance to attend Camp Rock, her life takes an unpredictable twist, and she learns just how important it is to be true to yourself.",0.107199
3889,Dragon: The Bruce Lee Story,"This film is a glimpse into the life, love and the unconquerable spirit of the legendary Bruce Lee. From a childhood of rigorous martial arts training, Lee realizes his dream of opening his own kung-fu school in America. Before long, he is discovered by a Hollywood producer and begins a meteoric rise to fame and an all too short reign as one the most charismatic action heroes in cinema history.",0.077441
