# AI/Machine Learning Intern Challenge: Simple Content-Based Recommendation
By Josh Houlding

## Dataset
<b>Dataset Chosen:</b> [Movies dataset details | Kaggle](https://www.kaggle.com/datasets/sachinkumar62/movies-details) <br>
<b>Description from Kaggle Page:</b> "This dataset contains information on 8,551 movies, including titles, release dates, popularity scores, and user ratings. It features vote counts and average ratings, making it useful for analyzing top-rated films and audience preferences. The dataset also includes movie overviews, providing a brief summary of each film. Ideal for recommendation systems, trend analysis, and sentiment studies in the film industry."

## Data Loading and Preprocessing

In [468]:
import pandas as pd

# Load and view data
df = pd.read_csv("movies.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,id,title,overview,release_date,popularity,vote_average,vote_count
0,0,19404,Dilwale Dulhania Le Jayenge,"Raj is a rich, carefree, happy-go-lucky second...",1995-10-20,18.433,8.7,2763
1,1,724089,Gabriel's Inferno Part II,Professor Gabriel Emerson finally learns the t...,2020-07-31,8.439,8.7,1223
2,2,278,The Shawshank Redemption,Framed in the 1940s for the double murder of h...,1994-09-23,65.57,8.7,18637
3,3,238,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...",1972-03-14,63.277,8.7,14052
4,4,761053,Gabriel's Inferno Part III,The final part of the film adaption of the ero...,2020-11-19,26.691,8.7,773


The only columns we need are the movie titles and descriptions, so we can drop everything else.

In [469]:
# Reduce dataframe down to only necessary columns
df = df[["title", "overview"]]

The README.md file on this challenge's GitHub repository also says this:

>Make sure the dataset is easy to handle (maybe 100–500 rows) so the solution remains quick to implement and run.

In [470]:
# Show dataset shape
print(f"Number of entries: {len(df)}")

Number of entries: 8551


Our dataset is too big. Let's take a random sample of 500 rows to ensure speedy performance for our model later on.

In [471]:
# Take random sample of 500 entries
df = df.sample(500, random_state=42)

# Show new entry count
print(f"Number of entries: {len(df)}")

Number of entries: 500


## Handling duplicates and missing values
This will improve the accuracy of our model's recommendations.

In [472]:
# Drop duplicates
df = df.drop_duplicates()

In [473]:
# Check missing value count by column
df.isna().sum()

title       0
overview    2
dtype: int64

The missing value count is so small that we can just drop rows with missing values.

In [474]:
# Drop rows with missing values
df = df.dropna()

It's also a good idea to save a copy of the dataframe before processing, so we can access the data in its original form whenever needed.

In [475]:
# Create copy of dataframe
unprocessed_df = df.copy()

## Lowercasing the movie overviews
Setting all letters to lowercase in the movie descriptions ensures consistency, improving the accuracy of similarity calculations.

In [476]:
# Set descriptions to lowercase
df["overview"] = df["overview"].str.lower()

## Remove punctuation and stopwords
Punctuation includes characters like periods, apostrophes, colons, etc., and stopwords are common words like "the", "a", "and", etc. that aren't useful for recommendation systems. Removing both of them streamlines performance and improves accuracy.

In [477]:
import string
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Remove punctuation from overviews
translator = str.maketrans('', '', string.punctuation)
df["overview"] = df["overview"].astype(str).apply(lambda x: x.translate(translator))

# Remove stopwords from overviews
stop_words = ENGLISH_STOP_WORDS  # Use scikit-learn's stop words
df["overview"] = df["overview"].astype(str).apply(
    lambda x: " ".join([word for word in x.split() if word not in stop_words])
)

## Stemming the text
Stemming involves reducing words to their root or base form ("stem"). For example, "running", "runs", and "ran" would all be stemmed to "run". Like with the other preprocessing steps, this improves matching and makes the model more efficient.

We will use a Porter Stemmer here, which is a very aggressive form of stemmer that often produces stems that don't look like recognizable words. However, these odd-looking stems are highly useful to recommendation systems, which is why this type of stemmer is so useful.

In [478]:
from nltk.stem import PorterStemmer

# Perform stemming on overviews
stemmer = PorterStemmer()
df["overview"] = df["overview"].astype(str).apply(
    lambda x: " ".join([stemmer.stem(word) for word in x.split()])
)

## Final text dataframe

In [479]:
# Show final text dataframe
df["overview"].head()

2389    naiv busi graduat instal presid manufactur com...
5048    year 2056 epidem organ failur devast planet me...
3133    shortli david abbott move new san francisco di...
5955    covert secur compani vanguard hope surviv acco...
625     jake blue just releas prison put old band togt...
Name: overview, dtype: object

## Vectorize the text
We need to vectorize the text into a numeric form so the model can analyze it, and we will do this with TF-IDF, a rigorous and often-used algorithm. It converts the text data into a sparse matrix where each row represents a movie, and each column represents a word in the vocabulary of the text data. This enables the model to quantify the importance of words in each movie's description, and thus understand which movies are similar to each other.

In [480]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorize text using TF-IDF
tfidf = TfidfVectorizer() 
tfidf_matrix = tfidf.fit_transform(df["overview"])

In [481]:
# Show info about TF-IDF matrix
tfidf_matrix

<498x4478 sparse matrix of type '<class 'numpy.float64'>'
	with 11826 stored elements in Compressed Sparse Row format>

We have a sparse matrix with 498 rows, one for each movie in the dataframe, and 4,478 columns for all the unique words found in the descriptions of the movies.

## Taking user input and providing recommendations

In [482]:
from sklearn.metrics.pairwise import cosine_similarity

# Function to preprocess user query
def preprocess_text(text):
    text = text.lower()
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)
    words = text.split()
    stop_words = ENGLISH_STOP_WORDS
    words = [word for word in words if word not in stop_words]
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]
    return " ".join(words)

def provide_recommendations(user_query, num_recommendations):
    
    print(f"You provided us with the following query: \"{user_query}\"")
    print(f"This is how many recommendations you wanted: {num_recommendations}")
    
    # Process user query into TF-IDF vector/matrix
    processed_user_query = preprocess_text(user_query)
    user_query_vector = tfidf.transform([processed_user_query])
    
    # Calculate similarity scores for all movies
    similarity_scores = cosine_similarity(user_query_vector, tfidf_matrix)
    
    # Get the top n recommendations and their similarity scores
    movie_indices = similarity_scores.argsort()[0][::-1][:num_recommendations]
    recommended_movies = []
    for i in movie_indices:
        movie_title = df['title'].iloc[i]
        similarity_score = similarity_scores[0][i]
        recommended_movies.append((movie_title, similarity_score))
        
    # Get the overviews of the movies
    recommended_movie_descriptions = unprocessed_df.iloc[movie_indices]

    print(f"Here are {num_recommendations} movies we think you'll love. \n")

    # Show movie titles and similarity scores
    for i in range(0, num_recommendations):
        print(f"Movie {i + 1}: \"{recommended_movies[i][0]}\". Similarity score: {recommended_movies[i][1]}.")
        movie_description = recommended_movie_descriptions.iloc[i]["overview"]
        print(f"Movie description: \"{movie_description}\" \n")

provide_recommendations("I like action movies set in space", 5)

You provided us with the following query: "I like action movies set in space"
This is how many recommendations you wanted: 5
Here are 5 movies we think you'll love. 

Movie 1: "Gattaca". Similarity score: 0.2324916639956013.
Movie description: "In a future society in the era of indefinite eugenics, humans are set on a life course depending on their DNA. Young Vincent Freeman is born with a condition that would prevent him from space travel, yet is determined to infiltrate the GATTACA space program." 

Movie 2: "The Cheetah Girls: One World". Similarity score: 0.21330124631580724.
Movie description: "Chanel, Dorinda, and Aqua, are off to India to star in a Bollywood movie. But when there they discover that they will have to compete against each other to get the role in the movie. Will the Cheetah's break up again?" 

Movie 3: "The Transporter Refueled". Similarity score: 0.21256302560184892.
Movie description: "The fast-paced action movie is again set in the criminal underworld in Franc