# **Movie Recommendation System**

### Importing Kaggle Data Sources

This cell is designed to import your Kaggle data sources to the correct location within the notebook environment.

#### Key Points:
- **Libraries**: The cell imports several essential Python libraries such as `os`, `sys`, `urllib`, `zipfile`, `tarfile`, and `shutil`. These libraries handle file and URL operations.
- **Paths**: Defines the paths for Kaggle input and working directories.
- **Symlinks**: Creates symbolic links to the Kaggle input and working directories to ensure data accessibility.
- **Data Download**: The script downloads data from specified URLs, decompresses it if necessary, and places it in the appropriate directory.
- **Error Handling**: Includes error handling for issues like HTTP errors or OS errors during data download and extraction.

This setup ensures that the required data is correctly imported and available for the subsequent steps in the notebook.


In [1]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = ':https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F339%2F77759%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240710%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240710T102144Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D5f2ede1df73e2f7e57dc119e03b94ed95dee6d121c29769c2483a7c255f82fcd2482622ea23bbe4e62e2c3af9c1e8ddf2bd7ed4730ae8e9ea1cbdf15bd8ffacfbc5c431b2ac30dfc14148c4fe76bbb8be5618ebe74a5b6edd046ec12692eedb757a39e2744c61a27f99f6f0f795b5d920dd4bad4139840184ad2d72e2c278e20bf6fc6459e50bd231e621482ea4bc61a58246d9f6333c69d1924a633455954dc3a0bc3fb29435527471bc24053f293de37b1f658941dda7208471e1a1bd97d24293ceb4ade5c1338e2ad268de0fa2b1bcf87f7a7f6cfaacc985660ad63ca653787ab6691a40a9eec2453efd7934bdace69d95f66aa39ae14d1fc73374a1de3fd'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


Downloading , 204953792 bytes compressed
Downloaded and uncompressed: 
Data source import complete.


### Importing Libraries

In this cell, we import various essential libraries that will be used throughout the notebook for data processing, visualization, and machine learning tasks.

#### Libraries Imported:
- **mpl_toolkits.mplot3d**: Provides tools for creating 3D plots.
- **sklearn.preprocessing.StandardScaler**: A module for standardizing features by removing the mean and scaling to unit variance.
- **matplotlib.pyplot**: A plotting library for creating static, animated, and interactive visualizations in Python.
- **numpy**: A library for numerical computations and handling arrays.
- **os**: Used for accessing and interacting with the operating system's file and directory structure.
- **pandas**: A powerful data manipulation and analysis library, particularly useful for working with structured data such as CSV files.

These libraries are foundational for the tasks of data analysis, visualization, and preprocessing in our movie recommendation system project.


In [2]:
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


### Listing Imported Data Files

This cell lists the files that were imported from the Kaggle data source directory.

#### Code Functionality:
- **os.listdir**: This function lists all files and directories in the specified path. In this case, it lists the contents of the `../input` directory, which contains the data files we will be working with.

#### Output:
- The output shows the following files:
  - **genome_tags.csv**: Contains tags and their corresponding IDs.
  - **tag.csv**: Contains movie tags provided by users.
  - **rating.csv**: Contains user ratings for movies.
  - **link.csv**: Contains links to external sites (e.g., IMDb).
  - **genome_scores.csv**: Contains tag relevance scores for movies.
  - **movie.csv**: Contains basic movie information such as movie IDs and titles.

These files will be used in subsequent steps for building the movie recommendation system.


In [3]:
print(os.listdir('../input'))

['tag.csv', 'genome_scores.csv', 'rating.csv', 'link.csv', 'movie.csv', 'genome_tags.csv']


Let's check 1st file: ../input/genome_scores.csv

### Loading and Previewing All Data Files

In this cell, we load and preview all the necessary CSV files from the dataset directory. Each file is loaded into a Pandas dataframe and its basic structure is displayed to ensure it is correctly imported.

#### Key Points:
1. **Genome Scores**:
    - **File**: `genome_scores.csv`
    - **Dataframe Name**: `df1`
    - **Description**: Contains tag relevance scores for movies.
    - **Output**: Number of rows and columns, and a preview of the first five rows.

2. **Genome Tags**:
    - **File**: `genome_tags.csv`
    - **Dataframe Name**: `df2`
    - **Description**: Contains tags and their corresponding IDs.
    - **Output**: Number of rows and columns, and a preview of the first five rows.

3. **Links**:
    - **File**: `link.csv`
    - **Dataframe Name**: `df3`
    - **Description**: Contains links to external sites (e.g., IMDb).
    - **Output**: Number of rows and columns, and a preview of the first five rows.

4. **Movies**:
    - **File**: `movie.csv`
    - **Dataframe Name**: `df4`
    - **Description**: Contains basic movie information such as movie IDs and titles.
    - **Output**: Number of rows and columns, and a preview of the first five rows.

5. **Ratings**:
    - **File**: `rating.csv`
    - **Dataframe Name**: `df5`
    - **Description**: Contains user ratings for movies.
    - **Output**: Number of rows and columns, and a preview of the first five rows.

6. **Tags**:
    - **File**: `tag.csv`
    - **Dataframe Name**: `df6`
    - **Description**: Contains movie tags provided by users.
    - **Output**: Number of rows and columns, and a preview of the first five rows.

#### Steps:
- For each file, the CSV is read using `pd.read_csv()`, and the number of rows and columns are printed to provide an overview of the data size.
- The `head()` function is used to display the first five rows, giving a glimpse of the data structure and content.

This process ensures that all necessary files are correctly imported and their structures are understood before proceeding with further analysis.


In [5]:
# importing genome_scores.csv with name df1
df1 = pd.read_csv('../input/genome_scores.csv', delimiter=',')
df1.dataframeName = 'genome_scores.csv'
nRow, nCol = df1.shape
print(f'There are {nRow} rows and {nCol} columns in df1')
print(df1.head())


# importing genome_tags.csv with name df2
df2 = pd.read_csv('../input/genome_tags.csv', delimiter=',')
df2.dataframeName = 'genome_tags.csv'
nRow, nCol = df2.shape
print(f'\nThere are {nRow} rows and {nCol} columns in df2')
print(df2.head())


# importing link.csv with name df13
df3 = pd.read_csv('../input/link.csv', delimiter=',')
df3.dataframeName = 'link.csv'
nRow, nCol = df3.shape
print(f'\nThere are {nRow} rows and {nCol} columns in df3')
print(df3.head())


# importing movie.csv with name d4
df4 = pd.read_csv('../input/movie.csv', delimiter=',')
df4.dataframeName = 'movie.csv'
nRow, nCol = df4.shape
print(f'\nThere are {nRow} rows and {nCol} columns in df4')
print(df4.head())


# importing rating.csv with name df5
df5 = pd.read_csv('../input/rating.csv', delimiter=',')
df5.dataframeName = 'rating.csv'
nRow, nCol = df5.shape
print(f'\nThere are {nRow} rows and {nCol} columns in df5')
print(df5.head())

# importing tag.csv with name df6
df6 = pd.read_csv('../input/tag.csv', delimiter=',')
df6.dataframeName = 'tag.csv'
nRow, nCol = df6.shape
print(f'\nThere are {nRow} rows and {nCol} columns in df6')
print(df6.head())

There are 11709768 rows and 3 columns in df1
   movieId  tagId  relevance
0        1      1    0.02500
1        1      2    0.02500
2        1      3    0.05775
3        1      4    0.09675
4        1      5    0.14675

There are 1128 rows and 2 columns in df2
   tagId           tag
0      1           007
1      2  007 (series)
2      3  18th century
3      4         1920s
4      5         1930s

There are 27278 rows and 3 columns in df3
   movieId  imdbId   tmdbId
0        1  114709    862.0
1        2  113497   8844.0
2        3  113228  15602.0
3        4  114885  31357.0
4        5  113041  11862.0

There are 27278 rows and 3 columns in df4
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres

### Loading and Previewing All Data Files

In this cell, we load and preview all the necessary CSV files from the dataset directory. Each file is loaded into a Pandas dataframe, and its basic structure is displayed to ensure it is correctly imported.

By loading these datasets and previewing their content, we ensure that all necessary data is correctly imported and ready for analysis. This step is crucial for verifying the integrity and structure of the data before proceeding with further steps in the movie recommendation system project.


### Merging Genome and Tags

In this cell, we merge the `genome_scores.csv` and `genome_tags.csv` datasets to create a unified dataframe that includes movie IDs, tag IDs, relevance scores, and tag names. This merged dataset will provide a comprehensive view of the tag relevance scores for each movie.


In [6]:
genome = pd.merge(df1, df2, on='tagId', how='left')[['movieId', 'tagId', 'relevance', 'tag']]
genome.head(3)

Unnamed: 0,movieId,tagId,relevance,tag
0,1,1,0.025,007
1,1,2,0.025,007 (series)
2,1,3,0.05775,18th century



By merging the genome scores and tags, we create a rich dataset that combines tag IDs and names with their relevance scores for each movie. This merged dataset will be useful for analyzing tag-based movie recommendations.


### Adding Movie Titles and Genres to Genome Data

In this cell, we further enrich the `genome` dataframe by merging it with the `movie.csv` dataset to include movie titles and genres. This step provides additional context for each movie, making it easier to understand the relevance of tags.


In [7]:
genome = pd.merge(genome, df4, on='movieId', how='left')[['movieId', 'tag', 'title', 'genres']]
genome.head(3)

Unnamed: 0,movieId,tag,title,genres
0,1,007,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,007 (series),Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,1,18th century,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy



By adding movie titles and genres to the genome data, we enhance the dataset with more meaningful information about each movie. This enriched dataframe will be instrumental in performing more insightful analyses and generating accurate movie recommendations.


### Summary of Genome Data Information

This cell displays essential information about the `genome` dataframe using the `info()` method. The `info()` method provides an overview of the dataframe's structure, including the number of entries (rows), column names, data types, and memory usage.


In [8]:
genome.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11709768 entries, 0 to 11709767
Data columns (total 4 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   movieId  int64 
 1   tag      object
 2   title    object
 3   genres   object
dtypes: int64(1), object(3)
memory usage: 357.4+ MB


### Optimizing DataFrame by Converting Data Types

In this cell, we optimize the `genome` dataframe by converting the data type of the `movieId` column from `int64` to `int32`. This conversion reduces memory usage while maintaining compatibility with typical operations and analyses.


In [None]:
# Convert data types
genome['movieId'] = genome['movieId'].astype('int32')

# Display the optimized DataFrame info
print(genome.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11709768 entries, 0 to 11709767
Data columns (total 4 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   movieId  int32 
 1   tag      object
 2   title    object
 3   genres   object
dtypes: int32(1), object(3)
memory usage: 312.7+ MB
None


### Grouping and Aggregating Tags by Movie

In this cell, we group the `genome` dataframe by `movieId`, `title`, and `genres`, aggregating the `tag` column into a list for each group. This transformation allows us to consolidate all tags associated with each movie into a single list, enhancing the dataset's structure for further analysis and processing.


In [9]:
# Group by 'movieId', 'title', and 'genres' and aggregate 'tag' into a list
genome = genome.groupby(['movieId', 'title', 'genres'])['tag'].agg(list).reset_index()

genome.head(3)

Unnamed: 0,movieId,title,genres,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[007, 007 (series), 18th century, 1920s, 1930s..."
1,2,Jumanji (1995),Adventure|Children|Fantasy,"[007, 007 (series), 18th century, 1920s, 1930s..."
2,3,Grumpier Old Men (1995),Comedy|Romance,"[007, 007 (series), 18th century, 1920s, 1930s..."


### Assigning DataFrame and Verification

In this cell, the `genome` dataframe is assigned to `df` for further processing or analysis. The verification step displays the first few rows of `df` to ensure that the assignment was successful and to provide a preview of the dataframe's structure and content.



In [10]:
df = genome

# Display the first few rows to verify
df.head(3)

Unnamed: 0,movieId,title,genres,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[007, 007 (series), 18th century, 1920s, 1930s..."
1,2,Jumanji (1995),Adventure|Children|Fantasy,"[007, 007 (series), 18th century, 1920s, 1930s..."
2,3,Grumpier Old Men (1995),Comedy|Romance,"[007, 007 (series), 18th century, 1920s, 1930s..."


### Freeing Up Memory by Deleting DataFrames and Garbage Collection

In this cell, several dataframes (`df1`, `df2`, `df3`, `df4`, `df6`, `genome`) are deleted to free up memory resources. Additionally, the Python garbage collector (`gc.collect()`) is invoked to reclaim memory allocated to objects that are no longer in use.



In [11]:
import gc

# Drop the DataFrames
del df1
del df2
del df3
del df4
# del df5
del df6
del genome

# Collect garbage to free up memory
gc.collect()


62

### Displaying the 'genres' Column from DataFrame

In this cell, the code snippet `df[['genres']]` is used to select and display the `genres` column from the `df` dataframe. This operation retrieves a subset of the dataframe containing only the `genres` column, which describes the genres associated with each movie.


In [12]:
df[['genres']]

Unnamed: 0,genres
0,Adventure|Animation|Children|Comedy|Fantasy
1,Adventure|Children|Fantasy
2,Comedy|Romance
3,Comedy|Drama|Romance
4,Comedy
...,...
10376,Action|Thriller
10377,Horror|Romance|Sci-Fi
10378,Comedy
10379,Drama


### Converting 'genres' Column to List of Strings

In this cell, the code snippet converts the `genres` column in the `df` dataframe from a delimited string format (separated by '|') into a list of strings for each movie. This transformation facilitates easier handling and analysis of movie genres, enabling operations such as genre-based filtering or aggregation.


In [14]:
# Convert 'genres' to a list of strings
df['genres'] = df['genres'].apply(lambda x: x.split('|'))

# Display the resulting DataFrame
print(df[['genres']].head())


                                              genres
0  [Adventure, Animation, Children, Comedy, Fantasy]
1                     [Adventure, Children, Fantasy]
2                                  [Comedy, Romance]
3                           [Comedy, Drama, Romance]
4                                           [Comedy]


### Creating a Duplicate Column for 'genres'

In this cell, a duplicate column named `genres1` is created in the `df` dataframe, copying the content of the existing `genres` column. This approach allows for independent manipulation or analysis of genre data without affecting the original `genres` column.

The reason for this duplicate is to make predicting model training ratio more on `genres` column

In [15]:
# Create a duplicate of the 'genres' column
df['genres1'] = df['genres']

df.head()

Unnamed: 0,movieId,title,genres,tag,genres1
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]","[007, 007 (series), 18th century, 1920s, 1930s...","[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]","[007, 007 (series), 18th century, 1920s, 1930s...","[Adventure, Children, Fantasy]"
2,3,Grumpier Old Men (1995),"[Comedy, Romance]","[007, 007 (series), 18th century, 1920s, 1930s...","[Comedy, Romance]"
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]","[007, 007 (series), 18th century, 1920s, 1930s...","[Comedy, Drama, Romance]"
4,5,Father of the Bride Part II (1995),[Comedy],"[007, 007 (series), 18th century, 1920s, 1930s...",[Comedy]


### Removing Year from Movie Titles

In this cell, a function `remove_year_from_title` is defined to remove the year enclosed in parentheses from movie titles in the `df` dataframe. The function uses regular expressions (`re`) to identify and replace patterns matching the format `(YYYY)` where YYYY represents a four-digit year.

The reason for removing year from `title` column and storing it in new column is, I want to make my model also trained on similar titles of movies.

In [16]:
import re

# Function to remove the year from the title
def remove_year_from_title(title):
    return re.sub(r'\(\d{4}\)', '', title).strip()

# Apply the function to remove the year from the title
df['title_without_year'] = df['title'].apply(remove_year_from_title)

df.head(3)

Unnamed: 0,movieId,title,genres,tag,genres1,title_without_year
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]","[007, 007 (series), 18th century, 1920s, 1930s...","[Adventure, Animation, Children, Comedy, Fantasy]",Toy Story
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]","[007, 007 (series), 18th century, 1920s, 1930s...","[Adventure, Children, Fantasy]",Jumanji
2,3,Grumpier Old Men (1995),"[Comedy, Romance]","[007, 007 (series), 18th century, 1920s, 1930s...","[Comedy, Romance]",Grumpier Old Men


### Combining Columns into a Single List and Dropping Separate Columns

In this cell, the code merges the `title_without_year`, `genres`, `genres1`, and `tag` columns from the `df` dataframe into a single list called `genres_and_tags`. After merging, the separate columns (`genres`, `genres1`, `tag`, `title_without_year`) are dropped to streamline the dataframe structure.


In [17]:
# Merge 'title_without_year', 'genres', 'genres1', and 'tag' columns into a single list
df['genres_and_tags'] = df.apply(lambda row: [row['title_without_year']] + row['genres'] + row['genres1'] + row['tag'], axis=1)

# Drop the separate 'genres', 'genres1', 'tag', and 'title_without_year' columns if needed
df = df.drop(columns=['genres', 'genres1', 'tag', 'title_without_year'])

df.head()

Unnamed: 0,movieId,title,genres_and_tags
0,1,Toy Story (1995),"[Toy Story, Adventure, Animation, Children, Co..."
1,2,Jumanji (1995),"[Jumanji, Adventure, Children, Fantasy, Advent..."
2,3,Grumpier Old Men (1995),"[Grumpier Old Men, Comedy, Romance, Comedy, Ro..."
3,4,Waiting to Exhale (1995),"[Waiting to Exhale, Comedy, Drama, Romance, Co..."
4,5,Father of the Bride Part II (1995),"[Father of the Bride Part II, Comedy, Comedy, ..."


### Checking how the 'genres and tags' column look like

In [18]:
# Access and print the first element of the 'genres_and_tags' column
print(df['genres_and_tags'].iloc[10380])

['Parallels', 'Sci-Fi', 'Sci-Fi', '007', '007 (series)', '18th century', '1920s', '1930s', '1950s', '1960s', '1970s', '1980s', '19th century', '3d', '70mm', '80s', '9/11', 'aardman', 'aardman studios', 'abortion', 'absurd', 'action', 'action packed', 'adaptation', 'adapted from:book', 'adapted from:comic', 'adapted from:game', 'addiction', 'adolescence', 'adoption', 'adultery', 'adventure', 'affectionate', 'afi 100', 'afi 100 (laughs)', 'afi 100 (movie quotes)', 'africa', 'afterlife', 'aging', 'aids', 'airplane', 'airport', 'alaska', 'alcatraz', 'alcoholism', 'alien', 'alien invasion', 'aliens', 'allegory', 'almodovar', 'alone in the world', 'alter ego', 'alternate endings', 'alternate history', 'alternate reality', 'alternate universe', 'amazing cinematography', 'amazing photography', 'american civil war', 'amnesia', 'amy smart', 'android(s)/cyborg(s)', 'androids', 'animal movie', 'animals', 'animated', 'animation', 'anime', 'antarctica', 'anti-hero', 'anti-semitism', 'anti-war', 'apo

### Convert Genres and Tags to Comma-Separated String

The following code snippet converts the list of genres, tags, and movie title in the 'genres_and_tags' column to a comma-separated string format. This transformation is essential for easier data handling and analysis.



In [19]:
# Convert the list to a comma-separated string
df['genres_and_tags'] = df['genres_and_tags'].apply(lambda x: ','.join(x))

# Access and print the first element of the 'genres_and_tags_str' column
print(df['genres_and_tags'].iloc[0])

df.head(3)

Toy Story,Adventure,Animation,Children,Comedy,Fantasy,Adventure,Animation,Children,Comedy,Fantasy,007,007 (series),18th century,1920s,1930s,1950s,1960s,1970s,1980s,19th century,3d,70mm,80s,9/11,aardman,aardman studios,abortion,absurd,action,action packed,adaptation,adapted from:book,adapted from:comic,adapted from:game,addiction,adolescence,adoption,adultery,adventure,affectionate,afi 100,afi 100 (laughs),afi 100 (movie quotes),africa,afterlife,aging,aids,airplane,airport,alaska,alcatraz,alcoholism,alien,alien invasion,aliens,allegory,almodovar,alone in the world,alter ego,alternate endings,alternate history,alternate reality,alternate universe,amazing cinematography,amazing photography,american civil war,amnesia,amy smart,android(s)/cyborg(s),androids,animal movie,animals,animated,animation,anime,antarctica,anti-hero,anti-semitism,anti-war,apocalypse,archaeology,argentina,arms dealer,arnold,art,art house,artificial intelligence,artist,artistic,artsy,assassin,assassination,assassins,as

Unnamed: 0,movieId,title,genres_and_tags
0,1,Toy Story (1995),"Toy Story,Adventure,Animation,Children,Comedy,..."
1,2,Jumanji (1995),"Jumanji,Adventure,Children,Fantasy,Adventure,C..."
2,3,Grumpier Old Men (1995),"Grumpier Old Men,Comedy,Romance,Comedy,Romance..."


### NLP Preprocessing for Genres and Tags

The following code snippet demonstrates the preprocessing steps applied to the 'genres_and_tags' column using NLTK (Natural Language Toolkit). This preprocessing is crucial for text data normalization, ensuring that our movie genres, tags, and titles are in a consistent format suitable for analysis.


In [20]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

# Download NLTK resources (you only need to do this once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize NLTK's WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

# Define a function for NLP preprocessing
def preprocess_text(text):
    if isinstance(text, str):  # Check if text is a string
        # Tokenization
        tokens = word_tokenize(text)

        # Lowercasing
        tokens = [token.lower() for token in tokens]

        # Removing punctuation
        tokens = [token for token in tokens if token not in string.punctuation]

        # Removing stopwords
        stop_words = set(stopwords.words('english'))
        tokens = [token for token in tokens if token not in stop_words]

        # Lemmatization
        tokens = [lemmatizer.lemmatize(token) for token in tokens]

        # Join the tokens back into a single string
        processed_text = ' '.join(tokens)

        return processed_text
    else:
        return ''  # Return empty string for non-string values

# Apply the preprocessing function to the 'genres_and_tags_str' column
df['genres_and_tags'] = df['genres_and_tags'].apply(preprocess_text)

# Display the first few rows to verify
df.head()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Unnamed: 0,movieId,title,genres_and_tags
0,1,Toy Story (1995),toy story adventure animation child comedy fan...
1,2,Jumanji (1995),jumanji adventure child fantasy adventure chil...
2,3,Grumpier Old Men (1995),grumpier old men comedy romance comedy romance...
3,4,Waiting to Exhale (1995),waiting exhale comedy drama romance comedy dra...
4,5,Father of the Bride Part II (1995),"father bride part ii comedy comedy,007,007 ser..."


### TF-IDF Vectorization for Genres and Tags

The following code snippet demonstrates the application of TF-IDF (Term Frequency-Inverse Document Frequency) vectorization using Scikit-learn's `TfidfVectorizer` to the preprocessed 'genres_and_tags' text data. TF-IDF is used to quantify the importance of each word in the text data relative to the entire dataset.


In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the combined text data
text_vectors = tfidf_vectorizer.fit_transform(df['genres_and_tags'])

# Convert to dense array for easier manipulation
text_vectors_array = text_vectors.toarray()

# Convert to DataFrame if needed
text_vectors_df = pd.DataFrame(text_vectors_array, columns=tfidf_vectorizer.get_feature_names_out())


### Calculating Movie Similarity Using Cosine Similarity

The following code snippet calculates movie similarity based on cosine similarity using the `cosine_similarity` function from Scikit-learn's `sklearn.metrics.pairwise` module. This approach helps identify movies that are most similar to a given movie based on their TF-IDF vector representations of genres and tags.


In [22]:
from sklearn.metrics.pairwise import cosine_similarity

# Step 1: Define Similarity Metric

# Step 2: Calculate Similarity Matrix
similarity_matrix = cosine_similarity(text_vectors)

# Step 3: Find Similar Movies
def get_similar_movies(movie_title, similarity_matrix, movies_df):
    movie_index = movies_df[movies_df['title'] == movie_title].index[0]
    similar_movies_indices = np.argsort(similarity_matrix[movie_index])[::-1][1:6]  # Exclude the input movie
    similar_movies = movies_df.iloc[similar_movies_indices]['title'].tolist()
    return similar_movies

### Implementing User Interface for Movie Recommendations

Here's an example of implementing a user interface to recommend movies similar to a specified input movie using our cosine similarity-based approach.


In [23]:
# Implement User Interface (example)
input_movie = "Batman (1989)"
similar_movies = get_similar_movies(input_movie, similarity_matrix, df)
print(f"Movies similar to '{input_movie}': {similar_movies}")

Movies similar to 'Batman (1989)': ['F/X (1986)', 'Brother (2000)', 'Gangster No. 1 (2000)', 'Assassins (1995)', 'Hitman (2007)']


In [24]:
input_movie = "Jumanji (1995)"
similar_movies = get_similar_movies(input_movie, similarity_matrix, df)
print(f"Movies similar to '{input_movie}': {similar_movies}")

Movies similar to 'Jumanji (1995)': ['Escape to Witch Mountain (1975)', 'Witches, The (1990)', 'Peter Pan (2003)', 'Jungle Book (1942)', 'Epic (2013)']


The model is designed to generate movies that are similar; however, if there is a typo mistake in the input data, the model may produce errors in the generated movies.

The solution for the typo error is implemented uaing 'fuzzywuzzy' library.

### Finding Similar Movies with Fuzzy Matching
In this section of the notebook titled "Movie Recommendation System," fuzzy matching is implemented to handle potential typos in movie titles. The fuzzywuzzy library is used for this purpose, enhancing the robustness of the recommendation system.

In [25]:
!pip install fuzzywuzzy[speedup]

import pandas as pd
import re
from fuzzywuzzy import process



# Function to get the closest match using fuzzy matching
def get_closest_match(input_title, movie_titles):
    closest_match = process.extractOne(input_title, movie_titles)
    return closest_match[0] if closest_match else None

# Define a function to find similar movies with fuzzy matching
def get_similar_movies(input_title, similarity_matrix, movies_df, top_n=5):
    movie_titles = movies_df['title'].tolist()
    closest_title = get_closest_match(input_title, movie_titles)

    if closest_title:
        movie_index = movies_df[movies_df['title'] == closest_title].index[0]
        similarity_scores = list(enumerate(similarity_matrix[movie_index]))
        similar_movies_indices = sorted(similarity_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]
        similar_movies = [movies_df.iloc[i[0]]['title'] for i in similar_movies_indices]
        return similar_movies
    else:
        return []

# Example usage
input_movie = "Toy Stroory (1995"  # Intentional typo
similar_movies = get_similar_movies(input_movie, similarity_matrix, df)
print(f"Movies similar to '{input_movie}': {similar_movies}")


Collecting fuzzywuzzy[speedup]
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Collecting python-levenshtein>=0.12 (from fuzzywuzzy[speedup])
  Downloading python_Levenshtein-0.25.1-py3-none-any.whl (9.4 kB)
Collecting Levenshtein==0.25.1 (from python-levenshtein>=0.12->fuzzywuzzy[speedup])
  Downloading Levenshtein-0.25.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (177 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.4/177.4 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rapidfuzz<4.0.0,>=3.8.0 (from Levenshtein==0.25.1->python-levenshtein>=0.12->fuzzywuzzy[speedup])
  Downloading rapidfuzz-3.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fuzzywuzzy, rapidfuzz, Levenshtein, python-levenshtein
Successfully installed Levenshtein-0.25.1 fuzzyw

In [26]:
import joblib

# Assuming you have trained your tfidf_vectorizer and calculated similarity_matrix
# Save tfidf_vectorizer
joblib.dump(tfidf_vectorizer, 'tfidf_vectorizer.pkl')

# Save similarity_matrix
# joblib.dump(similarity_matrix, 'similarity_matrix.pkl')

# Save similarity_matrix
np.save('similarity_matrix.npy', similarity_matrix)

import pandas as pd

# Assuming 'df' is your dataframe
df.to_pickle('dataframe.pkl')


In [None]:
# import joblib
# import pandas as pd

# # Load saved objects
# tfidf_vectorizer = joblib.load('tfidf_vectorizer.pkl')
# similarity_matrix = joblib.load('similarity_matrix.pkl')
# df = pd.read_pickle('dataframe.pkl')  # Only if you saved your DataFrame
# # After computing similarity_matrix

