# Task
Write a C++ code for a Netflix recommendation system using machine learning.

## Understand recommendation systems

### Subtask:
Briefly explain different types of recommendation systems (e.g., content-based, collaborative filtering) and their relevance to Netflix.


**Reasoning**:
The subtask requires explaining different types of recommendation systems and their relevance to Netflix. I will use a markdown cell to provide the explanations as requested in the instructions.



In [1]:
%%markdown
### Types of Recommendation Systems and Their Relevance to Netflix

Recommendation systems are algorithms designed to suggest relevant items to users. For a platform like Netflix, which has a vast library of movies and TV shows and millions of users, effective recommendation systems are crucial for improving user engagement and satisfaction. Two primary types of recommendation systems are:

1.  **Content-Based Recommendation Systems:**
    These systems recommend items based on the characteristics of the items themselves and a user's past preferences. They analyze features of items that a user has liked or interacted with and then recommend similar items. For example, if a user watches and enjoys several action movies starring a specific actor, a content-based system might recommend other action movies featuring the same actor or similar themes. For Netflix, this means analyzing metadata like genre, cast, director, plot keywords, and even visual elements of movies and shows to find similar content to what a user has previously enjoyed.

2.  **Collaborative Filtering Recommendation Systems:**
    Collaborative filtering works by leveraging the interactions between users and items. It assumes that if two users have similar tastes or behaviors (e.g., they both liked or watched the same items), they are likely to have similar preferences for other items as well. There are two main types: user-based collaborative filtering (finding users similar to the current user and recommending items they liked) and item-based collaborative filtering (finding items similar to those the current user liked based on how other users rated or interacted with them). For Netflix, collaborative filtering is highly relevant because it can identify patterns in viewing behavior across its large user base. For instance, if many users who watched a particular sci-fi series also watched a specific fantasy movie, the system might recommend that fantasy movie to a user who just finished the sci-fi series, even if the content features are not directly similar.

**Relevance to Netflix:**

Both content-based and collaborative filtering approaches are highly relevant and often used in combination by Netflix.

*   **Content-based filtering** helps in recommending new or niche content to users based on specific features they seem to prefer, which is especially useful for cold-start scenarios (when there is limited user interaction data) or for recommending items with unique characteristics.
*   **Collaborative filtering** is powerful for leveraging the collective intelligence of millions of users, identifying trends and recommending popular or critically acclaimed content that users with similar tastes have enjoyed. It's particularly effective for personalized recommendations on a large scale.

By combining these approaches (often in hybrid models), Netflix can provide more accurate, diverse, and personalized recommendations, improving user experience and encouraging them to discover more content on the platform.

### Types of Recommendation Systems and Their Relevance to Netflix

Recommendation systems are algorithms designed to suggest relevant items to users. For a platform like Netflix, which has a vast library of movies and TV shows and millions of users, effective recommendation systems are crucial for improving user engagement and satisfaction. Two primary types of recommendation systems are:

1.  **Content-Based Recommendation Systems:**
    These systems recommend items based on the characteristics of the items themselves and a user's past preferences. They analyze features of items that a user has liked or interacted with and then recommend similar items. For example, if a user watches and enjoys several action movies starring a specific actor, a content-based system might recommend other action movies featuring the same actor or similar themes. For Netflix, this means analyzing metadata like genre, cast, director, plot keywords, and even visual elements of movies and shows to find similar content to what a user has previously enjoyed.

2.  **Collaborative Filtering Recommendation Systems:**
    Collaborative filtering works by leveraging the interactions between users and items. It assumes that if two users have similar tastes or behaviors (e.g., they both liked or watched the same items), they are likely to have similar preferences for other items as well. There are two main types: user-based collaborative filtering (finding users similar to the current user and recommending items they liked) and item-based collaborative filtering (finding items similar to those the current user liked based on how other users rated or interacted with them). For Netflix, collaborative filtering is highly relevant because it can identify patterns in viewing behavior across its large user base. For instance, if many users who watched a particular sci-fi series also watched a specific fantasy movie, the system might recommend that fantasy movie to a user who just finished the sci-fi series, even if the content features are not directly similar.

**Relevance to Netflix:**

Both content-based and collaborative filtering approaches are highly relevant and often used in combination by Netflix.

*   **Content-based filtering** helps in recommending new or niche content to users based on specific features they seem to prefer, which is especially useful for cold-start scenarios (when there is limited user interaction data) or for recommending items with unique characteristics.
*   **Collaborative filtering** is powerful for leveraging the collective intelligence of millions of users, identifying trends and recommending popular or critically acclaimed content that users with similar tastes have enjoyed. It's particularly effective for personalized recommendations on a large scale.

By combining these approaches (often in hybrid models), Netflix can provide more accurate, diverse, and personalized recommendations, improving user experience and encouraging them to discover more content on the platform.


# Task
Write a C++ code for a Netflix recommendation system using machine learning, utilizing the dataset from "/content/imdb_top_1000.csv".

## Load and preprocess data

### Subtask:
Load the data from "/content/imdb_top_1000.csv" into a suitable data structure in C++. Preprocess the data as needed for the chosen machine learning approach (e.g., handle missing values, feature engineering).


**Reasoning**:
The subtask requires loading and preprocessing data from a CSV file in C++. I will use `std::fstream` to read the file and store the data in a `std::vector` of a custom `struct` for movies.



In [2]:
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <sstream>

struct Movie {
    std::string title;
    std::string genre;
    std::string director;
    std::string actors;
    float rating;
};

int main() {
    std::vector<Movie> movies;
    std::ifstream file("/content/imdb_top_1000.csv");

    if (!file.is_open()) {
        std::cerr << "Error opening file!" << std::endl;
        return 1;
    }

    std::string line;
    // Read the header row
    std::getline(file, line);

    while (std::getline(file, line)) {
        std::stringstream ss(line);
        std::string segment;
        std::vector<std::string> segments;

        // Simple CSV parsing - assumes no commas within fields
        while(std::getline(ss, segment, ',')) {
            segments.push_back(segment);
        }

        // Basic error handling for expected number of columns
        if (segments.size() >= 16) { // Adjust index based on your CSV structure
            Movie movie;
            // Assuming columns are in a specific order - adjust indices as needed
            movie.title = segments[1]; // Example: Title is the 2nd column
            movie.genre = segments[5]; // Example: Genre is the 6th column
            movie.director = segments[7]; // Example: Director is the 8th column
            movie.actors = segments[11]; // Example: Actors is the 12th column

            try {
                movie.rating = std::stof(segments[14]); // Example: Rating is the 15th column
            } catch (const std::invalid_argument& ia) {
                std::cerr << "Invalid rating for movie: " << movie.title << std::endl;
                continue; // Skip row with invalid rating
            } catch (const std::out_of_range& oor) {
                 std::cerr << "Rating out of range for movie: " << movie.title << std::endl;
                 continue; // Skip row with out-of-range rating
            }

            // Basic preprocessing: skip rows with potentially missing critical data
            if (movie.title.empty() || movie.genre.empty() || movie.director.empty() || movie.actors.empty()) {
                 std::cerr << "Skipping row with missing data for movie: " << movie.title << std::endl;
                 continue;
            }

            movies.push_back(movie);
        } else {
             std::cerr << "Skipping malformed row: " << line << std::endl;
        }
    }

    file.close();

    // You can now work with the 'movies' vector
    std::cout << "Loaded " << movies.size() << " movies." << std::endl;
    // Example: Print the first loaded movie
    if (!movies.empty()) {
        std::cout << "First movie: " << movies[0].title << " (" << movies[0].rating << ")" << std::endl;
    }


    return 0;
}

SyntaxError: invalid decimal literal (ipython-input-3595669389.py, line 42)

**Reasoning**:
The previous attempt failed because the C++ code was placed directly in a Python code cell. I need to use the `%%writefile` magic command to write the C++ code to a file and then compile and run it.



In [3]:
%%writefile movie_loader.cpp
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <sstream>

struct Movie {
    std::string title;
    std::string genre;
    std::string director;
    std::string actors;
    float rating;
};

int main() {
    std::vector<Movie> movies;
    std::ifstream file("/content/imdb_top_1000.csv");

    if (!file.is_open()) {
        std::cerr << "Error opening file!" << std::endl;
        return 1;
    }

    std::string line;
    // Read the header row
    std::getline(file, line);

    while (std::getline(file, line)) {
        std::stringstream ss(line);
        std::string segment;
        std::vector<std::string> segments;

        // Simple CSV parsing - assumes no commas within fields
        while(std::getline(ss, segment, ',')) {
            segments.push_back(segment);
        }

        // Basic error handling for expected number of columns
        if (segments.size() >= 16) { // Adjust index based on your CSV structure
            Movie movie;
            // Assuming columns are in a specific order - adjust indices as needed
            movie.title = segments[1]; // Example: Title is the 2nd column
            movie.genre = segments[5]; // Example: Genre is the 6th column
            movie.director = segments[7]; // Example: Director is the 8th column
            movie.actors = segments[11]; // Example: Actors is the 12th column

            try {
                movie.rating = std::stof(segments[14]); // Example: Rating is the 15th column
            } catch (const std::invalid_argument& ia) {
                std::cerr << "Invalid rating for movie: " << movie.title << std::endl;
                continue; // Skip row with invalid rating
            } catch (const std::out_of_range& oor) {
                 std::cerr << "Rating out of range for movie: " << movie.title << std::endl;
                 continue; // Skip row with out-of-range rating
            }

            // Basic preprocessing: skip rows with potentially missing critical data
            if (movie.title.empty() || movie.genre.empty() || movie.director.empty() || movie.actors.empty()) {
                 std::cerr << "Skipping row with missing data for movie: " << movie.title << std::endl;
                 continue;
            }

            movies.push_back(movie);
        } else {
             std::cerr << "Skipping malformed row: " << line << std::endl;
        }
    }

    file.close();

    // You can now work with the 'movies' vector
    std::cout << "Loaded " << movies.size() << " movies." << std::endl;
    // Example: Print the first loaded movie
    if (!movies.empty()) {
        std::cout << "First movie: " << movies[0].title << " (" << movies[0].rating << ")" << std::endl;
    }


    return 0;
}

Writing movie_loader.cpp


**Reasoning**:
The C++ code has been written to a file. Now I need to compile and run the C++ code to load and preprocess the data.



In [4]:
!g++ movie_loader.cpp -o movie_loader
!./movie_loader

Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for movie: 0
Invalid rating for m

**Reasoning**:
The previous execution showed a lot of "Invalid rating" errors, suggesting an issue with parsing or the data itself. It also looks like the simple comma splitting might not be sufficient for fields containing commas (like actor names or genres). I need to refine the CSV parsing logic to handle quoted fields or use a more robust CSV parsing library if possible. Since a dedicated C++ CSV library is not confirmed to be available, I will refine the parsing logic to account for potentially quoted fields and re-examine the column indices based on the CSV structure. I will print the header to verify the column order.



In [5]:
%%writefile movie_loader.cpp
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <sstream>

struct Movie {
    std::string title;
    std::string genre;
    std::string director;
    std::string actors;
    float rating;
};

// Function to parse a CSV line, handling quoted fields
std::vector<std::string> parseCsvLine(const std::string& line) {
    std::vector<std::string> result;
    std::stringstream ss(line);
    std::string cell;
    bool inQuote = false;

    for (char ch : line) {
        if (ch == '"') {
            inQuote = !inQuote;
        } else if (ch == ',' && !inQuote) {
            result.push_back(cell);
            cell.clear();
        } else {
            cell += ch;
        }
    }
    result.push_back(cell); // Add the last cell

    return result;
}


int main() {
    std::vector<Movie> movies;
    std::ifstream file("/content/imdb_top_1000.csv");

    if (!file.is_open()) {
        std::cerr << "Error opening file!" << std::endl;
        return 1;
    }

    std::string line;
    // Read and print the header row to verify column indices
    std::getline(file, line);
    std::vector<std::string> header = parseCsvLine(line);
    std::cerr << "Header: ";
    for(const auto& col : header) {
        std::cerr << col << "|";
    }
    std::cerr << std::endl;


    while (std::getline(file, line)) {
        std::vector<std::string> segments = parseCsvLine(line);

        // Basic error handling for expected number of columns
        if (segments.size() >= 16) { // Re-check this based on the printed header
            Movie movie;
            // Assuming columns are in a specific order - adjust indices as needed based on header
            // Example indices - NEED TO VERIFY FROM HEADER
            movie.title = segments[1];
            movie.genre = segments[5];
            movie.director = segments[7];
            movie.actors = segments[11];

            try {
                movie.rating = std::stof(segments[14]);
            } catch (const std::invalid_argument& ia) {
                std::cerr << "Invalid rating for movie: " << movie.title << " (Value: " << segments[14] << ")" << std::endl;
                continue; // Skip row with invalid rating
            } catch (const std::out_of_range& oor) {
                 std::cerr << "Rating out of range for movie: " << movie.title << " (Value: " << segments[14] << ")" << std::endl;
                 continue; // Skip row with out-of-range rating
            }

            // Basic preprocessing: skip rows with potentially missing critical data
            if (movie.title.empty() || movie.genre.empty() || movie.director.empty() || movie.actors.empty() || movie.rating <= 0) {
                 std::cerr << "Skipping row with missing or invalid data for movie: " << movie.title << std::endl;
                 continue;
            }

            movies.push_back(movie);
        } else {
             std::cerr << "Skipping malformed row (incorrect number of columns): " << line << std::endl;
        }
    }

    file.close();

    // You can now work with the 'movies' vector
    std::cout << "Loaded " << movies.size() << " movies." << std::endl;
    // Example: Print the first loaded movie
    if (!movies.empty()) {
        std::cout << "First movie: " << movies[0].title << " (" << movies[0].rating << ")" << std::endl;
    }


    return 0;
}

Overwriting movie_loader.cpp


**Reasoning**:
The improved C++ code has been written to a file. Now I need to compile and run the refined C++ code to load and preprocess the data with better CSV parsing and verify the column indices from the printed header.



In [6]:
!g++ movie_loader.cpp -o movie_loader
!./movie_loader

Header: Poster_Link|Series_Title|Released_Year|Certificate|Runtime|Genre|IMDB_Rating|Overview|Meta_score|Director|Star1|Star2|Star3|Star4|No_of_Votes|Gross|
Loaded 1000 movies.
First movie: The Shawshank Redemption (2.34311e+06)


**Reasoning**:
The data has been loaded and basic preprocessing (handling missing values and converting rating to float) has been done in C++. The header information confirmed the column indices used in the C++ code are likely correct (Series_Title: 1, Genre: 5, Director: 7, Star1-Star4: 10-13 - I used 11 for simplicity assuming it's a main actor, IMDB_Rating: 6, No_of_Votes: 14, Gross: 15). The output shows 1000 movies loaded, which is the expected number of rows in the dataset, and the first movie's title is correct, but the rating (2.34311e+06) looks incorrect, likely mixing No_of_Votes and Gross with the rating. I need to re-examine the C++ code and the header to correct the column index for the rating. The header shows 'IMDB_Rating' is at index 6 and 'No_of_Votes' is at index 14. The previous code used index 14 for rating, which corresponds to 'No_of_Votes'. I need to change the index for rating to 6.



In [7]:
%%writefile movie_loader.cpp
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <sstream>

struct Movie {
    std::string title;
    std::string genre;
    std::string director;
    std::string actors;
    float rating;
};

// Function to parse a CSV line, handling quoted fields
std::vector<std::string> parseCsvLine(const std::string& line) {
    std::vector<std::string> result;
    std::stringstream ss(line);
    std::string cell;
    bool inQuote = false;

    for (char ch : line) {
        if (ch == '"') {
            inQuote = !inQuote;
        } else if (ch == ',' && !inQuote) {
            result.push_back(cell);
            cell.clear();
        } else {
            cell += ch;
        }
    }
    result.push_back(cell); // Add the last cell

    return result;
}


int main() {
    std::vector<Movie> movies;
    std::ifstream file("/content/imdb_top_1000.csv");

    if (!file.is_open()) {
        std::cerr << "Error opening file!" << std::endl;
        return 1;
    }

    std::string line;
    // Read and print the header row to verify column indices
    std::getline(file, line);
    std::vector<std::string> header = parseCsvLine(line);
    std::cerr << "Header: ";
    for(const auto& col : header) {
        std::cerr << col << "|";
    }
    std::cerr << std::endl;


    while (std::getline(file, line)) {
        std::vector<std::string> segments = parseCsvLine(line);

        // Basic error handling for expected number of columns
        if (segments.size() >= 16) { // Re-check this based on the printed header
            Movie movie;
            // Assuming columns are in a specific order - adjust indices as needed based on header
            // Example indices - NEED TO VERIFY FROM HEADER
            movie.title = segments[1];
            movie.genre = segments[5];
            movie.director = segments[7];
            // Concatenate the first four stars for actors
            movie.actors = segments[10] + ", " + segments[11] + ", " + segments[12] + ", " + segments[13];


            try {
                // Corrected index for IMDB_Rating based on header
                movie.rating = std::stof(segments[6]);
            } catch (const std::invalid_argument& ia) {
                std::cerr << "Invalid rating for movie: " << movie.title << " (Value: " << segments[6] << ")" << std::endl;
                continue; // Skip row with invalid rating
            } catch (const std::out_of_range& oor) {
                 std::cerr << "Rating out of range for movie: " << movie.title << " (Value: " << segments[6] << ")" << std::endl;
                 continue; // Skip row with out-of-range rating
            }

            // Basic preprocessing: skip rows with potentially missing critical data
            if (movie.title.empty() || movie.genre.empty() || movie.director.empty() || movie.actors.empty() || movie.rating <= 0) {
                 std::cerr << "Skipping row with missing or invalid data for movie: " << movie.title << std::endl;
                 continue;
            }

            movies.push_back(movie);
        } else {
             std::cerr << "Skipping malformed row (incorrect number of columns): " << line << std::endl;
        }
    }

    file.close();

    // You can now work with the 'movies' vector
    std::cout << "Loaded " << movies.size() << " movies." << std::endl;
    // Example: Print the first loaded movie
    if (!movies.empty()) {
        std::cout << "First movie: " << movies[0].title << " (" << movies[0].rating << ")" << std::endl;
    }


    return 0;
}

Overwriting movie_loader.cpp


**Reasoning**:
The improved C++ code has been written to a file with the corrected rating column index. Now I need to compile and run the C++ code to load and preprocess the data and verify that the rating is parsed correctly.



In [8]:
!g++ movie_loader.cpp -o movie_loader
!./movie_loader

Header: Poster_Link|Series_Title|Released_Year|Certificate|Runtime|Genre|IMDB_Rating|Overview|Meta_score|Director|Star1|Star2|Star3|Star4|No_of_Votes|Gross|
Loaded 1000 movies.
First movie: The Shawshank Redemption (9.3)


## Choose an ml approach

### Subtask:
Discuss suitable machine learning algorithms for recommendation systems and select one or more for implementation in C++. This might involve content-based filtering using movie features from the dataset or exploring how collaborative filtering could be simulated or adapted with this data.


**Reasoning**:
I need to discuss suitable machine learning algorithms for recommendation systems based on the available data and select one for implementation in C++. This requires writing a detailed explanation in a markdown cell.



In [9]:
%%markdown
### Suitable Machine Learning Algorithms for Recommendation Systems and Selection for C++ Implementation

Based on the `Movie` struct with fields like `title`, `genre`, `director`, `actors`, and `rating` from the `/content/imdb_top_1000.csv` dataset, we can evaluate the feasibility of different recommendation system approaches.

1.  **Content-Based Recommendation Systems:**
    *   **Feasibility:** Highly feasible with this dataset. The dataset contains rich information about the content of each movie (`genre`, `director`, `actors`). We can calculate similarity between movies based on these features.
    *   **Rationale:** This approach directly leverages the available data. We can represent each movie as a feature vector based on its genre, director, and actors. Similarity between movies can then be computed using metrics like cosine similarity.
    *   **Implementation in C++:** A content-based approach is relatively straightforward to implement in C++ without external machine learning libraries. We can create functions to:
        *   Tokenize and process categorical features like genre, director, and actors (e.g., split genre strings, handle multiple directors/actors).
        *   Represent movies as vectors (e.g., using bag-of-words or one-hot encoding for categorical features).
        *   Calculate similarity scores between movie vectors (e.g., implementing the dot product for cosine similarity).
        *   Rank movies based on their similarity to a target movie.
    *   **How to implement with available features:**
        *   **Genre:** Treat genres as categorical features. A movie can belong to multiple genres. We can use a multi-hot encoding where each genre is a dimension, and the movie vector has a 1 in the dimensions corresponding to its genres.
        *   **Director:** Treat directors as categorical features. Similar to genres, a one-hot or multi-hot encoding can be used.
        *   **Actors:** Treat prominent actors (Star1-Star4 in the original data, concatenated into `actors` in our struct) as categorical features. A multi-hot encoding can represent the presence of key actors.
        *   **Combining Features:** The feature vectors for genre, director, and actors can be combined (e.g., concatenated) to form a single feature vector for each movie.
        *   **Similarity Calculation:** Cosine similarity is a suitable metric. For two movie vectors A and B, cosine similarity is calculated as `(A . B) / (||A|| * ||B||)`. A higher cosine similarity indicates greater similarity between the movies.

2.  **Collaborative Filtering Recommendation Systems:**
    *   **Feasibility:** Not directly feasible with this dataset alone. Collaborative filtering relies on user-item interaction data (e.g., ratings, viewing history) to find similar users or items based on their collective behavior. The `/content/imdb_top_1000.csv` dataset provides only movie information and their overall IMDB ratings, not individual user interactions.
    *   **Limitations with this dataset:** Without user interaction data, we cannot implement traditional user-based or item-based collaborative filtering. We cannot identify users with similar tastes or find items frequently liked together by the same users. Simulating or adapting collaborative filtering would require generating synthetic user data or making strong assumptions, which would not reflect real-world collaborative patterns. Implementing complex matrix factorization methods (like Singular Value Decomposition) in C++ without libraries is also significantly more challenging.

3.  **Hybrid Recommendation Systems:**
    *   **Feasibility:** Possible by combining content-based filtering with other available data or simulated data.
    *   **Rationale:** Could potentially offer better recommendations by leveraging both content features and some form of interaction data (if available or simulated).
    *   **Implementation in C++:** More complex than pure content-based filtering, requiring the implementation of logic to combine results from different approaches.

**Selected Approach for Implementation:**

Given the available data and the constraint of implementing in C++ without external machine learning libraries, **Content-Based Filtering** is the most suitable and feasible approach.

**Justification:**

*   The dataset contains relevant content features (genre, director, actors) that are essential for content-based recommendations.
*   Content-based filtering can be implemented in C++ using basic data structures (vectors, maps) and mathematical operations (dot product, vector norms) without requiring complex external libraries.
*   It allows us to recommend movies similar in characteristics to a movie that a user has liked or is currently viewing.
*   While collaborative filtering is powerful, the lack of user interaction data in the provided CSV makes its direct implementation impossible, and simulating it accurately is beyond the scope of using just this dataset.

Therefore, the subsequent steps will focus on implementing a content-based recommendation system in C++ based on the movie's genre, director, and actors features, calculating similarity using cosine similarity.

### Suitable Machine Learning Algorithms for Recommendation Systems and Selection for C++ Implementation

Based on the `Movie` struct with fields like `title`, `genre`, `director`, `actors`, and `rating` from the `/content/imdb_top_1000.csv` dataset, we can evaluate the feasibility of different recommendation system approaches.

1.  **Content-Based Recommendation Systems:**
    *   **Feasibility:** Highly feasible with this dataset. The dataset contains rich information about the content of each movie (`genre`, `director`, `actors`). We can calculate similarity between movies based on these features.
    *   **Rationale:** This approach directly leverages the available data. We can represent each movie as a feature vector based on its genre, director, and actors. Similarity between movies can then be computed using metrics like cosine similarity.
    *   **Implementation in C++:** A content-based approach is relatively straightforward to implement in C++ without external machine learning libraries. We can create functions to:
        *   Tokenize and process categorical features like genre, director, and actors (e.g., split genre strings, handle multiple directors/actors).
        *   Represent movies as vectors (e.g., using bag-of-words or one-hot encoding for categorical features).
        *   Calculate similarity scores between movie vectors (e.g., implementing the dot product for cosine similarity).
        *   Rank movies based on their similarity to a target movie.
    *   **How to implement with available features:**
        *   **Genre:** Treat genres as categorical features. A movie can belong to multiple genres. We can use a multi-hot encoding where each genre is a dimension, and the movie vector has a 1 in the dimensions corresponding to its genres.
        *   **Director:** Treat directors as categorical features. Similar to genres, a one-hot or multi-hot encoding can be used.
        *   **Actors:** Treat prominent actors (Star1-Star4 in the original data, concatenated into `actors` in our struct) as categorical features. A multi-hot encoding can represent the presence of key actors.
        *   **Combining Features:** The feature vectors for genre, director, and actors can be combined (e.g., concatenated) to form a single feature vector for each movie.
        *   **Similarity Calculation:** Cosine similarity is a suitable metric. For two movie vectors A and B, cosine similarity is calculated as `(A . B) / (||A|| * ||B||)`. A higher cosine similarity indicates greater similarity between the movies.

2.  **Collaborative Filtering Recommendation Systems:**
    *   **Feasibility:** Not directly feasible with this dataset alone. Collaborative filtering relies on user-item interaction data (e.g., ratings, viewing history) to find similar users or items based on their collective behavior. The `/content/imdb_top_1000.csv` dataset provides only movie information and their overall IMDB ratings, not individual user interactions.
    *   **Limitations with this dataset:** Without user interaction data, we cannot implement traditional user-based or item-based collaborative filtering. We cannot identify users with similar tastes or find items frequently liked together by the same users. Simulating or adapting collaborative filtering would require generating synthetic user data or making strong assumptions, which would not reflect real-world collaborative patterns. Implementing complex matrix factorization methods (like Singular Value Decomposition) in C++ without libraries is also significantly more challenging.

3.  **Hybrid Recommendation Systems:**
    *   **Feasibility:** Possible by combining content-based filtering with other available data or simulated data.
    *   **Rationale:** Could potentially offer better recommendations by leveraging both content features and some form of interaction data (if available or simulated).
    *   **Implementation in C++:** More complex than pure content-based filtering, requiring the implementation of logic to combine results from different approaches.

**Selected Approach for Implementation:**

Given the available data and the constraint of implementing in C++ without external machine learning libraries, **Content-Based Filtering** is the most suitable and feasible approach.

**Justification:**

*   The dataset contains relevant content features (genre, director, actors) that are essential for content-based recommendations.
*   Content-based filtering can be implemented in C++ using basic data structures (vectors, maps) and mathematical operations (dot product, vector norms) without requiring complex external libraries.
*   It allows us to recommend movies similar in characteristics to a movie that a user has liked or is currently viewing.
*   While collaborative filtering is powerful, the lack of user interaction data in the provided CSV makes its direct implementation impossible, and simulating it accurately is beyond the scope of using just this dataset.

Therefore, the subsequent steps will focus on implementing a content-based recommendation system in C++ based on the movie's genre, director, and actors features, calculating similarity using cosine similarity.


## Data representation

### Subtask:
Explain how the relevant data from the loaded CSV will be represented in C++ data structures for the chosen algorithm.


**Reasoning**:
Explain how the movie data will be represented in C++ data structures for content-based filtering, covering feature representation and storage.



In [10]:
%%markdown
### Data Representation in C++ for Content-Based Filtering

To implement a content-based recommendation system in C++ using the loaded `Movie` data, we need to transform the raw features (genre, director, actors) into a numerical representation suitable for similarity calculations. This typically involves converting categorical text data into vectors.

**1. Transformation for Content-Based Filtering:**

The `Movie` struct currently holds features as strings: `genre`, `director`, and `actors`. For content-based filtering, we need to represent these as numerical vectors. A common technique is using a vector space model where each unique genre, director, or actor in the entire dataset becomes a dimension in a high-dimensional space. A movie's presence (or absence) of a particular feature is marked in its corresponding dimension.

**2. C++ Data Structures for Feature Representation:**

We will use the following C++ data structures:

*   **`std::unordered_map<std::string, int> feature_index_map;`**: This map will store a mapping from each unique feature string (e.g., "Action", "Quentin Tarantino", "Tim Robbins") to a unique integer index. This index will correspond to a dimension in our feature vectors. This is built by iterating through all movies and collecting all unique genres, directors, and actors.
*   **`std::vector<float> feature_vector;`**: For each movie, we will create a vector of floats. The size of this vector will be equal to the total number of unique features found in `feature_index_map`. The value at each index `i` in the vector will represent the presence or weight of the feature corresponding to index `i` in `feature_index_map` for that specific movie.

For our categorical features (genre, director, actors), a **multi-hot encoding** is suitable:

*   **Genre:** The `genre` string in the `Movie` struct is typically a comma-separated list (e.g., "Action,Adventure,Sci-Fi"). We will split this string into individual genres. For each genre present in a movie, we find its index in `feature_index_map` and set the corresponding element in the movie's `feature_vector` to 1.
*   **Director:** The `director` string contains the director's name. We find the director's index in `feature_index_map` and set the corresponding element in the movie's `feature_vector` to 1.
*   **Actors:** The `actors` string contains concatenated actor names (e.g., "Tim Robbins, Morgan Freeman, Bob Gunton, William Sadler"). We will split this string by ", " to get individual actor names. For each actor, we find their index in `feature_index_map` and set the corresponding element in the movie's `feature_vector` to 1.

The `rating` could also be incorporated, but for a basic content-based approach focused on movie characteristics, genre, director, and actors are the primary features.

**Example Feature Vector Construction (Conceptual):**

Let's say `feature_index_map` has indices for "Action": 0, "Comedy": 1, "Quentin Tarantino": 2, "Brad Pitt": 3.

A movie with Genre: "Action,Comedy", Director: "Quentin Tarantino", Actors: "Brad Pitt, Leonardo DiCaprio" would have a feature vector like: `[1, 1, 1, 1, ...]` where the 1s are at indices 0, 1, 2, and 3, and other dimensions are 0 (assuming "Leonardo DiCaprio" and other features have indices beyond 3).

**3. Storage of Movie Representations:**

The collection of all movie feature vectors will be stored in a `std::vector` of `std::vector<float>`:

*   **`std::vector<std::vector<float>> movie_feature_vectors;`**: This will be a vector where each element is the feature vector for a corresponding movie in the `movies` vector loaded previously. The order should be consistent between `movies` and `movie_feature_vectors` so that `movies[i]` corresponds to `movie_feature_vectors[i]`.वरुनThis structure allows for efficient access to any movie's feature vector by its index. Similarity comparisons (like cosine similarity) can then be performed between these vectors.

To efficiently find movies similar to a given movie, after calculating similarity scores between the target movie's vector and all other movie vectors, we would store these scores, perhaps in a `std::vector` of pairs (`<float, int>`) where the float is the similarity score and the int is the index of the movie. This vector can then be sorted in descending order of similarity.


### Data Representation in C++ for Content-Based Filtering

To implement a content-based recommendation system in C++ using the loaded `Movie` data, we need to transform the raw features (genre, director, actors) into a numerical representation suitable for similarity calculations. This typically involves converting categorical text data into vectors.

**1. Transformation for Content-Based Filtering:**

The `Movie` struct currently holds features as strings: `genre`, `director`, and `actors`. For content-based filtering, we need to represent these as numerical vectors. A common technique is using a vector space model where each unique genre, director, or actor in the entire dataset becomes a dimension in a high-dimensional space. A movie's presence (or absence) of a particular feature is marked in its corresponding dimension.

**2. C++ Data Structures for Feature Representation:**

We will use the following C++ data structures:

*   **`std::unordered_map<std::string, int> feature_index_map;`**: This map will store a mapping from each unique feature string (e.g., "Action", "Quentin Tarantino", "Tim Robbins") to a unique integer index. This index will correspond to a dimension in our feature vectors. This is built by iterating through all movies and collecting all unique genres, directors, and actors.
*   **`std::vector<float> feature_vector;`**: For each movie, we will create a vector of floats. The size of this vector will be equal to the total number of unique features found in `feature_index_map`. The value at each index `i` in the vector will represent the presence or weight of the feature corresponding to index `i` in `feature_index_map` for that specific movie.

For our categorical features (genre, director, actors), a **multi-hot encoding** is suitable:

*   **Genre:** The `genre` string in the `Movie` struct is typically a comma-separated list (e.g., "Action,Adventure,Sci-Fi"). We will split this string into individual genres. For each genre present in a movie, we find its index in `feature_index_map` and set the corresponding element in the movie's `feature_vector` to 1.
*   **Director:** The `director` string contains the director's name. We find the director's index in `feature_index_map` and set the corresponding element in the movie's `feature_vector` to 1.
*   **Actors:** The `actors` string contains concatenated actor names (e.g., "Tim Robbins, Morgan Freeman, Bob Gunton, William Sadler"). We will split this string by ", " to get individual actor names. For each actor, we find their index in `feature_index_map` and set the corresponding element in the movie's `feature_vector` to 1.

The `rating` could also be incorporated, but for a basic content-based approach focused on movie characteristics, genre, director, and actors are the primary features.

**Example Feature Vector Construction (Conceptual):**

Let's say `feature_index_map` has indices for "Action": 0, "Comedy": 1, "Quentin Tarantino": 2, "Brad Pitt": 3.

A movie with Genre: "Action,Comedy", Director: "Quentin Tarantino", Actors: "Brad Pitt, Leonardo DiCaprio" would have a feature vector like: `[1, 1, 1, 1, ...]` where the 1s are at indices 0, 1, 2, and 3, and other dimensions are 0 (assuming "Leonardo DiCaprio" and other features have indices beyond 3).

**3. Storage of Movie Representations:**

The collection of all movie feature vectors will be stored in a `std::vector` of `std::vector<float>`:

*   **`std::vector<std::vector<float>> movie_feature_vectors;`**: This will be a vector where each element is the feature vector for a corresponding movie in the `movies` vector loaded previously. The order should be consistent between `movies` and `movie_feature_vectors` so that `movies[i]` corresponds to `movie_feature_vectors[i]`.वरुनThis structure allows for efficient access to any movie's feature vector by its index. Similarity comparisons (like cosine similarity) can then be performed between these vectors.

To efficiently find movies similar to a given movie, after calculating similarity scores between the target movie's vector and all other movie vectors, we would store these scores, perhaps in a `std::vector` of pairs (`<float, int>`) where the float is the similarity score and the int is the index of the movie. This vector can then be sorted in descending order of similarity.


## Implement the chosen algorithm

### Subtask:
Implement the chosen machine learning algorithm (Content-Based Filtering) in C++, including necessary libraries or frameworks for feature extraction and similarity calculation.


**Reasoning**:
Modify the `movie_loader.cpp` file to implement the feature extraction, feature vector creation using multi-hot encoding, and cosine similarity calculation as per the instructions. Then compile and execute the updated code to test the implementation.



In [11]:
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <sstream>
#include <unordered_map>
#include <cmath> // For sqrt

struct Movie {
    std::string title;
    std::string genre;
    std::string director;
    std::string actors;
    float rating;
};

// Function to parse a CSV line, handling quoted fields
std::vector<std::string> parseCsvLine(const std::string& line) {
    std::vector<std::string> result;
    std::stringstream ss(line);
    std::string cell;
    bool inQuote = false;

    for (char ch : line) {
        if (ch == '"') {
            inQuote = !inQuote;
        } else if (ch == ',' && !inQuote) {
            result.push_back(cell);
            cell.clear();
        } else {
            cell += ch;
        }
    }
    result.push_back(cell); // Add the last cell

    return result;
}

// Function to split a string by a delimiter
std::vector<std::string> splitString(const std::string& s, const std::string& delimiter) {
    std::vector<std::string> tokens;
    size_t pos_start = 0, pos_end;
    std::string token;
    while ((pos_end = s.find(delimiter, pos_start)) != std::string::npos) {
        token = s.substr(pos_start, pos_end - pos_start);
        tokens.push_back(token);
        pos_start = pos_end + delimiter.length();
    }
    tokens.push_back(s.substr(pos_start));
    return tokens;
}

// Function to calculate cosine similarity between two vectors
float cosineSimilarity(const std::vector<float>& vec1, const std::vector<float>& vec2) {
    if (vec1.size() != vec2.size() || vec1.empty()) {
        return 0.0; // Or handle error appropriately
    }

    float dot_product = 0.0;
    float norm_vec1 = 0.0;
    float norm_vec2 = 0.0;

    for (size_t i = 0; i < vec1.size(); ++i) {
        dot_product += vec1[i] * vec2[i];
        norm_vec1 += vec1[i] * vec1[i];
        norm_vec2 += vec2[i] * vec2[i];
    }

    float denominator = std::sqrt(norm_vec1) * std::sqrt(norm_vec2);

    if (denominator == 0) {
        return 0.0; // Avoid division by zero
    }

    return dot_product / denominator;
}


int main() {
    std::vector<Movie> movies;
    std::ifstream file("/content/imdb_top_1000.csv");

    if (!file.is_open()) {
        std::cerr << "Error opening file!" << std::endl;
        return 1;
    }

    std::string line;
    // Read and print the header row to verify column indices
    std::getline(file, line);
    std::vector<std::string> header = parseCsvLine(line);
    std::cerr << "Header: ";
    for(const auto& col : header) {
        std::cerr << col << "|";
    }
    std::cerr << std::endl;


    while (std::getline(file, line)) {
        std::vector<std::string> segments = parseCsvLine(line);

        // Basic error handling for expected number of columns
        if (segments.size() >= 16) { // Re-check this based on the printed header
            Movie movie;
            // Assuming columns are in a specific order - adjust indices as needed based on header
            // Example indices - NEED TO VERIFY FROM HEADER
            movie.title = segments[1];
            movie.genre = segments[5];
            movie.director = segments[7];
            // Concatenate the first four stars for actors
            movie.actors = segments[10] + ", " + segments[11] + ", " + segments[12] + ", " + segments[13];


            try {
                // Corrected index for IMDB_Rating based on header
                movie.rating = std::stof(segments[6]);
            } catch (const std::invalid_argument& ia) {
                std::cerr << "Invalid rating for movie: " << movie.title << " (Value: " << segments[6] << ")" << std::endl;
                continue; // Skip row with invalid rating
            } catch (const std::out_of_range& oor) {
                 std::cerr << "Rating out of range for movie: " << movie.title << " (Value: " << segments[6] << ")" << std::endl;
                 continue; // Skip row with out-of-range rating
            }

            // Basic preprocessing: skip rows with potentially missing critical data
            if (movie.title.empty() || movie.genre.empty() || movie.director.empty() || movie.actors.empty() || movie.rating <= 0) {
                 std::cerr << "Skipping row with missing or invalid data for movie: " << movie.title << std::endl;
                 continue;
            }

            movies.push_back(movie);
        } else {
             std::cerr << "Skipping malformed row (incorrect number of columns): " << line << std::endl;
        }
    }

    file.close();

    std::cout << "Loaded " << movies.size() << " movies." << std::endl;

    // 1. Build feature_index_map
    std::unordered_map<std::string, int> feature_index_map;
    int feature_index = 0;

    for (const auto& movie : movies) {
        // Genres
        std::vector<std::string> genres = splitString(movie.genre, ",");
        for (const auto& g : genres) {
            if (feature_index_map.find(g) == feature_index_map.end()) {
                feature_index_map[g] = feature_index++;
            }
        }
        // Director
        if (feature_index_map.find(movie.director) == feature_index_map.end()) {
            feature_index_map[movie.director] = feature_index++;
        }
        // Actors
        std::vector<std::string> actors = splitString(movie.actors, ", ");
        for (const auto& a : actors) {
             if (feature_index_map.find(a) == feature_index_map.end()) {
                feature_index_map[a] = feature_index++;
            }
        }
    }

    std::cout << "Built feature map with " << feature_index_map.size() << " features." << std::endl;

    // 2. Create movie_feature_vectors
    std::vector<std::vector<float>> movie_feature_vectors;
    int num_features = feature_index_map.size();

    for (const auto& movie : movies) {
        std::vector<float> feature_vector(num_features, 0.0);

        // Populate feature vector using multi-hot encoding
        // Genres
        std::vector<std::string> genres = splitString(movie.genre, ",");
        for (const auto& g : genres) {
            if (feature_index_map.count(g)) { // Check if feature exists (should always if built correctly)
                 feature_vector[feature_index_map[g]] = 1.0;
            }
        }
        // Director
        if (feature_index_map.count(movie.director)) {
             feature_vector[feature_index_map[movie.director]] = 1.0;
        }
        // Actors
        std::vector<std::string> actors = splitString(movie.actors, ", ");
        for (const auto& a : actors) {
            if (feature_index_map.count(a)) {
                 feature_vector[feature_index_map[a]] = 1.0;
            }
        }
        movie_feature_vectors.push_back(feature_vector);
    }

    std::cout << "Created feature vectors for " << movie_feature_vectors.size() << " movies." << std::endl;

    // 3. Implement cosine similarity function (already done above)

    // Verification: Print size of feature map and a sample movie's feature vector info
    if (!movies.empty()) {
        std::cout << "First movie: " << movies[0].title << std::endl;
        if (!movie_feature_vectors.empty()) {
            std::cout << "Size of first movie's feature vector: " << movie_feature_vectors[0].size() << std::endl;
            // Print first few elements of the feature vector (if size > 0)
            std::cout << "First 10 elements of feature vector: [";
            for (int i = 0; i < std::min((int)movie_feature_vectors[0].size(), 10); ++i) {
                std::cout << movie_feature_vectors[0][i] << (i == std::min((int)movie_feature_vectors[0].size(), 10) - 1 ? "" : ", ");
            }
            std::cout << "]" << std::endl;
        }
    }

    // Example cosine similarity calculation (between first two movies)
    if (movie_feature_vectors.size() >= 2) {
        float similarity = cosineSimilarity(movie_feature_vectors[0], movie_feature_vectors[1]);
        std::cout << "Cosine similarity between '" << movies[0].title << "' and '" << movies[1].title << "': " << similarity << std::endl;
    } else {
         std::cout << "Need at least two movies to calculate similarity example." << std::endl;
    }


    return 0;
}

SyntaxError: unterminated string literal (detected at line 201) (ipython-input-3495406464.py, line 201)

**Reasoning**:
The previous command failed due to a `SyntaxError` in the Python cell, not the C++ code itself. The error message indicates an unterminated string literal. This is likely caused by using `%%writefile` with a multi-line C++ code block that contains comments or strings that confuse the Python interpreter. The fix is to ensure the C++ code is correctly formatted within the `%%writefile` block. I will regenerate the exact same C++ code, ensuring the formatting is correct within the `%%writefile` magic command.



In [17]:
%%writefile movie_loader.cpp
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <sstream>
#include <unordered_map>
#include <cmath> // For sqrt
#include <algorithm> // For std::min

struct Movie {
    std::string title;
    std::string genre;
    std::string director;
    std::string actors;
    float rating;
};

// Function to parse a CSV line, handling quoted fields
std::vector<std::string> parseCsvLine(const std::string& line) {
    std::vector<std::string> result;
    std::stringstream ss(line);
    std::string cell;
    bool inQuote = false;

    for (char ch : line) {
        if (ch == '"') {
            inQuote = !inQuote;
        } else if (ch == ',' && !inQuote) {
            result.push_back(cell);
            cell.clear();
        } else {
            cell += ch;
        }
    }
    result.push_back(cell); // Add the last cell

    return result;
}

// Function to split a string by a delimiter
std::vector<std::string> splitString(const std::string& s, const std::string& delimiter) {
    std::vector<std::string> tokens;
    size_t pos_start = 0, pos_end;
    std::string token;
    while ((pos_end = s.find(delimiter, pos_start)) != std::string::npos) {
        token = s.substr(pos_start, pos_end - pos_start);
        if (!token.empty() && token != " ") { // Added check for empty or space tokens
             tokens.push_back(token);
        }
        pos_start = pos_end + delimiter.length();
    }
    token = s.substr(pos_start);
    if (!token.empty() && token != " ") { // Added check for empty or space tokens for the last token
        tokens.push_back(token);
    }
    return tokens;
}

// Function to calculate cosine similarity between two vectors
float cosineSimilarity(const std::vector<float>& vec1, const std::vector<float>& vec2) {
    if (vec1.size() != vec2.size() || vec1.empty()) {
        return 0.0; // Or handle error appropriately
    }

    float dot_product = 0.0;
    float norm_vec1 = 0.0;
    float norm_vec2 = 0.0;

    for (size_t i = 0; i < vec1.size(); ++i) {
        dot_product += vec1[i] * vec2[i];
        norm_vec1 += vec1[i] * vec1[i];
        norm_vec2 += vec2[i] * vec2[i];
    }

    float denominator = std::sqrt(norm_vec1) * std::sqrt(norm_vec2);

    if (denominator == 0) {
        return 0.0; // Avoid division by zero
    }

    return dot_product / denominator;
}


int main() {
    std::vector<Movie> movies;
    std::ifstream file("/content/imdb_top_1000.csv");

    if (!file.is_open()) {
        std::cerr << "Error opening file!" << std::endl;
        return 1;
    }

    std::string line;
    // Read and print the header row to verify column indices
    std::getline(file, line);
    std::vector<std::string> header = parseCsvLine(line);
    std::cerr << "Header: ";
    for(const auto& col : header) {
        std::cerr << col << "|";
    }
    std::cerr << std::endl;


    while (std::getline(file, line)) {
        std::vector<std::string> segments = parseCsvLine(line);

        // Basic error handling for expected number of columns
        if (segments.size() >= 16) { // Re-check this based on the printed header
            Movie movie;
            // Assuming columns are in a specific order - adjust indices as needed based on header
            // Example indices - NEED TO VERIFY FROM HEADER
            movie.title = segments[1];
            movie.genre = segments[5];
            movie.director = segments[7];
            // Concatenate the first four stars for actors
            movie.actors = segments[10] + ", " + segments[11] + ", " + segments[12] + ", " + segments[13];


            try {
                // Corrected index for IMDB_Rating based on header
                movie.rating = std::stof(segments[6]);
            } catch (const std::invalid_argument& ia) {
                std::cerr << "Invalid rating for movie: " << movie.title << " (Value: " << segments[6] << ")" << std::endl;
                continue; // Skip row with invalid rating
            } catch (const std::out_of_range& oor) {
                 std::cerr << "Rating out of range for movie: " << movie.title << " (Value: " << segments[6] << ")" << std::endl;
                 continue; // Skip row with out-of-range rating
            }

            // Basic preprocessing: skip rows with potentially missing critical data
            if (movie.title.empty() || movie.genre.empty() || movie.director.empty() || movie.actors.empty() || movie.rating <= 0) {
                 std::cerr << "Skipping row with missing or invalid data for movie: " << movie.title << std::endl;
                 continue;
            }

            movies.push_back(movie);
        } else {
             std::cerr << "Skipping malformed row (incorrect number of columns): " << line << std::endl;
        }
    }

    file.close();

    std::cout << "Loaded " << movies.size() << " movies." << std::endl;

    // 1. Build feature_index_map
    std::unordered_map<std::string, int> feature_index_map;
    int feature_index = 0;

    for (const auto& movie : movies) {
        // Genres
        std::vector<std::string> genres = splitString(movie.genre, ",");
        for (const auto& g : genres) {
            if (feature_index_map.find(g) == feature_index_map.end()) {
                feature_index_map[g] = feature_index++;
            }
        }
        // Director
        if (feature_index_map.find(movie.director) == feature_index_map.end()) {
            feature_index_map[movie.director] = feature_index++;
        }
        // Actors
        std::vector<std::string> actors = splitString(movie.actors, ", ");
        for (const auto& a : actors) {
             if (feature_index_map.find(a) == feature_index_map.end()) {
                feature_index_map[a] = feature_index++;
            }
        }
    }

    std::cout << "Built feature map with " << feature_index_map.size() << " features." << std::endl;

    // 2. Create movie_feature_vectors
    std::vector<std::vector<float>> movie_feature_vectors;
    int num_features = feature_index_map.size();

    for (const auto& movie : movies) {
        std::vector<float> feature_vector(num_features, 0.0);

        // Populate feature vector using multi-hot encoding
        // Genres
        std::vector<std::string> genres = splitString(movie.genre, ",");
        for (const auto& g : genres) {
            if (feature_index_map.count(g)) { // Check if feature exists (should always if built correctly)
                 feature_vector[feature_index_map[g]] = 1.0;
            }
        }
        // Director
        if (feature_index_map.count(movie.director)) {
             feature_vector[feature_index_map[movie.director]] = 1.0;
        }
        // Actors
        std::vector<std::string> actors = splitString(movie.actors, ", ");
        for (const auto& a : actors) {
            if (feature_index_map.count(a)) {
                 feature_vector[feature_index_map[a]] = 1.0;
            }
        }
        movie_feature_vectors.push_back(feature_vector);
    }

    std::cout << "Created feature vectors for " << movie_feature_vectors.size() << " movies." << std::endl;

    // 3. Implement cosine similarity function (already done above)

    // Verification: Print size of feature map and a sample movie's feature vector info
    if (!movies.empty()) {
        std::cout << "First movie: " << movies[0].title << std::endl;
        if (!movie_feature_vectors.empty()) {
            std::cout << "Size of first movie's feature vector: " << movie_feature_vectors[0].size() << std::endl;
            // Print first few elements of the feature vector (if size > 0)
            std::cout << "First 10 elements of feature vector: [";
            for (int i = 0; i < std::min((int)movie_feature_vectors[0].size(), 10); ++i) {
                std::cout << movie_feature_vectors[0][i] << (i == std::min((int)movie_feature_vectors[0].size(), 10) - 1 ? "" : ", ");
            }
            std::cout << "]" << std::endl;
        }
    }

    // Example cosine similarity calculation (between first two movies)
    if (movie_feature_vectors.size() >= 2) {
        float similarity = cosineSimilarity(movie_feature_vectors[0], movie_feature_vectors[1]);
        std::cout << "Cosine similarity between '" << movies[0].title << "' and '" << movies[1].title << "': " << similarity << std::endl;
    } else {
         std::cout << "Need at least two movies to calculate similarity example." << std::endl;
    }


    return 0;
}

Overwriting movie_loader.cpp


**Reasoning**:
The C++ code has been successfully written to `movie_loader.cpp`. Now, compile the updated C++ file using g++ and then execute the compiled program to verify that the feature map and feature vectors are created correctly and to see the example cosine similarity calculation.



In [13]:
!g++ movie_loader.cpp -o movie_loader -std=c++11
!./movie_loader

Header: Poster_Link|Series_Title|Released_Year|Certificate|Runtime|Genre|IMDB_Rating|Overview|Meta_score|Director|Star1|Star2|Star3|Star4|No_of_Votes|Gross|
Loaded 1000 movies.
Built feature map with 3743 features.
Created feature vectors for 1000 movies.
First movie: The Shawshank Redemption
Size of first movie's feature vector: 3743
First 10 elements of feature vector: [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
Cosine similarity between 'The Shawshank Redemption' and 'The Godfather': 0


## Training and evaluation

### Subtask:
Outline how the model will be trained using the preprocessed data and evaluated for performance (e.g., using relevant metrics for the chosen approach).


**Reasoning**:
Explain the training and evaluation process for the chosen content-based recommendation system, highlighting the lack of traditional training and the challenges in evaluation due to data limitations.



In [14]:
%%markdown
### Model Training and Evaluation for Content-Based Recommendation

For the content-based recommendation system we are building using cosine similarity, the concept of "model training" is different from supervised machine learning approaches that involve iterative learning from labeled data.

1.  **Absence of Traditional Training Phase:**
    Unlike algorithms such as linear regression, decision trees, or neural networks that learn parameters from training data through optimization processes (like gradient descent), a content-based system based on feature similarity does **not** have a traditional training phase. There are no model weights or parameters to learn iteratively.

2.  **Preprocessed Data as the "Trained" Model:**
    In this context, the **preprocessed data itself serves as the "trained" model**. The "knowledge" of the system is encoded directly within the feature vectors created from the movie data. The process of feature extraction and creating the multi-hot encoded vectors for each movie is the equivalent of the "training" step. The relationships and characteristics of the movies are captured in these vector representations. When we want to make a recommendation, we don't train a model; we simply use these precomputed feature vectors to find similar items.

3.  **Evaluation Through Recommendation Generation:**
    Evaluation for this type of system is typically done by generating recommendations for a given input movie (or a set of movies representing a user's preference) and then assessing the quality of those recommendations. The core task is to retrieve movies that are most similar to the input based on the calculated cosine similarity scores.

4.  **Challenges in Applying Standard Evaluation Metrics with This Dataset:**
    Common evaluation metrics for recommendation systems, such as Precision, Recall, F1-score, Mean Average Precision (MAP), or Normalized Discounted Cumulative Gain (NDCG), are designed for scenarios where explicit user feedback (like ratings or click-through data) or a known ground truth of relevant items for users is available.
    The `/content/imdb_top_1000.csv` dataset **lacks explicit user interaction data**. We have movie metadata and overall IMDB ratings, but no information about which users watched or liked which movies, or what constitutes a "good" recommendation for a specific user. Therefore, we cannot directly apply these standard metrics to quantitatively measure the system's performance in terms of predicting user preferences or ranking items according to relevance for individual users.

5.  **Alternative Evaluation Approaches Given Data Limitations:**
    Given the constraints of the dataset, evaluation will primarily rely on qualitative methods and direct inspection:
    *   **Qualitative Assessment:** The most practical approach is to manually select a few example movies and generate a list of top N recommended movies for each. We can then qualitatively assess the relevance of these recommendations based on our understanding of the movie's genre, director, actors, and plot (if external knowledge is used). This involves subjectively judging whether the recommended movies seem genuinely similar and likely to be enjoyed by someone who liked the input movie.
    *   **Examining Similarity Scores:** We can analyze the distribution and range of the calculated cosine similarity scores. This can provide insights into how distinct or similar movies are based on the features used. For example, are there many movies with high similarity scores, or are most scores low? This analysis helps understand the feature space but doesn't directly measure recommendation quality.
    *   **Hit Rate (Theoretical):** If we hypothetically had a small test set where we knew, for instance, that movie A is considered similar to movie B (perhaps from expert curation or external data), we could calculate a Hit Rate by checking if, when recommending based on movie A, movie B appears in the top N recommendations. However, this dataset does not provide such a test set.

6.  **Conclusion on Training and Evaluation:**
    In summary, for this content-based system, the "training" is the process of feature extraction and creating the movie feature vectors. Due to the absence of user interaction data in the `/content/imdb_top_1000.csv` dataset, traditional quantitative evaluation metrics for recommendation systems cannot be directly applied. Evaluation will primarily involve qualitative assessment of generated recommendation lists and potentially analyzing the characteristics of the similarity scores.

### Model Training and Evaluation for Content-Based Recommendation

For the content-based recommendation system we are building using cosine similarity, the concept of "model training" is different from supervised machine learning approaches that involve iterative learning from labeled data.

1.  **Absence of Traditional Training Phase:**
    Unlike algorithms such as linear regression, decision trees, or neural networks that learn parameters from training data through optimization processes (like gradient descent), a content-based system based on feature similarity does **not** have a traditional training phase. There are no model weights or parameters to learn iteratively.

2.  **Preprocessed Data as the "Trained" Model:**
    In this context, the **preprocessed data itself serves as the "trained" model**. The "knowledge" of the system is encoded directly within the feature vectors created from the movie data. The process of feature extraction and creating the multi-hot encoded vectors for each movie is the equivalent of the "training" step. The relationships and characteristics of the movies are captured in these vector representations. When we want to make a recommendation, we don't train a model; we simply use these precomputed feature vectors to find similar items.

3.  **Evaluation Through Recommendation Generation:**
    Evaluation for this type of system is typically done by generating recommendations for a given input movie (or a set of movies representing a user's preference) and then assessing the quality of those recommendations. The core task is to retrieve movies that are most similar to the input based on the calculated cosine similarity scores.

4.  **Challenges in Applying Standard Evaluation Metrics with This Dataset:**
    Common evaluation metrics for recommendation systems, such as Precision, Recall, F1-score, Mean Average Precision (MAP), or Normalized Discounted Cumulative Gain (NDCG), are designed for scenarios where explicit user feedback (like ratings or click-through data) or a known ground truth of relevant items for users is available.
    The `/content/imdb_top_1000.csv` dataset **lacks explicit user interaction data**. We have movie metadata and overall IMDB ratings, but no information about which users watched or liked which movies, or what constitutes a "good" recommendation for a specific user. Therefore, we cannot directly apply these standard metrics to quantitatively measure the system's performance in terms of predicting user preferences or ranking items according to relevance for individual users.

5.  **Alternative Evaluation Approaches Given Data Limitations:**
    Given the constraints of the dataset, evaluation will primarily rely on qualitative methods and direct inspection:
    *   **Qualitative Assessment:** The most practical approach is to manually select a few example movies and generate a list of top N recommended movies for each. We can then qualitatively assess the relevance of these recommendations based on our understanding of the movie's genre, director, actors, and plot (if external knowledge is used). This involves subjectively judging whether the recommended movies seem genuinely similar and likely to be enjoyed by someone who liked the input movie.
    *   **Examining Similarity Scores:** We can analyze the distribution and range of the calculated cosine similarity scores. This can provide insights into how distinct or similar movies are based on the features used. For example, are there many movies with high similarity scores, or are most scores low? This analysis helps understand the feature space but doesn't directly measure recommendation quality.
    *   **Hit Rate (Theoretical):** If we hypothetically had a small test set where we knew, for instance, that movie A is considered similar to movie B (perhaps from expert curation or external data), we could calculate a Hit Rate by checking if, when recommending based on movie A, movie B appears in the top N recommendations. However, this dataset does not provide such a test set.

6.  **Conclusion on Training and Evaluation:**
    In summary, for this content-based system, the "training" is the process of feature extraction and creating the movie feature vectors. Due to the absence of user interaction data in the `/content/imdb_top_1000.csv` dataset, traditional quantitative evaluation metrics for recommendation systems cannot be directly applied. Evaluation will primarily involve qualitative assessment of generated recommendation lists and potentially analyzing the characteristics of the similarity scores.


## Generating recommendations

### Subtask:
Describe how the trained model (the movie feature vectors and cosine similarity function) will be used to generate recommendations for a given movie based on similarity.


**Reasoning**:
Describe how the trained model (movie feature vectors and cosine similarity) is used for recommendations.



In [15]:
%%markdown
### Generating Recommendations Using Movie Feature Vectors and Cosine Similarity

Once the movie data has been loaded, preprocessed into feature vectors, and the cosine similarity function is available, generating recommendations for a given input movie involves the following steps:

1.  **Identify the Input Movie and its Feature Vector:**
    The process begins by selecting a specific movie for which recommendations are desired. This could be a movie the user is currently watching, a movie they have liked in the past, or a movie they have searched for. We need to find this movie in our loaded dataset and retrieve its corresponding precomputed feature vector from the `movie_feature_vectors` collection. This is typically done by finding the index of the input movie by its title or a unique ID if available.

2.  **Calculate Cosine Similarity with All Other Movies:**
    Using the retrieved feature vector of the input movie, the cosine similarity function is applied to calculate the similarity score between this vector and the feature vector of *every other movie* in the dataset. This involves iterating through the `movie_feature_vectors` collection and computing the similarity between the input movie's vector and each other movie's vector.

3.  **Store and Rank Movies by Similarity Score:**
    As similarity scores are calculated for each movie, they are stored along with the index or identifier of the movie they correspond to. A convenient way to store this is in a list or vector of pairs, where each pair contains the similarity score and the index of the movie. For example, `std::vector<std::pair<float, int>> similarity_scores;`. Once all similarity scores are computed, this list is sorted in **descending** order based on the similarity scores. Movies with higher cosine similarity scores are considered more similar to the input movie.

4.  **Select the Top N Recommendations:**
    After sorting, the movies with the highest similarity scores appear at the top of the ranked list. The recommendation system then selects the top N movies from this sorted list, where N is the desired number of recommendations (e.g., top 10, top 20). It is crucial to **exclude the input movie itself** from the list of recommendations, as recommending the same movie is trivial and unhelpful.

5.  **Considerations for Improving Recommendation Quality:**
    While the core process relies on cosine similarity, several considerations can improve the quality of the recommendations:
    *   **Filtering "Watched" Movies:** In a real Netflix scenario with user data, recommendations for a specific user would ideally exclude movies they have already watched. Although our current dataset lacks user data, this is a critical consideration for a complete system.
    *   **Similarity Thresholds:** A threshold could be applied to the similarity scores. Movies with scores below a certain threshold might be considered too dissimilar to be recommended, even if they are among the top N. This helps avoid recommending irrelevant items when the most similar movies still have relatively low similarity.
    *   **Combining Features:** Experimenting with how different features (genre, director, actors) are weighted or combined in the feature vector construction can impact results.
    *   **Handling Cold Start:** For new movies or new users, content-based filtering is particularly useful (cold start for items). For new users, initial recommendations might be based on popular items or basic demographic data if available.
    *   **Diversity:** Recommendations can sometimes be too similar. Techniques to introduce diversity among the top N recommendations can be explored to present the user with a wider range of relevant options.

In summary, the trained model (the feature vectors) and the cosine similarity function are used to measure the similarity between the input movie and all other movies, rank them based on these scores, and select the top N most similar movies as recommendations, while excluding the input movie itself and considering factors like watched status and similarity thresholds.

### Generating Recommendations Using Movie Feature Vectors and Cosine Similarity

Once the movie data has been loaded, preprocessed into feature vectors, and the cosine similarity function is available, generating recommendations for a given input movie involves the following steps:

1.  **Identify the Input Movie and its Feature Vector:**
    The process begins by selecting a specific movie for which recommendations are desired. This could be a movie the user is currently watching, a movie they have liked in the past, or a movie they have searched for. We need to find this movie in our loaded dataset and retrieve its corresponding precomputed feature vector from the `movie_feature_vectors` collection. This is typically done by finding the index of the input movie by its title or a unique ID if available.

2.  **Calculate Cosine Similarity with All Other Movies:**
    Using the retrieved feature vector of the input movie, the cosine similarity function is applied to calculate the similarity score between this vector and the feature vector of *every other movie* in the dataset. This involves iterating through the `movie_feature_vectors` collection and computing the similarity between the input movie's vector and each other movie's vector.

3.  **Store and Rank Movies by Similarity Score:**
    As similarity scores are calculated for each movie, they are stored along with the index or identifier of the movie they correspond to. A convenient way to store this is in a list or vector of pairs, where each pair contains the similarity score and the index of the movie. For example, `std::vector<std::pair<float, int>> similarity_scores;`. Once all similarity scores are computed, this list is sorted in **descending** order based on the similarity scores. Movies with higher cosine similarity scores are considered more similar to the input movie.

4.  **Select the Top N Recommendations:**
    After sorting, the movies with the highest similarity scores appear at the top of the ranked list. The recommendation system then selects the top N movies from this sorted list, where N is the desired number of recommendations (e.g., top 10, top 20). It is crucial to **exclude the input movie itself** from the list of recommendations, as recommending the same movie is trivial and unhelpful.

5.  **Considerations for Improving Recommendation Quality:**
    While the core process relies on cosine similarity, several considerations can improve the quality of the recommendations:
    *   **Filtering "Watched" Movies:** In a real Netflix scenario with user data, recommendations for a specific user would ideally exclude movies they have already watched. Although our current dataset lacks user data, this is a critical consideration for a complete system.
    *   **Similarity Thresholds:** A threshold could be applied to the similarity scores. Movies with scores below a certain threshold might be considered too dissimilar to be recommended, even if they are among the top N. This helps avoid recommending irrelevant items when the most similar movies still have relatively low similarity.
    *   **Combining Features:** Experimenting with how different features (genre, director, actors) are weighted or combined in the feature vector construction can impact results.
    *   **Handling Cold Start:** For new movies or new users, content-based filtering is particularly useful (cold start for items). For new users, initial recommendations might be based on popular items or basic demographic data if available.
    *   **Diversity:** Recommendations can sometimes be too similar. Techniques to introduce diversity among the top N recommendations can be explored to present the user with a wider range of relevant options.

In summary, the trained model (the feature vectors) and the cosine similarity function are used to measure the similarity between the input movie and all other movies, rank them based on these scores, and select the top N most similar movies as recommendations, while excluding the input movie itself and considering factors like watched status and similarity thresholds.


## Summary:

### Data Analysis Key Findings

*   The initial attempt to run C++ code directly in a Python cell resulted in a `SyntaxError`, highlighting the necessity of writing the C++ code to a file and then compiling/executing it.
*   Initial CSV parsing in C++ using a simple comma delimiter was insufficient and led to "Invalid rating" errors and incorrect data extraction, particularly for the movie rating, which was initially confused with the number of votes column.
*   Inspecting the CSV header in C++ was crucial in correctly identifying the column indices for 'Series_Title' (1), 'Genre' (5), 'Director' (7), 'Star1'-'Star4' (10-13), and 'IMDB_Rating' (6).
*   Refining the CSV parsing function to handle quoted fields improved robustness, although the primary issue was the incorrect column indexing.
*   The C++ code successfully loaded 1000 movies, built a feature map containing 3743 unique features (genres, directors, actors), created multi-hot encoded feature vectors of size 3743 for each movie, and implemented a cosine similarity function.
*   Content-Based Filtering was selected as the most suitable recommendation approach given the dataset's movie metadata and the constraint of using C++ without external ML libraries, as opposed to Collaborative Filtering which requires user interaction data.
*   For this content-based system, the "training" phase is implicitly the process of feature extraction and creating the feature vectors; there is no traditional model training with iterative optimization.
*   Standard quantitative evaluation metrics for recommendation systems (like Precision, Recall) cannot be directly applied due to the lack of user interaction data in the dataset. Evaluation would primarily rely on qualitative assessment of generated recommendations.

### Insights or Next Steps

*   The current C++ implementation provides the core components for a content-based recommendation system based on movie features and cosine similarity. The next practical step is to implement the recommendation generation logic in C++, taking an input movie and returning the top N similar movies based on the calculated cosine similarity scores.
*   While the dataset lacks user data for true collaborative filtering or quantitative evaluation, the implemented content-based approach serves as a solid foundation. Future enhancements could explore using external datasets with user ratings or building a more complex hybrid system if more data becomes available.


In [24]:
%%writefile netflix_recommender.cpp
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <sstream>
#include <unordered_map>
#include <cmath> // For sqrt
#include <algorithm> // For std::min, std::sort
#include <limits> // For std::numeric_limits
#include <iomanip> // For std::setw, std::left

struct Movie {
    std::string title;
    std::string released_year; // Added released_year
    std::string genre;
    std::string director;
    std::string actors;
    float rating;
    int original_index; // To link back to the original position in the loaded movies vector
};

// Function to parse a CSV line, handling quoted fields
std::vector<std::string> parseCsvLine(const std::string& line) {
    std::vector<std::string> result;
    std::stringstream ss(line);
    std::string cell;
    bool inQuote = false;

    for (char ch : line) {
        if (ch == '"') {
            inQuote = !inQuote;
        } else if (ch == ',' && !inQuote) {
            result.push_back(cell);
            cell.clear();
        } else {
            cell += ch;
        }
    }
    result.push_back(cell); // Add the last cell

    return result;
}

// Function to split a string by a delimiter
std::vector<std::string> splitString(const std::string& s, const std::string& delimiter) {
    std::vector<std::string> tokens;
    size_t pos_start = 0, pos_end;
    std::string token;
    while ((pos_end = s.find(delimiter, pos_start)) != std::string::npos) {
        token = s.substr(pos_start, pos_end - pos_start);
        if (!token.empty() && token != " ") { // Added check for empty or space tokens
             // Trim leading/trailing whitespace
            size_t first = token.find_first_not_of(' ');
            size_t last = token.find_last_not_of(' ');
            if (std::string::npos != first) {
                tokens.push_back(token.substr(first, (last - first + 1)));
            }
        }
        pos_start = pos_end + delimiter.length();
    }
    token = s.substr(pos_start);
     if (!token.empty() && token != " ") { // Added check for empty or space tokens for the last token
         // Trim leading/trailing whitespace
        size_t first = token.find_first_not_of(' ');
        size_t last = token.find_last_not_of(' ');
        if (std::string::npos != first) {
            tokens.push_back(token.substr(first, (last - first + 1)));
        }
    }
    return tokens;
}

// Function to calculate cosine similarity between two vectors
float cosineSimilarity(const std::vector<float>& vec1, const std::vector<float>& vec2) {
    if (vec1.size() != vec2.size() || vec1.empty()) {
        return 0.0; // Or handle error appropriately
    }

    float dot_product = 0.0;
    float norm_vec1 = 0.0;
    float norm_vec2 = 0.0;

    for (size_t i = 0; i < vec1.size(); ++i) {
        dot_product += vec1[i] * vec2[i];
        norm_vec1 += vec1[i] * vec1[i];
        norm_vec2 += vec2[i] * vec2[i];
    }

    float denominator = std::sqrt(norm_vec1) * std::sqrt(norm_vec2);

    if (denominator == 0) {
        return 0.0; // Avoid division by zero
    }

    return dot_product / denominator;
}

// Function to get recommendations for a given movie title
std::vector<Movie> getRecommendations(const std::string& movie_title,
                                    const std::vector<Movie>& all_movies,
                                    const std::vector<std::vector<float>>& movie_feature_vectors,
                                    int num_recommendations = 10) {

    int input_movie_index = -1;
    for (size_t i = 0; i < all_movies.size(); ++i) {
        if (all_movies[i].title == movie_title) {
            input_movie_index = i;
            break;
        }
    }

    if (input_movie_index == -1) {
        std::cerr << "Movie '" << movie_title << "' not found in the dataset." << std::endl;
        return {}; // Return empty vector
    }

    const std::vector<float>& input_movie_vector = movie_feature_vectors[input_movie_index];
    std::vector<std::pair<float, int>> similarity_scores;

    for (size_t i = 0; i < all_movies.size(); ++i) {
        // Skip the input movie itself
        if (i == input_movie_index) {
            continue;
        }
        float similarity = cosineSimilarity(input_movie_vector, movie_feature_vectors[i]);
        similarity_scores.push_back({similarity, (int)i});
    }

    // Sort by similarity score in descending order
    std::sort(similarity_scores.begin(), similarity_scores.end(), [](const auto& a, const auto& b) {
        return a.first > b.first;
    });

    // Get the top N recommendations
    std::vector<Movie> recommendations;
    for (int i = 0; i < std::min((int)similarity_scores.size(), num_recommendations); ++i) {
        recommendations.push_back(all_movies[similarity_scores[i].second]);
    }

    return recommendations;
}


int main() {
    std::vector<Movie> movies;
    std::ifstream file("/content/imdb_top_1000.csv");

    if (!file.is_open()) {
        std::cerr << "Error opening file!" << std::endl;
        return 1;
    }

    std::string line;
    // Read the header row
    std::getline(file, line);

    int current_index = 0;
    while (std::getline(file, line)) {
        std::vector<std::string> segments = parseCsvLine(line);

        if (segments.size() >= 16) {
            Movie movie;
            movie.title = segments[1];
            movie.released_year = segments[2]; // Get released year (index 2)
            movie.genre = segments[5];
            movie.director = segments[7];
            movie.actors = segments[10] + ", " + segments[11] + ", " + segments[12] + ", " + segments[13];
            movie.original_index = current_index++;


            try {
                movie.rating = std::stof(segments[6]);
            } catch (const std::invalid_argument& ia) {
                // std::cerr << "Invalid rating for movie: " << movie.title << " (Value: " << segments[6] << ")" << std::endl;
                continue;
            } catch (const std::out_of_range& oor) {
                 // std::cerr << "Rating out of range for movie: " << movie.title << " (Value: " << segments[6] << ")" << std::endl;
                 continue;
            }

            if (movie.title.empty() || movie.genre.empty() || movie.director.empty() || movie.actors.empty() || movie.rating <= 0 || movie.released_year.empty()) {
                 // std::cerr << "Skipping row with missing or invalid data for movie: " << movie.title << std::endl;
                 continue;
            }

            movies.push_back(movie);
        } else {
             // std::cerr << "Skipping malformed row (incorrect number of columns): " << line << std::endl;
        }
    }

    file.close();

    std::cout << "Loaded " << movies.size() << " movies." << std::endl;

    // 1. Build feature_index_map
    std::unordered_map<std::string, int> feature_index_map;
    int feature_index = 0;

    for (const auto& movie : movies) {
        // Genres
        std::vector<std::string> genres = splitString(movie.genre, ",");
        for (const auto& g : genres) {
            if (feature_index_map.find(g) == feature_index_map.end()) {
                feature_index_map[g] = feature_index++;
            }
        }
        // Director
        if (feature_index_map.find(movie.director) == feature_index_map.end()) {
            feature_index_map[movie.director] = feature_index++;
        }
        // Actors
        std::vector<std::string> actors = splitString(movie.actors, ", ");
        for (const auto& a : actors) {
             if (feature_index_map.find(a) == feature_index_map.end()) {
                feature_index_map[a] = feature_index++;
            }
        }
    }

    std::cout << "Built feature map with " << feature_index_map.size() << " features." << std::endl;

    // 2. Create movie_feature_vectors
    std::vector<std::vector<float>> movie_feature_vectors;
    int num_features = feature_index_map.size();

    for (const auto& movie : movies) {
        std::vector<float> feature_vector(num_features, 0.0);

        // Populate feature vector using multi-hot encoding
        // Genres
        std::vector<std::string> genres = splitString(movie.genre, ",");
        for (const auto& g : genres) {
            if (feature_index_map.count(g)) {
                 feature_vector[feature_index_map[g]] = 1.0;
            }
        }
        // Director
        if (feature_index_map.count(movie.director)) {
             feature_vector[feature_index_map[movie.director]] = 1.0;
        }
        // Actors
        std::vector<std::string> actors = splitString(movie.actors, ", ");
        for (const auto& a : actors) {
            if (feature_index_map.count(a)) {
                 feature_vector[feature_index_map[a]] = 1.0;
            }
        }
        movie_feature_vectors.push_back(feature_vector);
    }

    std::cout << "Created feature vectors for " << movie_feature_vectors.size() << " movies." << std::endl;

    // 3. Get movie title input from user
    std::string target_movie_title;
    std::cout << "\nEnter a movie title to get recommendations: ";
    std::getline(std::cin, target_movie_title);

    // 4. Generate recommendations for the input movie
    std::cout << "\nRecommendations for '" << target_movie_title << "':" << std::endl;

    std::vector<Movie> recommendations = getRecommendations(target_movie_title, movies, movie_feature_vectors, 10);

    if (recommendations.empty()) {
        std::cout << "Could not generate recommendations. Please check the movie title." << std::endl;
    } else {
        // Print recommendations in a tabular format
        std::cout << std::left << std::setw(40) << "Title"
                  << std::setw(10) << "Year"
                  << std::setw(30) << "Genre"
                  << std::setw(10) << "Rating" << std::endl;
        std::cout << std::string(90, '-') << std::endl;

        for (const auto& rec_movie : recommendations) {
            std::cout << std::left << std::setw(40) << rec_movie.title
                      << std::setw(10) << rec_movie.released_year
                      << std::setw(30) << rec_movie.genre
                      << std::setw(10) << std::fixed << std::setprecision(1) << rec_movie.rating << std::endl;
        }
    }


    return 0;
}

Overwriting netflix_recommender.cpp


In [29]:
!g++ netflix_recommender.cpp -o netflix_recommender -std=c++14
!./netflix_recommender

Loaded 1000 movies.
Built feature map with 3730 features.
Created feature vectors for 1000 movies.

Enter a movie title to get recommendations: Avengers: Endgame

Recommendations for 'Avengers: Endgame':
Title                                   Year      Genre                         Rating    
------------------------------------------------------------------------------------------
Captain America: Civil War              2016      Action, Adventure, Sci-Fi     7.8       
Avengers: Infinity War                  2018      Action, Adventure, Sci-Fi     8.4       
The Avengers                            2012      Action, Adventure, Sci-Fi     8.0       
Captain America: The Winter Soldier     2014      Action, Adventure, Sci-Fi     7.7       
Batoru rowaiaru                         2000      Action, Adventure, Drama      7.6       
Dawn of the Planet of the Apes          2014      Action, Adventure, Drama      7.6       
The Count of Monte Cristo               2002      Action, Adventure,