```
One such problem could be predicting the tags for a movie. In this case, each unique tag used by users to describe movies can be considered as a separate label. Given that users can create their own tags, and there are millions of users, the number of unique tags (and hence the output space) could potentially be in the order of millions.

Predicting labels, especially in the context of a movie dataset like MovieLens, can have several advantages and use cases in business:
Improved Recommendations: By predicting the tags that a user might apply to a movie, a system can provide more personalized and accurate recommendations. This can lead to increased user engagement and satisfaction.
Better Understanding of User Preferences: Predicting tags can help a business understand what aspects of a movie a user cares about, which can be valuable for understanding user preferences and behavior.
Targeted Marketing: If a business can predict what tags a user might apply to a movie, it can use this information to target marketing more effectively. For example, if a user is predicted to tag a movie with “romantic”, the business could recommend other romantic movies or offer promotions related to romantic movies.
Content Curation and Management: Predicting tags can also be useful for content curation and management. For example, if a movie is predicted to be tagged as “violent” or “mature”, it could be flagged for review or given a certain rating.
Enhanced Search Functionality: The predicted tags can be used to enhance the search functionality on a movie recommendation platform. Users can search for movies based on these tags, making the platform more user-friendly.
Remember, the effectiveness of label prediction would depend on the quality of the prediction model and the richness of the input data. The more accurate the predictions and the more comprehensive the input data, the more valuable these predictions would be for the business. 😊

```

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
import numpy as np

In [None]:
%%time

tags = pd.read_csv('./ml-25m/tags.csv')

movies = pd.read_csv('./ml-25m/movies.csv')

ratings = pd.read_csv('./ml-25m/ratings.csv') # Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

links = pd.read_csv('./ml-25m/links.csv')

In [None]:
tags['tag'].nunique()

In [None]:
links

In [None]:
data = pd.merge(pd.merge(movies, tags, on='movieId'), ratings, on=['userId', 'movieId'])

In [None]:
unique_users = data['userId'].unique()

train_users, test_users = train_test_split(unique_users, test_size=0.2, random_state=42)

data_train = data[data['userId'].isin(train_users)]
data_test = data[data['userId'].isin(test_users)]

In [None]:
# train

X_train = data_train[['userId', 'movieId', 'title', 'genres']] # movie id?

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(np.array(data_train['tag']).reshape(-1,1))
Y_train = enc.transform(np.array(data_train['tag']).reshape(-1,1))

In [None]:
X_train.shape, Y_train.shape

In [None]:
!pip install aiohttp


In [None]:
links

In [None]:
import requests

movie_id = "550"
url = f"https://api.themoviedb.org/3/movie/{movie_id}?api_key={api_key}"

response = requests.get(url)
movie_details = response.json()

print(movie_details)


In [None]:
import aiohttp
import asyncio
import pandas as pd
import time

api_key = '790a4d02585590ff7ab5dc5246fd16eb'
async def get_movie_details(tmdb_id, session):
    url = f"https://api.themoviedb.org/3/movie/{int(tmdb_id)}?api_key={api_key}&language=en-US"
    try:
        async with session.get(url) as response:
            data = await response.json()
            # Extracting nested and list data
            genres = ', '.join([genre['name'] for genre in data.get('genres', [])])
            production_companies = ', '.join([company['name'] for company in data.get('production_companies', [])])
            production_countries = ', '.join([country['name'] for country in data.get('production_countries', [])])
            spoken_languages = ', '.join([language['english_name'] for language in data.get('spoken_languages', [])])
            
            # Constructing the dictionary to return
            movie_details = {
                'tmdb_id': tmdb_id,
                'title': data.get('title', ''),
                'original_title': data.get('original_title', ''),
                'genres': genres,
                'release_date': data.get('release_date', ''),
                'rating': data.get('vote_average', ''),
                'votes': data.get('vote_count', ''),
                'original_language': data.get('original_language', ''),
                'overview': data.get('overview', ''),
                'popularity': data.get('popularity', ''),
                'production_companies': production_companies,
                'production_countries': production_countries,
                'spoken_languages': spoken_languages,
                'budget': data.get('budget', ''),
                'revenue': data.get('revenue', ''),
                'runtime': data.get('runtime', ''),
                'status': data.get('status', ''),
                'tagline': data.get('tagline', '')
            }
            return tmdb_id, movie_details
    except Exception as e:
        print(f"Failed to fetch details for TMDB ID {tmdb_id}: {e}")
        return tmdb_id, {}

async def fetch_details_concurrently(tmdb_ids):
    async with aiohttp.ClientSession() as session:
        movie_details_dict = {}
        for i in range(0, len(tmdb_ids), 40):  # Process in batches of 40 to respect rate limit
            batch = tmdb_ids[i:i+40]
            tasks = [get_movie_details(tmdb_id, session) for tmdb_id in batch]
            results = await asyncio.gather(*tasks)
            for tmdb_id, details in results:
                movie_details_dict[tmdb_id] = details
            time.sleep(10)  # Pause to respect the API's rate limit
        return movie_details_dict

async def main(links):
    tmdb_ids = links['tmdbId'].dropna().unique()  # Ensure unique IDs and drop NaN
    movie_details_dict = await fetch_details_concurrently(tmdb_ids)
    
    # Update the `links` DataFrame with the fetched movie details
    for tmdb_id, details in movie_details_dict.items():
        for key, value in details.items():
            links.loc[links['tmdbId'] == tmdb_id, key] = value

    # Optionally, save the updated DataFrame
    links.to_csv('updated_links_with_movie_details.csv', index=False)


In [None]:
await main(links)

In [None]:
movies_meta = pd.read_csv('updated_links_with_movie_details.csv')

In [None]:
movies_meta.drop_duplicates(subset=['tmdbId'], inplace=True)

In [None]:
links.shape

In [None]:
movies_meta.drop(columns=['imdbId', 'tmdb_id'], inplace=True)

In [None]:
movies_meta

In [None]:
movies_meta.dropna(subset=['tmdbId'], inplace=True)

In [None]:
movies_meta

In [None]:
data = pd.merge(movies_meta, tags, on='movieId', how='left')

In [None]:
data

In [None]:
columns = ['movieId', 'tmdbId', 'title', 'genres', 'overview', 'production_companies', 'production_countries', 'spoken_languages', 'tagline', 'tag']

In [None]:
df = data[columns]
df

In [None]:
df.dropna(subset=['tag'], inplace=True)

In [None]:
# Assuming `df` is your DataFrame and it has 'movieId' and 'tag' columns
# Group by 'movieId' and aggregate the tags into a list
df_grouped = df.groupby('movieId').agg({
    'title': 'first',  # Assuming title is the same for all entries with the same movieId
    'genres': 'first',  # Assuming genres are the same for all entries with the same movieId
    'overview': 'first',
    'production_companies': 'first',
    'production_countries': 'first',
    'spoken_languages': 'first',
    'tagline': 'first',
    'tmdbId': 'first',
    'tag': lambda x: ', '.join(x.astype(str).str.lower())
}).reset_index()

df_grouped

In [None]:
df_grouped.fillna('')

In [None]:
df_grouped.to_csv('movies_meta-part1.csv', index=False)

In [None]:
# find empty ones and try to search it again
columns_to_check = ['title', 'genres', 'overview', 'production_companies', 
                    'production_countries', 'spoken_languages', 'tagline']
df_grouped.replace('', pd.NA, inplace=True)
empty_rows = df_grouped[df_grouped[columns_to_check].isna().all(axis=1)]
empty_rows

In [None]:
await main(empty_rows)

In [None]:
columns_to_check = ['title', 'genres', 'overview', 'production_companies', 
                    'production_countries', 'spoken_languages', 'tagline']

df_grouped.replace('', pd.NA, inplace=True)

# Drop rows where all the columns in columns_to_check are NaN
df_grouped.dropna(subset=columns_to_check, how='all', inplace=True)
df_grouped

https://towardsdatascience.com/automated-movie-tagging-a-multiclass-classification-problem-721eb7fb70c2

In [None]:
m = pd.read_csv('updated_links_with_movie_details.csv')
m

In [None]:
m.dropna(subset=columns_to_check, how='all', inplace=True)

In [None]:
m.shape

In [None]:
df1 = m[columns]

In [None]:
df1

In [None]:
final_data = pd.concat([df_grouped, df1])

In [None]:
final_data

final_data.fillna('', inplace=True)

In [14]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def stem_text(text):
    words = text.split()
    stemmed_words = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_words)

# Apply stemming to a DataFrame column
df['processed_overview'] = df['processed_overview'].apply(stem_text)
df['processed_overview']

0    dystopian futur group warrior must unit reclai...
1    archeologist team adventur find mythic treasur...
2                                                     
3    team astronaut embark peril mission distant ga...
4    detect investig strang occurr town plagu myste...
Name: processed_overview, dtype: object

In [None]:
final_data.drop(columns=['tmdbId'], inplace=True)

In [None]:
final_data

In [None]:
final_data = pd.read_csv('movies_training_data.csv')

In [None]:
final_data

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(sparse_output=True)
tags_encoded = mlb.fit_transform(final_data['tag'].apply(lambda x: x.split(', ')))

Y = tags_encoded
Y

In [None]:
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.preprocessing import MultiLabelBinarizer
import re
import string

class MultiLabelBinarizerTransformer(TransformerMixin, BaseEstimator):
    def __init__(self):
        self.mlb = MultiLabelBinarizer()
    
    def fit(self, X, y=None):
        # Assume that X is a DataFrame with a single column of interest
        self.mlb.fit(X.iloc[:, 0].str.lower().str.split(', ').apply(lambda x: x if isinstance(x, list) else []))
        return self
    
    def transform(self, X, y=None):
        # Transform the input DataFrame
        X_transformed = self.mlb.transform(X.iloc[:, 0].str.lower().str.split(', ').apply(lambda x: x if isinstance(x, list) else []))
        return X_transformed
    
    def get_feature_names_out(self, input_features=None):
        # Generate feature names with a prefix using the class name of the transformer
        class_name = self.__class__.__name__.lower()
        return [f"{class_name}_{class_}" for class_ in self.mlb.classes_]
    

from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('wordnet')

class TextCombinerTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns_to_combine):
        self.columns_to_combine = columns_to_combine
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # Combine specified columns into a single text column
        combined_text = X[self.columns_to_combine].fillna('').apply(lambda x: ' '.join(x), axis=1)
        return combined_text

# Custom transformer to combine multiple text columns into a single text column
class TextCleanerTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, remove_punctuation=True, lowercase=True, use_stemming=False, use_lemmatization=True):
        self.remove_punctuation = remove_punctuation
        self.lowercase = lowercase
        self.use_stemming = use_stemming
        self.use_lemmatization = use_lemmatization
        self.stemmer = PorterStemmer()
        self.lemmatizer = WordNetLemmatizer()
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        if self.lowercase:
            X = X.apply(lambda x: x.lower() if isinstance(x, str) else x)
        
        if self.remove_punctuation:
            punctuation_pattern = re.compile('[%s]' % re.escape(string.punctuation))
            X = X.apply(lambda x: re.sub(punctuation_pattern, '', x) if isinstance(x, str) else x)
        
        if self.use_stemming or self.use_lemmatization:
            X = X.apply(self._apply_stemming_lemmatization)
        
        return X
    
    def _apply_stemming_lemmatization(self, text):
        tokens = word_tokenize(text)
        if self.use_stemming:
            tokens = [self.stemmer.stem(token) for token in tokens]
        if self.use_lemmatization:
            tokens = [self.lemmatizer.lemmatize(token) for token in tokens]
        return ' '.join(tokens)

text_pipeline = Pipeline(steps=[
    ('combine_text', TextCombinerTransformer(columns_to_combine=['overview', 'tagline'])),
    ('clean_text', TextCleanerTransformer(remove_punctuation=True, lowercase=True)),
    ('tfidf', TfidfVectorizer())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('genres_mlb', MultiLabelBinarizerTransformer(), ['genres']),
        ('production_countries_mlb', MultiLabelBinarizerTransformer(), ['production_countries']),
        ('spoken_languages_mlb', MultiLabelBinarizerTransformer(), ['spoken_languages']),
        ('text_processing', text_pipeline, ['overview', 'tagline', 'production_companies'])
    ],
    remainder='drop'  # maybe include title?
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor)
])

In [None]:
test_data = final_data[:10]
pipeline.fit(test_data.drop('tag', axis=1))
X = pipeline.transform(test_data.drop('tag', axis=1))

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, 
    Y, 
    test_size=0.2, 
    random_state=42

)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
final_data

In [None]:
links[['movieId', 'tmdbId']]

In [None]:
with_tmdbid = pd.merge(final_data, links[['movieId', 'tmdbId']], on='movieId', how='left')

In [None]:
with_tmdbid

In [None]:
async def get_movie_credits(tmdb_id, session):
    credits_url = f"https://api.themoviedb.org/3/movie/{int(tmdb_id)}/credits?api_key={api_key}"
    try:
        async with session.get(credits_url) as response:
            credits_data = await response.json()
            # Extracting actors and directors
            cast = ', '.join([person['name'] for person in credits_data.get('cast', [])][:5])  # top 5 actors
            directors = ', '.join([person['name'] for person in credits_data.get('crew', []) if person['job'] == 'Director'])
            
            # Constructing the dictionary to return
            credits_details = {
                'cast': cast,
                'directors': directors
            }
            return tmdb_id, credits_details
    except Exception as e:
        print(f"Failed to fetch credits for TMDB ID {tmdb_id}: {e}")
        return tmdb_id, {}
    
async def fetch_details_concurrently(tmdb_ids, details_function):
    async with aiohttp.ClientSession() as session:
        movie_details_dict = {}
        for i in range(0, len(tmdb_ids), 40):  # Process in batches of 40 to respect rate limit
            batch = tmdb_ids[i:i+40]
            tasks = [details_function(tmdb_id, session) for tmdb_id in batch]
            results = await asyncio.gather(*tasks)
            for tmdb_id, details in results:
                movie_details_dict[tmdb_id] = details
            time.sleep(10)  # Pause to respect the API's rate limit
        return movie_details_dict


async def main(links):
    tmdb_ids = links['tmdbId'].dropna().unique()  # Ensure unique IDs and drop NaN
    #movie_details_dict = await fetch_details_concurrently(tmdb_ids)
    movie_credits_dict = await fetch_details_concurrently(tmdb_ids, get_movie_credits)  # Use the credits function
    
    for tmdb_id, credits in movie_credits_dict.items():
        for key, value in credits.items():
            links.loc[links['tmdbId'] == tmdb_id, key] = value

    # Optionally, save the updated DataFrame
    links.to_csv('updated_links_with_movie_details_and_credits.csv', index=False)


In [None]:
await main(with_tmdbid)

In [2]:
movies = pd.read_csv('updated_links_with_movie_details_and_credits.csv')

In [3]:
movies

Unnamed: 0,movieId,title,genres,overview,production_companies,production_countries,spoken_languages,tagline,tag,tmdbId,tmdb_id,cast,directors
0,1,Toy Story,"Animation, Adventure, Family, Comedy","Led by Woody, Andy's toys live happily in his ...",Pixar,United States of America,English,Hang on for the comedy that goes to infinity a...,"owned, imdb top 250, pixar, pixar, time travel...",862.0,862.0,"Tom Hanks, Tim Allen, Don Rickles, Jim Varney,...",John Lasseter
1,2,Jumanji,"Adventure, Fantasy, Family",When siblings Judy and Peter discover an encha...,"TriStar Pictures, Interscope Communications, T...",United States of America,"English, French",Roll the dice and unleash the excitement!,"robin williams, time travel, fantasy, based on...",8844.0,8844.0,"Robin Williams, Kirsten Dunst, Bradley Pierce,...",Joe Johnston
2,3,Grumpier Old Men,"Romance, Comedy",A family wedding reignites the ancient feud be...,"Lancaster Gate, Warner Bros. Pictures",United States of America,English,Still Yelling. Still Fighting. Still Ready for...,"funny, best friend, duringcreditsstinger, fish...",15602.0,15602.0,"Walter Matthau, Jack Lemmon, Ann-Margret, Soph...",Howard Deutch
3,4,Waiting to Exhale,"Comedy, Drama, Romance","Cheated on, mistreated and stepped on, the wom...",20th Century Fox,United States of America,English,Friends are the people who let you be yourself...,"based on novel or book, chick flick, divorce, ...",31357.0,31357.0,"Whitney Houston, Angela Bassett, Loretta Devin...",Forest Whitaker
4,5,Father of the Bride Part II,"Comedy, Family",Just when George Banks has recovered from his ...,"Touchstone Pictures, Sandollar Productions",United States of America,English,Just when his world is back to normal... he's ...,"aging, baby, confidence, contraception, daught...",11862.0,11862.0,"Steve Martin, Diane Keaton, Martin Short, Kimb...",Charles Shyer
...,...,...,...,...,...,...,...,...,...,...,...,...,...
17482,95581,The Flying Fleet,"Adventure, Drama, Romance","Six friends, all hoping to become aviators, ar...",Metro-Goldwyn-Mayer,United States of America,,,"aircraft carrier, aviator, expulsion, midshipm...",99934.0,99934.0,"Ramon Novarro, Ralph Graves, Anita Page, Alfre...",George W. Hill
17483,95583,Savages,"Crime, Drama, Thriller",Pot growers Ben and Chon face off against the ...,"Relativity Media, Ixtlan, Onda Entertainment, ...",United States of America,English,Young. Beautiful. Deadly.,"american abroad, dea, dea agent, enforcer, exp...",82525.0,82525.0,"Taylor Kitsch, Blake Lively, Aaron Taylor-John...",Oliver Stone
17484,95591,Rat King,Thriller,18 year old Juri is seriously addicted to onli...,"Making Movies, Allfilm","Finland, Estonia",Finnish,Would you put your life on the line?,visually stunning,104997.0,104997.0,"Max Ovaska, Outi Mäenpää, Janne Virtanen, Juli...",Petri Kotwica
17485,95595,Bela Kiss: Prologue,"Horror, Mystery, Thriller","A true story, Bela Kiss was one of the the mos...",Mirror Maze,Germany,"English, German",Jede Legende fordert neues Blut,"muddled, suspense",155288.0,155288.0,"Kristina Klebe, Rudolf Martin, Fabian Stumm, B...",Lucien Förstner


In [4]:
movies['directors'].isna().sum()

9

In [5]:
movies = movies.drop(columns=['tmdb_id', 'tmdbId'])

In [6]:
movies

Unnamed: 0,movieId,title,genres,overview,production_companies,production_countries,spoken_languages,tagline,tag,cast,directors
0,1,Toy Story,"Animation, Adventure, Family, Comedy","Led by Woody, Andy's toys live happily in his ...",Pixar,United States of America,English,Hang on for the comedy that goes to infinity a...,"owned, imdb top 250, pixar, pixar, time travel...","Tom Hanks, Tim Allen, Don Rickles, Jim Varney,...",John Lasseter
1,2,Jumanji,"Adventure, Fantasy, Family",When siblings Judy and Peter discover an encha...,"TriStar Pictures, Interscope Communications, T...",United States of America,"English, French",Roll the dice and unleash the excitement!,"robin williams, time travel, fantasy, based on...","Robin Williams, Kirsten Dunst, Bradley Pierce,...",Joe Johnston
2,3,Grumpier Old Men,"Romance, Comedy",A family wedding reignites the ancient feud be...,"Lancaster Gate, Warner Bros. Pictures",United States of America,English,Still Yelling. Still Fighting. Still Ready for...,"funny, best friend, duringcreditsstinger, fish...","Walter Matthau, Jack Lemmon, Ann-Margret, Soph...",Howard Deutch
3,4,Waiting to Exhale,"Comedy, Drama, Romance","Cheated on, mistreated and stepped on, the wom...",20th Century Fox,United States of America,English,Friends are the people who let you be yourself...,"based on novel or book, chick flick, divorce, ...","Whitney Houston, Angela Bassett, Loretta Devin...",Forest Whitaker
4,5,Father of the Bride Part II,"Comedy, Family",Just when George Banks has recovered from his ...,"Touchstone Pictures, Sandollar Productions",United States of America,English,Just when his world is back to normal... he's ...,"aging, baby, confidence, contraception, daught...","Steve Martin, Diane Keaton, Martin Short, Kimb...",Charles Shyer
...,...,...,...,...,...,...,...,...,...,...,...
17482,95581,The Flying Fleet,"Adventure, Drama, Romance","Six friends, all hoping to become aviators, ar...",Metro-Goldwyn-Mayer,United States of America,,,"aircraft carrier, aviator, expulsion, midshipm...","Ramon Novarro, Ralph Graves, Anita Page, Alfre...",George W. Hill
17483,95583,Savages,"Crime, Drama, Thriller",Pot growers Ben and Chon face off against the ...,"Relativity Media, Ixtlan, Onda Entertainment, ...",United States of America,English,Young. Beautiful. Deadly.,"american abroad, dea, dea agent, enforcer, exp...","Taylor Kitsch, Blake Lively, Aaron Taylor-John...",Oliver Stone
17484,95591,Rat King,Thriller,18 year old Juri is seriously addicted to onli...,"Making Movies, Allfilm","Finland, Estonia",Finnish,Would you put your life on the line?,visually stunning,"Max Ovaska, Outi Mäenpää, Janne Virtanen, Juli...",Petri Kotwica
17485,95595,Bela Kiss: Prologue,"Horror, Mystery, Thriller","A true story, Bela Kiss was one of the the mos...",Mirror Maze,Germany,"English, German",Jede Legende fordert neues Blut,"muddled, suspense","Kristina Klebe, Rudolf Martin, Fabian Stumm, B...",Lucien Förstner


In [7]:
movies.fillna('', inplace=True)

In [8]:
movies.to_csv('movies_training_data.csv', index=False)