<a href="https://colab.research.google.com/github/headboyprince/Emotional-Movie-Recommendation-Engine/blob/main/movie_recommender_systems_with_beautiful_soup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this section, we're going to be scraping the movies data that we need for this project. we will scrape the data from the idmb website and store it in a pandas dataframe.

We're using the idmb website to scrap the data because it allows us to search movies by their genre and that's what we want... and it also contains 3 genre tag for each movie.. this will make it easy for us to create the emotion based movie recommendation, we'll easily be able to convert the genre into its suitable emotions. More genre == more emotions for us :)

 while we can definitely scrap more movies, we puporposely stopped at just 100 movies for this project..so it doesnt take take too much time for the code to run...

 Scraping data from websites allows you to gather fresh, up-to-date information that can be used to make accurate predictions. It's like when you're preparing a dish, you always want your recipes to be fresh, and the same applies when you're working with data. Using outdated data can lead to outdated predictions and this could be detrimental to the accuracy of the recommender system.

In other words, data scraping is like going shopping for the ingredients for a recipe. You want to make sure you have all the most recent and fresh ingredients to make the best dish. And in the same way, when building a recommender system, you want to have the most recent and fresh data to make the best predictions.

# SCRAPING THE DATA THAT WE NEED

In [None]:
#@title Importing Required Libraries for Web Scraping

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

HEADERS ={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

This code imports four modules: requests, BeautifulSoup, pandas`, and re.

requests is a Python module that allows you to send HTTP requests using Python. It simplifies the process of working with HTTP requests in Python by providing a higher-level interface than the built-in urllib module.

BeautifulSoup is a Python library that is used to parse HTML and XML documents. It allows you to extract data from these documents in a more human-readable and easier-to-manipulate format.

pandas is a powerful data analysis library for Python that provides easy-to-use data structures and data analysis tools for handling and manipulating numerical tables and time series data.

re is the Python standard library module for working with regular expressions. Regular expressions are a powerful way to search and manipulate text, and the re module provides functions for working with them in Python.

The HEADERS variable is a dictionary that contains a set of HTTP headers that will be used in the requests made by the script. HTTP headers allow the client and the server to pass additional information with an HTTP request or response. In this case, the headers are used to specify the user agent (i.e., the browser being used) and to specify various other options related to the request.

In [None]:
#@title Generating URLs for IMDB Movie Genres

genres = [
    
    "Adventure",
    "Animation",
    "Biography",
    "Comedy",
    "Crime",
    "Drama",
    "Family",
    "Fantasy",
    "Film-Noir",
    "History",
    "Horror",
    "Music",
    "Musical",
    "Mystery",
    "Romance",
    "Sci-Fi",
    "Sport",
    "Thriller",
    "War",
    "Western"
]

url_dict = {}

for genre in genres:
    url = "https://www.imdb.com/search/title/?genres={}&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=5aab685f-35eb-40f3-95f7-c53f09d542c3&pf_rd_r=N97GEQS6R7J9EV7V770D&pf_rd_s=right-6&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_gnr_16"
    formated_url = url.format(genre)
    url_dict[genre] = formated_url
    
print(url_dict)

{'Adventure': 'https://www.imdb.com/search/title/?genres=Adventure&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=5aab685f-35eb-40f3-95f7-c53f09d542c3&pf_rd_r=N97GEQS6R7J9EV7V770D&pf_rd_s=right-6&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_gnr_16', 'Animation': 'https://www.imdb.com/search/title/?genres=Animation&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=5aab685f-35eb-40f3-95f7-c53f09d542c3&pf_rd_r=N97GEQS6R7J9EV7V770D&pf_rd_s=right-6&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_gnr_16', 'Biography': 'https://www.imdb.com/search/title/?genres=Biography&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=5aab685f-35eb-40f3-95f7-c53f09d542c3&pf_rd_r=N97GEQS6R7J9EV7V770D&pf_rd_s=right-6&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_gnr_16', 'Comedy': 'https://www.imdb.com/search/title/?genres=Comedy&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=5

In [None]:
import requests
from bs4 import BeautifulSoup

def get_topics_page():
    # TODO - add comments
    topic_url = 'https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=100&start=000&ref_=adv_next'
    response=requests.get(topic_url)
    # check successfull response
    if response.status_code != 200:
        raise Exception(f'Failed to load page {topic_url}')
    # Parse using BeautifulSoup
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc
  

In [None]:
doc=get_topics_page()

In [None]:
def get_movie_titles(doc):
    
    selection_class="lister-item-header"
    movie_title_tags=doc.find_all('h3',{'class':selection_class})
    movie_titles=[]

    for tag in movie_title_tags:
        title = tag.find('a').text
        movie_titles.append(title)
        
        
    return movie_titles
titles = get_movie_titles(doc)
titles[:5]

['刺激1995', '教父', '黑暗騎士', '辛德勒的名單', '魔戒三部曲：王者再臨']

In [None]:
def get_movie_year(doc):
    year_selector = "lister-item-year text-muted unbold"           
    movie_year_tags=doc.find_all('span',{'class':year_selector})
    movie_year_tagss=[]
    for tag in movie_year_tags:
        movie_year_tagss.append(tag.get_text().strip()[1:5])
    return movie_year_tagss
years = get_movie_year(doc)
years[:5]

['1994', '1972', '2008', '1993', '2003']

In [None]:
def get_movie_url(doc):
    url_selector="lister-item-header"           
    movie_url_tags=doc.find_all('h3',{'class':url_selector})
    movie_url_tagss=[]
    base_url = 'https://www.imdb.com/'
    for tag in movie_url_tags:
        movie_url_tagss.append('https://www.imdb.com/' + tag.find('a')['href'])
    return movie_url_tagss

urls = get_movie_url(doc)
urls[:5]

['https://www.imdb.com//title/tt0111161/',
 'https://www.imdb.com//title/tt0068646/',
 'https://www.imdb.com//title/tt0468569/',
 'https://www.imdb.com//title/tt0108052/',
 'https://www.imdb.com//title/tt0167260/']

In [None]:
def get_movie_duration(doc):
    
    selection_class="runtime"
    movie_duration_tags=doc.find_all('span',{'class':selection_class})
    movie_duration=[]

    for tag in movie_duration_tags:
        duration = tag.text[:-4]
        movie_duration.append(duration)
        
        
    return movie_duration
durations = get_movie_duration(doc)
durations[:5]

['142', '175', '152', '195', '201']

In [None]:
def get_movie_genre(doc):
    genre_selector="genre"            
    movie_genre_tags=doc.find_all('span',{'class':genre_selector})
    movie_genre_tagss=[]
    for tag in movie_genre_tags:
        movie_genre_tagss.append(tag.get_text().strip())
    return movie_genre_tagss
genres = get_movie_genre(doc)
genres[:5]

['Drama',
 'Crime, Drama',
 'Action, Crime, Drama',
 'Biography, Drama, History',
 'Action, Adventure, Drama']

In [None]:
def all_pages():
# Let's we create a dictionary to store data of all movies
    movies_dict={
        'title':[],
        'genre':[],
        'duration':[],
        'year':[],
        'url':[]
    }
  # We have to scrap more than one page so we want urls of all pages with the help of loop we can get all urls
    for i in range(1,2000,100):
       
        try:
            url = 'https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=100&start='+str(i)+'&ref_=adv_next'
            response = requests.get(url)
        except:
            break
        
        if response.status_code != 200:
            break
           
    # Parse using BeautifulSoup
        doc = BeautifulSoup(response.text, 'html.parser')
        titles = get_movie_titles(doc)
        urls = get_movie_url(doc)
        certifications = get_movie_certification(doc)
        ratings = get_movie_rating(doc)
        durations = get_movie_duration(doc)
        years = get_movie_year(doc)
        genres = get_movie_genre(doc)
    
        
    # We are adding every movie data to dictionary
        for i in range(len(titles)):
            movies_dict['title'].append(titles[i])
            movies_dict['genre'].append(genres[i])
            movies_dict['duration'].append(durations[i])
            movies_dict['rating'].append(ratings[i])
            movies_dict['year'].append(years[i])
            movies_dict['certification'].append(certifications[i])
            movies_dict['url'].append(urls[i])
        
    return pd.DataFrame(movies_dict)
  

In [None]:
movies_dict={
    'title':titles,
    'genre':genres,
    'duration':durations,
    'year':years,
    'url':urls
}
df = pd.DataFrame(movies_dict)


This code defines a list of movie genres and creates a dictionary of URLs for each genre.

The genres list is a list of strings representing different movie genres.

The url_dict dictionary is initially empty. The code then iterates over the genres list using a for loop. For each genre, it creates a URL string with a placeholder for the genre. The placeholder is {}, which will be replaced with the actual genre using the format method. The resulting URL is stored in the formated_url variable.

Finally, the code adds an entry to the url_dict dictionary with the genre as the key and the formatted URL as the value.

At the end of the loop, the url_dict dictionary will contain an entry for each genre in the genres list, with the genre as the key and the corresponding URL as the value.

For example, if the genres list contains the values "Adventure" and "Comedy", the url_dict dictionary will contain the following entries:

In [None]:
df = pd.DataFrame(movies_dict)
df.head()

Unnamed: 0,title,genre,duration,year,url
0,刺激1995,Drama,142,1994,https://www.imdb.com//title/tt0111161/
1,教父,"Crime, Drama",175,1972,https://www.imdb.com//title/tt0068646/
2,黑暗騎士,"Action, Crime, Drama",152,2008,https://www.imdb.com//title/tt0468569/
3,辛德勒的名單,"Biography, Drama, History",195,1993,https://www.imdb.com//title/tt0108052/
4,魔戒三部曲：王者再臨,"Action, Adventure, Drama",201,2003,https://www.imdb.com//title/tt0167260/


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



# Get the first few rows of the dataframe
print(df.head())

# Get summary statistics of the numerical columns
print(df.describe())

# Get the count of unique values in the categorical columns
print(df['genre'].value_counts())

# Visualize the distribution of the 'rating' column
sns.distplot(df['rating'])
plt.show()

# Group the data by the 'year' column and compute the mean rating for each group
yearly_ratings = df.groupby('year')['rating'].mean()
print(yearly_ratings)

# Plot the yearly ratings
yearly_ratings.plot()
plt.show()

# Compute the pairwise correlations between the numerical columns
corr = df.corr()
print(corr)

# Visualize the correlations using a heatmap
sns.heatmap(corr)
plt.show()


        title                      genre duration  year  \
0      刺激1995                      Drama      142  1994   
1          教父               Crime, Drama      175  1972   
2        黑暗騎士       Action, Crime, Drama      152  2008   
3      辛德勒的名單  Biography, Drama, History      195  1993   
4  魔戒三部曲：王者再臨   Action, Adventure, Drama      201  2003   

                                      url  
0  https://www.imdb.com//title/tt0111161/  
1  https://www.imdb.com//title/tt0068646/  
2  https://www.imdb.com//title/tt0468569/  
3  https://www.imdb.com//title/tt0108052/  
4  https://www.imdb.com//title/tt0167260/  
         title  genre duration  year                                     url
count      100    100      100   100                                     100
unique     100     52       65    52                                     100
top     刺激1995  Drama      164  1994  https://www.imdb.com//title/tt0111161/
freq         1      7        4     5                                     

KeyError: ignored

In [None]:
# Find rows with missing values
print(df[df.isnull().any(axis=1)])


Empty DataFrame
Columns: [title, genre, duration, year, url]
Index: []


In [None]:
import sqlite3
import pandas as pd

# Connect to the database
conn = sqlite3.connect('scraped_movies_data.db')

# Save the dataframe to the database
df.to_sql('movies', conn, if_exists='replace', index=False)

# Close the connection
conn.close()


In [None]:
# Connect to the database
import sqlite3
import pandas as pd
conn = sqlite3.connect('scraped_movies_data.db')

# Query the database
cursor = conn.cursor()
cursor.execute('SELECT * FROM movies')

# Fetch and print the results
results = cursor.fetchall()
print(results)

# Close the connection
conn.close()


This code creates a Pandas DataFrame called df from the movies_dict dictionary and prints the first five rows of the DataFrame using the head method.

The resulting DataFrame will contain the data of all movies that were scraped from the IMDb website, including the movie title, genre, duration, rating, year of release, certification, and URL of each movie. The rows of the DataFrame represent individual movies, and the columns represent the different types of data.

The head method prints the first five rows of the DataFrame by default. You can pass a different number as an argument to print a different number of rows. For example, df.head(10) would print the first ten rows of the DataFrame.

# CLEANING THE DATA

Now here is where the dirty work is going to begin... we need to clean the data so we can pass it into a machine learning model. if there is any odd value in the data, it wont be able to create the model we want to build. by Cleaning we're getting rid of unwanted values, and breaking up columns into the kind of data that we can then use to train our model.. Sure, data cleaning can be thought of as preparing a chicken you just killed for cooking. Just as you would clean and prepare a chicken before cooking it, you would also clean and prepare your data before using it in a machine learning model or for analysis.

For example, in the code you provided, the data is being scraped from IMDB and then being modified so that it can be passed into the model. This is similar to removing the feathers, innards and any unwanted parts of the chicken before cooking. The cleaning process includes removing spaces from the genre column, converting the genre column into a matrix format, converting the duration column to numeric values, and removing any non-numeric values in the year column. This is similar to cutting and cleaning the chicken to your desired size and shape before cooking. The cleaned data is then being saved to a SQLite database for future use, similar to storing the cleaned chicken in the refrigerator before cooking.

Overall, data cleaning is an important step in the process of preparing data for analysis and machine learning model, just as preparing a chicken before cooking is important in making a delicious chicken pepper soup. Without proper cleaning and preparing, the final result (chicken pepper soup) may not turn out as desired.

---



---



In [None]:
#duplicate the scraped movies dataframe, so we can modify it as we wish while the orignial dataframe is still intact.
df1 = df.copy()
df1

Unnamed: 0,title,genre,duration,year,url
0,刺激1995,Drama,142,1994,https://www.imdb.com//title/tt0111161/
1,教父,"Crime, Drama",175,1972,https://www.imdb.com//title/tt0068646/
2,黑暗騎士,"Action, Crime, Drama",152,2008,https://www.imdb.com//title/tt0468569/
3,辛德勒的名單,"Biography, Drama, History",195,1993,https://www.imdb.com//title/tt0108052/
4,魔戒三部曲：王者再臨,"Action, Adventure, Drama",201,2003,https://www.imdb.com//title/tt0167260/
...,...,...,...,...,...
95,捍衛戰士：獨行俠,"Action, Drama",130,2022,https://www.imdb.com//title/tt1745960/
96,心靈捕手,"Drama, Romance",126,1997,https://www.imdb.com//title/tt0119217/
97,烈火悍將,"Action, Crime, Drama",170,1995,https://www.imdb.com//title/tt0113277/
98,惡棍特工,"Adventure, Drama, War",153,2009,https://www.imdb.com//title/tt0361748/


In [None]:
#@title Storing Scraped IMDB Movie Data in a CSV File.

#remove the spaces from the genre column so that it will be aeasy to convert the column into a matrix dataframe. 
movies_mat = df1.genre = df1.genre.str.replace(' ', '')
movies_mat.head()

This code removes the spaces from the genre column of the df DataFrame and assigns the resulting Series to a new variable called movies_mat. The str.replace() method is used to replace the spaces with an empty string.

After executing this code, the movies_mat variable will contain a Series with the same data as the genre column of df, but with the spaces removed from the values.

The head method is then used to print the first five rows of movies_mat. You can pass a different number as an argument to print a different number of rows. For example, movies_mat.head(10) would print the first ten rows of the Series.

In [None]:
#@title Preprocessing Movie Genre Data for Matrix Conversion

#convert the genre column into matrix column using items that are seperated by a comma
movies_mat = df1['genre'].str.get_dummies(sep=',')
movies_mat.head()

This code converts the genre column of the df DataFrame into a matrix of binary values using the str.get_dummies() method. The sep argument is used to specify the separator used to split the values in the genre column. In this case, the separator is a comma.

After executing this code, the movies_mat variable will contain a new DataFrame with a separate column for each genre in the genre column of df. The values in these columns will be binary, with 1 indicating that the movie belongs to that genre and 0 indicating that it does not.

The head method is then used to print the first five rows of movies_mat. You can pass a different number as an argument to print a different number of rows. For example, movies_mat.head(10) would print the first ten rows of the DataFrame.

In [None]:
#@title Converting Movie Genre Data into a Matrix Format.

#combine the dataframe and the matrixes columns
df1 = pd.concat([df1, movies_mat], axis = 1)
df1.head()

This code combines the df DataFrame and the movies_mat DataFrame into a single DataFrame using the concat() function from the pandas library. The axis argument is set to 1 to indicate that the DataFrames should be concatenated vertically, with the rows of movies_mat being added to the bottom of the rows of df.

After executing this code, the df1 variable will contain a new DataFrame that has all the columns of df and movies_mat. The genre column from df will be replaced by the binary genre columns from movies_mat.

The head method is then used to print the first five rows of df1. You can pass a different number as an argument to print a different number of rows. For example, df1.head(10) would print the first ten rows of the DataFrame.

In [None]:
#@title Converting Movie Genre Data into a Matrix Format
#turn the duration column from object to numeric values 
df1['duration'] = pd.to_numeric(df1['duration'])
#df1['year'] = pd.to_numeric(df1['year'])
df1.info()

This code converts the duration column of the df1 DataFrame from an object data type to a numeric data type. It does this using the to_numeric() function from the pandas library.

The info() method is then used to print information about the columns in the df1 DataFrame, including the data type of each column.

Note that this code also includes a line that converts the year column to a numeric data type, but it is commented out (preceded by #) and will not be executed. If you want to convert the year column to a numeric data type, you can remove the # at the beginning of the line.

After executing this code, the duration column of df1 will contain numeric values instead of strings. If you have also uncommented the line to convert the year column, it will also be converted to a numeric data type.

In [None]:
#@title Combining Movie Data and Genre Matrix Columns
import pandas as pd
import re
#extract the columns in year that contains non numeric numbers.
non_numeric = re.compile(r'[^\d.]+')

df1.loc[df1['year'].str.contains(non_numeric)]


This code uses the re (regular expression) module to define a regular expression pattern that matches any character that is not a digit or a period (.). The re.compile() function is used to create a compiled regular expression pattern object that can be used to match strings.

The str.contains() method of the year column is then used to find rows in which the year column contains any characters that match the regular expression pattern. The resulting boolean Series is passed to the loc attribute to select the rows of df1 that match the pattern.

The loc attribute is used to select rows and columns from a DataFrame based on their labels. In this case, it is used to select the rows of df1 that contain non-numeric characters in the year column.

After executing this code, the resulting DataFrame will contain only the rows of df1 where the year column contains non-numeric characters. If there are no such rows, the resulting DataFrame will be empty.

In [None]:
#@title Converting Movie Duration Column to Numeric Values
#convert all non numeric numbers in year to 0.
import pandas as pd
def remove_nils(x):
  if (x.isnumeric()):
    return int(x)
  else:
    return 0
df1["year"] = df1["year"].apply(remove_nils)
df1.head()

This code defines a function remove_nils() that takes in a single argument x and returns x as an integer if it is numeric, or 0 if it is not numeric.

The apply() method of the year column is then used to apply the remove_nils() function to each value in the year column. The resulting Series is assigned back to the year column of df1, effectively replacing the original values with their processed counterparts.

After executing this code, the year column of df1 will contain only numeric values, with any non-numeric values replaced with 0. The resulting DataFrame is then printed using the head() method to show the first five rows.

In [None]:
#check number cell 59 to see if it has been conveted to numeric. it has been converted to 0.
display(df1.iloc[59])

In [None]:
import sqlite3
import pandas as pd

# Connect to the database
conn = sqlite3.connect('cleaned_movies_data.db')

# Save the dataframe to the database
df1.to_sql('movies', conn, if_exists='replace', index=False)

# Close the connection
conn.close()


This code uses the display() function to print the 37th row of the df1 DataFrame. The iloc attribute is used to access the row by its index, and the resulting Series is printed to the console.

If the year column of this row has been successfully converted to a numeric value, it will be displayed as such in the output. If it has been converted to 0, as specified in the previous code, it will be displayed as 0 in the output.

# ACCEPTING INPUT & GENERATING A RANDOM MOVIE

In [None]:
genre_set = set(df1['genre'])
genre_list = list(genre_set)

genre_list = [genre for genre in genre_set]

print(genre_list)

In [None]:
genre_set = set()
for genre_string in df1['genre']:
    genre_list = genre_string.split(',')
    for genre in genre_list:
        genre_set.add(genre.strip())
print(genre_set)


In [None]:
genre_list = list(genre_set)
for i in range(0, len(genre_list), 10):
  print(", ".join(genre_list[i:i+10]))

In [None]:
df1.head()

In [None]:
print('genre_avialable =')
genre_list = list(genre_set)
for i in range(0, len(genre_list), 10):
  print(", ".join(genre_list[i:i+10]))

#using the input function to accept the user preferred genre of the movie they want to watch
select_genre = input("which genre do you want to watch?: ")

#select a random genre
filterm1 = df1.loc[df1[select_genre] == 1]
filterm1

In [None]:
#using the input function to accept the user preferred genre, year of the movie they want to watch
select_year = (input("classic or modern movies?: "))

#year categorized into mordern(released after the year 2000) and classic(released before the year 2000)
if select_year == 'classic':
  filterm1 = filterm1[filterm1['year'] >= 2000]
elif select_year == 'mordern':
  filterm1 = filterm1[filterm1['year'] <= 2000]
else:
  print('invalid input')


In [None]:
#using the input function to accept the user preferred duration of the movie they want to watch
select_duration = (input("short or long movies?: "))

#year categorized into long(movie length is greater than 2 hours) and short(movie length is less than 2 hours)
if select_duration == 'long':
  filterm1 = filterm1[filterm1['duration'] >= 120]
elif select_duration == 'short':
  filterm1 = filterm1[filterm1['duration'] <= 120]
else:
  print('invalid input')



In [None]:

import random

    
#check if there is a movie available, if there is print the dataframe else, print "no movie found"
if filterm1.empty:
  print('No movie matches your input, please make another input!')
else:
  watch_movie=filterm1.sample()
  watch_movie.head()

In [None]:
watch_movie.head()

Please note that this code will only work if you have a dataframe named "df1" that has the following columns: "title", "genre", "duration", "rating", "year", "certification", "url", and the genre column has been converted into a matrix dataframe with each genre as a separate column.

The code prompts the user to select a category, a year range (classic or modern), and a duration range (short or long). It filters the dataframe based on the user's selections, and then selects a random movie from the resulting dataframe using the .sample() function. If the resulting dataframe is empty, it will raise a SystemExit and print the message "No movie matches your selection, please make another selection!". You can remove this if you want the code to just return an empty dataframe instead.

In [None]:
watch_movie_title = watch_movie['title']
watch_movie_year = watch_movie['year']
watch_movie_url = watch_movie['url']
selected_movie_title = watch_movie['title'].iloc[0]

watch_movie_title = watch_movie['title']
watch_movie_year = watch_movie['year']
watch_movie_url = watch_movie['url']

print("We recommend watching: " + str(watch_movie_title) + "(" + str(watch_movie_year) + ")" + "\n" + "You can find more information about this movie at the following url: " + str(watch_movie_url))

In [None]:
# Convert the output to a string
title_string = watch_movie_title.values.tolist()[0]

# Print the resulting string
#print(title_string)

# Convert the output to a string
year_string = watch_movie_year.values.tolist()[0]

# Print the resulting string
#print(year_string)

# Extract the URL from the output
url_string = watch_movie_url.values.tolist()[0]

# Print the URL
#print(url)


print(f'you should watch {title_string}. Its a {year_string} movie. This is the url to the movie {url_string}')

# **CONVERT THE GENRES INTO THEIR RESPECTIVE EMOTIONS...**


 here are the list of emotions we will be working with;

* Love - Romance
* Contentment - Comedy

Excitement - Action
Boredom - Documentary
Curiosity - Mystery
Disappointment - Tragedy
Envy - Crime
Guilt - Horror
Hope - Fantasy
Pride - War
Relief - Animation
Shame - Biography
Skepticism - Science fiction
Sorrow - History
Suspense - Adventure
Sad - Drama
Disgust - Musical
Anger - Family
Anticipation - Thriller
Fear - Sport
Enjoyment - Thriller
Trust - Western
Surprise - Film-Noir



In [None]:
df2 = df1.copy()
df2

In [None]:
# Rename multiple columns at once

df2.rename(columns={'Action': 'Excitement', 'Adventure': 'Suspense', 'Animation': 'Relief', 'Biography': 'Shame',
           'Comedy': 'Enjoyment', 'Crime': 'Envy', 'Drama': 'Sad', 'Family': 'Anger', 'Fantasy': 'Hope',
           'Film-Noir': 'Surprise', 'History': 'Sorrow', 'Horror': 'Guilt', 'Music': 'Disgust', 'Mystery': 'Curiosity',
           'Romance': 'Trust', 'Sci-Fi': 'Skepticism', 'Thriller': 'Anticipation', 'War': 'Pride', 'Western': 'Rugged'}, inplace=True)

In [None]:
df2.head()
for column in df2.columns:
    print(column)


It looks like the code is attempting to rename certain columns in the df1 DataFrame using a mapping of old column names to new column names. For example, the column named "Action" is being renamed to "Excitement", the column named "Adventure" is being renamed to "Suspense", and so on. The inplace=True argument specifies that the changes should be made in place, meaning that the original df1 DataFrame is being modified rather than creating a new DataFrame.

After renaming the columns, the code is attempting to convert the 'year' column to datetime objects using the pd.to_datetime function. However, the column is already a numeric type, so this operation will likely not work. Finally, the code is attempting to change the values in the 'year' column to either 'classic' or 'mordern' depending on whether the value is less than or greater than 2000. It looks like this operation is not being applied correctly due to the use of the loc method on a numeric column.

In [None]:
binary_columns = []
for col in df2.columns:
  unique_values = df2[f'{col}'].nunique()
  print(f'Column {col} has {unique_values} unique values')
  

  if unique_values is 2:
    binary_columns.append(col)

print(binary_columns)


In [None]:
print(binary_columns)

In [None]:
print(f' list of emotions available {binary_columns}')
#using the input function to accept the user preferred emotion of the movie they want to watch
select_emotion = input("which emotion of the movie you want to watch?: ")


#select a random genre
filterm2 = df2.loc[df2[select_emotion] == 1]
filterm2

In [None]:
#using the input function to accept the user preferred emotion, year of the movie they want to watch
select_year = (input("classic or modern movies?: "))

#year categorized into mordern(released after the year 2000) and classic(released before the year 2000)
if select_year == 'classic':
  filterm2 = filterm2[filterm2['year'] >= 2000]
elif select_year == 'mordern':
  filterm2 = filterm2[filterm2['year'] <= 2000]
else:
  print('invalid input enter a valid output "classic" or "modern"' )


In [None]:
#using the input function to accept the user preferred duration of the movie they want to watch
select_duration = (input("short or long movies?: "))

#year categorized into long(movie length is greater than 2 hours) and short(movie length is less than 2 hours)
if select_duration == 'long':
  filterm2 = filterm2[filterm2['duration'] >= 120]
elif select_duration == 'short':
  filterm2 = filterm2[filterm2['duration'] <= 120]
else:
  print('invalid input enter a valid output "short" or "long"')



In [None]:

import random

    
#check if there is a movie available, if there is print the dataframe else, print "no movie found"
if filterm2.empty:
  print('No movie matches your input, please make another input!')
else:
  watch_movie1=filterm2.sample()


In [None]:
watch_movie1.head()

In [None]:
watch_movie_title = watch_movie1['title']
watch_movie_year = watch_movie1['year']
watch_movie_url = watch_movie1['url']

In [None]:
# Convert the output to a string
title_string = watch_movie_title.values.tolist()[0]

# Print the resulting string


# Convert the output to a string
year_string = watch_movie_year.values.tolist()[0]

# Print the resulting string


# Extract the URL from the output
url_string = watch_movie_url.values.tolist()[0]

# Print the URL



print(f"you should watch the {title_string} it's a {year_string} movie. This is the url to the movie {url_string}")

In [None]:
import sqlite3
import pandas as pd

# Connect to the database
conn = sqlite3.connect('emotion_movies_data.db')

# Save the dataframe to the database
df2.to_sql('movies', conn, if_exists='replace', index=False)

# Close the connection
conn.close()

# PREPARE DATA FOR RECOMMENDATION ML MODEL

In [None]:
df3=df2.copy()
df3

In [None]:
df3.dtypes

In [None]:
# Change the values in the 'year' column in place
import numpy as np

# Change the values in the 'year' column based on a condition
df3['year'] = np.where(df3['year'] < 2000, 'classic', 'modern')

This code creates a new column in the DataFrame df3 called year. It uses the np.where() function to create this new column by applying a condition to the existing year column. If the value of the year column is less than 2000, it will be replaced with the string 'classic' in the new year column. If the value of the year column is not less than 2000, it will be replaced with the string 'modern' in the new year column. This results in a new year column with only two values: 'classic' or 'modern'.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Plot the 'year' column as a pie chart
df3['year'].value_counts().plot.pie(figsize=(6, 6))

# Show the plot
plt.show()

In [None]:
df3.head()

In [None]:
# Change the values in the 'duration' column based on a condition if less than 200 then short, else long
df3['duration'] = np.where(df3['duration'] < 120, 'short', 'long')

The code above changes the values in the 'duration' column in the DataFrame based on a condition. If the value in the 'duration' column is less than 120, the value is replaced with 'short', otherwise it is replaced with 'long'. This is done using the np.where() function from NumPy, which takes a condition as the first argument and two values as the second and third arguments. If the condition is true, the value in the second argument is used, otherwise the value in the third argument is used. In this case, the condition is df1['duration'] < 120, so if the value in the 'duration' column is less than 120, it is replaced with 'short', otherwise it is replaced with 'long'.

In [None]:
# Plot the 'year' column as a pie chart
df3['duration'].value_counts().plot.pie(figsize=(6, 6))

# Show the plot
plt.show()

In [None]:
df3.head()

In [None]:
df3.head()

This code will create a pie chart visualization of the data in the 'year' column of the df3 DataFrame. The pie chart will show the distribution of values in the 'year' column, with each value (e.g. "modern", "classic") represented as a slice of the pie. The size of each slice will be proportional to the number of occurrences of that value in the 'year' column. The figure size will be set to 6 inches by 6 inches. Finally, the plot will be displayed using the plt.show() function.

This code generates a pie chart that represents the distribution of the values in the 'duration' column of the DataFrame df1. The 'duration' column is first converted into a Pandas Series using the value_counts method, which counts the number of occurrences of each unique value in the Series. The resulting Series is then plotted as a pie chart using the plot.pie method and the figure size is set to (6, 6) using the figsize parameter. Finally, the show function from the matplotlib.pyplot library is called to display the plot.

In [None]:
df3.drop(columns=['Excitement', 'Suspense', 'Relief', 'Shame', 'Enjoyment', 'Envy', 'Sad', 'Anger', 'Hope', 'Surprise', 'Sorrow', 'Guilt', 'Disgust', 'Curiosity', 'Trust', 'Skepticism', 'Anticipation', 'Pride', 'Trust', 
                  'Rugged'], inplace=True)
df3

In [None]:
df3.drop(columns=['url'], inplace=True)
df3

In [None]:
for column in df3.columns:
    print(column)

In [None]:
import sqlite3
import pandas as pd

# Connect to the database
conn = sqlite3.connect('prepare_movies_data.db')

# Save the dataframe to the database
df3.to_sql('movies', conn, if_exists='replace', index=False)

# Close the connection
conn.close()

# CREATING THE MODEL COSINE SIMILARITY

In [None]:
df4 = df3.copy()
df4

In [None]:
# Split the 'genre' column on the comma and expand it into new columns
df4[['genre1', 'genre2', 'genre3']] = df4['genre'].str.split(',', expand=True)


df4

In [None]:
import pandas as pd
import numpy as np

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
features = ['genre1', 'genre2', 'genre3', 'duration', 'year']
features

In [None]:
for feature in features:
    df4[feature] = df4[feature].fillna('')

In [None]:
def combined_features(row):
    return row['genre1']+" "+row['genre2']+" "+row['genre3']+" "+row['duration']+" "+row['year']


In [None]:
df4["combined_features"] = df4.apply(combined_features, axis =1)

In [None]:
#adding index column to the dataframe
df4 = df4.reset_index()


In [None]:
for columns in df4:
  print(columns)

In [None]:
cv = CountVectorizer()
count_matrix = cv.fit_transform(df4["combined_features"])
print("Count Matrix:", count_matrix.toarray())

In [None]:
cosine_sim = cosine_similarity(count_matrix)

In [None]:
cosine_sim

In [None]:
import sqlite3
import pandas as pd

# Connect to the database
conn = sqlite3.connect('model_movies_data.db')

# Save the dataframe to the database
df4.to_sql('movies', conn, if_exists='replace', index=False)

# Close the connection
conn.close()

# MAKING PREDICTIONS USING THE MODEL

In [None]:
movie_user_likes = "The Godfather"
def get_index_from_title(title):
    return df4[df4.title == title]['index'].values[0]
movie_index = get_index_from_title(movie_user_likes)

In [None]:

def get_index_from_title(title):
    return df4[df4.title == title]['index'].values[0]
movie_index = get_index_from_title(movie_user_likes)

In [None]:
similar_movies = list(enumerate(cosine_sim[movie_index]))

In [None]:
similar_movies

In [None]:
sorted_similar_movies = sorted(similar_movies, key=lambda x:x[1], reverse=True)

In [None]:
sorted_similar_movies

In [None]:
def get_title_from_index(index):
    return df4[df4.index == index]["title"].values[0]
    
i=0
for movie in sorted_similar_movies:
    print(get_title_from_index(movie[0]))
    
    i=i+1
    if i>5:
        break





In [None]:
def get_title_from_index(index):
    return df4[df4.index == index]["genre"].values[0]


i=0
for movie in sorted_similar_movies:
    print(get_title_from_index(movie[0]))
    
    i=i+1
    if i>5:
        break

In [None]:
def get_title_from_index(index):
    return df2[df2.index == index]["duration"].values[0]


i=0
for movie in sorted_similar_movies:
    print(get_title_from_index(movie[0]))
    
    i=i+1
    if i>5:
        break

In [None]:
def get_title_from_index(index):
    return df2[df2.index == index]["year"].values[0]


i=0
for movie in sorted_similar_movies:
    print(get_title_from_index(movie[0]))
    
    i=i+1
    if i>5:
        break

In [None]:


i=0
for movie in sorted_similar_movies:
    print(get_title_from_index(movie[0]))
    
    i=i+1
    if i>5:
        break

# PUTTING THE MODEL TO TEST

In [None]:
#select a random movie according to inputs
import random
import sys

select_emotion = input("which emotion type- movie do you want to watch?: ")
select_year = (input("Classic or modern movies?: "))
select_duration = (input("short or long movies?: "))

filterm3 = df2.loc[df2[select_emotion] == 1]
filterm3

In [None]:
#year categorized into mordern(released after the year 2000) and classic(released before the year 2000)
if select_year == 'classic':
  filterm3 = filterm3[df4['year'] == 'classic']
elif select_year == 'modern':
  filterm3 = filter3[df4['year'] == 'modern']
else:
  print('invalid input')
filterm3

In [None]:
#year categorized into long(movie length is greater than 2 hours) and short(movie length is less than 2 hours)
if select_duration == 'long':
  filterm3 = filterm3[df4['duration'] == 'long']
elif select_duration == 'short':
  filterm3 = filterm3[df4['duration'] == 'short']
else:
  print('invalid input')

filterm3

if filterm3.empty:
  print('No movie matches your input, please make another input!')
else:
  watch_movie2=filterm3.sample()
  watch_movie2.head()

In [None]:
watch_movie_title = watch_movie2['title']

# Convert the output to a string
title_string = watch_movie_title.values.tolist()[0]

# Print the resulting string
print(title_string)

In [None]:
df4

In [None]:
movie_user_likes = f"{title_string}"
def get_index_from_title(title):
    return df4[df4.title == title]['index'].values[0]
movie_index = get_index_from_title(movie_user_likes)


In [None]:
similar_movies = list(enumerate(cosine_sim[movie_index]))

In [None]:
sorted_similar_movies = sorted(similar_movies, key=lambda x:x[1], reverse=True)

In [None]:
def get_title_from_index(index):
    return df4[df4.index == index]["title"].values[0]
titles = []
i=0
for movie in sorted_similar_movies:
    print(get_title_from_index(movie[0]))
    title = get_title_from_index(movie[0])
    titles.append(title)                           
    
    i=i+1
    if i>5:
        break


In [None]:
print(titles)

In [None]:
def get_title_from_index(index):
    return df4[df4.index == index]["genre"].values[0]

genres = []
i=0
for movie in sorted_similar_movies:
    print(get_title_from_index(movie[0]))
    genre = get_title_from_index(movie[0])
    genres.append(genre) 
    
    i=i+1
    if i>5:
        break

In [None]:
print(genres)

In [None]:
def get_title_from_index(index):
    duration = df2[df2.index == index]["duration"].values[0]

    #convert the duration time to hrs and mins
    hours = duration // 60
    minutes = duration % 60
    return f"{hours} hours {minutes} minutes"
runtimes = []

i=0
for movie in sorted_similar_movies:
    print(get_title_from_index(movie[0]))
    runtime = (get_title_from_index(movie[0]))
    runtimes.append(runtime) 
    i=i+1
    if i>5:
        break

In [None]:
print(runtimes)

In [None]:
def get_title_from_index(index):
    return df4[df4.index == index]["duration"].values[0]

lengths=[]
i=0
for movie in sorted_similar_movies:
    print(get_title_from_index(movie[0]))
    length=get_title_from_index(movie[0])
    lengths.append(length)
    
    i=i+1
    if i>5:
        break

In [None]:
print(lengths)

In [None]:
def get_title_from_index(index):
    return df2[df2.index == index]["year"].values[0]
years = []

i=0
for movie in sorted_similar_movies:
    print(get_title_from_index(movie[0]))
    year=get_title_from_index(movie[0])
    years.append(year)
    
    i=i+1
    if i>5:
        break

In [None]:
print(years)

In [None]:
def get_title_from_index(index):
    return df4[df4.index == index]["year"].values[0]
eras = []

i=0
for movie in sorted_similar_movies:
    print(get_title_from_index(movie[0]))
    era=get_title_from_index(movie[0])
    eras.append(era)
    
    i=i+1
    if i>5:
        break

In [None]:
print(eras)

In [None]:
def get_title_from_index(index):
    return df2[df2.index == index]["url"].values[0]
urls = []

i=0
for movie in sorted_similar_movies:
    print(get_title_from_index(movie[0]))
    url=get_title_from_index(movie[0])
    urls.append(url)
    
    i=i+1
    if i>5:
        break

In [None]:
import pandas as pd

# Create a dictionary with the lists of data
movie_data = data = {'titles': titles, 'genres': genres, 'runtimes': runtimes, 'lengths': lengths, 'years': years, 'eras': eras, 'url': urls}

# Create the dataframe
df5 = pd.DataFrame(movie_data)

# Print the dataframe
print(df5)

In [None]:
df5

# DISPLAY THE RESULTS

In [None]:
!pip install streamlit

In [None]:
import streamlit as st
st.title('My App')
st.write('Hello, World!')

!streamlit run app.py

In [None]:
import streamlit as st

# Display the dataframe in the Streamlit app
st.title('My App')

# Display the dataframe in the app
st.dataframe(df5)