# Project Summary
Our project is to predict the genre of a movie based on the plot keywords, top 16 colors in the movie poster, the synopsis, and the plot keywords identified for each movie from IMDb. The correlations we drew based on these features are in Data Exploration of this folder.

### Relevance of Project
This is an important field to research and develop solutions in. Even in 2019, major companies, streaming platforms, and websites have people manually select the genres for each movie or take input from customers about which genre a movie belongs to. A lot of time and money would be saved the the genre-to-movie assignment process was automated. This is the field we explore in this project.

# Webscraping Code Summary
This Jupyter File contains all the code we used to scrape our data. We webscraped 10 different sites that contain the top 100 horror movies in imdb for each genre.The total dataframe has 1100 observations representing each of the following genres: Horror, Western, War, Scifi, Action, Adventure, Crime, Romance, Comedy, Thriller, and Drama. From the scraping we included- Plot Keywords, Movie Synopsis, Movie Name, imdb id, poster image url, Movie Storyline,Genre Variations, and Genre. After receiving all the imdb id's we had to loop through all the id's in order to recieve the plot keywords and storyline. 

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import urllib
from PIL import Image
from bs4 import BeautifulSoup
import requests

In [54]:
horror_link = "https://www.imdb.com/list/ls066355376/"
romance_link = "https://www.imdb.com/list/ls009668031/"
comedy_link = "https://www.imdb.com/list/ls009668747/"
thriller_link = "https://www.imdb.com/list/ls051523000/"
drama_link = "https://www.imdb.com/list/ls072723351/"
crime_link = "https://www.imdb.com/list/ls009668704/"
scifi_link = "https://www.imdb.com/list/ls056870923/"
war_link = "https://www.imdb.com/list/ls009678583/"
western_link = "https://www.imdb.com/list/ls002124326/"
action_link = "https://www.imdb.com/list/ls009668579/"
adventure_link = "https://www.imdb.com/list/ls009609925/"

In [68]:
import time
gen_url = "https://www.imdb.com/title/"
urls=[horror_link, western_link, war_link, scifi_link, action_link, 
    adventure_link, crime_link, romance_link, comedy_link,
    thriller_link,drama_link]

genres=["Horror","Western","War","Scifi","Action","Adventure",
       "Crime","Romance","Comedy","Thriller","Drama"]
movies_df = pd.DataFrame()
for url,genre in zip(urls,genres):
    print(genre)
    resp = requests.get(url)
    soup = BeautifulSoup(resp.content, "html.parser")
    images_soup = soup.find_all("img",{"class":"loadlate"})
    if(genre=="Horror" or genre=='Thriller'\
       or genre == "Drama"):
        texts = soup.find_all("p",{"class":""})[3:-1]
    else:
        texts = soup.find_all("p",{"class":""})[2:-1] 
    headers = soup.find_all("h3")
    imdb_id = []
    images = []
    synopsis = []
    movie_names = []
    for image in images_soup:
        if(len(image.get("class"))<2):
            movie_names.append(image.get("alt"))
            images.append(image.get("loadlate"))
    for text in texts:
        if(text.text[0].isdigit() == False):
            txt = text.text.replace("\n","")
            synopsis.append(txt)
    for h in headers:
        refs = h.find_all("a")
        try:
            imdb_id.append(refs[0].get("href").split("/")[2])
        except:
            break
    genre_df = pd.DataFrame()
    genre_df["Movie Name"] = movie_names
    genre_df["Poster Image Link"] = images
    genre_df["Synopsis"] = synopsis
    genre_df["Genre"] = genre
    genre_df["IMDb_id"] = imdb_id
    plot_keywords = []
    storyline = []
    for id in genre_df["IMDb_id"]:
        resp = requests.get(gen_url+id)
        soup = BeautifulSoup(resp.content, "html.parser")
        keywords = soup.find_all("span",{"class","itemprop"})
        words=""
        for keyword in keywords:
            words+=keyword.text+" "
        storyline_text = soup.find_all("div",{"class",
                "inline canwrap"})[0].find_all("span")[0].text
        storyline.append(storyline_text)
        plot_keywords.append(words)
        time.sleep(0.5)
    genre_df["Plot Keywords"] = plot_keywords
    genre_df["Storyline"] = storyline
    movies_df = pd.concat([movies_df,genre_df],ignore_index=True)

Horror
Western
War
Scifi
Action
Adventure
Crime
Romance
Comedy
Thriller
Drama


In [105]:
movies_df.to_csv("movies_df.csv")
movies_df.head()

Unnamed: 0,Movie Name,Poster Image Link,Synopsis,Genre,IMDb_id,Plot Keywords,Storyline
0,Final Destination,https://m.media-amazon.com/images/M/MV5BZTI0NG...,After a teenager has a terrifying vision o...,Horror,tt0195714,teen horror death premonition scantily clad fe...,Alex is boarding a plane to France on a sc...
1,The Evil Dead,https://m.media-amazon.com/images/M/MV5BODc2Mm...,Five friends travel to a cabin in the wood...,Horror,tt0083907,cult film evil dead necronomicon supernatural ...,Five college students take time off to spe...
2,Resident Evil,https://m.media-amazon.com/images/M/MV5BN2Y2MT...,"A special military unit fights a powerful,...",Horror,tt0120804,doberman cut into pieces zombie commando sex i...,A virus has escaped in a secret facility c...
3,I Know What You Did Last Summer,https://m.media-amazon.com/images/M/MV5BZDI4OD...,Four young friends bound by a tragic accid...,Horror,tt0119345,taking a shower teenage girl overalls bikini f...,"After an accident on a winding road, four ..."
4,Army of Darkness,https://m.media-amazon.com/images/M/MV5BODcyYz...,A man is accidentally transported to 1300 ...,Horror,tt0106308,necronomicon evil dead skeleton soldier 1990s ...,"Ash is transported with his car to 1,300 A..."


Because each movie could have multiple genres, we then looped through the id's again in order to get all the genres that are associated with the movie.

In [None]:
genres_list = []
for id in movies_df["IMDb_id"]:
    resp = requests.get(gen_url+id)
    soup = BeautifulSoup(resp.content, "html.parser")
    genres = soup.find_all("div",{"class":"subtext"})[0].find_all("a")
    genres_str = ""
    for i in range(0,len(genres)-1):
        genres_str+=genres[i].text + " "
    genres_list.append(genres_str)
    time.sleep(0.5)
genres_list    

In [108]:
movies_df["Genre Variations"] = genres_list
movies_df.to_csv("movies_df.csv",index=False)

# Shuffling
We had to shuffle our data set because we noticed that because we scraped the movies in a particular order the first 100 of the set would be horror, the next 100 were Western movies and so on. This became a problem later for cross validation. This is because the way cross validation in sklearn works is that it splits the set not in a random order but in the section it is in the data set. After shuffling our dataset, the accuracy scores and f1 scores of our model drastically improved

In [196]:
shuffle_movie_df = movies_df.sample(frac=1).reset_index(drop=True)
shuffle_movie_df.head()

Unnamed: 0,Movie Name,Poster Image Link,Synopsis,Genre,IMDb_id,Plot Keywords,Storyline,Genre Variations
0,Big Fish,https://m.media-amazon.com/images/M/MV5BMTYyMz...,A frustrated son tries to determine the fa...,Adventure,tt0319061,fish father son relationship death of father f...,United Press International journalist Will...,Adventure Drama Fantasy
1,Scott Pilgrim vs. the World,https://m.media-amazon.com/images/M/MV5BMTkwNT...,Scott Pilgrim must defeat his new girlfrie...,Action,tt0446029,toronto canada sexy woman cleavage panties sca...,Scott Pilgrim plays in a band which aspire...,Action Comedy Fantasy
2,The Paleface,https://m.media-amazon.com/images/M/MV5BOGZhMG...,Calamity Jane is despatched to find out wh...,Western,tt0040679,comedy of errors misunderstanding punched in t...,Someone is selling guns to the Indians and...,Comedy Family Western
3,How the West Was Won,https://m.media-amazon.com/images/M/MV5BNTk2ND...,A family saga covering several decades of ...,Western,tt0056085,ulysses s. grant character abraham lincoln cha...,Setting off on a journey to the west in th...,Western
4,The Butterfly Effect,https://m.media-amazon.com/images/M/MV5BODNiZm...,Evan Treborn suffers blackouts during sign...,Sci-Fi,tt0289879,love butterfly effect child pornography time t...,Evan Treborn grows up in a small town with...,Drama Sci-Fi Thriller


In [205]:
shuffle_movie_df.to_csv("shuffled_movie_df.csv",index=False)

# Variable Descriptions: 
- Movie Name: The name of the movie (String)
- Poster Image Link: The link to the poster (String)
- Synopsis: Movie Synopsis (String)
- Genre: Main Genre the movie is associated with (String)
- IMDb_id: imdb id for that movie (String)
- Plot Keywords: the keywords in the plot generated by imdb (String)
- Storyline: General Storyline for the movie (String)
- Genre Variations: Genres that movie is associated with (String) 

### Note:
We scraped the data for the poster images by looping over the Poster Image Links in ML_2.