### Melanie Gradeler
### Delanie Dahm 
### BAIS:3025 Wrangling Scraping Project Code
### Michael Colbert
## <center><ins>Scraping Top 250 Movies and Details<ins><center>
### <p style="text-align: right">_November 12th, 2024_
#### This file would be the first file to run in order, if starting this project from the begining
**Purpose:** This project involves scraping movie data from IMDb's Top 250 movies list, including details like titles, ratings, runtimes, and additional information from individual movie pages. The scraped data will be scraped in subsequent steps. Two other jupyter notebooks will be used to clean and analyze the data. At the end of this notebook, there will be a total of three raw data files: 

1.`top_250_raw.csv`
2.`top_250_details_raw.csv`

### Initial Scraping Steps
This section outlines the steps to scrape data from IMDb's Top 250 movies page. The focus is on extracting key details such as titles, release years, user ratings, runtimes, and URLs for further exploration.

#### Importing Libraries
The code below imports libraries required for web scraping and data handling

**Purpose:** This block sets up the environment for scraping by loading the necessary tools and libraries.



In [1]:
import time #For adding delays during scraping to avoid overwhelming the server.
import pandas as pd #For organizing and analyzing the scraped data.
from selenium import webdriver #For automating browser actions and navigating the IMDb website.
from selenium.webdriver.common.by import By # used to import different ways to access data in the XML or HTML file
from selenium.webdriver.chrome.service import Service # no longer need to download a driver file, use service
from webdriver_manager.chrome import ChromeDriverManager # used to manage the Chrome driver to emulate a Chrome web browser

### Setting Up WebDriver and Accessing IMDb
**Purpose:** This block prepares the browser for scraping and specifies the target URL.


In [2]:
# Initialize Selenium WebDriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# List to store scraped data
Top_250_movies_scraped_raw = []
#Directs the browser to the IMDb Top 250 Movies page and maximizes the window for better interaction.
url = f"https://www.imdb.com/chart/top/?ref_=nv_mv_250&sort=user_rating%2Cdesc"
driver.get(url)
driver.maximize_window()

### Scraping Movie Details from IMBD
This section focuses on scraping the following details from the IMDb Top 250 Movies page:
1. **Rank and Title**: Extracted from the main page.
2. **Year of Release**: Retrieved alongside the title.
3. **Rating**: User ratings for each movie.
4. **Runtime**: Duration of each movie.
5. **URL**: Link to the movie's detailed page for additional information.

The extracted URLs are essential for scraping more detailed information in subsequent steps, which involve iterating through individual movie pages.

This block also converts the raw scraped data into a second csv(`top_250_raw.csv`) and into a structured Pandas DataFrame and assigns column names for better readability.



In [3]:
## empty list for 250 movie details
top_250_scraped=[]
##initial summary element containing subsequent elements
movies_scraped = driver.find_elements(By.CLASS_NAME, 'ipc-metadata-list-summary-item')
#Loop though each movie in the initial list to find specific details such as year, runtime, rating and url
for movie in movies_scraped:
    
    # Scrape the movie title
    movie_title = movie.find_element(By.CLASS_NAME, 'ipc-title__text').text.strip()
    
    # Scrape the year if found to be blank then "N/A"
    try:
        year_element = movie.find_element(By.XPATH, ".//span[contains(@class, 'cli-title-metadata-item')][1]")
        movie_year = year_element.text.strip()
    except Exception:
        movie_year = "N/A"  # Handle cases where the year is not found

    # Scrape the runtime if found to be blank then "N/A"
    try:
        metadata_elements = movie.find_elements(By.XPATH, ".//span[contains(@class, 'cli-title-metadata-item')]")
        if len(metadata_elements) > 1:  # Assuming runtime is the second span
            movie_runtime = metadata_elements[1].text.strip()
        else:
            movie_runtime = "N/A"
    except Exception:
        movie_runtime = "N/A"  # Handle cases where the runtime is not found

    # Scrape the 2024 rating if found to be blank then "N/A"
    try:
        rating_element = movie.find_element(By.XPATH, ".//span[contains(@class, 'ipc-rating-star--rating')]")
        movie_rating = rating_element.text.strip()
    except Exception:
        movie_rating = "N/A"  # Handle cases where the rating is not found

    # Scrape the URL if found to be blank then "N/A"
    try:
        link_element = movie.find_element(By.XPATH, ".//a[@class='ipc-title-link-wrapper']")
        movie_url = link_element.get_attribute('href')
    except Exception:
        movie_url = "N/A"  # Handle cases where the link is not found

    #print(f"Title: {movie_title}, Year: {movie_year}, Runtime: {movie_runtime}, Rating: {movie_rating}, URL: {movie_url}")

#appending to empty list and making specific columns
    top_250_scraped.append({
            "title": movie_title,
            "year": movie_year,
            "run_time": movie_runtime,
            "IMBD_rating_2024": movie_rating,
            "url": movie_url
        })

#display(top_250_scraped)

# convert lists to a pandas datadataframe
print("Building the dataframe")
top_250_df = pd.DataFrame(top_250_scraped)

# # print data in a CSV file
print("Saving the CSV")
top_250_df.to_csv("top_250_raw.csv", header=True, index=False, sep=",", encoding='utf-8')

Building the dataframe
Saving the CSV


### Scraping Additional Deatails through each Individual Movie Pages
This block:
1. Iterates through the URLs stored in the `top_250_details` DataFrame.
2. Visits each movie's page to scrape additional details such as:
   - **Popularity Score, Metascore, and Number of Oscars**

**Purpose:** Complements the initial data with more detailed information from individual movie pages.
This block also converts the raw scraped data into a third csv (`top_250_details_raw.csv`) and into a structured Pandas DataFrame and assigns column names for better readability.


In [4]:
# List to store results
top_250_details = []

# Loop through the URLs in the DataFrame
for index, url in top_250_df['url'].dropna().items():
    try:
        # Visit the URL
        print(f"Visiting URL: {url}")
        driver.get(url)# Navigate to the URL
        time.sleep(5)  # Allow time for the page to load (adjust as needed)

        # Scrape the popularity score
        try:
            popularity_element = driver.find_element(By.XPATH, "//div[@data-testid='hero-rating-bar__popularity__score']")
            popularity_score = popularity_element.text.strip()
        except Exception:
            popularity_score = "N/A"  # Handle cases where the score is not found

        print(f"Popularity Score: {popularity_score}")

        # Scrape the meta score
        try:
            meta_element = driver.find_element(By.XPATH, "//span[contains(@class, 'metacritic-score-box')]")
            meta_score = meta_element.text.strip()
        except Exception:
            meta_score = "N/A"  # Handle cases where the score is not found

        print(f"Meta Score: {meta_score}")

        # Scrape the awards (filter later in cleaning file to include only oscars)
        try:
            award_element = driver.find_element(By.XPATH, "//a[contains(@class, 'ipc-metadata-list-item__label') and contains(@href, '/awards')]")
            award_text = award_element.text.strip()
        except Exception:
            award_text = "N/A"  # Handle cases where the score is not found

        print(f"Award: {award_text}")

        # Append results to the list
        top_250_details.append({'url': url, 'popularity_score': popularity_score, 'metascore': meta_score, 'oscars': award_text})

    except Exception as e:
        print(f"Error visiting {url}: {e}")

# Close the browser
#driver.quit()

# convert lists to a pandas datadataframe
print("building the dataframe")
top_250_details_df = pd.DataFrame(top_250_details)

# # perist data in a CSV file
print("saving the CSV")
top_250_details_df.to_csv("top_250_details_raw.csv", header=True, index=False, sep=",", encoding='utf-8')

Visiting URL: https://www.imdb.com/title/tt0111161/?ref_=chttp_t_1
Popularity Score: 62
Meta Score: 82
Award: Nominated for 7 Oscars
Visiting URL: https://www.imdb.com/title/tt0068646/?ref_=chttp_t_2
Popularity Score: 57
Meta Score: 100
Award: Won 3 Oscars
Visiting URL: https://www.imdb.com/title/tt0468569/?ref_=chttp_t_3
Popularity Score: 101
Meta Score: 84
Award: Won 2 Oscars
Visiting URL: https://www.imdb.com/title/tt0071562/?ref_=chttp_t_4
Popularity Score: 185
Meta Score: 90
Award: Won 6 Oscars
Visiting URL: https://www.imdb.com/title/tt0108052/?ref_=chttp_t_5
Popularity Score: 211
Meta Score: 95
Award: Won 7 Oscars
Visiting URL: https://www.imdb.com/title/tt0167260/?ref_=chttp_t_6
Popularity Score: 233
Meta Score: 94
Award: Won 11 Oscars
Visiting URL: https://www.imdb.com/title/tt0050083/?ref_=chttp_t_7
Popularity Score: 254
Meta Score: 97
Award: Nominated for 3 Oscars
Visiting URL: https://www.imdb.com/title/tt0120737/?ref_=chttp_t_8
Popularity Score: 105
Meta Score: 92
Award: W