<a href="https://colab.research.google.com/github/jmarcano101/data110/blob/main/Scraping_Box_Office_Data_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Scraping Box Office Data Using BeautifulSoup and Requests

#### Introduction
This Python script automates the process of extracting box office data from BoxOfficeMojo's weekend chart. Utilizing the `requests` library, it fetches the webpage content and employs `BeautifulSoup` for parsing the HTML to isolate the box office table. The script meticulously iterates through table rows, capturing essential details such as rank, release, gross earnings, and more, for each movie listed. The extracted data is then structured and saved into a CSV file named "box_office_data.csv". This approach facilitates easy aggregation, analysis, and storage of box office performance data for further analysis or reporting.


In [None]:
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
url = "https://www.boxofficemojo.com/weekend/chart/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Use the correct class or id for the table
table = soup.find('table', {'class': 'mojo-body-table'})

data = []

if table:
    rows = table.find_all('tr')[1:]  # Assuming the first row is the header

    for row in rows:
        cells = row.find_all('td')
        if len(cells) >= 9:  # Ensure there are enough cells
            entry = {
                'rank': cells[0].text.strip(),
                'release': cells[1].text.strip(),
                'gross': cells[2].text.strip(),
                'lw': cells[3].text.strip(),
                'theaters': cells[4].text.strip(),
                'change': cells[5].text.strip(),
                'average': cells[6].text.strip(),
                'total_gross': cells[7].text.strip(),
                'weeks': cells[8].text.strip(),
                # 'distributor': cells[9].text.strip() if len(cells) > 9 else ''
            }
            data.append(entry)

    if data:
        csv_file = "box_office_data.csv"
        with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
            writer = csv.DictWriter(file, fieldnames=data[0].keys())
            writer.writeheader()
            for item in data:
                writer.writerow(item)
        print(f"Data saved to {csv_file}")
    else:
        print("No data extracted from the table.")
else:
    print("Table not found in the page.")


In [None]:
df=pd.read_csv('/content/box_office_data.csv')
df.head()

In [None]:
data=df.head(5)
plt.bar(data['gross'],data['weeks'])

In [None]:
data=df.head(5)
plt.barh(data['gross'],data['weeks'])

In [None]:
# Remove $ and commas, then convert to float
data['weeks'] = data['weeks'].str.replace('[\$,]', '', regex=True).astype(float)





In [None]:
plt.barh(data['gross'],data['weeks'])

In [None]:
# Correcting the sorting method
sorted_data = data.sort_values(by='weeks', ascending=True)  # or ascending=False to reverse the order

# Now plotting with matplotlib
plt.barh(sorted_data['gross'], sorted_data['weeks'])

