## Web Scraping with BeautifulSoup, Requests, and Pandas

In [48]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

### Web Scraping: List of Film Production Companies from Wikipedia


In [49]:
url = 'https://en.wikipedia.org/wiki/List_of_film_production_companies'
page= requests.get(url)
soup = BeautifulSoup(page.text)

### Scraping and Extracting Table Headers from HTML using BeautifulSoup

In [50]:
table = soup.find_all('table')[0]

# scraping the table headers
titles = table.find_all('th')

# making a list of the table headers by removing the <th> tags and '\n'
titles_table = [title.text.strip() for title in titles]

### Creating a DataFrame with Specified Column Titles in Pandas

In [51]:
df = pd.DataFrame(columns = titles_table)


### Python Code to Extract and Normalize Table Data from HTML Using BeautifulSoup

The code iterates over the rows of an HTML table, processes the table cells, strips any excess whitespace from the cell data, and appends rows to a DataFrame. It includes a check to handle mismatched column lengths, padding rows with missing data to avoid errors.


In [52]:
column_data= table.find_all('tr')

for row in column_data[1:]:
    row_data = row.find_all('td')

    final_row_data = [data.text.strip() for data in row_data]

    # if this index line has mismatched columns: 
    if len(final_row_data) < len(df.columns):
        print(length, len(df.columns), len(final_row_data))

    # to prevent any mismatch errors
    while len(final_row_data) < len(df.columns):
        final_row_data.append(None)  # Or any default value

    length = len(df)
    df.loc[length]= final_row_data

253 5 4


### Dataframe 

This table provides a list of film production companies, detailing key information such as the company name, country of origin, headquarters location, year of establishment, and any relevant notes. 

**Company**: The name of the film production company. <br>
**Country**: The country where the company is based. <br>
**Headquarters**: The specific city or region where the company is headquartered. <br>
**Est.**: The year the company was founded. <br>
**Notes**: Any additional relevant information about the company (e.g., a focus on specific genres like Christian films).


In [53]:
df.head()

Unnamed: 0,Company,Country,Headquarters,Est.,Notes
0,Aleph Producciones,Argentina,Buenos Aires,1990,
1,Argentina Sono Film,Argentina,Buenos Aires,1933,
2,BD Cine,Argentina,Buenos Aires,1995,
3,Guacamole Films,Argentina,Buenos Aires,2002,
4,Patagonik Film Group,Argentina,Buenos Aires,1996,


### Saving the data into a csv file 

In [54]:
df.to_csv('Productions.csv', index= False)