# **Web Scraping**
Web scraping is the process of automatically extracting data from websites. It allows you to collect and organize data from the web for analysis or other purposes.

---

## **Applications of Web Scraping**
1. **Data Collection for Research**: Gathering data for academic or market research.
2. **E-commerce**: Extracting product prices, reviews, or inventory details.
3. **Competitor Analysis**: Tracking competitors' offerings and updates.
4. **News Aggregation**: Collecting headlines or articles from news websites.
5. **Social Media Insights**: Analyzing trends or user-generated content.

---

## **Basic Steps in Web Scraping**
1. **Identify the Website**:
   - Choose a website with the data you want to scrape.
2. **Analyze the Website Structure**:
   - Use browser tools (Inspect Element) to examine the HTML structure of the webpage.
3. **Send an HTTP Request**:
   - Use tools like Python's `requests` library to fetch the webpage's content.
4. **Parse the HTML**:
   - Use libraries like `BeautifulSoup` to extract specific data.
5. **Store the Data**:
   - Save the extracted data in a file (e.g., CSV, JSON) or a database.

---

## **Common Tools for Web Scraping**
1. **Programming Languages**: Python is the most popular for web scraping.
2. **Libraries**:
   - `requests`: For sending HTTP requests to websites.
   - `BeautifulSoup`: For parsing and extracting data from HTML.
   - `pandas`: For storing and analyzing scraped data.
   - `Selenium`: For handling JavaScript-heavy websites.

---

| **Status Code** | **Category**              | **Description**                                                                 |
|------------------|---------------------------|---------------------------------------------------------------------------------|
| **100**          | Informational Response   | Continue: The client can proceed with the request body.                        |
| **101**          | Informational Response   | Switching Protocols: The server is switching protocols as requested.           |
| **200**          | Success                  | OK: The request was successful, and the desired content was returned.          |
| **201**          | Success                  | Created: The request was successful, and a new resource was created.           |
| **204**          | Success                  | No Content: The server successfully processed the request, but no content returned. |
| **301**          | Redirection              | Moved Permanently: The resource has been moved to a new URL permanently.        |
| **302**          | Redirection              | Found: The resource is temporarily located at a different URL.                 |
| **304**          | Redirection              | Not Modified: The resource has not been modified since the last request.       |
| **400**          | Client Error             | Bad Request: The server could not understand the request due to invalid syntax.|
| **401**          | Client Error             | Unauthorized: Authentication is required to access the resource.               |
| **403**          | Client Error             | Forbidden: The server understands the request but refuses to authorize it.     |
| **404**          | Client Error             | Not Found: The requested resource was not found on the server.                 |
| **429**          | Client Error             | Too Many Requests: The client sent too many requests in a given time (rate-limiting). |
| **500**          | Server Error             | Internal Server Error: The server encountered an unexpected error.             |
| **502**          | Server Error             | Bad Gateway: The server received an invalid response from an upstream server.  |
| **503**          | Server Error             | Service Unavailable: The server is temporarily unavailable due to maintenance or overload. |
| **504**          | Server Error             | Gateway Timeout: The server did not receive a timely response from an upstream server. |


## **Movie Scraping**
This project helps in scraping the movies from a particular website

In [4]:
# Importing the libraries
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [5]:
url = 'https://www.imdb.com/search/title/?groups=top_250'
url

'https://www.imdb.com/search/title/?groups=top_250'

In [6]:
resp = requests.get(url)
resp.status_code

403

In [7]:
# Mimic request from device
with requests.Session() as se:
    se.headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.6 Safari/605.1.15",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en"
    }

In [8]:
resp = se.get(url)
resp.status_code

200

In [19]:
# Passing the html from the response using BeautifulSoup
body = resp.content
soup = BeautifulSoup(body, 'html.parser')

### **Scrape the movie titles**

title_tag = h3 class="ipc-title__text

In [35]:
movie_title = soup.find_all("h3", class_="ipc-title__text")
movie_title

[<h3 class="ipc-title__text">1. Gladiator</h3>,
 <h3 class="ipc-title__text">2. Interstellar</h3>,
 <h3 class="ipc-title__text">3. Dune: Part Two</h3>,
 <h3 class="ipc-title__text">4. The Wild Robot</h3>,
 <h3 class="ipc-title__text">5. Oppenheimer</h3>,
 <h3 class="ipc-title__text">6. The Godfather</h3>,
 <h3 class="ipc-title__text">7. The Shawshank Redemption</h3>,
 <h3 class="ipc-title__text">8. The Lord of the Rings: The Fellowship of the Ring</h3>,
 <h3 class="ipc-title__text">9. Die Hard</h3>,
 <h3 class="ipc-title__text">10. Come and See</h3>,
 <h3 class="ipc-title__text">11. Inglourious Basterds</h3>,
 <h3 class="ipc-title__text">12. The Sound of Music</h3>,
 <h3 class="ipc-title__text">13. Inception</h3>,
 <h3 class="ipc-title__text">14. It's a Wonderful Life</h3>,
 <h3 class="ipc-title__text">15. Mad Max: Fury Road</h3>,
 <h3 class="ipc-title__text">16. The Dark Knight</h3>,
 <h3 class="ipc-title__text">17. Pulp Fiction</h3>,
 <h3 class="ipc-title__text">18. Se7en</h3>,
 <h3 

In [37]:
# Getting the movie titles without tags
movies = []
for movie in movie_title:
    text = movie.get_text().strip()
    movies.append(text)
movies = movies[:25]
movies

['1. Gladiator',
 '2. Interstellar',
 '3. Dune: Part Two',
 '4. The Wild Robot',
 '5. Oppenheimer',
 '6. The Godfather',
 '7. The Shawshank Redemption',
 '8. The Lord of the Rings: The Fellowship of the Ring',
 '9. Die Hard',
 '10. Come and See',
 '11. Inglourious Basterds',
 '12. The Sound of Music',
 '13. Inception',
 "14. It's a Wonderful Life",
 '15. Mad Max: Fury Road',
 '16. The Dark Knight',
 '17. Pulp Fiction',
 '18. Se7en',
 '19. Harry Potter and the Deathly Hallows: Part 2',
 '20. The Wizard of Oz',
 '21. Top Gun: Maverick',
 '22. Parasite',
 '23. The Lord of the Rings: The Return of the King',
 '24. The Lion King',
 '25. The Wolf of Wall Street']

In [43]:
# Getting the movie title without the number
titles = []
ranks = []

for entry in movies:
    rank, title = entry.split('. ', 1)
    ranks.append(rank)
    titles.append(title)
titles

['Gladiator',
 'Interstellar',
 'Dune: Part Two',
 'The Wild Robot',
 'Oppenheimer',
 'The Godfather',
 'The Shawshank Redemption',
 'The Lord of the Rings: The Fellowship of the Ring',
 'Die Hard',
 'Come and See',
 'Inglourious Basterds',
 'The Sound of Music',
 'Inception',
 "It's a Wonderful Life",
 'Mad Max: Fury Road',
 'The Dark Knight',
 'Pulp Fiction',
 'Se7en',
 'Harry Potter and the Deathly Hallows: Part 2',
 'The Wizard of Oz',
 'Top Gun: Maverick',
 'Parasite',
 'The Lord of the Rings: The Return of the King',
 'The Lion King',
 'The Wolf of Wall Street']

### **Scrape the year the movies were released, runtime and certification**

Years, runtime and certification tags = span class="sc-300a8231-7 eaXxft dli-title-metadata-item"

In [52]:
year_runtime_cert = soup.find_all('span', class_="sc-300a8231-7 eaXxft dli-title-metadata-item")
year_runtime_cert

[<span class="sc-300a8231-7 eaXxft dli-title-metadata-item">2000</span>,
 <span class="sc-300a8231-7 eaXxft dli-title-metadata-item">2h 35m</span>,
 <span class="sc-300a8231-7 eaXxft dli-title-metadata-item">R</span>,
 <span class="sc-300a8231-7 eaXxft dli-title-metadata-item">2014</span>,
 <span class="sc-300a8231-7 eaXxft dli-title-metadata-item">2h 49m</span>,
 <span class="sc-300a8231-7 eaXxft dli-title-metadata-item">PG-13</span>,
 <span class="sc-300a8231-7 eaXxft dli-title-metadata-item">2024</span>,
 <span class="sc-300a8231-7 eaXxft dli-title-metadata-item">2h 46m</span>,
 <span class="sc-300a8231-7 eaXxft dli-title-metadata-item">PG-13</span>,
 <span class="sc-300a8231-7 eaXxft dli-title-metadata-item">2024</span>,
 <span class="sc-300a8231-7 eaXxft dli-title-metadata-item">1h 42m</span>,
 <span class="sc-300a8231-7 eaXxft dli-title-metadata-item">PG</span>,
 <span class="sc-300a8231-7 eaXxft dli-title-metadata-item">2023</span>,
 <span class="sc-300a8231-7 eaXxft dli-title-m

In [58]:
# Getting the year, runtime and certification differently by looping through the span
released_years = []
runtime = []
certifications = []

for span in year_runtime_cert:
    text = span.get_text()

    # Identification of the year
    if text.isdigit() and len(text)==4:
        released_years.append(text)
    # identification of the movie runtime
    elif 'h' in text or 'm' in text:
        runtime.append(text)
    else:
        certifications.append(text)

In [60]:
released_years

['2000',
 '2014',
 '2024',
 '2024',
 '2023',
 '1972',
 '1994',
 '2001',
 '1988',
 '1985',
 '2009',
 '1965',
 '2010',
 '1946',
 '2015',
 '2008',
 '1994',
 '1995',
 '2011',
 '1939',
 '2022',
 '2019',
 '2003',
 '1994',
 '2013']

In [62]:
runtime

['2h 35m',
 '2h 49m',
 '2h 46m',
 '1h 42m',
 '3h',
 '2h 55m',
 '2h 22m',
 '2h 58m',
 '2h 12m',
 '2h 22m',
 '2h 33m',
 '2h 52m',
 '2h 28m',
 '2h 10m',
 '2h',
 '2h 32m',
 '2h 34m',
 '2h 7m',
 '2h 10m',
 '1h 42m',
 '2h 10m',
 '2h 12m',
 '3h 21m',
 '1h 28m',
 '3h']

In [64]:
certifications

['R',
 'PG-13',
 'PG-13',
 'PG',
 'R',
 'R',
 'R',
 'PG-13',
 'R',
 'Not Rated',
 'R',
 'G',
 'PG-13',
 'PG',
 'R',
 'PG-13',
 'R',
 'R',
 'PG-13',
 'G',
 'PG-13',
 'R',
 'PG-13',
 'G',
 'R']

### **Scrape the movie rating**

rating tag = span class="ipc-rating-star--rating"

# Getting the rating span for the movie
rating_span = soup.find_all('span', class_="ipc-rating-star--rating")
rating_span

In [71]:
# Getting the rating span without the tag
ratings = []

for rate in rating_span:
    score = rate.get_text()
    ratings.append(score)
ratings

['8.5',
 '8.7',
 '8.5',
 '8.2',
 '8.3',
 '9.2',
 '9.3',
 '8.9',
 '8.2',
 '8.3',
 '8.4',
 '8.1',
 '8.8',
 '8.6',
 '8.1',
 '9.0',
 '8.9',
 '8.6',
 '8.1',
 '8.1',
 '8.2',
 '8.5',
 '9.0',
 '8.5',
 '8.2']

### **Scrape the rate vote counts**

vote_count tag = span class="ipc-rating-star--voteCount"

In [77]:
# Getting the vote count
vote_span = soup.find_all('span', class_="ipc-rating-star--voteCount")
vote_span

[<span class="ipc-rating-star--voteCount"> (<!-- -->1.7M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->2.2M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->579K<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->113K<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->838K<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->2.1M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->3M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->2.1M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->979K<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->108K<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->1.6M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->271K<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->2.6M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount

In [81]:
# Getting the vote count without the tag
vote_counts = []

for vote in vote_span:
    for content in vote.contents:
        if 'M' in content or 'K' in content:
            vote_counts.append(content)
vote_counts

['1.7M',
 '2.2M',
 '579K',
 '113K',
 '838K',
 '2.1M',
 '3M',
 '2.1M',
 '979K',
 '108K',
 '1.6M',
 '271K',
 '2.6M',
 '521K',
 '1.1M',
 '3M',
 '2.3M',
 '1.9M',
 '981K',
 '444K',
 '751K',
 '1M',
 '2M',
 '1.2M',
 '1.7M']

### **Scrape the metascore**

metascore tag = span class="sc-b0901df4-0 bXIOoL metacritic-score-box"

In [85]:
# Getting the metascore
metascore_span = soup.find_all('span', class_="sc-b0901df4-0 bXIOoL metacritic-score-box")
metascore_span

[<span class="sc-b0901df4-0 bXIOoL metacritic-score-box" style="background-color:#54A72A">67</span>,
 <span class="sc-b0901df4-0 bXIOoL metacritic-score-box" style="background-color:#54A72A">74</span>,
 <span class="sc-b0901df4-0 bXIOoL metacritic-score-box" style="background-color:#54A72A">79</span>,
 <span class="sc-b0901df4-0 bXIOoL metacritic-score-box" style="background-color:#54A72A">85</span>,
 <span class="sc-b0901df4-0 bXIOoL metacritic-score-box" style="background-color:#54A72A">90</span>,
 <span class="sc-b0901df4-0 bXIOoL metacritic-score-box" style="background-color:#54A72A">100</span>,
 <span class="sc-b0901df4-0 bXIOoL metacritic-score-box" style="background-color:#54A72A">82</span>,
 <span class="sc-b0901df4-0 bXIOoL metacritic-score-box" style="background-color:#54A72A">92</span>,
 <span class="sc-b0901df4-0 bXIOoL metacritic-score-box" style="background-color:#54A72A">72</span>,
 <span class="sc-b0901df4-0 bXIOoL metacritic-score-box" style="background-color:#54A72A">

In [87]:
# Getting the metascore without tags
metascores = []

for score in metascore_span:
    mark = score.get_text()
    metascores.append(mark)
metascores

['67',
 '74',
 '79',
 '85',
 '90',
 '100',
 '82',
 '92',
 '72',
 '69',
 '63',
 '74',
 '89',
 '90',
 '84',
 '95',
 '65',
 '85',
 '92',
 '78',
 '97',
 '94',
 '88',
 '75']

### **Scrape the movie description**

description tag = div class="ipc-html-content-inner-div"

In [93]:
# Getting the movie description
movie_description = soup.find_all('div', class_="ipc-html-content-inner-div")
movie_description

[<div class="ipc-html-content-inner-div" role="presentation">A former Roman General sets out to exact vengeance against the corrupt emperor who murdered his family and sent him into slavery.</div>,
 <div class="ipc-html-content-inner-div" role="presentation">When Earth becomes uninhabitable in the future, a farmer and ex-NASA pilot, Joseph Cooper, is tasked to pilot a spacecraft, along with a team of researchers, to find a new planet for humans.</div>,
 <div class="ipc-html-content-inner-div" role="presentation">Paul Atreides unites with the Fremen while on a warpath of revenge against the conspirators who destroyed his family. Facing a choice between the love of his life and the fate of the universe, he endeavors to prevent a terrible future.</div>,
 <div class="ipc-html-content-inner-div" role="presentation">After a shipwreck, an intelligent robot called Roz is stranded on an uninhabited island. To survive the harsh environment, Roz bonds with the island's animals and cares for an or

In [97]:
# Getting the movie description without the tag
descriptions = []

for descrip in movie_description:
    desc_text = descrip.get_text()
    descriptions.append(desc_text)
descriptions

['A former Roman General sets out to exact vengeance against the corrupt emperor who murdered his family and sent him into slavery.',
 'When Earth becomes uninhabitable in the future, a farmer and ex-NASA pilot, Joseph Cooper, is tasked to pilot a spacecraft, along with a team of researchers, to find a new planet for humans.',
 'Paul Atreides unites with the Fremen while on a warpath of revenge against the conspirators who destroyed his family. Facing a choice between the love of his life and the fate of the universe, he endeavors to prevent a terrible future.',
 "After a shipwreck, an intelligent robot called Roz is stranded on an uninhabited island. To survive the harsh environment, Roz bonds with the island's animals and cares for an orphaned baby goose.",
 'A dramatization of the life story of J. Robert Oppenheimer, the physicist who had a large hand in the development of the atomic bombs that brought an end to World War II.',
 'The aging patriarch of an organized crime dynasty tra

### **Store the scraped data into a DataFrame**

In [118]:
# Filling missing data in metascore
import numpy as np

metascores = [
    '67', '74', '79', '85', '90', '100', '82', '92', '72', '69', '63', '74', 
    '89', '90', '84', '95', '65', '85', '92', '78', '97', '94', '88', '75'
]

# filling metascore
metascores.insert(9, '0')

In [120]:
movie_data = pd.DataFrame()
movie_data['Titles'] = titles
movie_data['Ranks'] = ranks
movie_data['Release_Year']= released_years
movie_data['Runtime'] = runtime
movie_data['Certifications'] = certifications
movie_data['Ratings'] = ratings
movie_data['Votes'] = vote_counts
movie_data['Metascore'] = metascores
movie_data['Movie_Descriptions'] = descriptions

movie_data

Unnamed: 0,Titles,Ranks,Release_Year,Runtime,Certifications,Ratings,Votes,Metascore,Movie_Descriptions
0,Gladiator,1,2000,2h 35m,R,8.5,1.7M,67,A former Roman General sets out to exact venge...
1,Interstellar,2,2014,2h 49m,PG-13,8.7,2.2M,74,When Earth becomes uninhabitable in the future...
2,Dune: Part Two,3,2024,2h 46m,PG-13,8.5,579K,79,Paul Atreides unites with the Fremen while on ...
3,The Wild Robot,4,2024,1h 42m,PG,8.2,113K,85,"After a shipwreck, an intelligent robot called..."
4,Oppenheimer,5,2023,3h,R,8.3,838K,90,A dramatization of the life story of J. Robert...
5,The Godfather,6,1972,2h 55m,R,9.2,2.1M,100,The aging patriarch of an organized crime dyna...
6,The Shawshank Redemption,7,1994,2h 22m,R,9.3,3M,82,A banker convicted of uxoricide forms a friend...
7,The Lord of the Rings: The Fellowship of the Ring,8,2001,2h 58m,PG-13,8.9,2.1M,92,A meek Hobbit from the Shire and eight compani...
8,Die Hard,9,1988,2h 12m,R,8.2,979K,72,A New York City police officer tries to save h...
9,Come and See,10,1985,2h 22m,Not Rated,8.3,108K,0,"After finding an old rifle, a young boy joins ..."


In [122]:
# Exporting the dataset
movie_data.to_csv('top_imdb_movies.csv', index=False)