# Web-Scraping XBox Games

## Import BeautifulSoap

BeautifulSoap to extract and work with HTML and XML data.

url of the website: https://en.wikipedia.org/wiki/List_of_Xbox_games


In [6]:
from bs4 import BeautifulSoup
import requests

url = "https://en.wikipedia.org/wiki/List_of_Xbox_games"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-appearance-disabled vector-feature-appearance-pinned-clientpref-0 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-not-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of Xbox games - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinn

Extracting the table from the website with data to be scraped

In [7]:
games_list = soup.find_all("table", class_ = "wikitable sortable sticky-header-multi")

print(games_list)

[<table class="wikitable sortable sticky-header-multi" id="softwarelist" style="width:100%; font-size:90%;">
<tbody><tr>
<th rowspan="2" style="width:25%;">Title
</th>
<th rowspan="2" style="width:14%;">Developer(s)
</th>
<th rowspan="2" style="width:14%;">Publisher(s)
</th>
<th colspan="3" style="width:14%;">Release date
</th></tr>
<tr>
<th><abbr title="PAL regions, which include Europe and Australasia">PAL</abbr>
</th>
<th><abbr title="Japan">JP</abbr>
</th>
<th><abbr title="North America">NA</abbr>
</th></tr>
<tr>
<td><i><a href="/wiki/187_Ride_or_Die" title="187 Ride or Die">187 Ride or Die</a></i>
</td>
<td><a class="mw-redirect" href="/wiki/Ubisoft_Paris" title="Ubisoft Paris">Ubisoft Paris</a>
</td>
<td><a href="/wiki/Ubisoft" title="Ubisoft">Ubisoft</a>
</td>
<td><span data-sort-value="000000002005-08-26-0000" style="white-space:nowrap">Aug 26, 2005</span></td>
<td class="table-na" data-sort-value="" style="background: #ececec; color: #2C2C2C; vertical-align: middle; text-align

Finding all the Titles for the columns of the table

In [8]:
world_title = games_list[0].find_all("th")

world_title

[<th rowspan="2" style="width:25%;">Title
 </th>,
 <th rowspan="2" style="width:14%;">Developer(s)
 </th>,
 <th rowspan="2" style="width:14%;">Publisher(s)
 </th>,
 <th colspan="3" style="width:14%;">Release date
 </th>,
 <th><abbr title="PAL regions, which include Europe and Australasia">PAL</abbr>
 </th>,
 <th><abbr title="Japan">JP</abbr>
 </th>,
 <th><abbr title="North America">NA</abbr>
 </th>]

Using .text to extract the text from the titles.

Using strip() to remove the white spaces

In [10]:
world_table_titles = [title.text.strip() for title in world_title]

print(world_table_titles)

['Title', 'Developer(s)', 'Publisher(s)', 'Release date', 'PAL', 'JP', 'NA']


## Import Pandas 

Use pandas to work with dataframes.

Using pandas to store the titles data in a dataframe df.

In [11]:
import pandas as pd

df = pd.DataFrame(columns = world_table_titles)

df

Unnamed: 0,Title,Developer(s),Publisher(s),Release date,PAL,JP,NA


Getting all Row data from the source site.

In [12]:
column_data = games_list[0].find_all("tr")

print(column_data)

[<tr>
<th rowspan="2" style="width:25%;">Title
</th>
<th rowspan="2" style="width:14%;">Developer(s)
</th>
<th rowspan="2" style="width:14%;">Publisher(s)
</th>
<th colspan="3" style="width:14%;">Release date
</th></tr>, <tr>
<th><abbr title="PAL regions, which include Europe and Australasia">PAL</abbr>
</th>
<th><abbr title="Japan">JP</abbr>
</th>
<th><abbr title="North America">NA</abbr>
</th></tr>, <tr>
<td><i><a href="/wiki/187_Ride_or_Die" title="187 Ride or Die">187 Ride or Die</a></i>
</td>
<td><a class="mw-redirect" href="/wiki/Ubisoft_Paris" title="Ubisoft Paris">Ubisoft Paris</a>
</td>
<td><a href="/wiki/Ubisoft" title="Ubisoft">Ubisoft</a>
</td>
<td><span data-sort-value="000000002005-08-26-0000" style="white-space:nowrap">Aug 26, 2005</span></td>
<td class="table-na" data-sort-value="" style="background: #ececec; color: #2C2C2C; vertical-align: middle; text-align: center;">Unreleased</td>
<td><span data-sort-value="000000002005-08-24-0000" style="white-space:nowrap">Aug 24,

Storing the row data in lists with the help of find_all method.

In [13]:
for row in column_data[2:]:
    row_data = row.find_all("td")
    individual_row_data = [data.text.strip() for data in row_data]
    print(individual_row_data)

['187 Ride or Die', 'Ubisoft Paris', 'Ubisoft', 'Aug 26, 2005', 'Unreleased', 'Aug 24, 2005']
['2002 FIFA World Cup', 'EA CanadaCreationsIntelligent Games', 'EA SportsWWElectronic Arts SquareJP', 'Apr 26, 2002', 'May 2, 2002', 'Apr 22, 2002']
['2006 FIFA World Cup', 'EA Canada', 'EA Sports', 'Apr 28, 2006', 'Unreleased', 'Apr 24, 2006']
['25 to Life', 'Avalanche SoftwareRitual Entertainment', 'Eidos Interactive', 'Unreleased', 'Unreleased', 'Jan 17, 2006']
['4x4 Evo 2', 'Terminal Reality', 'Gathering of Developers', 'Jun 5, 2002', 'Unreleased', 'Nov 15, 2001']
['50 Cent: Bulletproof', 'Genuine Games', 'Vivendi Universal Games', 'Nov 25, 2005', 'Unreleased', 'Nov 17, 2005']
['Advent Rising', 'GlyphX Games', 'Majesco Entertainment', 'Feb 17, 2006', 'Unreleased', 'May 31, 2005']
['Æon Flux', 'Terminal Reality', 'Majesco Entertainment', 'Mar 30, 2006AUSMar 31, 2006EU', 'Unreleased', 'Nov 17, 2005']
['AFL Live 2003', 'IR Gurus', 'Acclaim Sports', 'Sep 5, 2002AUS', 'Unreleased', 'Unreleased'

Droping column Release date as it is not required.

In [14]:
df = pd.DataFrame(columns = world_table_titles)
for col in df.columns:
    if 'Release date' in col:
        del df[col]

df

Unnamed: 0,Title,Developer(s),Publisher(s),PAL,JP,NA


Iterate over the rows and store the data in the dataframe.

Also, calculate the length of the dataframe and assign the data to the dataframe by using .loc method.

In [15]:
for row in column_data[2:]:
    row_data = row.find_all("td")
    individual_row_data = [data.text.strip() for data in row_data]
    
    length = len(df)
    df.loc[length] = individual_row_data
    
df

Unnamed: 0,Title,Developer(s),Publisher(s),PAL,JP,NA
0,187 Ride or Die,Ubisoft Paris,Ubisoft,"Aug 26, 2005",Unreleased,"Aug 24, 2005"
1,2002 FIFA World Cup,EA CanadaCreationsIntelligent Games,EA SportsWWElectronic Arts SquareJP,"Apr 26, 2002","May 2, 2002","Apr 22, 2002"
2,2006 FIFA World Cup,EA Canada,EA Sports,"Apr 28, 2006",Unreleased,"Apr 24, 2006"
3,25 to Life,Avalanche SoftwareRitual Entertainment,Eidos Interactive,Unreleased,Unreleased,"Jan 17, 2006"
4,4x4 Evo 2,Terminal Reality,Gathering of Developers,"Jun 5, 2002",Unreleased,"Nov 15, 2001"
...,...,...,...,...,...,...
983,Yourself!Fitness,Respondesign,Respondesign,Unreleased,Unreleased,"Oct 8, 2004"
984,Yu-Gi-Oh! The Dawn of Destiny,Konami Computer Entertainment Japan,Konami,"Nov 19, 2004EUDec 3, 2004AUS",Unreleased,"Mar 23, 2004"
985,Zapper: One Wicked Cricket,Blitz Games,Infogrames,"Mar 14, 2003",Unreleased,"Nov 6, 2002"
986,Zathura,High Voltage Software,2K Games,"Jan 27, 2006",Unreleased,"Nov 3, 2005"


Output the final dataframe.

In [16]:
df

Unnamed: 0,Title,Developer(s),Publisher(s),PAL,JP,NA
0,187 Ride or Die,Ubisoft Paris,Ubisoft,"Aug 26, 2005",Unreleased,"Aug 24, 2005"
1,2002 FIFA World Cup,EA CanadaCreationsIntelligent Games,EA SportsWWElectronic Arts SquareJP,"Apr 26, 2002","May 2, 2002","Apr 22, 2002"
2,2006 FIFA World Cup,EA Canada,EA Sports,"Apr 28, 2006",Unreleased,"Apr 24, 2006"
3,25 to Life,Avalanche SoftwareRitual Entertainment,Eidos Interactive,Unreleased,Unreleased,"Jan 17, 2006"
4,4x4 Evo 2,Terminal Reality,Gathering of Developers,"Jun 5, 2002",Unreleased,"Nov 15, 2001"
...,...,...,...,...,...,...
983,Yourself!Fitness,Respondesign,Respondesign,Unreleased,Unreleased,"Oct 8, 2004"
984,Yu-Gi-Oh! The Dawn of Destiny,Konami Computer Entertainment Japan,Konami,"Nov 19, 2004EUDec 3, 2004AUS",Unreleased,"Mar 23, 2004"
985,Zapper: One Wicked Cricket,Blitz Games,Infogrames,"Mar 14, 2003",Unreleased,"Nov 6, 2002"
986,Zathura,High Voltage Software,2K Games,"Jan 27, 2006",Unreleased,"Nov 3, 2005"


Export the dataframe to a csv file.

In [18]:
df.to_csv(r'/home/harshvrdhn/Documents/Code/CODE/Jupyter/XBox-Games-Web-Scraping/xbox_games.csv', index = False)