---
## Recreating the Web Scraping:
I'm going to try to scrape from a website, without following the Tutorial.

### 1. importing the necessary libraries:
- Beautiful Soup 4
- Requests
- Lxml
- Pandas

In [1]:
from bs4 import BeautifulSoup
import requests
#import lxml -- unable to install due to Network issues
import pandas as pd

Next, i'll assign the site link to a variable, and use requests to send a request to the site in order to access it's contents, and also assign to a variable as well:

In [2]:
site = "http://www.scrapethissite.com/pages/simple/"

page = requests.get(site)

Next, i'll check if the site can be accessed, using BeautifulSoup.

In [3]:
content = BeautifulSoup(page.text, "html")

content

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Countries of the World: A Simple Example | Scrape This Site | A public sandbox for learning web scraping</title>
<link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="A single page that lists information about all the countries in the world. Good for those just get started with web scraping." name="description"/>
<link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"/>
<link href="/static/css/styles.css" rel="stylesheet" type="text/css"/>
<meta content="noindex" name="robots"/>
<link h

Next, i'll try to find and locate the parts of the HTML which we'll be scraping and converting into a pandas dataframe.

In [4]:
content.find_all("div", class_ = "col-md-4 country")

[<div class="col-md-4 country">
 <h3 class="country-name">
 <i class="flag-icon flag-icon-ad"></i>
                             Andorra
                         </h3>
 <div class="country-info">
 <strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br/>
 <strong>Population:</strong> <span class="country-population">84000</span><br/>
 <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br/>
 </div>
 </div>,
 <div class="col-md-4 country">
 <h3 class="country-name">
 <i class="flag-icon flag-icon-ae"></i>
                             United Arab Emirates
                         </h3>
 <div class="country-info">
 <strong>Capital:</strong> <span class="country-capital">Abu Dhabi</span><br/>
 <strong>Population:</strong> <span class="country-population">4975593</span><br/>
 <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">82880.0</span><br/>
 </div>
 </div>,
 <div class="col-md-4 country">
 <h3 class="country-name">
 

#### What next?

Since the Data in this HTML is not in a table format, instead it's in a list format, i'll try to use the for loop to iterate through the list of items, create an empty list, then append each item into the list in order to create a column header for my dataframe.

In [5]:
head = []

for item in content.find_all("strong")[:3]:
    items = item.text[: -1] # Remove the colon(:) sign
    head.append(items)

head

['Capital', 'Population', 'Area (km2)']

##### But there's an error!
The list (div) provided in the HTML only contains 3 data headers; the fourth and main header is in a separate "div".

Now i'll have to insert the fourth item manually into the list since it wasn't provided, and it's an essential part.

In [6]:
head.insert(0, "Country")

head

['Country', 'Capital', 'Population', 'Area (km2)']

After creating the Column headers from the scraped HTML, next i covert it into a pandas dataframe.

Then i save it as "data" variable.

In [7]:
data = pd.DataFrame(columns = head)

data

Unnamed: 0,Country,Capital,Population,Area (km2)


The next thing now is to scrape the actual data from the website and add them into to the dataframe as well.

In [8]:
country_info = content.find_all("div", class_ = "country-info")

country_info

[<div class="country-info">
 <strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br/>
 <strong>Population:</strong> <span class="country-population">84000</span><br/>
 <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br/>
 </div>,
 <div class="country-info">
 <strong>Capital:</strong> <span class="country-capital">Abu Dhabi</span><br/>
 <strong>Population:</strong> <span class="country-population">4975593</span><br/>
 <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">82880.0</span><br/>
 </div>,
 <div class="country-info">
 <strong>Capital:</strong> <span class="country-capital">Kabul</span><br/>
 <strong>Population:</strong> <span class="country-population">29121286</span><br/>
 <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">647500.0</span><br/>
 </div>,
 <div class="country-info">
 <strong>Capital:</strong> <span class="country-capital">St. John's</span><br/>
 <strong>Population:</strong> 

##### Another problem!
The "Country" column is from another div, another class, so merging it with the country info class is going to be complicated.

So instead, I'll create a separate list for the country names which i can later insert into the country info i'm later going to create.

In [9]:
country = content.find_all("h3", class_ = "country-name")

country_name = [name.text.strip() for name in country]
    
print(country_name)

['Andorra', 'United Arab Emirates', 'Afghanistan', 'Antigua and Barbuda', 'Anguilla', 'Albania', 'Armenia', 'Angola', 'Antarctica', 'Argentina', 'American Samoa', 'Austria', 'Australia', 'Aruba', 'Åland', 'Azerbaijan', 'Bosnia and Herzegovina', 'Barbados', 'Bangladesh', 'Belgium', 'Burkina Faso', 'Bulgaria', 'Bahrain', 'Burundi', 'Benin', 'Saint Barthélemy', 'Bermuda', 'Brunei', 'Bolivia', 'Bonaire', 'Brazil', 'Bahamas', 'Bhutan', 'Bouvet Island', 'Botswana', 'Belarus', 'Belize', 'Canada', 'Cocos [Keeling] Islands', 'Democratic Republic of the Congo', 'Central African Republic', 'Republic of the Congo', 'Switzerland', 'Ivory Coast', 'Cook Islands', 'Chile', 'Cameroon', 'China', 'Colombia', 'Costa Rica', 'Cuba', 'Cape Verde', 'Curacao', 'Christmas Island', 'Cyprus', 'Czech Republic', 'Germany', 'Djibouti', 'Denmark', 'Dominica', 'Dominican Republic', 'Algeria', 'Ecuador', 'Estonia', 'Egypt', 'Western Sahara', 'Eritrea', 'Spain', 'Ethiopia', 'Finland', 'Fiji', 'Falkland Islands', 'Micron

Now, for the actual country infomation, I have to scrape them and put each country info into a list using their HTML tag name "span", and then convert them into a list so that i can later add the countries to make each row of data,

In [10]:
countries_data = []

# try:
#     for row in country_info:
#         each_row = row.find_all("span")
#         for data in each_row:
#             each_data = [data.text.strip()]
#             print(each_data)
# except:
for row in country_info:
    each_row = row.find_all("span")
    
    each_data = [items.text.strip() for items in each_row]
    countries_data.append(each_data)

countries_data

[['Andorra la Vella', '84000', '468.0'],
 ['Abu Dhabi', '4975593', '82880.0'],
 ['Kabul', '29121286', '647500.0'],
 ["St. John's", '86754', '443.0'],
 ['The Valley', '13254', '102.0'],
 ['Tirana', '2986952', '28748.0'],
 ['Yerevan', '2968000', '29800.0'],
 ['Luanda', '13068161', '1246700.0'],
 ['None', '0', '1.4E7'],
 ['Buenos Aires', '41343201', '2766890.0'],
 ['Pago Pago', '57881', '199.0'],
 ['Vienna', '8205000', '83858.0'],
 ['Canberra', '21515754', '7686850.0'],
 ['Oranjestad', '71566', '193.0'],
 ['Mariehamn', '26711', '1580.0'],
 ['Baku', '8303512', '86600.0'],
 ['Sarajevo', '4590000', '51129.0'],
 ['Bridgetown', '285653', '431.0'],
 ['Dhaka', '156118464', '144000.0'],
 ['Brussels', '10403000', '30510.0'],
 ['Ouagadougou', '16241811', '274200.0'],
 ['Sofia', '7148785', '110910.0'],
 ['Manama', '738004', '665.0'],
 ['Bujumbura', '9863117', '27830.0'],
 ['Porto-Novo', '9056010', '112620.0'],
 ['Gustavia', '8450', '21.0'],
 ['Hamilton', '65365', '53.0'],
 ['Bandar Seri Begawan', '3

### For the interesting part:
Having to merge both lists; One containing country names, and the other containing each country info.

I'll use the "zip()" function to enclose both lists, and then in using a "for loop", i'll insert each country name at index[0] of each country_info list.

In [11]:
for name, info in zip(country_name, countries_data):
    info.insert(0, name)

countries_data

[['Andorra', 'Andorra la Vella', '84000', '468.0'],
 ['United Arab Emirates', 'Abu Dhabi', '4975593', '82880.0'],
 ['Afghanistan', 'Kabul', '29121286', '647500.0'],
 ['Antigua and Barbuda', "St. John's", '86754', '443.0'],
 ['Anguilla', 'The Valley', '13254', '102.0'],
 ['Albania', 'Tirana', '2986952', '28748.0'],
 ['Armenia', 'Yerevan', '2968000', '29800.0'],
 ['Angola', 'Luanda', '13068161', '1246700.0'],
 ['Antarctica', 'None', '0', '1.4E7'],
 ['Argentina', 'Buenos Aires', '41343201', '2766890.0'],
 ['American Samoa', 'Pago Pago', '57881', '199.0'],
 ['Austria', 'Vienna', '8205000', '83858.0'],
 ['Australia', 'Canberra', '21515754', '7686850.0'],
 ['Aruba', 'Oranjestad', '71566', '193.0'],
 ['Åland', 'Mariehamn', '26711', '1580.0'],
 ['Azerbaijan', 'Baku', '8303512', '86600.0'],
 ['Bosnia and Herzegovina', 'Sarajevo', '4590000', '51129.0'],
 ['Barbados', 'Bridgetown', '285653', '431.0'],
 ['Bangladesh', 'Dhaka', '156118464', '144000.0'],
 ['Belgium', 'Brussels', '10403000', '30510.0

After merging both lists and combining them into rows of data, now i can insert them into the "data" DataFrame and then using the length function "len()", i'll use the location function "loc()" to locate the length of the dataframe, and where the legth stops, insert the rows of data, to complete our DataFrame.

In [12]:
for all_data in countries_data:
    count = len(data)
    data.loc[count] = all_data

Confirming the DataFrame...

In [13]:
data

Unnamed: 0,Country,Capital,Population,Area (km2)
0,Andorra,Andorra la Vella,84000,468.0
1,United Arab Emirates,Abu Dhabi,4975593,82880.0
2,Afghanistan,Kabul,29121286,647500.0
3,Antigua and Barbuda,St. John's,86754,443.0
4,Anguilla,The Valley,13254,102.0
...,...,...,...,...
245,Yemen,Sanaa,23495361,527970.0
246,Mayotte,Mamoudzou,159042,374.0
247,South Africa,Pretoria,49000000,1219912.0
248,Zambia,Lusaka,13460305,752614.0


Converting DataFrame to CSV file to be analyzed or Explored.

In [18]:
data.to_csv(r"C:/Users/Ben/Videos/New folder/Data Analysis/World_info.csv", index = False)

---
---
## Thank You!