## Scraping -
https://www.scrapethissite.com/pages/simple/

#### **Steps:**

1. Import the necessary packages:
   - `import requests` 
   - `from bs4 import BeautifulSoup` 
2. Use `requests.get()` to retrieve the necessary details of the page to be scraped.
   
   - `page = requests.get("page_url")` where `page_url` is the URL of page to be scraped.
   - can check `page.status_code` to know if the connection is fine ie `status_code = 200` else `status_code = 404` for some error such as if the page doesnt exist.
3. Use `BeautifulSoup` to create a soup object as `page_soup = BeautifulSoup(markup = page.content, features= "html.parser")` or simply, `page_soup = BeautifulSoup(page.content, "html.parser")`. 
   
4. Alternatively, we can also use `page_soup = BeautifulSoup(page.text, "html.parser")`. `page.content` returns `binary` data whereas `page.text` returns `string` data.
   
   Now, this object holds the `html` codes for the page we are dealing with. 
   
5. Next step is to extract the relevant details via inspecting the code. This can be achieved with the `find`,`findAll`,`findChild` methods in conjunction with `get` and `getText` methods.


Step 1: Import the packages needed

In [2]:
import requests
from bs4 import BeautifulSoup

Step 2: Set up the soup object and prepare for extraction.

In [3]:
page = requests.get("https://www.scrapethissite.com/pages/simple/")
page_soup = BeautifulSoup(page.content,"html.parser")

Step 3: Upon inspecting the HTML code of the page, we can see that the information we seek is within `div` tags having `class=col-md-4 country`. So we will first isolate all those tags.

In [4]:
countries_data = page_soup.find_all("div", class_="col-md-4 country")
countries_data[0]

<div class="col-md-4 country">
<h3 class="country-name">
<i class="flag-icon flag-icon-ad"></i>
                            Andorra
                        </h3>
<div class="country-info">
<strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br/>
<strong>Population:</strong> <span class="country-population">84000</span><br/>
<strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br/>
</div>
</div>

Step 4: Let us extract the country names by getting them from the `h3` tags within these `div` tags from the previous step.

In [5]:
name_data = page_soup.find_all("h3", class_ = "country-name")
country_names = []

for name in name_data:
    country_names.append((name.get_text(strip=True)))

len(country_names), country_names[:10]

(250,
 ['Andorra',
  'United Arab Emirates',
  'Afghanistan',
  'Antigua and Barbuda',
  'Anguilla',
  'Albania',
  'Armenia',
  'Angola',
  'Antarctica',
  'Argentina'])

Step 5: Let us now actually extract all the details of each country (name, capital, population and area). We will store it as a dictionary of dictionary. Each country name will act as the key to another dictionary with the demographic details within it.

In [6]:
country_details={}
for country in countries_data:
    name = country.findChild("h3").get_text(strip=True)
    demographics = {}
    
    capital = country.findChild("span", class_ = "country-capital").get_text(strip=True)
    population = f"{int(country.findChild("span", class_ = "country-population").get_text(strip=True)):,d}"
    area = f"{float(country.findChild("span", class_ = "country-area").get_text(strip=True)):,.01f}"

    demographics["capital"]=capital
    demographics["population"]=population
    demographics["area (sq.km)"]=area
    country_details[name]=demographics


In [7]:
country_details["India"], country_details["China"],country_details["Andorra"]

({'capital': 'New Delhi',
  'population': '1,173,108,018',
  'area (sq.km)': '3,287,590.0'},
 {'capital': 'Beijing',
  'population': '1,330,044,000',
  'area (sq.km)': '9,596,960.0'},
 {'capital': 'Andorra la Vella',
  'population': '84,000',
  'area (sq.km)': '468.0'})

Step 6: Let's write it all to a text file for posterity. (Note to self: learn how to write to a csv format!)

In [10]:
with open("countries.txt", "w") as file:
    for country in country_details:
        file.write(f"{str(country)}\n")