# Lesson 7: Advanced Web Scraping and Data Gathering
## Topic 3: Reading data from an API
This Notebook shows how to use a free API (no authorization or API key needed) to download some basic information about various countries around the world and put them in a DataFrame.

### Import libraries

In [1]:
import urllib.request, urllib.parse
from urllib.error import HTTPError,URLError
import pandas as pd

### Exercise 20: Define the base URL

In [2]:
serviceurl = 'https://restcountries.eu/rest/v2/name/'

### Exercise 21: Define a function to pull the country data from the API

In [3]:
def get_country_data(country):
    """
    Function to get data about a country from "https://restcountries.eu" API
    """
    country_name=str(country)
    url = serviceurl + country_name
    
    try: 
        uh = urllib.request.urlopen(url)
    except HTTPError as e:
        print("Sorry! Could not retrive anything on {}".format(country_name))
        return None
    except URLError as e:
        print('Failed to reach a server.')
        print('Reason: ', e.reason)
        return None
    else:
        data = uh.read().decode()
        print("Retrieved data on {}. Total {} characters read.".format(country_name,len(data)))
        return data

### Exercise 22: Test the function by passing a correct and an incorrect argument

In [4]:
country_name = 'Switzerland'

In [5]:
data=get_country_data(country_name)

Retrieved data on Switzerland. Total 1090 characters read.


In [6]:
country_name1 = 'Switzerland1'

In [7]:
data1=get_country_data(country_name1)

Sorry! Could not retrive anything on Switzerland1


In [9]:
data

'[{"name":"Switzerland","topLevelDomain":[".ch"],"alpha2Code":"CH","alpha3Code":"CHE","callingCodes":["41"],"capital":"Bern","altSpellings":["CH","Swiss Confederation","Schweiz","Suisse","Svizzera","Svizra"],"region":"Europe","subregion":"Western Europe","population":8341600,"latlng":[47.0,8.0],"demonym":"Swiss","area":41284.0,"gini":33.7,"timezones":["UTC+01:00"],"borders":["AUT","FRA","ITA","LIE","DEU"],"nativeName":"Schweiz","numericCode":"756","currencies":[{"code":"CHF","name":"Swiss franc","symbol":"Fr"}],"languages":[{"iso639_1":"de","iso639_2":"deu","name":"German","nativeName":"Deutsch"},{"iso639_1":"fr","iso639_2":"fra","name":"French","nativeName":"français"},{"iso639_1":"it","iso639_2":"ita","name":"Italian","nativeName":"Italiano"}],"translations":{"de":"Schweiz","es":"Suiza","fr":"Suisse","ja":"スイス","it":"Svizzera","br":"Suíça","pt":"Suíça","nl":"Zwitserland","hr":"Švicarska","fa":"سوئیس"},"flag":"https://restcountries.eu/data/che.svg","regionalBlocs":[{"acronym":"EFTA","

### Exercise 23: Use the built-in `JSON` library to read and examine the data properly

In [11]:
import json

In [13]:
# Load from string 'data'
x=json.loads(data)

In [15]:
# Load the only element
y=x[0]

In [16]:
type(y)

dict

In [17]:
y.keys()

dict_keys(['name', 'topLevelDomain', 'alpha2Code', 'alpha3Code', 'callingCodes', 'capital', 'altSpellings', 'region', 'subregion', 'population', 'latlng', 'demonym', 'area', 'gini', 'timezones', 'borders', 'nativeName', 'numericCode', 'currencies', 'languages', 'translations', 'flag', 'regionalBlocs', 'cioc'])

### Exercise 24: Can you print all the data elements one by one?

In [18]:
for k,v in y.items():
    print("{}: {}".format(k,v))

name: Switzerland
topLevelDomain: ['.ch']
alpha2Code: CH
alpha3Code: CHE
callingCodes: ['41']
capital: Bern
altSpellings: ['CH', 'Swiss Confederation', 'Schweiz', 'Suisse', 'Svizzera', 'Svizra']
region: Europe
subregion: Western Europe
population: 8341600
latlng: [47.0, 8.0]
demonym: Swiss
area: 41284.0
gini: 33.7
timezones: ['UTC+01:00']
borders: ['AUT', 'FRA', 'ITA', 'LIE', 'DEU']
nativeName: Schweiz
numericCode: 756
currencies: [{'code': 'CHF', 'name': 'Swiss franc', 'symbol': 'Fr'}]
languages: [{'iso639_1': 'de', 'iso639_2': 'deu', 'name': 'German', 'nativeName': 'Deutsch'}, {'iso639_1': 'fr', 'iso639_2': 'fra', 'name': 'French', 'nativeName': 'français'}, {'iso639_1': 'it', 'iso639_2': 'ita', 'name': 'Italian', 'nativeName': 'Italiano'}]
translations: {'de': 'Schweiz', 'es': 'Suiza', 'fr': 'Suisse', 'ja': 'スイス', 'it': 'Svizzera', 'br': 'Suíça', 'pt': 'Suíça', 'nl': 'Zwitserland', 'hr': 'Švicarska', 'fa': 'سوئیس'}
flag: https://restcountries.eu/data/che.svg
regionalBlocs: [{'acrony

### Exercise 25: The dictionary values are not of the same type - print all the languages spoken

In [22]:
for lang in y['languages']:
    print(lang['nativeName'])

Deutsch
français
Italiano


### Exercise 26: Write a function which can take a list of countries and return a DataFrame containing key info
* Capital
* Region
* Sub-region
* Population
* lattitude/longitude
* Area
* Gini index
* Timezones
* Currencies
* Languages

In [2]:
import pandas as pd
import json
def build_country_database(list_country):
    """
    Takes a list of country names.
    Output a DataFrame with key information about those countries.
    """
    # Define an empty dictionary with keys
    country_dict={'Country':[],'Capital':[],'Region':[],'Sub-region':[],'Population':[],
                  'Lattitude':[],'Longitude':[],'Area':[],'Gini':[],'Timezones':[],
                  'Currencies':[],'Languages':[]}
    
    for c in list_country:
        data = get_country_data(c)
        if data!=None:
            x = json.loads(data)
            y=x[0]
            country_dict['Country'].append(y['name'])
            country_dict['Capital'].append(y['capital'])
            country_dict['Region'].append(y['region'])
            country_dict['Sub-region'].append(y['subregion'])
            country_dict['Population'].append(y['population'])
            country_dict['Lattitude'].append(y['latlng'][0])
            country_dict['Longitude'].append(y['latlng'][1])
            country_dict['Area'].append(y['area'])
            country_dict['Gini'].append(y['gini'])
            # Note the code to handle possibility of multiple timezones as a list
            if len(y['timezones'])>1:
                country_dict['Timezones'].append(','.join(y['timezones']))
            else:
                country_dict['Timezones'].append(y['timezones'][0])
            # Note the code to handle possibility of multiple currencies as dictionaries
            if len(y['currencies'])>1:
                lst_currencies = []
                for i in y['currencies']:
                    lst_currencies.append(i['name'])
                country_dict['Currencies'].append(','.join(lst_currencies))
            else:
                country_dict['Currencies'].append(y['currencies'][0]['name'])
            # Note the code to handle possibility of multiple languages as dictionaries
            if len(y['languages'])>1:
                lst_languages = []
                for i in y['languages']:
                    lst_languages.append(i['name'])
                country_dict['Languages'].append(','.join(lst_languages))
            else:
                country_dict['Languages'].append(y['languages'][0]['name'])
    
    # Return as a Pandas DataFrame
    return pd.DataFrame(country_dict)

### Exercise 27: Test the function by building a small database of countries' info. Include an incorrect name too.

In [24]:
df1=build_country_database(['Nigeria','Switzerland','France',
                            'Turmeric','Russia','Kenya','Singapore'])

Sorry! Could not retrive anything on Nigeria
Retrieved data on Switzerland. Total 1090 characters read.
Retrieved data on France. Total 1047 characters read.
Sorry! Could not retrive anything on Turmeric
Retrieved data on Russia. Total 1120 characters read.
Retrieved data on Kenya. Total 1052 characters read.
Retrieved data on Singapore. Total 1223 characters read.


In [25]:
df1

Unnamed: 0,Country,Capital,Region,Sub-region,Population,Lattitude,Longitude,Area,Gini,Timezones,Currencies,Languages
0,Switzerland,Bern,Europe,Western Europe,8341600,47.0,8.0,41284.0,33.7,UTC+01:00,Swiss franc,"German,French,Italian"
1,France,Paris,Europe,Western Europe,66710000,46.0,2.0,640679.0,32.7,"UTC-10:00,UTC-09:30,UTC-09:00,UTC-08:00,UTC-04...",Euro,French
2,Russian Federation,Moscow,Europe,Eastern Europe,146599183,60.0,100.0,17124442.0,40.1,"UTC+03:00,UTC+04:00,UTC+06:00,UTC+07:00,UTC+08...",Russian ruble,Russian
3,Kenya,Nairobi,Africa,Eastern Africa,47251000,1.0,38.0,580367.0,47.7,UTC+03:00,Kenyan shilling,"English,Swahili"
4,Singapore,Singapore,Asia,South-Eastern Asia,5535000,1.366667,103.8,710.0,48.1,UTC+08:00,"Brunei dollar,Singapore dollar","English,Malay,Tamil,Chinese"


In [26]:
x=json.loads?

In [None]:
x=json.loads

In [78]:
setx?

In [14]:
#column digging
keys = []
data_tmp = []
with open("C:\\Users\\andre\\Downloads\\All_Beauty.json", "r") as file:
        for line in file:
            line = line.replace("\n", "")
            tmp = json.loads(line)
            data_tmp.append(tmp)
            for k in list(tmp.keys()):
                keys.append(k)

In [4]:
data_tmp

[{'overall': 1.0,
  'verified': True,
  'reviewTime': '02 19, 2015',
  'reviewerID': 'A1V6B6TNIC10QE',
  'asin': '0143026860',
  'reviewerName': 'theodore j bigham',
  'reviewText': 'great',
  'summary': 'One Star',
  'unixReviewTime': 1424304000},
 {'overall': 4.0,
  'verified': True,
  'reviewTime': '12 18, 2014',
  'reviewerID': 'A2F5GHSXFQ0W6J',
  'asin': '0143026860',
  'reviewerName': 'Mary K. Byke',
  'reviewText': "My  husband wanted to reading about the Negro Baseball and this a great addition to his library\n Our library doesn't haveinformation so this book is his start. Tthank you",
  'summary': "... to reading about the Negro Baseball and this a great addition to his library Our library doesn't haveinformation so ...",
  'unixReviewTime': 1418860800},
 {'overall': 4.0,
  'verified': True,
  'reviewTime': '08 10, 2014',
  'reviewerID': 'A1572GUYS7DGSR',
  'asin': '0143026860',
  'reviewerName': 'David G',
  'reviewText': 'This book was very informative, covering all aspects 

In [9]:
def try_key_or_make_empty(dict_name, key_name):
    try:
        overall = dict_name[key_name]
    except:
        overall = ''
    return overall


['reviewTime',
 'reviewerName',
 'unixReviewTime',
 'overall',
 'reviewText',
 'vote',
 'image',
 'style',
 'summary',
 'reviewerID',
 'verified',
 'asin']

In [19]:
#reviewerID=A1V6B6TNIC10QE
for y in data_tmp:
#    if (y["reviewerID"] == "A1V6B6TNIC10QE"):
    if ( "{" in str(y["rank"]) ):
        print( y)
        break

{'overall': 1.0, 'verified': True, 'reviewTime': '02 19, 2015', 'reviewerID': 'A1V6B6TNIC10QE', 'asin': '0143026860', 'reviewerName': 'theodore j bigham', 'reviewText': 'great', 'summary': 'One Star', 'unixReviewTime': 1424304000}


In [12]:
pre_json_data = {k:[] for k in keys }
columns = list({k for k in keys })
for y in data_tmp:
    for c in columns:
        pre_json_data[c].append(try_key_or_make_empty(y,c))

In [13]:
pre_json_data

{'overall': [1.0,
  4.0,
  4.0,
  5.0,
  5.0,
  5.0,
  4.0,
  1.0,
  5.0,
  1.0,
  2.0,
  5.0,
  4.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  4.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  2.0,
  4.0,
  5.0,
  2.0,
  4.0,
  5.0,
  5.0,
  5.0,
  2.0,
  4.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  4.0,
  5.0,
  4.0,
  5.0,
  1.0,
  5.0,
  5.0,
  4.0,
  4.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  1.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  4.0,
  5.0,
  4.0,
  5.0,
  5.0,
  4.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  4.0,
  5.0,
  4.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  2.0,
  4.0,
  4.0,
  4.0,
  4.0,
  5.0,
  5.0,
  5.0,
  4.0,
  5.0,
  5.0,
  5.0,
  4.0,
  3.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  4.0,
  5.0,
  2.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  5.0,
  4.0,
  5.0,
  5.0,
  5.0,
  

In [130]:
df = pd.DataFrame(pre_json_data)
df

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,1.0,True,"02 19, 2015",A1V6B6TNIC10QE,0143026860,theodore j bigham,great,One Star,1424304000,,,
1,4.0,True,"12 18, 2014",A2F5GHSXFQ0W6J,0143026860,Mary K. Byke,My husband wanted to reading about the Negro ...,... to reading about the Negro Baseball and th...,1418860800,,,
2,4.0,True,"08 10, 2014",A1572GUYS7DGSR,0143026860,David G,"This book was very informative, covering all a...",Worth the Read,1407628800,,,
3,5.0,True,"03 11, 2013",A1PSGLFK1NSVO,0143026860,TamB,I am already a baseball fan and knew a bit abo...,Good Read,1362960000,,,
4,5.0,True,"12 25, 2011",A6IKXKZMTKGSC,0143026860,shoecanary,This was a good story of the Black leagues. I ...,"More than facts, a good story read!",1324771200,5,,
...,...,...,...,...,...,...,...,...,...,...,...,...
371340,1.0,True,"07 20, 2017",A202DCI7TV1022,B01HJEGTYK,Sam,It was awful. It was super frizzy and I tried ...,It was super frizzy and I tried to comb it and...,1500508800,,,
371341,5.0,True,"03 16, 2017",A3FSOR5IJOFIBE,B01HJEGTYK,TYW,I was skeptical about buying this. Worried it...,Awesome,1489622400,34,,
371342,5.0,True,"03 1, 2017",A1B5DK6CTP2P24,B01HJEGTYK,Norma Jennings,Makes me look good fast.,Five Stars,1488326400,46,,
371343,2.0,True,"02 21, 2017",A23OUYS5IRMJS9,B01HJEGTYK,Lee,Way lighter than photo\nNot mix blend of color...,Ok but color way off and volume as well,1487635200,,,


In [121]:
try: 
    trying =     pre_json_data['overalls'] 
except: 
    trying =     'bingo'
    
print(trying)

bingo


In [47]:
import csv
data = open("C:\\Users\\andre\\Downloads\\All_Beauty.json")
#open file with raw text reader
#for each line remove leading"{" and trailing "}"

data

<_io.TextIOWrapper name='C:\\Users\\andre\\Downloads\\All_Beauty.json' mode='r' encoding='cp1252'>

In [54]:
foo

'{"overall": 1.0, "vote": "3", "verified": true, "reviewTime": "11 19, 2016", "reviewerID": "AMACNEW14ADMX", "asin": "014789302X", "reviewerName": "rabiyaa123", "reviewText": "it burns your eyes when u put it on  and very light so u have to keep going back n forth a lot to get the dark eyeliner color.\nalso it smudges lot. waste of money.", "summary": "i do not recommend.", "unixReviewTime": 1479513600}\n'

In [51]:
foo = """{"overall": 1.0, "vote": "3", "verified": true, "reviewTime": "11 19, 2016", "reviewerID": "AMACNEW14ADMX", "asin": "014789302X", "reviewerName": "rabiyaa123", "reviewText": "it burns your eyes when u put it on  and very light so u have to keep going back n forth a lot to get the dark eyeliner color.\nalso it smudges lot. waste of money.", "summary": "i do not recommend.", "unixReviewTime": 1479513600}
"""

In [56]:
type(json.loads(foo.replace("\n","")))

dict

In [44]:
df2 = pd.read_json("C:\\Users\\andre\\Downloads\\All_Beauty.json")

ValueError: Trailing data

In [40]:
json_data=open("C:\\Users\\andre\\Downloads\\All_Beauty.json").read()
json_obj = json.loads(json_data)

JSONDecodeError: Extra data: line 2 column 1 (char 231)