# **02 Intermediate importing data in Python**

## 1 Importing flat files from the *web*

* Important packages: `urllib`, `requests`

* Using `urllib`: 
```python
    from urllib.request import urlretrive
    url = "https.../data.csv"
    urlretrive(url, 'data_location')
```  
* Get `html` data: 
```python
    from urllib.request import urlopen, Request
    url = "https://www.page.org"
    request = Request(url)
    response = urlopen(request)
    html = response.read()
    response.close()
```  


In [2]:
from urllib.request import Request, urlopen

In [7]:
url = "https://www.wikipedia.org/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
request = Request(url, headers=headers)
response = urlopen(request)
html = response.read().decode("utf-8")
response.close()

print(html[:500]) 

<!DOCTYPE html>
<html lang="en" class="no-js">
<head>
<meta charset="utf-8">
<title>Wikipedia</title>
<meta name="description" content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.">
<script>
document.documentElement.className = document.documentElement.className.replace( /(^|\s)no-js(\s|$)/, "$1js-enabled$2" );
</script>
<meta name="viewport" content="initial-scale=1,user-scalable=yes">
<link rel="apple-touch-


* using `requests `package

In [22]:
import requests

url = "https://www.wikipedia.org/"
r = requests.get(url, headers=headers)
text = r.text
text[:500]

'<!DOCTYPE html>\n<html lang="en" class="no-js">\n<head>\n<meta charset="utf-8">\n<title>Wikipedia</title>\n<meta name="description" content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.">\n<script>\ndocument.documentElement.className = document.documentElement.className.replace( /(^|\\s)no-js(\\s|$)/, "$1js-enabled$2" );\n</script>\n<meta name="viewport" content="initial-scale=1,user-scalable=yes">\n<link rel="apple-touch-'

* using `BeautifulSoup`

In [23]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(text, 'html.parser')
print(soup.title.string)

Wikipedia


### Scrapping with BeautifulSoup

* BSoup allos parse and stract data from HTML

In [26]:
url = "https://www.crummy.com/software/BeautifulSoup/"
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)

In [32]:
print(soup.prettify()[:200])

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/transitional.dtd">
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>



In [39]:
soup.title, type(soup.title)

(<title>Beautiful Soup: We called him Tortoise because he taught us.</title>,
 bs4.element.Tag)

In [42]:
soup.text

'\n\n\nBeautiful Soup: We called him Tortoise because he taught us.\n\n\n\n\n\n\n\n\n\n[ Download | Documentation | Hall of Fame | For enterprise | Source | Changelog | Discussion group  | Zine ]\n\nBeautiful Soup\n\nYou didn\'t write that awful page. You\'re just trying to get some\ndata out of it. Beautiful Soup is here to help. Since 2004, it\'s been\nsaving programmers hours or days of work on quick-turnaround\nscreen scraping projects.\nBeautiful Soup is a Python library designed for quick turnaround\nprojects like screen-scraping. Three features make it powerful:\n\n\nBeautiful Soup provides a few simple methods and Pythonic idioms\nfor navigating, searching, and modifying a parse tree: a toolkit for\ndissecting a document and extracting what you need. It doesn\'t take\nmuch code to write an application\n\nBeautiful Soup automatically converts incoming documents to\nUnicode and outgoing documents to UTF-8. You don\'t have to think\nabout encodings, unless the document doesn\'t sp

In [35]:
for link in soup.find_all('a'): 
    print(link.get('href'))

#Download
bs4/doc/
#HallOfFame
enterprise.html
https://code.launchpad.net/beautifulsoup
https://git.launchpad.net/beautifulsoup/tree/CHANGELOG
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
zine/
bs4/download/
http://lxml.de/
http://code.google.com/p/html5lib/
bs4/doc/
https://tidelift.com/subscription/pkg/pypi-beautifulsoup4?utm_source=pypi-beautifulsoup4&utm_medium=referral&utm_campaign=enterprise
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
https://bugs.launchpad.net/beautifulsoup/
https://tidelift.com/security
https://tidelift.com/subscription/pkg/pypi-beautifulsoup4?utm_source=pypi-beautifulsoup4&utm_medium=referral&utm_campaign=website
zine/
None
bs4/download/
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
download/3.x/BeautifulSoup-3.2.2.tar.gz
https://tidelift.com/subscription/pkg/pypi-beautifulsoup?utm_source=pypi-beautifulsoup&utm_medium=referral&utm_campaign=website
None
http://www.nytimes.com/2007/10/25/arts/design/

## 2 Interacting with API's

### Json

In [None]:
import json
with open('j.json') as json_file: 
    json_data = json.load(json_file)

for k, v in json_data.items(): 
    print(k, ': ', v)

API

* Aplication software interfase
* Collection of code: protocol and routines
* Allos two software programs to comunicate with each other

In [2]:
import requests

In [14]:
url = "http://www.omdbapi.com/?apikey=72bc447a&t=the+social+network"
r = requests.get(url)

* Response as json

In [15]:
json_data = r.json()
json_data

{'Title': 'The Social Network',
 'Year': '2010',
 'Rated': 'PG-13',
 'Released': '01 Oct 2010',
 'Runtime': '120 min',
 'Genre': 'Biography, Drama',
 'Director': 'David Fincher',
 'Writer': 'Aaron Sorkin, Ben Mezrich',
 'Actors': 'Jesse Eisenberg, Andrew Garfield, Justin Timberlake',
 'Plot': 'As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea and by the co-founder who was later squeezed out of the business.',
 'Language': 'English, French',
 'Country': 'United States',
 'Awards': 'Won 3 Oscars. 174 wins & 188 nominations total',
 'Poster': 'https://m.media-amazon.com/images/M/MV5BMjlkNTE5ZTUtNGEwNy00MGVhLThmZjMtZjU1NDE5Zjk1NDZkXkEyXkFqcGc@._V1_SX300.jpg',
 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '7.8/10'},
  {'Source': 'Rotten Tomatoes', 'Value': '96%'},
  {'Source': 'Metacritic', 'Value': '95/100'}],
 'Metascore': '95',
 'imdbRating': '7.8',
 'imdbVotes':

* Response as text

In [12]:
text = r.text
text

'{"Title":"The Social Network","Year":"2010","Rated":"PG-13","Released":"01 Oct 2010","Runtime":"120 min","Genre":"Biography, Drama","Director":"David Fincher","Writer":"Aaron Sorkin, Ben Mezrich","Actors":"Jesse Eisenberg, Andrew Garfield, Justin Timberlake","Plot":"As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea and by the co-founder who was later squeezed out of the business.","Language":"English, French","Country":"United States","Awards":"Won 3 Oscars. 174 wins & 188 nominations total","Poster":"https://m.media-amazon.com/images/M/MV5BMjlkNTE5ZTUtNGEwNy00MGVhLThmZjMtZjU1NDE5Zjk1NDZkXkEyXkFqcGc@._V1_SX300.jpg","Ratings":[{"Source":"Internet Movie Database","Value":"7.8/10"},{"Source":"Rotten Tomatoes","Value":"96%"},{"Source":"Metacritic","Value":"95/100"}],"Metascore":"95","imdbRating":"7.8","imdbVotes":"799,416","imdbID":"tt1285016","Type":"movie","DVD":"N/A","BoxOff

In [17]:
type(text)

str

In [16]:
type(json_data)

dict