# Importing data - web and API

import data (i) from the web and (ii) a special and essential case of this: pulling data from Application Programming Interfaces, also known as APIs, such as the Twitter streaming API, which allows us to stream real-time tweets.

# 1. Importing data from the Internet

Outline
- import data from web - files or HTML
- load datasets into pandas dataframes
- make HTTP requests (GET requests)
- Scape web data like HTML
- Parse HTML into useful data (BeautifulSoup)
- Use the urllib and requests packages

## 1.1 The urllib package - automate file downloads
- provides interface for fetching data across the web
- urlopen() - accepts URLs instead of file names

In [None]:
# Automate file download in Python
from urllib.request import urlretrieve
url = 'http://archive.ics.uic.edu/ml/machine-learning-databases/ \
wine-quality/winequality-white.csv'

urlretrieve(url, 'winequality-white.csv')


## 1.2 Importing flat files from the web: your turn!
You are about to import your first file from the web! The flat file you will import will be 'winequality-red.csv' from the University of California, Irvine's Machine Learning repository (http://archive.ics.uci.edu/ml/index.php). The flat file contains tabular data of physiochemical properties of red wine, such as pH, alcohol content and citric acid content, along with wine quality rating.

The URL of the file is

'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
After you import it, you'll check your working directory to confirm that it is there and then you'll load it into a pandas DataFrame.

In [None]:
# Import package
from urllib.request import urlretrieve

# Import pandas
import pandas as pd

# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/ \
course_1606/datasets/winequality-red.csv'

# Save file locally
urlretrieve(url, 'winequality-red.csv')

# Read file into a DataFrame and print its head
df = pd.read_csv('winequality-red.csv', sep=';')
print(df.head())


## 1.3 Opening and reading flat files from the web
You have just imported a file from the web, saved it locally and loaded it into a DataFrame. If you just wanted to load a file from the web into a DataFrame without first saving it locally, you can do that easily using pandas. In particular, you can use the function pd.read_csv() with the URL as the first argument and the separator sep as the second argument.

The URL of the file, once again, is

'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

In [None]:
# Import packages
import matplotlib.pyplot as plt
import pandas as pd

# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/ \
course_1606/datasets/winequality-red.csv'

# Read file into a DataFrame: df
df = pd.read_csv(url, sep=';')

# Print the head of the DataFrame
print(df.head())

# Plot first column of df
pd.DataFrame.hist(df.ix[:, 0:1])
plt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)')
plt.ylabel('count')
plt.show()


<img src="images/importweb1.png" width="300" />

## 1.4 Importing non-flat files from the web
Congrats! You've just loaded a flat file from the web into a DataFrame without first saving it locally using the pandas function pd.read_csv(). This function is super cool because it has close relatives that allow you to load all types of files, not only flat ones. In this interactive exercise, you'll use pd.read_excel() to import an Excel spreadsheet.

The URL of the spreadsheet is

'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'
Your job is to use pd.read_excel() to read in all of its sheets, print the sheet names and then print the head of the first sheet using its name, not its index.

Note that the output of pd.read_excel() is a Python dictionary with sheet names as keys and corresponding DataFrames as corresponding values.

In [None]:
# Import package
import pandas as pd

# Assign url of file: url
url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'

# Read in all sheets of Excel file: xl
xl = pd.read_excel(url, sheetname=None)

# Print the sheetnames to the shell
print(xl.keys())

# Print the head of the first sheet (using its name, NOT its index)
print(xl['1700'].head())


## 1.5 HTTP requests to import files from the web

URL
- Uniform/Universal Resource Locator
- References to web resources
- Focus: web addresses
- Ingredients:
    - protocol identifier - http
    - resource name - datacamp.com
- These specify web addresses uniquely

HTTP
- HyperText Transfer Protocol
- foundation for data communication for the web
- HTTPS - more secure form of HTTP
- Going to a website = sending HTTP request
    - GET request
- urlretrieve() performs a GET request
- HTML - HyperText Markup Language


### 1.5.a GET requests using urllib

In [None]:
from urllib.request import urlopen, Request
url = 'https://www.wikipedia.org/'
request = Request(url)
response = urlopen(request)
html = response.read()
response.close()

### 1.5.b GET requests using requests package - faster than urllib

In [None]:
import requests
url = 'https:///www.wikipedia.org/'
# send request and catch the response
r = requests.get(url)
# returns HTML as a string
text = r.text

### 1.5.c Performing HTTP requests in Python using urllib
Now that you know the basics behind HTTP GET requests, it's time to perform some of your own. In this interactive exercise, you will ping our very own DataCamp servers to perform a GET request to extract information from our teach page, "http://www.datacamp.com/teach/documentation".

In the next exercise, you'll extract the HTML itself. Right now, however, you are going to package and send the request and then catch the response.

### 

### 

### 

# 2. Interacting with APIs to import data from the web
- extract data from APIs
- OMDB and Library of Cogress APIs

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 

# 3. Diving deep into the Twitter API
- stream real-time Twitter data, analyze and visualize

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 