# Importing Data From the Web
---

You can import data from the web by different ways including scraping, or simply downloading.

## Importing Using `urlib`

**Functions**

| Functions | Description | Syntax |
|---|---|---|
| `urlretrieve` | Downloads a file from a URL. | `urllib.request.urlretrieve(url, filename)` |
| `urlopen` | Opens a URL for reading. | `urllib.request.urlopen(url)` |

**Arguments**

| Arguments | Description | Syntax |
|---|---|---|
| `url` | The URL of the file to download or open. | `urllib.request.urlretrieve(url, ...)` or `urllib.request.urlopen(url)` |
| `filename` | The filename to save the downloaded file as. | `urllib.request.urlretrieve(..., filename)` |

**IF** you want to just load the csv into a dataframe and not save it locally, you can do that just using the `pd.read_csv()` having the url as its first argument.

## Using HTTP Request (GET) to Get Data From the Web

Let's say we want to get the HTMl file in a website, to do that we should ask for its consent by requesting GET. The following functions from two packages can be used to request:

**urllib Functions**

| Functions | Description | Syntax |
|---|---|---|
| `Request` | Creates a request object. | `urllib.request.Request(url)` |
| `urlopen` | Opens a URL and returns a response object. | `urllib.request.urlopen(request)` |
| `read` | Reads the contents of a response object. | `response.read()` |

**DO NOT FORGET TO CLOSE THE RESPONSE**

**urllib Arguments**

| Arguments | Description | Syntax |
|---|---|---|
| `url` | The URL to send the request to. | `urllib.request.Request(url)` |
| `request` | The request object. | `urllib.request.urlopen(request)` |

**EXAMPLE:**
```python
from urllib.request import urlopen, Request
url = 'some url'
request = Request(url)
response = urlopen(request)
html = response.read()
response.close()
```



**requests Functions**

| Functions | Description | Syntax |
|---|---|---|
| `requests.get` | Sends a GET request and returns a response object. | `requests.get(url)` |
| `text` | Returns the content of the response, in unicode. | `response.text` |

**requests Arguments**

| Arguments | Description | Syntax |
|---|---|---|
| `url` | The URL to send the request to. | `requests.get(url)` |

**EXAMPLE:**
```python
import requests
url = 'some url'
r = request.get(url)
text = r.text
```

## Scraping The Web with BeautifulSoup

Usually beautifulsoup's flow is:
- use requests
- use beautifulsoup

Example:
```python
import requests
from bs4 import BeautifulSoup

url = 'someurl.com'
r = requests.get(url)

html_doc = r.text

soup = BeautifulSoup(html_doc)

html_pretty = soup.prettify()
```

**Functions**

| Functions | Description | Syntax |
|---|---|---|
| `BeautifulSoup` | Creates a BeautifulSoup object from HTML or XML. | `BeautifulSoup(html_content, parser)` |
| `prettify` | Formats the HTML into a more readable structure. | `soup.prettify()` |
| `title` | Extracts the title tag from the HTML. | `soup.title` |
| `get_text` | Extracts the text content from the HTML. | `soup.get_text()` |
| `find_all` | Finds all elements matching a specified tag or criteria. | `soup.find_all(tag)` |

**Arguments**

| Arguments | Description | Syntax | Example Values |
|---|---|---|---|
| `html_content` | The HTML or XML string to parse. | `BeautifulSoup(html_content, ...)` | A string containing HTML or XML code. |
| `parser` | The parser to use (e.g., 'html.parser', 'lxml', 'xml'). | `BeautifulSoup(..., parser)` | `'html.parser'`, `'lxml'`, `'xml'`, `'html5lib'` |
| `tag` | The HTML tag to search for. | `soup.find_all(tag)` | `'a'`, `'p'`, `'div'`, `'h1'`, etc. |

## Working with JSON
When working with JSON package, you assume that you will encounter a JSON File, now that being a file you must do the right way of opening a file, reading, and whatnot.

```python
import json
with open('fileName.json', 'r') as json_file:
    json_data = json.load(json_file)

# use the data
print(type(json_data))
```

## Working with API

What is it:
- A bunch of code that allows programs to communicate with each other
- Set of protocols and routines that are prolly wirtten in codes
- If you for example are expecting a `JSON`, after you get the data it is a string, you will need to use the `json.loads(string_json)` to make it a dictionary and be able to work with it.

For example, if you want to connect to an API of the OMDB website, which offers the `JSON` data for movies you could do it like this: 

```python
import json
import requests

# The url here may seem odd but this can be seen from the documentation of the website
url = 'http://www.omdbapi.com/?t=hackers'
r = requests.get(url)
json_data = r.json() # this is awesome

for key, value in json_data.items():
    print("key: " + key + " value: " + value)
```

## Streaming Data from Twitter API

Twitter uses REST API and when you are working with usually the flow is:
- import `tweepy`, `json`
- store your authentication
- Create a stream object which will contain methods for you to work with Stream API
- Gather data

```python
# Step 1
import tweepy, json

# Step 2
access_token = '...'
access_token_secret = '...'
consumer_key = '...'
consumer_secret = '...'

# Step 3
stream = tweepy.Stream(consumer_key, consumer_secret,\
                      access_token, accesss_token_secret)

# Step 4 (filters twitter streams to capture data by keywords
stream.filter(track = ['apples', 'oranges'])
```

This is an example workflow:
```python
# Import package
import json

# We imagine that we are able to get the json strings and put it 
# inside the text file
tweets_data_path = 'tweets.txt'

# Initialize empty list to store tweets: tweets_data
tweets_data = [] # we want to store the dictionaries/JSON in here

# Open connection to file
tweets_file = open(tweets_data_path, "r")

# print(tweets_file.read())
# Read in tweets and store in list: tweets_data
for line in tweets_file:
    tweet = json.loads(line) # transforms the str JSON to a dict
    tweets_data.append(tweet)

# Close connection to file
tweets_file.close()

# Print the keys of the first tweet dict
print(tweets_data[0].keys())

# Import package
import pandas as pd

# Build DataFrame of tweet texts and languages
df = pd.DataFrame(tweets_data, columns=['text', 'lang']) # this is interesting, apparently you can pass multiple dictionaries to a dataframe

# Print head of DataFrame
print(df.head())

```