~~~
filename = 'huck_finn.txt'

file = open(filename, mode='r') # 'r' is to read

text = file.read()

file.close()

print(text)
~~~

#### Context Manager

~~~
with open('huck_finn.txt','r') as file:
	print(file.read())
~~~

**Flat files**: basic text files containing records. That is, table data.

In [1]:
import pandas as pd

dic = {'Name': ['Ana', 'Bob', 'Carol'], 'Age': [13, 28, 55]}
data = pd.DataFrame(dic)

print(data.to_numpy())

[['Ana' 13]
 ['Bob' 28]
 ['Carol' 55]]


**Pickled files**:

~~~
import pickle

with open('pickled_fruit.pkl','rb') as file:
	data = pickle.load(file)

print(data)
~~~

**Excel files**:

~~~
import pandas as pd

file = 'urbanpop.xlsx'

data = pd.ExcelFile(file)

print(data.sheet_names)

df1 = data.parse('1960-1966')
df2 = data.parse(0)
~~~

Listing the working directory:

~~~
import os
wd = os.getcwd()
os.listdir(wd)
~~~

#### SAS and SATA files

~~~
import pandas as pd

# SAS files

from sas7bdat import SAS7BDAT

with SAS7BDAT('urbanpop.sas7bdat') as file:
	df_sas = file.to_data_frame()

# Stata files

data = pd.read_stata('urbanpop.dta')
~~~

#### HDF5 files

Standard for storing large quantities of numerical data.

Can scale to exabytes.

~~~
import h5py

filename = 'H-H1_LOSC_4_V1-815411200-4096.hdf5'

data = h5py.File(filename,'r')

print(type(data))

for key in data.keys():
	print(key) # meta, quality, strain

~~~

#### MATLAB files

~~~
scipy.io.loadmat() # read .mat files

scipy.io.savemat() # write .mat files
~~~

~~~
import scipy.io

filename = 'workspace.mat'

mat = scipy.io.loadmat(filename)

print(type(mat)) # dict!

# keys: MATLAB var names
# values: objects assigned to vars
~~~

### Intro to relational DBs

Using SQLAlchemy:

~~~
from sqlalchemy import import create_engine

engine = create_engine('sqlite:///Northwind.sqlite')

table_names = engine.table_names()
~~~

**Workflow of SQL querying**:

- Import packages and functions
- Create the database engine
- Connect to the engine
- Query the database
- Save query results to a DataFrame
- Close the connection

Connecting...

~~~
from sqlalchemy import create_engine

import pandas as pd

engine = create_engine('sqlite:///Northwind.sqlite')

con = engine.connect()

rs = con.execute('SELECT * FROM Orders')

df = pd.DataFrame(rs.fecthall())

df.columns = rs.keys()

con.close()
~~~

Using context manager:

~~~
from sqlalchemy import create_engine

import pandas as pd

engine = create_engine('sqlite:///Northwind.sqlite')

with engine.connect() as con:
    rs = con.execute('SELECT OrderID, OrderDate, ShipName FROM Orders')

    df = pd.DataFrame(rs.fecthmany(size=5))

    df.columns = rs.keys()

~~~

**The Pandas Way**:

~~~
from sqlalchemy import create_engine

import pandas as pd

engine = create_engine('sqlite:///Northwind.sqlite')

df = pd.read_sql_query('SELECT * FROM Orders',engine)
~~~

**INNER JOIN in Python**:
~~~
df = pd.read_sql_query('SELECT OrderId, CompanyName FROM Orders INNER JOIN Customers on Orders.CustomerId = Customers.CustomersId',engine)
~~~

#### The urllib package

- provides interface for fetching data across the web
- urlopen() - accepts URLs instead of file names

~~~
from urllib.request import urlretrieve

url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'

urlretrieve(url,'winequality-white.csv') # saves locally
~~~

or

~~~
df = pd.read_csv(url,sep=';')
~~~

#### URL

- Uniform/Universal Resource Locator
- References to web resources
- Focus: web addresses
- Ingredients:
	- Protocol identifier - http:
	- Resource name - datacamp.com

#### HTTP

- HyperText Transfer Protocol
- Foundation of data communication for the web
- HTTPS - more secure form of HTTP
- Going to a website = sending HTTP request
	- GET request
- urlretrieve() performs a GET request

**GET requests using urllib**

~~~
from urllib.request import urlopen, Request

url = 'https://www.wikipedia.org/'

request = Request(url)

response = urlopen(request)

html = response.read() # returns HTML code as a string

response.close()
~~~

**GET requests using requests**

~~~
import requests

url = 'https://www.wikipedia.org/'

r = requests.get(url)

text = r.text # HTML as a string
~~~

#### Beautiful Soup

- Parse and extract structured data from HTML
- make tag soup beautiful and extract information

~~~
from bs4 import BeautifulSoup

import requests

url = 'https://www.crummy.com/software/BeautifulSoup/'

r = requests.get(url)

html_doc = r.text

soup = BeautifulSoup(html_doc)
print(soup.prettify())
~~~

~~~

# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Print the title of Guido's webpage
print(soup.title)

# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')

# Print the URLs to the shell
for link in a_tags:
    print(link.get('href'))
~~~

#### APIs

- Application Programming Interface
- Protocols and routines
	- Building and interacting with software applications

#### JSONs

- JavaScript Object Notation
- Real-time server-to-browser communication
- Human readable

~~~
import json

with open('snakes.json','r') as json_file:
	json_data = json.load(json_file) # dict

~~~

**Connecting to an API in Python**

~~~
import requests

url = 'http://www.omdbapi.com/?t=hackers'

r.requests.get(url)

json_data = r.json()

for key, value in json_data.items():
	print(key + ' : ', value)
~~~

**What was that URL?**

- http - making an HTTP request
- www.omdbapi.com - querying the OMDB API
- ?t=hackers
	- Query string
	- Return data for a movie with title (t) 'Hackers'

#### Access the Twitter API

**Using Tweepy: Authentication handler**

~~~
import tweepy, json

access_token = '...'
access_token_secret = '...'
consumer_key = '...'
consumer_secret = '...'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
~~~

**Tweepy: define stream listener class**

~~~
class MyStreamListener(tweepy.StreamListener):
	def __init__(self, api=None):
		super(MyStreamListener, self).__init__()
		self.num_tweets = 0
		self.file = open('tweets.txt','w')

	def on_status(self, status):
		tweet = status._json
		self.file.write(json.dumps(tweet) + '\n')
		tweet_list.append(status)
		self.num_tweets += 1
		if self.num_tweets < 100:
			return True
		else:
			return False
		self.file.close()
~~~

**Using Tweepy: stream tweets**

~~~
l = MyStreamListener()
stream = tweepy.Stream(auth,l)

stream.filter(track=['apples','oranges'])
~~~

**Example**:

~~~
import re

def word_in_text(word, text):
    word = word.lower()
    text = tweet.lower()
    match = re.search(word, text)

    if match:
        return True
    return False


# Initialize list to store tweet counts
[clinton, trump, sanders, cruz] = [0, 0, 0, 0]

# Iterate through df, counting the number of tweets in which
# each candidate is mentioned
for index, row in df.iterrows():
    clinton += word_in_text('clinton', row['text'])
    trump += word_in_text('trump', row['text'])
    sanders += word_in_text('sanders', row['text'])
    cruz += word_in_text('cruz', row['text'])


# Import packages
import matplotlib.pyplot as plt
import seaborn as sns

# Set seaborn style
sns.set(color_codes=True)

# Create a list of labels:cd
cd = ['clinton', 'trump', 'sanders', 'cruz']

# Plot histogram
ax = sns.barplot(cd, [clinton, trump, sanders, cruz])
ax.set(ylabel="count")
plt.show()

~~~