<a href="https://colab.research.google.com/github/kilos11/Data-Science-from-Scratch_-First-Principles-with-Python-by-Joel-Graus/blob/main/Chapter_9_Getting_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# stdin and stdout




In [None]:
#here is a script that reads in lines of text and
#spits back out the ones that match a regular expression:
# egrep.py
import sys, re

# sys.argv is the list of command-line arguments
# sys.argv[0] is the name of the program itself
# sys.argv[1] will be the regex specified at the command line
regex = sys.argv[1]

# for every line passed into the script
for line in sys.stdin:
    # if it matches the regex, write it to stdout
    if re.search(regex, line):
        sys.stdout.write(line)

#And here’s one that counts the lines it receives and then writes out the count:
# line_count.py
import sys

# sys.stdin is the standard input stream
count = 0
for line in sys.stdin:
    count += 1

# print goes to sys.stdout
print (count)


# Reading Files
**The Basics of Text Files**


In [None]:
from collections import Counter

# 'r' means read-only
file_for_reading = open('reading_file.txt', 'r')

# 'w' is write—will destroy the file if it already exists!
file_for_writing = open('writing_file.txt', 'w')

# 'a' is append—for adding to the end of the file
file_for_appending = open('appending_file.txt', 'a')

# don't forget to close your files when you're done
file_for_writing.close()

#Because it is easy to forget to close your files, you should always use them in a with
#block, at the end of which they will be closed automatically:

with open('reading_file.txt', 'r') as file:
    data = function_that_gets_data_from(f)

# at this point f has already been closed, so don't try to use it
process(data)

#If you need to read a whole text file, you can just iterate over the lines of the file using
#for:
starts_with_hash = 0

with open('input.txt','r') as f:
    for line in file:
        if re.match("^#",line): # use a regex to see if it starts with '#'
            starts_with_hash += 1

def get_domain(email_address):
    """split on '@' and return the last piece"""
    return email_address.split('@')[-1]

with open('email_addresses.txt', 'r') as f:
    domain_counts = Counter(get_domain(line.strip())
                             for line in f if '@' in line)






# **Delimited Files**

In [None]:
#if we had a tab-delimited file of stock prices:

6/20/2014 AAPL 90.91
6/20/2014 MSFT 41.68
6/20/2014 FB 64.5
6/19/2014 AAPL 91.86
6/19/2014 MSFT 41.51
6/19/2014 FB 64.34

import csv

with open('tab_delimited_stock_prices.txt', 'rb') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        date = row[0]
        symbol = row[1]
        closing_price = float(row[2])
        print (date, symbol, closing_price)


#If your file has headers:
date:symbol:closing_price
6/20/2014:AAPL:90.91
6/20/2014:MSFT:41.68
6/20/2014:FB:64.5

#you can either skip the header row (with an initial call to reader.next()) or get each row
#as a dict (with the headers as keys) by using csv.DictReader:
with open('colon_delimited_stock_prices.txt', 'rb') as f:
    reader = csv.DictReader(f, delimiter=':')
    for row in reader:
        date = row["date"]
        symbol = row["symbol"]
        closing_price = float(row["closing_price"])
        process(date, symbol, closing_price)



# **You can similarly write out delimited data using csv.writer:**

In [None]:
#You can similarly write out delimited data using csv.writer:
import csv
import chardet
today_prices = { 'AAPL' : 90.91, 'MSFT' : 41.68, 'FB' : 64.5 }

with open('comma_delimited_stock_prices.txt','wb') as f:
    writer = csv.writer(f, delimiter=',')
    for stock, price in today_prices.items():
        writer.writerow([stock.encode('utf-8'), price.encode('utf-8')])




csv.writer will do the right thing if your fields themselves have commas in them. Your
own hand-rolled writer probably won’t. For example, if you attempt:


In [None]:
results = [["test1", "success", "Monday"],
["test2", "success, kind of", "Tuesday"],
["test3", "failure, kind of", "Wednesday"],
["test4", "failure, utter", "Thursday"]]

# don't do this!
with open('bad_csv.txt', 'wb') as f:
    for row in results:
        f.write(",".join(map(str, row))) # might have too many commas in it!
        f.write("\n") # row might have newlines as well!







In [None]:
with open('good_csv.txt', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(results)

# **Scraping the Web**
### **HTML and the Parsing Thereof**


In [None]:
!pip install beautifulsoup4
!pip install requests
!pip install html5lib





In [None]:
from bs4 import BeautifulSoup
import requests
html = requests.get("https://www.kaggle.com/datasets/paramaggarwal/fashion-product-images-small").text
soup = BeautifulSoup(html, 'html5lib')
#print(soup)


#We’ll typically work with Tag objects, which correspond to the tags representing the
#structure of an HTML page.
#For example, to find the first <p> tag (and its contents) you can use:
first_paragraph = soup.find('p') # or just soup.p
print(first_paragraph)

#You can get the text contents of a Tag using its text property:
first_paragraph_text = soup.p.text
first_paragraph_words = soup.p.text.split()

#And you can extract a tag’s attributes by treating it like a dict:
first_paragraph_id = soup.p['id'] # raises KeyError if no 'id'
first_paragraph_id2 = soup.p.get('id') # returns None if no 'id'

#You can get multiple tags at once:
all_paragraphs = soup.find_all('p') # or just soup('p')
paragraphs_with_ids = [p for p in soup('p') if p.get('id')]

#Frequently you’ll want to find tags with a specific class:
important_paragraphs = soup('p', {'class' : 'important'})
important_paragraphs2 = soup('p', 'important')
important_paragraphs3 = [p for p in soup('p')
if 'important' in p.get('class', [])]

#And you can combine these to implement more elaborate logic. For example, if you want
#to find every <span> element that is contained inside a <div> element, you could do this:
# warning, will return the same span multiple times
# if it sits inside multiple divs
# be more clever if that's the case
spans_inside_divs = [span
for div in soup('div') # for each <div> on the page
for span in div('span')] # find each <span> inside it



## **Example: O’Reilly Books About Data**


In [None]:
# you don't have to split the url like this unless it needs to fit in a book
url = "http://shop.oreilly.com/category/browse-subjects/" + \
"data.do?sortby=publicationDate&page=1"
soup = BeautifulSoup(requests.get(url).text, 'html5lib')

#A good first step is to find all of the td thumbtext tag elements:
tds = soup('td', 'thumbtext')
print (len(tds))

#Next we’d like to filter out the videos.
def is_video(td):
    pricelabels = td('span', 'pricelabel')
    return (len(pricelabels) == 1 and
               pricelabels[0].text.strip().startswith("Video"))
    """it's a video if it has exactly one pricelabel, and if
       the stripped text inside that pricelabel starts with 'Video'"""


print (len([td for td in tds if not is_video(td)]))

#Now we’re ready to start pulling data out of the td elements. It looks like the book title is
#the text inside the <a> tag inside the <div class="thumbheader">:
title = td.find("div", "thumbheader").a.text

#The author(s) are in the text of the AuthorName <div>. They are prefaced by a By (which
#we want to get rid of) and separated by commas (which we want to split out, after which
#we’ll need to get rid of spaces):
author_name = td.find('div', 'AuthorName').text
authors = [x.strip() for x in re.sub("^By ", "", author_name).split(",")]

#The ISBN seems to be contained in the link that’s in the thumbheader <div>:
isbn_link = td.find("div", "thumbheader").a.get("href")
# re.match captures the part of the regex in parentheses
isbn = re.match("/product/(.*)\.do", isbn_link).group(1)

#And the date is just the contents of the <span class="directorydate">:
date = td.find("span", "directorydate").text.strip()

#Let’s put this all together into a function:
def book_info(td):
    """given a BeautifulSoup <td> Tag representing a book,
    extract the book's details and return a dict"""
    title = td.find("div", "thumbheader").a.text
    by_author = td.find('div', 'AuthorName').text
    authors = [x.strip() for x in re.sub("^By ", "", by_author).split(",")]
    isbn_link = td.find("div", "thumbheader").a.get("href")
    isbn = re.match("/product/(.*)\.do", isbn_link).groups()[0]
    date = td.find("span", "directorydate").text.strip()

    return {
        "title" : title,
        "authors" : authors,
        "isbn" : isbn,
        "date" : date }

#And now we’re ready to scrape:
from bs4 import BeautifulSoup
import requests
from time import sleep
base_url = "http://shop.oreilly.com/category/browse-subjects/" + \
"data.do?sortby=publicationDate&page="

books = []
NUM_PAGES = 31
for page_num in range(1,NUM_PAGES+1):
    print("souping page", page_num, ",", len(books), " found so far")
    url = base_url + str(page_num)
    soup = BeautifulSoup(requests.get(url).text, 'html5lib')
    for td in soup('td', 'thumbtext'):
        if not is_video(td):
            books.append(book_info(td))
    sleep(30)


0
0


NameError: name 'td' is not defined

In [None]:
# you don't have to split the url like this unless it needs to fit in a book
url = "http://shop.oreilly.com/category/browse-subjects/" + \
"data.do?sortby=publicationDate&page=1"
soup = BeautifulSoup(requests.get(url).text, 'html5lib')

#A good first step is to find all of the td thumbtext tag elements:
tds = soup('td', 'thumbtext')
print (len(tds))

#Next we’d like to filter out the videos.
def is_video(td):
    pricelabels = td('span', 'pricelabel')
    return (len(pricelabels) == 1 and
               pricelabels[0].text.strip().startswith("Video"))
    """it's a video if it has exactly one pricelabel, and if
       the stripped text inside that pricelabel starts with 'Video'"""


print (len([td for td in tds if not is_video(td)]))

#Now we’re ready to start pulling data out of the td elements. It looks like the book title is
#the text inside the <a> tag inside the <div class="thumbheader">:
for td in tds:
    if not is_video(td):
        title = td.find("div", "thumbheader").a.text

        #The author(s) are in the text of the AuthorName <div>. They are prefaced by a By (which
        #we want to get rid of) and separated by commas (which we want to split out, after which
        #we’ll need to get rid of spaces):
        author_name = td.find('div', 'AuthorName').text
        authors = [x.strip() for x in author_name.split('\n') if x.strip()]

0
0



# **Using APIs**

In [3]:
#JSON (and XML)
import json

serialized = """{ "title" : "Data Science Book",
                "author" : "Joel Grus",
                "publicationYear" : 2014,
                "topics" : [ "data", "science", "data science"] }"""

# parse the JSON to create a Python dict
deserialized = json.loads(serialized)
if "data science" in deserialized["topics"]:
    print (deserialized)





{'title': 'Data Science Book', 'author': 'Joel Grus', 'publicationYear': 2014, 'topics': ['data', 'science', 'data science']}


In [7]:
!pip install python-dateutil




## **Using an Unauthenticated API**

In [9]:
import requests, json
from collections import Counter

endpoint = "https://api.github.com/users/kilo11/repos"
repos = json.loads(requests.get(endpoint).text)
print(repos)


#from which you’ll probably only ever need the dateutil.parser.parse function:
from dateutil.parser import parse

dates = [parse(repo["created_at"]) for repo in repos]
month_counts = Counter(date.month for date in dates)
weekday_counts = Counter(date.weekday() for date in dates)
print(dates)
print(month_counts)
print(weekday_counts)

[{'id': 64323887, 'node_id': 'MDEwOlJlcG9zaXRvcnk2NDMyMzg4Nw==', 'name': 'effective-octo-rotary-phone', 'full_name': 'kilo11/effective-octo-rotary-phone', 'private': False, 'owner': {'login': 'kilo11', 'id': 20686427, 'node_id': 'MDQ6VXNlcjIwNjg2NDI3', 'avatar_url': 'https://avatars.githubusercontent.com/u/20686427?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/kilo11', 'html_url': 'https://github.com/kilo11', 'followers_url': 'https://api.github.com/users/kilo11/followers', 'following_url': 'https://api.github.com/users/kilo11/following{/other_user}', 'gists_url': 'https://api.github.com/users/kilo11/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/kilo11/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/kilo11/subscriptions', 'organizations_url': 'https://api.github.com/users/kilo11/orgs', 'repos_url': 'https://api.github.com/users/kilo11/repos', 'events_url': 'https://api.github.com/users/kilo11/events{/privacy}', 'received_even

# **Finding API**

If you need data from a specific site, look for a developers or API section of the site for
details, and try searching the Web for “python __ api” to find a library. There is a Rotten
Tomatoes API for Python. There are multiple Python wrappers for the Klout API, for the
Yelp API, for the IMDB API, and so on.
If you’re looking for lists of APIs that have Python wrappers, two directories are at Python
API and Python for Beginners.
If you want a directory of web APIs more broadly (without Python wrappers necessarily),
a good resource is Programmable Web, which has a huge directory of categorized APIs.
And if after all that you can’t find what you need, there’s always scraping, the last refuge
of the data scientist.

# Example: Using the Twitter APIs


In [10]:
!pip install twython


Collecting twython
  Downloading twython-3.9.1-py3-none-any.whl (33 kB)
Installing collected packages: twython
Successfully installed twython-3.9.1


In [None]:
#Getting Credentials
