<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Make-a-project-directory" data-toc-modified-id="1.-Make-a-project-directory-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>1. Make a project directory</a></span></li><li><span><a href="#2.-Download-some-data" data-toc-modified-id="2.-Download-some-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>2. Download some data</a></span><ul class="toc-item"><li><span><a href="#a.-Download-a-file" data-toc-modified-id="a.-Download-a-file-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>a. Download a file</a></span></li></ul></li><li><span><a href="#3.-Get-data-in-and-out-of-Python" data-toc-modified-id="3.-Get-data-in-and-out-of-Python-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>3. Get data in and out of Python</a></span><ul class="toc-item"><li><span><a href="#a.-Read-in-to-Pandas" data-toc-modified-id="a.-Read-in-to-Pandas-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>a. Read in to Pandas</a></span></li><li><span><a href="#b.-Write-to-csv" data-toc-modified-id="b.-Write-to-csv-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>b. Write to csv</a></span></li></ul></li><li><span><a href="#4.-Query-an-API" data-toc-modified-id="4.-Query-an-API-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>4. Query an API</a></span><ul class="toc-item"><li><span><a href="#Query-via-URL" data-toc-modified-id="Query-via-URL-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Query via URL</a></span></li><li><span><a href="#Query-via-Python-package" data-toc-modified-id="Query-via-Python-package-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Query via Python package</a></span></li></ul></li><li><span><a href="#5.-Web-Scraping" data-toc-modified-id="5.-Web-Scraping-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>5. Web Scraping</a></span></li></ul></div>

In [None]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
%matplotlib inline

# Project workflow

An important component of this course is the project, which will integrate all of the topics we've learned over the past few weeks. Once we've built some basic Python skills, we can get started -- but there is still a lot on the practical side to learn.

The goal of this notebook is to provide a brief overview of resources that might help you on your project. Each project will have specific needs, and you might need to read different tutorials to achieve your goals. Here, I try to provide pointers to some useful resources.


## 1. Make a project directory

See notebooks:

- [01-Introduction_to_Python/L2-Folders.ipynb](https://github.com/khof312/Summer2018_ProfHoffmannPham/blob/master/01-Introduction_to_Python/L2-Folders.ipynb)
- [12-UNIX_Basics/A-Basic_Unix_Shell_Commands.ipynb](https://github.com/khof312/Summer2018_ProfHoffmannPham/blob/master/12-UNIX_Basics/A-Basic_Unix_Shell_Commands.ipynb)

Let's start by making a project directory. You can do this manually, or using command line tools. To call the command line from within Python, you prefix your command with a`!`. We'll see three simple examples:
    
    pwd                    <- Print working directory
    mkdir [DIRECTORYNAME]  <- Make a new directory in the working directory
    ls    [DIRECTORYNAME]  <- List the contents of a directory; lists current working directory if blank

In [None]:
# Let's print our working directory using "pwd"
!pwd

In [None]:
# Let's make a new directory called "project" with the "mkdir" command
!mkdir project

In [None]:
# Let's list what's in our current working directory
# ...we should see the "project" folder
!ls

In [None]:
# Using the "ls" command, let's list what is in the folder
# ...nothing, for now
!ls project

## 2. Download some data
See notebooks:

- [12-UNIX_Basics/B-Fetching_Data_Using_CURL.ipynb](https://github.com/khof312/Summer2018_ProfHoffmannPham/blob/master/12-UNIX_Basics/B-Fetching_Data_Using_CURL.ipynb)

### a. Download a file

It's time to get some data! Let's download a CSV that I've posted on my website. We will use two more commands:

    curl [URL] -o [DIRECTORYNAME/FILENAME]   -> Save a URL's contents to a chosen directory and file
    head -5       [DIRECTORYNAME/FILENAME]   -> Print the first 5 lines of a file
    

In [None]:
# Use the "curl" command to download a URL
# Specify the -o option to name the output file
!curl "http://people.stern.nyu.edu/khoffman/intro_programming_datasci/assets/csv/trains.csv" -o project/trains.csv

In [None]:
# Using the "ls" command, let's list what is in the folder
# ...we got the new file!
!ls project

In [None]:
# Using the "head" command, let's inspect what's in the file
!head -5 project/trains.csv

## 3. Get data in and out of Python
See notebooks:
- [03-Pandas](https://github.com/khof312/Summer2018_ProfHoffmannPham/tree/master/03-Pandas)


### a. Read in to Pandas

The `pandas` library is a very convenient tool for working with data. We will explore it in depth later. For now, let's just see an example of what it can do.

In [None]:
# Read our csv file in
trains = pd.read_csv("project/trains.csv", index_col=['route_id'])
trains

In [None]:
# Select a row
trains.loc['A']

In [None]:
# Select a column
trains[['route_long_name']]

In [None]:
# Add a new column
trains['age'] = 2018 - trains['line_introduced']

In [None]:
# Plot 
trains.plot(kind = 'scatter', 
            x ='line_introduced', 
            y ='age')

### b. Write to csv

In [None]:
trains.to_csv("project/trains_with_age_column.csv")

## 4. Query an API

See notebooks:
- [05-APIs](https://github.com/khof312/Summer2018_ProfHoffmannPham/tree/master/04-WebAPIs)

### Query via URL

Instead of downloading a file directly, we might want to query an API directly.

Let's try to query OpenWeatherMap now, to get data about the weather. [Documentation](http://openweathermap.org/current#geo). Below you can find the URL that you can copy and paste in your browser, to get the weather for New York. You will notice that it contains parameters as part of the URL, including an `appid` which is a key that is used to limit the number of calls that can be issued by a single application. 

    http://api.openweathermap.org/data/2.5/weather?&appid=ffb7b9808e07c9135bdcc7d1e867253d&q=New%20York,NY,USA&mode=json 
    
Try the URL in your browser. Also try to change the query parameter `q` and change it from `New%20York,NY` to something different. (Note: The `%20` is a transformation for the space (` `) character in URLs.)

In Python, we often query such URLS using the `requests` library.


In [None]:
import requests

openweathermap_url = "http://api.openweathermap.org/data/2.5/weather"
parameters = {
    'q'     : 'New York, NY, USA',
    'units' : 'imperial',
    'mode'  : 'json',
    'appid' : 'ffb7b9808e07c9135bdcc7d1e867253d'
}
resp = requests.get(openweathermap_url, params=parameters)
data = resp.json()
data

Here is a simple example that queries the API for five boroughs.

In [None]:
for borough in ['Brooklyn', 'Queens', 'Bronx', 'Manhattan', 'Staten Island']:
    # Put borough into the query
    parameters = {'q'     : borough + ', NY, USA',
            'units' : 'imperial',
            'mode'  : 'json',
            'appid' : 'ffb7b9808e07c9135bdcc7d1e867253d'
            }
    resp = requests.get(openweathermap_url, params = parameters)
    data = resp.json()
    
    # Print the result
    print(f"The temperature in {borough} is {data['main']['temp']}")

Let's do the same thing, but store it in a file:

In [None]:
# Let's open the file and write a row of headers
with open("project/weather.csv", "w") as f:
    f.write("borough, temperature\n")

In [None]:
# Now, let's loop through the tweets and write them to file
with open("project/weather.csv", "a") as f:
    
    for borough in ['Brooklyn', 'Queens', 'Bronx', 'Manhattan', 'Staten Island']:
        parameters = {'q'     : borough + ', NY, USA',
                'units' : 'imperial',
                'mode'  : 'json',
                'appid' : 'ffb7b9808e07c9135bdcc7d1e867253d'
                }
        resp = requests.get(openweathermap_url, params = parameters)
        data = resp.json()

        f.write(f"{borough},{data['main']['temp']}\n")

In [None]:
# Did it work?
!ls project

In [None]:
!head -5 project/weather.csv

### Query via Python package

Here is a simple example using the Twitter API and the `tweepy` package.
`tweepy` makes it easy to access tweets and their contents in Python.

First, we need to authenticate ourselves:

In [None]:
import tweepy

In [None]:
# YOUR APPLICATION CREDENTIALS HERE
consumer_key = 'YOUR KEY'
consumer_secret = 'YOUR SECRET'

access_token = 'YOUR TOKEN'
access_secret = 'YOUR ACCESS SECRET'

In [None]:
# Then you need to supply your credentials to tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)

# If the authentication was successful, you should see the name of the account print out
print ("My name is", api.me().name)

In [None]:
# Let's get a tweet
api.search(q='from:NYCTSubway', count=1)

In [None]:
# What a mess! Let's print the text, created date, and the author
api.search(q='from:NYCTSubway', count=1)[0].text

In [None]:
api.search(q='from:NYCTSubway', count=1)[0].author.name

In [None]:
api.search(q='from:NYCTSubway', count=1)[0].created_at

The specific syntax for API calls, and accessing results, will vary by API and by library. When starting out, it's often best to search for online examples of how others have used the library, and then modify or simplify their code to achieve your task.

For now, let's just complete our example by pulling 5 tweets from the subway and storing them in a CSV.

In [None]:
# Let's open the file and write a row of headers
with open("project/mta_tweets.csv", "w") as f:
    f.write("author, created_at, text\n")

In [None]:
# Now, let's loop through the tweets and write them to file
with open("project/mta_tweets.csv", "a") as f:
    for t in api.search(q='from:NYCTSubway', count=50):
        if t.text[0]!="@": # Let's make a small filter to avoid direct tweets
            f.write(",".join([t.author.name, str(t.created_at), t.text]) + '\n')

In [None]:
# Did it work?
!ls project
!head -5 project/mta_tweets.csv

## 5. Web Scraping

Finally, we can scrape a web page. This is conceptually the most difficult, and almost guaranteed to be the most time consuming.

[**_BeautifulSoup_**](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) is an incredible Python tool (open library) for pulling out information from a webpage. You can use it to extract tables, lists, paragraph and you can also put filters to extract information from web pages.

For more information on Web Scraping, see:
- [06-Web Scraping](https://github.com/khof312/Summer2018_ProfHoffmannPham/tree/master/06-Web_Scraping)
