# Command Line Interface (CLI) & Data collection

## GUI? CLI?

- **Graphical User Interface (GUI)**:  
    interaction via graphical objects  
    e.g., Microsoft Windows and Apple OS X

- **Command Line Interface (CLI)**:  
    interaction via commands typed into shell  
    e.g., bash, zsh, tcsh, etc.

- Shell is often accessed by terminal [Terminal in Jupyter]    

- GUI is simple to use everyday but not easy to automate repetitive tasks with

- CLI is more cumbersome to use everyday but scriptable

## Basic Shell Usage

- Common shell commands for interactions outside programming environment
    - Downloading files from a URL
    - Inspect, search, and replace text in files
    - Chaining commands together for sequential processing

- Shell and IPython (Jupyter notebook)
    - Reading shell command output into python variable
    - Passing python string back to shell command
    - IRS zip code data example: parsing website, extracting URL, and downloading all files

- Accessing NBA data
    - Understanding GET URL structure
    - JSON data format
    - Reading JSON data into python
    - Creating Pandas data frame

## Shell commands

### Commonly used commands for text files

- `cat`: prints content of a file
- `head`: prints first few lines of a file
- `sed`: (stream editor) changes texts
- `paste`: pasts text files side-by-side
- `cut`: processes columns in delimited text file
- `find`: searches file system
- `grep`: searches text given regular expression pattern
- Many more!

### Anatomy of shell commands

Here is a simple shell command:

In [None]:
! cat --help

1. `cat`: program name

2. `[OPTION]`: controls program behavior

3. `[FILE]`: specify file to read from or standard input

### References to learn shell command line

- [Software Carpentry Lessons](https://software-carpentry.org/lessons/)

- [Unix Power Tools](https://ucsb-primo.hosted.exlibrisgroup.com/primo-explore/fulldisplay?docid=01UCSB_ALMA51295276690003776&context=L&vid=UCSB&search_scope=default_scope&tab=default_tab&lang=en_US)

- [Explain Shell](https://explainshell.com/)

### Example: Downloading Files

- URLs of files are directly visible (e.g., Github)

- `wget` is simple and effective download tool

- Example: https://github.com/fivethirtyeight/data

- "Raw" button is the URL for actual file

- Take the candy ratings data: https://github.com/fivethirtyeight/data/tree/master/candy-power-ranking

- `wget` can be used to download files to course jupyterhub

In [None]:
%%bash
wget https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv

### Example: Viewing file contents 

In [None]:
%%bash
head candy-data.csv

In [None]:
! head candy-data.csv ## also works

In [None]:
! head -n 1 candy-data.csv  ## first line is the header

In [None]:
! wc -l candy-data.csv      ## counts lines in text file

In [None]:
! cut -d',' -f1,3 candy-data.csv    ## prints columns of delimited text

In [None]:
! grep 'Tootsie' candy-data.csv      ## finds lines with pattern (regular expression)

### Chaining commands togeter

- Commands can be chained together using "pipes"

- Many commands in the shell sends output to what is called "stdout" (essentially printing to screen)

- Pipe enable "stdout" to be input into another command via "stdin" (standard input).

- Hence, we can make commands such as the following

In [None]:
! head -n1 candy-data.csv

In [None]:
! head -n1 candy-data.csv | sed 's/,/\n/g'

In [None]:
! head -n1 candy-data.csv | sed 's/,/\n/g' | sed 's/chocolate/CHOCOLATE/g'

### Example: Text file download, search, and manipulation

Comands like `grep`, `sed` and `awk` enable on-the-fly text processing.

In [None]:
%%bash

wget -q -O - https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi \
#     | grep 'zipcode.zip' \
#     | sed 's/<a data/\n<a data/g' \
#     | grep -Po '(?<=href=")[^"]*(?=")'

## Shell and Jupyter

- Shell and Jupyter can be used together, and this becomes even more interesting.

- Grab a webpage,

- Extract all links,

- Filter file links that end with `zipcode.zip`, and

- Download all such files

In [None]:
files = !wget -q -O - https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi | grep 'zipcode.zip' | sed 's/<a data/\n<a data/g' | grep -Po '(?<=href=")[^"]*(?=")'
files

### Python variables into shell

In [None]:
for f in files[:3]:
    ! wget {f}        ## pass python variables into shell!

## Deciphering the NBA stats API

![](https://cdn.nba.net/nba-drupal-prod/styles/landscape_1045w/s3/2017-07/NBA%20Secondary%20Logo.jpg)

- NBA provides a nice website: [http://stat.nba.com](http://stat.nba.com)

- For example, in order to navigate to the shooting records for Stephen Curry, you navigate their menus to get to here:

> [http://stats.nba.com/player/201939/shooting/?Season=2017-18&SeasonType=Regular%20Season](http://stats.nba.com/player/201939/shooting/?Season=2017-18&SeasonType=Regular%20Season)

Here, our choices show up as parameters :
- Season: 2017-18
- SeasonType: Regular Season ([%20 is character code for space](https://en.wikipedia.org/wiki/Percent-encoding#Character_data))
- Player: 201939 (less obvious)

### GET method

- This URL uses [GET method](https://www.w3schools.com/tags/ref_httpmethods.asp)

- GET method passes parameters in the URL

- Long URLs are usually passing a series of variables and values to target page

- Sometimes cryptic: [https://www.google.com/maps/place/M+Special+Brewing+Company/@34.4302877,-119.8723167,15z/data=!4m5!3m4!1s0x80e940babfb897db:0x261e47c5399139d!8m2!3d34.4327838!4d-119.8685351](https://www.google.com/maps/place/M+Special+Brewing+Company/@34.4302877,-119.8723167,15z/data=!4m5!3m4!1s0x80e940babfb897db:0x261e47c5399139d!8m2!3d34.4327838!4d-119.8685351)

- Tools such as [online URL parser](https://www.freeformatter.com/url-parser-query-string-splitter.html) can decipher common format

- Try passing in the URL.

Knowledge of how web sites work is useful for data science since there is so much interaction through the web.

### Example: Collect all player information

- NBA doesn't officially publish their API (application programming interface); however,

- Community has reverse engineered it: e.g., https://github.com/swar/nba_api

- Scraping using `wget` is easy

In [None]:
useragent = "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9\""
playerurl = "\"http://stats.nba.com/stats/commonallplayers?LeagueID=00&Season=2017-18&IsOnlyCurrentSeason=1\""

# json_str = !wget -q -O - --user-agent={useragent} {playerurl}                           # if NBA doesn't cooperate 
json_str = !cat commonallplayers\?LeagueID\=00\&Season\=2017-18\&IsOnlyCurrentSeason\=1   # saved from earlier

- `playerurl`: url to download data from

- `useragent`: suitable string to imitate a browser. Websites can return browser-dependent content 

- NBA blocks programatic scraping of websites by simple use of `wget`; however,

- Specifying user agent string makes `wget` pretend that we are using a Mozilla-type browser on OS X

### Javascript Object Notation (JSON) format

- One of the widely used standards in data formats

- Usually plain text file with python dictionary-like formatting:  
    `{"key":"value"}`

- Can be nested:  
    `{"key0":{"key1":"value1", "key2":"value2"}}`

In [None]:
json_str[0]

- In fact, Jupyter notebooks are in json format.

In [None]:
! head 03-Command-Line-and-Data-collection.ipynb

### Parsing JSON

- Raw JSON is in a string

- Needs to be parsed to Python dictionary: i.e., keys and values.

- Parse `json_str` string with the `json` module

In [None]:
import json
data = json.loads(json_str[0])
data

In [None]:
data.keys() ## we specified 'resource' and 'parameters' 

In [None]:
data['resultSets'][0].keys() ## 'resultSets' contain returned results

In [None]:
data['resultSets'][0]

### Importing data into Pandas

In [None]:
import pandas as pd

h = data['resultSets'][0]['headers']
d = data['resultSets'][0]['rowSet']
players = pd.DataFrame(d, columns=h)
players.head()

- What other data can we download using these types of URLS? [community documentation](https://github.com/seemethere/nba_py/wiki/stats.nba.com-Endpoint-Documentation).

### Analyzing Shot Data

- Let's analyze [shot chart data](https://github.com/seemethere/nba_py/wiki/stats.nba.com-Endpoint-Documentation#shotchartdetail)

- Test with browser: site kindly tells me [which parameters are required if none is passed](http://stats.nba.com/stats/shotchartdetail)

- First, download [team data](https://github.com/seemethere/nba_py/wiki/stats.nba.com-Endpoint-Documentation#commonteamyears)

In [None]:
from urllib.parse import urlencode      ## urlencode builds parameter string for us
from urllib.request import urlretrieve

params = {'LeagueID':'00'}
teamurl = 'http://stats.nba.com/stats/commonTeamYears?' + urlencode(params)
!wget -q -O - --user-agent={useragent} {teamurl}

### Scraping Function

Now that we know what a general request looks like, we can create a function to make our requests simpler.

The function will do the following:
1. Set User Agent
1. Set base URL with appropriate end point
1. Set parameters required for query
1. Read JSON string into python variable
1. Parse JSON string into python object
1. Convert the objects into pandas a data frame

In [None]:
def get_nba_data(endpt, params, return_url=False):

    ## endpt: https://github.com/seemethere/nba_py/wiki/stats.nba.com-Endpoint-Documentation
    ## params: dictionary of parameters: i.e., {'LeagueID':'00'}
    from pandas import DataFrame
    from urllib.parse import urlencode
    import json
    
    useragent = "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9\""

    dataurl = "\"" + "http://stats.nba.com/stats/" + endpt + "?" + urlencode(params) + "\""
    
    # for debugging: just return the url
    if return_url:
        return(dataurl)
    
    jsonstr = !wget -q -O - --user-agent={useragent} {dataurl} ## Note: ! doesn't work in plain Python
    
    data = json.loads(jsonstr[0])
    
    h = data['resultSets'][0]['headers']
    d = data['resultSets'][0]['rowSet']
    
    return(DataFrame(d, columns=h))

### Testing the Scraping Function: Team data

To see what URL string is returned, set `return_url=True`.

In [None]:
params = {'LeagueID':'00'}
get_nba_data('commonTeamYears', params, return_url=True)

Function can also return Pandas data frame

In [None]:
params = {'LeagueID':'00'}
# teamdata = get_nba_data('commonTeamYears', params) # if NBA doesn't cooperate 
teamdata = pd.read_pickle('commonTeamYears.pkl')     # saved from earlier
teamdata.head()

### Testing the Scraping Function: Player data

- Endpoint is here: https://github.com/seemethere/nba_py/wiki/stats.nba.com-Endpoint-Documentation#commonallplayers

In [None]:
params = {'LeagueID':'00', 'Season': '2017-18', 'IsOnlyCurrentSeason': '0'}
# plyrdata = get_nba_data('commonallplayers', params) # if NBA doesn't cooperate
plyrdata = pd.read_pickle('commonallplayers.pkl')     # saved from earlier
plyrdata.head()

### Testing the Scraping Function: Shotchart data

In [None]:
params = {'PlayerID':'201935',
          'PlayerPosition':'',
          'Season':'2017-18',
          'ContextMeasure':'FGA',
          'DateFrom':'',
          'DateTo':'',
          'GameID':'',
          'GameSegment':'',
          'LastNGames':'0',
          'LeagueID':'00',
          'Location':'',
          'Month':'0',
          'OpponentTeamID':'0',
          'Outcome':'',
          'Period':'0',
          'Position':'',
          'RookieYear':'',
          'SeasonSegment':'',
          'SeasonType':'Regular Season',
          'TeamID':'0',
          'VsConference':'',
          'VsDivision':''}

# shotdata = get_nba_data('shotchartdetail', params) # if NBA doesn't cooperate
shotdata = pd.read_pickle('shotchartdetail.pkl')     # saved from earlier
shotdata.head()

Finally, we can get the shot chart detail.

![](images/nba-dance.gif)