# PSTAT 134/234 - Command Line Interface (CLI) & Data collection <a class="tocSkip">

## GUI? CLI?

- **Graphical User Interface (GUI)**:  
    interaction via graphical objects  
    e.g., Microsoft Windows and Apple OS X

- **Command Line Interface (CLI)**:  
    interaction via commands typed into shell  
    e.g., bash, zsh, tcsh, etc.

- Shell is often accessed by terminal [Terminal in Jupyter]    

- GUI is simple to use everyday but not easy to automate repetitive tasks with

- CLI is more cumbersome to use everyday but scriptable

## Shell commands

### References to learn shell command line

- **Required reading: [Software Carpentry Shell Novice Lesson](http://swcarpentry.github.io/shell-novice/)**

In [None]:
! wget https://swcarpentry.github.io/shell-novice/data/shell-lesson-data.zip
! unzip shell-lesson-data.zip
## In the terminal
# cd shell-lesson-data/exercise-data/alkanes
# ls *.pdb  ## * any number of characters
# ?ethane.pdb  ## ? single character
#mkdir file_test
# mv octane.pdb file_test/octane.pdb
# cp octane.pdb file_test/octane_copy.pdb
# nano test.txt  ## Ctrl + o, ctrl + x
# rm test.txt
# rm -i test_txt
# rm -r -i file_test
# echo "Hello!"
# echo "This is a test" > test.txt
# echo "This is a second line" >> test.txt

- [Explain Shell](https://explainshell.com/)

### Commonly used commands for text files

- `cat`: prints content of a file
- `head`: prints first few lines of a file
- `sed`: (stream editor) changes texts
- `paste`: merges lines of files.
- `cut`: processes columns in delimited text file
- `find`: searches file system
- `grep`: searches text given regular expression pattern
- `sort`: sort a file line by line
- `uniq`: keeps unique lines of a sorted text
- etc.

### Anatomy of shell commands

Here is a simple shell command:

In [None]:
! cat --help ## most shell commands have built-in help

In [None]:
! head --help

1. `cat`: program name

2. `[OPTION]`: controls program behavior

3. `[FILE]`: specify file to read from or standard input

### Example: Downloading Files

- URLs of files are directly visible (e.g., Github)

- `wget` is simple and effective download tool

- Example: https://github.com/fivethirtyeight/data

- "Raw" button is the URL for actual file

- Take the candy ratings data: https://github.com/fivethirtyeight/data/tree/master/candy-power-ranking

- `wget` can be used to download files to course jupyterhub

In [None]:
%%bash
wget https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv

### Example: Viewing file contents 

In [4]:
%%bash
head candy-data.csv

competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
100 Grand,1,0,1,0,0,1,0,1,0,.73199999,.86000001,66.971725
3 Musketeers,1,0,0,0,1,0,0,1,0,.60399997,.51099998,67.602936
One dime,0,0,0,0,0,0,0,0,0,.011,.116,32.261086
One quarter,0,0,0,0,0,0,0,0,0,.011,.51099998,46.116505
Air Heads,0,1,0,0,0,0,0,0,0,.90600002,.51099998,52.341465
Almond Joy,1,0,0,1,0,0,0,1,0,.465,.76700002,50.347546
Baby Ruth,1,0,1,1,1,0,0,1,0,.60399997,.76700002,56.914547
Boston Baked Beans,0,0,0,1,0,0,0,0,1,.31299999,.51099998,23.417824
Candy Corn,0,0,0,0,0,0,0,0,1,.90600002,.32499999,38.010963


In [5]:
! head candy-data.csv ## also works

competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
100 Grand,1,0,1,0,0,1,0,1,0,.73199999,.86000001,66.971725
3 Musketeers,1,0,0,0,1,0,0,1,0,.60399997,.51099998,67.602936
One dime,0,0,0,0,0,0,0,0,0,.011,.116,32.261086
One quarter,0,0,0,0,0,0,0,0,0,.011,.51099998,46.116505
Air Heads,0,1,0,0,0,0,0,0,0,.90600002,.51099998,52.341465
Almond Joy,1,0,0,1,0,0,0,1,0,.465,.76700002,50.347546
Baby Ruth,1,0,1,1,1,0,0,1,0,.60399997,.76700002,56.914547
Boston Baked Beans,0,0,0,1,0,0,0,0,1,.31299999,.51099998,23.417824
Candy Corn,0,0,0,0,0,0,0,0,1,.90600002,.32499999,38.010963


In [6]:
! head -n 8 candy-data.csv  ## first line is the header

competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
100 Grand,1,0,1,0,0,1,0,1,0,.73199999,.86000001,66.971725
3 Musketeers,1,0,0,0,1,0,0,1,0,.60399997,.51099998,67.602936
One dime,0,0,0,0,0,0,0,0,0,.011,.116,32.261086
One quarter,0,0,0,0,0,0,0,0,0,.011,.51099998,46.116505
Air Heads,0,1,0,0,0,0,0,0,0,.90600002,.51099998,52.341465
Almond Joy,1,0,0,1,0,0,0,1,0,.465,.76700002,50.347546
Baby Ruth,1,0,1,1,1,0,0,1,0,.60399997,.76700002,56.914547


In [7]:
! tail -n5 candy-data.csv 

Twizzlers,0,1,0,0,0,0,0,0,0,.22,.116,45.466282
Warheads,0,1,0,0,0,0,1,0,0,.093000002,.116,39.011898
Welch's Fruit Snacks,0,1,0,0,0,0,0,0,1,.31299999,.31299999,44.375519
Werther's Original Caramel,0,0,1,0,0,0,1,0,0,.186,.26699999,41.904308
Whoppers,1,0,0,0,0,1,0,0,1,.87199998,.84799999,49.524113


In [8]:
! wc -w candy-data.csv  ##word count

188 candy-data.csv


In [9]:
! wc -c candy-data.csv      ## counts lines in text file

5193 candy-data.csv


In [10]:
! cut -d',' -f 1-4 candy-data.csv    ## prints columns of delimited by commas text

competitorname,chocolate,fruity,caramel
100 Grand,1,0,1
3 Musketeers,1,0,0
One dime,0,0,0
One quarter,0,0,0
Air Heads,0,1,0
Almond Joy,1,0,0
Baby Ruth,1,0,1
Boston Baked Beans,0,0,0
Candy Corn,0,0,0
Caramel Apple Pops,0,1,1
Charleston Chew,1,0,0
Chewey Lemonhead Fruit Mix,0,1,0
Chiclets,0,1,0
Dots,0,1,0
Dum Dums,0,1,0
Fruit Chews,0,1,0
Fun Dip,0,1,0
Gobstopper,0,1,0
Haribo Gold Bears,0,1,0
Haribo Happy Cola,0,0,0
Haribo Sour Bears,0,1,0
Haribo Twin Snakes,0,1,0
Hershey's Kisses,1,0,0
Hershey's Krackel,1,0,0
Hershey's Milk Chocolate,1,0,0
Hershey's Special Dark,1,0,0
Jawbusters,0,1,0
Junior Mints,1,0,0
Kit Kat,1,0,0
Laffy Taffy,0,1,0
Lemonhead,0,1,0
Lifesavers big ring gummies,0,1,0
Peanut butter M&M's,1,0,0
M&M's,1,0,0
Mike & Ike,0,1,0
Milk Duds,1,0,1
Milky Way,1,0,1
Milky Way Midnight,1,0,1
Milky Way Simply Caramel,1,0,1
Mounds,1,0,0
Mr Good Bar,1,0,0
Nerds,0,1,0
Nestle Butterfinger,1,0,0
Nestle Crunch,1,0,0
Nik L Nip,0,1,0
Now & Later,0,1,0
Payday,0,0,0
Peanut M&Ms,1,0,0
Pixie Sticks

In [11]:
! grep 'Sugar' candy-data.csv      ## finds lines with pattern (regular expression)

Sugar Babies,0,0,1,0,0,0,0,0,1,.96499997,.76700002,33.43755
Sugar Daddy,0,0,1,0,0,0,0,0,0,.41800001,.32499999,32.230995


### Chaining commands together

- Commands can be chained together using "pipes"

- Many commands in the shell sends output to what is called "stdout" (essentially printing to screen)

- Pipe enable "stdout" to be input into another command via "stdin" (standard input).

- Hence, we can make commands such as the following

In [12]:
! head -n5 candy-data.csv | tail -n1

One quarter,0,0,0,0,0,0,0,0,0,.011,.51099998,46.116505


In [13]:
! head -n1 candy-data.csv

competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent


In [14]:
! head -n1 candy-data.csv | sed 's/,/\n/g' # stream editor: \n is a special character for new line;g:global sustitution

competitorname
chocolate
fruity
caramel
peanutyalmondy
nougat
crispedricewafer
hard
bar
pluribus
sugarpercent
pricepercent
winpercent


In [15]:
! head -n1 candy-data.csv | sed 's/,/\n/g' | sed 's/chocolate/CHOCOLATE/g'

competitorname
CHOCOLATE
fruity
caramel
peanutyalmondy
nougat
crispedricewafer
hard
bar
pluribus
sugarpercent
pricepercent
winpercent


### Example: Text file download, search, and manipulation

Comands like `grep`, `sed` and `awk` can be used for text processing.

In [16]:
%%bash
# - quiet mode, -O standard output
wget -q -O - https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi \
   | grep 'zipcode.zip' \
| sed 's/<a data/\n<a data/g' \ ##stream editor to add new line
| grep -Po '(?<=href=")[^"]*(?=")' #Po pearl compatible

https://www.irs.gov/pub/irs-soi/2010zipcode.zip
https://www.irs.gov/pub/irs-soi/2009zipcode.zip
https://www.irs.gov/pub/irs-soi/2008zipcode.zip
https://www.irs.gov/pub/irs-soi/2007zipcode.zip
https://www.irs.gov/pub/irs-soi/2006zipcode.zip
https://www.irs.gov/pub/irs-soi/2005zipcode.zip
https://www.irs.gov/pub/irs-soi/2004zipcode.zip
https://www.irs.gov/pub/irs-soi/2002zipcode.zip
https://www.irs.gov/pub/irs-soi/2001zipcode.zip
https://www.irs.gov/pub/irs-soi/1998zipcode.zip


## Shell and Jupyter Notebook

Shell and Jupyter can interact with each other by passing values back and forth: e.g.

1. In shell, grab a webpage, extract all links, filter file links that end with `zipcode.zip`.  
    Then, pass the file links as python variable: `files`

2. In python, loop through the file links.  
    In each iteration, pass one file name, `f`, back to shell.

3. In shell, download the file at the link using `wget`

In [17]:
files = !wget -q -O - https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi | grep 'zipcode.zip' | sed 's/<a data/\n<a data/g' | grep -Po '(?<=href=")[^"]*(?=")'
files  ## file names from bash is in python variable!  WHat if we removed -O -

['https://www.irs.gov/pub/irs-soi/2010zipcode.zip',
 'https://www.irs.gov/pub/irs-soi/2009zipcode.zip',
 'https://www.irs.gov/pub/irs-soi/2008zipcode.zip',
 'https://www.irs.gov/pub/irs-soi/2007zipcode.zip',
 'https://www.irs.gov/pub/irs-soi/2006zipcode.zip',
 'https://www.irs.gov/pub/irs-soi/2005zipcode.zip',
 'https://www.irs.gov/pub/irs-soi/2004zipcode.zip',
 'https://www.irs.gov/pub/irs-soi/2002zipcode.zip',
 'https://www.irs.gov/pub/irs-soi/2001zipcode.zip',
 'https://www.irs.gov/pub/irs-soi/1998zipcode.zip']

In [18]:
for f in files[:3]:
    ! wget -nc {f}        ## pass python variables into shell! What if we removed -nc??

File ‘2010zipcode.zip’ already there; not retrieving.

File ‘2009zipcode.zip’ already there; not retrieving.

File ‘2008zipcode.zip’ already there; not retrieving.



### User Agent string

* Sometimes pages in browser look different than downloaded code

* The [user agent string](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent) self-identify application, operating system, vendor, and/or version of the requesting user agent

* For example, here is a [list for Chrome browsers](https://www.useragentstring.com/pages/Chrome/)
 
   - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36



* `wget --user-agent="User Agent Here" "[URL]"`

   - `wget --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36" "https://example.com"`


## GET method and APIs

### GET method

- URLs often use [GET method](https://www.w3schools.com/tags/ref_httpmethods.asp)

- GET method passes parameters using the URL: e.g.  
    https://www.google.com/search?q=hello+there  
    https://www.google.com/search?q=hello+there&tbm=isch

- "[Urls Explained](https://www.freeformatter.com/url-parser-query-string-splitter.html#urls-explained)" dissects standard components of URLs ([online URL parser](https://www.freeformatter.com/url-parser-query-string-splitter.html)) 

### Application Programming Interface (API)

> a set of functions and procedures allowing the creation of applications that access the features or data of an operating system, application, or other service.

- GET URLs are commonly used form of API

- In API context, an **endpoint** is a [URL path](https://www.freeformatter.com/url-parser-query-string-splitter.html#urls-explained) that process your request

- **Query string** is used to specify the request: e.g. search term and type

### Example: Google Maps

- Web services often have documentation for developers:  
    [Google Maps Developer Documentation](https://developers.google.com/maps/documentation/urls/guide#forming-the-url)

- Demo: [Display a map](https://developers.google.com/maps/documentation/urls/guide#map-action):  
    e.g. https://www.google.com/maps/@?api=1&map_action=map&basemap=terrain&layer=bicycling
    
  - **Base URL**: `https://www.google.com/maps/`
    - This is the base URL for accessing Google Maps.

  - **Parameters**:
    - `@`: The `@` symbol is used in the URL to indicate a specific location on the map. In this case, it is followed by query parameters that define the map settings.
    - `api=1`: This parameter specifies that the Google Maps JavaScript API should be used to render the map.
    - `map_action=map`: This parameter sets the action to be performed on the map, which is to display a map.
    - `basemap=terrain`: This parameter sets the base map type to "terrain". It specifies that the map should display topographic features such as mountains, hills, and land elevation.
    - `layer=bicycling`: This parameter adds a layer to the map indicating bicycling information. It displays bike lanes, trails, and other cycling-related details on the map.

- Demo: [Searching Google Maps](https://developers.google.com/maps/documentation/urls/guide#forming-the-search-url):  
    e.g. https://www.google.com/maps/search/?api=1&query=home+depot
    
  - **Base URL**: `https://www.google.com/maps/search/`
    - This is the base URL for performing searches on Google Maps.

  - **Parameters**:
    - `api=1`: This parameter indicates that the Google Maps JavaScript API should be used to perform the search.
    - `query=home+depot`: This parameter specifies the search query. In this case, the query is "home depot". The space between "home" and "depot" is represented by the plus sign (+) as it is URL encoded.


### Example: Film Locations in San Francisco

> _... listing of filming locations of movies shot in San Francisco starting from 1924 ..._

- [Dataset Metadata](https://data.sfgov.org/Culture-and-Recreation/Film-Locations-in-San-Francisco/yitu-d5am)

- [API documentation](https://dev.socrata.com/foundry/data.sfgov.org/yitu-d5am)

- [Simple Filters](https://dev.socrata.com/docs/filtering.html): selection criteria

- [Paging through Data](https://dev.socrata.com/docs/paging.html#2.1): paging through returned data

In [19]:
!wget -qO - "https://data.sfgov.org/resource/yitu-d5am.json?release_year=2013&title=Red%20Widow"

[{"title":"Red Widow","release_year":"2013","locations":"Vallejo Street Garage","production_company":"Beyond Pix","distributor":"American Broadcasting Company (ABC)","director":"Alon Aranya","writer":"Melissa Rosenberg","actor_1":"Radha Mitchell","actor_2":"Sterling Beaumon","actor_3":"Clifton Collins Jr."}
,{"title":"Red Widow","release_year":"2013","locations":"Montgomery & Market Streets","production_company":"Beyond Pix","distributor":"American Broadcasting Company (ABC)","director":"Alon Aranya","writer":"Melissa Rosenberg","actor_1":"Radha Mitchell","actor_2":"Sterling Beaumon","actor_3":"Clifton Collins Jr."}
,{"title":"Red Widow","release_year":"2013","locations":"Broadway & Taylor","production_company":"Beyond Pix","distributor":"American Broadcasting Company (ABC)","director":"Alon Aranya","writer":"Melissa Rosenberg","actor_1":"Radha Mitchell","actor_2":"Sterling Beaumon","actor_3":"Clifton Collins Jr."}
,{"title":"Red Widow","release_year":"2013","locations":"Mason & Sacram

## Javascript Object Notation (JSON) format

- One of the widely used standards in data formats

- Usually plain text file with python dictionary-like formatting:  
    `{"key":"value"}`

- Can be nested:  
    `{"key0":{"key1":"value1", "key2":"value2"}}`

- In fact, Jupyter notebooks are in json format.

In [20]:
! head 08-Command-Line-and-Data-collection.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": true
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#GUI?-CLI?\" data-toc-modified-id=\"GUI?-CLI?-1\">GUI? CLI?</a></span></li><li><span><a href=\"#Shell-commands\" data-toc-modified-id=\"Shell-commands-2\">Shell commands</a></span><ul class=\"toc-item\"><li><span><a href=\"#References-to-learn-shell-command-line\" data-toc-modified-id=\"References-to-learn-shell-command-line-2.1\">References to learn shell command line</a></span></li><li><span><a href=\"#Commonly-used-commands-for-text-files\" data-toc-modified-id=\"Commonly-used-commands-for-text-files-2.2\">Commonly used commands for text files</a></span></li><li><span><a href=\"#Anatomy-of-shell-commands\" data-toc-modified-id=\"Anatomy-of-shell-commands-2.3\">Anatomy of shell commands</a></span></li><li><span><a href=\"#Example:-Downloading-Files\" dat

### Example: Parsing Film Locations in San Francisco

- Raw JSON is in a string

- Needs to be parsed to Python dictionary: i.e., keys and values.

- Parse returned JSON formatted page with the `json` module

In [21]:
import json
json_str = !wget -qO - "https://data.sfgov.org/resource/yitu-d5am.json?release_year=2013&title=Red%20Widow"
json_str = ''.join(json_str) # Remove line breaks from this json. Not all json files do
data = json.loads(json_str)
data[0] ## print first line

{'title': 'Red Widow',
 'release_year': '2013',
 'locations': 'Vallejo Street Garage',
 'production_company': 'Beyond Pix',
 'distributor': 'American Broadcasting Company (ABC)',
 'director': 'Alon Aranya',
 'writer': 'Melissa Rosenberg',
 'actor_1': 'Radha Mitchell',
 'actor_2': 'Sterling Beaumon',
 'actor_3': 'Clifton Collins Jr.'}

* `data` is now python dictionary

* Dictionaries can be imported to pandas dataframe

In [22]:
import pandas as pd

pd.DataFrame.from_dict(data).head()

Unnamed: 0,title,release_year,locations,production_company,distributor,director,writer,actor_1,actor_2,actor_3
0,Red Widow,2013,Vallejo Street Garage,Beyond Pix,American Broadcasting Company (ABC),Alon Aranya,Melissa Rosenberg,Radha Mitchell,Sterling Beaumon,Clifton Collins Jr.
1,Red Widow,2013,Montgomery & Market Streets,Beyond Pix,American Broadcasting Company (ABC),Alon Aranya,Melissa Rosenberg,Radha Mitchell,Sterling Beaumon,Clifton Collins Jr.
2,Red Widow,2013,Broadway & Taylor,Beyond Pix,American Broadcasting Company (ABC),Alon Aranya,Melissa Rosenberg,Radha Mitchell,Sterling Beaumon,Clifton Collins Jr.
3,Red Widow,2013,Mason & Sacramento St,Beyond Pix,American Broadcasting Company (ABC),Alon Aranya,Melissa Rosenberg,Radha Mitchell,Sterling Beaumon,Clifton Collins Jr.
4,Red Widow,2013,Golden Gate Ave & Jones,Beyond Pix,American Broadcasting Company (ABC),Alon Aranya,Melissa Rosenberg,Radha Mitchell,Sterling Beaumon,Clifton Collins Jr.


## Data Manipulation

### Working with CSV files

Install [csvkit](https://csvkit.readthedocs.io/en/latest/):

In [23]:
# !pip install csvkit

### Examples

* [Data input](https://csvkit.readthedocs.io/en/latest/scripts/in2csv.html#examples)
* [Database conversion](https://csvkit.readthedocs.io/en/latest/scripts/sql2csv.html#examples)
* [Tutorials](https://csvkit.readthedocs.io/en/latest/tutorial.html)