## Data Processing in Shell




## Course Description

We live in a busy world with tight deadlines. As a result, we fall back on what is familiar and easy, favoring GUI interfaces like Anaconda and RStudio. However, taking the time to learn data analysis on the command line is a great long-term investment because it makes us stronger and more productive data people.

In this course, we will take a practical approach to learn simple, powerful, and data-specific command-line skills. Using publicly available Spotify datasets, we will learn how to download, process, clean, and transform data, all via the command line. We will also learn advanced techniques such as command-line based SQL database operations. Finally, we will combine the powers of command line and Python to build a data pipeline for automating a predictive model.

##  Downloading Data on the Command Line
Free
0%

In this chapter, we learn how to download data files from web servers via the command line. In the process, we also learn about documentation manuals, option flags, and multi-file processing.

    Downloading data using curl    50 xp
    Using curl documentation    50 xp
    Downloading single file using curl    100 xp
    Downloading multiple files using curl    100 xp
    Downloading data using Wget    50 xp
    Installing Wget    50 xp
    Downloading single file using wget    100 xp
    Advanced downloading using Wget    50 xp
    Setting constraints for multiple file downloads    50 xp
    Creating wait time using Wget    100 xp
    Data downloading with Wget and curl    100 xp 
    

##  Data Cleaning and Munging on the Command Line
0%

We continue our data journey from data downloading to data processing. In this chapter, we utilize the command line library csvkit to convert, preview, filter and manipulate files to prepare our data for further analyses.

    Getting started with csvkit    50 xp
    Installation and documentation for csvkit    100 xp
    Converting and previewing data with csvkit    100 xp
    File conversion and summary statistics with csvkit    100 xp
    Filtering data using csvkit    50 xp
    Printing column headers with csvkit    100 xp
    Filtering data by column with csvkit    100 xp
    Filtering data by row with csvkit    100 xp
    Stacking data and chaining commands with csvkit    50 xp
    Stacking files with csvkit    100 xp
    Chaining commands using operators    100 xp
    Data processing with csvkit    100 xp 
    

##  Database Operations on the Command Line
0%

In this chapter, we dig deeper into all that csvkit library has to offer. In particular, we focus on database operations we can do on the command line, including table creation, data pull, and various ETL transformation.

    Pulling data from database    50 xp
    Using sql2csv documentation    50 xp
    Understand sql2csv connectors    50 xp
    Practice pulling data from database    100 xp
    Manipulating data using SQL syntax    50 xp
    Applying SQL to a local CSV file    100 xp
    Cleaner scripting via shell variables    100 xp
    Joining local CSV files using SQL    100 xp
    Pushing data back to database    50 xp
    Practice pushing data back to database    100 xp
    Database and SQL with csvkit    100 xp


##  Data Pipeline on the Command Line
0%

In the last chapter, we bridge the connection between command line and other data science languages and learn how they can work together. Using Python as a case study, we learn to execute Python on the command line, to install dependencies using the package manager pip, and to build an entire model pipeline using the command line.

    Python on the command line    50 xp
    Finding Python version on the command line    50 xp
    Executing Python script on the command line    100 xp
    Python package installation with pip    50 xp
    Understanding pip's capabilities    50 xp
    Installing Python dependencies    100 xp
    Running a Python model   100 xp
    Data job automation with cron    50 xp
    Understanding cron scheduling syntax    50 xp
    Scheduling a job with crontab    100 xp
    Model production on the command line    100 xp
    Course recap    50 xp 
    

## Downloading data using curl





**Welcome to Intermediate Shell.  My name is Susan Sun, and I do data work.  I'm looking forward to learning with you in this course.  In data, many of us bypass the command line in favor of GUI interfaces like Anaconda and RStudio because that is what we are familiar with.  However, taking the time to learn data science on the command line is a great long term investment that will, ultimately, make us better and more productive data people.  


In this course, we take a practical approach and learn command line tools useful for everyday data processing and analyses.  First, lets learn how to download data files using curl.  The "curl" is short for Client for URLs, is a UNIX command line tool for transferring data to and from a server.  It is often used to download data from HTTP sites and FTP servers.  To check if "curl" has properly installed, type the following in the command line: "man curl".  If "curl" has not been installed, you will see: "curl command not found".  To install curl, Google it.  If "curl" is installed, your console will look like normal man help pages.  You can keep pressing Enter to scroll through the curl manual.  To exit and return to your console, press q.  

The basic syntax for curl has the following structure: "curl [optional flags] [URL]".  The URL is required  for the command to run successfully.  The "curl" supports a large number of protocal calls.  (including HTTP, HTTPS, FTP, SFTP etc).  For the full list using the "curl --help".  Lets download a single file stored at this hypothetical URL using curl.  To save the file with its original name "datafilename.txt", use the optional flag "-O" (dash uppercase O).  This reads "curl -O URL".  To save the file under a different name, replace -O (dash uppercase O) with -o (dash lowercase o) and new file name.  Now it reads "curl -o newname URL".  

Often times, a server will host multiple data files, with similar filenames.  Like with different ending values.  Instead of curl each file individually, we can use wildcards (do you remember what we learned in introduction to shell course) to download all the files at once.  To download every file hostedon this server that starts with datafilename and end in ".txt", we use: "curl -o URLsomething*.txt".  

Another option is to increment using a globbing parser.  The following will download every files sequentially starting with data "filename001.txt" ane ending with data "filename100.txt".  Note that the end of the command that reads: open square bracket zero zero one dash one hundread close square bracket dot txt.  That is the globbing at work.  


# *******************************************************************************************************************
# curl -O https://websitename.com/datafilename[001-100].txt
#                                             *********


We can increment through the files and download every Nth file.  For example, to download every 10th file, we can modify the globbing parser to read: open square bracket zero zero one dash one hundred colon ten close square bracket dot txt.  


# *******************************************************************************************************************
# curl -O https://websitename.com/datafilename[001-100:10].txt
#                                             ************


# Sometimes internet can time out.  To make sure that our download progress is not lost, 
# *******************************************************************************************************************
curl has these two flags: 
"-L" redirects the HTTP URL if a 300 error code occurs.  
"-C" resumes a previous file transfer if it times out before completion.  
Putting everything together.  Note that all option flags come before URL, but the order of the flags does not matter.  



In this lesson, we learned how to download files using curl.  Lets put our new knowledge to practice.  Happy crul.  




In [None]:
jhu@debian:~$ curl -O https://assets.datacamp.com/production/repositories/4180/datasets/513986f5ea7ed9a8565bba20d088d21c10e099dc/Spotify_MusicAttributes.csv > ~/Downloads/Spotify_MusicAttributes.csv
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1717  100  1717    0     0   1382      0  0:00:01  0:00:01 --:--:--  1382
jhu@debian:~$ 


## Using curl documentation

As you work with command line tools you will often need to consult the documentation to remind yourself of the syntax or of some of the available functionality. In this exercise, you'll consult curl's documentation to answer this question:

Based on the information in the curl manual, which of the following is NOT a supported file protocol:
Instructions
50 XP
Possible Answers

    LDAP
    FTPS
    HTTPS
#    OFTP
    

In [None]:
jhu@debian:~$ curl --help
Usage: curl [options...] <url>
 -d, --data <data>   HTTP POST data
 -f, --fail          Fail silently (no output at all) on HTTP errors
 -h, --help <category> Get help for commands
 -i, --include       Include protocol response headers in the output
 -o, --output <file> Write to file instead of stdout
 -O, --remote-name   Write output to a file named as the remote file
 -s, --silent        Silent mode
 -T, --upload-file <file> Transfer local FILE to destination
 -u, --user <user:password> Server user and password
 -A, --user-agent <name> Send User-Agent <name> to server
 -v, --verbose       Make the operation more talkative
 -V, --version       Show version number and quit

This is not the full help, this menu is stripped into categories.
Use "--help category" to get an overview of all categories.
For all options use the manual or "--help all".
jhu@debian:~$ man curl
DESCRIPTION
       curl  is  a tool to transfer data from or to a server, using one of the
       supported protocols (DICT, FILE, FTP, FTPS, GOPHER, HTTP, HTTPS,  IMAP,
       IMAPS,  LDAP,  LDAPS,  MQTT, POP3, POP3S, RTMP, RTMPS, RTSP, SCP, SFTP,
       SMB, SMBS, SMTP, SMTPS, TELNET and TFTP). The command  is  designed  to
       work without user interaction.

       curl offers a busload of useful tricks like proxy support, user authen‐
       tication, FTP upload, HTTP post, SSL connections, cookies, file  trans‐
       fer  resume,  Metalink,  and more. As you will see below, the number of
       features will make your head spin!

       curl is powered by  libcurl  for  all  transfer-related  features.  See
       libcurl(3) for details.

PROTOCOLS
       curl supports numerous protocols, or put in URL  terms:  schemes.  Your
       particular build may not support them all.

       DICT   Lets you lookup words using online dictionaries.

       FILE   Read  or  write  local  files.  curl  does not support accessing
              file:// URL remotely, but when running on Microsft Windows using
              the native UNC approach will work.

       FTP(S) curl  supports  the  File Transfer Protocol with a lot of tweaks
              and levers. With or without using TLS.

       GOPHER Retrieve files.

       HTTP(S)
              curl supports HTTP with numerous options and variations. It  can
              speak HTTP version 0.9, 1.0, 1.1, 2 and 3 depending on build op‐
              tions and the correct command line options.

       IMAP(S)
              Using the mail reading protocol, curl can "download" emails  for
              you. With or without using TLS.

       LDAP(S)
              curl can do directory lookups for you, with or without TLS.

       MQTT   curl supports MQTT version 3. Downloading over MQTT equals "sub‐
              scribe" to a topic while uploading/posting equals "publish" on a
              topic.  MQTT  support  is experimental and TLS based MQTT is not
              supported (yet).

       POP3(S)
              Downloading from a pop3 server means getting  a  mail.  With  or
              without using TLS.

       RTMP(S)
              The  Realtime  Messaging  Protocol  is  primarily used to server
              streaming media and curl can download it.

       RTSP   curl supports RTSP 1.0 downloads.

       SCP    curl supports SSH version 2 scp transfers.

       SFTP   curl supports SFTP (draft 5) done over SSH version 2.

       SMB(S) curl supports SMB version 1 for upload and download.

       SMTP(S)
              Uploading contents to an SMTP server  means  sending  an  email.
              With or without TLS.

       TELNET Telling curl to fetch a telnet URL starts an interactive session
              where it sends what it reads  on  stdin  and  outputs  what  the
              server sends it.

       TFTP   curl can do TFTP downloads and uploads.



## Downloading single file using curl

Let's get some hands on practice for the more commonly used options and flags with curl. 
# The URL for the hosted file is a shortened URL using tinyurl. Because of that, we need to fill out a flag option that allows for redirected URLs.
Instructions 1/2
50 XP

    Question 1
#    Fill in the option flag that allow downloading from a redirected URL.
    
    
    Question 2
    In the same step as the download, add in the necessary syntax to rename the downloaded file as Spotify201812.zip.
    

In [None]:
# Use curl to download the file from the redirected URL
curl -L -o Spotify201812.zip https://assets.datacamp.com/production/repositories/4180/datasets/eb1d6a36fa3039e4e00064797e1a1600d267b135/201812SpotifyData.zip




In [None]:
jhu@debian:~$ cd ~/Downloads/
jhu@debian:~/Downloads$ ls
Spotify_MusicAttributes.csv
Training_Machine_Learning_Surrogate_Models_From_a_.pdf
jhu@debian:~/Downloads$ curl -L -o Spotify201812.zip https://assets.datacamp.com/production/repositories/4180/datasets/eb1d6a36fa3039e4e00064797e1a1600d267b135/201812SpotifyData.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1944k  100 1944k    0     0   863k      0  0:00:02  0:00:02 --:--:--  863k
jhu@debian:~/Downloads$ ls
new_file  Spotify201812.zip  Spotify_MusicAttributes.csv  Training_Machine_Learning_Surrogate_Models_From_a_.pdf
jhu@debian:~/Downloads$ 


## Exercise
Exercise
Downloading multiple files using curl

We have 100 data files stored in long sequentially named URLs. Scroll right to see the complete URLs.

https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile001.txt
https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile002.txt
......
https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile100.txt

To minimize having to type the long URLs over and over again, we'd like to download all of these files using a single curl command.
Instructions
100 XP

    Download all 100 data files using a single curl command.
    Print all downloaded files to directory.


In [None]:
# Download all 100 data files
curl -O https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile[001-100].txt
#                                                                                                        #########

# Print all downloaded files to directory
ls datafile*.txt

## Downloading data using Wget





**Welcome back, in this lesson, we will introduce another command line tool for downloading data, called Wget.  We will walk through how to install and set up Wget along with some basic usage.  Wget derives its name from World Wide Web and Get.  It is a GNU project native to the Linux system, but is compatible across all operating systems.  It is another command line tool that will help you download files via HTTP and FTP.  


# Compared to "curl", Wget is more multi-purpose.  It can download a single file, an entire folder, or even a webpage.  
Most importantly, it makes multiple file downloads possible recursively.  Aside from using man, another way to check is Wget has been installed correctly, is by using "which wget" (just like Bash and Dash?).  This will return the location of where Wget is installed.  For example, in the local user bin: If Wget has not been installed, there will simply be no output.  For official documentation and source code of Wget, Google it.  Unless you are comfortable compiling from the source code, here are some easier alternatives.  

For Linux users, it is likely Wget is already installed for you.  If not, run "sudo apt-get install wget", just Google it.  For Mac users, use homebrew by running "brew install wget".  For Windows users, this will not be a command line install.  Rather, download as part of the gunwin32 package.  Once the installation is complete, use the man command to print the Wget manual.  

The basic syntax for Wget has a similar structure to curl: "wget [optional flags] [URL]".  The URL is also required for the Wget command to run successfully (isn't that obviously? we are doing URL request).  Wget supports a large number of protocal calls for data stored on servers.  For the full list of the options available, refer to "wget --help" or "man wget" or ask Google.  


# Here are some option flags unique to Wget: 
"-b" allows your download to run in the background. 
"-q" turns off the wget output, which saves some disk spaces. 
"-c" is useful to finish up a previously broken download wheather by Wget or another program. 

Finally, you can link all the option flags together like this.  Running this command on this hypothetical file location will generate the output: "Continuing in background, pid 12345."  The pid is unique process ID assigned to this particular data download job for your reference, in case you need to cancel the process.    ********************
# *******************************************************************************************************************

# wget -bqc https://websitename.com/datafilename.txt




In this lesson, we learned another way to download filesin the command line using the tool Wget.  Up next, we will put our new knowledge to practice and learn more advanced Wget use cases.  Happy wget.  



## Installing Wget

# Unlike curl, there are several ways to download and install wget depending on which operating system your machine is running. Which of the following is NOT a way to install wget?
Answer the question
50XP
Possible Answers

    On some Linux systems, Wget is already pre-installed
    press
    1
    On Linux, install using apt-get
    press
    2
    On Windows, install via gnuwin32
    press
    3
#    On MacOS, install using pip       its not a Python package, its a command line program, Mac use brew XXX
    press
    4
    On MacOS, install using homebrew
    press
    5

## Downloading single file using wget

Let's get some hands on practice for the option flags that make wget such a popular file downloading tool.
Instructions
100 XP

#    Fill in the option flag for resuming a partial download.
#    Fill in the option flag for letting the download occur in the background.
    Preview the download log file


In [None]:
# Fill in the two option flags 
wget -c -b https://assets.datacamp.com/production/repositories/4180/datasets/eb1d6a36fa3039e4e00064797e1a1600d267b135/201812SpotifyData.zip

# Verify that the Spotify file has been downloaded
ls 

# Preview the log file 
cat ___

In [None]:
jhu@debian:~/Downloads$ wget -c -b https://assets.datacamp.com/production/repositories/4180/datasets/eb1d6a36fa3039e4e00064797e1a1600d267b135/201812SpotifyData.zip
Continuing in background, pid 29917.
Output will be written to ‘wget-log’.
jhu@debian:~/Downloads$ ls


## Advanced downloading using Wget






**


