# Command line and data frames


Spring 2017 - Prof. Foster Provost

Teacher Assistant: Maria L Zamora Maass


***

## Command Line (Terminal)

We can use the command line system through the Ipython Notebooks. You can use shell commands (such as the following) by prefixing the line with an exclamation point.


#### Interaction with files and folders

We can navigate the folder structure where we are working (or in any machine you are). For this you will typically use commands such as `ls` (list) and `cd` (change directory). You can make a directory with `mkdir` or move (`mv`) and copy (`cp`) files. To delete a file you can `rm` (remove) it. To print the contents of a file you can `cat` (concatenate) it to the screen.

Many commands have options you can set when running them. For example to get a listing of files as a vertical list you can pass the `-l` (list) flag, e.g. `ls -l`. During the normal course of using the command line, you will learn the most useful flags. If you want to see all possible options you can always read the `man` (manual) page for a command, e.g. `man ls`. When you are done reading the `man` page, you can exit by hitting `q` to quit.


In [1]:
!ls

Command line and data frames 2017.ipynb
Ipython notebooks and files 2017.ipynb
Programming Structures and Python Tour 2017.ipynb
[34mdata[m[m
[34mimages[m[m


In [2]:
!mkdir test

In [3]:
!ls

Command line and data frames 2017.ipynb
Ipython notebooks and files 2017.ipynb
Programming Structures and Python Tour 2017.ipynb
[34mdata[m[m
[34mimages[m[m
[34mtest[m[m


In [4]:
!ls images/

new_notebook.png   notebook.png       terminal.png
new_terminal.png   script.png         terminal_2017.png
new_text.png       selectlanguage.png text.png


In [5]:
!cp images/terminal.png test/some_picture.png

In [6]:
!ls test/

some_picture.png


In [7]:
# WARNING: THIS WILL DELETE THE TEST FOLDER JUST CREATED
!rm -rf test/

In [8]:
!ls

Command line and data frames 2017.ipynb
Ipython notebooks and files 2017.ipynb
Programming Structures and Python Tour 2017.ipynb
[34mdata[m[m
[34mimages[m[m


#### Data manipulation and exploration
Virtually anything you want to do with a data file can be done at the command line. There are dozens of commands that can be used together to get almost any result! Lets take a look at the the file `data/users.csv`.

Before we do anything, lets take a look at the first few lines of the file to get an idea of what's in it.

In [9]:
!head data/users.csv

user,variable1,variable2
parallelconcerned,145.391881,-6.081689
driftmvc,145.7887,-5.207083
snowdonevasive,144.295861,-5.826789
cobolglaucous,146.726242,-6.569828
stylishmugs,147.22005,-9.443383
hypergalaxyfibula,143.669186,-3.583828
pipetsrockers,-45.425978,61.160517
bracesworkable,-51.678064,64.190922
spiritedjump,-50.689325,67.016969


Maybe we want to see a few more lines of the file,

In [10]:
!head -15 data/users.csv

user,variable1,variable2
parallelconcerned,145.391881,-6.081689
driftmvc,145.7887,-5.207083
snowdonevasive,144.295861,-5.826789
cobolglaucous,146.726242,-6.569828
stylishmugs,147.22005,-9.443383
hypergalaxyfibula,143.669186,-3.583828
pipetsrockers,-45.425978,61.160517
bracesworkable,-51.678064,64.190922
spiritedjump,-50.689325,67.016969
barnevidence,-68.703161,76.531203
emeraldclippers,-18.072703,65.659994
maintainwiggly,-14.401389,65.283333
submittedwavelength,-15.227222,64.295556
clucklinnet,-17.425978,65.952328


How about the last few lines of the file?

In [11]:
!tail data/users.csv

troubledseptum,135.521667,-29.716667
troubledseptum,-118.598889,34.256944
organicmajor,-5.435,36.136
cobolglaucous,-123.5,48.85
troubledseptum,-124.016667,49.616667
snaildossier,-124.983333,50.066667
unbalancedprotoplanet,-127.028611,50.575556
badgefields,-126.833333,50.883333
backedammeter,-123.00596,48.618397
clucklinnet,-117.1995,32.7552


We can count how many lines are in the file by using wc (a word counting tool) with the -l flag to count lines,

In [12]:
!wc -l data/users.csv

    8104 data/users.csv


It looks like there are three columns in this file, lets take a look at the first one alone. Here, we can cut the field (-f) we want as long as we give the proper delimeter (-d defaults to tab).

In [None]:
!cut -f1 -d',' data/users.csv

That's a lot of output. Let's combine the cut command with the head command by piping the output of one command into another one,

In [13]:
!cut -f1 -d',' data/users.csv | head

user
parallelconcerned
driftmvc
snowdonevasive
cobolglaucous
stylishmugs
hypergalaxyfibula
pipetsrockers
bracesworkable
spiritedjump


We can use pipes (`|`) to string together many commands to create very powerful one liners. For example, lets get the number of unique users in the first column. We will get all values from the first column, sort them, find all unique values, and then count the number of lines,

In [14]:
!cut -f1 -d',' data/users.csv | sort | uniq | wc -l

     201


Or, we can get a list of the top-10 most frequently occuring users. If we give uniq the -c flag, it will return the number of times each value occurs. Since these counts are the first entry in each new line, we can tell sort to expect numbers (-n) and to give us the results in reverse (-r) order. Note, that when you want to use two or more single letter flags, you can just place them one after another.

In [15]:
!cut -f1 -d',' data/users.csv | sort | uniq -c | sort -nr | head

  59 compareas
  56 upbeatodd
  56 burntrifle
  56 binomialapathetic
  54 frequencywould
  54 ellipticalfabricator
  53 globeshameful
  52 badgefields
  52 ashamedmuscles
  51 alloweruptions


After some exploration we decide we want to keep only part of our data and bring it into a new file. Let's find all the records that have a negative value in the second and third columns and put these results in a file called `data/negative_users.csv`. Searching through files can be done using _[regular expressions](http://www.robelle.com/smugbook/regexpr.html#expression)_ with a tool called `grep` (Global Regular Expression Printer). You can direct output into a file using a `>`.

In [16]:
!grep '.*,-.*,-.*' data/users.csv > data/negative_users.csv

In [17]:
!ls data

ds_survey.csv      negative_users.csv users.csv


## Packages and built-in functions

Python has a ton of packages that make doing complicated stuff very easy. We won't discuss how to install packages, or give a detailed list of what packages exist, but we will give a brief description about how they are used. An easy way to think of why package are useful is by thinking: "**Python packages give us access to MANY functions!**".

This are pre-defined functions (built-in) that will make our life easier!! (e.g. the funciton 'str()' that we used to convert numbers into strings)

In this class we will use four packages very frequently: `pandas`, `sklearn`, `matplotlib`, and `numpy`:

- **`pandas`** is a data manipulation package. It let's you store data in data frames. More on this next class.
- **`sklearn`** is a machine learning and data science package. It let's you do fairly complicated machine learning tasks, such as running regressions and building classification models with only a few lines of code!
- **`matplotlib`** let's you make nice looking plots.
- **`numpy`** (pronounced num-pie) is used for doing "math stuff" such as complex math operations (e.g., square roots, exponents, logs) and give you complex matrix operation abilities.

If it's confusing as to why this is useful, don't worry. As we use them throughout the semester, their usefulness will become apparent.

To make the contents of a package useful, you need to import it:

In [18]:
import pandas
import sklearn
import matplotlib
import numpy

Sometimes you will want to use short names for packages. This has just become the norm now, so we will often be doing it so that we fit in with all the professional programmers.

In [19]:
import pandas as pd
import numpy as np

We can now use some package specific things. For example, numpy has a function called `sqrt()` which will give us the square root of a numpy. Since it is part of numpy, we need to tell Python that's where it is by using a dot.

In the following cell you can also see how to write **comments** in your code (professional programmers write comments to allow somebody else understand their code). You should always write commands and procedures considering that they'll be understandable and straightforward.

In [20]:

some_list = [0,0,1,2,3,3,4.5,7.6]
some_dictionary = {'student1': '(929)-000-0000', 'student2': '(917)-000-0000', 'student3': '(470)-000-0000'}
some_set = set( [1,2,4,4,5,5] )


# In this part of the code I am using numpy (np) functions

print ("Square root: " + str ( np.sqrt(25) ))
print ("Maximum element of our previous list: " + str( np.max(some_list) ))

# In this part of the code I am using python functions

print ("Number of elements in our previous list: " + str( len(some_list) ))
print ("Sum of elements in our previous list: " + str( sum(some_list) ))
print ("Range of 5 numbers (remember we start with 0): " + str( range(5) ))



Square root: 5.0
Maximum element of our previous list: 7.6
Number of elements in our previous list: 8
Sum of elements in our previous list: 21.1
Range of 5 numbers (remember we start with 0): [0, 1, 2, 3, 4]


What about **pandas** ?? The basic aspect of this package is the concept of **DATAFRAMES**. 

A Dataframe is 2-dimensional labeled data structure with columns of potentially different types. It is generally the most commonly used pandas object. Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. [More details here](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe)

This is how it looks:


In [25]:
list1 = ['studentA',22,'(929)-000-000']
list2 = ['studentB',27,'(646)-000-000']
list3 = ['studentC',30,'(917)-000-000']
list4 = ['studentD',31,'(646)-001-001']
list5 = ['studentE',31,'(929)-001-001']
list6 = ['studentF',30,'(917)-001-001']
list7 = ['studentG',30,'(470)-001-001']

data_pandas = pd.DataFrame([list1,list2,list3,list4,list5,list6,list7],columns=['Name','Age','Mobile'])
data_pandas 


Unnamed: 0,Name,Age,Mobile
0,studentA,22,(929)-000-000
1,studentB,27,(646)-000-000
2,studentC,30,(917)-000-000
3,studentD,31,(646)-001-001
4,studentE,31,(929)-001-001
5,studentF,30,(917)-001-001
6,studentG,30,(470)-001-001


Count nuber of rows based on 'Age'

In [29]:
data_pandas.groupby('Age').count()

Unnamed: 0_level_0,Name,Mobile
Age,Unnamed: 1_level_1,Unnamed: 2_level_1
22,1,1
27,1,1
30,3,3
31,2,2


You can sub-select columns making a list with the names

In [41]:
data_pandas[ ['Name','Age'] ]

Unnamed: 0,Name,Age
0,studentA,22
1,studentB,27
2,studentC,30
3,studentD,31
4,studentE,31
5,studentF,30
6,studentG,30


We can also include columns ( it should have the same number of rows! )

In [42]:

data_pandas['business_major'] = ['yes','no','yes','yes','yes','no','yes']
data_pandas['years_experience'] = [1,4,2,6,0,3,0]

data_pandas


Unnamed: 0,Name,Age,Mobile,business_major,years_experience
0,studentA,22,(929)-000-000,yes,1
1,studentB,27,(646)-000-000,no,4
2,studentC,30,(917)-000-000,yes,2
3,studentD,31,(646)-001-001,yes,6
4,studentE,31,(929)-001-001,yes,0
5,studentF,30,(917)-001-001,no,3
6,studentG,30,(470)-001-001,yes,0


What if we take a look again? But now let's use "sum" to see all values, not just counts ( sum / aggregate )

In [37]:
data_pandas.groupby('Age').sum()

Unnamed: 0_level_0,Name,Mobile,business_major
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
22,studentA,(929)-000-000,yes
27,studentB,(646)-000-000,no
30,studentCstudentFstudentG,(917)-000-000(917)-001-001(470)-001-001,yesnoyes
31,studentDstudentE,(646)-001-001(929)-001-001,yesyes


****

What is the age average? (Combine packages: numpy and pandas)


In [43]:
np.mean( data_pandas['Age'] )

28.714285714285715


What about operations with columns? Let's take the difference between age and years of experience!

_(Look how I select columns here!!)_


In [47]:

data_pandas.Age - data_pandas.years_experience


0    21
1    23
2    28
3    25
4    31
5    27
6    30
dtype: int64


***

We'll see more functions during the semester and you can always look for them (remember, google is your best friend) !!

#### Auto complete for packages

One of the most useful things about IPython notebook is its tab completion. 

Try this: click just after `sqrt(` in the cell below and press `Shift + Tab` 4 times, slowly

In [None]:
np.sqrt(

I find this amazingly useful. I think of this as "the more confused I am, the more times I should press Shift+Tab". Nothing bad will happen if you tab complete 12 times.

Okay, let's try tab completion for function names! Just hit `Tab` when typing below to get suggestions.

In [None]:
np.sq

This is super useful when you forget the names of everything!

## Hands-on

To master your new found knowledge of Python, you should try these hands-on examples. 

Your homeworks will be in a similar format to this section.

**1\. Create one list of 5 fruits and another one with 5 colors**

**2\. Go through each fruit (first list) and print out the name of the fruit with one color of the second list **

(don't worry, it doesn't have to be the color of the fruit!)

Example of what you should print:  _apple is purple_

**3\. Add two new fruits to your list with a _BUILT-IN_ function **

( Look for the function with the **TAB** hint! )

**4\. Use the list of fruits and sort the names (put them in alphabetical order) **

( Hint: Numpy has a great function for that!)

**5\. Create a new empty list called "count_letters". Go through your list of fruits, and for each one, add an entry to that new list (count_letters) telling the number of letters each fruit name.**

Example of what you should print: _apple is 5 letters_


** 6\. Make a function called `one_more_change` that takes a list (input_list) and returns a new list (output_list) where each element of the original will be increased by 1 and divided by 2. **

In [None]:
def one_more(input_list):
    output_list = []   # What should we do?
    
    return output_list

** 7\. Use the previous function to change the value of a list of 10 random numbers. **