# Ipython notebooks and data files

Spring 2017 - Prof. Foster Provost

Teacher Assistant: Maria L Zamora Maass

***

## Python

Python is a programming language that has been growing in popularity in recent years. There are many reasons for this, but it mostly comes down to Python being easy to learn and use as well as the fact that Python has a very active community that develops amazing extensions to Python!

In just the past few years, Python has become one of the most frequently used languages in the world of data science due to the ability to almost instantly apply it to a large number of data science problems. When asking companies in different industries and of various sizes what lanuage they would like their data scientists to know when coming in, they almost all agree that Python is the best choice. If you are going to learn one language (something everyone should do!), Python would be a great choice.

From this language, other languages, features and packages have been created: Ipython, Pandas, Numpy, Matplotlib, and others that we will be using during this course. For more info please visit https://www.python.org/doc/


## Jupyter Ipython notebooks

One useful tool to work with Python is Jupyter, which has the Ipython notebooks.

"The IPython Notebook is now known as the Jupyter Notebook. It is an interactive computational environment, in which 
you can combine code execution, rich text, mathematics, plots and rich media. It is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more."

- Language: The Notebook has support for over 40 programming languages, including those popular in Data Science such as Python, R, Julia and Scala.

- Sharing: Notebooks can be shared with others using email, Dropbox, GitHub and the Jupyter Notebook Viewer.

- Widgets (apps): Code can produce rich output such as images, videos, LaTeX, and JavaScript. Interactive widgets can be used to manipulate and visualize data in realtime.


For more details on the Jupyter Notebook, please see the Jupyter website http://jupyter.org/


Steps to open a new notebook (you can open as many as you want!):


![NewNotebook](images/new_notebook.png)



This is how a new Ipython notebook looks:




![NewNotebook](images/notebook.png)





## Text files and scripts in Jupyter

In jupyter, we can also create new text files or scripts.

A script is just text known as "command". This text is defined in certain programming language (e.g. Python) and can be executed as a "program" without user interaction. We can know the language of the script based on the extension of the file. Then, for example, a file called "script_example.py" is a file with python commands, and a file called "script_example.R" is a file with the R language commands.

Steps to open a new text file:

![NewText](images/new_text.png)

***

This is how it looks:

![Text](images/text.png)

***

Now we can change the language and write some examples of Python commands. 

We should change the extension of this file into a file.py to be able to run the file later.

![Language](images/selectlanguage.png)
![Language](images/script.png)


Why do we need scripts? Because it is a good way to write commands that will be used frequently. Then, instead of writing all the commands in many notebooks, we can create a script and just call its commands. You will see how I do this in later labs!


## Command line in Jupyter

The command line is the way in which we interact with a computer program. From the command line (also known as terminal, or shell) you can perform almost any computer operation that you would normally use a mouse and graphical interface (GUI) for. In some cases, such as dealing with raw data files, the command line can give a quick way to start exploring. For example, we can use this to run scripts (like the one we just saw in the previous step). You have a terminal available in the Amazon system, but there is also one in your computer for your local system.

For our class, we will only use the terminal to update our class material. This means that each time you want to get the new files that I have in the web (https://github.com/mariazm/Spring2017_ProfFosterProvost.git) you will need to open the terminal and write the command:   ~/sync_notebooks.sh

(For more details look at the installations' assignment. Remember that running this command will replace all files in your folder "Class_files").


Steps to open the terminal in Jupyter. 

![NewTerminal](images/new_terminal.png)

This is how it looks when you open the terminal in jupyter and write some commands.

![Terminal](images/terminal_2017.png)



## Command line tasks in a Jupyter Ipython Notebook


Since we are not using the terminal, to communicate with the command line system we can use the Ipython Notebook.

You can use shell commands (such as the following) in IPython notebooks by prefixing the line with an exclamation point.


#### Interaction with files and folders

We can navigate the folder structure where we are working (or in any machine you are). For this you will typically use commands such as `ls` (list) and `cd` (change directory). You can make a directory with `mkdir` or move (`mv`) and copy (`cp`) files. To delete a file you can `rm` (remove) it. To print the contents of a file you can `cat` (concatenate) it to the screen.

Many commands have options you can set when running them. For example to get a listing of files as a vertical list you can pass the `-l` (list) flag, e.g. `ls -l`. During the normal course of using the command line, you will learn the most useful flags. If you want to see all possible options you can always read the `man` (manual) page for a command, e.g. `man ls`. When you are done reading the `man` page, you can exit by hitting `q` to quit.


In [1]:
!ls

data	Ipython notebooks and files 2017.ipynb
images	Programming Structures and Python Tour 2017.ipynb


In [2]:
!mkdir test

In [3]:
!ls images/

new_notebook.png  new_text.png	script.png	    terminal_2017.png  text.png
new_terminal.png  notebook.png	selectlanguage.png  terminal.png


In [4]:
!cp images/terminal.png test/some_picture.png

In [5]:
!ls test/

some_picture.png


In [6]:
# WARNING: THIS WILL DELETE THE TEST FOLDER JUST CREATED
!rm -rf test/

In [7]:
!ls

data	Ipython notebooks and files 2017.ipynb
images	Programming Structures and Python Tour 2017.ipynb


#### Data manipulation and exploration
Virtually anything you want to do with a data file can be done at the command line. There are dozens of commands that can be used together to get almost any result! Lets take a look at the the file `data/users.csv`.

Before we do anything, lets take a look at the first few lines of the file to get an idea of what's in it.

In [8]:
!head data/users.csv

user,variable1,variable2
parallelconcerned,145.391881,-6.081689
driftmvc,145.7887,-5.207083
snowdonevasive,144.295861,-5.826789
cobolglaucous,146.726242,-6.569828
stylishmugs,147.22005,-9.443383
hypergalaxyfibula,143.669186,-3.583828
pipetsrockers,-45.425978,61.160517
bracesworkable,-51.678064,64.190922
spiritedjump,-50.689325,67.016969


Maybe we want to see a few more lines of the file,

In [9]:
!head -15 data/users.csv

user,variable1,variable2
parallelconcerned,145.391881,-6.081689
driftmvc,145.7887,-5.207083
snowdonevasive,144.295861,-5.826789
cobolglaucous,146.726242,-6.569828
stylishmugs,147.22005,-9.443383
hypergalaxyfibula,143.669186,-3.583828
pipetsrockers,-45.425978,61.160517
bracesworkable,-51.678064,64.190922
spiritedjump,-50.689325,67.016969
barnevidence,-68.703161,76.531203
emeraldclippers,-18.072703,65.659994
maintainwiggly,-14.401389,65.283333
submittedwavelength,-15.227222,64.295556
clucklinnet,-17.425978,65.952328


How about the last few lines of the file?

In [10]:
!tail data/users.csv

troubledseptum,135.521667,-29.716667
troubledseptum,-118.598889,34.256944
organicmajor,-5.435,36.136
cobolglaucous,-123.5,48.85
troubledseptum,-124.016667,49.616667
snaildossier,-124.983333,50.066667
unbalancedprotoplanet,-127.028611,50.575556
badgefields,-126.833333,50.883333
backedammeter,-123.00596,48.618397
clucklinnet,-117.1995,32.7552


We can count how many lines are in the file by using `wc` (a word counting tool) with the `-l` flag to count lines,

In [11]:
!wc -l data/users.csv

8104 data/users.csv


It looks like there are three columns in this file, lets take a look at the first one alone. Here, we can `cut` the field (`-f`) we want as long as we give the proper delimeter (`-d` defaults to tab).

In [12]:
!cut -f1 -d',' data/users.csv 

user
parallelconcerned
driftmvc
snowdonevasive
cobolglaucous
stylishmugs
hypergalaxyfibula
pipetsrockers
bracesworkable
spiritedjump
barnevidence
emeraldclippers
maintainwiggly
submittedwavelength
clucklinnet
bluetailgodwottery
microwavejar
croutonwrack
submittedwavelength
moderatohorn
heaterinert
micaassistant
gaudyfea
turnoverlovesick
amuckpoints
allegatorwafers
expecteffective
mincegaiters
peacefulceaseless
decanterbalance
synonympatisserie
starbucksbluetail
pipeathlete
radicandoceanic
somethingalbedo
craytugofwar
pipetsrockers
unbalancedprotoplanet
emeraldclippers
ischemicfrosted
binomialapathetic
stairsgobsmacked
ledgeindeed
badgefields
synonympatisserie
worldlyventuri
globeshameful
alloweruptions
burritoscarriage
grabbig
dronessomersault
latticelaboratory
ellipticalfabricator
amuckpoints
guavaconfide
fundingticket
croutonwrack
elatedunicorn
freelysociable
loindecorate
micaassistant
dweebspices
latticelaboratory
babyam

That's a lot of output. Let's combine the `cut` command with the `head` command by _piping_ the output of one command into another one,

In [13]:
!cut -f1 -d',' data/users.csv | head

user
parallelconcerned
driftmvc
snowdonevasive
cobolglaucous
stylishmugs
hypergalaxyfibula
pipetsrockers
bracesworkable
spiritedjump
cut: write error: Broken pipe


We can use pipes (`|`) to string together many commands to create very powerful one liners. For example, lets get the number of unique users in the first column. We will get all values from the first column, sort them, find all unique values, and then count the number of lines,

In [14]:
!cut -f1 -d',' data/users.csv | sort | uniq | wc -l

201


Or, we can get a list of the top-10 most frequently occuring users. If we give `uniq` the `-c` flag, it will return the number of times each value occurs. Since these counts are the first entry in each new line, we can tell `sort` to expect numbers (`-n`) and to give us the results in reverse (`-r`) order. Note, that when you want to use two or more single letter flags, you can just place them one after another.

In [15]:
!cut -f1 -d',' data/users.csv | sort | uniq -c | sort -nr | head

     59 compareas
     56 upbeatodd
     56 burntrifle
     56 binomialapathetic
     54 frequencywould
     54 ellipticalfabricator
     53 globeshameful
     52 badgefields
     52 ashamedmuscles
     51 alloweruptions


After some exploration we decide we want to keep only part of our data and bring it into a new file. Let's find all the records that have a negative value in the second and third columns and put these results in a file called `data/negative_users.csv`. Searching through files can be done using _[regular expressions](http://www.robelle.com/smugbook/regexpr.html#expression)_ with a tool called `grep` (Global Regular Expression Printer). You can direct output into a file using a `>`.

In [16]:
!grep '.*,-.*,-.*' data/users.csv > data/negative_users.csv

We can check the data folder to see if our new file is in there,

In [17]:
!ls data

ds_survey.csv  negative_users.csv  users.csv
