# Lecture 4: Argument parsing, Version control
## October 1, 2019
## Tristan Glatard


# Objectives for today
* Customize scripts with arguments
* Get introduced to version control
* Improve our file reading skills


# Back to the Airbnb dataset (week 3 homework)

Parsing the first 10 lines only requires `split`:

In [57]:
file_name = 'airbnb/first_10.csv'
f = open(file_name)
lines = f.readlines()
for line in lines[1:]:
    columns = line.split(',')
    price = columns[9]
    print(price)

149
225
150
89
80
200
60
79
79
150


The assignment requested to print only the lines where price is in [100, 150]:

In [58]:
file_name = 'airbnb/first_10.csv'
f = open(file_name)
lines = f.readlines()
for line in lines[1:]:
    columns = line.split(',')
    price = int(columns[9])
    if price >= 100 and price <= 150:
        print(line)

2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365

3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365

5238,Cute & Cozy Lower East Side 1 bdrm,7549,Ben,Manhattan,Chinatown,40.71344,-73.99037,Entire home/apt,150,1,160,2019-06-09,1.33,4,188



# Script arguments

In many occasions, we would like to run the same script with slight variations in parameters, for instance:
* A different file name
* A different price range

It is not convenient to have to modify our script for that. Instead, we will use script arguments. Arguments are:
* Passed on the command line, for instance `python myscript.py a b c` passes 3 arguments to `myscript.py`, with values `"a"`, `"b"` and `"c"`.
* Accessible by the script through a list called `sys.argv`
* Always of type string: you should convert them to numbers if necessary.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Antu_task-complete.svg/1024px-Antu_task-complete.svg.png" width=50 align=left/>

* Create a script with the following code and run it with arguments:
```
import sys
print(sys.argv)
```
* What does it do?

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Antu_task-complete.svg/1024px-Antu_task-complete.svg.png" width=50 align=left/>
    
* Modify your Airbnb homework script to pass the file name and price range as arguments

# Git and GitHub

As you may have noticed, scripts rapidly evolve, and it's sometimes important to keep track of this evolution. To keep things tractable, good software engineers always use a *version control system*. [Git](https://git-scm.com) is currently the most popular version control system. Git allows developers to:
* Keep track of the history of a script
* Work collaboratively on the same scripts
* Share their code on an online platform: [GitHub](https://github.com).

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Antu_task-complete.svg/1024px-Antu_task-complete.svg.png" width=50 align=left/>

* Create an account on [GitHub](https://github.com) if you don't have one already
* Look at projects hosted on GitHub, such as [scikit-learn](https://github.com/scikit-learn/scikit-learn), [Apache Spark](https://github.com/apache/spark) or [streamlit](https://github.com/streamlit/streamlit). What type of information can you get from GitHub?

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Antu_task-complete.svg/1024px-Antu_task-complete.svg.png" width=50 align=left/>

* Create a new repository in your GitHub account
* Download and install the [GitHub Desktop app](https://desktop.github.com/)
* Add your "Airbnb" homework script to the repository (you can use the solution shown above if you prefer)
* Note: if you don't want this code to be public, you can configure this in your repository under "Settings".

# File reading, continued

Our current solution to parse the Airbnb dataset doesn't work on the entire file:

In [64]:
file_name = 'airbnb/complete.csv'
f = open(file_name)
lines = f.readlines()
for line in lines[1:]:
    columns = line.split(',')
    price = int(columns[9])
    if price >= 100 and price <= 150:
        print(line)

2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365

3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365

5238,Cute & Cozy Lower East Side 1 bdrm,7549,Ben,Manhattan,Chinatown,40.71344,-73.99037,Entire home/apt,150,1,160,2019-06-09,1.33,4,188

5295,Beautiful 1br on Upper West Side,7702,Lena,Manhattan,Upper West Side,40.80316,-73.96545,Entire home/apt,135,5,53,2019-06-22,0.43,1,6



ValueError: invalid literal for int() with base 10: '40.66829'

Let's use a "try-catch" clause to print the line where there is a parsing error.

In [68]:
import sys
file_name = 'airbnb/complete.csv'
f = open(file_name)
lines = f.readlines()
for line in lines[1:]:
    columns = line.split(',')
    try:  
        price = int(columns[9])
    except:
        print("Parsing ERROR: ",line)
        sys.exit(1)  # this stops the program
    if price >= 100 and price <= 150:
        print(line)

2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365

3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365

5238,Cute & Cozy Lower East Side 1 bdrm,7549,Ben,Manhattan,Chinatown,40.71344,-73.99037,Entire home/apt,150,1,160,2019-06-09,1.33,4,188

5295,Beautiful 1br on Upper West Side,7702,Lena,Manhattan,Upper West Side,40.80316,-73.96545,Entire home/apt,135,5,53,2019-06-22,0.43,1,6

Parsing ERROR:  5803,"Lovely Room 1, Garden, Best Area, Legal rental",9744,Laurie,Brooklyn,South Slope,40.66829,-73.98779,Private room,89,4,167,2019-06-24,1.34,3,314



SystemExit: 1

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Antu_task-complete.svg/1024px-Antu_task-complete.svg.png" width=50 align=left/>

Check the data file: 
* why is line 14 not parsed correctly by our program?
* how can we fix this problem and parse the file correctly?

Here is a possible solution to address the "quoting" issue:

In [67]:
for line in lines[1:]:
    columns = line.split(',')
    clean_columns = [] # this will contain the correctly-parsed columns
    in_quote = False  
    for x in columns: # we go through the "dirty" columns of data. We will set in_quote to True when we are inside a quote
        if not in_quote:
            clean_columns += [ x ] # we aren't in a quote, we can add the column to the clean ones
            if '"' in x:  # x contains a quote, we will have to handle the next columns differently
                in_quote = True
        else:
            clean_columns[-1] += x  # we are in a quote, we musn't create a new column. Instead, we append the current column to the last one in columns_clean
            if '"' in x:  # this is our closing quote, next column will be out of the quote
                in_quote = False
    try:
        price = int(clean_columns[9])
    except:
        print("Parsing ERROR: ",line)
        sys.exit(1)
    if price >= 100 and price <= 150:
        print(line)

2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365

3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365

5238,Cute & Cozy Lower East Side 1 bdrm,7549,Ben,Manhattan,Chinatown,40.71344,-73.99037,Entire home/apt,150,1,160,2019-06-09,1.33,4,188

5295,Beautiful 1br on Upper West Side,7702,Lena,Manhattan,Upper West Side,40.80316,-73.96545,Entire home/apt,135,5,53,2019-06-22,0.43,1,6

6090,West Village Nest - Superhost,11975,Alina,Manhattan,West Village,40.7353,-74.00525,Entire home/apt,120,90,27,2018-10-31,0.22,1,0

6848,Only 2 stops to Manhattan studio,15991,Allen & Irina,Brooklyn,Williamsburg,40.70837,-73.95352,Entire home/apt,140,2,148,2019-06-29,1.20,1,46

7322,Chelsea Perfect,18946,Doti,Manhattan,Chelsea,40.74192,-73.99501,Private room,140,1,260,2019-07-01,2.12,1,12

8024,CBG CtyBGd HelpsHaiti rm#1:1-4,22486,Lisel,Brooklyn,Park Slope,40.680

SystemExit: 1

We still have an issue!

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Antu_task-complete.svg/1024px-Antu_task-complete.svg.png" width=50 align=left/>

Check the data file: 
* why is line 461 not parsed correctly by our program?
* how can we fix these 2 problems and parse the file correctly?

Here is a possible solution to address the "escaped quote" and the "multiple quote" issues:

In [78]:
for line in lines[1:]:
    columns = line.split(',')
    clean_columns = [] # this will contain the correctly-parsed columns
    in_quote = False  
    for x in columns: # we go through the "dirty" columns of data. We will set in_quote to True when we are inside a quote
        clean_x = x.replace('""', '') # we remove the "escaped" quotes from x
        n_quotes = clean_x.count('"')
        if not in_quote:
            clean_columns += [ clean_x ] # we aren't in a quote, we can add the column to the clean ones
            if n_quotes % 2 != 0:  # x contains an odd number of quotes, we will have to handle the next columns differently
                in_quote = True
        else:
            clean_columns[-1] += clean_x  # we are in a quote, we musn't create a new column. Instead, we append the current column to the last one in columns_clean
            if n_quotes % 2 != 0:  # this is our closing quote, next column will be out of the quote
                in_quote = False
    try:
        price = int(clean_columns[9])
    except:
        print("Parsing ERROR: ",line)
        sys.exit(1)
    if price >= 100 and price <= 150:
        print(line)
    

2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365

3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365

5238,Cute & Cozy Lower East Side 1 bdrm,7549,Ben,Manhattan,Chinatown,40.71344,-73.99037,Entire home/apt,150,1,160,2019-06-09,1.33,4,188

5295,Beautiful 1br on Upper West Side,7702,Lena,Manhattan,Upper West Side,40.80316,-73.96545,Entire home/apt,135,5,53,2019-06-22,0.43,1,6

6090,West Village Nest - Superhost,11975,Alina,Manhattan,West Village,40.7353,-74.00525,Entire home/apt,120,90,27,2018-10-31,0.22,1,0

6848,Only 2 stops to Manhattan studio,15991,Allen & Irina,Brooklyn,Williamsburg,40.70837,-73.95352,Entire home/apt,140,2,148,2019-06-29,1.20,1,46

7322,Chelsea Perfect,18946,Doti,Manhattan,Chelsea,40.74192,-73.99501,Private room,140,1,260,2019-07-01,2.12,1,12

8024,CBG CtyBGd HelpsHaiti rm#1:1-4,22486,Lisel,Brooklyn,Park Slope,40.680

SystemExit: 1

We still have an issue!

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Antu_task-complete.svg/1024px-Antu_task-complete.svg.png" width=50 align=left/>

Check the data file: 
* why is line 689 not parsed correctly by our program?
* how can we fix this problem and parse the file correctly?

(This is left as homework)