# Lab-P6: Real-world Datasets (Airbnb)

In [None]:
## Please make sure "airbnb.csv" is in your "lab6" folder.
import csv

## Segment 2: Loading Data from CSVs

### Task 2.1: Process the CSV file
[Chapter 14](https://automatetheboringstuff.com/chapter14/) of Automate the Boring Stuff introduces CSV files and provides a code snippet we can reuse. We will use the same code snippet for p6.

In [None]:
def process_csv(filename):
    example_file = open(filename, encoding="utf-8")
    example_reader = csv.reader(example_file)
    example_data = list(example_reader)
    example_file.close()
    return example_data

In [None]:
# Use process_csv to pull out the header and data rows
csv_rows = process_csv("airbnb.csv")

# Use indexing to extract the first inner list
csv_header = csv_rows[???] # A list of the column headers

# Use slicing to slice all the inner lists, except the first one
csv_data = csv_rows[???1] # The entire CSV data set besides the headers

### About the dataset
The `airbnb.csv` file has data about nearly 50,000 listings on Airbnb from New York City, NY from the year 2019. Each row in the file contains data about a single listing. The columns contain the following data about each listing (along with the correct data type to represent it):

1. `room_id` - The ID of the room listing (str)
2. `name` - The name of the room listing (str)
3. `host_id` - The ID of the host for the room listing (str)
4. `host_name` - The name of the host for the room listing (str)
5. `neighborhood_group` - The group of neighborhoods the room is in (str)
6. `neighborhood` - The neighborhood the room is in (str)
7. `latitude` - The latitude where the room is located (float)
8. `longitude` - The longitude where the room is located (float)
9. `room_type` - The type of room (str)
10. `price` - The price per night for the room in US dollars (int)
11. `minimum_nights` - The minimum amount of nights the room can be booked for (int)
12. `number_of_reviews` - The total number of reviews the room has received (int)
13. `last_review` - The date of the most recent review in the form yyyy-mm-dd (str)
14. `reviews_per_month` - How many reviews per month the room receives (float)
15. `calculated_host_listings_count` - How many listings the host of the room has (int)
16. `availability_365` - How many days per year the listing is available for (int)

<b>Note:</b> Keep in mind while writing your project, some entries may be missing data for specific columns. Sadly, data in real life is often messy, and in p6, we will have to deal with missing data.

### Task 2.2: Access the contents of the dataset
Index the data to extract the correct answers for the questions listed below. Some have been done for you. To understand the results better, locate the values in the airbnb.csv file.

In [None]:
# Question: What are the names of the columns in the dataset?
csv_header # we did this one for you

In [None]:
# Question: How many rows of data (excluding header) are present in the dataset?
len(csv_data) # we did this one for you

Open airbnb.csv, to verify your answers.

In [None]:
# Question: What values are present in the first row in the dataset (use indexing)?


In [None]:
# Question: Display the first five rows of data in the dataset (using slicing).


To get a data in a cell of the csv, we need two things:
  1. row index
  2. column index
    
Indexing syntax will be `csv_data_variable[row_idx][col_idx]`

In [None]:
# Question: What is the host name of the first airbnb listing
# TODO: determine the row index and column index
required_val = csv_data[???][???]

print("Expected: John Actual:", required_val)

#### `index` method:

- applicable only to list type
- enables us to lookup the index of a value inside a list
- enables us to avoid hardcoding the index of the columns

Syntax: `list_variable.index(column_name)`

In [None]:
# Question: What is the index of the column "neighborhood_group" ?
group_index = csv_header.index(???)
print("Expected: 4. Actual:", group_index)

In [None]:
# Question: What is the first value in the "Neighborhood Group" column?

# Use the variable you declared in the previous cell 
# to get the first value in the column "Neighborhood Group"
group_value = csv_data[0][???]
print("Expected: Brooklyn. Actual:", group_value)

### Task 2.3: Build a helper function for quick data access
You'll use the following function as the basis for accessing data in p6, but first you need to fill in some missing pieces:

In [None]:
def cell(row_idx, col_name):
    """
    Returns the data value (cell) corresponding to the row index and 
    the column name of a CSV file.
    """
    col_idx = csv_header.index(???)
    val = csv_data[???][col_idx]
    if val == "": # missing value
        return None
    return val

Is your implementation correct? Test it with the following:

In [None]:
print("Expected: Kensington\t Actual:", cell(0, "neighborhood"))

In [None]:
print("Expected: Skylit Midtown Castle\t Actual:", cell(1, "name"))

For the cell below, complete the function call and test your result.

In [None]:
# Use the cell function to get the price of the 3rd room in the dataset
required_price = cell(???, ???)
print("Expected: 150\t Actual:", required_price)

**Important**: Raise your hand and confirm your implementation with a TA. We'll use the `cell` function for all remaining tasks in this lab and throughout the project.

## Segment 3: Sorting Data
There are two major ways to sort lists in Python: (1) with the `sorted` function and (2) with the `.sort` method. For each method, let's examine (a) how it modifies existing structures, and (b) what new values it returns, if any.

The default sorting order is ascending. You can change that to descending, by passing keyword argument `reverse = True`. Same parameter / argument pair applicable for both `sort` method and `sorted` function.

### Task 3.1: Sort lists using `.sort()`
Use the `cell` function to fetch the neighborhood names of the first three rows in the dataset and sort them using `.sort()`

In [None]:
# Fetch the neighborhood names for the first three rows in the dataset
neighborhood1 = cell(1,"neighborhood")
neighborhood2 = cell(???, ???)
neighborhood3 = cell(???, ???)

#Initialize a list with the three neighborhood names as elements
neighborhoods = ???

print("Expected list before sorting: ['Kensington', 'Midtown', 'Harlem']", neighborhoods)
print("Actual list before sorting:", neighborhoods)

#Sort the neighborhoods
result = neighborhoods.sort()

print("List after sorting:", neighborhoods)
print("Returned value:", result)

#Descending order sort
neighborhoods.sort(reverse = True)
print("Reverse order sorting:", neighborhoods)

As you can see, the original list has been updated while no value has been returned by the `.sort` method. What does till you about what `.sort` does to existing structures? What does it return?

### Task 3.2: Sort lists using `sorted()`

Now, use the sorted function to complete the same task as above. That is, fetch the names of the neighborhoods in the first three rows of the dataset. This time, use the `.append()` function and a `for` loop to add entries into the list.

In [None]:
neighborhoods = list() # this creates an empty list
for row_idx in range(???): # Iterate over the indices of the first 3 rows in the dataset
    neighborhoods.append(cell(???, ???))

print("Original list before sorting:", neighborhoods)

#Sort the neighborhoods and assign the sorted list to a new variable.
sorted_neighborhoods = sorted(???)

print("Original list after sorting:",neighborhoods)
print("Returned list:", sorted_neighborhoods)

#Descending order sort
reverse_sorted_neighborhoods = sorted(neighborhoods, reverse = True)
print("Reverse order sorting:", reverse_sorted_neighborhoods)

Discuss the difference between sorted() function and sort() method with your lab partner or TA / PM.

### Task 3.3: Sort characters in a string

We'll use the `sorted` function to sort the characters in a string.

In [None]:
s1 = "study"
sorted_s1 = sorted(s1)
print(sorted_s1)

s2 = "dusty"
# sort s2 and print
sorted_s2 = ???
print(sorted_s2)

# compare s1 and s2 using == operator

# compare sorted_s1 and sorted_s2 using == operator

# change s1 intialization to "cheap" and re-run this cell

# now change s2 initialization to "peach" and re-run this cell

Lists have a `sort` method because lists are mutable. 

Why isn't there a `sort` method for strings? Discuss with your lab partner or TA / PM.

### Task 3.4: Sort to find the median

Now, let's try using sorting to solve a common problem - that of finding the median of a given distribution of values. Recall that the median is the middle number in a sorted (ascending or descending) list of numbers.   
  
In a sorted list, if the list has an odd number of elements, the median is the middle number:  
e.g. [10, 20, 30, 40, 50] --> median is 30

If a sorted list has an even number of elements, the median is the average of the two middle numbers:  
e.g. [10, 20, 30, 40] --> median is 25

In [None]:
def median(some_list):
    """
    Returns median of a list passed as arugment
    """
    
    # STEP 1: sort the list
    
    # STEP 2: determine the length of the list
    
    # STEP 3: determine whether length of the list is odd
    if ???:
        # return item in the middle using indexing
    else:
        first_middle = ??? # use appropriate indexing
        second_middle = ??? # use appropriate indexing
        median = (firstMiddle + secondMiddle) / 2
        return median

Now, let's find the median price of all rooms in the neighborhood *Harlem*.

In [None]:
# Initialize an empty list to keep track of prices in the Harlem neighborhood
harlem_prices = ??? 

# Iterate over all row indices. Recall that range built-in function 
# enables you to iterate over indices. You need to use len as well.

# Use cell function to extract neighborhood column value

# Check if neighborhood is Harlem

# If so, use cell function to extract price column value.
# Use append method to append the current row's price into harlem_prices list 


# Let's invoke the median function.
harlem_median_price = median(harlem_price)

**Troubleshooting your function:**

Beware of type errors.

We expect the price to be an `int` value, but what type does the `cell` function actually return? Use the `type` function to find out. Think about how to solve this and test your result below.

In [None]:
# Test your code below
print("Expected result: 89.0 Actual result", harlem_median_price)

## Segment 4: Sets
In class, we learned about the Python `list` sequence. Another simpler structure you'll sometimes find useful is the `set`. A set is NOT a sequence because it does not keep all the values in any particular order. 

### Task 4.1: Create a set
You can create sets the same way as lists, just replacing the square brackets with curly braces. In the cell below, create a set with the same elements as the example list provided.

In [None]:
example_list = ["Kensington", "Harlem", "Midtown"]
print(example_list)
example_set = {???, ???, ???}
print(example_set)

### Task 4.2: Check if an element is present in a list or set

The `in` operator is used to check if an element is present in a list or set. Try it below:

In [None]:
"Harlem" in example_list

Now, check if the neighborhood "Midtown" is present in the set `example_set`

In [None]:
??? in ???

### Task 4.3: Check the ordering of elements in a list or set

Sets have no inherent ordering, so they don't support indexing. Try the code in the cells below.

In [None]:
example_list[0]  # Works

Now, try to index `example_set` (use index 1) and observe what happens. You should see a type error.

In [None]:
???[???]   # Crashes

The lack of order also matters for comparisons. Try evaluating this boolean expression:

In [None]:
["Harlem", "Midtown", "Kensington"] == ["Kensington", "Harlem", "Midtown"]

And now try this:

In [None]:
{"Harlem", "Midtown", "Kensington"} == {"Kensington", "Harlem", "Midtown"}

### Task 4.4 Convert between lists and sets
You can switch back and forth between lists and sets with ease. Let's try it. Create a list of all neighborhoods in the neighborhood_group *Brooklyn*.

In [None]:
# initialize the list
brooklyn_neighborhoods_list = list() 

# Iterate over the dataset to populate brooklyn_neighborhoods_list

Now, convert the list to a set. Compare the the number of elements in the list and the set.

In [None]:
brooklyn_neighborhoods_set = set(???)
print(len(???))
print(len(???))

As you can see, the number of elements is vastly different! This is because a set is a collection of **unique** elements.

Be careful! When going from a set to a list, Python has to choose how to order the previously unordered values. If you run the same code, there's no guarantee Python will always choose the same way to order the set values in the new list.

### Task 4.5 Remove Duplicates
Let's use the uniqueness property of sets above to remove duplicates from a list by converting from a list to a set and back to a list again. Explore the resulting effect:

In [None]:
# Try playing with different values here
# Backslash enables us to split a long line of code into two lines
list_1 = ["Brooklyn", "Brooklyn", "Manhattan", "Midtown", \
          "Kensington", "Kensington", "Manhattan"] 
list_2 = ??? # Convert list 1 to a set and back to a list
print(list_2)

Now, remove the duplicates in the *brooklyn_neighborhoods_list* obtained previously. Return the unique values as a list.

In [None]:
unique_brooklyn_neighborhoods = ???

In [None]:
print('Expected elements 47 Actual elements: ', len(unique_brooklyn_neighborhoods))

In [None]:
print('Expected type <class \'list\'> Actual type: ', type(unique_brooklyn_neighborhoods))

## Segment 5: Building a better helper function

Finally, let's try to improve our helper function `cell`. As you have seen, we often have to manually convert the type returned by the function to suit our requirements. Instead, let's ensure the function returns the required type on its own. 

We will define a new function `cell_v2` to test our new implementation. Once the function is tested and works correctly, you can replace the original function with the new version.

First, define `cell_v2` by completing and running the code below.

In [None]:
def cell_v2(row_idx, col_name):
    col_idx = csv_header.index(???)
    val = csv_data[???][col_idx]
    if val == "":
        return None
    return val

### Task 5.1 Return the correct data type for price

- Create a new conditional in the function definition for `cell_v2` that will help you perform appropriate type casting before you return the data value from the cell function

In [None]:
print("Expected: <class 'int'>. Actual:", type(cell_v2(4,'price')))

In [None]:
print("Expected: 80 . Actual:", cell_v2(4,'price'))

### Task 5.2 Return the correct data type for minimum_nights

- Update the condition in the new conditional (in `cell_v2`) to handle minimum_nights column data conversion to `int`
- Recall that in Python you cannot write conditions of the form `3 + 4 == 6 or 7`. The correct way to write the condition is `3 + 4 == 6 or 3 + 4 == 7`.

In [None]:
print("Expected: <class 'int'>. Actual:", type(cell_v2(4,'minimum_nights')))

In [None]:
print("Expected: 10. Actual:", cell_v2(4,'minimum_nights'))

### Task 5.3 Return the correct data types for all other `int` columns

- Refer to the list in Task 2.1, for types associated with each column.
- The if condition will become very long if you keep using `or` to separate each `int` column comparison operation.
- It is easier to make a list of all the column names whose values require `int` conversion and use `in` operator to check if `col_name in ???`

### Task 5.4 Return the correct data types for `float` columns

You could choose to start the project and update the function definition as you go along.

Update the definition of `cell` to match `cell_v2`. The function you use in the project must be called `cell`. Remember to always update the original definition - do not have two differing functions of the same name.

**Note**: Using this advanced `cell` function is recommended but **optional** for the project. You may choose to use the basic version and convert the return types manually when needed.

## Great work! You are now ready to start P6.