<a id="overview"></a>

# Evaluating Airbnbs in Asheville


## Overview

You have just been hired by a real estate company in San Francisco, California. They are trying to enter the short-term rental market by listing several of their properties on Airbnb.com and they have hired you to help inform their direction and marketing. You have a few questions about how Airbnb listings perform, based on factors such as: number of bedrooms, bathrooms, and amenities.

When choosing Airbnb listings, what are the factors that go into a typical consumer decision-making process? Can we decompose this process by looking at the data? We can begin to estimate consumer interests by asking questions of the data, such as: How many bathrooms does the rental property have? It is a yurt, a cottage, or a mansion? How much is the nightly fee? What amenities are provided?

Your task is to import raw data, clean the data, and provide insights to your real estate client, based on the available sample data and structured questions below. They are looking to you to provide programmatic insights based on your new Python skills!

**Expected Time to complete: 4-8 hours**


## Objectives
This assignment will provide you with a chance to:

1. Read/write CSV files using Python's built-in `csv` module.
1. Clean and transform raw data from a csv into `lists` and `dicts`.


## Problem

Your goal is to filter the data and perform some basic analysis, looking to glean market insights and answer questions such as: 

- What is the most frequently offered amenity in San Francisco?
- What is the average cost of listings that match a certain criteria?

## Structure

This notebook walks through Pythonic data analysis in different stages: 

- **Required:** This section covers classroom topics from Unit 1 and is _required_. 
- **Advanced:** This section covers upcoming topics from future Units and is _optional_.

Throughout the notebook, you will see clearly labeled sections setting up questions for you to solve. _You must provide answers to all of the questions in the **Required** section._ Note that some questions have been further divided up into "Part 1", "Part 2", (etc) in order to break down the steps of sequential logic used in Python programming. Please attempt answers for all parts.

For those of you who wish to work ahead or want to come back later for more practice, the **Advanced** section offers additional prompts that will extend your analysis. This section is optional; you do not need to complete these for submission. Depending on the discretion of your section instructor, these questions may be worth bonus points.

Finally, the **Challenge** section provides an additional set of real-world prompts and examples that integrate new programming concepts and Python libraries not covered in this class. Challenge questions are intended to help you explore and continue your learning outside of this course! _You do not need to complete Challenge questions._


## Instructions

1. Open the assignment notebook. 
1. Save a copy of your notebook and retitle it: "yourname_assignment.ipynb"
1. Attempt answers for all **Required questions**. Some questions can be solved in many different ways!
1. Include at least one comment per question explaining your logic or approach. To include a comment in your Python code, use the `#` sign.
1. Make sure to include all work within your Jupyter notebook.
1. Submit answers for the **Required questions** to your instructional team by the due date.
1. Have fun!

## Data 

[Our data](./data/sanfran_airbnb.csv) is a truncated subset of data taken from [Inside Airbnb](http://insideairbnb.com). You'll see twelve columns:

- `id` - A unique identifier of the Airbnb
- `listing_url` - The URL to the Airbnb
- `name` - The name of the listing
- `host_id` - A unique identifier for the host
- `host_name` - The name of the host
- `host_is_superhost` - A boolean stating whether or not the host is a superhost
- `neighbourhood_cleansed` - Identifies the neighborhood of the city the listing is in
- `accomodates` - How many people the listing can house
- `bedrooms` - The reported number of bedrooms
- `bathrooms` - The reported number of bathrooms
- `amenities` - A list of the amenities that the listing offers
- `price` - The nightly fee of the listing (before cleaning fees)

In [1]:
# Import the csv module
import csv

---

# REQUIRED / GRADED
> **Required:** This section covers classroom topics from Unit 1 and is _required_. 

In this section of the notebook, you'll begin your analysis by importing and inspecting the data with Python. Make sure to complete questions 1-5. 

Ready, set, go!


---

## Question 1

- **Part 1**: First, you'll need to load the `sanfran_airbnb` CSV from your local files. Alternatively, you can also [click here to access the data online](https://gist.github.com/jeff-boykin/2879dcf8936e42f2d0ef5c7c39b4da70).  

> Our data is a truncated subset of data taken from [Inside Airbnb](http://insideairbnb.com/asheville/). The original set contains extra columns which have been removed for this assignment.

> Hint: The delimiter for this file is a *tab* character, which can be passed into the `csv.reader` as `csv.reader(csvfile, delimiter='\t')`


- **Part 2**: Next, create a list called `column_names` that holds the column names from the csv. 

> Hint: There should be 10 columns, total. For example: `columns_names == ['id', 'listing_url', ....]`


- **Part 3**: Now create a list called `listings` that holds each listing as it's own list. There should be 6,346 total. For example, `listings[0]` will be:

```
['958', 'https://www.airbnb.com/rooms/958', 'Bright, Modern Garden Unit - 1BR/1BTH', '1169', 'Holly', 't', 'Western Addition', '3', '1', '1', '["Heating", "Hot water", "Stove", "Iron", "Dryer", "Coffee maker", "Carbon monoxide alarm", "Pack \\u2019n Play/travel crib", "Private entrance", "Microwave", "Hangers", "Essentials", "Laptop-friendly workspace", "First aid kit", "Smoke alarm", "Refrigerator", "Wifi", "Cooking basics", "Shampoo", "TV", "Dishes and silverware", "Room-darkening shades", "Garden or backyard", "Hair dryer", "Kitchen", "Washer", "Keypad", "Cable TV", "Oven", "Free street parking"]', '$132.00 ']

```

#### Need Help?

- [Click here for an explanation of Python data types](https://www.geeksforgeeks.org/python-set-3-strings-lists-tuples-iterations/).
- [For examples of Python lists, click here](https://www.w3schools.com/python/python_lists.asp).
- [For help reading and writing CSV files in Python, click here](https://www.w3schools.com/python/python_file_write.asp).


> Quick note on the amenity: `translation missing: en.hosting_amenity_XX`. Airbnb has to translate each amenity into as many languages as it can in order to provide their services across multiple geographic regions. In order to do this, each amenity is assigned an English translation and served up to us when we view the site in English. When we see things like `translation missing: en.hosting_amenity_49`, that implies that there is some amenity for which there is no suitable translation or available option.

In [2]:
# Enter your solution for Q1, Part 1

# Load the necessary modules
import os
import csv

# Load the sanfran_airbnb data
csvpath = os.path.join("data","sanfran_airbnb.csv")
with open(csvpath, newline = "") as datafile:
    # Checking the sanfran_airbnb.csv file showed delimiter = ","
    csvreader = csv.reader(datafile, delimiter = ",")
    csv_header = next(csvreader)

In [3]:
# Enter your solution for Q1, Part 2

# Create list called column_names to contain the headers in the sanfran_airbnb.csv
column_names = ["id", "listing_url","name","host_id","host_name",
                "host_is_superhost", "neighbourhood_cleansed","accomodates",
                "bathrooms","bedrooms","amenities","price"]


In [4]:
# Enter your solution for Q1, Part 3

# Create list called listings that contains each listing as its own list
# First create an empty list called listings
listings = []

# Next populate the listings list by first openning the sanfran_airbnb file
csvpath = os.path.join("data","sanfran_airbnb.csv")
with open(csvpath, newline = "") as datafile:
    # sanfran_airbnb uses a comma as its delimiter = ','
    csvreader = csv.reader(datafile, delimiter = ",")
    # skip the header
    csv_header = next(csvreader)
    
    # populate the listings list with each row after the header
    for row in csvreader:
        listings.append(row)

# Check the number of items in listing to make sure it matches 6,346 rows.        
len(listings)


6346

In [5]:
# Print out the column names
print(column_names)

['id', 'listing_url', 'name', 'host_id', 'host_name', 'host_is_superhost', 'neighbourhood_cleansed', 'accomodates', 'bathrooms', 'bedrooms', 'amenities', 'price']


---

## Question 2

Next, answer the following questions using the `listings` variable:

- **Part 1**. Print the first listing
- **Part 2**. Print the 100th listing
- **Part 3**. Print the price of the 100th listing *without* printing the rest of the listing information!

> Hint: [Here are some examples on how to print in Python](https://www.w3schools.com/python/ref_func_print.asp) 

In [6]:
# Enter your solution for Q2, Part 1

# Print the first listing
print(listings[0])

['958', 'https://www.airbnb.com/rooms/958', 'Bright, Modern Garden Unit - 1BR/1BTH', '1169', 'Holly', 't', 'Western Addition', '3', '1', '1', '"Heating","Hot water","Stove","Iron","Dryer","Coffee maker","Carbon monoxide alarm","Pack \\u2019n Play/travel crib","Private entrance","Microwave","Hangers","Essentials","Laptop-friendly workspace","First aid kit","Smoke alarm","Refrigerator","Wifi","Cooking basics","Shampoo","TV","Dishes and silverware","Room-darkening shades","Garden or backyard","Hair dryer","Kitchen","Washer","Keypad","Cable TV","Oven","Free street parking"', '$132.00 ']


In [7]:
# Enter your solution for Q2, Part 2

# Print the 100th listing
print(listings[99])

['137672', 'https://www.airbnb.com/rooms/137672', 'Charming Private Room in Cozy Apt', '673098', 'Elizabeth', 'f', 'Inner Sunset', '2', '1.5', '1', '"Host greets you","Long term stays allowed","Heating","Kitchen","Breakfast","Luggage dropoff allowed","Wifi","Washer","Iron","Cable TV","Dryer","TV","Hangers","Laptop-friendly workspace"', '$150.00 ']


In [8]:
# Enter your solution for Q2, Part 3

# Print only the price for the 100th listing
print(listings[99][11])

$150.00 


---

### Tutorial

Before we get to Question 3, let's first look at at a few ways we can manipulate string data in Python.

In [9]:
# Here are some examples using a `.replace` function with string data (this will come in handy for the next question)!

# Example: `str.replace(item_to_replace, item_to_replace_with)`
# This will return: `str`

print("$40,123.00".replace('$', ''))  # removes the dollar sign
print("$40,123.00".replace(',', ''))  # removes the comma
print("$40,123.00".replace('$', '').replace(',', ''))  # removes the dollar sign and the comma


40,123.00
$40123.00
40123.00


In [10]:
# And here are some examples of the `.split` functionality with strings. Take a look and then proceed to Question 3 when you're ready!

# Example: `str.split(delimiter)`
# Returns: `list`

print("a,b,c,d".split(','))  # split by comma
print("a;b;c;d".split(';'))  # split by semi-colon
print("a; b; c; d".split('; '))  # split by semi-colon and a space


['a', 'b', 'c', 'd']
['a', 'b', 'c', 'd']
['a', 'b', 'c', 'd']


---

## Question 3

Create a list called `parsed_listings` that contains the original listings as its elements - but with the following changes:

    - First, change the 4th item (amenities) to be a list of strings (this one is a bit tricky). 
> Hint, you may have to remove the `"`, `}`, and the `{` characters and then split the string by the comma.   
    
    - Second, change the 5th item (price) to be a float.
> Try using `.replace` to remove a few bad characters from your floats

    - Third, change the 6th item (bedrooms) to be a float.
    - Fourth, change the 7th item (bathrooms) to be a float.

> Note that the elements of `parsed_listings` should still be lists themselves (in other words, they should hold the listings' same characteristics). [Click here to learn more about working with different Python data types](https://www.w3schools.com/python/python_datatypes.asp).

    - Fifth and finally, try using a `for` loop to accomplish this. When you're done, the first element (`parsed_listing[0]`) should look like:

```
['958',
 'https://www.airbnb.com/rooms/958',
 'Bright, Modern Garden Unit - 1BR/1BTH',
 '1169',
 'Holly',
 't',
 'Western Addition',
 3.0,
 1.0,
 1.0,
 ['[Heating',
  ' Hot water',
  ' Stove',
  ' Iron',
  ' Dryer',
  ' Coffee maker',
  ' Carbon monoxide alarm',
  ' Pack \\u2019n Play/travel crib',
  ' Private entrance',
  ' Microwave',
  ' Hangers',
  ' Essentials',
  ' Laptop-friendly workspace',
  ' First aid kit',
  ' Smoke alarm',
  ' Refrigerator',
  ' Wifi',
  ' Cooking basics',
  ' Shampoo',
  ' TV',
  ' Dishes and silverware',
  ' Room-darkening shades',
  ' Garden or backyard',
  ' Hair dryer',
  ' Kitchen',
  ' Washer',
  ' Keypad',
  ' Cable TV',
  ' Oven',
  ' Free street parking]'],
 132.0]
```

> Note: A more advanced method would be to use a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html) to accomplish this.

In [22]:
# Enter your solution for Q3

# Create a list called parsed_listings to contain modifications of the original listings as elements
# First create an empty list named parsed_listings.
parsed_listings = []

# Create a list of each idex in listings. 
ID = []
listing_url = []
name = []
host_id = []
host_name= []
host_is_superhost = []
neighborhood_cleansed = []
accommodates = []
bathrooms = []
bedrooms = []
amenities = []
price = []

# Populate the lists above with the appropriate rows in listings
for row in listings:
    ID.append(row[0])

for row in listings:
    listing_url.append(row[1])
    
for row in listings:
    name.append(row[2])

for row in listings:
    host_id.append(row[3])
    
for row in listings:
    host_name.append(row[4])

for row in listings:
    host_is_superhost.append(row[5])

for row in listings:    
    neighborhood_cleansed.append(row[6])

for row in listings:
    accommodates.append(row[7])
                         
# This index represents bathrooms - convert to a float
for row in listings:
    bathrooms.append(float(row[8]))
                         
# This index represents bedrooms - convert to a float
for row in listings:
    bedrooms.append(float(row[9]))

# This index represents amenities - convert from single string to list of strings.
for row in listings:
    amenities.append(row[10].replace('"','').split(','))
                      
# This index represents price - drop the '$' and ',' and convert to a float
for row in listings:
    price.append(float(row[11].replace('$','').replace(',','')))


# Zip the 12 lists above into a tupple. 
ziplist = zip(ID, listing_url, name, host_id, host_name, host_is_superhost, neighborhood_cleansed, 
              accommodates, bathrooms, bedrooms, amenities, price)

# Then loop each of the items in ziplist to be appended into parsed_listings as lists
    
for item in ziplist:
    parsed_listings.append(list(item))

In [23]:
parsed_listings[0]  # Great!

['958',
 'https://www.airbnb.com/rooms/958',
 'Bright, Modern Garden Unit - 1BR/1BTH',
 '1169',
 'Holly',
 't',
 'Western Addition',
 '3',
 1.0,
 1.0,
 ['Heating',
  'Hot water',
  'Stove',
  'Iron',
  'Dryer',
  'Coffee maker',
  'Carbon monoxide alarm',
  'Pack \\u2019n Play/travel crib',
  'Private entrance',
  'Microwave',
  'Hangers',
  'Essentials',
  'Laptop-friendly workspace',
  'First aid kit',
  'Smoke alarm',
  'Refrigerator',
  'Wifi',
  'Cooking basics',
  'Shampoo',
  'TV',
  'Dishes and silverware',
  'Room-darkening shades',
  'Garden or backyard',
  'Hair dryer',
  'Kitchen',
  'Washer',
  'Keypad',
  'Cable TV',
  'Oven',
  'Free street parking'],
 132.0]

---

## Question 4

Next, let's dig into price differences between listings with different criteria.

- **Part 1**. Begin by creating two lists called `one_bathroom` and `two_bathroom` where the elements fit the following criteria:
    - `small_homes_one` should only have listings with less than two bathrooms
    - `small_homes_two` should only have listings with more than two bathrooms but less than three
    
- **Part 2**. What is the average price for each set of listings? 

- **Part 3**. Finish by printing the number of elements in each list.

- **Part 4**. Then create a new list called `small_homes` that only contains listings that have either: 
    - Exactly 1 bathroom
OR
    - Less than 2 bathrooms AND exactly 1 bedroom

- **Part 5**. Wrap up by printing the number of elements in the list `small_homes`.

In [24]:
# Now you try!
# Enter your solution for Q4, Parts 1, 2, and 3
# Part 1:
#   Create list called small_homes_one
small_homes_one = []

# Loop through parsed_listings to populate small_homes_one
for row in parsed_listings:
    # only take listings with < 2 bathrooms
    if row[8] < 2.0:
        small_homes_one.append(row)
        
#   Create list called small_homes_two
small_homes_two = []

# Loop through parsed listings to populate small_homes_two
for row in parsed_listings:
    # only take listings with more than 2 bathrooms but less than three
    if row[8] > 2 and row[8] < 3:
        small_homes_two.append(row)

# Part 2: What is the average price for each set of listings above.
# Create empty list named small_homes_one_prices
small_homes_one_prices = []

# Iterate through small_homes_one prices to populate small_homes_one_prices
for row in small_homes_one:
    small_homes_one_prices.append(row[11])

#  Calculate and store the average price for small_homes_one into avgprice_small_homes_one 
avgprice_small_homes_one = round(sum(small_homes_one_prices)/len(small_homes_one_prices))

#   Print the average price for small_homes_one
print(f'small_homes_one average price = {avgprice_small_homes_one}')

# Create empty list named small_homes_tow_prices
small_homes_two_prices = []

# Iterate through small_homes_two prices to populate small_homes_two_prices
for row in small_homes_two:
    small_homes_two_prices.append(row[11])
    
#  Calculate and store the average price for small_homes_one into avgprice_small_homes_two 
avgprice_small_homes_two = round(sum(small_homes_two_prices)/len(small_homes_two_prices))

#   Print the average price for small_homes_two
print(f'small_homes_two average price = {avgprice_small_homes_two}')

# Part 3: Print the number of elements in each list

print(f'small_homes_one has {len(small_homes_one)} listings.')
print(f'small_homes_two has {len(small_homes_two)} listings.')

small_homes_one average price = 240
small_homes_two average price = 378
small_homes_one has 4773 listings.
small_homes_two has 196 listings.


In [25]:
# Now you try!
# Enter your solution for Q4, Parts 4 and 5
# Part 4: Create a new list called small_homes 
small_homes = []

for row in parsed_listings:
    if row[8] == 1.0 or (row[8] < 2 and row[9] == 1.0):
        small_homes.append(row)

# Part 5: Print the number of elements in small_homes.
print(f'small_homes has {len(small_homes)} listings.')

small_homes has 4563 listings.


---

## Question 5


- **Part 1**. Now let's create a *dictionary* called `amenities_count`. 

> Hint: A dictionary uses key/value pairs. For more info on Python dictionaries, [check out this link](https://www.w3schools.com/python/python_dictionaries.asp).

For your new `amenities_count` dictionary, make the *keys* of the dictionary equal the amenities listed and the *values* indicate the number of times that amenity appears across every listing.

Examples:
    - amenities_count['Day bed'] == 7
    - amenities_count['Coffee maker'] == 1230
    

- **Part 2**. Now *iterate* over your new `amenities_count` dictionary to surface the amenity that appears the *most often* across all listings!



In [15]:
amenities_list = []
for row in parsed_listings:
    amenities_list.append(row[10])

len(amenities_list)

NameError: name 'parsed_listings' is not defined

In [25]:
# Now you try!
# Enter your solution for Q5, Part 1
# Create a dictionary called amenities_count where the keys equal the name of the amenities and
# the values indicate the number of times the amenities appear.

# Create an empty dictionary called amenities_count
amenities_count = {}

# Create an empty list to store the amenities taken from parsed_listings row[10]
all_amenities = []

# Populate all_amenities list with each amenity listed in parsed_listings row[10].
for row in parsed_listings:
    for item in row[10]:
        all_amenities.append(item)
        
# Loop through the items in the all_amenities list and add each item to the amenities_count dictionary
# where the item is the key and the item's count is the value
for item in all_amenities:
    amenities_count[item] = all_amenities.count(item)


In [26]:
# Now you try!
# Enter your solution for Q5, Part 2

for key,value in amenities_count.items():
        maxvalue = max(amenities_count.values())
        if value == maxvalue:
            print(key,value)

Wifi 6279


---

# ADVANCED 

> **Advanced:** This section covers mopre complex topics from the previous unit as well as conquering some brand new concepts. These questions are _optional_. 

## Question 1

This dataset has a bunch of properties in it that are ABSURDLY priced ($10000 per night seems a bit high) and are probably priced in this way to deter rentals whilst still keeping the property up. This makes them severe outliers in the dataset and could throw off any analysis we want to make in the future. Let's try to clear this up.

- **Part 1.** Create a loop that goes through the original list of properties and places them into a new list from least to most expensive. Then take some time to look through a few of the higher priced properties. This will reveal some strange values. 
> Note: There are many ways to accomplish this task but we recommend using a new library method called [itemgetter](https://docs.python.org/3/library/operator.html#operator.itemgetter) which was made specifically for this purpose and the [sorted](https://www.w3schools.com/python/ref_func_sorted.asp) function.

- **Part 2.** Calculate the median price of the sorted dataset. This will be used in order to determine the quartiles of our dataset.

- **Part 3.** Calculate the lower quartile (the data point below which 25% of the observations set)

- **Part 4.** Calculate the upper quartile (the data point above which 25% of the observations set)

- **Part 5.** Find the interquartile range by subtracting the value of the lower quartile from the value of the upper quartile.

- **Part 6.** Find the "inner fences" of the data set. To find the inner fences of the data set first multiply the interquartile range by 1.5. Then add the result to the upper quartile and subtract it from the lower quartile. The two values you recieve are the boundries for the dataset's inner fences.
> Note: A point that falls outside of this numeric boundry is classified as a *minor outlier*

- **Part 7.** Find the "outer fences" of the data set. This is done in the same way as uncovering the inner fences, except that the interquartile range is multiplied by 3 instead of 1.5. The result is then added to the upper quartile and subtracted from lower quartile to find the upper and lower boundaries of the outer fence.
> Note: A point that falls outside of this numeric boundry is classified as a *major outlier*

- **Part 8.** Now it is time to finally clean the dataset! Remove any values from the listings whose prices are outside of the outer fences.

- **Part 9.** Finally, let's add a new value to each listing that tells the viewer whether or not the listing is a minor outlier or not.

In [4]:
# Now you try!
# Enter your solution for Q1, Part 1

# Load the operator module which contains itemgetter function
import operator

# Sort parsed_listings from least to most expensive and place it into a list called sorted_listings 
Sorted_listings = sorted(parsed_listings, key=operator.itemgetter(11))

# Spot check listing to see if first listing has low price and last listing has high price
# Check first price
print(Sorted_listings[0][11])
# Check last price
print(Sorted_listings[-1][11])

In [5]:
# Now you try!
# Enter your solution for Q1, Part 2

# Calculate median price of sorted list.
print(f'Sorted_listings has {len(Sorted_listings)} listings')
print('To calculate the median price, take the two midpoint prices in Sorted_listings located at midpoint indices...')
print(f'{round((len(Sorted_listings)/2)-1)} and {round(len(Sorted_listings)/2)}')

# # Take the two prices at the midpoint of Sorted_listings and divide it by 2
middle_number1 = Sorted_listings[3172][11]
middle_number2 = Sorted_listings[3173][11]
median_price = (middle_number1 + middle_number2)/2

print(f'The middle price at indices 3172 = ${round(middle_number1,2)}')
print(f'The middle price at indices 3173 = ${round(middle_number2,2)}')
print(f'Therefore the median price = ${round(median_price,2)}')

In [6]:
# Now you try!
# Enter your solution for Q1, Part 3, 4, and 5

In [7]:
# Now you try!
# Enter your solution for Q1, Part 6 and 7

In [8]:
# Now you try!
# Enter your solution for Q1, Part 8

In [9]:
# Now you try!
# Enter your solution for Q1, Part 9

## Question 2

You are working with a client on developing a new a value proposition for their AirBnB properties. This will help your client, a real estate investor, determine which type of properties they should purchase to have the best success on AirBnB.

**Part 1.** Create at least three rental market segments based on price and the number of people the property can accommodate.

> Note: Market segments refer to clustering a group of people by one of more charactertic. In marketing, it will allow us to develop a specifc targeted strategy for different people based on their needs. In our case, you should segment the rentals based on price and the number of people the property can accommodate. For exmaple, one segment could contain lower priced and smaller properties. This segment could be geared toward a customers that are price sensitive and looking for a deal. Another segment could contain large higher priced properties tailored to customers looking for a place to stay for a family / friend celebration or vacation. 

**Part 2.** Which room and property type appear the most in each segment?

**Part 3.** How many properties contain reviews under 5 in each segment? Remove all rentals with the number of reviews under 5 from each

**Part 4.** Which segment should your client consider and why?

In [3]:
# Now you try!
# Enter your solution for Q2, Part 1

In [2]:
# Now you try!
# Enter your solution for Q2, Part 2

In [4]:
# Now you try!
# Enter your solution for Q2, Part 3

In [None]:
# Now you try!
# Enter your solution for Q2, Part 3