# Analyzing Airbnb Rentals with Regression

Airbnb was founded in 2008 to allow people to rent apartments, houses, and spaces directly to one another. The company provides an alternative to traditional hotels and rentals. Recently there's been a lot of public discussion about the effects Airbnb rentals have on neighborhoods: as Airbnb gets more popular, more and more homes and apartments that were previously used as private residences are now becoming full-time rentals. The data activists and journalists at [Inside Airbnb](http://insideairbnb.com/) are using public information about rentals to better understand these phenomena.

This week, we'll use a sample of [Inside Airbnb's data](http://insideairbnb.com/get-the-data) to investigate the prices of Airbnb rentals in New Orleans, LA during the year 2021. New Orleans is a city with a lot of tourism that relies on hotels and rentals to house its many visitors. In this first workshop, we'll use linear regression to explore some variables that may affect the price of rentals.

*n.b. This assignment is loosely adapted from the Airbnb Lab assignment at Denison University.*

## Getting Started

**Begin by importing libraries as normal. Remember that you'll need the usual libraries as well as the `sklearn` classes and functions that we discussed in class.**

Now we can read in our data. This is the largest data set we've worked with so far, and the file you downloaded, `nola_listings_2021.csv.gz`, is a zip archive which compresses the file to a more managable size. A cool feature of pandas is that you don't need to unzip this file to read it with the `read_csv()` function. **You can read the file as normal below**:

## Data Wrangling

There are a lot of columns in this data. More than can be viewed in the table view above. **Let's use the `.info()` method on our new dataframe to see all the column names and their data types:**

Right away this is very helpful! Now we have a list of all column names and we can see their type. You can compare this to [Inside Airbnb's full documentation](https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit#gid=1322284596) so that you understand what each column is describing.

Eagle-eyed readers might have spotted an issue with the column we care most about: the `price` of the rentals. But let's pretend we didn't see that for now. **Since we want to use `price` as our dependent variable, let's calculate the mean of price now:**

Oh dear! That's a massive error message.  (You can click the blue bar at the left to collapse it.) Something clearly went wrong. **Let's look at the price column to see if we can spot the issue:**

Okay, so this column is a string rather than a number. That's no good, since we know we need a numerical variable as the dependent or target variable in a regression.

What's worse, if you look above you'll see why we can't just convert this in the same way we've done in previous workshops. **Why can't we convert this directly to a number? And what do you think we need to do *before* we convert it? Write some thoughts below before reading the next section:**

[Your answer here.]

So, we need to get rid of the problem characters first. Right now, our price column has the data type string; it's a bit of text. And Pandas has lots of helpful methods for [dealing with strings](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#string-methods). The basic syntax is to put `.str.nameOfStringFunction()` after a column name. So we can use these `.str` methods to manipulate strings in different ways.

We need to get rid of certain characters, and the right tool for this is the `.str.replace()` method. Replace will remove any character we like and insert a new character in its place, or it can simply remove a character and replace it with nothing.

The replace method also accepts [*regular expressions*](https://en.wikipedia.org/wiki/Regular_expression), a special kind of search pattern that lets us select characters in advanced ways. For instance, in this case it will let us select multiple characters at once, so that we don't need to write three separate `.str.replace()` methods for the three characters we want to remove. For now we're not doing a deep dive into regular expressions, but they'll come up again later in the semester.

The regular expression (or regex) we'll use in our replace function is `[\$\.\,]`. The brackets mean that we want to search for any of these three possibilites, and the backslashes mean to search for these literal characters (so that they don't accidentally perform a function instead). We can ask the `.str.replace()` method to search for these characters and return nothing in their place. Copy this code into the cell below to see what it does:

```python
airbnb.price.str.replace("[\$\.\,]","",regex=True)
```

Voila! No more troublesome characters. But this column is still a string, and now it has too many zeroes. Let's get the value in dollars by converting this to a `float` and then dividing by 100. We can combine that with the above replacement and save the whole thing into the price column. Copy this code into the cell below:

```python
airbnb.price = airbnb.price.str.replace("[\$\.\,]","",regex=True).astype(float)/100
airbnb.price
```

Much better! Notice we used a different function, `.astype()`, to change the data type this time. Pandas usually provides multiple ways to accomplish the same task.

Now let's calculate the mean rental price again:

This kind of thing is *very common* in data science. Even thoughtfully-prepared data can come to us in a form we wouldn't expect, and thankfully Pandas gives us the tools to change things up.

## Exploratory Data Analysis

Now we're ready to run some analyses on our data. As we prepare to do a statistical model like regression, we always want to get a sense of our data first. Let's investigate `price` a bit.

**First, make a histogram of the Airbnb price. You'll have to change the number of bins or the binwidth to make this readable. Interpret the histogram fully, and note whether price is normally distributed.**

[Your interpretation here.]

**Now use `.describe()` to view some summary statistics for price.**

Now we know a little bit more about the `price` variable, which will be our *dependent, or target, variable*. The *independent, or predictor, variable* will be your choice. Scroll back up and take a look at the list of variables. Refer to the [documentation](https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit#gid=1322284596) to make sure you understand what the column name means.

Choose one variable that you think will make a good predictor for `price`. **Below, explain the variable you chose and why you think it's a good option.**

[Your answer here.]

**Now make a regression plot showing the relationship between your dependent and independent variables. Interpret the plot fully, and note whether it seems like there's a correlation between them.**

[Your interpretation here.]

**Calculate the correlation coefficient between your two variables. Is this a weak or strong correlation? Is is a positive or negative correlation?**

[Your answer here.]

**Do you think this variable is a good choice as the independent (predictor) variable in your regression? Why or why not?**

[Your answer here.]

If your variable is a good choice, you can start on the next section. **If your variable didn't work out, you may need to explore the data a little more and find a variable better suited as a predictor.** This happens all the time, and it's totally fine to try a few options. In fact, it's very unlikely that you found a good dependent variable on your first try! Use your intuition and refer to the documentation to find another variable to try. Once you've got a good one, move on to the next section.

## Linear Regression

Now we're ready to model our data! We've done due diligence wrangling and exploring our data, and we have a much better sense of things.

**Now you can run the code for your model. Split the data into training and test sets, create an instance of the `LinearRegression()` class, fit the model, and calculate the slope and intercept. You can do all of those in one code cell below:**

**Now interpret the above coefficients *completely, accurately, and in terms of the data*. What do these coefficients tell us about the relationship between your independent variable and our dependent variable, the price of the rentals?**

[Your answer here.]

Finally, let's assess our model's performance. **Calculate both Root Mean Squared Error and the Coefficient of Determination ($R^{2}$) below. Remember that you'll need to predict some fitted values first.**

**Interpret both of these metrics, especially $R^{2}$. Did your model account for much of the variance in the target variable, `price`? Are your residuals too big or too small? Remember that there aren't clear answers to these questions—they're based on your interpretation of the data's context. So apply what you've learned about this data set and give it a shot:**

[Your answer here.]

## Conclusion

**Write a brief conclusion summarizing what you found out. What variables seem to affect the price of an Airbnb rental in New Orleans? How or how much do these things affect the price? What might you want to try next?**

[Your conclusion here.]

## Bonus Challenge

*This next part is completely optional and won't affect your grade on this workshop at all. But you might find it interesting!*

Another way we can validate or assess a model is to find out if our residuals are normally distributed and have a mean close to 0. If they do, then we can be reasonably confident in our model. Below, calculate the in-sample residuals and make a histogram showing the residual distribution. Are the residuals normally distributed? Do they seem to have a mean near 0?

[Your answer here.]

Another way to look at the distribution of residuals is to use a Quantile-Quantile Plot. This special type of scatter plot compares the residuals to normally-distributed values. As an extra challenge, see if you can make a Q-Q plot below. *This will definitely involve some Googling and/or searching the Seaborn site. You'll also probably have to import some new libraries. There are many possible approaches to this, so don't stress about doing it "the right way."*