# Airbnb Price Prediction

In this workshop, we'll apply data mining and machine learning to predict prices of Airbnb listings. The workshop will give you a hands-on experience of using real data to answer questions on urban dynamics. 

Our dataset is a listing of over 74,000 Airbnb rentals in six major US cities with over 20 variables on each property, such as host, and past reviews.

**The goal of the workshop is to predict Airbnb price of a property given its characteristics.**

## Data preparation

Up to this point we have have shown you examples of nice, clean, structured datasets on cities. This is all very nice, but not very realistic. In the real world, datasets will arrive at your door in a bad shape - full of errors, null values and unreliability. You need to know how to approach these datasets, how to understand their limitations, and how to make repairs where necessary.

This part of the workshop will explore how we can use tools in Python to import, clean, analyse and preprocess new datasets for further analysis. Specifically we'll be working with the Pandas Data Analysis Library (http://pandas.pydata.org), which you may have used elsewhere. During this part of the tutorial, we'll work through an example data cleaning process, **finishing with data ready for data mining techniques**. You'll learn how to approach new datasets, but further expand your skills in using Pandas.

So without further ado, the first thing we need to do is setup our working environment. Run the scripts below to import the Pandas libraries.

In [0]:
# !apt-get install libgeos-3.5.0
!apt-get install libgeos-dev
!pip install https://github.com/matplotlib/basemap/archive/master.zip

In [0]:
# import libraries, and set pd as the pandas alias
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

pd.set_option('display.max_rows', 100) # specifies number of rows to show
pd.options.display.float_format = '{:40,.4f}'.format # specifies default number format to 4 decimal places
plt.style.use('ggplot') # specifies that graphs should use ggplot styling
%matplotlib inline 

### Data import

Pandas has a range of functions for enabling the import of data into Python. The import functions are relatively straight forward to use, and require very little in the way actual coding. There are also functions available for a range of different data formats.

In this initial section of the workshop, we will explore how to import CSV data using Pandas. Other data formats (e.g. Excel, JSON, HTML) can be imported will similar ease, the documentation for these tools can be found here: http://pandas.pydata.org/pandas-docs/stable/io.html

An important thing to know about Pandas is that it will always load your data into its own Dataframe format. This format basically acts as a multicolumn table, and so will eventually make loading our cleaned data into MySQL all the more easier. However, before we get to that stage, we need to load the data in, check it for problems and fix it up.

During this part of the workshop we'll be working with Airbnb price data obtained from *Kaggle* (https://www.kaggle.com/stevezhenghp/deloitte-airbnb-price-prediction). Understanding the landscape of airbnb prices is cities is clearly a very interesting question for urban data scientists. However, as with any real data, the data might be incomplete or prone to errors, and so requires careful handling and analysis.

In [0]:
# location of raw data file
url = 'https://raw.githubusercontent.com/kirakowalska/airbnb-price-prediction/master/airbnb_properties.csv'

# this command loads your csv data and sets up the 'airbnb' dataframe, using the Pandas (pd) libraries
airbnb = pd.read_csv(url)

We've now imported the data as a dataframe called `'airbnb'`.

### Initial data checks

As with all new datasets, it is important that you consult the data before proceeding with any analysis. In this section, we will run some basic scripts to answer **the following questions**:

1. What is the nature of each attribute (e.g. continuous, discrete, etc.)?
2. How many rows of data should you expect following the import?
3. What issues remain around incompleteness in the data?

You can start the investigation by looking at the data itself - to do this you simply call the dataframe by name. Look at the data using the command below, what does it tell you about what we have?

In [0]:
airbnb

This gives you a sample of the data, but it may be that you want to look at only the first 10 rows. You can do this by adding the `.head(n)` function to the end of the dataframe name, replacing `n` with the number of rows you wish to see. **Try this out below.**

Now investigate the number of Airbnb properties by adding the `.shape` function to the end of the dataframe name.

In [0]:
airbnb.shape

Look also at the type of data in each column by running the `.info()` funcion on the dataframe.

In [0]:
airbnb.info()

As you can see, we have numeric features (e.g. `log_price`, `review_scores_rating`) as well as non-numeric features (`room_type`, `first_review`) describing each Airnbn property. This gives us a sense of what the data looks like, but doesn't provide an idea of how complete it is across the whole dataset. The `.count()` function provides counts of *non-null* values in each column. **Try running this below.**

In [0]:
airbnb.count()

It would appear that we have null values to deal with in multiple columns, e.g. `first_review` or `host_has_profile_picture`. We'll come on to that later.

Before we move on though, it would be worth exploring variation in the data we have imported. From the sample loaded earlier, it would appear we have a number of categorical datasets, and it would be useful to know how many rows correspond to each category. 

To do this, we use the `.value_counts()` function. In calling a function against a column, Pandas allows us to reference the column name directly within the function. This structure requires the dataframe (e.g. `airbnb`), the column name (e.g. `property_type`), and the function name (e.g. `value_counts()`).

In [0]:
airbnb.property_type.value_counts()

**What do these results tell you about the dataset?**

Next run the same `.value_counts()` function for the `room_type` column.

It would also be useful to get some summary statistics on the numerical features in the dataset. To do this, we can use the `.describe()` function. This provides some basic summary stats relating to variations in the column data. You run it in the same way as you did the `.value_counts()` function, by just calling it against a column name. **Try running this function for each column below.**

As you will see, different types of results are extracted for each column. `.describe()` combines a number of statistical measures, including `.max()`, `.min()` and `.mean()`, the full range can be found here: http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics

**What do these results tell you about Airbnb properties?**

We might also want to test the coocurrance of different values across different data columns. In order to do this we use the `.groupby()` function instead. This function takes two or more column names, and where calling `.size()` returns the size of each group.

In [0]:
airbnb.groupby(['city', 'neighbourhood']).size()

Again, try this for the `'property_type'` and `'room_type'` columns. **What do these queries tell you about the data?**

You should now be building a picture of what the data looks like and the associations between columns. 

The last way to visualise the data is by plotting it. Again, very simple using Pandas. We just use the `.plot.scatter()` function on the dataframe to create a scatter plot, and specify the `x` and `y` axes. In the case of this data, we might be interested in a range of relationships, for example in the relationship between the number of reviews of a property (`number_of_reviews`) and its average rating (`review_scores_rating`)
**Run the code below to explore the relationship, what can you say about it?**

In [0]:
airbnb.plot.scatter(x='number_of_reviews',y='review_scores_rating')

We can also plot data using its spatial (lat, lon) coordinates. Here `Basemap` library comes in handy. The library comes with a range of tools for creating maps in Python. We show you an example usage of the library below. 

Suppose we want to create a map showing the average Airbnb price in each city. Firstly, we use the `groupby()` function to get average location and price in each city. We then map the outcome using the `Basemap` functionality.

In [0]:
locations = airbnb[['city', 'longitude', 'latitude']].groupby(['city'], as_index=False).mean()

In [0]:
counts = airbnb.groupby(['city'], as_index=False).size().reset_index()

In [0]:
df = pd.merge(locations, counts, on=['city'])

In [0]:
# Rename columns
df.columns = ['city','longitude','latitude','count']

In [0]:
df

In [0]:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

In [0]:
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution='l', 
            lat_0=40, lon_0=-100,
            width=7E6, height=5e6)
m.etopo(scale=0.5, alpha=0.5)
m.scatter(df['longitude'].values,df['latitude'].values, latlon=True,s=df['count'].values/10,alpha=0.4,edgecolors='r',linewidths=2,c='r')

### Data exploration

Now that we have a basic understanding of the data, data formats and missing values, we need to start some more sophisticated analysis to get a sense of attributes that will help us predict Airbnb prices. 

Before building a fancy predictive model, it is always a good idea to explore the data in search of attributes that are correlated with property price.

Let's start with simple plots of pairs of attributes. For example, we could check how property prices change depending on location or property type.

In [0]:
airbnb[['property_type', 'log_price']].groupby(['property_type'], as_index=False).mean().sort_values(by='log_price',ascending=False)

Hmm interesting, renting a tent is on average more expensive than a hostel!

In [0]:
airbnb[['city', 'log_price']].groupby(['city'], as_index=False).mean().sort_values(by='log_price',ascending=False)

Now run the same code to group the average price by the `accommodates` property. 

And the `room_type` property. **What can you say about the property price for each room type?**

Now, let's correlate the prices with numerical attributes.

In [0]:
g=sns.regplot(x=airbnb['review_scores_rating'],y=airbnb['log_price'],fit_reg=True)

### Data augmentation

Here. we'll see how we can augment our dataset with new features that might be useful in predicting property prices. One such feature is distance to the city centre.

In [0]:
list_city_centres = [['SF', 37.78, -122.42], ['Boston', 42.36, -71.06], ['Chicago', 41.87, -87.63], ['DC', 38.91, -77.03], ['LA', 34.05, -118.22], ['NYC', 40.72, -74.02] ] 

In [0]:
list_city_centres

In [0]:
city_centres = pd.DataFrame(list_city_centres, columns = ['city', 'lat_centre', 'lon_centre'])

In [0]:
city_centres

In [0]:
airbnb_augmented = pd.merge(airbnb, city_centres, on=['city'])

In [0]:
airbnb_augmented

In [0]:
airbnb_augmented['distance_to_centre']=np.sqrt((airbnb_augmented['lat_centre']-airbnb_augmented['latitude'])**2+(airbnb_augmented['lon_centre']-airbnb_augmented['longitude'])**2)

In [0]:
airbnb_augmented

In [0]:
distance_vs_price=sns.regplot(x=airbnb_augmented['distance_to_centre'],y=airbnb_augmented['log_price'],fit_reg=True)

In [0]:
print (np.corrcoef(airbnb_augmented['distance_to_centre'], airbnb_augmented['log_price']))

### Data preprocessing

You will see that, given the number of attributes involved, it is hard to get an overview of the major trends underlying the data just by looking at pairs of features. Nevertheless, this is something we need to get a grip of before we proceed to more sophisticated analyses - the trends we generate through clustering and regression may look interesting, but they could be misleading without us having a good underlying knowledge of the data.

In order to have a complete look at the data and then be able to run clustering or regression analysis later on, we need to represent all attributes as numerical variables. As you remember, some Airbnb attributes are numerical (e.g. `number_of_reviews` or `beds`), but other attributes are categorical (e.g. `bed_type` or `amenities`).

The most common way to convert categorical variables to numerical variables is using so-called **one-hot encoding**, which converts each *categorical* variable into multiple *numerical* variables. Each new numerical variable represents one category and can take the value of either 1 (category present) or 0. In Pandas, one-hot encoding can be easily done by running the `get_dummies()` function that requires your dataframe and a list of categorical columns as inputs.

**Check out the code below - run it and see what you get. How many variables do we have after one-hot encoding?**

In [0]:
airbnb_augmented.info()

In [0]:
categorical=['property_type','room_type','bed_type','cancellation_policy']
airbnb_model=pd.get_dummies(airbnb_augmented, columns=categorical)

In [0]:
airbnb_model.head(5)

In [0]:
airbnb_model.info()

You can see above that one-hot encoding converted selected categorical attributes into multiple numerical attributes, each corresponding to one category. As a result, we increased the number of attributes to 73.

In [0]:
numerics = ['uint8', 'int64', 'float64', 'bool']

In [0]:
airbnb_model=airbnb_model.select_dtypes(include=numerics).fillna(0)

In [0]:
airbnb_model.info()

Once we are left with numerical attributes only, we can identify dependency between `log_price` and all attributes by extracting a correlation matrix. This method calculates the Pearson R statistic for each pair of attributes. 

**Run the very simple code below to extract correlations for this data. Do these results align with what you found above? Which pair of attributes has the strongest correlation?**

**NOTE**: You should be familiar with the concept of correlation of datasets, but if you're not then have a quick look at this page http://learntech.uwe.ac.uk/da/Default.aspx?pageid=1442

In [0]:
airbnb_model.corr()

Hopefully you've now identified attributes which have high and low correlation with Airbnb price - this will be important going forward as we construct clusters and regression models to predict property prices.

## Clustering

The first analysis we will undertake is some exploratory clustering of the data. Clustering draws together subsets of our data, revealing structure within our data points. It also helps tell us which attributes are more discrimininatory in differentiating between entries, and identify those which are not useful for taking forward to later analyses.

In this tutorial, we will work through the use of one clustering approach, and then you'll be expected to try out two more. The `scikit-learn` package makes production of multiple models very simple, so once you've run one model, the rest should be relatively straight forward. 

We'll start off with developing a KMeans clustering model. You'll recall from the lecture that KMeans is a distance-based clustering method, which requires the prior specification of the desired number of clusters. 

The documentation for KMeans can be found [here](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html), and it describes the input and expected outputs of the function.

First things first, **we always import the package we need**.

In [0]:
from sklearn.cluster import KMeans

Next we need to specify the data on which we are going to cluster. This work is exploratory, so we don't know what we're going to get back, or whether those clusters are meaningful. However, the process allows us to identify structure in data and association between attributes. 

So, we need to decide which data to use. Clustering can take any number of attributes, but for this first stage we'll just use two as they will allow us to easily visualise the clusters. One of the variables will be `log_price` as that's what we are mostly intersted in. Choose another attribute that you think could have high variation with price.

**Create a subset of the data containing just the `log_price` and your chosen attribute.**

In [0]:
price_review=airbnb_model[['log_price','review_scores_rating']]

There is an issue with this data in that the `log_price` and `review_scores_rating` values are numerically very far apart. The higher variance in attributes with higher values means clusters are more likely to be influenced by these attributes than those with smaller value and variance.

So before moving onto the clustering stage, we need to standardise our new dataset. This scales our values to between 0 and 1, based on the mean and variance. We do this using the [`preprocessing.scale()`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html) function, as shown below. **Add in your dataset and run the function to create your scaled values. When complete, check out the resulting data.**

In [0]:
price_review

In [0]:
import sklearn.preprocessing as preprocessing
scaled = preprocessing.scale(price_review) # add your dataframe name here

In [0]:
scaled

You'll see that the function has created a numpy array, rather than a Pandas dataframe. This isn't a problem, as our clustering method will take either, but is something to bare in mind. Now on to the clustering!

**IMPORTANT:** The `scikit-learn` package operates in a very similar fashion, regardless of the approach you are using. It uses the `.fit()` function execute the model and return results. Each model will take different arguments, but the same essential approach is used every time.

So, let's run the KMeans clustering method. According to the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html), the model can take a number of parameters. From our knowledge of the approach, we know that the number of clusters `n_clusters` is fundamental in creating the clusters. I've suggested a number of clusters below, but we'll try amending those later.

**Make sure you understand the code below (you'll need to know it for later), then run it to create the KMeans clusters.**

In [0]:
kmeans = KMeans(n_clusters=2) # create KMeans cluster object

In [0]:
kmeans.fit(scaled) # run the .fit() function on the scaled dataset

We now have our KMeans object created, we can extract the groups it has identified. We do this using the `.labels_` method. **Run the code below to create an array of group labels for our data.**`

In [0]:
kmeans_labels = kmeans.labels_

The next thing to do is measure the formation of the clusters. We can do this through a range of measures - described in detail here [here](http://scikit-learn.org/stable/modules/clustering.html#clustering-evaluation). Most of these, however, require a groundtruth relating to how a cluster should look (e.g. help us supervise the creation of the clusters. In this case we do not have this. 

The only measure that can help is the Silhouette Score, which calculates how close points are on average to points their clustered with, relative to points they are not clustered with.

The `scikit-learn` algorithm ([docs](http://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient)) for Silhouette Score simply takes the data and the generated labels. A score closer to one indicates strong clustering, negative scores indicate poor clustering.

**Run the code below and extract the Silhouette Score for the clusters created above.**

In [0]:
from sklearn import metrics

In [0]:
metrics.silhouette_score(scaled, kmeans_labels)

Now we have a sense of the quality of our clusters, we can visually explore the clustering. **So let's add the labels data as a new column to the `price_reviews` dataframe you created earlier.** The data order will have been maintained so the labels will align correctly with the data.

In [0]:
price_review['labels']=kmeans_labels

Now visualise the clusters and their variation with `log_price` and `bedrooms`. **Use a `.plot()` function to create a scatter plot of the two variables, with points coloured by group value. If you manage it, try getting creative with the colormap.** More information on how to do that can be found [here](http://pandas.pydata.org/pandas-docs/stable/visualization.html).

In [0]:
price_review

In [0]:
price_review.plot.scatter(x='log_price', y='review_scores_rating', c='labels')

In [0]:
price_review.plot.scatter(x='log_price', y='review_scores_rating', c='labels', colormap='viridis')

So what do you think of the results? Do the clusters look useful or realistic? **To help you judge, it is worth playing with the KMeans model, try different parameter values and observe how the groups change.**

Once you have tested different parameter settings, next add in additional attributes. This will create clusters in multiple dimensions. You won't be able to visualise the results easily in a graph, but you can test the clustering again using the Silhouette Score.

**Add in attributes that you think will help create better house clusters, and then test the performance gain or loss using the Silhouette Score.**

### Other Clustering Methods

Now that you've mastered the KMeans model, time to experiment with some others. The process is nearly exactly the same, but the models needs careful parameterisation and therefore a knowledge of how these approaches differ. 

**Referring back the lecture notes and the `sci-kit` online [literature](http://scikit-learn.org/stable/modules/clustering.html#clustering), develop clusters using...**

1. DBSCAN
2. Affinity Propagation

**In using each approach, you should seek to understand...**

* The parameters you need to get each approach to work, and their impact on the resulting clusters.
* Differences in the resulting clusters generated through each approach.
* The time each approach takes to run, and its indication of computational complexity. For this stage, you should explore how to use the `time.time()` function to measure how long the `.fit()` function takes to complete.
* How the addition of further attributes affects the speed of cluster generation.

If you have time and are feeling brave, try running agglomerative clustering too. If you want to visualise the resulting dendrogram, then the code found [here](https://github.com/MatKallada/mathewapedia-learn/commit/70cf4a676caa2d2dad2e3f6e4478d64bcb0506f7) will help you do that.

## Regression

Now that you have a bit more of grasp on the data, it's time to explore some of the relationships between property price and other attributes in the dataset with the untimate goal of predicting property prices. As you'll know, we can do this through regression. 

### Linear regression

Fortunately, because `scikit-learn` is so bloody marvellous, we don't actually need to learn much new syntax to get a model off the ground. As long as we have a decent understanding of the approach we are using, we can generate the models pretty quickly. We'll start with the simplest approach - Linear Regression - and advance from there.

**Before we start, check out the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) on the `scikit-learn` LR implementation.** 

With that in mind, let's **import the package we need**.

In [0]:
from sklearn.linear_model import LinearRegression

Now we'll set up our dataset. In this case we will try and predict the property price value shown in `log_price`. We'll use one explanator variable to predict these prices (we'll try more in a minute). **From your earlier analysis pick one attribute which you think will be interesting to investigate, then create a subset of the data including those two attributes below.**

In [0]:
airbnb_prediction=airbnb_model[['log_price','accommodates']]

One thing to know about the linear regression is that it treats response (`y`) and explanator (`x`) data differently. The `scikit` function requires that we pass these data separately.

To organise our data into the right format we need to run a `numpy` `.reshape()` command. In doing so we need to specify for both the response and explanator data the number of points we're including (the length of the dataframe), and how long each entry is (1 attribute in this case). **We do this using the following commands - insert your variable names into the independent variable range.** We'll specify these as a `y` (our response variable) and `X` (our explanator variable).

In [0]:
y = airbnb_prediction['log_price'].values.reshape(len(airbnb_prediction['log_price']), 1)

In [0]:
X = airbnb_prediction['accommodates'].values.reshape(len(airbnb_prediction['accommodates']), 1) # add attribute names

Now the data is in order, we can run our Linear Regression function. Consulting the documentation, and knowing about Linear Regression, we can guess that we can get by without setting any of the optional parameters. **So let's first create the LinearRegression object, just like we did with KMeans.**

In [0]:
lr = LinearRegression()

And just like earlier, we'll run the model on the data using the `.fit()` function. This time we pass it our X and y arrays. **Do this below (and make sure you get them the right way around!).**

In [0]:
lr.fit(X,y)

Now that we've created our fit, it's time to look at the structure of the model and how well it fits the data.

There are a number of ways to do this. In the first instance, as with any linear regression model, we want to get a grip on the coefficients and intercept of the model. Helpfully, these two functions are built into the LinearRegression object. **Consult the documentation and see if you can find out how to extract these.**

In [0]:
lr.fit(X,y).intercept_

In [0]:
lr.fit(X,y).coef_

We are also naturally interested to see how well the model performed - in this case how close our explanatory variables predict property prices. The typical approach to this is to calculate the R2 value, which again is a function provided by `scikit`.

The code below will generate the R2 score for us. **Run the code and see how well the model performed.** 

**NOTE:** In this case we have used the original data to test the fit, however, a more robust approach would have been to use a subset of the data to train and test the model. In which case, we would send the test dataset to the `.score()` function to test the fit.

In [0]:
lr.score(X,y)

Finally, we would like to see the fit visualised, which we can do using a scatter plot. However, because we want to plot points and the line of best fit, we need to configure things slightly differently, pulling on some functionality provided in `matplotlib`. 

This process works by assigning plots to the `matplotlib` `plt` object (we defined this when we imported the library earlier). Below we first set the size of the figure area (using the `figsize` parameter, then we add a scatter graph of the original data, and then we plot a line between our indepedent variable (`x`) and the predictions of `y` made by the linear regression model. We also add axis labels and a legend (which uses the `label` parameter set in each plot). The final `plt.show()` command prompts `matplotlib` to show the combined plots. 

**Check through the code below and execute it to reveal the chart. Enter a label name for your independent variables. Try adjusting the parameters to change the look of the chart, referring to the [documentation](http://matplotlib.org/api/pyplot_summary.html) where needed.**

In [0]:
plt.figure(figsize=(12, 8))
plt.scatter(X, y, color='black', alpha=0.2, label='Real')
plt.plot(X, lr.predict(X), color='blue', label='LR')
plt.xlabel('accommodates')
plt.ylabel('log_price')
plt.legend()
plt.show()

Well done on creating your first regression model using `scikit-learn`, pretty straightforward eh? 

**A couple of important points to consider - How does that fit look to you? Are you concerned about any skew in the points in the plot? What could you do to resolve this issue? In which way could you transform the data? Think about this and make any necessary adjustments to the data above.**

If you have time, try adding in more attributes to test the fit with the model. It won't be possible to visualise the fit beyond three dimensions, but the R2 value will provide an indication of model fit. You will need to adjust the `.reshape()` function to accept the number of attributes you intend to use (it is currently configured to take only 1).

### Support Vector Regression 

From what we have learnt so far it's pretty easy to extend these approaches to other modelling techniques. Let's quickly have a play around with Support Vector Regression.

SVR uses a highly adaptable kernel to model the shape of the relationship between dependent and independent variables. A kernel is a smooth surface which can effectively bend around the data, and therefore adapt better to non-linearities in the relationship between variables.

**Let's start by importing the package as below.**

In [0]:
from sklearn.svm import SVR

Now, as we've seen before, we first need to set up the model. For the purposes of this example, we'll choose an RBF `kernel` (others are available, and can be tested), set an error penalty term (`C`) which penalises poor model performance, and a `gamma` value to control the extent to which the line can adapt to the points.

In [0]:
svr = SVR(kernel='rbf', C=100, gamma=100)

**Now fit the model to the data we used above.**

In [0]:
svr.fit(X,y)

**Check the R2 score.**

In [0]:
svr.score(X,y)

As we have moved between dimensions, the shape of the fit created by SVR can look chaotic when drawn as a line. So this time we'll just draw the predicted distribution of points generated by the SVR, and compare it to the original data.

**Create a visualisation with two scatter plots overlaid, one of the real data (in black), one with the prediction (in red). Then try adding the predicted line from the linear regression model too (in blue) and see how they compare. You should end up with something like that below.**

In [0]:
plt.figure(figsize=(12, 8))
plt.scatter(X, y, color='black', alpha=0.2, label='Real')
plt.plot(X, lr.predict(X), color='blue', label='LR')
plt.scatter(X, svr.predict(X), color='red', label='SVR')
plt.xlabel('accommodates')
plt.ylabel('log_price')
plt.legend()
plt.show()

Now that you've seen how SVR can work, and observed its improvement on Linear Regression, **try running it with different parameters**, and get a sense of how they impact the quality of the prediction. Also check **how different settings affect the model run time**, as this will have a big impact when you move to bigger datasets.

You can **try different kernels too**, check out [this page](http://scikit-learn.org/stable/modules/svm.html) to learn more about the different options.

**Finally, let's see how well we can predict airbnb prices using all available attributes.** We will no longer be able to visualise the result, but we should see that our prediction accuracy measured using the R2 score goes up.

In [0]:
X_all = airbnb_model.drop('log_price',axis=1).values

In [0]:
svr.fit(X_all,y)

In [0]:
svr.score(X_all,y)

### Exercises

Well done for completing this tutorial on urban data mining. You will now have a stronger understanding of how to use data mining tools to gain insight into your datasets. The principles you have learnt here are applicable across many data mining methods, as is much of the `scikit` syntax.

If you have time and/or interest in exploring these tools further, then you might wan to try one or more of these activities:

1. We committed a methodological sin in reusing our calibration dataset for validation. But there are ways around this problem, either by using another dataset, or by splitting the dataset into parts for *cross-validation*. Look at the documentation on cross-validation here http://scikit-learn.org/stable/modules/cross_validation.html and then implement a simple technique to avoid this problem.
2. There are lots of nice datasets for mining held on the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets.html). Take a look, and if any of them take your fancy try create a regression model. Which data features are important? Which are not? My favourite is the Wine dataset (https://archive.ics.uci.edu/ml/datasets/Wine), for obvious reasons.
3. There is the Boston housing dataset available here - https://www.kaggle.com/c/house-prices-advanced-regression-techniques. It contains many more attributes and also relates to price prediction, which would make for a more sophisticated, interesting, but more complicated analysis process.