# Predicting Yelp Ratings for Restaurants

Yelp is a site in which customers can write reviews and rate local businesses. These ratings and reviews are quite important. As [Netwaiter](https://www.netwaiter.net/articles/how-to-manage-your-restaurants-yelp-reviews/#:~:text=Your%20Yelp%20reviews%20and%20ranking%20matter.%20If%20your,Yelp%20reviews%2C%20use%20these%20techniques%20to%20manage%20them.) put it in their article on the topic "Your Yelp reviews and ranking matter. If your restaurant does not sustain a four-star ranking or higher, you may see a dramatic loss in business."

In this project, we're going to get data for restaurants from Yelp's API and use them to write an algorithm to predict the Yelp rating for a restaurant. You can read more about the Yelp API [here](https://www.yelp.com/developers/documentation/v3/business_search). 

In [758]:
import requests
import json
import pandas as pd
params = {"term":"food", "location":"New York City", "radius":40000, "limit":50, "offset":20}
headers = {'Authorization': 'Bearer %s' % 'z0fIanEHB4pXGRy21tlFSTbkYHwoAnUjG_a-Udi1hWVwzAg6R7oajT8rGt5saeLwPmi-2YANkTjaxR8ZUAfjLhr7SM6VyPq0hrFVtDOqG07jNJcyhT5pDHUwMCkQYnYx'}
response = requests.get("https://api.yelp.com/v3/businesses/search", params=params, headers = headers)
status_code = response.status_code
print(status_code)

200


The Yelp API only allows us to take 50 results at a time. However, we can use the offset parameter to change which 50 results we get and, through this method, we can get up to 1000 results. For our analysis, we'll want more than 1000 results so, I'm going to create a function that will take in a location and grab 1000 results associated with the "food" search term. 

Using this function, I'll grab a variety of restaurants from different American cities.

In [759]:
'''
The Yelp API only allows us to take 50 results at a time. However, we can use the offset parameter to 
change which 50 results we get. Here, we'll use a loop to make 20 get requests for a total of 1000 results.
'''
df = pd.DataFrame()
def yelp_api(city, df):
    offset = 0
    for i in range(0,20):
        params = {"term":"food", "location":city, "radius":40000, "limit":50, "offset":offset}
        headers = {'Authorization': 'Bearer %s' % 'z0fIanEHB4pXGRy21tlFSTbkYHwoAnUjG_a-Udi1hWVwzAg6R7oajT8rGt5saeLwPmi-2YANkTjaxR8ZUAfjLhr7SM6VyPq0hrFVtDOqG07jNJcyhT5pDHUwMCkQYnYx'}
        response = requests.get("https://api.yelp.com/v3/businesses/search", params=params, headers = headers)
        data_string = response.json()
        temp_df = pd.DataFrame(data_string["businesses"])
        df = pd.concat([df, temp_df])
        offset += 50
    offset = 0
    return df

# Next, we'll use this function to get data for a variety of different cities.

df = yelp_api("New York City", df)

df = yelp_api("Philadelphia", df)

df = yelp_api("Pittsburgh", df)

df = yelp_api("Baltimore", df)

df = yelp_api("Cleveland", df)

df = yelp_api("Miami", df)

df = yelp_api("Chicago", df)

df = yelp_api("Colombus", df)

df = yelp_api("Boston", df)

df = yelp_api("Nashville", df)

df = yelp_api("Seattle", df)

Now that we've grabbed this data, let's print some results and get a sense of what the data looks like.

In [760]:
df.head()

Unnamed: 0,id,alias,name,image_url,is_closed,url,review_count,categories,rating,coordinates,transactions,price,location,phone,display_phone,distance
0,qEe7nIhPfEL6L0-VYhWRrA,raku-new-york-6,Raku,https://s3-media1.fl.yelpcdn.com/bphoto/EQsQlL...,False,https://www.yelp.com/biz/raku-new-york-6?adjus...,615,"[{'alias': 'japanese', 'title': 'Japanese'}, {...",4.5,"{'latitude': 40.72728, 'longitude': -74.00252}","[delivery, pickup]",$$,"{'address1': '48 Macdougal St', 'address2': No...",12129894797,(212) 989-4797,2522.267046
1,7YYp_2WYWdRSt_NohV7Psw,casa-birria-nyc-new-york,Casa Birria NYC,https://s3-media3.fl.yelpcdn.com/bphoto/koyMTX...,False,https://www.yelp.com/biz/casa-birria-nyc-new-y...,62,"[{'alias': 'foodtrucks', 'title': 'Food Trucks...",5.0,"{'latitude': 40.778824, 'longitude': -73.953986}",[delivery],$,"{'address1': '86th & 3rd Ave', 'address2': '',...",16468682090,(646) 868-2090,8837.339832
2,FM3oBtsMdVZnb_ttxB9xmg,laojie-hotpot-brooklyn-6,LaoJie hotpot,https://s3-media1.fl.yelpcdn.com/bphoto/dhFEap...,False,https://www.yelp.com/biz/laojie-hotpot-brookly...,77,"[{'alias': 'hotpot', 'title': 'Hot Pot'}, {'al...",4.5,"{'latitude': 40.60012, 'longitude': -73.99135}",[],$$,"{'address1': '2314 86th St', 'address2': '', '...",17188879887,(718) 887-9887,11716.253609
3,qVlDLz8Ri-ThAFJgYHuQ9A,omusubi-gonbei-new-york-3,Omusubi Gonbei,https://s3-media1.fl.yelpcdn.com/bphoto/2MxgyQ...,False,https://www.yelp.com/biz/omusubi-gonbei-new-yo...,73,"[{'alias': 'japanese', 'title': 'Japanese'}, {...",4.5,"{'latitude': 40.750701, 'longitude': -73.976961}",[delivery],$,"{'address1': '370 Lexington Ave', 'address2': ...",19174727168,(917) 472-7168,5244.026713
4,MGd6HFEq1ALD58XWNviSXw,time-out-market-new-york-brooklyn,Time Out Market New York,https://s3-media1.fl.yelpcdn.com/bphoto/4oRxGW...,False,https://www.yelp.com/biz/time-out-market-new-y...,327,"[{'alias': 'food_court', 'title': 'Food Court'}]",3.5,"{'latitude': 40.70342863348067, 'longitude': -...",[],$$,"{'address1': '55 Water St', 'address2': '', 'a...",19178104855,(917) 810-4855,288.119008


## Cleaning the Data

The columns for this dataset are as follows:

* `id` - a random id associated with each business.
* `alias` - a unique name given to each business.
* `name` - the name of the restaurant.
* `image_url` - a link to the logo of the restaurant.
* `is_closed` - a boolean value that indicates whether or not the business is closed at the moment.
* `url` - a link to the Yelp page for the business.
* `review_count` - the number of customer reviews for this business.
* `categories` - dictionaries containing descriptors of the business (i.e. what kind of food do they sell?)
* `rating` - Average rating for the business.
* `coordinates` - dictionaries containing latitude and longitude. 
* `transactions` - does the business offer pickup, delivery or reservations?
* `price` - prices are indicated using dollar signs, more dollar signs means the business is more expensive.
* `location` - address for the business.
* `phone` - phone number for the location.
* `display_phone` - phone number written in a more readable fashion.
* `distance` - distance from the location searched for in the API.

Not all of these columns will contain useful information for this analysis, so we'll get rid of some of these columns now.

In [761]:
yelp_df = df.drop(["id", "alias", "image_url", "is_closed", "url", "phone", "display_phone"], axis=1)
yelp_df.head()

Unnamed: 0,name,review_count,categories,rating,coordinates,transactions,price,location,distance
0,Raku,615,"[{'alias': 'japanese', 'title': 'Japanese'}, {...",4.5,"{'latitude': 40.72728, 'longitude': -74.00252}","[delivery, pickup]",$$,"{'address1': '48 Macdougal St', 'address2': No...",2522.267046
1,Casa Birria NYC,62,"[{'alias': 'foodtrucks', 'title': 'Food Trucks...",5.0,"{'latitude': 40.778824, 'longitude': -73.953986}",[delivery],$,"{'address1': '86th & 3rd Ave', 'address2': '',...",8837.339832
2,LaoJie hotpot,77,"[{'alias': 'hotpot', 'title': 'Hot Pot'}, {'al...",4.5,"{'latitude': 40.60012, 'longitude': -73.99135}",[],$$,"{'address1': '2314 86th St', 'address2': '', '...",11716.253609
3,Omusubi Gonbei,73,"[{'alias': 'japanese', 'title': 'Japanese'}, {...",4.5,"{'latitude': 40.750701, 'longitude': -73.976961}",[delivery],$,"{'address1': '370 Lexington Ave', 'address2': ...",5244.026713
4,Time Out Market New York,327,"[{'alias': 'food_court', 'title': 'Food Court'}]",3.5,"{'latitude': 40.70342863348067, 'longitude': -...",[],$$,"{'address1': '55 Water St', 'address2': '', 'a...",288.119008


One of the columns that could be very useful for our purposes is the `categories` column, which tells us what kind of food each restaurant is. Let's take a look at some of the values contained in this column:

In [762]:
yelp_df["categories"].head()

0    [{'alias': 'japanese', 'title': 'Japanese'}, {...
1    [{'alias': 'foodtrucks', 'title': 'Food Trucks...
2    [{'alias': 'hotpot', 'title': 'Hot Pot'}, {'al...
3    [{'alias': 'japanese', 'title': 'Japanese'}, {...
4     [{'alias': 'food_court', 'title': 'Food Court'}]
Name: categories, dtype: object

This column contains dictionaries with two keys, the `alias` key and the `title` key. The alias contains largely the same information as the title so, for our purposes we'll only use the title key. 

Each row could contain anywhere between 1 and 3 dictionaries, because certain restaurants may fit into more than one category.

In order to use this column, we're going to need to clean it up a bit first.

In [763]:
# We'll create three seperate columns for the categories.
df_expanded = df["categories"].apply(pd.Series)
df_expanded.head()

Unnamed: 0,0,1,2
0,"{'alias': 'japanese', 'title': 'Japanese'}","{'alias': 'noodles', 'title': 'Noodles'}",
1,"{'alias': 'foodtrucks', 'title': 'Food Trucks'}","{'alias': 'tacos', 'title': 'Tacos'}",
2,"{'alias': 'hotpot', 'title': 'Hot Pot'}","{'alias': 'bbq', 'title': 'Barbeque'}",
3,"{'alias': 'japanese', 'title': 'Japanese'}","{'alias': 'foodstands', 'title': 'Food Stands'}",
4,"{'alias': 'food_court', 'title': 'Food Court'}",,


Next, we'll split each of these individual columns into the `alias` column and the `title` column, then we'll drop the alias column so we don't have redundant information (since most of the `alias` values are the same as the `title` values).

In [764]:
for i in range(0,3):
    df_expanded = pd.concat([df_expanded[i].apply(pd.Series), df_expanded], axis=1)
    df_expanded = df_expanded.drop(i, axis=1)
    df_expanded = df_expanded.drop("alias", axis=1)
    df_expanded = df_expanded.rename(columns = {"title":"category_" + str(i+1)})

# Drop the remaining 0 column and reorder the columns in numerical order.
df_expanded = df_expanded.drop(0,axis=1)
df_expanded = df_expanded.reindex(columns=['category_1', 'category_2', 'category_3'])

Now that we've cleaned up our category columns, we'll reintroduce those columns back into the dataframe and drop the categories column.

In [765]:
yelp_with_categories=pd.concat([yelp_df, df_expanded], axis=1)
yelp_with_categories = yelp_with_categories.drop("categories", axis=1)
yelp_with_categories.head()

Unnamed: 0,name,review_count,rating,coordinates,transactions,price,location,distance,category_1,category_2,category_3
0,Raku,615,4.5,"{'latitude': 40.72728, 'longitude': -74.00252}","[delivery, pickup]",$$,"{'address1': '48 Macdougal St', 'address2': No...",2522.267046,Japanese,Noodles,
1,Casa Birria NYC,62,5.0,"{'latitude': 40.778824, 'longitude': -73.953986}",[delivery],$,"{'address1': '86th & 3rd Ave', 'address2': '',...",8837.339832,Food Trucks,Tacos,
2,LaoJie hotpot,77,4.5,"{'latitude': 40.60012, 'longitude': -73.99135}",[],$$,"{'address1': '2314 86th St', 'address2': '', '...",11716.253609,Hot Pot,Barbeque,
3,Omusubi Gonbei,73,4.5,"{'latitude': 40.750701, 'longitude': -73.976961}",[delivery],$,"{'address1': '370 Lexington Ave', 'address2': ...",5244.026713,Japanese,Food Stands,
4,Time Out Market New York,327,3.5,"{'latitude': 40.70342863348067, 'longitude': -...",[],$$,"{'address1': '55 Water St', 'address2': '', 'a...",288.119008,Food Court,,


Next, we'll turn our attention to the columns that tell us the location of the business. There are three columns that give us this information. 

The `location` column tells us the address of the business. This column will be too difficult for us to use since the algorithm won't be able to identify the significance of different streets and their relation to one another. Let's drop this column first.

In [766]:
yelp_with_categories = yelp_with_categories.drop("location", axis=1)

The `coordinates` column tells us the latitude and longitude of the restaurant. The other column related to location is the `distance` column. This column tells us how far away the restaurant is from the location the user searched in (in our case this is the city). We'll keep both of these columns for now until we determine whether or not there's a correlation between this value and our target column for the analysis.

In [767]:
# Next, we'll make latitude and longitude columns out of the coordinates column.
coordinates = yelp_with_categories["coordinates"].apply(pd.Series)
yelp_with_categories = pd.concat([yelp_with_categories, coordinates], axis=1)
yelp_with_categories = yelp_with_categories.drop("coordinates", axis=1)

The next column that we'll deal with is the `transactions` column. This column contains a list that has up to three values: `delivery`, `pickup` and `restaurant reservation`. The row for each restaurant will contain only those transactions that the restaurant engages in.

We'll split this into three columns, one for each transaction and, in each column, a restaurant will receive a 1 if it engages in this kind of transaction and a 0 if it doesn't.

In [768]:
temp_df = yelp_with_categories["transactions"].apply(pd.Series)
temp_df = pd.get_dummies(temp_df)
for i in ["delivery", "pickup", "restaurant_reservation"]:
    temp_df[i] = temp_df["0_" + i] + temp_df["1_" + i] + temp_df["2_" + i]
    temp_df = temp_df.drop(["0_" + i, "1_"+ i, "2_" + i], axis=1)

yelp_with_categories = pd.concat([yelp_with_categories, temp_df], axis =1)
yelp_with_categories = yelp_with_categories.drop("transactions", axis=1)

Next, we'll look at the `price` column which uses a system of dollar signs to indicate price.

In [769]:
yelp_with_categories["price"].value_counts()

$$      5287
$       2132
$$$      302
$$$$      61
Name: price, dtype: int64

We'll just need to change this column so that it contains only numerical values.

In [770]:
def price_change(dollar):
    if dollar == "$":
        return 1
    elif dollar == "$$":
        return 2
    elif dollar == "$$$":
        return 3
    elif dollar == "$$$$":
        return 4


yelp_with_categories["price"] = yelp_with_categories["price"].apply(price_change)

Before we finish, we'll check the dataset for null values and either replace them or drop the columns that contain null values.

In [771]:
yelp_with_categories.isnull().sum()

name                         0
review_count                 0
rating                       0
price                     3218
distance                     0
category_1                   0
category_2                3037
category_3                6184
latitude                     0
longitude                    0
delivery                     0
pickup                       0
restaurant_reservation       0
dtype: int64

Three columns contain null values here: `price`, `category_2` and `category_3`. The category columns can simply be dropped. We already have information on the category of each restaurant and so we can safely drop these columns.

In [772]:
yelp_with_categories = yelp_with_categories.drop(["category_2", "category_3"], axis=1)

The `price` column has quite a lot of null values, but this information will likely be important for our analysis so we don't want to lose it. Let's look again at some of the values in this column. 

In [773]:
yelp_with_categories["price"].value_counts(normalize=True)

2.0    0.679388
1.0    0.273966
3.0    0.038808
4.0    0.007839
Name: price, dtype: float64

We'll aim to fill the null values in the `price` column such that these proportions stay roughly the same. On Yelp, users are responsible for reporting price values and so, a null value means that no user has reported what the price of this location is. We have quite a lot of data here and so, it's fair to assume that these proportions will likely hold for the rest of the restaurants in our dataset.

We can use the `ffill` method, which fills each null value with the value the precedes it.

In [774]:
yelp_with_categories["price"] = yelp_with_categories["price"].fillna(method='ffill')
yelp_with_categories["price"].value_counts(normalize=True)

2.0    0.673818
1.0    0.278818
3.0    0.039091
4.0    0.008273
Name: price, dtype: float64

In [775]:
yelp_with_categories.isnull().sum()

name                      0
review_count              0
rating                    0
price                     0
distance                  0
category_1                0
latitude                  0
longitude                 0
delivery                  0
pickup                    0
restaurant_reservation    0
dtype: int64

Now that our dataset is clean, we can start to select some features for our algorithm.

## Designing and Selecting Features

Before we look at the correlation of different columns in the dataset, let's look at our columns and make sure all of them are numeric columns.

In [813]:
clean_yelp = yelp_with_categories
clean_yelp.head()

Unnamed: 0,name,review_count,rating,price,distance,category_1,latitude,longitude,delivery,pickup,restaurant_reservation
0,Raku,615,4.5,2.0,2522.267046,Sushi & Japanese,40.72728,-74.00252,1,1,0
1,Casa Birria NYC,62,5.0,1.0,8837.339832,Street Vendors,40.778824,-73.953986,1,0,0
2,LaoJie hotpot,77,4.5,2.0,11716.253609,Korean,40.60012,-73.99135,0,0,0
3,Omusubi Gonbei,73,4.5,1.0,5244.026713,Sushi & Japanese,40.750701,-73.976961,1,0,0
4,Time Out Market New York,327,3.5,2.0,288.119008,Food Court,40.703429,-73.992146,0,0,0


Before we proceed, the `category_1` column will need to be adjusted into a numeric column. Let's look at the values in this column and see if there's anything that needs changing before we create dummy variables.

In [814]:
clean_yelp["category_1"].value_counts().head(10)

American              867
Bars                  784
Mexican               655
Pizza                 536
Cafes                 470
Sandwiches            402
Sushi & Japanese      385
Fast Food             331
Seafood               303
Breakfast & Brunch    301
Name: category_1, dtype: int64

There are quite a few categories here so, in order to make our algorithm effective, let's try to group together some redundant or similar categories.

In [815]:
clean_yelp["category_1"] = clean_yelp["category_1"].replace("American (New)", "American")
clean_yelp["category_1"] = clean_yelp["category_1"].replace("American (Traditional)", "American")

clean_yelp["category_1"] = clean_yelp["category_1"].replace("Delis", "Sandwiches")

clean_yelp["category_1"] = clean_yelp["category_1"].replace("Tacos", "Mexican")
clean_yelp["category_1"] = clean_yelp["category_1"].replace("Empanadas", "Mexican")

clean_yelp["category_1"] = clean_yelp["category_1"].replace("Coffee & Tea", "Cafes")
clean_yelp["category_1"] = clean_yelp["category_1"].replace("Coffee Roasteries", "Cafes")

clean_yelp["category_1"] = clean_yelp["category_1"].replace("Burgers", "Fast Food")

clean_yelp["category_1"] = clean_yelp["category_1"].replace("Sushi Bars", "Sushi & Japanese")
clean_yelp["category_1"] = clean_yelp["category_1"].replace("Japanese", "Sushi & Japanese")

clean_yelp["category_1"] = clean_yelp["category_1"].replace("Food Trucks", "Street Vendors")
clean_yelp["category_1"] = clean_yelp["category_1"].replace("Food Stands", "Street Vendors")

clean_yelp["category_1"] = clean_yelp["category_1"].replace("Cocktail Bars", "Bars")
clean_yelp["category_1"] = clean_yelp["category_1"].replace("Sports Bars", "Bars")
clean_yelp["category_1"] = clean_yelp["category_1"].replace("Wine Bars", "Bars")
clean_yelp["category_1"] = clean_yelp["category_1"].replace("Breweries", "Bars")
clean_yelp["category_1"] = clean_yelp["category_1"].replace("Dive Bars", "Bars")
clean_yelp["category_1"] = clean_yelp["category_1"].replace("Tapas Bars", "Bars")
clean_yelp["category_1"] = clean_yelp["category_1"].replace("Distilleries", "Bars")
clean_yelp["category_1"] = clean_yelp["category_1"].replace("Brewpubs", "Bars")
clean_yelp["category_1"] = clean_yelp["category_1"].replace("Pubs", "Bars")
clean_yelp["category_1"] = clean_yelp["category_1"].replace("Gastropubs", "Bars")
clean_yelp["category_1"] = clean_yelp["category_1"].replace("Beer Bar", "Bars")
clean_yelp["category_1"] = clean_yelp["category_1"].replace("Irish Pub", "Bars")
clean_yelp["category_1"] = clean_yelp["category_1"].replace("Whiskey Bars", "Bars")
clean_yelp["category_1"] = clean_yelp["category_1"].replace("Izakaya", "Bars")

clean_yelp["category_1"] = clean_yelp["category_1"].replace("Hot Pot", "Korean")

Now that we've done some clean up on this column we can create our dummy variables.

In [816]:
category_df = pd.get_dummies(clean_yelp["category_1"])
clean_yelp = pd.concat([clean_yelp, category_df], axis=1)
clean_yelp = clean_yelp.drop("category_1", axis=1)

Next, let's look at the correlation coefficients for the price column to determine which columns to drop.

In [817]:
yelp_corr = clean_yelp.drop("name", axis=1).corr()

In [818]:
import numpy as np
np.absolute(yelp_corr["rating"]).sort_values(ascending=False).head(10)

rating            1.000000
Fast Food         0.118915
delivery          0.106384
Pizza             0.088621
distance          0.081267
pickup            0.077526
Cafes             0.071899
Street Vendors    0.066234
American          0.064514
Bakeries          0.052890
Name: rating, dtype: float64

Let's drop some of these columns which have a particularly low correlation in order to simplify our analysis.

In [819]:
drop_cols = yelp_corr[np.absolute(yelp_corr["rating"]) < 0.05].index

In [820]:
clean_yelp = clean_yelp.drop(drop_cols, axis=1)
clean_yelp = clean_yelp.reset_index()
clean_yelp = clean_yelp.drop("index", axis=1)

# Making Predictions

First, we will need to select an error metric as well as a model. In order to do this, let's first look at our target column.

In [821]:
clean_yelp["rating"].value_counts()

4.0    4167
4.5    3893
3.5    1151
5.0    1136
3.0     387
2.5     153
2.0      59
1.5      32
1.0      22
Name: rating, dtype: int64

Our target column will be the `rating` column and I've printed the value counts for this column above. While the column is a numeric datatype, it's important to note that the values are not continuous. The `rating` column is an ordinal variable. For our purposes, however, we will treat this column as if it contains continuous values. In the end, as needed, we will round values in order to get back values that look like the original ratings.

To simplify things, we'll multiply all the values in the `rating` column by 2. This way, when we round off our values at the end, we can just round to the nearest whole number, instead of rounding to the nearest 0.5.

In [822]:
clean_yelp["rating"] = clean_yelp["rating"]*2

In [823]:
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

Before we split the dataset into train and test sets, recall that our dataset was taken from one city at a time. This could cause problems when we split the data set so we'll randomize the index.

In [824]:
clean_yelp = clean_yelp.sample(frac=1).reset_index(drop=True)

To simplify things, we'll also group our columns together and label them as our feature columns.

In [825]:
features = clean_yelp.columns.drop(["name", "rating"])

Next, to make our process easier, we'll create a function that splits the dataset into train and test sets, uses linear regression to make predictions and return the root mean squared error for the predictions. The inputs to the function will be the data, the size of the test set as well as the feature columns and the target column.

In [832]:
def lr_train_test(data, size, features, target):
    x_train, x_test, y_train, y_test = train_test_split(data[features], data[target], test_size=size, random_state=1)
    
    lr = LinearRegression()
    lr.fit(x_train, y_train)
    
    # We round the predictions so that they fit the categories of the rating column.
    predictions = lr.predict(x_test)
    predictions = np.round(predictions)
    
    return (mean_squared_error(y_test, predictions))**(1/2)

lr_train_test(clean_yelp, 0.2, features, "rating")

1.1591532960037527

Next, we'll create a new function that uses k-fold analysis.

In [831]:
from sklearn.model_selection import KFold
def kfolds(X, y, n):
    rmses = []
    skf = KFold(n_splits=n)
    for train_index, test_index in skf.split(X, y):
        x_train, x_test = X.loc[train_index], X.loc[test_index]
        y_train, y_test = y.loc[train_index], y.loc[test_index]
        
        lr = LinearRegression()
        lr.fit(x_train, y_train)
        prediction = lr.predict(x_test)
        prediction = np.round(prediction)
        
        rmses.append((mean_squared_error(y_test, prediction))**(1/2))
    
    return sum(rmses)/len(rmses)

kfolds(clean_yelp[features], clean_yelp['rating'], 3)

1.1485864361195726