                                                                   Helga Sigríður Thordersen Magnúsdóttir s202027 
                                                                                 Hlynur Árni Sigurjónsson s192302
                                                                             Katrín Erla Bergsveinsdóttir s202026
                                                                                Kristín Björk Lilliendahl s192296
 
 ![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

<img src="https://susfans.eu/sites/default/files/clients/DTU.png"  align="right" width="300"/>

# Advanced Business Analytics - Project

## Tripadvisor - Copenhagen restaurants reviews

<br>
The idea is to scrape tripadvisor reviews of restaurants in copenhagen. 

#### The information gained would be

* Review text, rating and time
* The resturants info
* Basic reviewer info.

#### I our thoughts of what could be done with the data.

* Create a network of reviewers and restaurants
* Sentiment analysis on the reviews
* Word embedding on the reviews
* Words accosiated with bad and good reviews
* Recommendation system based on ratings
* Time laps of reviews based on location
* Spacial prediction and trends
* Fraud detection (detecting un-authentic reviews), would be subjective I assume
* Are there any neighborhoods better rated than others

## Contents
* [1 Scraper info](#scraper)
* [2 The datasets and data preparation](#datasets)
    * [2.1 restaurantInfo.csv](#restaurantInfo)
    * [2.2 reviews.csv](#reviews)
    * [2.3 reviewerInfo.csv](#reviewer)
    * [2.4 Shapefiles](#shapefiles)
* [3 Descriptive stats](#descStats)
    * [3.1 Restaurants](#restaurants)
    * [3.2 Reviews](#reviews)
* [4 Business questions](#businessquestions)

<hr>

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Imported data 🐼 
The first step as always is to install and import the necessary packages.

In [1]:
# !pip install pandas_profiling
# !pip install folium
# !pip install -U selenium

# Imports
import pandas as pd
import numpy as np
import seaborn as sns
import ast
import matplotlib.pyplot as plt
%matplotlib inline 
from pandas_profiling import ProfileReport
import requests
import urllib.parse
import folium
import geopandas as gpd
import json
from datetime import datetime

#  Scraper packages
import requests 
from bs4 import BeautifulSoup
import csv 
from selenium import webdriver
import time
import sys
import os
import argparse
import string
import pickle

 ![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

<a id='scraper'></a>
# 1. Scraper info

A scraper tool is created to gather the reviews and the restaurants info into csv files. The skeleton of the tool was taken from a github page [LaskasP](https://github.com/LaskasP/TripAdvisor-Python-Scraper-Restaurants-2021). It later turned out that the code had many errors and crashed after a few calls. So the scarper tool was fixed and improved, with enriching the information gathered. The addition of getting reviewer information was created. Since the data was much and the tool was expected to encounter some errors on the way the urls links are stored in a csv. In our case we chose to look into resturant in the Copenhagen area, with over 1900 resturants available. Each restaurant has on average 700 reviews so the scraping time is quick to add up.

The scraper is based on the beautifulsoup package and selenium. The reason for using selenium is to open and click on  things, to retriew next pages or additional information.

#### Selenium actions
* Click next button since tripadvisor only displays 20 restaurants or reviews at each page
* Click boxes that pop out with addition information about reviewers
* Click on the "more" button when a review is exceeding a certain length

#### Procedure.

* Find a tripadvisor page with a selected area and select only restaurants
* Run the scrapeRestaurantsUrlsAll function, this function retrievs all the urls in the selected area
* Run through all the urls and scrape the reviews with get_reviews function
* If successful retrieval of all reviews remove the the resturant urls csv file
* Sperately run the scrapeRestaurantInfo function to get the information of the restaurants

As can be seen in the three code snippets below there are a lot of "try: except:" clauses in the code. This is do to many smaller deviation in the tripadvisor webpage. Data can be missing for some restaurants so the scraper tries to retrieve them, if not successfull it is left empty.

#### Scrape all restaurants urls

With the help of selenium the next page button is pushed until it has reached the end. Every resturant's url that is not "sponsored" in the ordering is saved. The sponsored restaurant appear many times and ofter the same restaurants, if this would be skipped that data would have a lot of duplicates but what is worse it would extend the scraper tools to by a big margin.

In [None]:
# Get urls for all "next" pages in a selected area
def scrapeRestaurantsUrlsAll(url, limit=100):
    store_name = []
    urls = []
    limit_set = 1
    nextPage = True
    while nextPage and limit_set <= limit:
        page = requests.get(url)
        soup = BeautifulSoup(page.text, 'html.parser')
        results = soup.find('div', class_='_1kXteagE')
        stores = results.find_all('div', class_='wQjYiB7z') 
        for store in stores:
            if store.find('a', class_ = '_15_ydu6b').text[0].isdigit(): # skip the ones that er sponsored since they will also come later.
                
                print(store.find('a', class_ = '_15_ydu6b').text)
                unModifiedUrl = str(store.find('a', href=True)['href'])
                urls.append('https://www.tripadvisor.com'+unModifiedUrl)
        limit_set += 1
        #Go to next page if exists
        try:
            print('tried next in finding all')
            unModifiedUrl = str(soup.find('a', class_ = 'nav next rndBtn ui_button primary taLnk',href=True)['href'])
            # print(unModifiedUrl, 'later unmod')
            url = 'https://www.tripadvisor.com' + unModifiedUrl
            # print('new url is ', url)
        except:
            print('no next in finding all')
            nextPage = False

    with open(pathAllRestaurants, 'wb') as f:
        pickle.dump(urls, f)

    print(f'Total restaurant count: {len(urls)}')
    return urls

#### Scrape the restaurants info

The restaurants info is scraped. Here the most applicable data was retrieved and stored into a seperate csv file. Here the beautifulsoup package was sufficient to retreived the data needed. Again here we see many try: except: clauses in the code since there is missing information for many of the resturants.

In [None]:
def scrapeRestaurantInfo(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    storeName = soup.find('h1', class_='_3a1XQ88S').text
    try:
        avgRating = soup.find('span', class_='r2Cf69qf').text.strip()
        nrReviews = soup.find('a', class_='_10Iv7dOs').text.strip().split()[0]
    except:
        avgRating = None
        nrReviews = 0
    storeAddress = soup.find('div', class_= '_2vbD36Hr _36TL14Jn').find('span', class_='_2saB_OSe').text.strip()
#     urlAddress = str(soup.find('div', class_ = '_2vbD36Hr _36TL14Jn').find('span').find('a', href=True)['href'])
    
    try:
        cousineType = [word.text for  word in soup.find('span', class_='_13OzAOXO _34GKdBMV').find_all('a')]
        cousine = True
    except:
        cousineType = []
        cousine = False
    nrPos = soup.find('a', class_='_15QfMZ2L').find('b').find('span').text.strip()
    
    # Other rankings 
    all_ranks = []
    try:
        all_ranks = [word.text for word in soup.find('div', class_ = '_3acGlZjD').find_all('div', class_ = '_3-W4EexF')]
    except:
        all_ranks = []
        
    # Other ratings
    all_ratings = []
    try:
        rating = soup.find_all('div', class_='jT_QMHn2')
        rating_type = [x.find('span', class_ = '_2vS3p6SS').text for x in rating]
        true_rating = [x.find('span', class_ = '_377onWB-') for x in rating]
        true_rating = [int(str(x.findChildren('span')).split('_')[3][:2])/10 for x in true_rating]
        all_ratings = list(zip(rating_type,true_rating))
    except:
        all_ratings = []
        
    with open(restaurantInfo, mode='a', encoding="utf-8") as trip:
        data_writer = csv.writer(trip, delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
        if len(cousineType) > 1:
            data_writer.writerow([storeName, storeAddress, avgRating, nrReviews, cousineType[0], cousineType[1:], nrPos, all_ranks, all_ratings])
        elif len(cousineType) == 1:
            data_writer.writerow([storeName, storeAddress, avgRating, nrReviews, cousineType[0], [], nrPos, all_ranks, all_ratings])
        else:
            data_writer.writerow([storeName, storeAddress, avgRating, nrReviews, [], [], nrPos, all_ranks, all_ratings])

#### Get the info of the reviewer

Here selenium came to the rescue as the need to click buttons on the reviewers own page was neccessary. The infor store here is mainly in the hope to get the connection between reviewers and restaurants. Tripadvisor has a community of reviewers and they can follow each other as on social platforms. The information in that regard is gathered along with the total reviews and "upvotes" the reviewer gives. The hope here is to shed light on the influence of specific reviewers and the value it could add to restaurants. Here detecting bad or fraudulent reviews is hopefully possible with the data at hand. The most frequent available data is the location and the join date of the reviewer. This information is quite important since a network can be created based on those attributes.

In [None]:
# GEt all the reviwer info, location, join date, review count, upvotes, followers and following.
def reviewerInfo(url):
    username = url
    full_url = f"https://www.tripadvisor.com/Profile/{url}"
    driver.get(full_url)
    time.sleep(1)

    # Get Intro info, location and join date
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    try:
        location = soup.find('span', class_ = "_2VknwlEe _3J15flPT default").text
    except:
        location = None

    try:
        joined = soup.find('span', class_ = "_1CdMKu4t").text
    except:
        joined = None


    all_links = soup.find_all('div', class_ = '_1aVEDY08')
    # link = driver.find_elements_by_xpath("//div[@class='nkw-3XeH']/div[1]/span[2]/a")

    # # Get the contributions info
    nrContributions = int(str(all_links[0].text).split()[1])
    if nrContributions > 0:
        link = driver.find_elements_by_xpath("//div[@class='nkw-3XeH']/div[1]/span[2]/a")
        driver.execute_script("arguments[0].click();", link[0])
        time.sleep(1)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        nrReviews = int(str(soup.find('span', class_ = 'ui_icon pencil-paper _1LSVmZLi').parent.text).split()[0])
        try:
            nrUpvotes = int(str(soup.find('span', class_ ='ui_icon thumbs-up _1LSVmZLi _3zmXi7gU').parent.text).split()[0])
        except:
            nrUpvotes = 0
        close = driver.find_elements_by_xpath("//div[@class='_2EFRp_bb _9Wi4Mpeb']")
        driver.execute_script("arguments[0].click();", close[0])
    else:
        nrReviews = 0
        nrUpvotes = 0

    # Get Followers
    nrFollowers = int(str(all_links[1].text).split()[1])
    if nrFollowers > 0:
        link = driver.find_elements_by_xpath("//div[@class='nkw-3XeH']/div[2]/span[2]/a")
        driver.execute_script("arguments[0].click();", link[0])
        time.sleep(1)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        followers = [word.text for word in soup.find('div', class_='_1caczhWN').find_all('span', class_='gf69u3Nd')]
        close = driver.find_elements_by_xpath("//div[@class='_2EFRp_bb _9Wi4Mpeb']")
        driver.execute_script("arguments[0].click();", close[0])
    else:
        followers = []

    # Get all following
    nrFollowing = int(str(all_links[2].text).split()[1])
    if nrFollowing > 0:
        link = driver.find_elements_by_xpath("//div[@class='nkw-3XeH']/div[3]/span[2]/a")
        driver.execute_script("arguments[0].click();", link[0])
        time.sleep(1)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        following = [word.text for word in soup.find('div', class_='_1caczhWN').find_all('span', class_='gf69u3Nd')]
        close = driver.find_elements_by_xpath("//div[@class='_2EFRp_bb _9Wi4Mpeb']")
        driver.execute_script("arguments[0].click();", close[0])
    else:
        following = []

    with open(pathtoReviewers, mode='a', encoding="utf-8") as reviewer_data:
        data_writer = csv.writer(reviewer_data, delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
        data_writer.writerow([username, location, joined, nrContributions,nrReviews, nrUpvotes, nrFollowers, followers, nrFollowing,following])

#### Initialise files and get the data

The data is gathered by firsly getting all urls then looping through them and scraping the information. Firtsly initilise the file and the relevant column names.

In [None]:
pathToReviews = "reviews.csv"
restaurantInfo = "restaurantInfo.csv"
pathAllRestaurants = "AllRestaurants.txt"

with open(restaurantInfo, mode='a', encoding="utf-8") as trip:
    data_writer = csv.writer(trip, delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
    data_writer.writerow(['storeName', 'storeAddress', 'avgRating', 'nrReviews', 'priceCategory','CousineType', 'Rank'])
#webDriver init

with open(pathToReviews, mode='a', encoding="utf-8") as trip_data:
    data_writer = csv.writer(trip_data, delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
    data_writer.writerow(['storeName', 'reviewerUsername', 'ratingDate', 'reviewHeader','reviewText', 'rating'])

#### Getting the reviews
Scrape all of the reviews and keep track of what urls are finished and who are left.

In [None]:
with open("urls_left.txt", "rb") as f:   # Unpickling
    urls_left = pickle.load(f)

# initialize the selenium driver    
driver_path = f'../scraper/{os.getcwd()}/chromedriver'
driver = webdriver.Chrome(driver_path)
urls_left = urls.copy()

# Do this in steps of 100 restaurants since the code takes multiple hours to run
finished = []
bad_url = []
next_100 = urls_left[0:100]
for url in next_100:
    try:
        get_reviews(url)
        finished.append(url)
    except:
        bad_url.append(url)

Remove the urls that are successfull. Take a look at those that failed and try to fix what was missing or caused the bad retrieval.

In [None]:
finished_urls = [urls_left[x].index for x in finished]

for idx in finished_urls:
    del urls_left[idx]

#### Getting restaurant info

Here we run a function the gets the website with a request, the html parser is then used from beautifulsoup and the needed tags are lacated and stored.

In [None]:
with open("AllRestaurants.txt", "rb") as f:   # Unpickling
    urls = pickle.load(f)

bad_url = []
for url in urls:
    try:
        scrapeRestaurantInfo(url)
    except:
        bad_url.append(url)

Here bad_url contained restaurant with no reviews and thus deemed reduntant to our dataset.

#### Getting reviwer info
Scraping the reviews info after the file has been created from the dataframe

In [None]:
driver_path = f'{os.getcwd()}/chromedriver'
driver = webdriver.Chrome(driver_path)

with open("reviewers.txt", "r") as f:
    reviewers = f.readlines()

for reviewer in reviewers:
    reviewerInfo(reviewer)

#### Merging csv files

In order to gather the data in a shorter time, multiple environments were run in paralell to gather the data simultaneously. 

In [None]:
import csv
reader = csv.reader(open("Data/reviews1.csv"))
reader1 = csv.reader(open("Data/reviews2.csv"))
reader2 = csv.reader(open("Data/reviews22.csv"))

# Skip the first lines when combining
next(reader1)
next(reader2)

f = open("Data/reviews.csv", "w")
writer = csv.writer(f)

for row in reader:
    writer.writerow(row)
for row in reader1:
    writer.writerow(row)
for row in reader2:
    writer.writerow(row)
f.close()

 ![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

<a id="datasets"></a>
# 2. The datasets and data preparation

As mentioned in the scraper section the tool used to to iteratevly go through every "next" page of a specific tripadvisor restaurant review and gather the wanted information. The restaurants info is also scraped seperately and lastly the reviwer's information is gathered after all the reviews have been collected.


The data is saved into three files:
1. **restaurantsInfo.csv**: Contains information about each resturant.
2. **reviews.csv**: Contains the reviews for each restaurant.
3. **reviewerInfo.csv**: Contains the reviews for each restaurant.

The raw data is gathered from the csv files and cleaned and prepared for further analysis.

Additionally some datasets were downloaded from the internet. These include shapefiles of Denmark with municipality division.

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

<a id="restaurantInfo"></a>
## 2.1 restaurantInfo.csv

**restaurantInfo.csv** containts information about each restaurant. This information are namely::
* Restaurant name
* Address
* Average rating
* Number of reviews
* Price category
* List of cousine types the restaurant offers
* Rank

Let's examine how the data looks by loading the **restaurantInfo.csv** into a pandas dataframe.

Some restaurants contain several type of food and therefore the **CousineType** column contains a list for each restaurant and that is why we need to use converter to read it in correctly.

In [3]:
restaurants = pd.read_csv('Data/restaurantInfo.csv')
restaurants.head()

Unnamed: 0,storeName,storeAddress,avgRating,nrReviews,priceCategory,CousineType,Rank,all_ranks,all_ratings
0,Krogs Fish Restaurant,"Gammel Strand 38, Copenhagen 1202 Denmark",4.5,329,$$$$,"['Seafood', 'European', 'Scandinavian']",#113,"['#8 of 82 Seafood in Copenhagen', '#113 of 1,...","[('Food', 4.5), ('Service', 4.5), ('Value', 4.0)]"
1,Maple Casual Dining,"Vesterbrogade 24, Copenhagen 1620 Denmark",5.0,237,$$ - $$$,"['International', 'European', 'Vegetarian Frie...",#1,"['#1 of 95 International in Copenhagen', '#1 o...","[('Food', 5.0), ('Service', 5.0), ('Value', 4.5)]"
2,Keyser Social,"Frederiksborggade 20d, Copenhagen 1360 Denmark",5.0,125,$$$$,"['Asian', 'Thai', 'Vegetarian Friendly']",#2,"['#1 of 226 Asian in Copenhagen', '#2 of 1,971...","[('Food', 5.0), ('Service', 5.0), ('Value', 5.0)]"
3,Restaurant Krebsegaarden,"Studiestraede 17, Copenhagen 1455 Denmark",5.0,1403,$$$$,"['European', 'Scandinavian', 'Danish']",#3,"['#2 of 840 European in Copenhagen', '#3 of 1,...","[('Food', 5.0), ('Service', 5.0), ('Value', 4...."
4,The Olive Kitchen & Bar,"Noerregade 22, Copenhagen 1165 Denmark",5.0,2413,$$ - $$$,"['International', 'European', 'Gluten Free Opt...",#4,"['#2 of 95 International in Copenhagen', '#4 o...","[('Food', 5.0), ('Service', 5.0), ('Value', 4.5)]"


The **priceCategory** column appears to be displaying in a weird way or not showing all the data. However when a single row is printed, the correct format of the column can be seen.

In [4]:
restaurants.iloc[0]

storeName                                    Krogs Fish Restaurant
storeAddress             Gammel Strand 38, Copenhagen 1202 Denmark
avgRating                                                      4.5
nrReviews                                                      329
priceCategory                                                 $$$$
CousineType                ['Seafood', 'European', 'Scandinavian']
Rank                                                          #113
all_ranks        ['#8 of 82 Seafood in Copenhagen', '#113 of 1,...
all_ratings      [('Food', 4.5), ('Service', 4.5), ('Value', 4.0)]
Name: 0, dtype: object

### Data Cleaning 🧹 
A clean-up needs to be performed before the dataset is used for further analysis. Since '$$' is a keyword in Matplotlib, the **priceCategory** column has to be mapped to something. A mapping to integers was created, since they are a nice way to represent the data.

Additionally the **Rank** column is modified, in such a way that the '#' symbol is removed so the column can be converted from a string to integer. 

In [6]:
# Since the Pandas profiler can no display string with '$$' it is necessary to map the price categories differently
restaurants.priceCategory = restaurants.priceCategory.map({'$': 1, '$ - $$': 1.5, '$$': 2, '$$ - $$$': 2.5, '$$$': 3, '$$$ - $$$$': 3.5, '$$$$': 4, '$$$$ - $$$$$': 4.5, '$$$$$': 5})

# Remove the '#' in the front of the Rank column
restaurants.Rank = restaurants.Rank.apply(lambda x: x.replace('#','')).apply(lambda x: int(x))

ValueError: invalid literal for int() with base 10: 'Rank'

Finally it would be benefitial to get the latitude and longitude coordinates of the restaurant address, so that it can easily be plotted on a map and used for further analysis. This can be achieved by using the address as a query string and calling the Open Street maps. Since some of the addresses are in a weird format, the name of the restaurant will first be used in the search query (with 'denmark' added at the end to clarify the search). If the location information can be found from the restaurant name, that will be used. If the location information can not be found from the name then the address will be used to generate the latitude and longitude.

In [None]:
# Iterate through all the rows of the dataset and gather the lat and lon info into vectors
# First we use the lat/lon info found from the restaurant name since that could be considered more accurate 
# since users have to label the restaurant on a map. Otherwise the info from the address is used.
# Finally we collect the display name, to be able to extract the municipality information

# Create vectors to store information
lats = []
lons = []
displayNames = []

for idx, row in restaurants.iterrows():
    
    lat = None
    lon = None
    displayName = None
    address = row.storeAddress
    name = row.storeName
    url1 = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(name + " denmark") +'?format=json'
    url2 = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(address) +'?format=json'

    # Make the first request based on the name
    response1 = requests.get(url1).json()
    
    if response1:
        lat = response1[0]["lat"]
        lon = response1[0]["lon"]
        displayName = response1[0]["display_name"]
    else:
        # make the second request based on the address
        response2 = requests.get(url2).json()
        if response2:
            lat = response2[0]["lat"]
            lon = response2[0]["lon"]
            displayName = response2[0]["display_name"]
    
    # Append the info we gathered into the vectors
    lats.append(lat)
    lons.append(lon)
    displayNames.append(displayName)
    
# Add the vectors to the dataset
restaurants['lat'] = lats
restaurants['lon'] = lons
restaurants['location'] = displayNames

After this data preparation the information about a single restaurant appears like so:

In [None]:
restaurants.iloc[0]

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

<a id="reviews"></a>
## 2.2 reviews.csv

**reviews.csv** containts information about each review for every restaurant. Namely:
* Restaurant name
* Reviewer's username
* Date of rating
* Header of review
* Review text
* User's rating

Let's examine how the data looks:

In [None]:
reviews = pd.read_csv('reviews.csv')
reviews.head()

The above dataset appears ready for use, however the **ratingDate** is presented as a string. 

In [None]:
print(reviews.iloc[0].ratingDate)
print(type(reviews.iloc[0].ratingDate))

For further analysis, it will be benefitial to have converted this column into a timestamp object.

In [None]:
reviews['ratingDate'] = reviews['ratingDate'].apply(lambda x: datetime.strptime(x, '%B %d, %Y'))
print(reviews.iloc[0].ratingDate)
print(type(reviews.iloc[0].ratingDate))

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

<a id="shapefiles"></a>
## 2.3 Shapefiles
Shapefiles for the whole of Denmark were downloaded from [here](https://www.diva-gis.org/datadown).
The files include information about the municipalities and their division within Denmark.
These can be used later on when performing analysis that require information about different areas of Denmark.

The first step is to load and examine the dataset

In [None]:
# Load the shapefiles and 
shp = 'DNK_adm/DNK_adm2.shp'
gdf = gpd.read_file(shp)

gdf.head()

Any analysis performed will be focused on the capital region, so the shapefile is filtered on *Hovedstaden* and some of the included islands are skipped so the plot will be as clear as possible. Each municipality will be plotted in a seperate color based on their numerical ID in the shapefile.

In [None]:
# Filter on the areas we want displayed
hovedstaden = gdf[(gdf.ID_1==1) & (gdf.NAME_2 != 'Bornholm') & (gdf.NAME_2 != 'Christiansø') & (gdf.NAME_2 != 'Halsnæs')]

# Plot the shapefiles
fig, ax = plt.subplots(1, figsize=(14, 8));
hovedstaden.plot(column='ID_2', cmap='tab20b', linewidth=0.8, ax=ax, edgecolor='black', legend=True);
ax.axis('off');
ax.set_title('Denmark', fontsize=16);

 ![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

<a id="descStats"></a>
# 3. Descriptive stats
...

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

<a id="restaurants"></a>
## 3.1. Restaurants
...

In [None]:
# Generate a profile for the dataset and display it
restaurant_profile = ProfileReport(restaurants, title="Restaurants Info dataset", html={'style': {'full_width': True}});
restaurant_profile.to_notebook_iframe();

<font color='red'>**ADD DISCUSSION ON FINAL FINDINGS**</font>

From the profiler it can be seen that it does not handle a column containing a list. Therefore we do an additional check for the **CousineType** column.

In [None]:
CousineTypeFlat = [y for x in restaurants.CousineType for y in x]

# https://stackoverflow.com/questions/49017002/bar-plot-based-on-list-of-string-values
keys, counts = np.unique(CousineTypeFlat, return_counts=True)

counts, keys = zip(*sorted(zip(counts, keys), reverse=True))

plt.figure(figsize=(20,10))
plt.bar(keys, counts)
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
# https://georgetsilva.github.io/posts/mapping-points-with-folium/
locations = restaurants[['lat', 'lon']]
locationlist = locations.values.tolist()

map = folium.Map(location=[55.7, 12.6], zoom_start=12)
for point in range(0, len(locationlist)):
    folium.Marker(locationlist[point], popup=restaurants['storeName'][point]).add_to(map)
map

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

<a id="reviews"></a>
## 3.2 Reviews


In [None]:
# Generate a profile for the dataset and display it
reviews_profile = ProfileReport(reviews, title="Reviews dataset", html={'style': {'full_width': True}});
reviews_profile.to_notebook_iframe();

 ![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

<a id='businessquestions'></a>
# 4. Business Questions


The business question for the project are the following. 

<font color='red'>**ADD any missing question kveðja Hlynur**</font>

1. Can we predict up and coming neigborhoods based on restaurant rating or newly opened restaurants ?
2. Is the advertised cousine type of the restaurant represent what the reviewers really like ?
3. Showing the trend of restaurant and how the movement based on rating and restaurant availability has changed over time 
4. Provide insight into what makes a good and bad review, what are the keywoards and can we filter out to get a report for the restaurant improvement points ? 
5. Network analysis based on the connection between reviewers or restaurants
6. Where should you place your restaurant in the city based on the surrounding restaurant types and causines. Is there space for a new pizza place or are they simply to many ? Are there only low rated pizza places in the neighbourhood where your ambition will thrive ? 
7. Recommendation system based on similar reviewer's interest or similar restaurant characteristics
8. Create a machine learning model to predict a review rating based on the review text.

<font color='red'>**ADD more detailed descirption of each business question**</font>

Hlynur/Kristín - 1, 2, 3, 6