## Overview

This report describes the steps done to clean and combine the datasets of the second capstone project. The aim of the project is to predict the violations of restaurants in Allegheny county.

The datasets come from two sources:

- Allegheny county inspection dataset, which includes the types of violations found during restaurants' inspections;
- Yelp datasets, which include information and text reviews of the restaurants.

In this project, we want to collect for each restaurant the information and reviews that preceded its inspection. The first source includes information about each inspection and the second source has the reviews. 

In this first part of the project, we examine and clean the inspection dataset from Allegheny county source and the business dataset (which only includes restaurants information without the reviews) from Yelp source. We will then find the common restaurants mentioned in both datasets, in order to combine the information of the restaurants provided by the two sources.

This report is divided as follows:
- we first load and describe each dataset that we refer as violation and Yelp business datasets;
- we then examine the violation dataset by filling some missing entries, processing the names of facilities and counting the types of violation for each inspection;
- we also examine the Yelp business dataset and process the names of the restaurants;
- we finally find the common restaurants mentioned in both datasets by comparing their names and addresses.

## Description of the Violation and Yelp Business datasets

We here load and describe the two datasets that we will be working with in this first part of the project.

### Violation Dataset

The county of Allegheny keeps a record of each inspection done and the violation found in each inspection. The dataset is available from the website of the county. Let us examine the dataset and check its entries.

We first load the data.

In [1]:
import pandas as pd
import numpy as np

violations = pd.read_csv("violations.csv")

The data consists of the following columns:

In [2]:
violations.columns

Index(['encounter', 'id', 'placard_st', 'facility_name', 'bus_st_date',
       'description', 'description_new', 'num', 'street', 'city', 'state',
       'zip', 'inspect_dt', 'start_time', 'end_time', 'municipal', 'rating',
       'low', 'medium', 'high', 'url'],
      dtype='object')

Each row of this dataset indicates the violation type found during an inspection of a restaurant in Allegheny county. The "encounter" column contains the identification number of each inspection, "inspect_dt", "start_time" and "end_time" columns represent respectively the date and time of the inspection. The column "description_new" contains a description of the violation found and the columns "low", "medium" and "high" indicate the type of the violation. The information of the restaurant inspected is provided by the columns: facility name, business starting date ('bus_st_date'), restaurant category ("description"), and address ("num", "street", "city", "state", "zip", "municipal"). Also each restaurant is identified by an identification number (ID) indicated by the column "id". Note that multiple rows could belong to the same inspection of a given restaurant, since each row belongs to one violation and multiple violations could be found at the same restaurant during an inspection.

### Yelp Business Dataset

Two of the files provided by Yelp datasets are business and review files (json files). The business file includes information about the restaurants and associates each restaurant with an identification number. The review file contain Yelp text reviews for each restaurant. We here focus on the business dataset, and map the restaurants of the violation dataset to the restaurants in the Yelp dataset.

We load the business json file and extract only the restaurants of Pennsylvania state.

In [3]:
import json

business = [json.loads(line)
            for line in open('YELP/business.json', encoding="utf8")]
bus_df = pd.DataFrame(business)
bus_df = bus_df[bus_df.state == "PA"]

The YELP business dataset consists of the following columns:

In [4]:
bus_df.columns

Index(['business_id', 'name', 'address', 'city', 'state', 'postal_code',
       'latitude', 'longitude', 'stars', 'review_count', 'is_open',
       'attributes', 'categories', 'hours'],
      dtype='object')

This dataset contains the name, address, attributes and categories of each restaurant, as well as the number of stars and reviews given to each restaurant. Moreover, each restaurant of Yelp dataset is associated with a business identification number (business ID ). Our target is that after we process the violation and business datasets, we want to map the ID of the restaurants in the violation dataset to the business ID of the restaurants in the Yelp dataset.

## Processing of the Violation Dataset

We now focus on cleaning the violation dataset. More specifically, we fill the missing addresses, group the data by inspection and count the number of each type of violations found in each inspection, and finally fix some of the inconsistencies in the names and IDs of the facilities inspected. 

### Filling the Missing Entries

The data contains some missing entries in the columns related to restaurants' information, namely: facility name, business starting date, description and addresses. The county website has a dataset for the facilities inspected that we can use here to fill the missing entries. More precisely, we extract from the dataset of facilities the missing information using the ID of the restaurants with missing entries.

We load the dataset of facilities.

In [5]:
facilities = pd.read_csv("facilities.csv")

We fill the missing entries of the corresponding columns using the facilities dataset.

In [6]:
cols_to_fill = ["facility_name", "bus_st_date", 'description',
                'num', 'street', 'city', 'state', 'zip', 'municipal']

for col_name in cols_to_fill:
    
    # get the IDs of facilities with empty entries in the column col_name
    ids_none = violations[violations[col_name].isnull()].id
    
    # get the info from faciltites dataset
    entries_none = facilities[(facilities.id.isin(list(ids_none)))][[
        "id", col_name]]
    entries = entries_none.set_index('id').T.to_dict(orient='list')
    
    # fill the missing entries
    for ind in ids_none.index:
        try:
            violations.at[ind, col_name] = entries[violations.at[ind, 'id']][0]
        except KeyError:
            pass

After using the facilities dataset, the columns of "facility name", "description" and "street" are completely filled. 

In [7]:
print(sum(violations.description.isnull()))
print(sum(violations.facility_name.isnull()))
print(sum(violations.street.isnull()))

0
0
0


However the column ("num"), which represents the number of street, still have some missing entries.

In [8]:
print(sum(violations.num.isnull()))

1443


By checking the name of the streets and the zip code of the restaurants with empty entries for the number of street ('num'), we notice that most of them correspond to restaurants in Pittsburgh airport or in a mall, so that the airport or the name of the mall is given as the address in the column 'street' instead of the real address, and the column 'num' is left empty. We next focus on filling some of these missing street numbers, and we leave the remaining entries to be filled after we map the restaurants of both datasets.

We first look at the zip codes of the restaurants with empty street number.

In [9]:
violations[violations.num.isnull()].zip.value_counts()

15231.0    973
15213.0    100
15232.0     74
15236.0     36
15238.0     36
15146.0     35
15026.0     29
15056.0     27
15116.0     25
15102.0     19
15084.0     18
15223.0     14
15065.0     12
15045.0      9
15222.0      8
15104.0      7
15241.0      6
15240.0      4
15215.0      4
15147.0      3
15205.0      2
15108.0      1
15219.0      1
Name: zip, dtype: int64

We notice that most of the entries correspond to the zip code: 15231. Let us check the street names of the corresponding zip code.

In [10]:
where_to_check = np.logical_and(
    violations.num.isnull(), violations.zip == 15231)
violations[where_to_check].street.value_counts()

Pgh Intl Airport / AC-17        55
Pgh Intl Airport / AC-2B        53
Pgh Intl Airport / AC-2D        29
Pgh Intl Airport / SE-1         28
Pgh Int'l Airport / AC-20A/B    28
                                ..
Pgh Intl Airport / NE-13         1
Pgh Int'l Airport / AC-14        1
Pgh Intl Airport / NE-9A         1
Pgh Intl Airport /  AC-1A        1
Pgh Intl Airport / AC-32A        1
Name: street, Length: 77, dtype: int64

We see that these restaurants are in Pittsburgh airport. In order to make the mapping of restaurants information easier between this dataset and Yelp dataset, we will modify the address of the restaurants that are in Pittsburgh airport to : 1000 Airport Blvd.

In [11]:
violations.at[where_to_check,'num'] = 1000
violations.at[where_to_check,'street'] = "Airport Blvd"

For the remaining restaurants with empty "num", we manually fill some of the missing entries by focusing on the mostly empty entries and on the restaurants that are in malls. 

In [12]:
where_to_check = np.logical_and(
    violations.num.isnull(), violations.zip == 15213)
violations[where_to_check].facility_name.value_counts()

India on Wheels (YXN-3894) MFF4    63
Chakh Le India (YBV-7676) MFF4     37
Name: facility_name, dtype: int64

In [13]:
violations.at[violations.facility_name ==
              "India on Wheels (YXN-3894) MFF4", "num"] = 4422
violations.at[violations.facility_name ==
              "India on Wheels (YXN-3894) MFF4", "street"] = "Bigelow"
violations.at[violations.facility_name ==
              "Chakh Le India (YBV-7676) MFF4", "num"] = 4341
violations.at[violations.facility_name ==
              "Chakh Le India (YBV-7676) MFF4", "street"] = "Bigelow"

In [14]:
where_to_check = np.logical_and(violations.num.isnull(), violations.zip ==15236)
violations[where_to_check].facility_name.value_counts()

Little Caesars Pizza #145    32
GNC #5306                     4
Name: facility_name, dtype: int64

In [15]:
violations.at[violations.facility_name=="Little Caesars Pizza #145","num"] =5301 
violations.at[violations.facility_name=="Little Caesars Pizza #145","street"]="Grove Rd"
violations.at[violations.facility_name=="GNC #5306","num"] =5301 
violations.at[violations.facility_name=="GNC #5306","street"]="Grove Rd"

In [16]:
where_to_check = np.logical_and(violations.num.isnull(), violations.zip ==15146)
violations[where_to_check].facility_name.value_counts()

Gloria Jean's Coffees            24
Foxwood Park Forest Swim Club     7
Auntie Anne's Pretzels #PA287     4
Name: facility_name, dtype: int64

In [17]:
violations.at[violations.facility_name == "Gloria Jean's Coffees", "num"] = 200
violations.at[violations.facility_name ==
              "Gloria Jean's Coffees", "street"] = "Mall Blvd"
violations.at[violations.facility_name ==
              "Auntie Anne's Pretzels #PA287", "num"] = 145
violations.at[violations.facility_name ==
              "Auntie Anne's Pretzels #PA287", "street"] = "Mall Blvd"

In [18]:
where_to_check = np.logical_and(
    violations.num.isnull(), violations.zip == 15026)
violations[where_to_check].facility_name.value_counts()

Persin's Tavern    29
Name: facility_name, dtype: int64

In [19]:
violations.at[violations.facility_name == "Persin's Tavern", "num"] = 1286

In [20]:
where_to_check = np.logical_and(
    violations.num.isnull(), violations.zip == 15056)
violations[where_to_check].facility_name.value_counts()

Giant Eagle Cafe #37                 17
GNC Distribution Center Warehouse     3
GNC #3710                             2
Henle Park Concession Stand           2
GetGo #3137                           2
AFC Sushi @ Giant Eagle #37           1
Name: facility_name, dtype: int64

In [21]:
names = ["Giant Eagle Cafe #37", "GetGo #3137",
         "GNC #3710", "AFC Sushi @ Giant Eagle #37"]
for name in names:
    violations.at[violations.facility_name == name,
                  "street"] = "Quaker Village Shopping Ctr"
    violations.at[violations.facility_name == name, "num"] = 1

We filled most of the missing "num" entries, but there are still some missing entries, which we will leave for now empty and fill them when we combine the violations dataset with the yelp business dataset. We also have some missing entries in the column "bus_st_date" (business starting date), which we will also leave empty until we combine the datasets, so that we focus only on the common restaurants.

In [22]:
sum(violations.bus_st_date.isnull())

113

In [23]:
violations.at[violations.bus_st_date.isnull(), 'bus_st_date'] = "none"
violations.at[violations.num.isnull(), 'num'] = "none"

We finally remove some redundant rows where facility name is given as test.

In [24]:
where_to_drop = violations[(violations.facility_name == "test")].index
violations = violations.drop(index=where_to_drop)

### Counting each Type of Violation per Inspection

Each row of the dataset represents one type of violation detected during an inspection. Since multiple types of violation could be detected during an inspection, multiple rows in the dataset can correspond to the same inspection. We now transform this dataset into a new one, where each row represents one inspection and includes the number of each type of violation detected.

We first divide the columns into two sets: one that contains the information of the restaurants inspected and another set that contains the type of each violation found. 

In [25]:
cols1 = ['encounter', 'id', 'placard_st', 'facility_name', 'bus_st_date',
         'description', 'num', 'street', 'city', 'state',
         'zip', 'inspect_dt', 'municipal']

cols2 = ['encounter', 'low', 'medium', 'high']

viol1 = violations[cols1]
viol2 = violations[cols2]

viol1 = viol1.drop_duplicates()

The columns "low", "medium" and "high" have true or false values. We map those values to 1 or 0 respectively, before we add them for each inspection.

In [26]:
viol2 = viol2.fillna(0)
viol2 = viol2.replace({'F': 0, 'T': 1})
viol2 = viol2.groupby(['encounter'])[['low', 'medium', 'high']].sum()
viol2 = viol2.reset_index()

We finally merge the two sets to obtain a new dataset, where each row corresponds to one inspection and the number of each type of violations found during the inspection.

In [27]:
violation = viol1.merge(viol2, on='encounter')

In [28]:
violation.shape

(55416, 16)

In the sequel, we only focus on this new violation dataset where each row represents one inspection.

### Processing the Names of facilities

While exploring the violation data, we noticed that some restaurants have multiple ID numbers. This can be seen by computing the number of unique values of ID and facility names.

In [29]:
print(len(violation.facility_name.unique()))
print(len(violation.id.unique()))

10299
11325


The number of unique IDs is higher than the number of unique facility names. One reason for this discrepancy is that restaurants that have more than one branches can all have the same name (for example chain restaurants). Another reason for this discrepancy, is that the name of the same restaurant can be spelled differently by the inspectors. To address this issue, we next focus on making the names of facilities more consistent and then adjust the IDs of the restaurants that have multiple IDs. 

We first modify the names of the restaurants by making the letters lowercase, removing the apostrophes and underscores, replacing "&" with "and" and any double spaces with a single space. The decisions to make these specific modifications was made based on the observation of the facility names. Moreover, this processing step will also be helpful when merging this dataset with Yelp business dataset.

In [30]:
violation['facility_name'] = violation['facility_name'].str.lower()
violation['facility_name'] = violation['facility_name'].str.replace('&', 'and')
violation['facility_name'] = violation['facility_name'].str.replace('\'', '')
violation['facility_name'] = violation['facility_name'].str.replace('-', '')
violation["facility_name"] = violation.facility_name.str.replace("  ", " ")

After modifying the names of the restaurants, the next step is to check the restaurants that have same name and address but different IDs. To make sure we have consistent addresses, we modify the street names by removing any double spaces and replacing the types of streets with their abbreviations.

In [31]:
violation["street"] = violation.street.str.replace("  "," ")
violation["street"] = violation.street.str.replace("Road","Rd")
violation["street"] = violation.street.str.replace("Avenue","Ave")
violation["street"] = violation.street.str.replace("Street","St")
violation["street"] = violation.street.str.replace("Boulevard","Blvd")
violation["street"] = violation.street.str.replace("William","Wm")

We now extract the street name from the "street" column by dropping the direction symbols ("N", "E", "W", "S") if present. This is done because a same restaurant might be given two addresses, where the only difference is the presence or absence of the direction symbols.

In [32]:
# this function extracts the street name
def extract_street(street_name):
    # split the strings
    streets = street_name.split(' ')
    if (len(streets) > 1):
        # if the street name starts with
        # a direction's character
        # extract the next string
        if (len(streets[0]) == 1):
            return streets[1]
    return streets[0]

# make a column "st" for the street name
violation['st'] = violation["street"].apply(extract_street)

We now check the facilities that have the same name and address but with different IDs. After we find these facilities, we adjust their IDs and business starting dates, so that they only have one ID and one starting date.

In [33]:
# group the violation by facility name, street nnumber and street name
# compute for each group the number of unique IDs
fnames = violation.groupby(['facility_name', 'num', 'st'])['id'].nunique()
# extract restaurants with more than one ID
fnames = fnames[fnames > 1]
fnames = fnames.reset_index().drop(columns='id')

# find the minimum ID for each group
fnames_id = violation.groupby(['facility_name', 'num', 'st'])['id'].min()
# find the minimum business starting date for each group
fnames_date = violation.groupby(['facility_name', 'num', 'st'])[
    'bus_st_date'].min()

# assign the minimum ID and business starting dates for 
# restaurants with more than one ID
fnames_tot = (fnames.merge(fnames_id.reset_index())
              ).merge(fnames_date.reset_index())
fnames_tot = fnames_tot.rename(
    columns={'id': 'id_2', 'bus_st_date': 'bus_st_date_2'})
violation = violation.merge(fnames_tot, how='left')
for index, row in violation.iterrows():
    if (~np.isnan(row['id_2'])):
        violation.at[index, 'id'] = row['id_2']
        violation.at[index, 'bus_st_date'] = row['bus_st_date_2']

In [34]:
violation = violation.drop(columns=['bus_st_date_2', 'id_2'])

To find a same restaurant with different IDs, we looked at the names, street numbers and street names. However, we also noticed from our observations that for same restaurant, different street numbers are entered (for instance instead of 424 Beaver st, we have 428 Beaver St.) This is another reason why some restaurants have multiple IDs. We next find the restaurants with same name, same street number, same zip code but with different IDs, and then check their street numbers. If their street numbers are close, we adjust the IDS and the business starting dates of the restaurant.

We first define a function that extracts the first portion of the street number, this is because some street number are followed by a letter.

In [35]:
# this function extracts the number street
# in case the number is followed by a letter

def extract_num(number):
    nums = str(number).split(' ')
    return nums[0]

We group the restaurant by their names, street name and zip code. We then extract those with many IDs, check their street numbers and focus only on those with close street numbers.

In [36]:
# extract the street number
violation["num_st"] = violation.num.apply(extract_num)

# group the violation by facility name, street name and zip code
# compute for each group the number of unique IDs
fnames = violation.groupby(['facility_name', 'st', 'zip'])['id'].nunique()
# extract restaurants with more than one ID
fnames = fnames[fnames > 1]
fnames = fnames.reset_index().drop(columns='id')

# find the minimum ID for each group
fnames_id = violation.groupby(['facility_name', 'st', 'zip'])['id'].min()
# find the minimum business starting date for each group
fnames_date = violation.groupby(['facility_name', 'st', 'zip'])[
    'bus_st_date'].min()
# find the maximum street number
fnames_num_max = violation.groupby(['facility_name', 'st', 'zip'])[
    'num_st'].max()
fnames_num_max = fnames_num_max.reset_index().rename(
    columns={'num_st': 'num_max'})
# find the minimum street number
fnames_num_min = violation.groupby(['facility_name', 'st', 'zip'])[
    'num_st'].min()
fnames_num_min = fnames_num_min.reset_index().rename(
    columns={'num_st': 'num_min'})

# assign the minimum ID and business starting dates for 
# restaurants with more than one ID and that have close 
# street numbers
fnames_tot = (fnames.merge(fnames_id.reset_index())
              ).merge(fnames_date.reset_index())
fnames_tot = (fnames_tot.merge(fnames_num_min)).merge(fnames_num_max)
fnames_tot = fnames_tot[fnames_tot.num_max.astype(
    int)-fnames_tot.num_min.astype(int) < 1500]
fnames_tot = fnames_tot.rename(
    columns={'id': 'id_2', 'bus_st_date': 'bus_st_date_2'})

violation = violation.merge(fnames_tot, how='left')

for index, row in violation.iterrows():
    if (~np.isnan(row['id_2'])):
        violation.at[index, 'id'] = row['id_2']
        violation.at[index, 'bus_st_date'] = row['bus_st_date_2']

violation = violation.drop(
    columns=['bus_st_date_2', 'id_2', 'num_st', 'num_max', 'num_min'])

We adjusted the information of the restaurants by making the names and addresses of the restaurants more consistent. We then identified the restaurants that have many IDs and adjusted their IDs, by primarily focusing on the name of the facilities and their addresses. However, it is also possible that for a same facility, different names were entered. We will address this issue when we extract more information about the facilities using YELP dataset.

## Processing of the Yelp Business Dataset.

We now process the name and address of each restaurant in Yelp business dataset, similarly to what we have early done with the violation dataset. This is in order to make the mappings of the restaurants between the two datasets easier.

In [37]:
bus_df['name'] = bus_df['name'].str.lower()
bus_df['name'] = bus_df['name'].str.replace('&', 'and')
bus_df['name'] = bus_df['name'].str.replace('\'', '')
bus_df['name'] = bus_df['name'].str.replace('-', '')
bus_df['name'] = bus_df['name'].str.replace("  ", " ")

In [38]:
bus_df['address'] = bus_df['address'].str.replace("  ", " ")
bus_df['address'] = bus_df['address'].str.replace("Road", "Rd")
bus_df['address'] = bus_df['address'].str.replace("Avenue", "Ave")
bus_df['address'] = bus_df['address'].str.replace("Street", "St")
bus_df['address'] = bus_df['address'].str.replace("Boulevard", "Blvd")
bus_df['address'] = bus_df['address'].replace("William", "Wm")

From the address column, we extract the street name.

In [39]:
# this function extract the street name
# from the address
def extract_street_y(address):
    streets = address.split(' ')
    if(len(streets)==1):
        return streets[0]
    streets = streets[1:]
    if (len(streets)>1):
        if (len(streets[0])==1):
            return streets[1]
    return streets[0]

bus_df['sty'] = bus_df["address"].apply(extract_street_y)

## Mapping the IDs of the Restaurants

We now focus on mapping the common restaurants in both datasets. To map the restaurants of the two datasets, we rely on the name and address of the restaurants. We first map the restaurants that have the same name in both datasets. We then map the restaurants where one name is a substring of the other name, this is because the names of some restaurants are inconsistent between the two datasets. 

We first extract the names and addresses of the restaurants of both datasets.

In [40]:
# extract the names and addresses from yelp and violation datasets
name_add_yelp = bus_df[['business_id', 'name',
                        'address', 'city', 'sty', 'postal_code']]
name_add_viol = violation[['id', 'facility_name',
                           'num', 'street', 'city', 'st', 'zip']]

We then merge the information of the restaurants based on the names of the facilities. This is an outer merge, from which we are going next to extract the restaurants with similar names in both datasets and those with different names.

In [41]:
# merge the two sets of info using the names of the restaurants
viol_yelp_merged = name_add_viol.merge(
    name_add_yelp, left_on="facility_name", right_on="name", how='outer')

### Common Restaurants with Same Name in Both Datasets
We now find the common restaurants of both datasets with exact naming.

In [42]:
# focus on restaurants with same name
# i.e., extract the non-null entries in the merged dataframe (outer merge)
viol_yelp_same = viol_yelp_merged[np.logical_and(
    viol_yelp_merged.name.notnull(), viol_yelp_merged.facility_name.notnull())]

In [43]:
viol_yelp_same = viol_yelp_same.sort_values("facility_name")
viol_yelp_same = viol_yelp_same.drop_duplicates()

Since restaurants with same name can have many branches, we keep the restaurants with same street name and zipcode.

In [44]:
# extract the restaurants (with same name) that are on the same street
viol_yelp_same = viol_yelp_same[viol_yelp_same["st"] == viol_yelp_same["sty"]]

In [45]:
# extract the restaurants (with same name) that are have the same zipcode
viol_yelp_same = viol_yelp_same[viol_yelp_same["zip"].astype(
    int) == viol_yelp_same["postal_code"].astype(int)]

We save the list of restaurants with the same name and address of both datasets.

In [46]:
viol_yelp_same.to_csv("viol_yelp_same.csv")

### Common Restaurants with Slightly Different Naming

We now focus on finding the common restaurants that appear with different names in the two datasets. To do this, we map the restaurants with names that are subsets of each other. We then check their address and keep the restaurants with same address.

We first find the restaurants that appear with different names in both datasets.

In [47]:
# focus on restaurants with same name
# i.e., extract the rows with null entries in the merged dataframe
viol = viol_yelp_merged[viol_yelp_merged.name.isnull()][[
    'id', 'facility_name', 'num', 'street', 'city_x', 'st', 'zip'
]].drop_duplicates()

yelp = viol_yelp_merged[viol_yelp_merged.facility_name.isnull()][[
    'business_id', 'name', 'address', 'city_y', 'sty', 'postal_code'
]].drop_duplicates()

We then map the restaurants with names that are substring of each others. For each restaurant in the violation dataset, we check the names of the restaurants in the yelp business dataset that start with the same character and that are either a substring or superstring of the restaurant's name of the violation dataset.

In [48]:
viol_yelp_diff = []

# ascii codes of characters and symbols
# the codes of upper case are excluded
codes = list(range(32, 65))+list(range(92, 127))

for code in codes:
    
    # get the character
    c = chr(code)
    
    # get the facility names that starts with the character c
    # from both datasets
    viol_c = viol[viol.facility_name.apply(
        lambda s: s.startswith(c))].sort_values('facility_name')
    yelp_c = yelp[yelp.name.apply(
        lambda s: s.startswith(c))].sort_values('name')
    
    # combine the restaurant information from both datasets
    # when one name is subset of the other name 
    for indexv, rowv in viol_c.iterrows():
        for indexy, rowy in yelp_c.iterrows():
            if ((rowy['name'] in rowv['facility_name']) 
                or (rowv['facility_name'] in rowy['name'])):
                rest = list(rowv)+list(rowy)
                viol_yelp_diff.append(rest)

We transform the obtained results into a panda dataframe.

In [49]:
viol_yelp_diff = pd.DataFrame(
    viol_yelp_diff, columns=list(viol.columns)+list(yelp.columns))

After finding the names of the restaurants that are slightly different between the two datasets, we only keep the restaurants with same addresses.

In [50]:
# keep the restaurants with samme street name and zip codes
viol_yelp_diff = viol_yelp_diff[viol_yelp_diff["st"] == viol_yelp_diff["sty"]]

In [51]:
viol_yelp_diff = viol_yelp_diff[viol_yelp_diff["zip"].astype(
    int) == viol_yelp_diff["postal_code"].astype(int)]

We finally save the obtained mapping.

In [52]:
viol_yelp_diff.to_csv("viol_yelp_diff.csv")

## Additional Processing 

After having found the common restaurants in both datasets based on their names and addresses, we manually checked the mappings found. We noticed some more inconsistencies that were not solved by the previous processing steps. We address these inconsistencies here based on our observations: we first remove the incorrect mappings, we then check if any business ID is mapped to more than one ID and any ID is mapped to more than one business ID.

### Deleting some Wrong Mappings

We now delete some rows from the dataframes of mappings: viol_yelp_same and viol_yelp_diff; these rows contain wrong mappings. For instance, a cafe from a store from the violation dataset is mapped to the whole store mentioned in Yelp business dataset, or chain restaurants that are on the same street but with different street numbers. Note that we explicitly specify the rows to drop based on our manual checking not in an automated way.

In [53]:
# drop rows with wrong mapping
where_to_drop_1 = (viol_yelp_same.name == "mcdonalds") & (
    viol_yelp_same.num == "6361") & (viol_yelp_same.business_id == "G9LyIc5LgBNM_zF8BJMhNg")
where_to_drop_2 = (viol_yelp_same.name == "crazy mocha") & (
    viol_yelp_same.num == "801") & (viol_yelp_same.business_id == "87rslhpXfVcJXN1_DiUQ9A")

viol_yelp_same = viol_yelp_same[~where_to_drop_1]
viol_yelp_same = viol_yelp_same[~where_to_drop_2]

  


In [54]:
# facility name, street number and business id of rows to drop
# those are based on checking the mappings found eariler

rows_to_drop = [["agh suburban campus gift shop", "100", "zPaoppBtXodfjEnsT7GFZA"],
                ["agh suburban campus kitchen", "100", "zPaoppBtXodfjEnsT7GFZA"],
                ["aldi #69", "8000", "ZOAbx2hTdu8KyUMDicZNDw"],
                ["au bon pain #103 @ us steel tower concourse",
                    "600", "mNMVqHsgJq6YErFPfaUlig"],
                ["au bon pain #103 @ us steel tower concourse",
                    "600", "9toyHY_tXx4eenp8Oxmmsg"],
                ["au bon pain #224 @ gulf tower", "707", "9toyHY_tXx4eenp8Oxmmsg"],
                ["au bon pain @ one oxford center plaza level",
                    "301", "mNMVqHsgJq6YErFPfaUlig"],
                ["casa rastalas palmas", "2056", "iXnHrmTw-r6LNzC2MjjrWQ"],
                ["chipotle mexican grill #2410", "4611", "VFv7NcPW9ajUTLleJ8wOQA"],
                ["chipotle mexican grill #863", "3619", "o7aBPJhTR-R4TccTSfu-Ng"],
                ["eatn park #3", "7370", "6fQXFkkw2BUARDa0EzsGnA"],
                ["eatn park restaurants #80", "516", "cTYEiHz8AEOpiwzDY9ngWQ"],
                ["getgo #3047", "6513", "x_9malt5q6yHZwYdX6k6Hg"],
                ["getgo #3057", "6513", "x_9malt5q6yHZwYdX6k6Hg"],
                ["getgo fuel kiosk #3257 robinson",
                    "6513", "x_9malt5q6yHZwYdX6k6Hg"],
                ["giant eagle #1691", "400", "bNkCDwXWscwnVV3xixQoAg"],
                ["giant eagle #24", "3239", "PJcOOjebn86geLXG2qPB_Q"],
                ["giant eagle #63", "4250", "k8NHw2fUjisdp38PLCl0JA"],
                ["giant eagle #641", "1025", "6FqYbhI4bClySEVbdHnOaQ"],
                ['giant eagle #67', '8080', 'Ea-6xpe581a_mmeD3W3mog'],
                ['ikea restaurant', '2001', 'ohkd4oHpIrvue5nWBPMSMg'],
                ['pan', '3519', '6MQJMPVi5HobGjx73DaE7Q'],
                ['panera bread #4329', '117', 'fM9Nmx3Rv4zFU5fLMILpLA'],
                ['starbucks coffee #19816 mcknight',
                    '7707', 'y90JVPFQ_TWGQaBhHtSm3w'],
                ['starbucks coffee #21587', '301', 'UQtV1plcxfLTlvKeLPv63Q'],
                ['starbucks coffee #22032', '3007', 'NeM7anGnTOTn7sEJavS3sw'],
                ['starbucks coffee #760', '5211', 'Vgh-CjwAl4tFleP39OdzaA'],
                ['starbucks coffee #7625', '5310', 'SY7ZyPxidnToTEFbJBleOg'],
                ['starbucks coffee #7749', '4765', 'Hi9Mq6SkJRU9E0YzPAl7PQ'],
                ['starbucks coffee #776', '4885', 'A-OvzL1cssAoFQ-TMRUpBQ'],
                ['starbucks coffee #7875', '1597', 'qnN6BumIj-OPU38Bg6wPOw'],
                ['starbucks coffee / baggage claim',
                    1000, 'TrZVtAQivYyGDOQH46_aFA'],
                ['wendys old fashioned hamburgers #1622367',
                    '891', 'E4U8RCe42CpT3rtof7hEwQ'],
                ['wendys old fashioned hamburgers #1632356',
                    '2691', 'GaP5KfdlDuNasKgTNR8Saw'],
                ['wendys old fashioned hamburgers #528',
                    '891', 'E4U8RCe42CpT3rtof7hEwQ'],
                ['wendys old fashioned hamburgers #529',
                    '2691', 'GaP5KfdlDuNasKgTNR8Saw'],
                ['starbucks coffee #75819', '427', 'A89Re1NzGXBNuAk6CZ1rwQ'],
                ['rite aid phamacies #10931', '2501', 'DDRXDnU_BBbGg9U2RYWiAQ'],
                ['eatn park #13', '7671', 'MjOFnVxoKeOikgBGTZ-UcA'],
                ['rite aid pharmacies #10906', '5235', 'yl97wnawcJ14L_o5xaQ7bA']]

for row in rows_to_drop:
    where_to_drop = (viol_yelp_diff.facility_name == row[0]) & (
        viol_yelp_diff.num == row[1]) & (viol_yelp_diff.business_id == row[2])
    viol_yelp_diff.drop(
        index=viol_yelp_diff[where_to_drop].index, inplace=True)

In [55]:
# the names of restaurants whose rows are to drop
names = ["carlow university", "chartiers country club", "chatham university",
         "heinz field", "gandy dancer saloon", "grand concourse", 
         "hyatt place pittsburgh airport", "hyatt place pittsburgh airport",
         "pittsburgh airport marriott", "pittsburgh field club",
         "riverhounds", "rivers casino", "target", "upmc mercy", 
         "upmc", "upmc presbyterian"]

for name in names:
    where_to_drop = (viol_yelp_diff.name == name)
    viol_yelp_diff.drop(
        index=viol_yelp_diff[where_to_drop].index, inplace=True)

# the facility names of restaurants whose rows are to drop
fa_names = ["microtel inn and suites", "giant eagle #24 cafe", "giant eagle #67 cafe",
            "giant eagle cafe", "giant eagle wexford cafe #45",
            "starbucks temporary baggage claim kiosk", 'giant eagle cafe #18',
            'giant eagle cafe #52', 'giant eagle cafe #61', 'giant eagle cafe #068',
            'giant eagle cafe #619', 'giant eagle cafe #43', 'giant eagle cafe #60',
            'giant eagle cafe #646', 'giant eagle cafe #0004', 'giant eagle cafe #6379',
            'revel and roost']

for facility_name in fa_names:
    where_to_drop = (viol_yelp_diff.facility_name == facility_name)
    viol_yelp_diff.drop(
        index=viol_yelp_diff[where_to_drop].index, inplace=True)

In [56]:
where_to_drop = (viol_yelp_diff.facility_name == "starbucks coffee") & (
    viol_yelp_diff.num == 1000) & (viol_yelp_diff.street == "Airport Blvd")
viol_yelp_diff.drop(index=viol_yelp_diff[where_to_drop].index, inplace=True)

### Adjusting more IDs of Restaurants of the Violation Dataset

In the violation dataset, we previously mentioned that same restaurants can have names spelled differently. The previous processing did not find all of these restaurants in the violation dataset. The mapping between the violation and Yelp business datasets can help in finding more of these instances. This can be done by finding the restaurants with business ID mapped to more than one ID, i.e., a restaurant from Yelp dataset mapped to more than one restaurant from violation dataset.

We find the restaurants with one business ID mapped to more than one ID. 

In [57]:
bus_ids = viol_yelp_diff[['id', 'business_id']]

# group by business_id and count for each group the number of unique IDs
bus_ids = bus_ids.groupby('business_id').id.nunique()
bus_ids = bus_ids.reset_index()
# focus on restaurants with business_id mapped to more than one ID
bus_ids = bus_ids[bus_ids.id > 1]

We then adjust the IDs and business starting dates of those restaurants in violation dataset and in the mapping dataframe.

In [58]:
for bus_id in bus_ids.business_id:
    fc_ids = viol_yelp_diff[viol_yelp_diff.business_id == bus_id].id
    # adjust the id and businsess starting dates for restaurants with more than one ID
    violation.at[violation.id.isin(
        fc_ids), 'bus_st_date'] = violation[violation.id.isin(fc_ids)].bus_st_date.min()
    violation.at[violation.id.isin(fc_ids), 'id'] = fc_ids.min()
    viol_yelp_diff.at[viol_yelp_diff.business_id ==
                      bus_id, 'id'] = fc_ids.min()

### Adjusting the Business IDs of Some Restaurants of Yelp dataset

We similarly check if same restaurants in the Yelp dataset have different business IDs.

To find those restaurants, we check if there are restaurants with IDs mapped to more than one business IDs, i.e., a restaurant from violation dataset mapped to more than one restaurant from Yelp dataset.

We find the restaurants with one ID mapped to more than one business ID, using the mapping dataframe "viol_yelp_diff".

In [59]:
fc_ids = viol_yelp_diff[['id', 'business_id']]

# group by id and count for each group the number of unique business IDs
fc_ids = fc_ids.groupby('id').business_id.nunique()
fc_ids = fc_ids.reset_index()

# focus on restaurants with id mapped to more than one business ID
fc_ids = fc_ids[fc_ids.business_id > 1]

We then adjust the business ID of those restaurants in the mappings dataframe and we also adjust the number of total reviews, the average number of stars and we merge the attributes and the categories of those restaurants in the business Yelp dataset.

In [60]:
bus_ids_saved = []

for fc_id in fc_ids.id:
    # get the business ids of restaurants with more than one ID
    bus_ids = viol_yelp_diff[viol_yelp_diff.id == fc_id].business_id
    where_id = bus_df.business_id.isin(bus_ids)
    
    # compute the total review, total stars
    # combine attribute and categories
    total_reviews = bus_df[where_id].review_count.sum()
    total_stars = (bus_df[where_id].stars *
                   bus_df[where_id].review_count).sum()/total_reviews
    attributes = bus_df[where_id].attributes
    att_merged = dict()
    for att in attributes:
        if att is not None:
            att_merged.update(att)
    categories = bus_df[where_id].categories.str.cat()
    
    bus_ids_saved.append(list(bus_ids))
    
    # update business dataset
    bus_df.at[where_id, 'stars'] = total_stars
    bus_df.at[where_id, 'review_count'] = total_reviews
    bus_df.at[where_id, 'attributes'] = json.dumps(att_merged)
    bus_df.at[where_id, 'categories'] = categories
    viol_yelp_diff.at[viol_yelp_diff.id ==
                      fc_id, "business_id"] = bus_ids.min()

We find the restaurants with one ID mapped to more than one business ID, using the mapping dataframe "viol_yelp_same".

In [61]:
fc_ids = viol_yelp_same[['id', 'business_id']]
fc_ids = fc_ids.groupby('id').business_id.nunique()
fc_ids = fc_ids.reset_index()
fc_ids = fc_ids[fc_ids.business_id > 1]

In [62]:
for fc_id in fc_ids.id:
    bus_ids = viol_yelp_same[viol_yelp_same.id == fc_id].business_id
    where_id = bus_df.business_id.isin(bus_ids)
    total_reviews = bus_df[where_id].review_count.sum()
    total_stars = (bus_df[where_id].stars *
                   bus_df[where_id].review_count).sum()/total_reviews
    attributes = bus_df[where_id].attributes
    att_merged = dict()
    for att in attributes:
        if att is not None:
            att_merged.update(att)
    categories = bus_df[where_id].categories.str.cat()
    bus_ids_saved.append(list(bus_ids))
    bus_df.at[where_id, 'stars'] = total_stars
    bus_df.at[where_id, 'review_count'] = total_reviews
    bus_df.at[where_id, 'attributes'] = json.dumps(att_merged)
    bus_df.at[where_id, 'categories'] = categories
    viol_yelp_same.at[viol_yelp_same.id ==
                      fc_id, "business_id"] = bus_ids.min()

In [63]:
viol_yelp_diff.to_csv("viol_yelp_diff_2.csv")
viol_yelp_same.to_csv("viol_yelp_same_2.csv")

### Finding the Final Mapping between the Datasets

We finally extract the mapping between the violation IDs and business IDs in a table.

In [76]:
maps = pd.concat([viol_yelp_same,viol_yelp_diff])

In [77]:
maps = maps [['id','business_id']]
maps = maps.drop_duplicates()

We now merge the information provided by Yelp business dataset with the violation dataset.

In [78]:
violation_busid = violation.merge(maps,left_on="id",right_on="id")

In [81]:
violation_bus = violation_busid.merge(bus_df,left_on="business_id",right_on="business_id")

In [88]:
maps.to_csv("maps.csv")