## Yelp Dataset Pruning

### Purpose

This script (`yelp_preprocessing.py`) is designed to prune the Yelp dataset obtained online, making it suitable for use with scikit-learn machine learning models. The pruning process involves cleaning and refining the dataset to ensure it aligns with the requirements of scikit-learn.

### Steps

1. **Data Input:**
   - Ensure the Yelp dataset in JSON format is available online.

2. **Run Script:**
   - Execute `python yelp_preprocessing.py` to run the pruning script.

3. **Output:**
   - The script generates a pruned version of the Yelp dataset ready for scikit-learn usage.

### Notes

- Verify that the pruned features are in accordance with scikit-learn model requirements.
- Keep a backup of the original dataset obtained online.


#### Dependencies

In [1]:
pip install pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install numpy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install altair


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [33]:
pip install geopy

Collecting geopy
  Downloading geopy-2.4.1-py3-none-any.whl.metadata (6.8 kB)
Collecting geographiclib<3,>=1.52 (from geopy)
  Downloading geographiclib-2.0-py3-none-any.whl (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.3/40.3 kB[0m [31m886.1 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading geopy-2.4.1-py3-none-any.whl (125 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.4/125.4 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: geographiclib, geopy
Successfully installed geographiclib-2.0 geopy-2.4.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [4]:
#Import Relevant Packages
import pandas as pd
import numpy as np
import altair as alt

In [28]:
#Import CSV File
df = pd.read_csv("./yelp_dataset.csv", #Directory
                 usecols=['business_id',
                          'name',
                          'address',
                          'city',
                          'state',
                          'latitude',
                          'longitude',
                          'stars',
                          'review_count',
                          'categories'])

In [29]:
#Shape
print(df.shape)
print(df.head())

(150346, 10)
              business_id                      name  \
0  Pns2l4eNsfO8kk83dixA6A  Abby Rappoport, LAC, CMQ   
1  mpf3x-BjTdTEA3yCZrAYPw             The UPS Store   
2  tUFrWirKiKi_TAnsVWINQQ                    Target   
3  MTSW4McQd7CbVtyjqoe9mw        St Honore Pastries   
4  mWMc6_wTdE0EUBKIGXDVfA  Perkiomen Valley Brewery   

                           address           city state   latitude  \
0           1616 Chapala St, Ste 2  Santa Barbara    CA  34.426679   
1  87 Grasso Plaza Shopping Center         Affton    MO  38.551126   
2             5255 E Broadway Blvd         Tucson    AZ  32.223236   
3                      935 Race St   Philadelphia    PA  39.955505   
4                    101 Walnut St     Green Lane    PA  40.338183   

    longitude  stars  review_count  \
0 -119.711197    5.0             7   
1  -90.335695    3.0            15   
2 -110.880452    3.5            22   
3  -75.155564    4.0            80   
4  -75.471659    4.5            13   

      

In [30]:
#Filters
df1 = df[df['categories'].str.contains('restaurants', case=False, na=False)]
print(df1.shape)
print(df1.head())

(52268, 10)
               business_id                   name              address  \
3   MTSW4McQd7CbVtyjqoe9mw     St Honore Pastries          935 Race St   
5   CF33F8-E6oudUQ46HnavjQ         Sonic Drive-In        615 S Main St   
8   k0hlBqXX-Bt0vf1op7Jr1w  Tsevi's Pub And Grill    8025 Mackenzie Rd   
9   bBDDEgkFA1Otx9Lfe7BZUQ         Sonic Drive-In  2312 Dickerson Pike   
11  eEOYSgkmpB90uNA7lDOMRA  Vietnamese Food Truck                  NaN   

            city state   latitude  longitude  stars  review_count  \
3   Philadelphia    PA  39.955505 -75.155564    4.0            80   
5   Ashland City    TN  36.269593 -87.058943    2.0             6   
8         Affton    MO  38.565165 -90.321087    3.0            19   
9      Nashville    TN  36.208102 -86.768170    1.5            10   
11     Tampa Bay    FL  27.955269 -82.456320    4.0            10   

                                           categories  
3   Restaurants, Food, Bubble Tea, Coffee & Tea, B...  
5   Burgers, Fas

In [32]:
df2 = df1.iloc[:, [5, 6, 7, 8, 9]] #Remove address columns and name columns
print(df2.shape)
print(df2.head())

(52268, 5)
     latitude  longitude  stars  review_count  \
3   39.955505 -75.155564    4.0            80   
5   36.269593 -87.058943    2.0             6   
8   38.565165 -90.321087    3.0            19   
9   36.208102 -86.768170    1.5            10   
11  27.955269 -82.456320    4.0            10   

                                           categories  
3   Restaurants, Food, Bubble Tea, Coffee & Tea, B...  
5   Burgers, Fast Food, Sandwiches, Food, Ice Crea...  
8   Pubs, Restaurants, Italian, Bars, American (Tr...  
9   Ice Cream & Frozen Yogurt, Fast Food, Burgers,...  
11         Vietnamese, Food, Restaurants, Food Trucks  


In [36]:
#Obtain ZipCode
from geopy.geocoders import Nominatim

#Accepts row with latitude, and longitude
def get_zipcode(row):
    coordinates = f"{row['latitude']}, {row['longitude']}"
    geolocator = Nominatim(user_agent="mihirtakalkar@gmail.com")  # Replace "your_app_name" with your unique user agent
    location = geolocator.reverse(coordinates, language='en')  # 'en' specifies English language for results
    if location and location.raw.get('address', {}).get('postcode'):
        return location.raw['address']['postcode']
    else:
        return None
    

df2['zipcode'] = df2.apply(get_zipcode, axis=1)

print(df2.head())

KeyboardInterrupt: 

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

https://scikit-learn.org/stable/modules/metrics.html#cosine-similarity