# Data Visualizations Workbook

In our previous notebook, we collected data from the Yelp API for our two regions of interest, Napa Valley and San Diego.

In this notebook, we will explore, analyze, and interpret the results. We will look at the general statistics of the cities; use plots to visualize the data; and then we will review the results at the end.

By the end of the notebook, we will have our recommendation for in which region to start our winery.

# Exploring, Analyzing, and Interpreting Results

Now that we pulled the business data from the Yelp API, we will explore the data to gain insight to guide our decision.

Our guiding questions will be:
- **What does each specific region look like, statistically?**
  - How do the specific metrics relate to each other?
      - Are there any strong relationships between them?
  - Are there any clusters of businesses in each area?
      - Indicates strong local competition as well as ideal real estate.
 
 
- **What are the price ranges for each area?**
    - Gives indication of the clientele for local businesses
        - Lower price may be more broadly appealing
        - Higher pricing targets more discerning, luxury clientele

 
- **Ratings**
 - Indicates how strongly satisfied and unsatisfied clients may be
   - Unsatisfied customers indicate oportunities to steal their business from the other wineries
 - How do the ratings compare/contrast between regions?
   - Are there more satisfied customers in one region vs. the other?
 
 
- **Number of reviews**
 - May indicate the popularity of businesses
   - More popular businesses may have more reviews
 - The larger number of reviews creates a larger sample size for analyses, such as the average number of reviews per business.


# Importing Packages

First, we are going to import our packages for use in the notebook. 

We are going to import packages to:
- access our saved data
- explore the data and generate statistics 
- create visualizations

In [1]:
# Accessing stored data
import csv
import json

# Data exploration and statistics
import pandas as pd
import numpy as np

# Creating Visualizations
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Loading Data

 We will use our CSV package to load in the data generated previously in the Data Acquisition notebook.

## San Diego Wineries

In [2]:
# Read in data from the San Diego .csv
df_sd_details = pd.read_csv("data/wineries_San_Diego_price_converted.csv")
df_sd_details.reset_index(drop=True, inplace=True)
df_sd_details['City'] = 'San Diego'

# View results
df_sd_details.head()

Unnamed: 0,name,location,Business ID,alias,title,rating,review_count,price,latitude,longitude,City,price_converted
0,Bernardo Winery,"13330 Paseo Del Verano Norte San Diego, CA 92128",DknnpiG1p4OoM1maFshzXA,winetastingroom,Wine Tasting Room,4.5,626,$$,33.0328,-117.04646,San Diego,2
1,Callaway Vineyard & Winery,"517 4th Ave Ste 101 San Diego, CA 92101",Cn2_bpTngghYW1ej4zreZg,winetastingroom,Wine Tasting Room,5.0,100,$$,32.710751,-117.160918,San Diego,2
2,San Pasqual Winery - Seaport Village,"805 W Harbor Dr San Diego, CA 92101",gMW1RvyLu90RSQAY9UrIHw,winetastingroom,Wine Tasting Room,4.5,138,$$,32.708732,-117.168195,San Diego,2
3,Négociant Winery,"2419 El Cajon Blvd San Diego, CA 92104",Cc1sQWRWgGyMCjzX2mmMQQ,winetastingroom,Wine Tasting Room,4.5,103,$$,32.75488,-117.13828,San Diego,2
4,Domaine Artefact Vineyard & Winery,"15404 Highland Valley Rd Escondido, CA 92025",WqVbxY77Ag96X90LultCUw,wineries,Wineries,5.0,96,$$,33.06817,-117.0016,San Diego,2


## Napa Valley Wineries

In [9]:
# Read in data from the Napa Valley .csv
df_nv_details = pd.read_csv("data/wineries_Napa Valley_price_converted.csv")
df_nv_details.reset_index(drop=True, inplace=True)
df_nv_details['City'] = 'Napa Valley'

# View results
df_nv_details.head()

Unnamed: 0,name,location,Business ID,alias,title,rating,review_count,price,latitude,longitude,City,price_converted
0,Hendry Vineyard and Winery,"3104 Redwood Rd Napa, CA 94558",mO8n3zTLoFhlmcfQr7X_TQ,wineries,Wineries,5.0,658,$$,38.32168,-122.34481,Napa Valley,2
1,Domaine Carneros,"1240 Duhig Rd Napa, CA 94559",8eGTOeEQpUpYb89ISug3ag,wineries,Wineries,4.0,2239,$$,38.255534,-122.351391,Napa Valley,2
2,Paraduxx Winery,"7257 Silverado Trl Napa, CA 94558",cBFZALrZbLV5XBsiPcgknQ,wineries,Wineries,4.5,373,$$,38.43548,-122.35143,Napa Valley,2
3,Jarvis Winery,"2970 Monticello Rd Napa, CA 94558",NPkAqW68Og5eBofEpPiRXQ,wineries,Wineries,4.5,209,$$$,38.35701,-122.21362,Napa Valley,3
4,Cuvaison Estate Wines,"1221 Duhig Rd Napa, CA 94559",rjiMUH4UecBVD3wkqhgxXw,wineries,Wineries,4.0,327,$$,38.251176,-122.347084,Napa Valley,2


## Combined Data for Both Regions

Now we'll concatenate the dataframes for our regions to compare the data.

In [13]:
combined = pd.concat([df_sd_details, df_nv_details], ignore_index = 0)
combined

Unnamed: 0,name,location,Business ID,alias,title,rating,review_count,price,latitude,longitude,City,price_converted
0,Bernardo Winery,"13330 Paseo Del Verano Norte San Diego, CA 92128",DknnpiG1p4OoM1maFshzXA,winetastingroom,Wine Tasting Room,4.5,626,$$,33.032800,-117.046460,San Diego,2
1,Callaway Vineyard & Winery,"517 4th Ave Ste 101 San Diego, CA 92101",Cn2_bpTngghYW1ej4zreZg,winetastingroom,Wine Tasting Room,5.0,100,$$,32.710751,-117.160918,San Diego,2
2,San Pasqual Winery - Seaport Village,"805 W Harbor Dr San Diego, CA 92101",gMW1RvyLu90RSQAY9UrIHw,winetastingroom,Wine Tasting Room,4.5,138,$$,32.708732,-117.168195,San Diego,2
3,Négociant Winery,"2419 El Cajon Blvd San Diego, CA 92104",Cc1sQWRWgGyMCjzX2mmMQQ,winetastingroom,Wine Tasting Room,4.5,103,$$,32.754880,-117.138280,San Diego,2
4,Domaine Artefact Vineyard & Winery,"15404 Highland Valley Rd Escondido, CA 92025",WqVbxY77Ag96X90LultCUw,wineries,Wineries,5.0,96,$$,33.068170,-117.001600,San Diego,2
...,...,...,...,...,...,...,...,...,...,...,...,...
398,Andretti Winery,"1625 Trancas St Ste 3017 Napa, CA 94558",NKCMqIlRopcSMA15JpeyJg,wineries,Wineries,3.5,311,$$,38.321516,-122.304108,Napa Valley,2
399,Lionstone International,"21481 8th St E Sonoma, CA 95476",pW9QPUkm2_tTXLCzyQ6qvg,wineries,Wineries,1.0,1,$$,38.262062,-122.442036,Napa Valley,2
400,Napa Vinyards,"Napa, CA 94558",UwgQWRkTzlFnw3-QYCaBlQ,wineries,Wineries,1.0,1,$$,38.383260,-122.313060,Napa Valley,2
401,Cook Vinyard Management,"19626 Eighth St E Sonoma, CA 95476",LxMkyxBokxu6iRIsuMF5Tw,wineries,Wineries,1.0,1,$$,38.286261,-122.434893,Napa Valley,2


Everything looks good! Let's go ahead and get a quick overview of the dataframe.

In [18]:
combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 485 entries, 0 to 402
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             485 non-null    object 
 1   location         485 non-null    object 
 2   Business ID      485 non-null    object 
 3   alias            485 non-null    object 
 4   title            485 non-null    object 
 5   rating           485 non-null    float64
 6   review_count     485 non-null    int64  
 7   price            485 non-null    object 
 8   latitude         485 non-null    float64
 9   longitude        485 non-null    float64
 10  City             485 non-null    object 
 11  price_converted  485 non-null    int64  
dtypes: float64(3), int64(2), object(7)
memory usage: 49.3+ KB


In [19]:
combined.shape

(485, 12)

In [15]:
combined.describe()

Unnamed: 0,rating,review_count,latitude,longitude,price_converted
count,485.0,485.0,485.0,485.0,485.0
mean,4.551546,81.723711,37.415949,-121.434665,2.179381
std,0.679863,183.315341,2.029642,1.998159,0.492493
min,1.0,1.0,32.512608,-122.49469,1.0
25%,4.5,4.0,38.247393,-122.371093,2.0
50%,4.5,15.0,38.29737,-122.295921,2.0
75%,5.0,79.0,38.36385,-122.25338,2.0
max,5.0,2239.0,38.465419,-116.750704,4.0


In [20]:
combined.groupby('City').agg(['count', 'mean', min, 'median', max])['price_converted']

Unnamed: 0_level_0,count,mean,min,median,max
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Napa Valley,403,2.220844,1,2,4
San Diego,82,1.97561,1,2,3


# Exploring the Data for Each City

## EDA: San Diego Business Details

- .describe()
- box plot, scatter plot, hist
- what details stand out?
- What insight do we gain from these details?

In [None]:
# First, let's see an overview of the data with the .head() method.
df_sd_details.head()

- What does this data show? Any insight?

In [None]:
# Now let's see the centrality of the data via the .describe() method.
df_sd_details.describe()

- What does this data show? Any insight?

### San Diego - Visualizing the Data

- .describe()
- box plot, scatter plot, hist
- what details stand out?
- What insight do we gain from these details?

In [None]:
# Visualizing the centrality of the data.

df_sd_details['review_count'].plot(kind='box');

Our box plot for the review counts of San Diego show that almost 75% of businesses have fewer than 100 reviews, with some **close to 200**. Four businesses have significantly more reviews, which raises the question of _why_ those businesses are so popular?

Let's get a closer look at the data without those outliers.

In [None]:
# Boxplot using the same data as above with the outliers removed.

df_sd_details['review_count'].plot(kind='box', showfliers=False);

This new box plot gives us a clearer idea of the spread of the data. Half of the reviews call between about 10 to 90 reviews, with an average close to 30 reviews.

Where does the other half of the data fall? Let's use a histogram to visualize the data differently.

In [None]:
# Counting the occurences of each review total

sns.histplot(df_sd_details['review_count']);

This histogram shows that a large number of businesses have below 50 reviews. This matches our understanding from above.

In [None]:
fig, ax = plt.subplots()
df_sd_details['price_converted'].plot.hist(bins=5)
ax.set_xlabel('Price ($)');

## EDA: Napa Valley Business Details

In [None]:
df_nv_details.head()

- .describe()
- box plot, scatter plot, hist
- what details stand out?
- What insight do we gain from these details?

In [None]:
# Now let's see the centrality of the data via the .describe() method.
df_nv_details.describe()

- What does this data show? Any insight?

### Napa Valley- Visualizing the Data

- .describe()
- box plot, scatter plot, hist
- what details stand out?
- What insight do we gain from these details?

In [None]:
# Visualizing the centrality of the data.

df_sd_details['review_count'].plot(kind='box');

Our box plot for the review counts of San Diego show that almost 75% of businesses have fewer than 100 reviews, with some **close to 200**. Four businesses have significantly more reviews, which raises the question of _why_ those businesses are so popular?

Let's get a closer look at the data without those outliers.

In [None]:
# Boxplot using the same data as above with the outliers removed.

df_sd_details['review_count'].plot(kind='box', showfliers=False);

This new box plot gives us a clearer idea of the spread of the data. Half of the reviews call between about 10 to 90 reviews, with an average close to 30 reviews.

Where does the other half of the data fall? Let's use a histogram to visualize the data differently.

In [None]:
# Counting the occurences of each review total

sns.histplot(df_sd_details['review_count']);

This histogram shows that a large number of businesses have below 50 reviews. This matches our understanding from above.

In [None]:
fig, ax = plt.subplots()
df_sd_details['price_converted'].plot.hist(bins=5)
ax.set_xlabel('Price ($)');

# Exploring the Combined Data

Now that we looked at each of the individual businesses, we are going to compare each region's details to help us determine the best choice for our winery.

## Inspecting the Combined Data

In [None]:
combined.head()

In [None]:
combined.describe()

In [None]:
combined.groupby('City').agg(['count', 'mean', min, 'median', max])['price_converted']

## Combined Data: Visualizing the Data

In [None]:
sns.countplot(data=combined, hue='City', x='price_converted' );

In [None]:
sns.countplot(data=combined, hue='City', x='rating' );

In [None]:
# Making a grouped bar chart for each city normalizing the ratings
norm_rating = combined.groupby('City')['rating'].value_counts(normalize=True).to_frame()
norm_rating.unstack(0).plot(kind='bar')
legend = plt.legend()
legend.get_texts()[0].set_text('Napa Valley')
legend.get_texts()[1].set_text('San Diego')
plt.xticks(rotation = 0)
plt.xlabel('Rating')
plt.ylabel('Reviews (%)')
plt.title('Ratings per Region');

In [None]:
# Exploring how to normalize values, then change the structure of a dataframe
# normalized_rating = combined.groupby('City')['rating'].value_counts(normalize=True).to_frame().unstack(0, fill_value=0)
# normalized_rating

In [None]:
# Making a grouped bar chart for each city normalizing the ratings
norm_pricing = combined.groupby('City')['price_converted'].value_counts(normalize=True).to_frame().unstack(0, fill_value=0)
norm_pricing.plot(kind='bar')
legend = plt.legend()
legend.get_texts()[0].set_text('Napa Valley')
legend.get_texts()[1].set_text('San Diego')
plt.xlabel('Price ($)')
plt.xticks(rotation = 0)
plt.ylabel('Normalized Count')
plt.title('Menu Pricing per Region');

In [None]:
# # Comparing Review Counts between regions
# norm_r_c = combined.groupby('City')['review_count'].value_counts(normalize=True).to_frame().unstack(0, fill_value=0)
# norm_r_c.plot(kind='bar')
# legend = plt.legend()
# legend.get_texts()[0].set_text('Napa Valley')
# legend.get_texts()[1].set_text('San Diego')
# plt.xlabel('Number of Reviews')
# plt.xticks(rotation = 0)
# plt.ylabel('Normalized Count')
# plt.title('Review Counts per Region');

In [None]:
combined_sd = combined[combined.loc[:,'City'] == 'San Diego']
combined_sd.hist(column = 'review_count');

In [None]:
combined_sd = combined[combined.loc[:,'City'] == 'Napa Valley']
combined_sd.hist(column = 'review_count');

In [None]:
# Determining # businesses for NV vs. SD

sns.countplot(x=combined['City']);

In [None]:
# rc_norm = combined.groupby('City')['review_count'].value_counts(normalize=True)
# rc_norm = rc_norm.to_frame().unstack(0, fill_value=0).reset_index(drop=True)
# rc_norm

In [None]:
combined_sd = combined[combined.loc[:,'City'] == 'Napa Valley']
combined_sd['review_count'].hist(grid=False);

# Finished Vis

Now that we have an idea of what each dataset looks like, we will use visualizations to compare each of the data sets to answer our questions, guiding our decision.

In [None]:
# Comparing 'price' values between regions
norm_pricing = combined.groupby('City')['price_converted'].value_counts(normalize=True).to_frame().unstack(0, fill_value=0)
norm_pricing.plot(kind='bar')
legend = plt.legend()
legend.get_texts()[0].set_text('Napa Valley')
legend.get_texts()[1].set_text('San Diego')
plt.xlabel('Price ($)')
plt.xticks(rotation = 0)
plt.ylabel('Normalized Count')
plt.title('Business Pricing per Region');

# plt.savefig('Business_Pricing_per_Region.png')

In [None]:
# Making a grouped bar chart for each city normalizing the ratings
norm_rating = combined.groupby('City')['rating'].value_counts(normalize=True).to_frame().unstack(0).plot(kind='bar')
legend = plt.legend()
legend.get_texts()[0].set_text('Napa Valley')
legend.get_texts()[1].set_text('San Diego')
plt.xticks(rotation = 0)
plt.xlabel('Rating')
plt.ylabel('Reviews (%)')
plt.title('Ratings per Region');

# plt.savefig('Ratings_per_Region.png')

In [None]:
for style in plt.style.available:
    with plt.style.context(style):
        
        # Making a grouped bar chart for each city normalizing the ratings
        norm_rating = combined.groupby('City')['rating'].value_counts(normalize=True).to_frame().unstack(0).plot(kind='bar')
        legend = plt.legend()
        legend.get_texts()[0].set_text('Napa Valley')
        legend.get_texts()[1].set_text('San Diego')
        plt.xticks(rotation = 0)
        plt.xlabel('Rating')
        plt.ylabel('Reviews (%)')
        plt.title(style)
        plt.show();

        # plt.savefig('Ratings_per_Region.png')

In [None]:
# Combining SNS styles - merge settings of first with second
with plt.style.context(['seaborn-talk', 'seaborn-darkgrid']):

    # Making a grouped bar chart for each city normalizing the ratings
    norm_rating = combined.groupby('City')['rating'].value_counts(normalize=True).to_frame().unstack(0).plot(kind='bar')
    legend = plt.legend()
    legend.get_texts()[0].set_text('Napa Valley')
    legend.get_texts()[1].set_text('San Diego')
    plt.xticks(rotation = 0)
    plt.xlabel('Rating')
    plt.ylabel('Reviews (%)')
    plt.title(style)
    plt.show();

    # plt.savefig('Ratings_per_Region.png')

In [None]:
# Show count of businesses per city

sns.countplot(x=combined['City'])
plt.xlabel('Region')
plt.ylabel('Total')
plt.title('Number of Wineries per Region');

# plt.savefig('Number_of_Wineries_per_Region.png')

In [None]:
df_nv = combined[combined['City'] == 'Napa Valley']
df_nv

In [None]:
# Creating geospatial view of Napa Valley Wineries

with open(r'C:\Users\bmcca\.secret\mapbox_api.json') as f:
    token = json.load(f)

open(r'C:\Users\bmcca\.secret\mapbox_api.json').read()
    
token = token['token']

px.set_mapbox_access_token(token)

fig = px.scatter_mapbox(df_nv, lat= "latitude", lon= "longitude", 
                        color= "price_converted", range_color= (0, 3),
                        labels= {"price_converted": "Price ($) ",
                                 "latitude":"Latitude ",
                                 "longitude":"Longitude ",
                                 'review_count':'Number of Reviews '},
                        size= 'review_count', hover_name = df_nv["name"],
                        color_continuous_scale=px.colors.sequential.Greys,
                        size_max=15, zoom=9.8, title='Napa Valley Wineries',
                        mapbox_style='light', width=900, height=900)
fig.show()

In [None]:
# Creating geospatial view of Napa Valley Wineries

with open(r'C:\Users\bmcca\.secret\mapbox_api.json') as f:
    token = json.load(f)

open(r'C:\Users\bmcca\.secret\mapbox_api.json').read()
    
token = token['token']

px.set_mapbox_access_token(token)

fig = px.scatter_mapbox(df_nv, lat= "latitude", lon= "longitude", 
                         range_color= (0, 3),
                        labels= {"price_converted": "Price ($) ",
                                 "latitude":"Latitude ",
                                 "longitude":"Longitude ",
                                 'review_count':'Number of Reviews '},
                        size= 'review_count', hover_name = df_nv["name"],
                        size_max=15, zoom=10.25, title='Napa Valley Wineries',
                        mapbox_style='light', width=900, height=900)
fig.show()

In [None]:
df_sd = combined[combined['City'] == 'San Diego']
df_sd

In [None]:
# Creating geospatial view of San Diego Wineries

with open(r'C:\Users\bmcca\.secret\mapbox_api.json') as f:
    token = json.load(f)

open(r'C:\Users\bmcca\.secret\mapbox_api.json').read()
    
token = token['token']

px.set_mapbox_access_token(token)

fig = px.scatter_mapbox(df_sd, lat= "latitude", lon= "longitude",
                        color= "price_converted", range_color= (0, 3),
                        labels= {"price_converted": "Price ($) ", 
                                "latitude":"Latitude ","longitude":"Longitude ",
                                'review_count':'Number of Reviews '},
                        size= 'review_count', hover_name = df_sd["name"],
                        color_continuous_scale=px.colors.sequential.Greys,
                        size_max=15, zoom=9.75, title='San Diego Wineries',
                        mapbox_style='light', width=900, height=900)
fig.show()

In [None]:
# # Creating geospatial view of San Diego Wineries

# with open(r'C:\Users\bmcca\.secret\mapbox_api.json') as f:
#     token = json.load(f)

# open(r'C:\Users\bmcca\.secret\mapbox_api.json').read()
    
# token = token['token']

# px.set_mapbox_access_token(token)

# fig = px.scatter_mapbox(df_sd, lat= "latitude", lon= "longitude",
#                         color= "review_count", range_color= (0,150),
#                         labels= {"review_count": "Number of Reviews ", 
#                                 "latitude":"Latitude ","longitude":"Longitude ",
#                                 'review_count':'Number of Reviews '},
#                         size= 'review_count', hover_name = df_sd["name"],
#                         color_continuous_scale=px.colors.sequential.Greys,
#                         size_max=15, zoom=9.25, title='San Diego Wineries',
#                         mapbox_style='light', width=800, height=800)
# fig.show()

In [None]:
# Creating geospatial view of San Diego Wineries

with open(r'C:\Users\bmcca\.secret\mapbox_api.json') as f:
    token = json.load(f)

open(r'C:\Users\bmcca\.secret\mapbox_api.json').read()
    
token = token['token']

px.set_mapbox_access_token(token)

fig = px.scatter_mapbox(df_sd, lat= "latitude", lon= "longitude",
                        range_color= (0,150),
                        labels= {"review_count": "Number of Reviews ", 
                                "latitude":"Latitude ","longitude":"Longitude ",
                                'review_count':'Number of Reviews '},
                        size= 'review_count', hover_name = df_sd["name"],
                        size_max=15, zoom=9.25, title='San Diego Wineries',
                        mapbox_style='light', width=800, height=800)
fig.show();