<head><h1 align="center">
Film Production Industry
</h1></head>

<head><h3 align="center">Analysis with Insights Provided by the Yelp API</h3></head>

For over a centery, film, television & media production was ubiquitous in, and almost exclusive to, two major US metropolises: Los Angeles and New York. Over the past several decades, there have been many developments that have cracked this exclusivity: The advent of the internet has increased access to entertainment, while simultaneously inspiring a new generation of creators. Other advances in technology have allowed film-industry-standard production quality to be capable with devices that are commonplace with today's consumer. Further, state governments have begun competing to attract business with tax credits designed to lure productions to within their borders, where no production industry existed before. <br><br>
With the help of business insights provided by the Yelp API, we will take a glimpse at this industry, comparing companies that offer production services in a few locations. Out goal is to find what business opportunities appear to be present within this industry, and ultimately, develop a business with what we find.

In [3]:
import json # Jake, can we get rid of import json and sys for this notebook?
import sys
import numpy as np
import pandas as pd
import csv
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

#### For visitors:

- Please import the packages above to run the code throughout the page below.
- All of the functions we developed to pull lots of data from the Yelp API are housed in the ```/code``` folder in the main repository. 
- If you would like to use our functions, you will need your own Yelp API developer client ID and API key, which you may attain [here!](https://www.yelp.com/developers/documentation/v3/get_started)

## Our Method

We have used our ```yelp_call(_)```, ```parse_data(_)```, and ```call_1000(_)``` functions to fetch all of our business data — 
- They *call* the Yelp API using user input for ```term```, ```location```, and ```categories```. 
- From there, they *parse the data* in to a ```python``` list of tuples, with each tuple containing individual business name, address, rating, review count, price, and some other information that can be received by changing the parameters in ```url_params```. 
- Finally, we use the third function to circumvent the necessity to manually offset our call each time, as the Yelp API only allows a maximum limit of 50 businesses returned with each call.
  - After gathering the data from the call, the last function also writes the collected data to a csv in the ```/database``` folder. The file is automatically named using the ```term``` and ```location``` variables which were input by the user at the beginning.

We have other functions that we used to call the Yelp API to gather example reviews for each business collected by the above method, which we will come back to later, but for now...

## *On to our Analysis!*

Lets read our film production data in to this document.

In [4]:
df_fp_ny = pd.read_csv('database/Film Production_NYC_database.csv') 
df_fp_la = pd.read_csv('database/Film Production_Los Angeles_database.csv')
df_fp_ga = pd.read_csv('database/Film Production_Atlanta_database.csv')

In [16]:
df_fp_la.tail(3)

Unnamed: 0,Name,Address,City,Rating,Review Count,Coordinates,Price,Id,Categories
997,FunLovinCamera,"['Glendale, CA 91206']",Glendale,5.0,119,"{'latitude': 34.1660842895508, 'longitude': -1...",,HT837BNGI49eAhc1gcGiVg,"[{'alias': 'eventphotography', 'title': 'Event..."
998,Printefex,"['456 Foothill Blvd', 'Ste B', 'La Canada Flin...",La Canada Flintridge,4.5,100,"{'latitude': 34.19874, 'longitude': -118.18839}",,t_eid75T-RVOz2GBTqXKiw,"[{'alias': 'copyshops', 'title': 'Printing Ser..."
999,Rebecca Blue Media,"['Rancho Cucamonga, CA 91730']",Rancho Cucamonga,4.5,16,"{'latitude': 34.10246, 'longitude': -117.58408}",,OxuwVhB0RM9wxVkMVgHZLg,"[{'alias': 'sessionphotography', 'title': 'Ses..."


Here is a sample of one of our dataframes. As you can see we have information about the business name, address, city, coordinates just in case we want them. We also have Yelp-specific data, such as their rating, review count, and price bracket listed on the site, as well as the business ID to identify each company on yelp, and categories. 

### Initial Takeaway

For statistical analysis, we have three sections to compare among businesses:
- Price — a measure of how expensive the business's products or services are compared to their competition, in the eyes of the Yelp reviewer/customer.
- Review Count — a measure of how popular the business is.
- Rating — a measure of how *beloved* the business is.

#### Introductory Analysis of Descriptive Stats:

In [38]:
df_fp_ny.describe()

Unnamed: 0,Rating,Review Count,Price
count,297.0,297.0,11.0
mean,4.301347,20.026936,2.363636
std,1.146793,72.959582,0.924416
min,1.0,1.0,1.0
25%,4.0,1.0,2.0
50%,5.0,4.0,2.0
75%,5.0,13.0,2.5
max,5.0,890.0,4.0


In [34]:
df_fp_la.describe()

Unnamed: 0,Rating,Review Count,Price
count,1000.0,1000.0,39.0
mean,4.713,19.56,1.948718
std,0.682713,48.070565,0.793019
min,0.0,0.0,1.0
25%,5.0,2.0,1.5
50%,5.0,5.0,2.0
75%,5.0,17.25,2.0
max,5.0,817.0,4.0


In [35]:
df_fp_ga.describe()

Unnamed: 0,Rating,Review Count,Price
count,54.0,54.0,3.0
mean,4.287037,8.166667,2.333333
std,1.230899,28.899141,0.57735
min,1.0,1.0,2.0
25%,4.0,1.0,2.0
50%,5.0,1.5,2.0
75%,5.0,3.0,2.5
max,5.0,198.0,3.0


Above, we can see a few interesting things to note:
- **There are ~300 businesses in New York, just over 50 in Atlanta, and *at least* 1000 in Los Angeles**.

  Without jumping to conclusions, both a saturated market and a barren one may have advantages and disadvantages for a startup business.
   - A saturated market is very difficult to compete in, but we may be able to benefit by offering services that are *parallel* to businesses that are thriving there. E.g., we offer equipment rentals near a company that offers production services, or vice-versa. 
   - Additionally, were we to survive in the market to the point where we could scale, there are a multitude of other businesses close by that offer a variety of products or services that we may absorb to become a conglomerate. With there being such quantity and variety, we are more likely to find a company to buy that will fill in our needs precisely.
   - A market with sparse industry is only easy to compete in if there is a demand for our services. If there is demand, we could quickly grow to become the pinnacle of industry in that locale. If there is none, we starve.

- **The mean rating for film production businesses are highest in Los Angeles**.
  - Popularity aside, the key takeaway here is that, on average, the film production businesses are more beloved in LA than in either NYC or Atlanta.
    - It will be more difficult to get customers in this market, as customers here are already happy with what they have.
- By contrast, the average rating for film production businesses in New York are only marginally higher than in Atlanta.
  - NYC offers a larger market for film than Atlanta, while simultaneously having businesses that are negligibly more beloved.

- **The Review Count for businesses in New York have by far the highest standard deviation from the mean.**
  - Were we to plot businesses in NYC on a map and look at their review counts, we would see that there is a large disparity between how frequented the most popular production businesses are from the least frequented, all over the city.
  - Were we to open up shop in NYC, we *must* consider where we are opening business carefully! We could easily position ourselves next to a large production business that offers *different* services to ~~leech~~ develop a symbiotic relationship where their clients become our customers.
  - Conversely, businesses that are less frequented could be in a comercial dead-zone (and we should stay away from that precise area), *or* could simply not be good at attracting customers (and we could easily outcompete).

### Price

When evaluating our dataframes, it appears we have a lot of ```NaN``` values in the price column. Yelp actually lists their price with $ (which, of course, is not a number), but this has been corrected for in our ```parse_data(_)``` function, so every NaN is actually an empty cell. Let's see how much data we have to work with if we drop the NaN values.

In [31]:
print('With NaNs: ',(len(df_fp_ny), len(df_fp_la), len(df_fp_ga)), 'Without NaNs: ',(len(df_fp_ny.dropna()), len(df_fp_la.dropna()), len(df_fp_ga.dropna())))

With NaNs:  (297, 1000, 54) Without NaNs:  (11, 39, 3)


Yikes. That really limits our insight in to how these places compare as far as costs go! But lets see what we can take away with what we have.