**Instructions**

Gaming industry.

**Scheme of the company:**

- 20 Designers
- 5 UI/UX Engineers
- 10 Frontend Developers
- 15 Data Engineers
- 5 Backend Developers
- 20 Account Managers
- 1 Maintenance guy that loves basketball
- 10 Executives
- 1 CEO/President.


**Requirements:**  

- Designers like to go to design talks and share knowledge. There must be some nearby companies that also do design.
- 30% of the company staff have at least 1 child.
- Developers like to be near successful tech startups that have raised at least 1 Million dollars.
- Executives like Starbucks A LOT. Ensure there's a starbucks not too far.
- Account managers need to travel a lot.
- Everyone in the company is between 25 and 40, give them some place to go party.
- The CEO is vegan.
- If you want to make the maintenance guy happy, a basketball stadium must be around 10 Km.
- The office dog—"Dobby" needs a hairdresser every month. Ensure there's one not too far away.

**Libraries**

In [43]:
from pymongo import MongoClient
import pandas as pd
import time
import re
import io
import folium
from folium import Choropleth, Circle, Marker, Icon, Map
from folium.plugins import HeatMap, MarkerCluster
#from geopy.geocoders import Nominatim

# Extracting companies from the database

**Establishing connection to the database**

In [44]:
client = MongoClient("localhost:27017")
db = client["ironhack"]
c = db.get_collection("crunchbase")
c

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'ironhack'), 'crunchbase')

In [45]:
# Getting all the companies:
result = list(c.find())

**Exploring the *tags* and *categories***

In [46]:
# Looking at the distinct tags in the database:

tags = []
for i in range(len(result)):
    tags.append(result[i]['tag_list'])
tags = set(tags)
# print(tags)

In [47]:
# Looking at the distinct categories in the database:

categories = []
for i in range(len(result)):
    categories.append(result[i]['category_code'])
categories = set(categories)
# print(categories)

**Things to consider when filtering:**

- I want my company to be in a city with a **gaming industry hub**. Thus, I will select for those with **tags** related to video-games.
- Videogame companies collaborate and outsource other software-related and design companies, so I might want to have them near my company aswell. Some **categories** are interesting: 'games_video', 'software' and 'design'. 
- I want the hub to be composed of **successful** companies, so the total amount of money they raised should be >1M ($/€).
- I could filter the database by these tags and categories and then group the companies by country/city to identify a potential hub.

In [48]:
# Necessary filters: Companies whose 'offices' field is not empty and who have raised more than 1M $/€
office_filter = {'offices' : {"$ne" : []}}
money_filter = {'total_money_raised' : {"$regex" : ".*\d+[MB]"}}
necessary = {"$and" : [office_filter, money_filter]}

# I need a city with a gaming hub, but that also has design and tech companies nearby for my employees:
my_regex = ".*gaming.*|.*game.*|.*design.*|.*tech.*"
tag_filter = {'tag_list' : {"$regex": my_regex}}
overview_filter = {'overview' : {"$regex": my_regex}}
category_filter = {'category_code': {"$regex": my_regex}}

projection = {"_id":0, "name":1, 'total_money_raised':1, 'offices':1, 'tag_list':1}

gaming_companies = list(c.find({"$or": [
                                    {"$and": [necessary, tag_filter, category_filter]},
                                    {"$and": [necessary, overview_filter, category_filter]}
                                        ]}, projection))

len(gaming_companies)

430

In [49]:
gaming_df = pd.DataFrame(gaming_companies)
gaming_df.head()

Unnamed: 0,name,tag_list,total_money_raised,offices
0,Joost,"iptv, babelgum, television, video, thevenicepr...",$45M,"[{'description': '', 'address1': '100 5th Ave ..."
1,Babelgum,"iptv, web2ireland",$13.2M,"[{'description': '', 'address1': '', 'address2..."
2,Veoh,"veoh, video, veohtv, socialvideo, videosharing...",$69.8M,"[{'description': '', 'address1': '10180 Telesi..."
3,YouTube,"channels, movies, rentals, share, usergenerate...",$11.5M,"[{'description': 'Corporate Headquarters', 'ad..."
4,Pando Networks,"p2p, video, streaming, download, cdn",$11M,"[{'description': None, 'address1': '520 Broadw..."


**Checking enrichment / filtering**

In [50]:
# When I print the tags again I see an enrichment of gaming-related tags.

tags = []
for i in range(len(gaming_companies)):
    tags.append(gaming_companies[i]['tag_list'])
tags = set(tags)
# print(tags)

In [51]:
# I selected companies with milions of $ or € raised.

money = []
for i in range(len(gaming_companies)):
    money.append(gaming_companies[i]['total_money_raised'])
money = set(money)
# print(money)

**Extracting the city, state and country for each company**

In [52]:
# I would like to get all offices and that they are added as new rows, 
# since I want to count how many gaming industry offices are there per city/country.

gaming_df = gaming_df.explode('offices')
gaming_df = gaming_df.reset_index()
gaming_df = gaming_df.drop(columns = 'index', axis = 1)

In [53]:
# Appending to empty lists to add as new columns:

city = []
state_code = []
country_code = []
latitude = []
longitude = []

for index, row in gaming_df.iterrows():
    try:
        city.append(row['offices']['city'])
        state_code.append(row['offices']['state_code'])
        country_code.append(row['offices']['country_code'])
        latitude.append(row['offices']['latitude'])
        longitude.append(row['offices']['longitude'])
    except IndexError:
        city.append(None)
        state_code.append(None)
        country_code.append(None)
        latitude.append(None)
        longitude.append(None)
    
gaming_df['city'] = city
gaming_df['state_code'] = state_code
gaming_df['country_code'] = country_code
gaming_df['latitude'] = latitude
gaming_df['longitude'] = longitude

gaming_df.head()

Unnamed: 0,name,tag_list,total_money_raised,offices,city,state_code,country_code,latitude,longitude
0,Joost,"iptv, babelgum, television, video, thevenicepr...",$45M,"{'description': '', 'address1': '100 5th Ave F...",New York,NY,USA,40.746497,-74.009447
1,Babelgum,"iptv, web2ireland",$13.2M,"{'description': '', 'address1': '', 'address2'...",London,,GBR,53.344104,-6.267494
2,Veoh,"veoh, video, veohtv, socialvideo, videosharing...",$69.8M,"{'description': '', 'address1': '10180 Telesis...",San Diego,CA,USA,32.902266,-117.20834
3,YouTube,"channels, movies, rentals, share, usergenerate...",$11.5M,"{'description': 'Corporate Headquarters', 'add...",San Bruno,CA,USA,37.627971,-122.426804
4,Pando Networks,"p2p, video, streaming, download, cdn",$11M,"{'description': None, 'address1': '520 Broadwa...",New York,NY,USA,40.722655,-73.99873


**Cleaning and exploring**

In [54]:
# Dropping the columns I will not use: I am interested in the gaming hub location.

gaming_df = gaming_df.drop(columns = ['tag_list', 'offices', 'total_money_raised'])
gaming_df.head()

Unnamed: 0,name,city,state_code,country_code,latitude,longitude
0,Joost,New York,NY,USA,40.746497,-74.009447
1,Babelgum,London,,GBR,53.344104,-6.267494
2,Veoh,San Diego,CA,USA,32.902266,-117.20834
3,YouTube,San Bruno,CA,USA,37.627971,-122.426804
4,Pando Networks,New York,NY,USA,40.722655,-73.99873


In [55]:
# Top 3 countries with more companies
gaming_df.groupby('country_code').size().sort_values(ascending=False).head(3)

country_code
USA    341
GBR     23
DEU     18
dtype: int64

In [56]:
# Top 3 states with more companies
gaming_df.groupby('state_code').size().sort_values(ascending=False).head(3)

state_code
CA    161
MA     28
NY     25
dtype: int64

In [57]:
# Top 3 cities with more companies
gaming_df.groupby('city').size().sort_values(ascending=False).head(6)

city
San Francisco          41
                       21
New York               21
Cambridge              11
South San Francisco     9
London                  8
dtype: int64

In [58]:
# cities_with_space = gaming_df[gaming_df['city'] == ""]
# cities_with_space

# only 8 out of the 21 cities with spaces have data of latitude/longitude.
# of those 8, 3 are in CA (USA), 2 in other states of USA, 1 in SWE, 1 in NLD and 1 in DEU
# they would not affect the ranking

**Next steps:**
1. San Francisco seems like the best gaming/tech/design industry hub.  
2. New York and Cambridge or London could also be good options.
    - Cambridge is much smaller than London: good for density of companies, bad for my emplpoyee's interests
3. Now I will have to see if these cities have all the requirements.

# Building maps for each city
This is to see whether these companies agglomerate in one region of the city

In [59]:
# subset of the dataframe: San Francisco
sanfrancisco = gaming_df[(gaming_df['city'] == "San Francisco") | (gaming_df['city'] == "South San Francisco")]
sanfrancisco = sanfrancisco[sanfrancisco['latitude'].notna()]
sanfrancisco.shape

(38, 6)

In [60]:
# subset of the dataframe: New York
newyork = gaming_df[(gaming_df['city'] == "New York")]
newyork = newyork[newyork['latitude'].notna()]
newyork.shape

(15, 6)

In [61]:
# subset of the dataframe: Cambridge
cambridge = gaming_df[(gaming_df['city'] == "Cambridge")]
cambridge = cambridge[cambridge['latitude'].notna()]
cambridge.shape

(5, 6)

In [62]:
# subset of the dataframe: London
london = gaming_df[(gaming_df['city'] == "London")]
london = london[london['latitude'].notna()]
london.shape

(5, 6)