# Notebook for: Database filtering and extraction

Find the best place to locate a gaming company.

**Scheme of the company:**

- 20 Designers
- 5 UI/UX Engineers
- 10 Frontend Developers
- 15 Data Engineers
- 5 Backend Developers
- 20 Account Managers
- 1 Maintenance guy that loves basketball
- 10 Executives
- 1 CEO/President.


**Requirements:**  

- Designers like to go to design talks and share knowledge. There must be some nearby companies that also do design.
- 30% of the company staff have at least 1 child.
- Developers like to be near successful tech startups that have raised at least 1 Million dollars.
- Executives like Starbucks A LOT. Ensure there's a starbucks not too far.
- Account managers need to travel a lot.
- Everyone in the company is between 25 and 40, give them some place to go party.
- The CEO is vegan.
- If you want to make the maintenance guy happy, a basketball stadium must be around 10 Km.
- The office dog—"Dobby" needs a hairdresser every month. Ensure there's one not too far away.

# Libraries

In [1]:
from pymongo import MongoClient
import pandas as pd
import time
import re
import io
import folium
from folium import Choropleth, Circle, Marker, Icon, Map

# Extracting companies from the database

**Establishing connection to the database**

In [2]:
client = MongoClient("localhost:27017")
db = client["ironhack"]
c = db.get_collection("crunchbase")
c

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'ironhack'), 'crunchbase')

In [3]:
# Getting all the companies:
result = list(c.find())

**Exploring the *tags* and *categories***

In [4]:
# Looking at the distinct tags in the database:

tags = []
for i in range(len(result)):
    tags.append(result[i]['tag_list'])
tags = set(tags)
# print(tags)

In [5]:
# Looking at the distinct categories in the database:

categories = []
for i in range(len(result)):
    categories.append(result[i]['category_code'])
categories = set(categories)
# print(categories)

**Things to consider when filtering:**

- I want my company to be in a city with a **gaming industry hub**. Thus, I will select for those with **tags** related to video-games.
- Videogame companies collaborate and outsource other software-related and design companies, so I might want to have them near my company aswell. Some **categories** are interesting: 'games_video', 'software' and 'design'. 
- I want the hub to be composed of **successful** companies, so the total amount of money they raised should be >1M ($/€).
- I could filter the database by these tags and categories and then group the companies by country/city to identify a potential hub.

**Only gaming companies: 285**

In [35]:
# Necessary filters: Companies whose 'offices' field is not empty and who have raised more than 1M $/€
office_filter = {'offices' : {"$ne" : []}}
money_filter = {'total_money_raised' : {"$regex" : ".*\d+[MB]"}}
necessary = {"$and" : [office_filter, money_filter]}

# If I only consider gaming companies:
tag_filter = {'tag_list' : {"$regex": ".*gaming.*|.*game.*"}}
overview_filter = {'overview' : {"$regex": ".*gaming.*|.*game.*"}}

projection = {"_id":0, "name":1, 'total_money_raised':1, 'offices':1, 'tag_list':1}

companies = list(c.find({"$or": [
                                    {"$and": [necessary, tag_filter]},
                                    {"$and": [necessary, overview_filter]}
                                        ]}, projection))

len(companies)

285

**Gaming / design / tech companies: 430**

Designers and developers account for the 41% of the company. Thus it makes sense that I satisfy their requirements by identifying a gaming-design-tech hub. 

In [7]:
# Necessary filters: Companies whose 'offices' field is not empty and who have raised more than 1M $/€
office_filter = {'offices' : {"$ne" : []}}
money_filter = {'total_money_raised' : {"$regex" : ".*\d+[MB]"}}
necessary = {"$and" : [office_filter, money_filter]}

# I need a city with a gaming hub, but that also has design and tech companies nearby for my employees:
my_regex = ".*gaming.*|.*game.*|.*design.*|.*tech.*"
tag_filter = {'tag_list' : {"$regex": my_regex}}
overview_filter = {'overview' : {"$regex": my_regex}}
category_filter = {'category_code': {"$regex": my_regex}}

projection = {"_id":0, "name":1, 'total_money_raised':1, 'offices':1, 'tag_list':1}

companies = list(c.find({"$or": [
                                    {"$and": [necessary, tag_filter, category_filter]},
                                    {"$and": [necessary, overview_filter, category_filter]}
                                        ]}, projection))

len(companies)

430

In [36]:
companies_df = pd.DataFrame(companies)
companies_df.head()

Unnamed: 0,name,tag_list,total_money_raised,offices
0,Digg,"community, social, news, bookmark, digg, techn...",$45M,"[{'description': None, 'address1': '135 Missis..."
1,Gizmoz,"avatar, widget, 3d, facial, taylor",$18.1M,"[{'description': None, 'address1': None, 'addr..."
2,RockYou,"widgets, photos, avatars, countdown, horoscope...",$136M,"[{'description': '', 'address1': '585 Broadway..."
3,spigit,,$55.6M,"[{'description': '', 'address1': '311 Ray Stre..."
4,hi5,"hi5, socialnetworking, musicnetwork, youyu",$52M,"[{'description': '', 'address1': '55 Second St..."


**Checking enrichment / filtering**

In [37]:
# When I print the tags again I see an enrichment of gaming-related tags.

tags = []
for i in range(len(companies)):
    tags.append(companies[i]['tag_list'])
tags = set(tags)
# print(tags)

In [38]:
# I selected companies with milions of $ or € raised.

money = []
for i in range(len(companies)):
    money.append(companies[i]['total_money_raised'])
money = set(money)
# print(money)

**Extracting the city, state and country for each company**

In [39]:
# I would like to get all offices and that they are added as new rows, 
# since I want to count how many gaming industry offices are there per city/country.

companies_df = companies_df.explode('offices')
companies_df = companies_df.reset_index()
companies_df = companies_df.drop(columns = 'index', axis = 1)

In [40]:
# Appending to empty lists to add as new columns:

city = []
state_code = []
country_code = []
latitude = []
longitude = []

for index, row in companies_df.iterrows():
    try:
        city.append(row['offices']['city'])
        state_code.append(row['offices']['state_code'])
        country_code.append(row['offices']['country_code'])
        latitude.append(row['offices']['latitude'])
        longitude.append(row['offices']['longitude'])
    except IndexError:
        city.append(None)
        state_code.append(None)
        country_code.append(None)
        latitude.append(None)
        longitude.append(None)
    
companies_df['city'] = city
companies_df['state_code'] = state_code
companies_df['country_code'] = country_code
companies_df['latitude'] = latitude
companies_df['longitude'] = longitude

companies_df.head()

Unnamed: 0,name,tag_list,total_money_raised,offices,city,state_code,country_code,latitude,longitude
0,Digg,"community, social, news, bookmark, digg, techn...",$45M,"{'description': None, 'address1': '135 Mississ...",San Francisco,CA,USA,37.764726,-122.394523
1,Gizmoz,"avatar, widget, 3d, facial, taylor",$18.1M,"{'description': None, 'address1': None, 'addre...",Menlo Park,CA,USA,37.48413,-122.169472
2,RockYou,"widgets, photos, avatars, countdown, horoscope...",$136M,"{'description': '', 'address1': '585 Broadway'...",Redwood City,CA,USA,37.484619,-122.206893
3,spigit,,$55.6M,"{'description': '', 'address1': '311 Ray Stree...",Pleasanton,CA,USA,37.663728,-121.873181
4,spigit,,$55.6M,"{'description': 'One Freedom Drive', 'address1...",Reston,VA,USA,38.959008,-77.359275


**Cleaning and exploring**

In [41]:
# Dropping the columns I will not use: I am interested in the gaming hub location.

companies_df = companies_df.drop(columns = ['tag_list', 'offices', 'total_money_raised'])
companies_df.head()

Unnamed: 0,name,city,state_code,country_code,latitude,longitude
0,Digg,San Francisco,CA,USA,37.764726,-122.394523
1,Gizmoz,Menlo Park,CA,USA,37.48413,-122.169472
2,RockYou,Redwood City,CA,USA,37.484619,-122.206893
3,spigit,Pleasanton,CA,USA,37.663728,-121.873181
4,spigit,Reston,VA,USA,38.959008,-77.359275


In [42]:
# Top 3 countries with more companies
companies_df.groupby('country_code').size().sort_values(ascending=False).head(3)

country_code
USA    221
GBR     26
ISR     15
dtype: int64

In [43]:
# Top 3 states with more companies
companies_df.groupby('state_code').size().sort_values(ascending=False).head(3)

state_code
CA    134
NY     24
WA     14
dtype: int64

In [44]:
# Top 3 cities with more companies
companies_df.groupby('city').size().sort_values(ascending=False).head(6)

city
San Francisco    53
New York         22
London           13
                 11
Los Angeles      10
Beijing           8
dtype: int64

In [45]:
companies_df.groupby(['city','country_code']).size().sort_values(ascending=False).head(10)

city           country_code
San Francisco  USA             53
New York       USA             21
London         GBR             13
Los Angeles    USA             10
Beijing        CHN              8
Seattle        USA              8
Mountain View  USA              6
Austin         USA              6
Tokyo          JPN              6
Stockholm      SWE              5
dtype: int64

In [46]:
# cities_with_space = gaming_df[gaming_df['city'] == ""]
# cities_with_space

# only 8 out of the 21 cities with spaces have data of latitude/longitude.
# of those 8, 3 are in CA (USA), 2 in other states of USA, 1 in SWE, 1 in NLD and 1 in DEU
# they would not affect the ranking

**Comparison of hubs: gaming vs gaming-design-tech**

If I consider the **gaming hub only**, then the top cities are:
1. San Francisco: 53
2. New York: 21
3. London (GBR): 13
4. Los Angeles: 10
5. Beijing: 8

On the other hand, if I consider a **gaming-design-tech hub**, then the top cities are:
1. San Francisco: 41
2. New York: 21
3. South San Francisco: 9
3. Cambridge (USA): 8
5. London (GBR): 7

**Next steps**
- **San Francisco** and **New York** seem like the best cities **in general**.
- **London (GBR)** and **Cambridge (USA)** could also be good options. Between the two I would choose **London**, since there are more gaming companies.
- I compared the maps filtering for gaming companies only **(gaming hub)** and filtering for **gaming-design-tech**, and they tend to accumulate in same places irrespectively of the of the kind of hub. Therefore, it doesn't make much of a difference.
- Now I will have to see if these cities have all the requirements.

# Building maps for each city
To see whether these companies agglomerate in one region of the city, where I would locate my company.

**San Francisco**

In [47]:
# Subset of the dataframe: San Francisco
sanfrancisco = companies_df[(companies_df['city'] == "San Francisco") | (companies_df['city'] == "South San Francisco")]
sanfrancisco = sanfrancisco[sanfrancisco['latitude'].notna()]
sanfrancisco.shape

(44, 6)

In [48]:
# Building map
sanfran_lat, sanfran_lon = 37.709209, -122.414242
map_sanfran = Map(location = [sanfran_lat, sanfran_lon], zoom_start = 10)

# Adding markers
for index, row in sanfrancisco.iterrows():
    company = {"location": [row["latitude"], row["longitude"]], "tooltip": row["name"]}
    new_marker = Marker(**company, radius = 2)
    new_marker.add_to(map_sanfran)

In [69]:
# map_sanfran # Company in the center: Globant (37.781929,-122.404176) / Tagged (37.775300,-122.418600)
# sanfrancisco

In [50]:
for index, row in sanfrancisco.iterrows():
    company = {"location": [row["latitude"], row["longitude"]], "tooltip": row["name"]}
    new_marker = Marker(**company, radius = 2)
    new_marker.add_to(map_sanfran)

- Most companies agglomerate to the **north-east coast** of San Francisco.
- Some companies categorized as belonging to San Francisco are actually in Palo Alto.

**New York**

In [51]:
# Subset of the dataframe: New York
newyork = companies_df[(companies_df['city'] == "New York")]
newyork = newyork[newyork['latitude'].notna()]
newyork.shape

(13, 6)

In [52]:
# Building map
newyork_lat, newyork_lon = 40.730610, -73.935242
map_newyork = Map(location = [newyork_lat, newyork_lon], zoom_start = 10)

# Adding markers
for index, row in newyork.iterrows():
    company = {"location": [row["latitude"], row["longitude"]], "tooltip": row["name"]}
    new_marker = Marker(**company, radius = 2)
    new_marker.add_to(map_newyork)

In [62]:
# map_newyork # Company right in the center: Cellufun (40.739930,-73.993049)
# newyork

**London (GBR)**

In [54]:
# Subset of the dataframe: London
london = companies_df[(companies_df['city'] == "London")]
london = london[london['latitude'].notna()]
print(london.shape)
print(london.groupby('country_code').size())

(8, 6)
country_code
GBR    8
dtype: int64


In [55]:
# Building map
london_lat, london_lon = 51.509865, -0.118092
map_london = Map(location = [london_lat, london_lon], zoom_start = 12)

# Adding markers
for index, row in london.iterrows():
    company = {"location": [row["latitude"], row["longitude"]], "tooltip": row["name"]}
    new_marker = Marker(**company, radius = 2)
    new_marker.add_to(map_london)

In [60]:
# map_london # Company in the middle: spigit (51.517038, 0.139476)
# london

**Cambridge (USA)**

In [57]:
# Subset of the dataframe: Cambridge
cambridge = companies_df[(companies_df['city'] == "Cambridge")]
cambridge = cambridge[cambridge['latitude'].notna()]
print(cambridge.shape)
print(cambridge.groupby('country_code').size())

(2, 6)
country_code
USA    2
dtype: int64


In [58]:
# Building map
cambridge_lat, cambridge_lon = 42.373611, -71.110558
map_cambridge = Map(location = [cambridge_lat, cambridge_lon], zoom_start = 12)

# Adding markers
for index, row in cambridge.iterrows():
    company = {"location": [row["latitude"], row["longitude"]], "tooltip": row["name"]}
    new_marker = Marker(**company, radius = 2)
    new_marker.add_to(map_cambridge)

In [33]:
# map_cambridge # Company right in the center: Aileron Therapeutics

**Conclusions:**
- San Francisco (USA), New York (USA) and London (GBR) are the best cities based on gaming / gaming-design-tech hubs, which are the requirements of designers and developers of the company.
- I will compare which of the three is the best city based on the rest of requirements.