**Instructions**

Gaming industry.

**Scheme of the company:**

- 20 Designers
- 5 UI/UX Engineers
- 10 Frontend Developers
- 15 Data Engineers
- 5 Backend Developers
- 20 Account Managers
- 1 Maintenance guy that loves basketball
- 10 Executives
- 1 CEO/President.


**Requirements:**  

- Designers like to go to design talks and share knowledge. There must be some nearby companies that also do design.
- 30% of the company staff have at least 1 child.
- Developers like to be near successful tech startups that have raised at least 1 Million dollars.
- Executives like Starbucks A LOT. Ensure there's a starbucks not too far.
- Account managers need to travel a lot.
- Everyone in the company is between 25 and 40, give them some place to go party.
- The CEO is vegan.
- If you want to make the maintenance guy happy, a basketball stadium must be around 10 Km.
- The office dog—"Dobby" needs a hairdresser every month. Ensure there's one not too far away.

**Libraries**

In [1]:
from pymongo import MongoClient
import pandas as pd
import time

**Extracting**

In [2]:
# Establishing connection to the database:

client = MongoClient("localhost:27017")
db = client["ironhack"]
c = db.get_collection("crunchbase")
c

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'ironhack'), 'crunchbase')

In [4]:
# Getting all the companies:
result = list(c.find())

In [107]:
# Looking at the distinct categories in the database:

categories = []
for i in range(len(result)):
    categories.append(result[i]['category_code'])
categories = set(categories)
# print(categories)

In [33]:
# Looking at the distinct tags in the database:

tags = []
for i in range(len(result)):
    tags.append(result[i]['tag_list'])
tags = set(tags)
# print(tags)

**Things to consider when filtering:**

- I identify the category 'games_video' as an interesting one.
- The categories 'design', 'software' might also be interesting, since videogame companies might want to outsource/collaborate with other companies.
- I would include tags related to video games using regex, since I would like my company to be in a **gaming industry hub.**
- I could filter the companies in the database by all these and group them by country/city to identify a potential hub.

In [108]:
# I retrieve companies whose 'offices' field is not empty
# who have raised more than 1M $/€
# and who are tagged as game-related

office_filter = {'offices':{"$ne":[]}}
money_filter = {'total_money_raised':{"$regex": ".*\d+[MB]"}}
tag_filter = {'tag_list':{"$regex":".*design.*|.*gaming.*|.*game.*|.*tech.*|.*develop.*"}}
category_1 = {'category_code':'games_video'}
category_2 = {'category_code':'software'}
category_3 = {'category_code':'design'}

projection = {"_id":0, "name":1, 'total_money_raised':1, 'offices':1, 'tag_list':1}

gaming_companies = list(c.find({"$and": [office_filter, money_filter, tag_filter,
                                   {"$or": [category_1, category_2, category_3]}
                                        ]}, projection))
len(gaming_companies)

111

In [147]:
gaming_df = pd.DataFrame(gaming_companies)
gaming_df

Unnamed: 0,name,tag_list,total_money_raised,offices
0,CastTV,"videosearch, techcrunch40",$3.1M,"[{'description': None, 'address1': '374 Branna..."
1,PodTech,"videonetwork, onlinevideo, technologyvideo",$7.5M,"[{'description': None, 'address1': '1801 Page ..."
2,Metacafe,"online-video, video-entertainment, online-ente...",$50M,"[{'description': '', 'address1': '128 King Str..."
3,Dropbox,"techcrunch50, tc50, file-storage",$257M,"[{'description': 'Headquarters', 'address1': '..."
4,Zango,"games, videos, downloads, free, adware",$40M,"[{'description': None, 'address1': '3600 136th..."
...,...,...,...,...
106,Electric Cloud,"application-lifecycle-management, software-pro...",$25.6M,"[{'description': 'Corporate Headquarters', 'ad..."
107,GameGround,"achievements, gaming, online-gaming, share, ga...",$11.4M,"[{'description': 'HQ', 'address1': '', 'addres..."
108,Bigpoint,"online-games, browser-games, browsergames, bro...",€420M,"[{'description': 'Bigpoint Headquarters', 'add..."
109,Exent,"games, videogames, games-on-demand, video-games",$3M,"[{'description': 'Sales & Marketing', 'address..."


In [148]:
# When I print the tags again I see an enrichment of game-related tags.

tags = []
for i in range(len(gaming_companies)):
    tags.append(gaming_companies[i]['tag_list'])
tags = set(tags)
# print(tags)

In [149]:
# Did it worked? 
# Yes: I selected companies with milions of $ or € raised.
money = []
for i in range(len(gaming_companies)):
    money.append(gaming_companies[i]['total_money_raised'])
money = set(money)
print(money)

{'$12M', '$3.1M', '$40.9M', '$25M', '$15M', '$4.8M', '$200M', '$25.6M', '€1.75M', '$1.9M', '€5M', '$28M', '$40.8M', '€1.5M', 'C$3M', '$69M', '¥2.21B', '$100M', '$7M', '$12.1M', '$13M', '$76.6M', '$49.6M', '€3.66M', '$7.97M', '$9.4M', '$40M', '$37.7M', '€2.75M', '$1.2M', '$5M', '€420M', '$36.3M', '$2M', '$12.5M', '$84M', '$1M', '$4.36M', '$11.4M', '$10M', '$3M', '$125M', 'C$2.5M', '$15.5M', '$50M', '$46.3M', '$32.4M', '€4M', '€21.3M', '$13.9M', '$1.5M', '$34.8M', '$17.5M', '$3.97M', '$17M', '$10.8M', '$7.5M', '$44M', '$50.5M', '$16.6M', '$4M', '$1.13M', '$14.7M', '€3M', '$8.25M', '$3.2M', '$59.5M', '$44.8M', '$21M', '$42.7M', '$35M', '$82.8M', '$5.92M', '$2.7M', '¥464M', '$14.5M', '$1.1M', '$3.5M', '$11.3M', '$257M', '$860M', '$10.3M', '$17.1M', '$16.3M'}


In [150]:
# I would like to get all offices and that they are added as new rows, 
# since I want to count how many gaming industry offices are there per city/country.

gaming_df = gaming_df.explode('offices')
gaming_df = gaming_df.reset_index()
gaming_df = gaming_df.drop(columns = 'index', axis = 1)

In [151]:
# Checking:
gaming_df.head()

Unnamed: 0,name,tag_list,total_money_raised,offices
0,CastTV,"videosearch, techcrunch40",$3.1M,"{'description': None, 'address1': '374 Brannan..."
1,PodTech,"videonetwork, onlinevideo, technologyvideo",$7.5M,"{'description': None, 'address1': '1801 Page M..."
2,Metacafe,"online-video, video-entertainment, online-ente...",$50M,"{'description': '', 'address1': '128 King Stre..."
3,Dropbox,"techcrunch50, tc50, file-storage",$257M,"{'description': 'Headquarters', 'address1': ''..."
4,Dropbox,"techcrunch50, tc50, file-storage",$257M,"{'description': 'EMEA HQ', 'address1': 'Fitzwi..."


In [152]:
# Appending to empty lists to add as new columns:

city = []
state_code = []
country_code = []
latitude = []
longitude = []

for index, row in gaming_df.iterrows():
    try:
        city.append(row['offices']['city'])
        state_code.append(row['offices']['state_code'])
        country_code.append(row['offices']['country_code'])
        latitude.append(row['offices']['latitude'])
        longitude.append(row['offices']['longitude'])
    except IndexError:
        city.append(None)
        state_code.append(None)
        country_code.append(None)
        latitude.append(None)
        longitude.append(None)
    
gaming_df['city'] = city
gaming_df['state_code'] = state_code
gaming_df['country_code'] = country_code
gaming_df['latitude'] = latitude
gaming_df['longitude'] = longitude

gaming_df

Unnamed: 0,name,tag_list,total_money_raised,offices,city,state_code,country_code,latitude,longitude
0,CastTV,"videosearch, techcrunch40",$3.1M,"{'description': None, 'address1': '374 Brannan...",San Francisco,CA,USA,37.780716,-122.393913
1,PodTech,"videonetwork, onlinevideo, technologyvideo",$7.5M,"{'description': None, 'address1': '1801 Page M...",Palo Alto,CA,USA,37.408256,-122.154176
2,Metacafe,"online-video, video-entertainment, online-ente...",$50M,"{'description': '', 'address1': '128 King Stre...",San Francisco,CA,USA,37.437328,-122.159928
3,Dropbox,"techcrunch50, tc50, file-storage",$257M,"{'description': 'Headquarters', 'address1': ''...",San Francisco,CA,USA,37.790943,-122.408499
4,Dropbox,"techcrunch50, tc50, file-storage",$257M,"{'description': 'EMEA HQ', 'address1': 'Fitzwi...",Dublin,,IRL,,
...,...,...,...,...,...,...,...,...,...
193,Exent,"games, videogames, games-on-demand, video-games",$3M,"{'description': 'Sales & Marketing', 'address1...",New York,NY,USA,40.752380,-74.005568
194,Exent,"games, videogames, games-on-demand, video-games",$3M,"{'description': 'Headquarters', 'address1': '2...",Petach-Tikva,,ISR,,
195,Exent,"games, videogames, games-on-demand, video-games",$3M,"{'description': 'Premium Services', 'address1'...",San Francisco,CA,USA,37.787646,-122.402759
196,Clavis Technology,"online-store-audit, consumer-products, ecommer...",€3.66M,"{'description': '', 'address1': '7th Floor,', ...",Dublin,,IRL,,


In [153]:
# Dropping the columns I will not use: I am interested in the gaming hub location.
gaming_df = gaming_df.drop(columns = ['tag_list', 'offices'])
gaming_df.head()

Unnamed: 0,name,total_money_raised,city,state_code,country_code,latitude,longitude
0,CastTV,$3.1M,San Francisco,CA,USA,37.780716,-122.393913
1,PodTech,$7.5M,Palo Alto,CA,USA,37.408256,-122.154176
2,Metacafe,$50M,San Francisco,CA,USA,37.437328,-122.159928
3,Dropbox,$257M,San Francisco,CA,USA,37.790943,-122.408499
4,Dropbox,$257M,Dublin,,IRL,,


In [154]:
gaming_df.groupby('country_code').size().sort_values(ascending=False).head(5)

country_code
USA    113
GBR     11
ARG     11
DEU      7
IRL      5
dtype: int64

In [156]:
gaming_df.groupby('state_code').size().sort_values(ascending=False).head(5)

state_code
CA    54
MA    12
NY     9
TX     6
WA     5
dtype: int64

In [158]:
gaming_df.groupby('city').size().sort_values(ascending=False).head(10)

city
San Francisco    27
New York          8
London            5
                  4
Buenos Aires      4
Austin            4
Moscow            4
Berlin            4
Boston            4
Palo Alto         3
dtype: int64

San Francisco seems like the best gaming industry hub.  
New York and London could also be good options.  
Now I will have to see if these cities have all the requirements.