# Geospatial BI & Data Viz Project - Part 2

- VISUALIZATION PROJECT Geospatial Business Intelligence (BI)
    * Make a geospatial analysis of the `companies` dataset
    * Things you know:
        - You have a software company with 50 employees
        - The company creates video games
        - Roles in your company: 20 developers, 20 Designers/Creatives/UX/UI and 10 executives/managers
    * Do an analysis about placing the new company offices in the best environment based on the following criteria:
        - There should be software engineers working around
        - The surroundings must have a good ratio of big companies vs startups
        - Ensure you have in your surroundings companies that cover the interests of your team
        - Avoid old companies, prefer recently created ones

In [1]:
# import pymongo to connect Python with MongoDB
from pymongo import MongoClient
# to work with stats
import pandas as pd
# to work with dataframes
import numpy as np
# to work with json
from pandas.io.json import json_normalize

### 1. Prepare the data: extract from the companies database the relevant information for the challenge.

* company identification
    - id (just in case we would need to get more information later)
    - name
* only currently active company
    - 'deadpooled_year' --> not none
    
* for geolocation analysis & geospatial visualization
    - offices.latitude, offices.longitude --> not null
    - offices.country_code, offices.city --> not null

* to determine if the company is old (negative) o recent (positive)
    - founded_year --> not null

* to define if it's a small (startup) or a big company:
    - number_of_employees --> company size (make sure there are not ficticious and have >= 1 employee)
    - investments.funding_round.round_code: 'Angel','seed' --> to associate to startup category
    https://support.crunchbase.com/hc/en-us/articles/115010458467-Glossary-of-Funding-Types
    - investments.funding_round.funded_year
    - ipo.pub_year,ipo.valuation_amount --> to associate with big company   

* to match with our team interests: technology & videogames
    - category_code: 'software', 'web', 'games_video' --> to filter as 'best match' for our team --> not null
    - description: 'software','technology', 'Platform','Social network' --> for a qualitative analysis
    - tag_list: 'network', 'online-communities','projects', etc --> for a qualitative analysis

In [11]:
# connecting on default host and port
client = MongoClient ('localhost', 27017)

# loading the database
db = client['companies']

# getting the collection
companies = db['companies']

# defining query to get the relevant information
query = db.companies.find({'$and':[{'founded_year':{'$ne': 'null'}},{'category_code':{'$ne': 'null'}},
{'deadpooled_year':None},{'number_of_employees':{'$gte':1}},{'offices.latitude':{'$ne': 'null'}},{'offices.longitude': {'$ne': 'null'}},
{'offices.country_code':{'$ne': 'null'}},{'offices.city':{'$ne': 'null'}}]},{'_id':1, 'name':1,'founded_year':1,'deadpooled_year':1, 
'number_of_employees':1,'offices.latitude':1,'offices.longitude':1,'offices.country_code':1,'offices.city':1,
'investments.funding_round.round_code':1,'investments.funding_round.funded_year':1,'ipo.pub_year':1,'ipo.valuation_amount':1,
'category_code':1,'description':1,'tag_list':1})

# we load our query to a dataframe to work with                           
def cursor_to_df(query):
    return pd.DataFrame(list(query))

In [12]:
# checking we get all the requested info
data.columns

Index(['_id', 'category_code', 'deadpooled_year', 'description',
       'founded_year', 'investments', 'ipo', 'name', 'number_of_employees',
       'offices', 'tag_list'],
      dtype='object')

In [13]:
# checking what kind of variables we have
data.dtypes

_id                     object
category_code           object
deadpooled_year         object
description             object
founded_year           float64
investments             object
ipo                     object
name                    object
number_of_employees      int64
offices                 object
tag_list                object
dtype: object

In [14]:
data.head()

Unnamed: 0,_id,category_code,deadpooled_year,description,founded_year,investments,ipo,name,number_of_employees,offices,tag_list
0,52cdef7c4bab8bd675297d8d,news,,user driven social content website,2004.0,[],,Digg,60,"[{'city': 'San Francisco', 'country_code': 'US...","community, social, news, bookmark, digg, techn..."
1,52cdef7c4bab8bd675297d91,web,,Geneology social network site,2006.0,[],,Geni,18,"[{'city': 'West Hollywood', 'country_code': 'U...","geni, geneology, social, family, genealogy"
2,52cdef7c4bab8bd675297d97,news,,Read Unlimited Books,2007.0,[],,Scribd,50,"[{'city': 'San Francisco', 'country_code': 'US...","book-subscription, digital-library, netflix-fo..."
3,52cdef7c4bab8bd675297d94,social,,Real time communication platform,2006.0,[],"{'valuation_amount': 18100000000, 'pub_year': ...",Twitter,1300,"[{'city': 'San Francisco', 'country_code': 'US...","text, messaging, social, community, twitter, t..."
4,52cdef7c4bab8bd675297d8e,social,,Social network,2004.0,"[{'funding_round': {'round_code': 'seed', 'fun...","{'valuation_amount': 104000000000, 'pub_year':...",Facebook,5299,"[{'city': 'Menlo Park', 'country_code': 'USA',...","facebook, college, students, profiles, network..."


In [16]:
# checking all companies are currently active
data.deadpooled_year.value_counts()

Series([], Name: deadpooled_year, dtype: int64)

### Working with founded_year variable:

In [None]:
def convert_to_int(data,int_var):
    for iv in int_var:
        data[iv].apply(lambda x: str(x).astype('int64', copy=False))
        #data[iv].astype('int64', copy=False)
    return data

In [19]:
# first we convert from float to int type:
data['founded_year'].astype('int64')

ValueError: Cannot convert non-finite values (NA or inf) to integer

In [18]:
# before defining bins, we will have a look first to statistics
data.founded_year.describe()

count    7830.000000
mean     2003.745977
std        10.581145
min      1800.000000
25%      2003.000000
50%      2006.000000
75%      2008.000000
max      2013.000000
Name: founded_year, dtype: float64

In [None]:
def founded_year_bins(var):
    bins_labels = ['Old (1800-2000)','Mid Old (2000-2003)','Mid Recent (2004-2007)','Recent (2008-2013)']
    cutoffs = [1800,2000,2006,2013] 
    df['tempo_clas'] = pd.cut(df[var],cutoffs, labels=bins_labels)
    return df[['tempo','tempo_clas']].head()

In [None]:
data.founded_year.value_counts()

In [None]:
# Defining 'Startups': Less than 10 employees

In [None]:
# checking the data quantity for our analyse
data.shape

In [15]:
data.category_code.value_counts()

web                 1861
software            1497
advertising          594
other                565
games_video          539
mobile               465
consulting           415
ecommerce            412
enterprise           389
network_hosting      338
public_relations     299
search               247
hardware             135
security              63
cleantech             55
analytics             49
biotech               43
social                36
finance               26
semiconductor         25
education             25
news                  25
music                 24
travel                18
messaging             16
photo_video           15
health                 9
legal                  8
real_estate            7
sports                 7
transportation         6
fashion                6
hospitality            4
automotive             3
medical                3
design                 2
nonprofit              2
manufacturing          1
nanotech               1
Name: category_code, dtyp