# CrunchBase

Here we produce indicators about level of Venture & Seed Funding in the UK using proprietary CrunchBase data licensed by Nesta. 

This involves:

* Download the data from Nesta DAPS system
* Merge organisations & funders to create org - funding matches
* Geocode with NUTS2 and LEPS geographies
* Create indicators
  * This will be based on a function that subsets by year and distinguishes between seed funding and venture capital

## Preamble

In [None]:
%run ../notebook_preamble.ipy

In [None]:
import re
import random
from zipfile import ZipFile
from io import BytesIO
import csv
from data_getters.labs.core import download_file
from ast import literal_eval
from data_getters.core import get_engine


In [None]:
#dirs

if 'crunchbase' not in os.listdir('../../data/raw'):
    os.makedirs('../../data/raw/crunchbase')

if 'crunchbase' not in os.listdir('../../data/processed/'):
    os.makedirs('../../data/processed/crunchbase')

In [None]:
# %load ../utilities.py
# Some utilities

import random

def make_data_dict(table,name,path,sample=5):
    '''
    A function to output the form for a data dictionary
    
    Args:
        -table (df) is the df we want to create the data dictionary for
        -name (str) of the df
        -path (str) is the place where we want to save the file
        

    
    '''
    
    types = [estimate_type(table[x],sample=sample) for x in table.columns]
        
    data_dict = pd.DataFrame()
    data_dict['variable'] = table.columns
        
    data_dict['type'] = types
    
    data_dict['description'] = ['' for x in data_dict['variable']]
        
    out = os.path.join(path,f'{today_str}_{name}.csv')
    
    #print(data_dict.columns)
    
    data_dict.to_csv(out)
    

def estimate_type(variable,sample):
    '''
    Estimates the type of a column. 

    Args:
        variable (iterable) with values
        sample (n) is the number of values to test
    
    '''
    
    selection = random.sample(list(variable),sample)
    
    types = pd.Series([type(x) for x in selection]).value_counts().sort_values(ascending=False)
    
    return(types.index[0])

In [None]:
def get_daps_data(table,connection,chunksize=1000):
    '''
    Utility function to get data from DAPS with less faff
    
    Args:
        -table is the SQL table in DAPS that we are extracting
        -connection is the database connection we are using
        -Chunksize are the chunks to download
    
    Returns:
        -A dataframe with the data we have collected
    
    '''
    #Get chunks
    chunks = pd.read_sql_table(table, connection, chunksize=chunksize)
    
    #Create df
    df = pd.concat(chunks)
    
    #Return data
    return(df)

## 1. Load Data

### Setup

In [None]:
# Download CrunchBase data using DAPS

my_config = '../../mysqldb_team.config'

#Create connection with SQL
con = get_engine(my_config)

#### Organisations

This is the list of organisations we want to wo

In [None]:
#Read data
cb_orgs = get_daps_data('crunchbase_organizations',con)

In [None]:
cb_orgs.head()

Every organisation has an id and a location id

### Funding rounds

Funding rounds for organisations

In [None]:
cb_funding_rounds = get_daps_data('crunchbase_funding_rounds',con)

In [None]:
cb_funding_rounds.head()

Each funding round has the company name and location id, the investment type and the year. This means that we don't need the organisation data for the funding measurements

### Reverse geocoded place ids

We have reverse geocoded place ids with their NUTS and LEPS code in notebook `0_rev_geocoder`. 

We load that information here and use it to generate indicators of activity by NUTS and LEPS area in the UK.

In [None]:
places = pd.read_csv('../../data/processed/crunchbase/2020_01_28_rev_geocoded_places')

In [None]:
places

## 2. Process data

### a. Number of technology companies indicator

This is the number of active companies in a NUTS or LEP.

In [None]:
cb_orgs_geo = pd.merge(cb_orgs)

### Geographies

Locations for organisations

In [None]:
places_df = get_daps_data('geographic_data',con)

In [None]:
places_df.head()

Every city had an id that can be matched with the CB data and a lat,lon that can be used for geoocoding