- Meet the data science cookiecutter requirements, in brief:
- Install:
git-crypt
- Have a Nesta AWS account configured with
awscli
- Install:
- Run
make install
to configure the development environment:- Setup the conda environment
- Configure pre-commit
- Configure metaflow to use AWS
- Run
git clone https://github.com/martingerlach/hSBM_Topicmodel.git
insidecreatech
to clone thetop-SBM
repo. - Add
MYSQL_CONFIG=path/to/sql/config
to.env
conda config --add channels conda-forge
Run python -m spacy download en_core_web_sm
to install the Spacy language model
Run make fetch-daps1
to Fetch GtR and CB data from nesta/nestauk
(DAPS1), including:
- Crunchbase:
crunchbase_organizations
: CrunchBase organisations in the UKcrunchbase_funding_rounds
: CrunchBase funding rounds in the UKcrunchbase_organizations_categories
: lookup between CrunchBase organisations in the UK and their categoriescrunchbase_category_groups
: Lookup between crunchbase categories and higher level categories.
- GtR:
gtr_projects
gtr_funders
(which we use to get project start dates)gtr_topics
gtr_link_table
for merging various gtr tables
Key tables for analysis can be read using getter functions in createch/getters/{source}
.
We still need to create fetchers & queries for gtr organisation data and locations
Run python createch/pipeline/model_tokenise.py
to tokenise {source} descriptions and train a word2vec model. The respective json files and models are saved in outputs/{output_type}/{source}
Run python createch/pipeline/semantic_identification.py
to expand technology vocabularies and tag relevant descriptions. The expanded vocabularies and id - area lookups are saved in outputs/data/{source}
.
Run python createch/pipeline//make_research_topic_partition.py
to produce a research topic co-occurrence network that we can use to produce a dataset labelled with research disciplines
Run python createch/pipeline/discipline_classifier.py
to create a labelled dataset using the community partitions above and train a model that predicts disciplines based on project descriptions.
Run python createch/pipeline/industry_classifier.py
to predict creative industry sectors for all crunchbase companies identified as potentially creative based on a training set of crunchbaase - companies house matched companies.
Fuzzy-matching of GtR and Crunchbase to Companies House.
Getters for lookups between matched datasets are located in createch/getters/jacchammer.py
.
- Run
make jacchammer
- Runs in test-mode by default. To run the full process add
test_mode=false
as an argument to the make command (warning: long-running process) - To run on AWS batch add
batch=true
as an argument to the make command
- Runs in test-mode by default. To run the full process add
- Update
flows.jacchammer.{gtr,crunchbase}.run_id
inbase.yaml
with run ids for each flow in order for getters to fetch updated run
Alternatively, run individual make
commands in createch/pipelines/jacchammer
Technical and working style guidelines
Project based on Nesta's data science project template (Read the docs here).