# 250_prep_dataset2

## Purpose
The focus of this preparation notebook is to deal with the second dataset outputted from the 100_load_dataset. As previously mentioned this dataset holds all the information about individuals that are in a company or organisation. We will need to filter this large initial dataset so we have only the information needed for oue RQ1 and RQ2. 
This includes:
- Removing any person instances with no funding value as we will need this to conclude our RQ's
- Ensuring all the people in our resulting dataset are founders of a company.
- Transforming the "degree_type" field to get the founders Highest Degree.

## Datasets
* _Input_: 100_dataset2.pkl
* _Output_: 250_prep_dataset2.pkl

In [1]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
import networkx as nx
pd.set_option('display.max_columns', None)
module_path = os.path.abspath(os.path.join('../../data/..'))
if module_path not in sys.path:
    sys.path.append(module_path)
%matplotlib inline

### Preparing Second Dataset

We read in the pickle file from our first notebook. This dataframe will have all relevant information about each individual involved in the company. We will thus need to filter this dataframe to only include founders of startups in general, not everyone is needed for this RQ analysis.

In [2]:
merged_df = pd.read_pickle('../../data/processed/100_dataset2.pkl')
merged_df.head(5)

Unnamed: 0,first_name,last_name,gender,company_name,funding_rounds,funding_total_usd,primary_role,country_code_y,state_code_y,city_y,title,job_type,subject,degree_type,person_uuid,degree_uuid,institution_uuid,org_uuid
0,Steve,Jobs,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Board of Directors,board_member,unknown,Dropout,2b3a6b34-ad65-8e1e-29ff-d267f42530e0,be0638c8-653e-5f5e-2845-2c99dd3b6abe,76cd719f-af9e-7984-a6a6-ef970b52515d,7063d087-96b8-2cc1-ee88-c221288acc2a
1,Steve,Jobs,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,CEO,executive,unknown,Dropout,2b3a6b34-ad65-8e1e-29ff-d267f42530e0,be0638c8-653e-5f5e-2845-2c99dd3b6abe,76cd719f-af9e-7984-a6a6-ef970b52515d,7063d087-96b8-2cc1-ee88-c221288acc2a
2,Steve,Jobs,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Co-Founder,executive,unknown,Dropout,2b3a6b34-ad65-8e1e-29ff-d267f42530e0,be0638c8-653e-5f5e-2845-2c99dd3b6abe,76cd719f-af9e-7984-a6a6-ef970b52515d,7063d087-96b8-2cc1-ee88-c221288acc2a
3,Tim,Cook,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Senior Vice President of Worldwide Operations,employee,unknown,MBA,e53fad55-f0a6-584b-4396-804499b36712,d04f22dc-da1f-d76f-884f-375133e3f5a6,208fca08-b131-9527-1033-4c433760531a,7063d087-96b8-2cc1-ee88-c221288acc2a
4,Tim,Cook,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Member of the Board of Directors,board_member,unknown,MBA,e53fad55-f0a6-584b-4396-804499b36712,d04f22dc-da1f-d76f-884f-375133e3f5a6,208fca08-b131-9527-1033-4c433760531a,7063d087-96b8-2cc1-ee88-c221288acc2a


There are a lot of duplicate person instances in this dataframe that we need to decide what to do with. 

**First step of preparation is:**
- Removing any rows with no "funding_total_usd" value as we need this funding value for our analysis.

In [3]:
clean_df = merged_df[~merged_df["funding_total_usd"].isnull()]
clean_df.shape

(322440, 18)

**Next step of preparation is...**

Ensuring all the individuals in this dataframe are founders. A regular expression is used to do this as on some occassions the starting letter of founder in the "title" field can be lower-case or upper-case

In [4]:
founder_df = clean_df[clean_df.title.str.contains(r'[fF]ounder[.]?', na=False)]
founder_df.shape

(54113, 18)

In [5]:
founder_df.head(5)

Unnamed: 0,first_name,last_name,gender,company_name,funding_rounds,funding_total_usd,primary_role,country_code_y,state_code_y,city_y,title,job_type,subject,degree_type,person_uuid,degree_uuid,institution_uuid,org_uuid
2,Steve,Jobs,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Co-Founder,executive,unknown,Dropout,2b3a6b34-ad65-8e1e-29ff-d267f42530e0,be0638c8-653e-5f5e-2845-2c99dd3b6abe,76cd719f-af9e-7984-a6a6-ef970b52515d,7063d087-96b8-2cc1-ee88-c221288acc2a
14,Steve,Wozniak,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Co-founder,executive,EE & CS,BS,f3abe539-8db3-57e4-0f4d-de54a78eaf68,fe3eb345-b465-84ad-45d7-448f8f7a44e5,10f9a25b-9675-2281-486e-a52955c706df,7063d087-96b8-2cc1-ee88-c221288acc2a
145,Kevin,Harvey,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Founder,executive,Engineering,B.S.,e7f5c146-66c5-fba4-64cb-8ffd422899d8,0dee09e8-13b6-50ee-3e17-1343036b2eed,c3144da5-8618-2e95-3a13-60417220da5e,7063d087-96b8-2cc1-ee88-c221288acc2a
455,Armas,Markkula,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Founder,executive,Electrical Engineering,B.S.,56e8a800-5c37-7599-5eb3-b815aa6acd30,29b2a7bc-4628-0e5d-53d1-d0af77d3de33,867f0af5-a1d0-143d-bbed-5cc252ca40d6,7063d087-96b8-2cc1-ee88-c221288acc2a
460,Armas,Markkula,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Founder,executive,Electrical Engineering,M.S.,56e8a800-5c37-7599-5eb3-b815aa6acd30,9da52706-0933-81f3-5be3-5ae30747612e,867f0af5-a1d0-143d-bbed-5cc252ca40d6,7063d087-96b8-2cc1-ee88-c221288acc2a


### Task: Getting Highest Degree of a founder
We have decided that we are only interested in the highest degree achieved by each founder in a startup.
This will be our first task to complete. 

Currently the datafame we are reading in from our first prep notebook details the following: 
- A duplicate person instance if they have founded multiple companies.
- A duplicate person instance if they have changed positions in the same company.
- A duplicate person instance if they have attain another degree/change institution while at the same company.

After becoming aware of this, we have decided to not reduce person instance duplicates but to use is it as a basis for our analysis. We will add information by transforming the dataframe, e.g finding the founders Highest Degree. We will create separate dataframes based off this resulting dataset using the groupBy function.

In [6]:
founder_df['degree_type'].unique().size

2437

We can see from the size of our founders "degree_type" field that there are 2437 unique values in this particular field. This means that the degrees have obviously been user-inputted. 

We decided that the next appropriate step would be to normalize these values in the field by:

- Removing all punctuation 
- Making every letter in the field lowercase.

In [7]:
founder_df.is_copy = False
founder_df['degree_type'] = founder_df['degree_type'].str.replace(r"[\"\’\'\´,.-]", '')
founder_df['degree_type'] = founder_df['degree_type'].str.lower()
founder_df['degree_type'].unique().size

1956

This has reduced the amount of unique degree types significantly but it still remains to be a large amount.

In [8]:
founder_df.degree_type.value_counts()

bs                                                              8253
ba                                                              6140
unknown                                                         6086
mba                                                             5922
ms                                                              3816
phd                                                             3470
bsc                                                             1491
msc                                                             1209
bachelor                                                         948
graduate                                                         941
ma                                                               742
master                                                           699
be                                                               696
masters                                                          689
bba                               

There seems to be some degree types being sepearted using an '&' so we will deal with this case separately. They will need to be idealy split by a comma.

In [9]:
founder_df[founder_df["degree_type"].str.contains("&",na=False)].shape

(121, 18)

In [10]:
founder_df['degree_type'] = founder_df['degree_type'].str.replace("&", ',')
founder_df[founder_df["degree_type"].str.contains("&",na=False)].shape

(0, 18)

In [11]:
founder_df.degree_type.value_counts()

bs                                             8253
ba                                             6140
unknown                                        6086
mba                                            5922
ms                                             3816
phd                                            3470
bsc                                            1491
msc                                            1209
bachelor                                        948
graduate                                        941
ma                                              742
master                                          699
be                                              696
masters                                         689
bba                                             663
btech                                           618
bachelors                                       605
jd                                              580
bachelors degree                                533
bachelor of 

In [12]:
founder_df.head(10)

Unnamed: 0,first_name,last_name,gender,company_name,funding_rounds,funding_total_usd,primary_role,country_code_y,state_code_y,city_y,title,job_type,subject,degree_type,person_uuid,degree_uuid,institution_uuid,org_uuid
2,Steve,Jobs,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Co-Founder,executive,unknown,dropout,2b3a6b34-ad65-8e1e-29ff-d267f42530e0,be0638c8-653e-5f5e-2845-2c99dd3b6abe,76cd719f-af9e-7984-a6a6-ef970b52515d,7063d087-96b8-2cc1-ee88-c221288acc2a
14,Steve,Wozniak,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Co-founder,executive,EE & CS,bs,f3abe539-8db3-57e4-0f4d-de54a78eaf68,fe3eb345-b465-84ad-45d7-448f8f7a44e5,10f9a25b-9675-2281-486e-a52955c706df,7063d087-96b8-2cc1-ee88-c221288acc2a
145,Kevin,Harvey,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Founder,executive,Engineering,bs,e7f5c146-66c5-fba4-64cb-8ffd422899d8,0dee09e8-13b6-50ee-3e17-1343036b2eed,c3144da5-8618-2e95-3a13-60417220da5e,7063d087-96b8-2cc1-ee88-c221288acc2a
455,Armas,Markkula,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Founder,executive,Electrical Engineering,bs,56e8a800-5c37-7599-5eb3-b815aa6acd30,29b2a7bc-4628-0e5d-53d1-d0af77d3de33,867f0af5-a1d0-143d-bbed-5cc252ca40d6,7063d087-96b8-2cc1-ee88-c221288acc2a
460,Armas,Markkula,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Founder,executive,Electrical Engineering,ms,56e8a800-5c37-7599-5eb3-b815aa6acd30,9da52706-0933-81f3-5be3-5ae30747612e,867f0af5-a1d0-143d-bbed-5cc252ca40d6,7063d087-96b8-2cc1-ee88-c221288acc2a
969,Kristee,Rosendahl,female,Apple,4,6150250000.0,company,USA,CA,Cupertino,"Designer, Art Director, Human Interface Co-fou...",employee,Design,ba,035e9cc5-d2a4-9298-7488-c348527a5d1a,0c2be7d4-47cd-8758-e668-34593c6c605f,20135206-96eb-8be0-9ac4-670b257e532c,7063d087-96b8-2cc1-ee88-c221288acc2a
2102,Nolan,Bushnell,male,Atari,2,22260000.0,company,USA,NY,New York,Founder / CEO,executive,unknown,mba,29309412-8440-3525-a2dd-2d4338f07dc3,285515b7-6121-e8d5-57e9-b56cf4c49e62,20135206-96eb-8be0-9ac4-670b257e532c,17fc007e-1f5d-3ff8-8f03-bcff5c9528a5
2103,Nolan,Bushnell,male,Atari,2,22260000.0,company,USA,NY,New York,Founder / CEO,executive,"Engineering, Business",be,29309412-8440-3525-a2dd-2d4338f07dc3,ab6c5b94-8d09-eeb3-c7de-4d6ef61125aa,5fed9dd9-f09b-632e-da77-036a077ef5cb,17fc007e-1f5d-3ff8-8f03-bcff5c9528a5
5323,Mark,Zuckerberg,male,Facebook,11,2335700000.0,company,USA,CA,Menlo Park,Founder & CEO,executive,Computer Science,dropped out,a01b8d46-d311-3333-7c34-aa3ae9c03f22,e75e1434-2ace-9255-2da8-3943f5bbae7c,d8b57c0e-9f0f-4dcb-d207-a12a90c64a2d,df662812-7f97-0b43-9d3e-12f64f504fbb
5338,Eduardo,Saverin,male,Facebook,11,2335700000.0,company,USA,CA,Menlo Park,Co-Founder,executive,Economics,ba,fb5b458c-0aab-a977-71b9-ecf78d3ec756,16d17b89-f6e3-c887-1a82-bb16ba2196d3,d8b57c0e-9f0f-4dcb-d207-a12a90c64a2d,df662812-7f97-0b43-9d3e-12f64f504fbb


**Next step of preparation...**

As seen above, we have a lot of duplicate person instances. I have previously mentioned the reasons for these duplicates. The next point of action for this dataset preparation will be transforming the founders "degree_type" field to get the Highest Degree. This will involve getting each degree type in a single field instead of two separate person instances with different "degree_type fields. 

For example:
- John Doe BS
- John Doe MS
- will become: John Doe BS,MS

and then merge it back onto the "founder_df" dataframe using the "person_uuid".

In [13]:
# This less_duplicates is being used to concatanate the degree_types of a founder
def less_duplicates(df_group):
    # if the amount of unique degree types of a person is greater than 1
    if(df_group['degree_type'].unique().size!=1):
        d_typ = df_group['degree_type'].str.cat(sep=', ') # concatanates unique degree types using comma as separator
    else:
        d_typ = df_group['degree_type'].iloc[0] # chooses degree type of the person
    
    df_return =  pd.DataFrame(
        {'Degree_Type': [d_typ], # degree type set based on condition above
         'person_uuid': df_group['person_uuid'].iloc[0], # takes first instance of unique person_uuid (all are the same)
         'org_uuid': df_group['org_uuid'].iloc[0]}) # takes first instance of unique org_uuid (all are the same)
    return df_return # dataframe returned for each group

# We are grouping each individual by their unique person id and org id as this will deal with founders of multiple
# startups appropriately. (reasons at start of section).
no_dup = founder_df.groupby(['person_uuid', 'org_uuid'], as_index=False).apply(less_duplicates)
no_dup.shape

(36594, 3)

In [14]:
no_dup.head(10)

Unnamed: 0,Unnamed: 1,Degree_Type,org_uuid,person_uuid
0,0,"phd, bsc",d2d0cb83-b874-c5d7-c7f7-fb77613cc95b,00026df9-9254-269d-40b1-549e9529550d
1,0,aa,14658850-0cc9-15f8-62f3-a8c532ea6c61,000497ac-d3f9-7969-6c8b-b4050c8efc04
2,0,masters,d2de0c01-397d-b4f1-8575-9b5e74e6b6b8,000575b8-eac0-66b1-2a16-03c08c2b9f66
3,0,mba,2aec3826-0f75-1f21-326a-5dbca9d5ff15,0005da7e-2311-9002-7756-ed2f2734e057
4,0,mba,bd4c4326-ef34-d5d9-b689-0c0b0a6ba03c,0005da7e-2311-9002-7756-ed2f2734e057
5,0,bachelors,6663f9d3-e6ab-348f-66c7-cafc00ce01a8,00065f25-101a-bfe2-d79c-a172af342c70
6,0,bachelors,a1102c6c-1bc4-b6aa-c5f2-6c34bd4b2370,00065f25-101a-bfe2-d79c-a172af342c70
7,0,bs,7b224a36-b7b4-d02f-bf76-b4ac9ba085ca,000792fb-3022-cac3-eea5-a93a49150727
8,0,unknown,077d60d6-1885-0518-4162-9e827c9269b2,00081d61-57bb-fc90-143e-bad90544b0c5
9,0,ba,000ad7a8-b868-f301-5f00-2a3361288fc9,00082be1-4c28-c41f-6147-92d0e12629c8


This resulting dataframe "no_dup" will be the dataframe we use to get the Higest Degree of a founder. Before we go any further, it may be worth looking at the "Degree_Type" field to ensure it is in the format we want.

In [15]:
no_dup["Degree_Type"].value_counts()

bs                                          4105
unknown                                     3622
ba                                          3123
mba                                         1626
phd                                         1281
ms                                           983
bsc                                          845
graduate                                     710
bachelor                                     583
msc                                          529
bs, ms                                       449
mba, bs                                      448
mba, ba                                      448
ba, mba                                      448
bba                                          424
bachelors degree                             410
ms, bs                                       384
bs, mba                                      373
bachelors                                    362
bachelor of science                          311
btech               

The resulting value counts of the Degree_Type column are very interesting. I want the format for the field to be in the format "ba,phd" ideally.

**Next Step...**

Initially I thought removing would be perfect, however, on further investigation, a field with values "doctor of technology" would turn into "doctoroftechnology" if this was done.

Thus, to get the desired format:
- All whitespaces are replaced with a comma. 
- Then for cases when there are two consecutive commas, they are replaced them with a single comma 
- Besides this, any brackets and slashes will be removed from the field.

In [16]:
no_dup['Degree_Type'] = no_dup['Degree_Type'].str.replace(" ", ',') # whitespaces replaced with comma
no_dup['Degree_Type'] = no_dup['Degree_Type'].str.replace(",,", ',') # two consecutive commas reduced to a single comma
# more punctuation removal
no_dup['Degree_Type'] = no_dup['Degree_Type'].str.replace("/", ',')
no_dup['Degree_Type'] = no_dup['Degree_Type'].str.replace(")", '')
no_dup['Degree_Type'] = no_dup['Degree_Type'].str.replace('(', '')
no_dup["Degree_Type"].value_counts()

bs                                           4106
unknown                                      3622
ba                                           3125
mba                                          1639
phd                                          1287
ms                                            984
bsc                                           850
graduate                                      710
bachelor                                      583
msc                                           532
bs,ms                                         456
mba,bs                                        448
mba,ba                                        448
ba,mba                                        448
bba                                           425
bachelors,degree                              410
ms,bs                                         384
bs,mba                                        374
bachelors                                     362
bachelor,of,science                           311


After viewing these value counts, it can be seen that an appropriate format has been reached to now transform the "Degree_Type" field using the get_dummies() function.

The get_dummies function will allow us to create a matrix like layout for person instance where the Degree_Type will be a column.

In [17]:
dummy_df = no_dup["Degree_Type"].str.get_dummies(',')
dummy_df.shape

(36594, 1268)

In [18]:
dummy_df.head(5)

Unnamed: 0,Unnamed: 1,#39392,+,+mtech,1,10,109,10th,11!9,110,12,12th,12week,1st,1year,2,20,2009,2013,2014,2015,2016,21,3,360,9,[management],a,aa,aaas,aab,aas,ab,abd,abitur,abj,abroad,absence:,abt,aca,academic,academy,acc,accelerated,acceleration,accelerator,accountant,accounting,ace,acs,acsmcpt,activities,ad,adiministration,adjunct,adm,admi,admin,administration,administrator,adminitration,adp,advance,advanced,aec,aemt,aerospace,affairs,agri,agronomo,agrónomo,ai,alb,albert,alevels,alias,alumni,am,amp,analysis,analyst,ancient,and,animation,aocad,apos,applicable,applications,applied,arabic,arch,architect,architecture,art,arts,as,asian,associate,associates,attended,audited,austin,avanzado,aviation,aviator,award,aziendale,b,ba,baa,babs,bac,bac+5,bacaharel,bacc,baccalaureate,baccaulareat,bacgelors,bach,bachalor,bachellor,bachellors,bachelo,bachelor,bachelorofeducation,bachelors,bachelorshons,bachelour,bachleor,bachleors,bachlor,bachlores,bachlors,bacholers,bad,badm,bah,bahons,bai,baker,ballb,bamd,bamod,banker,banking,bappsci,bapsc,bar,barch,bas,basc,basic,batch,batchelor,bba,bbl,bbls,bbm,bbs,bbus,bbussc,bbussci,bc,bca,bce,bch,bche,bcl,bcm,bcom,bcomm,bcomn,bcomp,bcompsc,bcompsci,bcriminology,bcs,bcst,bdegree,bdes,bdesign,bds,be,bec,bechelor,bechelors,bed,bee,beelec,behonors,behons,beijing,benchmarking,beng,beng1st,benghons,bengineering,bengr,besap,besc,bess,bfa,bff,bgp,bgs,bha,bhm,bib,bict,bid,bintbus,biochemistry,biodesign,bioengineering,biological,biology,bioorganic,biophysics,bioproducts,bis,bit,biz,bla,blavatnik,bleacher,bm,bmath,bme,bmedsci,bmus,board,bootcamp,bpharm,bphil,brand,bridge,brigham,bs,bsa,bsb,bsba,bsc,bscapp,bsce,bscee,bsceng,bsche,bschons,bsci,bscmpe,bscpe,bscs,bsd,bse,bsed,bsee,bsemgt,bseng,bsfs,bshonors,bsie,bsj,bsm,bsme,bsms,bsn,bsocsci,bsrtis,bss,bt,btec,btech,btech+mtech,btl,bus,business,bussiness,butler,buz,bvms,by,c,c2,ca,cae,cam,canada,cand,candidate,candmercint,candscient,cas,catering,ccie,ccna,ccnp,ceag,cegep,cell,celta,cems,ceng,centrale,cert,certif,certificate,certification,certified,ces,cesa,cetificate,cfa,cfp,cfq,chain,challenge,challenge+,chartered,chb,chem,chemical,chemistry,child,chinese,cib,cibe,cima,cissp,city,civil,class,classes,classics,clinical,club,cmpe,cofounder,collage,college,com,combined,comm,commerce,commercial,commercio,communication,communications,comp,company,competitive,computational,computer,computing,concentration,core,corporate,course,coursera,courses,coursework,cpa,cpe,creative,crossregistered,cs,csc,cse,cspo,ct,cta,cum,curie,curtification,customer,cvl,cyber,cycle,d,daad,daf,data,dbsoc,dc,dch,dcn,dds,de,dea,deans,dec,decf,ded,degee,degre,degree,degrees,deia,delhi,delta,dentreprise,department,des,design,dess,dest,deu,dev,develop,development,dhl,diagnostic,digital,dihm,dingenieur,dingénieur,dip,dipl,diplinf,dipling,diplom,diploma,diplomakaufmann,diplome,diplomkaufmann,diplphys,diplt,diplôme,dipsi,dir,director,directors,distance,distinction,dma,dmd,do,doc,doctor,doctoral,doctorate,doctorates,doctors,doctpr,double,dp,dpd,dpharm,dphil,dpm,dr,dring,drop,dropout,dropped,drs,dsc,dschc,dsctech,du,dual,dueños,dut,dvm,e,eap,eart,east,eba,ebusiness,ec,echols,ecla,ecole,ecommerce,ecomomics,econ,econmics,economia,economic,economicfinance,economics,economy,ed,edd,edm,edsi,education,educators,ee,eecs,eesc,einstein,eir,ekonomi,elec,elected,elective,electives,electr,electrical,electronic,electronics,em,emba,embaglobalasia,embryological,emergency,empresarial,emtm,en,eneng,energy,eng,engd,engeeiring,engg,engineer,engineerin,engineering,engineers,enginnering,england,english,engner,enineering,entrepreneur,entrepreneurs,entrepreneurship,eo,epgc,ephs,epym,erasmus,eric,estate,etech,ethics,ethnomusicology,european,evaluation,exam,examination,exchange,exec,executive,executives,exed,expertize,extension,fall,fashion,fca,fellow,fellowship,fiber,film,finalist,finance,financial,financing,fine,first,firstclass,fishing,focus,focused,food,for,foreign,forest,former,founder,frcs,from,ft,full,ga,gaicd,gastroenterology,gc,gcse,gdb,geb,general,genetics,geography,geological,geology,german,gestão,global,gmba,gmite,gmp,gmpap,go,google,grad,grade,graduate,graduated,graduateship,graduation,graduaçao,graduação,gradudate,gradute,grande,graphic,grauate,gsp,gsp!#,gsp14,gsp15,gsp2014,guerilla,guilds,gwapt,göteborg,h,hba,hbsc,hci,hd,hda,health,hec,hero,hes,high,higher,highest,history,hnd,hon,honorary,honors,honours,honrs,hons,hotel,hrad,hs,hsc,hsg,i,ib,iba,ibeb,ices,icse,ict,id,idea,ideax,idha,idl,iep,igcse,ignite,ii,iit,ilpse,im,imag,imaging,imf,in,inclusion,incomplete,incubatee,industrial,industry,informatic,informatics,information,ing,ingeneer,ingeniero,ingeniería,ingenieur,ingénieur,innovation,insolventierecht,institute,instructor,integrated,intensive,inter,interdis,internal,internat,international,interrupted,intl,investment,invited,ip,ir,isc,isma,it,italian,itpm,iup,ivlp,jarman,jd,jnciajunos,joint,joszef,journalism,journalist,junior,juris,jya,kauffman,kaufmann,kaufmännisch,kfm,kinesiology,language,languages,las,laude,launchpad,laurea,law,laws,lawyer,leader,leadership,lean,learning,leave,lecturer,leeds,level,levels,liberal,lic,licence,licenciate,licenciatura,licentiate,licentiateship,list,literature,litterature,llb,lld,llm,loa,lsca,m,m1,m2,ma,macc,machine,magister,magna,magr,maiic,maitrise,major,managemenet,management,manager,managerial,managment,mandarin,mangement,mappfin,march,marchii,marie,marine,market,marketing,marquette,mas,masc,maser,mass,master,masterappliedsciense,mastere,masters,math,mathematics,maths,matrix,maturity,maya,mb,mba,mbb,mbbs,mbchb,mbe,mbet,mbiochem,mbl,mbs,mbsc,mbus,mc,mca,mcom,mcs,mcse,md,mde,mdes,mdesr,mdiv,mdm,mdp,mdphd,me,meb,mech,mechanical,med,medecine,media,medical,medicine,mee,member,membership,meng,mengg,mengsc,menng,ment,mentor,mercadeoypublicidad,merit,mes,met,method,mfa,mfad,mfe,mfin,mgmt,mgrart,mgt,mha,mhcds,mhp,mhsa,mia,miage,mib,mibs,mica,microelectronics,microfinance,mid,miee,mih,mii,military,mim,mims,minor,miph,mipp,mis,mism,mit,mitxl,mla,mlb,mlitt,mm,mmath,mmc,mmngt,mmp,mms,mo,mod,modern,modular,module,moe,molecular,months,montreal,mooc,mp*,mpa,mpaff,mph,mpharm,mphil,mphill,mphs,mphys,mpp,mps,mpsi,mr,mrcpch,mres,ms,msa,msaad,msba,msbs,msc,msca,msce,msceng,msci,mscmechanical,mscr,mscs,mse,msed,msee,msem,msf,msfs,msg,mshfid,msi,msia,msie,msis,msm,msmba,msme,msn,msp,msph,mss,msw,msx,mtech,mtl,mtpo,multimedia,multiple,music,mva,mvsc,mycotoxins,n,name,nanodegree,nat,national,natural,negotiations:,network,networks,neurology,neurosurgery,new,next,nih,niha,no,nondegree,none,nuclr,nutrition,o,oec,of,omp,one,oneyear,online,operations,opm,opm33,ops,optic,optometry,or,organic,organization,ossd,out,owp,oxon,paediatrics,paris,part,partial,parttime,pass,pat,pdd,pdg,penn,period,persian,pg,pgcbm,pgce,pgcert,pgd,pgdba,pgdbm,pgdcm,pgdem,pgdib,pgdim,pgdip,pgdit,pgdm,pgdmm,pgdpc,pge,pgp,pgpm,pgpmax,pgtech,ph,pharm,pharmacy,pharmd,phd,phil,philosophie,philosophy,philosophyeconomics,philosphy,photography,physical,physics,physiology,pilot,pld,plda,plus,pmd,pmp,pol,policy,politcal,political,politics,portfolio,portugal,portuguese,post,postbaccalaureate,postdoc,postdoctoral,postgrad,postgraduate,postgraduation,ppe,ppi,practical,practitioner,prediploma,preincubator,presidents,prg,prgm,prince2,product,production,prof,profession,professional,professor,program,programa,programme,programmer,programming,programs,progress,project,property,prépa,psycho,psychology,psyd,publ,public,pursuing,pyme,r,ra,rcpsc,real,reaserch,relations,renewable,rer,rernat,rerum,research,researcher,residency,resident,rhetoric,rhodes,ries,risk,rn,robotics,royal,s,s14,s17,sales,sandwich,sarjana,sb,sc,scb,scd,sceince,scholar,scholarship,school,sci,science,science;,sciences,scientist,scienzeinformazione,scm,scps,secondary,secretary,self,semester,seminar,sep,service,shakespeare,sift,single,sistemas,sitn,slelp,sloan,sm,smee,smurfit,social,societie,sociology,software,spanish,spec,special,specialist,specialization,specialized,spl,sports,ss,ssc,staatsexamen,stanford,startup,startups,state,stem,steps,stonier,strategic,strategies,strategy,student,students;,studies,study,sturgis,summer,superior,supply,surgery,syestems,system,systems,systems:,talent,te,teacher,teachers,tec,tech,technical,technician,technologist,technology,teknisk,telecom,textile,the,theater,theatre,therapy,thesis,threeyear,thèse,time,to,toronto,track,trainee,training,transfer,transfered,trf,triple,tshirt,tutorials,two,ub,ubaccalauréat,under,undergrad,undergraduate,unfinished,unicamp,universitet,university,unknown,unofficially,uppsala,uxid,varieties,vascular,vc,ventures,vis,visiting,vmd,vocational,w,w08,w15,wbb,winner,winston,wirtschaftinformatik,wissnerslivka,with,women,work,wp,x,x3,yc,ycw14,yea,year,ymp,young,ypo,ypp,électronique,радиоинженер
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


The dummy dataframe created is shown above. There are currently 1200+ different degree types that have been transformed into columns.

**Next Step...**

We decided that we would filter columns that have a sum less than 10 instances. We concluded this would have little effect on our research question and would give us more time to spend on our analysis. By doing this, it narrows the columns needed for potential collapsing to under 200.

In [19]:
dummy_df.drop([col for col, val in dummy_df.sum().iteritems() if val < 10], axis=1, inplace=True)
dummy_df.shape

(36594, 163)

**Next Step..**

These different instances will be maually assigned into different columns based on degree type. The degrees we decided to assign these to were:
- Bachelors
- Masters
- Juris
- Diploma
- Doctorate
- Executive
- Honors

We decided that if the name is not clear after doing some analysis of where it appears in the dataframe, then we will not assign it to one of the degree types. 

For example:
**'d'** appears in the column list and thus is very difficult to conclude what it refers to. We would then ignore cases like this and focus on the clear instances.

In [20]:
# We did our collapsing of these columns using the text file we made below
list_dummy = list(dummy_df)
thefile = open('../../data/txt_prep/degree_types.txt', 'w')
for item in list_dummy:
  thefile.write("%s\n" % item)

The following code block shows the different degrees we decided to assign to each overall degree type. As previously mentioned, this was a very manual process and is not ideal for reproducibility reasons, however, by focusing on clear degree_types accuracy is improved.

In [21]:
# Overall Degree Types
Bachelors = [ 'sb',  'aa',  'ab',   'b',  'ba',  'bachelor',  'bachelors',  'bachleor',  'barch',  'bas',  'basc',  'bba',  'bbm',  'bbs',  'bcom',  'bcomm',  'bcs',  'be',  'bec',  'bechelors',  'bed',  'beng',  'bfa',  'bm',  'bmath',  'bs',  'bsba',  'bsc',  'bse',  'bsee',  'bsme',  'btech', 'undergraduate',   'llb'  ]   
Masters = [   'sm', 'm',  'ma',  'masc',  'master',  'masters',  'mba',  'mbbs',  'mca',  'md',  'me',  'med',  'meng',  'mfa',  'mpa',  'mph',  'mphil',  'mpp',  'mps',  'ms',  'msc',  'mse',  'msee',  'mtech', 'edm',    'llm'  ]     
Juris = [  'jd',  'juris'  ]     
Diploma = [  'dipl',  'diplom',  'diploma',   'pgdm'  ]     
Doctorate = [  'doctor',  'doctoral',  'doctorate',  'dphil',  'dr',  'dsc',  'postdoc',  'postdoctoral'  ]     
Executive = [  'executive',   'emba'  ]     
Honors = [  'hba',  'honors',  'honours',  'hons', 'hon',  'honorary'  ]

**Next Step...** 

Our next step was simply grouping and collapsing these similar columns while simultaneously summing up the values in them. We then are able to drop the columns with no loss of information.

In [22]:
# Using the assign function we were able to create a new column for each Overall Degree Types.
# We then assigned 1 to this column. The columns associated to the Overall Degree Types are then dropped using lists above
dummy_df = dummy_df.assign(Bachelors=dummy_df[Bachelors].max(1)).drop(Bachelors, 1)
dummy_df = dummy_df.assign(Masters=dummy_df[Masters].max(1)).drop(Masters, 1)
dummy_df = dummy_df.assign(Juris=dummy_df[Juris].max(1)).drop(Juris, 1)
dummy_df = dummy_df.assign(Diploma=dummy_df[Diploma].max(1)).drop(Diploma, 1)
dummy_df = dummy_df.assign(Doctorate=dummy_df[Doctorate].max(1)).drop(Doctorate, 1)
dummy_df = dummy_df.assign(Executive=dummy_df[Executive].max(1)).drop(Executive, 1)
dummy_df = dummy_df.assign(Honors=dummy_df[Honors].max(1)).drop(Honors, 1)
dummy_df.shape

(36594, 88)

**Next Step...**

We only select the columns we want from the dummy dataframe. In essence we are leaving all the unclear instances that cannot be accurately assigned to an Overall Degree Type.

In [23]:
degrees_df = dummy_df[['Bachelors', 'Masters', 'Juris', 'Diploma', 'Doctorate', 'Executive', 'Honors', 'certificate', 'phd']]
degrees_df.shape

(36594, 9)

### Merging the dummy dataframe and no_dup dataframe

In [24]:
# resetting and dropping 'level_1' index that was created from apply function when grouping the dataframe initially.
# The 'level_0' field will be used for merging
no_dup = no_dup.reset_index()
no_dup = no_dup.drop(labels=['level_1'], axis=1)
no_dup.head(5)

Unnamed: 0,level_0,Degree_Type,org_uuid,person_uuid
0,0,"phd,bsc",d2d0cb83-b874-c5d7-c7f7-fb77613cc95b,00026df9-9254-269d-40b1-549e9529550d
1,1,aa,14658850-0cc9-15f8-62f3-a8c532ea6c61,000497ac-d3f9-7969-6c8b-b4050c8efc04
2,2,masters,d2de0c01-397d-b4f1-8575-9b5e74e6b6b8,000575b8-eac0-66b1-2a16-03c08c2b9f66
3,3,mba,2aec3826-0f75-1f21-326a-5dbca9d5ff15,0005da7e-2311-9002-7756-ed2f2734e057
4,4,mba,bd4c4326-ef34-d5d9-b689-0c0b0a6ba03c,0005da7e-2311-9002-7756-ed2f2734e057


In [25]:
# resetting and dropping 'level_1' index that was created from apply function when grouping the dataframe initially.
# The 'level_0' field will be used for merging as both dataframes.
degrees_df.is_copy = False
degrees_df.reset_index(inplace=True)
degrees_df.drop(labels=['level_1'], axis=1, inplace=True)
degrees_df.head(5)

Unnamed: 0,level_0,Bachelors,Masters,Juris,Diploma,Doctorate,Executive,Honors,certificate,phd
0,0,1,0,0,0,0,0,0,0,1
1,1,1,0,0,0,0,0,0,0,0
2,2,0,1,0,0,0,0,0,0,0
3,3,0,1,0,0,0,0,0,0,0
4,4,0,1,0,0,0,0,0,0,0


Now we have both the two dataframes we want to merge ready. We will call this resulting dataframe 'result_df'. This will then finally be used for when we want to find the Highest Degree of a founder.

In [26]:
# Merging dataframes on the 'level_0' field
result_df = no_dup.merge(degrees_df, on='level_0', how='right')
result_df = result_df.drop(labels=['level_0'], axis=1)
result_df.head(5)

Unnamed: 0,Degree_Type,org_uuid,person_uuid,Bachelors,Masters,Juris,Diploma,Doctorate,Executive,Honors,certificate,phd
0,"phd,bsc",d2d0cb83-b874-c5d7-c7f7-fb77613cc95b,00026df9-9254-269d-40b1-549e9529550d,1,0,0,0,0,0,0,0,1
1,aa,14658850-0cc9-15f8-62f3-a8c532ea6c61,000497ac-d3f9-7969-6c8b-b4050c8efc04,1,0,0,0,0,0,0,0,0
2,masters,d2de0c01-397d-b4f1-8575-9b5e74e6b6b8,000575b8-eac0-66b1-2a16-03c08c2b9f66,0,1,0,0,0,0,0,0,0
3,mba,2aec3826-0f75-1f21-326a-5dbca9d5ff15,0005da7e-2311-9002-7756-ed2f2734e057,0,1,0,0,0,0,0,0,0
4,mba,bd4c4326-ef34-d5d9-b689-0c0b0a6ba03c,0005da7e-2311-9002-7756-ed2f2734e057,0,1,0,0,0,0,0,0,0


In [27]:
result_df.shape

(36594, 12)

**Next Step...**

We want to remove any person who does not have a degree classified in these Overall Degree Types. We do this using the transpose function and any().

We have also created a list called 'degree_cols' which holds the names of all the Overall Degree Types. They are ranked from highest degree to lowest. This will be used later on in the preparation when deciphering which degrees are higher than others. For example, we are classifying a PhD as a Higher Degree than a Masters etc.

In [28]:
result_df.rename(columns={'phd': 'PhD'}, inplace=True) # renaming of the phd column to PhD
# Overall Degree Types (column names) ranked from highest to lowest type
degree_cols = ['PhD','Doctorate','Executive','Masters','Juris','Honors','Bachelors','Diploma','certificate']
hdegree_df = result_df[(result_df[degree_cols].T != 0).any()]  # only instances where a value for degree_cols exists
hdegree_df.is_copy = False
hdegree_df.shape

(30756, 12)

In [29]:
hdegree_df.head(5)

Unnamed: 0,Degree_Type,org_uuid,person_uuid,Bachelors,Masters,Juris,Diploma,Doctorate,Executive,Honors,certificate,PhD
0,"phd,bsc",d2d0cb83-b874-c5d7-c7f7-fb77613cc95b,00026df9-9254-269d-40b1-549e9529550d,1,0,0,0,0,0,0,0,1
1,aa,14658850-0cc9-15f8-62f3-a8c532ea6c61,000497ac-d3f9-7969-6c8b-b4050c8efc04,1,0,0,0,0,0,0,0,0
2,masters,d2de0c01-397d-b4f1-8575-9b5e74e6b6b8,000575b8-eac0-66b1-2a16-03c08c2b9f66,0,1,0,0,0,0,0,0,0
3,mba,2aec3826-0f75-1f21-326a-5dbca9d5ff15,0005da7e-2311-9002-7756-ed2f2734e057,0,1,0,0,0,0,0,0,0
4,mba,bd4c4326-ef34-d5d9-b689-0c0b0a6ba03c,0005da7e-2311-9002-7756-ed2f2734e057,0,1,0,0,0,0,0,0,0


**Next Step...**

As mentioned previously, our aim is to find the Highest Degree of a founder. The dataframe is in an appropriate format now to do this. We will be using the **idxmax** function to achieve this. 

Essentially when the axis is set to 1, it will go across the row and return the name of the column with the highest value. However, if there are two columns with the same highest value, then it will take the first column name in the list we defined previously ('degree_cols'). 

Thus, if a founder has a value for Bachelors and PhD equal to 1, then the column name of PhD will be selected. We will save this name in a new column called "Highest_Degree".

In [30]:
# finds name of column with highest value in ranked degree_cols list. Explained fully in markdown above. Axis=1 as row.
hdegree_df['Highest_Degree'] = hdegree_df[degree_cols].idxmax(axis=1)
hdegree_df.head(5)

Unnamed: 0,Degree_Type,org_uuid,person_uuid,Bachelors,Masters,Juris,Diploma,Doctorate,Executive,Honors,certificate,PhD,Highest_Degree
0,"phd,bsc",d2d0cb83-b874-c5d7-c7f7-fb77613cc95b,00026df9-9254-269d-40b1-549e9529550d,1,0,0,0,0,0,0,0,1,PhD
1,aa,14658850-0cc9-15f8-62f3-a8c532ea6c61,000497ac-d3f9-7969-6c8b-b4050c8efc04,1,0,0,0,0,0,0,0,0,Bachelors
2,masters,d2de0c01-397d-b4f1-8575-9b5e74e6b6b8,000575b8-eac0-66b1-2a16-03c08c2b9f66,0,1,0,0,0,0,0,0,0,Masters
3,mba,2aec3826-0f75-1f21-326a-5dbca9d5ff15,0005da7e-2311-9002-7756-ed2f2734e057,0,1,0,0,0,0,0,0,0,Masters
4,mba,bd4c4326-ef34-d5d9-b689-0c0b0a6ba03c,0005da7e-2311-9002-7756-ed2f2734e057,0,1,0,0,0,0,0,0,0,Masters


**Next Step...**

We now have all the essential information we set out to achieve. We only need the following columns from the dataframe above: 
- person_uuid (unique person id)
- org_uuid (unique organisation id)
- Highest Degree

The other columns are unnecessary for our preparation.

In [31]:
# only relevant columns selected. Other columns dropped.
hdegree_df = hdegree_df[['person_uuid','org_uuid','Highest_Degree']]
hdegree_df.head(5)

Unnamed: 0,person_uuid,org_uuid,Highest_Degree
0,00026df9-9254-269d-40b1-549e9529550d,d2d0cb83-b874-c5d7-c7f7-fb77613cc95b,PhD
1,000497ac-d3f9-7969-6c8b-b4050c8efc04,14658850-0cc9-15f8-62f3-a8c532ea6c61,Bachelors
2,000575b8-eac0-66b1-2a16-03c08c2b9f66,d2de0c01-397d-b4f1-8575-9b5e74e6b6b8,Masters
3,0005da7e-2311-9002-7756-ed2f2734e057,2aec3826-0f75-1f21-326a-5dbca9d5ff15,Masters
4,0005da7e-2311-9002-7756-ed2f2734e057,bd4c4326-ef34-d5d9-b689-0c0b0a6ba03c,Masters


### Merging the highest degree dataframe ('hdegree_df') and original founders dataframe ('founder_df')

We can now merge the Highest Degree information we have got from our transformations with the original founders dataframe. We can do this using the unique person id and unique organisation id we have for each founder. This will be one of the final steps of our preparation of this dataset.

In [32]:
# dataset2_df is resulting dataframe from the merging of founder_df and hdegree_df
# The dataframes are being merged on the unique person and organisations id
dataset2_df = pd.merge(founder_df, hdegree_df, how='left', left_on=['person_uuid','org_uuid'], right_on=['person_uuid','org_uuid'])
dataset2_df.head(5)

Unnamed: 0,first_name,last_name,gender,company_name,funding_rounds,funding_total_usd,primary_role,country_code_y,state_code_y,city_y,title,job_type,subject,degree_type,person_uuid,degree_uuid,institution_uuid,org_uuid,Highest_Degree
0,Steve,Jobs,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Co-Founder,executive,unknown,dropout,2b3a6b34-ad65-8e1e-29ff-d267f42530e0,be0638c8-653e-5f5e-2845-2c99dd3b6abe,76cd719f-af9e-7984-a6a6-ef970b52515d,7063d087-96b8-2cc1-ee88-c221288acc2a,
1,Steve,Wozniak,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Co-founder,executive,EE & CS,bs,f3abe539-8db3-57e4-0f4d-de54a78eaf68,fe3eb345-b465-84ad-45d7-448f8f7a44e5,10f9a25b-9675-2281-486e-a52955c706df,7063d087-96b8-2cc1-ee88-c221288acc2a,Bachelors
2,Kevin,Harvey,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Founder,executive,Engineering,bs,e7f5c146-66c5-fba4-64cb-8ffd422899d8,0dee09e8-13b6-50ee-3e17-1343036b2eed,c3144da5-8618-2e95-3a13-60417220da5e,7063d087-96b8-2cc1-ee88-c221288acc2a,Bachelors
3,Armas,Markkula,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Founder,executive,Electrical Engineering,bs,56e8a800-5c37-7599-5eb3-b815aa6acd30,29b2a7bc-4628-0e5d-53d1-d0af77d3de33,867f0af5-a1d0-143d-bbed-5cc252ca40d6,7063d087-96b8-2cc1-ee88-c221288acc2a,Masters
4,Armas,Markkula,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,Founder,executive,Electrical Engineering,ms,56e8a800-5c37-7599-5eb3-b815aa6acd30,9da52706-0933-81f3-5be3-5ae30747612e,867f0af5-a1d0-143d-bbed-5cc252ca40d6,7063d087-96b8-2cc1-ee88-c221288acc2a,Masters


In [33]:
dataset2_df.shape

(54113, 19)

This merged dataset2_df dataframe will conclude our preparation. The main aim of this preparation was to transform the degree type field to get the Highest Degree of a founder which we achieved.

### Saving resulting dataset for 250_prep_dataset2 in pickle
We chose to use pickle files for the saving and loading of our dataframes due to their fast load and save times compared to csv files and the like

In [34]:
dataset2_df.to_pickle("../../data/processed/250_dataset2.pkl")