# 300_validation_d1

## Purpose
We continue the task of preparing and cleaning our dataset in this notebook. In this dataset we focus on the location aspect of the first research question:
* (RQ1: Correlation between a company’s industry and location, with the amount of funding they receive?)

## Datasets
* _Input_: 200_dataset1.pkl
* _Output_: 300_dataset1



In [1]:
import os
import re
import sys
import hashlib
import pandas as pd
import numpy as np
%matplotlib inline
pd.set_option('display.max_columns', None)
module_path = os.path.abspath(os.path.join('../../data/..'))
if module_path not in sys.path:
    sys.path.append(module_path)

## Dataset 1 Validation

As mentioned at the start of this document, we will be focusing on the first research question. As we know dataset1 is the dataset used for this question. 
* It is helpful to print out the dataframe to have a brief look at it before performing any validation checks. This essentially gives us a sense of direction in our validation as we now know what we are dealing with.

In [2]:
ds1_df = pd.read_pickle('../../data/processed/200_dataset1.pkl')
ds1_df.head(5)

Unnamed: 0,company_name,roles,country_code,state_code,region,city,status,category_list,category_group_list,funding_rounds,funding_total_usd,last_funding_on,founded_on,employee_count,org_uuid,primary_role,type,Administrative Services,Advertising,Agriculture and Farming,Biotechnology,Clothing and Apparel,Commerce and Shopping,Community and Lifestyle,Consumer Goods,Content and Publishing,Design,Education,Energy,Events,Food and Beverage,Government and Military,Hardware,Health Care,Manufacturing,Media and Entertainment,Music and Audio,Natural Resources,Navigation and Mapping,Platforms,Privacy and Security,Professional Services,Real Estate,Sales and Marketing,Science and Engineering,Sports,Sustainability,Transportation,Travel and Tourism,Video,Technology,Finance,Communication
0,Intel,"company,investor",USA,CA,SF Bay Area,Santa Clara,ipo,"Hardware,Manufacturing,Product Design,Semicond...","Design,Hardware,Manufacturing,Science and Engi...",1,2510000.0,1968-07-31,1968-07-18,10000+,1e4f199c-363b-451b-a164-f94571075ee5,company,organization,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
1,Intercomp,company,USA,OH,Cleveland,Medina,operating,"Hardware,Software","Hardware,Software",1,549000.0,1970-12-31,1968-01-01,101-250,6681b1b0-0cea-6a4a-820d-60b15793fa66,company,organization,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,Microsoft,"company,investor",USA,WA,Seattle,Redmond,ipo,"Cloud Computing,Collaboration,Consumer Electro...","Consumer Electronics,Hardware,Internet Service...",1,1000000.0,1981-09-01,1975-04-04,10000+,fd80725f-53fc-7009-9878-aeecf1e9ffbb,company,organization,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
3,Compaq,"company,investor",USA,CA,SF Bay Area,Palo Alto,acquired,"Hardware,Information Technology,Software","Hardware,Information Technology,Software",1,1500000.0,1982-02-14,1982-02-14,11-50,10a3b2fd-b142-046b-7d8f-3b1aa4877aca,company,organization,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,Toyota Motor Corporation,"company,investor",JPN,,,,ipo,"Automotive,Mobile,Transportation","Mobile,Transportation",1,42000000.0,1982-04-14,1937-08-28,10000+,12b90373-ab49-a56a-4b4e-c7b3e9236faf,company,organization,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1


Printing out a list of the columns names as shown below is very beneficial when it comes to further validation as we can now copy and paste multiple column names into future cells. By listing the column names it also gives us a different presepective into the dataframe we are dealing with.

In [3]:
list(ds1_df)

['company_name',
 'roles',
 'country_code',
 'state_code',
 'region',
 'city',
 'status',
 'category_list',
 'category_group_list',
 'funding_rounds',
 'funding_total_usd',
 'last_funding_on',
 'founded_on',
 'employee_count',
 'org_uuid',
 'primary_role',
 'type',
 'Administrative Services',
 'Advertising',
 'Agriculture and Farming',
 'Biotechnology',
 'Clothing and Apparel',
 'Commerce and Shopping',
 'Community and Lifestyle',
 'Consumer Goods',
 'Content and Publishing',
 'Design',
 'Education',
 'Energy',
 'Events',
 'Food and Beverage',
 'Government and Military',
 'Hardware',
 'Health Care',
 'Manufacturing',
 'Media and Entertainment',
 'Music and Audio',
 'Natural Resources',
 'Navigation and Mapping',
 'Platforms',
 'Privacy and Security',
 'Professional Services',
 'Real Estate',
 'Sales and Marketing',
 'Science and Engineering',
 'Sports',
 'Sustainability',
 'Transportation',
 'Travel and Tourism',
 'Video',
 'Technology',
 'Finance',
 'Communication']

It is important to consider null values in throughout the preperation of our datasets. As we seen in the previous notebooks we removed null values for columns such as the funding values. For some cases it is important to get rid of null values if they hinder our analysis in negative way, or are thought to be useless. Whereas in other case, we can deal with null values and keep them in the dataset.
* Below we are checking for the amount of null values in under each of the column names.

In [4]:
ds1_df.isnull().sum()

company_name                   0
roles                         60
country_code                1475
state_code                 31579
region                     15499
city                        2975
status                         0
category_list                  0
category_group_list            0
funding_rounds                 0
funding_total_usd              0
last_funding_on                2
founded_on                  3470
employee_count                 0
org_uuid                       0
primary_role                   0
type                           0
Administrative Services        0
Advertising                    0
Agriculture and Farming        0
Biotechnology                  0
Clothing and Apparel           0
Commerce and Shopping          0
Community and Lifestyle        0
Consumer Goods                 0
Content and Publishing         0
Design                         0
Education                      0
Energy                         0
Events                         0
Food and B

As seen in the previous step there are quite a large number of rows in our dataframe with null values for the 'region' column. It is important to understand, or have a general idea of where these null values are located. As we can see below a huge number of these null values are once again with USA.
* Below we are checking the number of null values for the 'region' column for every country.

In [5]:
ds1_df[ds1_df['region'].isnull()]['country_code'].value_counts()

USA    6805
GBR    1065
CHN     880
IND     416
CAN     358
ISR     320
FRA     310
AUS     284
RUS     244
JPN     211
ESP     207
SGP     186
BRA     183
DEU     181
ITA     149
CHL     144
SWE     125
HKG     111
NLD     110
MEX      86
KOR      79
TUR      71
ARG      67
IRL      65
UKR      65
ISL      61
ARE      58
FIN      58
CHE      58
DNK      54
       ... 
MLI       2
GIB       2
BWA       2
SLE       2
SEN       2
ZMB       2
ZWE       2
LKA       2
COD       2
PRY       2
BIH       1
COG       1
LIE       1
HND       1
IMN       1
CMR       1
HTI       1
MDG       1
SLV       1
RWA       1
IRQ       1
STP       1
TGO       1
SYC       1
MDA       1
SRB       1
DJI       1
BAH       1
DMA       1
DOM       1
Name: country_code, Length: 131, dtype: int64

If we delve deeper into these null values we can look into exactly what states these null values for the 'region' column occur most frequently. As we can see below two of the state in particular make up a huge amount of these null values, these states are 'CA' and 'NY'. 
* Below we are checking the number of null values for the 'region' column for each state.

In [6]:
ds1_df[ds1_df['region'].isnull()]['state_code'].value_counts()

CA    2669
NY     918
TX     355
MA     297
WA     225
FL     223
IL     193
ON     176
CO     144
OH     132
PA     123
GA     108
NC      89
TN      85
NJ      81
VA      80
BC      70
DC      70
MD      67
AZ      64
OR      61
UT      61
MN      58
SC      53
NV      53
QC      52
MO      50
MI      47
IN      41
WI      41
      ... 
DE      26
AL      23
NE      21
HI      21
LA      19
AB      17
OK      16
KS      15
NM      14
RI      14
NS      13
ID      12
ME      12
AR      11
IA      11
NH      10
NB       7
VI       6
VT       6
ND       5
WY       4
AK       4
MT       3
MB       3
SD       2
NL       2
MS       1
PE       1
WV       1
SK       1
Name: state_code, Length: 62, dtype: int64

Due to the nature of these null values in our dataset we decided not to remove them. As the rows with null values for the 'region', 'country_code', 'state_code' and 'city' columns had values for the funding columns and categotry columns, we thought it best to keep these columns in our dataset, as removing them might skew our analysis. Instead we replaced these null values with "unknown".
* Below we are replacing the null values in each of the 'region', 'country_code', 'state_code' and 'city' columns with "unknown".

In [7]:
ds1_df.region.fillna('Unknown', inplace=True)
ds1_df.country_code.fillna('Unknown', inplace=True)
ds1_df.state_code.fillna('Unknown', inplace=True)
ds1_df.city.fillna('Unknown', inplace=True)

After performing the above step it is important to double check that the null values have been removed successfully.
* Below we are checking that the null values have been replaced successfully.

In [8]:
ds1_df.isnull().sum()

company_name                  0
roles                        60
country_code                  0
state_code                    0
region                        0
city                          0
status                        0
category_list                 0
category_group_list           0
funding_rounds                0
funding_total_usd             0
last_funding_on               2
founded_on                 3470
employee_count                0
org_uuid                      0
primary_role                  0
type                          0
Administrative Services       0
Advertising                   0
Agriculture and Farming       0
Biotechnology                 0
Clothing and Apparel          0
Commerce and Shopping         0
Community and Lifestyle       0
Consumer Goods                0
Content and Publishing        0
Design                        0
Education                     0
Energy                        0
Events                        0
Food and Beverage             0
Governme

After substituting the null values in the 'region' column we can now check the amount of rows with each region. Unfortunately after performing the above step 'Unknown' now makes up the biggest region in our dataset. Unfortunately there is not much we can do to avoid this as the data hasn't been provided. If we remove these rows from our dataset we would lose a huge portion of our data. 
* Below we are checking the number of rows with each region.

In [9]:
ds1_df['region'].value_counts()

Unknown             15499
SF Bay Area          8948
New York City        3596
London               3040
Boston               2657
Los Angeles          1725
Seattle              1246
Washington, D.C.     1081
San Diego             942
Chicago               925
Denver                887
Toronto               846
Austin                835
Paris                 828
Tel Aviv              790
Atlanta               691
Dallas                664
Philadelphia          584
Newark                578
Bangalore             550
Anaheim               541
Singapore             481
Berlin                470
New Delhi             466
Raleigh               457
Beijing               452
GBR - Other           436
Minneapolis           430
Hartford              413
Salt Lake City        411
                    ...  
BGD - Other             1
Terni                   1
Penang                  1
Rimini                  1
Ayr                     1
Bekasi Kota             1
Tbilisi                 1
Kiel        

Similarly, below it is helpful to check the number of rows with each country after replacing the null values with "Unknown". Unlike above with the 'region' column, "Unknown" isn't as dominant in this case. This is great as it won't have any major effects on our analysis.
* Below we are checking the number of rows with each country code.


In [10]:
ds1_df['country_code'].value_counts()

USA        44348
GBR         5788
CAN         2534
IND         2481
CHN         2144
FRA         1755
Unknown     1475
DEU         1380
ISR         1295
AUS         1062
ESP         1058
SWE          786
NLD          672
SGP          667
ITA          660
RUS          633
IRL          595
BRA          585
JPN          573
CHE          507
KOR          479
FIN          433
DNK          381
CHL          345
HKG          322
BEL          303
MEX          242
POL          233
ISL          231
ARG          224
           ...  
DOM            2
DZA            2
SYC            2
HTI            2
MLI            2
MOZ            2
SLE            2
BIH            2
HND            2
DJI            1
BRB            1
LBR            1
GRD            1
SOM            1
MAF            1
MAC            1
MNE            1
IRQ            1
UZB            1
MTQ            1
GGY            1
DMA            1
AGO            1
MRT            1
COG            1
MDG            1
GUM            1
STP           

Here we are doing the same thing as the previous steps. After substituting the null values in the 'state_code' column we can now check the amount of rows with each state code. Unfortunately after performing the above step 'Unknown' now makes up the biggest state_code in our dataset. Unfortunately there is not much we can do to avoid this as the data hasn't been provided. Although one thing that may explain this is the fact that not all of our data is sourced from USA, so quite a lot of countries outside of USA will not have state codes. If we remove these rows from our dataset we would lose a huge portion of our data. 
* Below we are checking the number of rows with each state code.

In [11]:
ds1_df['state_code'].value_counts()

Unknown    31579
CA         15548
NY          5029
MA          3109
TX          2370
WA          1526
FL          1395
ON          1221
IL          1205
CO          1153
PA          1133
VA           851
GA           840
NC           764
NJ           745
OH           735
MD           701
MN           551
BC           550
TN           525
UT           481
CT           480
AZ           467
MI           457
OR           454
QC           410
MO           355
WI           316
DC           312
IN           308
           ...  
AB           165
NH           155
DE           148
AL           134
KS           131
RI           121
IA           110
NE           108
LA           106
ME            97
NM            96
ID            93
OK            93
AR            85
NS            84
HI            80
VT            65
MT            44
MS            31
WY            30
MB            25
NL            25
ND            25
NB            24
SD            20
WV            16
AK            15
SK            

Once again we are doing the same thing as the previous steps. After substituting the null values in the 'city' column we can now check the amount of rows with each state city. Fortunately, unlike above with the 'state_code' column, "Unknown" isn't as dominant in this case. This is great as it won't have any major effects on our analysis. If we remove these rows from our dataset we would lose a huge portion of our data. 
* Below we are checking the number of rows with each city.

In [12]:
ds1_df['city'].value_counts()

San Francisco          4342
New York               3942
London                 3039
Unknown                2975
Los Angeles             938
Austin                  934
Seattle                 919
Paris                   886
Boston                  883
Palo Alto               848
Cambridge               825
Chicago                 810
Beijing                 779
San Diego               751
Toronto                 681
San Jose                657
Mountain View           656
Singapore               590
Atlanta                 542
Berlin                  536
Sunnyvale               520
Bangalore               501
Santa Clara             472
Moscow                  450
Mumbai                  446
Dublin                  434
Vancouver               431
Shanghai                425
Tel Aviv                425
San Mateo               409
                       ... 
Saint-alexandre           1
Basehor                   1
North Hampton             1
Krabi                     1
Llanfairfechan      

Now to have a look at individual countries. This is important as it gives us a deeper look into our dataset and the dominant countries in our dataset, namely USA and GBR. 
* Below we are creating a dataframe containing only companies in GBR (Great Britain).

In [13]:
ds1_df_region = ds1_df[ds1_df['country_code']== 'GBR']

As we saw above there were quite a lot of "Unknown" values for the 'region' column along our whole dataset. Now we can check GBR in particular to help us have an idea if one country in particular has the majority of these "Unknown" values. We can see that this is not the case below. 
* Below we are counting the number of rows with each region in GBR.

In [14]:
ds1_df_region['region'].value_counts()

London                 3035
Unknown                1065
GBR - Other             436
Edinburgh               110
Manchester               93
Bristol                  66
Newcastle                65
Glasgow                  56
Sheffield                43
Liverpool                42
Belfast                  40
Nottingham               34
Birmingham               34
Leeds                    31
Cardiff                  26
Bath                     20
Aberdeen                 20
Coventry                 19
Kent                     15
Gateshead                13
Middlesbrough            11
Durham                   11
Newbury                  10
Cheltenham               10
Watford                  10
Warrington               10
Newport                   9
Northampton               9
Livingston                9
Stockport                 8
                       ... 
Hungerford                1
Wellingborough            1
Ayr                       1
Farrington Gurney         1
Great Missenden     

As mentioned above, it is now time to look at individual countries. Below we will look at USA.
* Below we are creating a dataframe containing only companies in USA (United States of America).

In [15]:
ds1_df_country = ds1_df[ds1_df['country_code']== 'USA']

As I mentioned earlier in this notebook, it is possible there is quite a lot of "Unknown" values for 'state_code' as they many countries outside the USA may not have state codes. 
* Below we are counting the number of rows with the 'state_code' column in the USA dataframe.

In [16]:
ds1_df_country['state_code'].value_counts().sum()

44348

Weirdly, against my original thought there are quite a lot of state codes outside of USA. As we can see in the above result there are 44,348 rows with state codes in USA, yet there are 78,357 state codes among the worldwide dataset. 
* Below we are counting the number of rows with the 'state_code' column in the worldwide dataframe.

In [17]:
ds1_df['state_code'].value_counts().sum()

78357

In [18]:
ds1_df.shape

(78357, 53)

### Saving resulting dataset for 300_validation_d1 in pickle

This dataset will be used for inital RQ1 analysis notebook.

In [19]:
ds1_df.to_pickle("../../data/processed/300_dataset1.pkl")