# Cleaning Merged Patents Data and Split for Model Training/Testing

### Outline:

- Drop redundant columns
- Rename columns
- Add key features
- Clean University Assignment Features
- Data Dictionary
- Split Data
- Save Data

In [37]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import os

In [38]:
file_location =  'C:\\Users\\trent\\Documents\\Capstone' ##change as necessaru
os.chdir(file_location)

In [39]:
patents_full = pd.read_csv('PATENTS_DATA.csv', dtype = {'GEOID':'str'}, low_memory = False)

In [40]:
patents_full.columns

Index(['patent_number', 'assignee', 'grant_year', 'application_year',
       'application_number', 'GEOID', 'ipc_section', 'team_size', 'inventors',
       'men_inventors', 'women_inventors', 'already_granted',
       'assignee_uni_clean2', 'Institution', 'Control', 'level_r1', 'level_r2',
       'special_focus', 'Perc_Over25_LessNinthGrade',
       'Perc_Over25_SomeHighSchool', 'Perc_Over25_HighSchoolGrad',
       'Perc_Over25_SomeCollege', 'Perc_Over25_Assosciates',
       'Perc_Over25_Bachelors', 'Perc_Over25_Graduate', 'bea_region',
       'Agriculture_Forestry_Fishing_Hunting',
       'Mining_Quarrying_and_Oil_Gas_Extraction', 'Utilities', 'Construction',
       'Manufacturing', 'Wholesale_Trade', 'Retail_Trade',
       'Transportation_Warehousing', 'Information', 'Finance_Insurance',
       'Real_Estate_Rental_Leasing',
       'Professional_Scientific_and_Technical_Services',
       'Management_of_Companies_Enterprises',
       'Administrative_Support_Waste_Management_Remediation

In [41]:
patents_full.dtypes

patent_number          object
assignee               object
grant_year            float64
application_year        int64
application_number      int64
                       ...   
pop_gt_16               int64
pop_gt_16_lf            int64
pop_gt_16_lf_c          int64
Pop_Est                 int64
year                    int64
Length: 73, dtype: object

## Dropping Redundant Columns

In [43]:
patents_full.drop(['year','inventors'], axis = 1, inplace = True)

## Renaming Columns

In [44]:
patents_full.rename(columns = 
                      {'patent_number':'patent_num',
                      'grant_year':'grant_yr',
                      'application_year':'app_yr',
                      'application_number':'app_num',
                      'ipc_section':'ipc',
                      'level_r1':'r1',
                      'level_r2':'r2',
                      'Perc_Over25_LessNinthGrade':'Over25_Less9Grade',
                      'Perc_Over25_SomeHighSchool':'Over25_SomeHS',
                      'Perc_Over25_HighSchoolGrad':'Over25_HSGrad', 
                      'Perc_Over25_SomeCollege':'Over25_SomeCollege',
                      'Perc_Over25_Assosciates':'Over25_Assosc',
                      'Perc_Over25_Bachelors':'Over25_Bach',
                      'Perc_Over25_Graduate':'Over25_Grad',
                      'assignee_uni_clean2':'assignee_univ_map'}, inplace = True)

## Creating Necessary Features

### Women Involvement in Patent

In [45]:
patents_full['women_involved'] = np.where(patents_full['women_inventors'] > 0, 1, 0)

In [46]:
patents_full['women_involved'].value_counts()

0    1303467
1     412290
Name: women_involved, dtype: int64

## Cleaning University Assignments

Binary classification if research university or not, for three types of research universities

In [47]:
patents_full['r1'].fillna(0, inplace = True)
patents_full['r2'].fillna(0, inplace = True)
patents_full['special_focus'].fillna(0, inplace = True)

In [48]:
patents_full[['r1','r2','special_focus']] = patents_full[['r1','r2','special_focus']].astype('int')

In [49]:
patents_full.columns

Index(['patent_num', 'assignee', 'grant_yr', 'app_yr', 'app_num', 'GEOID',
       'ipc', 'team_size', 'men_inventors', 'women_inventors',
       'already_granted', 'assignee_univ_map', 'Institution', 'Control', 'r1',
       'r2', 'special_focus', 'Over25_Less9Grade', 'Over25_SomeHS',
       'Over25_HSGrad', 'Over25_SomeCollege', 'Over25_Assosc', 'Over25_Bach',
       'Over25_Grad', 'bea_region', 'Agriculture_Forestry_Fishing_Hunting',
       'Mining_Quarrying_and_Oil_Gas_Extraction', 'Utilities', 'Construction',
       'Manufacturing', 'Wholesale_Trade', 'Retail_Trade',
       'Transportation_Warehousing', 'Information', 'Finance_Insurance',
       'Real_Estate_Rental_Leasing',
       'Professional_Scientific_and_Technical_Services',
       'Management_of_Companies_Enterprises',
       'Administrative_Support_Waste_Management_Remediation_Services',
       'Educational_Services', 'Health_Care_Social_Assistance',
       'Arts_Entertainment_and_Recreation', 'Accommodation_Food_Services',


In [50]:
patents_full.already_granted

0          1
1          1
2          1
3          1
4          1
          ..
1715752    0
1715753    0
1715754    0
1715755    0
1715756    0
Name: already_granted, Length: 1715757, dtype: int64

## Data Dictionary

- patent_num: Patent Number
- assignee: Assignee
- grant_yr: Grant Year
- app_yr: Application Year
- app_num: Application Number
- GEOID: GEOID
- ipc: International Patent Application
- team_size: Number of inventors
- men_inventors: Number of men inventors for patent
- women_inventors: Number of women inventors for patent
- already_granted: 1/0 classification, 1 i patent has been granted
- assignee_univ_map: If assignee is a resaerch university, this is what official university name it is mapped to
- Institution: If assigne is research university, the name of university
- Control: Public or private university
- r1: 1/0 classification, 1 if assignee is r1 research university: Very high research activity
- r2: 1/0 classification, 1 if assignee is r2 research university: High research activity
- special_focus: 1/0 classification, 1 if research university that only awards degrees in one area
- Over25_Less9Grade: % of GEOID over 25 years with less than 9th grade education
- Over25_SomeHS': % of GEOID over 25 years with some high school education
- Over25_HSGrad': % of GEOID over 25 years with high school diploma or equivalent
- Over25_SomeCollege': % of GEOID over 25 years with some college education
- Over25_Assosc': % of GEOID over 25 years with assosciate's degree
- Over25_Bach': % of GEOID over 25 years with bachelor's degree 
- Over25_Grad': % of GEOID Over 25 years with graduate degree
- bea_region: Bureau of Economic Analysis Region
- North American Industrial Classification (NAICS) Code Location Quotient:
  - Agriculture_Forestry_Fishing_Hunting
  - Mining_Quarrying_and_Oil_Gas_Extraction
  - Utilities
  - Construction   
  - Manufacturing  
  - Wholesale_Trade  
  - Retail_Trade
  - Transportation_Warehousing
  - Information
  - Finance_Insurance
  - Real_Estate_Rental_Leasing
  - Professional_Scientific_and_Technical_Services
  - Management_of_Companies_Enterprises
  - Administrative_Support_Waste_Management_Remediation_Services
  - Educational_Services
  - Health_Care_Social_Assistance
  - Arts_Entertainment_and_Recreation 
  - Accommodation_Food_Services
  - Other_Services_except_Public_Administration
  - Agriculture_Forestry_Fishing_Hunting_base
  - Mining_Quarrying_and_Oil_Gas_Extraction_base
  - Utilities_base
  - Construction_base
  - Manufacturing_base
  - Wholesale_Trade_base
  - Retail_Trade_base
  - Transportation_Warehousing_base
  - Information_base
  - Finance_Insurance_base
  - Real_Estate_Rental_Leasing_base
  - Professional_Scientific_and_Technical_Services_base
  - Management_of_Companies_Enterprises_base
  - Administrative_Support_Waste_Management_Remediation_Services_base
  - Educational_Services_base 
  - Health_Care_Social_Assistance_base
  - Arts_Entertainment_and_Recreation_base
  - Accommodation_Food_Services_base
  - Other_Services_except_Public_Administration_base
- qp1: ??
- ap: ??
- est: ?? 
- GDP: GDP or normalized GSP??
- pop_gt_16: ?
- pop_gt_16_lf: ?
- pop_gt_16_lf_c: ?
- Pop_Est: ?
- women_involved: 1/0 classification, 1 if woman is on the team

## Split Data

In [52]:
### Train & Validation Data: Application Years 2010-2017

patents_train_val = patents_full.query("app_yr >= 2010 & app_yr <= 2017")
patents_train_val.shape

(1591187, 72)

In [53]:
## Test Data Application Years: 2018-2019

patents_test_val = patents_full.query("app_yr >= 2018 & app_yr <= 2019")
patents_test_val.shape

(124570, 72)

## Save Data

In [14]:
patents_train_val.to_csv('patents_full_train.csv')
patents_test_val.to_csv('patents_full_test.csv')