# Trademarks

Here we collect open trademark data from the Intellectual Property Office. 

The data is available here: https://www.gov.uk/government/publications/ipo-trade-mark-data-release

We will undertake the following activities:

* Collect all the data.
* Enrich it with information about the product codes that the trademarks refer to
* Enrich it with information about its NUTS location (we keep this flexible as we will using this code in multiple places)




## Preamble

In [1]:
%run ../notebook_preamble.ipy

In [2]:
import re
import random
from zipfile import ZipFile
from io import BytesIO
import csv

In [4]:
#dirs

if 'trademarks' not in os.listdir('../../data/raw'):
    os.makedirs('../../data/raw/trademarks')
    
if 'trademarks' not in os.listdir('../../data/interim'):
    os.makedirs('../../data/interim/trademarks')

if 'trademarks' not in os.listdir('../../data/processed/'):
    os.makedirs('../../data/processed/trademarks')

In [5]:
# %load ../utilities.py
# Some utilities

import random

def make_data_dict(table,name,path,sample=5):
    '''
    A function to output the form for a data dictionary
    
    Args:
        -table (df) is the df we want to create the data dictionary for
        -name (str) of the df
        -path (str) is the place where we want to save the file
        

    
    '''
    
    types = [estimate_type(table[x],sample=sample) for x in table.columns]
        
    data_dict = pd.DataFrame()
    data_dict['variable'] = table.columns
        
    data_dict['type'] = types
    
    data_dict['description'] = ['' for x in data_dict['variable']]
        
    out = os.path.join(path,f'{today_str}_{name}.csv')
    
    #print(data_dict.columns)
    
    data_dict.to_csv(out)
    

def estimate_type(variable,sample):
    '''
    Estimates the type of a column. 

    Args:
        variable (iterable) with values
        sample (n) is the number of values to test
    
    '''
    
    selection = random.sample(list(variable),sample)
    
    types = pd.Series([type(x) for x in selection]).value_counts().sort_values(ascending=False)
    
    return(types.index[0])

In [29]:

def make_indicator(table,target_path,var_lookup,year_var,nuts_var='nuts_code',nuts_spec=2018,decimals=3):
    '''
    We use this function to create and save indicators using our standardised format.
    
    Args:
        table (df) is a df with relevant information
        target_path (str) is the location of the directory where we want to save the data (includes interim and processed)
        var_lookup (dict) is a lookup to rename the variable into our standardised name
        year (str) is the name of the year variable
        nuts_var (str) is the name of the NUTS code variable. We assume it is nuts_code
        nuts_spec (y) is the value of the NUTS specification. We assume we are working with 2018 NUTS
    
    '''
    #Copy
    t = table.reset_index(drop=False)
    
    #Reset index (we assume that the index is the nuts code, var name and year - this might need to be changed)
    
    
    #Process the interim data into an indicator
    
    #This is the variable name and code
    var_name = list(var_lookup.keys())[0]
    
    var_code = list(var_lookup.values())[0]
    
    #Focus on those
    t = t[[year_var,nuts_var,var_name]]
    
    #Add the nuts specification
    t['nuts_year_spec'] = nuts_spec
    
    #Rename variables
    t.rename(columns={var_name:var_code,year_var:'year',nuts_var:'nuts_id'},inplace=True)

    #Round variables
    t[var_code] = [np.round(x,decimals) if decimals>0 else int(x) for x in t[var_code]]
    
    
    #Reorder variables
    t = t[['year','nuts_id','nuts_year_spec',var_code]]
    
    print(t.head())
    
    #Save in the processed folder
    t.to_csv(f'../../data/processed/{target_path}/{var_code}.csv',index=False)

## 1. Collect data

We collect the data from the IPOs open data site. This is a zip file.

#### Collect trademark open dataset

In [6]:
#Download and parse the data
trademark_link = 'https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/680986/opendatadomestic.zip'
trade_request = requests.get(trademark_link)

In [7]:
tradem = ZipFile(BytesIO(trade_request.content)).extract('OpenDataDomestic.txt',path=f'../../data/raw/trademarks/{today_str}_trademarks.txt')

In [8]:
#Note that here we are escaping a small number (~20) of badlines.
#I couldn't quite determine what was the problem with them

tradem_df = pd.read_csv('../../data/raw/trademarks/2020_02_14_trademarks.txt/OpenDataDomestic.txt',delimiter='|',
                        encoding='utf-16',warn_bad_lines=False,error_bad_lines=False)

  interactivity=interactivity, compiler=compiler, result=result)


In [10]:
# This is what it looks like
tradem_df.head()

Unnamed: 0,Trade Mark,Hyperlink,Mark Text,Name,Postcode,Region,Country,Status,Category of Mark,Mark Type,...,Class36,Class37,Class38,Class39,Class40,Class41,Class42,Class43,Class44,Class45
0,UK00000000001,"=HYPERLINK(""http://www.ipo.gov.uk/tmcase/Resul...",BASS & Co's PALE ALE,Pioneer Brewing Company Limited,LU1,East of England,United Kingdom,Registered,Standard,Figurative,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,UK00000000002,"=HYPERLINK(""http://www.ipo.gov.uk/tmcase/Resul...",BASS & Co's,Pioneer Brewing Company Limited,LU1,East of England,United Kingdom,Registered,Standard,Figurative,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,UK00000000039,"=HYPERLINK(""http://www.ipo.gov.uk/tmcase/Resul...",BBB,"A. Oppenheimer & Co., Limited",SS3,East of England,United Kingdom,Dead,Standard,Figurative,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,UK00000000041,"=HYPERLINK(""http://www.ipo.gov.uk/tmcase/Resul...",Bodega,Travelodge Hotels Limited,OX9,South East,United Kingdom,Dead,Standard,Word,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,UK00000000042,"=HYPERLINK(""http://www.ipo.gov.uk/tmcase/Resul...",BODEGA,Travelodge Hotels Limited,OX9,South East,United Kingdom,Dead,Standard,Figurative,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
#Tidy the columns
tradem_df.columns = [re.sub(' ','_',x).lower() for x in tradem_df.columns]

#Convert year strings to years. Faster with string processing than with datetime
tradem_df['year_published'] = [int(str(x).split('-')[0]) if not pd.isnull(x) else x for x in tradem_df.published]

#### Product - category lookup

We will use this lookup to identify patents with scientific Nice codes

In [12]:
class_product_category_lookup = pd.read_csv('../../data/aux/12_11_2019_nice_class_to_category_lookup.csv')

## 2. Geocoding the trademarks

We are going to create a function that automatically geocodes the trademarks using a postcode-NUTS lookup. One challenge with this is that both postcodes and NUTS classifications change over time.



In [13]:
#Clean postcodes
tradem_df['postcode'] = [x.strip().lower() if (pd.isnull(x)==False) & (x!='Not Available') else np.nan for x in tradem_df.postcode]

In [14]:
tradem_uk = tradem_df.loc[tradem_df['country']=='United Kingdom'].dropna(axis=0,subset=['postcode'])

len(tradem_uk)

811417

These are the trademarks in the UK with postcodes. We can use them in subsequent analyses

In [15]:
pc_url = 'https://www.arcgis.com/sharing/rest/content/items/19fac93960554b5e90840505bd73917f/data'

In [16]:
def geo_trademark(tradem_df,geography,nspl_file,lookup_file,geo_code,path_to_nspl):
    '''
    
    This function classifies trademars into locations using a postcode Lookup. As part of this we need to merge the merged file with a geo-code - geo-name
    lookup to get the geography names.
    
    Arguments:
        tradem_df (df) is the df with the trademark information. It needs to include a postode for matching
        geography (str) is the geography we want to match
        nspl_file (str) is the file with the nspl data
        lookup_file (str) is the name of the file with a lookup between variable names and codes
        geo_code (str) is the name of the variable name in the lookup
        path_to_nspl (str) if a link, then we download the nspl file.
    
    '''
    
    #Read the NSPL files
    
    if 'https' in path_to_nspl:
        
        print('downloading nspl')
        
        #Download the file
        nspl_request = requests.get(path_to_nspl)
        
        nspl_zipfile = ZipFile(BytesIO(nspl_request.content))
        
        #Read the nspl
        nspl = pd.read_csv(nspl_zipfile.open(f'Data/{nspl_file}'))[['pcds',geography]]
        
        #Read the lookup
        lookup = pd.read_csv(nspl_zipfile.open(f'Documents/{lookup_file}'))
        
    else:
        print('reading nspl')
        
        nspl = pd.read_csv(path_to_nspl+f'/Data/{nspl_file}')[['pcds',geography]]
        
        lookup = pd.read_csv(path_to_nspl+f'/Documents/{lookup_file}')
        
       
    print('processing data')
    #Throw away unnecessary postcodes in the nspl file (we are only interested in the first digit. Also, make them lowercase
    nspl['pcds_1st'] = nspl['pcds'].apply(lambda x: x.split(' ')[0].lower())
    
    
    #Merge
    tradem_merged = pd.merge(tradem_df,nspl.drop_duplicates('pcds_1st')[['pcds_1st',geography]],left_on='postcode',right_on='pcds_1st')
    
    
    #Merge with the lookup names
    #Remove Walsh column names from lookup
    lookup = lookup[[x for x in lookup.columns if x[-1]!='W']]
    
    tradem_w_names = pd.merge(tradem_merged,lookup,left_on=geography,right_on=geo_code)
    
    #Remove the geography variable as it has unstandardised names
    tradem_w_names.drop(axis=1,labels=geography,inplace=True)
    
    return(tradem_w_names)    

In [17]:
trademark_nuts = geo_trademark(tradem_df,geography='nuts',nspl_file='NSPL_AUG_2019_UK.csv',
                            lookup_file='LAU219_LAU119_NUTS18_MAY_2019_UK_LU.csv',
                            geo_code='LAU219CD',path_to_nspl = pc_url)

downloading nspl


  if (await self.run_code(code, result,  async_=asy)):


processing data


In [18]:
trademark_nuts.head()

Unnamed: 0,trade_mark,hyperlink,mark_text,name,postcode,region,country,status,category_of_mark,mark_type,...,LAU219CD,LAU219NM,LAU119CD,LAU119NM,NUTS318CD,NUTS318NM,NUTS218CD,NUTS218NM,NUTS118CD,NUTS118NM
0,UK00000000001,"=HYPERLINK(""http://www.ipo.gov.uk/tmcase/Resul...",BASS & Co's PALE ALE,Pioneer Brewing Company Limited,lu1,East of England,United Kingdom,Registered,Standard,Figurative,...,E05002208,South,E06000032,Luton,UKH21,Luton,UKH2,Bedfordshire and Hertfordshire,UKH,East of England
1,UK00000000002,"=HYPERLINK(""http://www.ipo.gov.uk/tmcase/Resul...",BASS & Co's,Pioneer Brewing Company Limited,lu1,East of England,United Kingdom,Registered,Standard,Figurative,...,E05002208,South,E06000032,Luton,UKH21,Luton,UKH2,Bedfordshire and Hertfordshire,UKH,East of England
2,UK00000000914,"=HYPERLINK(""http://www.ipo.gov.uk/tmcase/Resul...",,Pioneer Brewing Company Limited,lu1,East of England,United Kingdom,Registered,Standard,Figurative,...,E05002208,South,E06000032,Luton,UKH21,Luton,UKH2,Bedfordshire and Hertfordshire,UKH,East of England
3,UK00000000915,"=HYPERLINK(""http://www.ipo.gov.uk/tmcase/Resul...",,Pioneer Brewing Company Limited,lu1,East of England,United Kingdom,Registered,Standard,Figurative,...,E05002208,South,E06000032,Luton,UKH21,Luton,UKH2,Bedfordshire and Hertfordshire,UKH,East of England
4,UK00001387389,"=HYPERLINK(""http://www.ipo.gov.uk/tmcase/Resul...",VAUXHALL MASTERFIT,Vauxhall Motors Limited,lu1,East of England,United Kingdom,Registered,Standard,Word,...,E05002208,South,E06000032,Luton,UKH21,Luton,UKH2,Bedfordshire and Hertfordshire,UKH,East of England


## 3. Processing

We will create a df with registered trademark counts after 2010, and counts of trademarks in scientific and technnological nice codes. 

We identify what these are using the lookup we created in our project mapping innovation in Scotland, and which we loaded above.

In [19]:
#Filter to focus on recent & registered trademarks
trademark_clean = trademark_nuts.loc[(trademark_nuts['status']=='Registered')&(trademark_nuts['year_published']>=2010)]

In [20]:
#What are the scientic classes?
class_product_category_lookup.loc[class_product_category_lookup['category']=='scientific']

Unnamed: 0.1,Unnamed: 0,class,Description,category
8,8,class9,"Scientific, nautical, surveying, photographic,...",scientific
9,9,class10,"Surgical, medical, dental and veterinary appar...",scientific
41,41,class42,Scientific and technological services and rese...,scientific


In [21]:
scientic_nice_classes = list(class_product_category_lookup.loc[class_product_category_lookup['category']=='scientific']['class'])

In [22]:
#Does a trademark have at least one scientific category?
#trademark_clean['is_scientific'] = trademark_nuts[scientic_nice_classes].sum(axis=1)>0

trademark_clean['is_scientific'] = trademark_nuts['class42']>0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [23]:
trademark_clean['is_scientific'].sum()

34974

In [32]:
trademark_grouped = pd.concat([trademark_clean.groupby(['NUTS218NM','NUTS218CD','year_published']).size(),
                            trademark_clean.groupby(['NUTS218NM','NUTS218CD','year_published'])['is_scientific'].sum()],axis=1)

In [33]:
trademark_grouped.rename(columns={0:'trademark_n','is_scientific':'scientific_trademark_n'},inplace=True)

In [34]:
trademark_grouped['scientific_trademark_share'] = trademark_grouped['scientific_trademark_n']/trademark_grouped['trademark_n']

In [35]:
trademark_grouped.sort_values('scientific_trademark_share',ascending=False).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,trademark_n,scientific_trademark_n,scientific_trademark_share
NUTS218NM,NUTS218CD,year_published,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
North Eastern Scotland,UKM5,2011.0,116,37.0,0.318966
North Eastern Scotland,UKM5,2013.0,152,47.0,0.309211
North Eastern Scotland,UKM5,2012.0,132,39.0,0.295455
North Eastern Scotland,UKM5,2010.0,109,29.0,0.266055
Highlands and Islands,UKM6,2011.0,83,17.0,0.204819


In [36]:
trademark_grouped.to_csv(f'../../data/interim/trademarks/{today_str}_nuts_trademarks.csv')

## 4. Create final indicators

Trademarks (#74)

In [30]:
trademark_grouped.columns

Index(['trademark_n', 'scientific_trademark_n', 'scientific_trademark_share'], dtype='object')

In [43]:
make_indicator(trademark_grouped,'trademarks',{'trademark_n':'total_trademarks'},
               nuts_spec=2016,nuts_var='NUTS218CD',year_var='year_published',decimals=0)

     year nuts_id  nuts_year_spec  total_trademarks
0  2010.0    UKH2            2016               873
1  2011.0    UKH2            2016               940
2  2012.0    UKH2            2016              1059
3  2013.0    UKH2            2016              1171
4  2014.0    UKH2            2016              1441


In [42]:
make_indicator(trademark_grouped,'trademarks',{'scientific_trademark_n':'total_trademarks_scientific'},
               nuts_spec=2016,nuts_var='NUTS218CD',year_var='year_published',decimals=0)

     year nuts_id  nuts_year_spec  total_trademarks_scientific
0  2010.0    UKH2            2016                           90
1  2011.0    UKH2            2016                           92
2  2012.0    UKH2            2016                          132
3  2013.0    UKH2            2016                          164
4  2014.0    UKH2            2016                          160


In [None]:
tr