# North American Industry Classification System (NAICS) Recommender System
### By Holden Miller-Schaeffer


#### North American Industry Classification System (NAICS)

NAICS code is a classification within the North American Industry Classification System. The NAICS System was developed for use by Federal Statistical Agencies for the collection, analysis and publication of statistical data related to the US Economy. NAICS is a Self-Assigned System; no government official assigns a NAICS Code. What this means is a company selects the code that best depicts their primary business activity and then uses it when asked for their code. If the Business Activities include more than one Unique Line of Business, the company may want to use more than one NAICS Code.

#### The Business Case
NAICS codes are used across a variety of industries to evaluate business operations, bank loans, and insurance risks. Since the code is self assigned, it may be difficult for users to know exactly what code they are supposed to use. This recommendation system is useful for users to query a description of their business operations and see a list of codes that are relevant to their search. This removes the hassle and research required for looking up the codes on your own. NAICS codes are broken into subgroups, with the most granular classification being 6 digits. The aim of this recommender system is to return the most accurate results for the 6 digit classifications. You can read more about the breakdown of Sector, Subsector, Industry Group, NAICS Industry, and National Industry code breakdown here: https://www.census.gov/programs-surveys/economic-census/guidance/understanding-naics.html to understand how they are grouped together.

#### Why a Recommender System?
When originally framing this problem, I concluded that the main goal was to use a query of the description of business operations as input data to predict the most accurate classification. I went in with the idea of building a classification model, but I realized quickly that collecting a large enough dataset with labeled data was nearly impossible for the scope of this project. I reached out to various organizations, including the NAICS Association directly, and discovered that business names, descriptions, and NAICS classifications were hidden behind a significant paywall. This left me with no choice other than to reframe the problem, and thus I decided to build a recommender system that could learn from user feedback, and theoretically converge at a perfect state, where every recommendation returned is relevant to the user.

#### The Data
The two main datasets used in this system are the [2017_NAICS_Index_file](https://www.census.gov/naics/2017NAICS/2017_NAICS_Index_File.xlsx) and [2017_NAICS_Descriptions](https://www.census.gov/naics/?48967) files found on the US Census NAICS website. The 2017 NAICS codes were chosen over the recently updated 2022 NAICS codes, as many relevant industries have not had much time to update their classifications, and any information found online more likely pertains to the 2017 NAICS codes than the 2022 codes. The 2017 NAICS Index file contains a list of all NAICS codes, and a short description of keywords that describes the types of businesses within that classification. The second file, 2017 NAICS Descriptions contains in-depth business descriptions for each NAICS subclass. For example, there is a main description for all class codes that begin with 11, another description for all classes that begin with 111, all the way until the most granular 5-digit code. The method for merging this text is described in the Data Cleaning and Processing section below.


##### Load and Process Dataset For Recommender System

In [1]:
# import packages
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import pandas as pd
import re

In [2]:
df = pd.read_csv(r'assets/2017_NAICS_Index_File.csv')
df.head()

Unnamed: 0,naics,description
0,111110,"Soybean farming, field and seed production"
1,111120,"Canola farming, field and seed production"
2,111120,"Flaxseed farming, field and seed production"
3,111120,"Mustard seed farming, field and seed production"
4,111120,"Oilseed farming (except soybean), field and se..."


In [3]:
des_df = pd.read_csv(r'assets/2017_NAICS_Descriptions.csv')
des_df.head()

Unnamed: 0,Code,Title,Description
0,11,"Agriculture, Forestry, Fishing and HuntingT","The Sector as a Whole\n\nThe Agriculture, Fore..."
1,111,Crop ProductionT,Industries in the Crop Production subsector gr...
2,1111,Oilseed and Grain FarmingT,This industry group comprises establishments p...
3,11111,Soybean FarmingT,See industry description for 111110.
4,111110,Soybean Farming,This industry comprises establishments primari...


In [73]:
def create_description_df(des_df):
    
    #load description dataframe
    des_df = pd.read_csv(r'assets/2017_NAICS_Descriptions.csv')
    
    # clean space characters
    # des_df['Title'] = des_df['Title'].str.replace(r'[T\b]', '', regex=True)
    des_df['Description'] = des_df['Description'].str.replace('\\n', ' ', regex=True)
    
    # remove common sentences across NAICS descriptions
    des_df = des_df[des_df['Description'].str.contains('See industry description for')==False]
    des_df['Description'] = des_df['Description'].str.replace('The Sector as a Whole', '', regex=False)
    des_df['Description'] = des_df['Description'].str.replace('Cross-References. Establishments primarily engaged in--', '', regex=False)
    
    # delete text after Excluded. This removes words related to descriptions specifically excluded from a class
    des_df['Description'] = des_df['Description'].str.replace('Excluded(.*?)$', '', regex=True)
    
    des_df = des_df.dropna()
    
    des_df['text'] = des_df['Title'] + ' ' + des_df['Description']
    des_df = des_df[['Code', 'text']]
    
    # add the text from each higher-level class to lower ones. i.e class "11" description gets added to "11110" and "111111", etc.
    for vals in des_df.values:
        if len(vals[0]) < 6:
            idx = des_df[(des_df['Code'].str.slice(start=0, stop=len(vals[0])) == vals[0]) & (des_df['Code'].str.len() == 6)].index
            des_df.loc[idx, 'text'] = des_df.loc[idx, 'text'] + ' ' + str(vals[1])

    
    return des_df[des_df['Code'].str.len() == 6]

def load_data(df, des_df):
    """
    Load and process NAICS documents
    """
    #description dataframe
    des_df = create_description_df(des_df).rename(columns= {'Code': 'naics'})
    
    # merge the two dataframes
    df = pd.merge(df, des_df[['naics','text']], how='outer', on='naics').fillna('')
    df['description'] = df[['description', 'text']].agg(' '.join, axis=1)
    
    #remove wildcard NAICS code
    df = df[df['naics'] != '******']
    
    # remove punctuation
    df['description'] = df['description'].str.replace('[^\w\s]', ' ', regex=True)
    df = df.groupby(['naics'])['description'].apply(' '.join).reset_index()

    return df

In [None]:
processed_df = load_data(df, des_df)
# view the results of the cleaned descriptions
processed_df.head()

NameError: name 'load_data' is not defined

In [75]:
# view the results of the cleaned descriptions
processed_df.head()

Unnamed: 0,naics,description
0,111110,Soybean farming field and seed production Soy...
1,111120,Canola farming field and seed production Oils...
2,111130,Bean farming dry field and seed production D...
3,111140,Wheat farming field and seed production Wheat...
4,111150,Corn farming except sweet corn field and se...


## Part 2: Stem/Lemmatize Dataset

In [76]:
# The below code will take a couple minutes to run. It is recommended to skip to Pt 3 where the resulting files are loaded in, rather than waiting for this to run.

In [80]:
stop_words = set(stopwords.words("english"))
STOPWORDS = set(stopwords.words('english'))
MIN_WORDS = 4
MAX_WORDS = 200

PATTERN_S = re.compile("\'s")  # matches `'s` from text  
PATTERN_RN = re.compile("\\r\\n\\b") #matches `\r` and `\n`
PATTERN_PUNC = re.compile(r"[^\w\s]") # matches all non 0-9 A-z whitespace 

def clean_text(text):
    """
    Series of cleaning. String to lower case, remove non words characters and numbers (punctuation, curly brackets etc).
        text (str): input text
    return (str): modified initial text
    """
    text = text.lower()  # lowercase text
    # replace the matched string with ' '
    text = re.sub(PATTERN_S, ' ', text)
    text = re.sub(PATTERN_RN, ' ', text)
    text = re.sub(PATTERN_PUNC, ' ', text)
    return text

def tokenizer(description, stop_words, normalization):
    
    if normalization == 'lemmatize':
        # tokenize and lemmatize text
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(w) for w in word_tokenize(description)]
        
    elif normalization == 'stem':
        # tokenize and stem text
        stemmer = PorterStemmer()
        tokens = [stemmer.stem(w) for w in word_tokenize(description)]
    
   # remove tokens length of 2 or below and make all lowercase and remove stop words
    tokens = [w.lower() for w in tokens if (w.lower() not in stop_words) and (len(w) > 2) and (w.isalpha())]
    
    return tokens    

def process_description(df):
    df['clean_description'] = df['description'].apply(clean_text)
    processed_df['lemmatized'] = df['clean_description'].apply(lambda x: tokenizer(x, stop_words, 'lemmatize'))
    processed_df['stemmed'] = df['clean_description'].apply(lambda x: tokenizer(x, stop_words, 'stem'))
                                                     
    return processed_df


processed_df = process_description(processed_df)

In [81]:
processed_df.head()

Unnamed: 0,naics,description,clean_description,lemmatized,stemmed
0,111110,Soybean farming field and seed production Soy...,soybean farming field and seed production soy...,"[soybean, farming, field, seed, production, so...","[soybean, farm, field, seed, product, soybean,..."
1,111120,Canola farming field and seed production Oils...,canola farming field and seed production oils...,"[canola, farming, field, seed, production, oil...","[canola, farm, field, seed, product, oilse, ex..."
2,111130,Bean farming dry field and seed production D...,bean farming dry field and seed production d...,"[bean, farming, dry, field, seed, production, ...","[bean, farm, dri, field, seed, product, dri, p..."
3,111140,Wheat farming field and seed production Wheat...,wheat farming field and seed production wheat...,"[wheat, farming, field, seed, production, whea...","[wheat, farm, field, seed, product, wheat, far..."
4,111150,Corn farming except sweet corn field and se...,corn farming except sweet corn field and se...,"[corn, farming, except, sweet, corn, field, se...","[corn, farm, except, sweet, corn, field, seed,..."


The end result displays the lemamtized and stemmed tokens from the description, which we will use later in our Recommender system. The result is exported as a .pkl file so it can easily be loaded into the next notebook.

In [82]:
processed_df.to_pickle(r'assets/processed_df.pkl')