## Text Classification project 

Perform a sort of “unsupervised” text classification based on similarity. 

In [234]:
import json  ## Read JSON data.                          
import pandas as pd  ## Data manipulation and analysis.     
import numpy as np  ## Numerical operations.                
from sklearn import metrics, manifold  ## Evaluation and dimensionality reduction. 
from sklearn.feature_extraction.text import CountVectorizer  ## Text vectorization. Convert text data to numerical vectors.

import re  ## Regular expressions for pattern matching in text.
import nltk  ## Natural Language Toolkit tools for text processing and analysis.
from nltk.stem import WordNetLemmatizer  ## Word lemmatization.  
from nltk.corpus import stopwords  ## Stopwords.            
from nltk.tokenize import word_tokenize  ## Tokenization.Tokenize text into words using NLTK.
nltk.download('wordnet')  ## Download WordNet for lemmatization.

import matplotlib.pyplot as plt  ## Plotting.              
import seaborn as sns  ## Statistical data visualization.   

import gensim  ## Topic modeling and document similarity.  
import gensim.downloader as gensim_api  ## Gensim model downloader. Download pre-trained models using Gensim API.

import transformers  ##  Implement NLP tasks with BERT using Transformers library.
import os  ##  Interface with the operating system for file handling.              

import lda  ## Latent Dirichlet Allocation for topic modeling and unsupervised classification.  

import tensorflow as tf  ## Machine learning framework.      
print(tf.__version__)  ## Display TensorFlow version.      


2.15.0


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sanie.s.rojas.lobo\AppData\Roaming\nltk_data.
[nltk_data]     ..
[nltk_data]   Package wordnet is already up-to-date!


## Initial file display

In [235]:
df = pd.read_csv("2_results_cleansed.csv")

In [236]:
df.dtypes

Company                object
Company description    object
Cleaned Description    object
dtype: object

In [237]:
text = ' '.join(df['Cleaned Description'].astype("string").fillna("null"))

In [238]:
text

'carolina engineers manufactures sells printers printing materials scanners printing service creates concept models precision functional prototypes master patterns tooling well production parts direct manufacturing uses proprietary processes fabricate physical objects using input computeraided design manufacturing software sca mining manufacturing conglomerate worker healthcare consumer goods several adhesives abrasives laminates passive fire protection protective equipment window films paint protection films dental orthodontic electronic connecting insulating materials medica manufacturer residential heaters boilers manufacturer marketer heaters also supplies treatment purification locations five manufacturing facilities well plants veldhoven netherlandsin past n manufacturing application delivery controllers software hardware chen cofounder foundry serviced identity line id series early added bandwidth appliances ex series initial offering march raising corp aviation servicesaar wood

## Unsupervised classification method : Latent Dirichlet allocation or LDA 

Latent Dirichlet Allocation (LDA) is a probabilistic model commonly used for topic modeling and unsupervised classification in the context of text data. In LDA, each document is viewed as a mixture of topics, and each topic is considered a distribution of words.

The algorithm aims to uncover these latent topics by iteratively assigning words to topics based on the probability of word-topic associations. For text classification, LDA can be applied to identify underlying themes or topics within a collection of documents. By assigning topics to documents and words to topics, LDA provides a structured representation of the textual content, enabling the discovery of hidden patterns and relationships. 

This method is valuable for organizing large datasets of unstructured text, facilitating content summarization, and enhancing the interpretability of document collections. In the context of unsupervised classification, LDA offers a powerful approach to discovering the inherent structure and content themes within text corpora, aiding in tasks such as categorizing companies by industry based on the content of their websites.

In [239]:
#We can convert the text to a matrix of token counts through CountVectorizer, which is the input LDA will be expecting. 
#Note that you can use the ngram parameter to specify the n-gram range you want to consider to train the model. 
#By setting ngram_range=(1,2) for instance you'd end up with features containing all individual words as well as 2-grams in each sentence. 


vec = CountVectorizer(analyzer='word', ngram_range=(1,1))  # Convert text to numerical vectors.
X = vec.fit_transform(df['Cleaned Description'].fillna("null"))  # Fit and transform text data.

model = lda.LDA(n_topics=30, random_state=1)  # Initialize LDA model stablishing how many topics we expect the model to identify.
model.fit(X)  # Fit LDA model to text data.

topic_word = model.topic_word_  # Access topic-word distributions obtained with the LDA model. 
vocab = vec.get_feature_names_out()  # Get feature names.
n_top_words = 30  # Number of top words.

for i, topic_dist in enumerate(topic_word):  # Iterate through topics.
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]  # Extract top words for each topic.
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))  # Print top words for each topic.


INFO:lda:n_documents: 1479
INFO:lda:vocab_size: 7860
INFO:lda:n_words: 28511
INFO:lda:n_topics: 30
INFO:lda:n_iter: 2000
INFO:lda:<0> log likelihood: -409586
INFO:lda:<10> log likelihood: -271800
INFO:lda:<20> log likelihood: -265046
INFO:lda:<30> log likelihood: -262545
INFO:lda:<40> log likelihood: -260028
INFO:lda:<50> log likelihood: -258343
INFO:lda:<60> log likelihood: -257487
INFO:lda:<70> log likelihood: -255937
INFO:lda:<80> log likelihood: -255283
INFO:lda:<90> log likelihood: -254849
INFO:lda:<100> log likelihood: -254869
INFO:lda:<110> log likelihood: -254233
INFO:lda:<120> log likelihood: -253862
INFO:lda:<130> log likelihood: -253579
INFO:lda:<140> log likelihood: -253423
INFO:lda:<150> log likelihood: -253290
INFO:lda:<160> log likelihood: -253345
INFO:lda:<170> log likelihood: -252791
INFO:lda:<180> log likelihood: -252868
INFO:lda:<190> log likelihood: -252640
INFO:lda:<200> log likelihood: -252507
INFO:lda:<210> log likelihood: -252343
INFO:lda:<220> log likelihood: -

Topic 0: software platform cloud security customer develops applications service francisco analytics technologies web computer uses social automation using tools networking create use designed cloudbased hardware used support raised human computing manage
Topic 1: service offering users various initial pinyin website country printing march design monthly become marketplace developing site sales internet reached streaming registered available kingdom lit app others images began local college
Topic 2: care healthcare facilities employees living among centers blue providers cosmetics revenues senior managed beauty throughout majority distributes hospitals across surgical battery home nine emergency portfolio kingdom batteries lithium professional surgery
Topic 3: plc education listed divisions conglomerate three focus mountain constituent santa cruise kingdom tax royal operator school listing combined line secondary segment legally publishing cruises seven students motorcycle caribbean co


ChatGPT 3 summarizes these topics as follows: 

    Topic 0: Software Development and Security - Develops cloud-based applications, analytics, and technologies.
    Topic 1: Online Services and Sales - Offers various online services and sales platforms.
    Topic 2: Healthcare and Senior Living - Provides healthcare services, senior living facilities, and cosmetics.
    Topic 3: Education and Listing - Operates education services and conglomerate listing.
    Topic 4: Information Consulting and Research - Employs big data analytics, research, and engineering projects.
    Topic 5: Branded Manufacturing - Produces and markets branded sports equipment and tools.
    Topic 6: Mining and Resources - Engaged in gold mining, resources exploration, and production.
    Topic 7: Corporate Transactions - Involves mergers, acquisitions, and corporate transactions.
    Topic 8: Home Construction and Retail - Constructs homes, sells home improvement supplies, and rentals.
    Topic 9: Real Estate Investment - Invests in real properties, shopping centers, and hospitality.
    Topic 10: Restaurant Chains - Manages fast-food and casual restaurant chains.
    Topic 11: Entertainment and Airlines - Operates entertainment parks, airlines, and game franchises.
    Topic 12: Capital Investment - Manages equity, assets, funds, and renewable energy investments.
    Topic 13: Natural Resources Production - Produces natural resources, crude oil, and hydrocarbon products.
    Topic 14: Utilities and Electricity - Provides electricity distribution, generation, and utility services.
    Topic 15: Specialty Materials Manufacturing - Manufactures specialty materials, packaging, and consumer goods.
    Topic 17: Clinical Laboratory and Healthcare - Develops medical devices, diagnostics, and clinical laboratory services.
    Topic 18: Banking and Mortgage - Operates banking, mortgage, and financial advisory services.
    Topic 19: Diversified Conglomerate - Diversified conglomerate with various business divisions.
    Topic 20: Stock Exchanges and Components - Listed on stock exchanges, ticker symbol, and components.
    Topic 21: Retail Stores and Brands - Operates retail stores, sells branded apparel, and accessories.
    Topic 22: Real Estate Development - Engaged in real estate development, ownership, and timeshare.
    Topic 23: Corporate Rankings - Ranked among top corporations, fiscal performance, and industries.
    Topic 24: Payments and Credit Processing - Processes payments, credit cards, and electronic trading.
    Topic 25: Industrial Equipment Manufacturing - Designs and manufactures industrial equipment, vehicles, and machinery.
    Topic 26: Telecommunications and Solar - Operates telecommunications, solar energy, and semiconductor services.
    Topic 27: Automotive and Transportation - Provides automotive shipping, fleet services, and vehicle maintenance.
    Topic 28: Industrial Manufacturing - Manufactures industrial components, supplies, and automotive parts.
    Topic 29: Marketing and Risk Management - Provides marketing, risk management, and professional programs.

In [240]:
# Display and assign topics to the DataFrame
topic_columns = []

for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words + 1):-1]
    topic_name = f'Topic_{i}'
    df[topic_name] = model.transform(X)[:, i]  # Assign topic probabilities to DataFrame
    topic_columns.append(topic_name)

In [241]:
df[df["Topic_0"]>0.4].head()

Unnamed: 0,Company,Company description,Cleaned Description,Topic_0,Topic_1,Topic_2,Topic_3,Topic_4,Topic_5,Topic_6,...,Topic_20,Topic_21,Topic_22,Topic_23,Topic_24,Topic_25,Topic_26,Topic_27,Topic_28,Topic_29
79,AMTD Digital,AMTD Digital is a France headquartered–based f...,amtd headquarteredbased amtd became notable ea...,0.007841,0.428412,0.013486,0.000577,0.000393,0.000473,0.021702,...,0.271012,0.001494,0.000474,0.010577,0.006191,0.000473,0.000465,0.000715,0.000311,0.000465
272,Cheetah Mobile,Cheetah Mobile Inc (猎豹移动公司) is a Chinese mobil...,cheetah internet january monthly users,8e-05,0.791815,0.000121,0.000132,0.000111,0.000133,0.00012,...,0.000112,0.014224,0.062667,0.000114,0.000159,9.6e-05,0.008061,0.000119,8.9e-05,0.000131
481,Eventbrite,Eventbrite is an American event management and...,event ticketing website service allows users b...,0.17689,0.767815,0.000439,0.000599,0.000173,0.000246,0.000187,...,0.000174,0.030221,0.000208,0.00019,0.00025,0.000149,0.000983,0.000742,0.000137,0.000204
518,Fiverr,Fiverr is an Israeli multinational online mark...,marketplace freelance fiverrs platform connect...,0.036483,0.43792,0.000541,0.077543,0.000122,0.115757,0.039224,...,0.001247,0.000235,0.000147,0.000125,0.066767,0.000137,0.000587,0.000131,9.7e-05,0.000144
574,Getty Images,"Getty Images Holdings, Inc. is an American vis...",getty images visual supplier images editorial ...,0.000363,0.693804,0.000119,0.037513,0.00068,0.002584,0.000488,...,0.000109,0.000109,0.000131,0.000111,0.002192,9.4e-05,0.02915,0.000427,0.016324,0.151639


In [242]:
topics_list = [
    ("Topic_0", "Software Development and Security"),
    ("Topic_1", "Online Services and Sales"),
    ("Topic_2", "Healthcare and Senior Living"),
    ("Topic_3", "Education and Listing"),
    ("Topic_4", "Information Consulting and Research"),
    ("Topic_5", "Branded Manufacturing"),
    ("Topic_6", "Mining and Resources"),
    ("Topic_7", "Corporate Transactions"),
    ("Topic_8", "Home Construction and Retail"),
    ("Topic_9", "Real Estate Investment"),
    ("Topic_10", "Restaurant Chains"),
    ("Topic_11", "Entertainment and Airlines"),
    ("Topic_12", "Capital Investment"),
    ("Topic_13", "Natural Resources Production"),
    ("Topic_14", "Utilities and Electricity"),
    ("Topic_15", "Specialty Materials Manufacturing"),
    ("Topic_16", "Consumer Goods Producers"),
    ("Topic_17", "Clinical Laboratory and Healthcare"),
    ("Topic_18", "Banking and Mortgage"),
    ("Topic_19", "Diversified Conglomerate"),
    ("Topic_20", "Stock Exchanges and Components"),
    ("Topic_21", "Retail Stores and Brands"),
    ("Topic_22", "Real Estate Development"),
    ("Topic_23", "Corporate Rankings"),
    ("Topic_24", "Payments and Credit Processing"),
    ("Topic_25", "Industrial Equipment Manufacturing"),
    ("Topic_26", "Telecommunications and Solar"),
    ("Topic_27", "Automotive and Transportation"),
    ("Topic_28", "Industrial Manufacturing"),
    ("Topic_29", "Marketing and Risk Management"),
]

In [245]:
# Assuming df is your DataFrame
dfn = df.copy()

for topic, title in topics_list:
    dfn[title] = dfn[topic]
    del dfn[topic]  # Optionally, you can remove the original Topic_x columns

# Display the modified DataFrame
dfn.to_csv("scored.csv")

## Generating corpus with Gensim to classify companies with an unsupervised approach

We will use embeddings to calculate similarities among companies and to classify them based on the Standard Industrial Classification (SIC) system. The SIC classification is a standardized numerical code assigned to businesses and industries to facilitate uniformity in economic reporting and analysis. Developed by the U.S. government, the SIC system categorizes companies into specific industry groups based on their primary economic activities. Each SIC code consists of a unique four-digit number, with greater specificity achieved through additional digits. We will kickstart this project by only using the first level of classification. See more information on https://en.wikipedia.org/wiki/Standard_Industrial_Classification#:~:text=The%20Standard%20Industrial%20Classification%20

 By leveraging embeddings, which represent semantic relationships between words or entities in a vector space, we aim to capture nuanced similarities in the textual content extracted from company websites/wikipedia sites. This classification methodology allows for a more granular understanding of industry affiliations and can enhance the precision of clustering and categorization efforts within the broader context of data science and natural language processing applications.

![SIC Classification](sic_codes.png)

In [None]:
nlp = gensim_api.load("glove-wiki-gigaword-300")

In [None]:
## Function to apply
def get_similar_words(lst_words, top, nlp):
    lst_out = lst_words
    for tupla in nlp.most_similar(lst_words, topn=top):
        lst_out.append(tupla[0])
    return list(set(lst_out))

In [None]:
## Create Dictionary {category:[keywords]}
dic_clusters = {}
dic_clusters["Farming"] = get_similar_words(['agriculture','fishing','forestry','farming'],  top=30, nlp=nlp)
dic_clusters["Mining"] = get_similar_words(['gold','coil','silver','mining','extraction'] , top=30, nlp=nlp)
dic_clusters["Construction"] = get_similar_words(['build','construction','state'],    top=30, nlp=nlp)
dic_clusters["Manufacturing"] = get_similar_words(['manufacture','plant'],  top=30, nlp=nlp)
dic_clusters["Transportation"] = get_similar_words(['manufacture','plant'], top=30, nlp=nlp)
dic_clusters["Retail"] = get_similar_words(['manufacture','plant'], top=30, nlp=nlp)
dic_clusters["Banking"] = get_similar_words(['manufacture','plant'],  top=30, nlp=nlp)
dic_clusters["Services"] = get_similar_words(['manufacture','plant'], top=30, nlp=nlp)

## print results to explore
for k,v in dic_clusters.items():
    print(k, ": ", v[0:5], "...", len(v))

In [None]:
tot_words = [word for v in dic_clusters.values() for word in v]
X = nlp[tot_words]
tot_words

In [None]:
## pca
pca = manifold.TSNE(perplexity=40, n_components=2, init='pca')
X = pca.fit_transform(X) #obtains a numpy array of PCA reducted vectors for each of the similar keywords
my_dataframe = pd.DataFrame(X, columns=['x', 'y'])
my_dataframe

In [None]:
## create a dataframe to portray our PCA vectors visually 
dtf = pd.DataFrame()
for k,v in dic_clusters.items():
    size = len(dtf) + len(v)
    dtf_group = pd.DataFrame(X[len(dtf):size], columns=["x","y"], 
                             index=v)
    dtf_group["cluster"] = k
    dtf = pd.concat([dtf, dtf_group])
        
## plot
fig, ax = plt.subplots()
sns.scatterplot(data=dtf, x="x", y="y", hue="cluster", ax=ax)
ax.legend().texts[0].set_text(None)
ax.set(xlabel=None, ylabel=None, xticks=[], xticklabels=[], 
       yticks=[], yticklabels=[])
