# Exploratory Data Analysis 

I'd like to build a model that predicts a company's category based on the website text. Therefore, my EDA will focus on assessing the text data available. 

- Perform detailed EDA on CompanyMerged
- Visualize key aspects of data with notes relevant to model building
- Findings and hypotheses outlined below

In [1]:
import sys
import os
import pkgutil
from inspect import getmembers, isfunction
import pandas as pd

import plotly.express as px

# Dynamically get the current working directory
current_dir = os.getcwd()

# Add the path to utils/ directory, assuming it's one level up from the current working directory
utils_path = os.path.abspath(os.path.join(current_dir, '..', 'utils'))
sys.path.append(utils_path)

# Verify that the utils path is correctly added
print(f"Utils path added: {utils_path}")

# Check that the modules in the utils directory are found
print(f"Modules in utils directory: {[name for _, name, _ in pkgutil.iter_modules([utils_path])]}")

import db_utils as db

# Import helper_functions module after appending the correct path
try:
    import helper_functions as hf
    print("Successfully imported helper_functions.")
except ImportError as e:
    print(f"Failed to import helper_functions: {e}")

# Inspect and list all functions in helper_functions module
helper_funcs = getmembers(hf, isfunction)
print(f"Functions in helper_functions: {helper_funcs}")

# If no functions are found, print a warning message
if not helper_funcs:
    print("Warning: No functions found in helper_functions.py")

# Example: Call a function from helper_functions
if hasattr(hf, 'example_function_1'):
    result = hf.example_function_1()
    print(f"Result from 'example_function_1': {result}")


Utils path added: c:\Users\megan\OneDrive\Documents\GitHub\sqlite_to_analysis_app\utils
Modules in utils directory: ['db_utils', 'helper_functions', 'markdown_writer']
helper_functions.py has been loaded
Successfully imported helper_functions.
Functions in helper_functions: [('expand_contractions', <function expand_contractions at 0x000001C321EE25E0>), ('get_word_net_pos', <function get_word_net_pos at 0x000001C321EE2670>), ('join_text_columns', <function join_text_columns at 0x000001C321EE2550>), ('lemmatize_text', <function lemmatize_text at 0x000001C321EE2700>), ('lemmatize_with_pos', <function lemmatize_with_pos at 0x000001C321EE2790>), ('process_text', <function process_text at 0x000001C321EE2820>), ('remove_html', <function remove_html at 0x000001C321EE24C0>), ('word_count', <function word_count at 0x000001C31EC1B0D0>), ('word_freq', <function word_freq at 0x000001C321EE23A0>), ('word_tokenize', <function word_tokenize at 0x000001C3200E7430>)]


In [2]:
# get data path
db_path = os.path.abspath(os.path.join(current_dir, '..', 'data','combined_data.db'))
conn = db.connect_to_db(db_path)

In [3]:
# identify names of tables in the database
db.run_query(conn,"SELECT name FROM sqlite_master WHERE type='table'")

[('CompanyClassification',), ('CompanyDataset',), ('CompanyMerged',)]

In [4]:
# extract data to pandas dataframe
company_merged = pd.read_sql_query("SELECT * FROM CompanyMerged",conn)
# count the words within homepage_text
company_merged['len_homepage_text'] = company_merged['homepage_text'].apply(lambda x: hf.word_count(x) if x is not None else 0)
company_merged.head()

Unnamed: 0,Company_ID,CompanyName,Website,Industry,Size_Range,Locality,Country,Current_Employee_Estimate,Total_Employee_Estimate,Category,homepage_text,h1,h2,h3,nav_link_text,meta_keywords,meta_description,len_homepage_text
0,99,crinan hotel,crinanhotel.com,hospitality,1 - 10,"ardchonell, argyll and bute, united kingdom",united kingdom,1,3,Corporate Services,01546 830261 Crinan · by Lochgilp...,Latest News#sep#Website Privacy Statement#sep#...,How we use cookies#sep#Security#sep#Let's be S...,Accommodation#sep#Activities#sep#Experience Cr...,,"Crinan hotel, country house hotel, boutique ho...",Crinan Hotel - on waterfront overlooking Loch ...,3467
1,222,"spot on productions, llc",spotonproductionsllc.com,entertainment,1 - 10,"jackson, mississippi, united states",united states,2,3,"Media, Marketing & Sales",...,Storytelling Brought to Life.,,,,,"We're Philip Scarborough and Tom Beck, the for...",45
2,535,akhand jyoti eye hospital,akhandjyoti.in,hospital & health care,11 - 50,"saran, bihar, india",india,8,11,Healthcare,Donate ...,Eradicate Curable Blindness,"12,600,000#sep#In Low-Income States Of India",Our Girls Help#sep#Donate In Specific Programs...,"why blindness,women empowerment,our impact,abo...",Akhand Jyoti - the largest eye hospital in eas...,"Akhandjyoti, akhand jyoti eye hospital, non-pr...",909
3,642,lasercare eye center,dfweyes.com,medical practice,1 - 10,"irving, texas, united states",united states,4,11,Healthcare,...,,,,"home,why choose us,new patient information,pat...",,Call 214.574.9600 TODAY for an appointment! Th...,1633
4,675,compumachine inc,compumachine.com,machinery,1 - 10,"danvers, massachusetts, united states",united states,4,9,Industrials,MACHINES & AUTOMATION HOME MACHINE...,,MACHINES & AUTOMATION,,"home,machines,automation,mastercam,services,ab...",,Compumachine is proud to offer CNC Machine Too...,192


In [5]:
# check nulls per row of the merged table
print(f"Total Rows: {len(company_merged)}")
company_merged.isnull().sum(axis=0)

Total Rows: 73124


Company_ID                       0
CompanyName                      0
Website                          0
Industry                         0
Size_Range                       0
Locality                      1745
Country                          0
Current_Employee_Estimate        0
Total_Employee_Estimate          0
Category                         0
homepage_text                    0
h1                           26511
h2                           20055
h3                           28491
nav_link_text                25084
meta_keywords                49474
meta_description              6688
len_homepage_text                0
dtype: int64

In my sample, all companies have some website text. 
- Roughly one third don't have h1-h3 or nav_link_text. 
- meta_keywords is not available for most of my sample, but only about 10% are missing meta_description

It would make sense to join text from all available text fields to expand words available for predicting categories per company.

In [10]:
clean_text = company_merged.loc[company_merged['len_homepage_text']>0]
fig = px.histogram(clean_text, x='len_homepage_text', title="Distribution of Homepage Words")

# top 10 industries per category by website countfig.show()

## Websites by Country

In [83]:
# top 10 countries per category by website count
top_countries = company_merged.groupby(['Country'], as_index=False)['Website'].nunique()
top_countries = top_countries.sort_values(['Country', 'Website'], ascending=[True, False])

fig = px.bar(top_countries.head(20)
             ,x='Country'
             ,y='Website')
fig.show()

most websites in sample come from the US

## Categories Summary

In [11]:
# understand categories available
categories = company_merged['Category'].unique()
print("There are {} categories in CompanyMerged".format(len(categories)))
print(categories)

There are 13 categories in CompanyMerged
['Corporate Services' 'Media, Marketing & Sales' 'Healthcare'
 'Industrials' 'Commercial Services & Supplies' 'Consumer Discretionary'
 'Transportation & Logistics' 'Energy & Utilities' 'Financials'
 'Professional Services' 'Consumer Staples' 'Materials'
 'Information Technology']


In [12]:
# Group by Category 
grouped_df = company_merged.groupby(['Category'], as_index=False)['Website'].nunique()

# Create bar plot
fig = px.bar(
    grouped_df,
    x='Category',
    y='Website',
    color='Category',
    title='Unique Company Websites by Category',
    barmode='group'  # Group bars by industry
)

# Adjust the axes to scale automatically per group
fig.update_yaxes(matches=None)  # This ensures y-axes are independent

# Show the plot
fig.show()

## industries per category

1. Corporate Services is mostly real estate, hospitality, and recruiting
2. Media has a large marketing skew
3. Healthcare is a blend of vet practices, providers and healthcare facilities
4. Industries is mostly machinery

In [71]:
# top 10 industries per category by website count
top_ind = company_merged.groupby(['Category','Industry'], as_index=False)['Website'].nunique()
top_ind = top_ind.sort_values(['Category', 'Website'], ascending=[True, False])

# Select the top 10 industries per category
top_ind = top_ind.groupby('Category').head(10)

for i in categories[0:5]:
    df_cat = top_ind.loc[top_ind['Category']==i]
    fig = px.bar(df_cat
                ,x='Industry'
                ,y='Website'
                ,title=i)
    fig.show()


### ....continued

1. Financials is mostly insurance and finserv
2. Consumer staples is mostly food related
3. Professional services has law as the most freq. industries
4. IT is mostly information tech

In [72]:
for i in categories[6:]:
    df_cat = top_ind.loc[top_ind['Category']==i]
    fig = px.bar(df_cat
                ,x='Industry'
                ,y='Website'
                ,title=i)
    fig.show()

## text overview per category

In [13]:
# Group by Category for total word count
grouped_df = company_merged.groupby(['Category'], as_index=False)['len_homepage_text'].sum()

# Create bar plot
fig = px.bar(
    grouped_df,
    x='Category',
    y='len_homepage_text',
    color='Category',
    title='Words by Category',
    barmode='group'  # Group bars by industry
)

# Adjust the axes to scale automatically per group
fig.update_yaxes(matches=None)  # This ensures y-axes are independent

# Show the plot
fig.show()

In [14]:
# visually inspect some examples of the homepage text
# print top 3 examples by word count
for i,row in company_merged.sort_values(by='len_homepage_text',ascending=False)['homepage_text'].iloc[:3].reset_index(drop=True).iteritems():
    print(i,row)

0                    LOADING                               Browse Events    Past Events    News    Event Alerts    Vendor    About Us    How it Works    FAQs    Contact        Publish Event   Publish Press Release                           Register for Dundalk Institute of Technology Admission Program 2019       Blarose Lifestyle & Fashion Expo       Blarose Winter Edit- Lifestyle & Fashion Expo       Blarose Lifestyle and Fashion Expo- Season 3       Global Educators Fest 2018 , 3 AUG 2018 - 4 AUG 2018                    -- Select Sector --  Automobiles  Healthcare  IT & ITeS  Engineering  Services  Cement  Aviation  Startups  Food Industry  Education and Training  Science and Technology  Government  Real Estate  Pharmaceuticals  Media and Entertainment  Financial Services  Consumer Markets  Urban Market  Auto Components  Tourism and Hospitality  Agriculture  Textiles  Manufacturing  Gems and Jewellery  Food & Beverage  Consultancy  Not for Profit  Business Services  Environment  Infr

In [15]:
# print bottom 3 examples by word count
for i,row in company_merged.sort_values(by='len_homepage_text',ascending=True)['homepage_text'].iloc[:10].reset_index(drop=True).iteritems():
    print(i,row)

0 RackCorp.com
1 ÍøÕ¾·ÃÎÊÈÏÖ¤£¬µã»÷Á´½Óºó½«Ìø×ªµ½·ÃÎÊÒ³Ãæ
2 welcome
3 Skip
4 www.gs-co.eu
5 Loading
6 welcome
7 ...
8 welcome
9 Skip


In [16]:
print("Row count with less than 50 words: {}".format(len(company_merged.loc[company_merged['len_homepage_text']<50])))
company_merged.loc[company_merged['len_homepage_text']<20].head()

Row count with less than 50 words: 1516


Unnamed: 0,Company_ID,CompanyName,Website,Industry,Size_Range,Locality,Country,Current_Employee_Estimate,Total_Employee_Estimate,Category,homepage_text,h1,h2,h3,nav_link_text,meta_keywords,meta_description,len_homepage_text
17,1809,guelph medical laser skin centre,guelphlaser.com,medical practice,1 - 10,"guelph, ontario, canada",canada,1,1,Healthcare,,,,,,"Laser Hair REmoval, CoolSculpting, Baby Belly,...",Guelph Medical Laser & Skin Centre offer Laser...,5
55,5175,new era debt solutions,neweradebtsolutions.com,financial services,1 - 10,"camarillo, california, united states",united states,1,2,Financials,,,,,,,,2
111,10819,"live edge media, llc",live-edge-media.com,photography,1 - 10,"cleona, pennsylvania, united states",united states,1,1,"Media, Marketing & Sales",,,,,,,Live Edge Media photography and videography. S...,7
196,21723,dominion lending centres clearmortgage.ca,clearmortgage.ca,financial services,1 - 10,"penticton, british columbia, canada",canada,2,2,Financials,,,,,,"mortgages, rates, broker, mortgage, lender, ba...",Mortgage Brokers,5
210,22706,7 accounts - xero accountants in london and ch...,7accounts.uk,accounting,1 - 10,"chichester, west sussex, united kingdom",united kingdom,2,2,Professional Services,,,,,,,Get online with Website Builder! Create a free...,5


While cases where homepage_text was null have been removed, there are still examples where the text will be empty or have too few words to use.

Will need to clean:
- punctuation
- unicode
- html formatting
- stopwords
- contractions
- indentations, paragraphs etc

In [17]:
# Check the distribution of Total_Employee_Estimate per Category
# Calculate the 5th and 95th percentiles
lower_bound = company_merged['Total_Employee_Estimate'].quantile(0.05)
upper_bound = company_merged['Total_Employee_Estimate'].quantile(0.95)

print("upper bound: {}".format(upper_bound))
print("lower_bound: {}".format(lower_bound))

# Filter the DataFrame to remove outliers
filtered_df = company_merged[(company_merged['Total_Employee_Estimate'] >= lower_bound) & 
                  (company_merged['Total_Employee_Estimate'] <= upper_bound)]

# Create faceted charts
fig = px.histogram(
    filtered_df,
    x='Total_Employee_Estimate',
    color='Category',
    facet_col='Category',
    title='Distribution of Total Employee Estimate per Category'
)

# Show the plot
fig.show()

upper bound: 127.0
lower_bound: 1.0


In [18]:
# Check the distribution of Total_Employee_Estimate per Category
# Calculate the 5th and 95th percentiles
lower_bound = company_merged['Current_Employee_Estimate'].quantile(0.05)
upper_bound = company_merged['Current_Employee_Estimate'].quantile(0.95)

print("upper bound: {}".format(upper_bound))
print("lower_bound: {}".format(lower_bound))

# Filter the DataFrame to remove outliers
filtered_df = company_merged[(company_merged['Current_Employee_Estimate'] >= lower_bound) & 
                  (company_merged['Current_Employee_Estimate'] <= upper_bound)]

# Create faceted charts
fig = px.histogram(
    filtered_df,
    x='Current_Employee_Estimate',
    color='Category',
    facet_col='Category',
    title='Distribution of Current Employee Estimate per Category'
)

# Show the plot
fig.show()

upper bound: 57.0
lower_bound: 0.0


Consumer Discretionary, Industrials, and Materials are least represented in this dataset. However, there isn't a large enough skew amongst existing categories to make me significantly alter the distribution. I will test classificaiton with data as is for the time being. 

## Text Cleaning for Further Analysis

1. Merge text from all text fields into one extended string
2. Remove HTML tags
3. Update contractions
4. Remove punctuation
5. Remove stopwords

In [5]:
# Join text from specified columns
columns_to_merge = ['homepage_text', 'h1', 'h2','h3','nav_link_text','meta_keywords','meta_description']
company_merged['Full_Text'] = hf.join_text_columns(company_merged, columns_to_merge, separator=' ')
# count the words within homepage_text
company_merged['len_Full_Text'] = company_merged['Full_Text'].apply(lambda x: hf.word_count(x) if x is not None else 0)
print('Rows with less than 50 words of full text: {}'.format(len(company_merged.loc[company_merged['len_Full_Text']<50])))

company_merged.head()


Rows with less than 50 words of full text: 1657


Unnamed: 0,Company_ID,CompanyName,Website,Industry,Size_Range,Locality,Country,Current_Employee_Estimate,Total_Employee_Estimate,Category,homepage_text,h1,h2,h3,nav_link_text,meta_keywords,meta_description,len_homepage_text,Full_Text,len_Full_Text
0,99,crinan hotel,crinanhotel.com,hospitality,1 - 10,"ardchonell, argyll and bute, united kingdom",united kingdom,1,3,Corporate Services,01546 830261 Crinan · by Lochgilp...,Latest News#sep#Website Privacy Statement#sep#...,How we use cookies#sep#Security#sep#Let's be S...,Accommodation#sep#Activities#sep#Experience Cr...,,"Crinan hotel, country house hotel, boutique ho...",Crinan Hotel - on waterfront overlooking Loch ...,3467,01546 830261 Crinan · by Lochgilp...,3665
1,222,"spot on productions, llc",spotonproductionsllc.com,entertainment,1 - 10,"jackson, mississippi, united states",united states,2,3,"Media, Marketing & Sales",...,Storytelling Brought to Life.,,,,,"We're Philip Scarborough and Tom Beck, the for...",45,...,75
2,535,akhand jyoti eye hospital,akhandjyoti.in,hospital & health care,11 - 50,"saran, bihar, india",india,8,11,Healthcare,Donate ...,Eradicate Curable Blindness,"12,600,000#sep#In Low-Income States Of India",Our Girls Help#sep#Donate In Specific Programs...,"why blindness,women empowerment,our impact,abo...",Akhand Jyoti - the largest eye hospital in eas...,"Akhandjyoti, akhand jyoti eye hospital, non-pr...",909,Donate ...,1015
3,642,lasercare eye center,dfweyes.com,medical practice,1 - 10,"irving, texas, united states",united states,4,11,Healthcare,...,,,,"home,why choose us,new patient information,pat...",,Call 214.574.9600 TODAY for an appointment! Th...,1633,...,1820
4,675,compumachine inc,compumachine.com,machinery,1 - 10,"danvers, massachusetts, united states",united states,4,9,Industrials,MACHINES & AUTOMATION HOME MACHINE...,,MACHINES & AUTOMATION,,"home,machines,automation,mastercam,services,ab...",,Compumachine is proud to offer CNC Machine Too...,192,MACHINES & AUTOMATION HOME MACHINE...,228


In [20]:
company_merged['Full_Text'][0]

'            01546 830261  Crinan\xa0·\xa0by Lochgilphead\xa0·\xa0PA31 8SR                 Home Hotel History The Ryan Family Awards Reviews Crinan from the air Accommodation Rooms at Crinan Classic Double Balcony Twin / Double Superior Twin / Double Rates and Reservations Yours Exclusively Dogs are welcome Facilities and Services Food & Drink Lock 16 The Westward Crinan Seafood Bar The Pub Crinan Coffee Shop Sample Menus & Wine List Weddings Romantic Breaks Our Secret Garden Crinan Fine Art Art and Music weekends Fine Art Prints For Sale Crinan Gallery Exhibitions Frances Macdonald Ross Ryan Painting Holidays Sleep with the Art Activities & Boat Trips Boat trips on the Sgarbh The Corryvreckan Whirlpool Golf near Crinan Health & Beauty Heart of Argyll Wildlife Organisation History and Heritage Knapdale Beavers at Barnluasgan Kilmartin Glen and Kilmartin Museum Tarbert on Loch Fyne Visitor Attractions Walking at Crinan Whisky Distilleries Upcoming events Special offers Gift vouchers Tra

In [6]:
# remove cases that have less than 50 words to classify them with across available website text
df_clean = company_merged.loc[company_merged['len_Full_Text']>50] # must have at least 50 words

# remove HTML tags
# expand contractions
# remove punctuation and numbers
# remove stopwords
df_clean['clean_text'] = df_clean['Full_Text'].apply(hf.process_text)
df_clean.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['clean_text'] = df_clean['Full_Text'].apply(hf.process_text)


Unnamed: 0,Company_ID,CompanyName,Website,Industry,Size_Range,Locality,Country,Current_Employee_Estimate,Total_Employee_Estimate,Category,...,h1,h2,h3,nav_link_text,meta_keywords,meta_description,len_homepage_text,Full_Text,len_Full_Text,clean_text
0,99,crinan hotel,crinanhotel.com,hospitality,1 - 10,"ardchonell, argyll and bute, united kingdom",united kingdom,1,3,Corporate Services,...,Latest News#sep#Website Privacy Statement#sep#...,How we use cookies#sep#Security#sep#Let's be S...,Accommodation#sep#Activities#sep#Experience Cr...,,"Crinan hotel, country house hotel, boutique ho...",Crinan Hotel - on waterfront overlooking Loch ...,3467,01546 830261 Crinan · by Lochgilp...,3665,"[crinan, lochgilphead, pa, sr, hotel, history,..."
1,222,"spot on productions, llc",spotonproductionsllc.com,entertainment,1 - 10,"jackson, mississippi, united states",united states,2,3,"Media, Marketing & Sales",...,Storytelling Brought to Life.,,,,,"We're Philip Scarborough and Tom Beck, the for...",45,...,75,"[reels, work, storytelling, brought, life, phi..."
2,535,akhand jyoti eye hospital,akhandjyoti.in,hospital & health care,11 - 50,"saran, bihar, india",india,8,11,Healthcare,...,Eradicate Curable Blindness,"12,600,000#sep#In Low-Income States Of India",Our Girls Help#sep#Donate In Specific Programs...,"why blindness,women empowerment,our impact,abo...",Akhand Jyoti - the largest eye hospital in eas...,"Akhandjyoti, akhand jyoti eye hospital, non-pr...",909,Donate ...,1015,"[donate, gift, someone, sight, support, girl, ..."
3,642,lasercare eye center,dfweyes.com,medical practice,1 - 10,"irving, texas, united states",united states,4,11,Healthcare,...,,,,"home,why choose us,new patient information,pat...",,Call 214.574.9600 TODAY for an appointment! Th...,1633,...,1820,"[lasik, hotline, main, number, toll, free, irv..."
4,675,compumachine inc,compumachine.com,machinery,1 - 10,"danvers, massachusetts, united states",united states,4,9,Industrials,...,,MACHINES & AUTOMATION,,"home,machines,automation,mastercam,services,ab...",,Compumachine is proud to offer CNC Machine Too...,192,MACHINES & AUTOMATION HOME MACHINE...,228,"[machines, automation, machines, automation, m..."


In [23]:
df_test = df_clean.copy()
# ' '.join(df_test['clean_text'][0])
df_clean['len_clean_text'] = df_clean['clean_text'].apply(lambda x: hf.word_count(' '.join(x)) if x is not None else 0)
df_clean['clean_text_str'] = df_clean['clean_text'].apply(lambda x: ' '.join(x))



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [24]:
# pickle results out for later modeling
text_path = os.path.abspath(os.path.join(current_dir, '..', 'output','combined_data.pkl'))
df_clean.to_pickle(text_path)

# read data back in 
df_clean2 = pd.read_pickle(text_path)

In [25]:
#Top 20 most frequent words for all websites
# review stopwords as needed
cl_text_list = df_clean['clean_text'].tolist()
wf = hf.word_freq(cl_text_list, 20)
wf.head(20)

Unnamed: 0,Word,Frequency
30,policy,57439
31,office,57116
32,health,56659
33,years,56231
34,provide,55759
35,online,55728
36,property,53508
37,need,52291
38,make,52246
39,needs,51468


In [11]:
# check through top words and output results to excel file
# update punctuation list and stopwords after reviewing these words
# want to capture all punctuation and general "website" words that don't provide insight into the actual industry (e.g. online, search, etc.)
df_list = []
categories = df_clean['Category'].unique()

for i in categories:
    df_cat = df_clean.loc[df_clean['Category']==i]
    cl_text_list = df_cat['clean_text'].tolist()
    wf = hf.word_freq(cl_text_list, 50)
    wf = wf.rename(columns={'0':'word','1':'count'})
    wf['Category']=i
    df_list.append(wf)

df_result = pd.concat(df_list)

In [13]:
# df_result.to_excel("C:/Users/megan/OneDrive/Documents/GitHub/sqlite_to_analysis_app/output/top_word.xlsx")

## website EDA part 2

In [12]:
print("Average Words in Cleaned Text:")
avg_word_count = df_clean.groupby('Category')['len_clean_text'].mean().reset_index()
avg_word_count

Average Words in Cleaned Text:


Unnamed: 0,Category,len_clean_text
0,Commercial Services & Supplies,450.948279
1,Consumer Discretionary,571.473462
2,Consumer Staples,447.104303
3,Corporate Services,539.533549
4,Energy & Utilities,447.994571
5,Financials,517.715059
6,Healthcare,526.599914
7,Industrials,456.608185
8,Information Technology,519.537943
9,Materials,452.415167


In [13]:
# Group by Category for total word count
tot_word_count = df_clean.groupby(['Category'], as_index=False)['len_clean_text'].sum()

# Create bar plot
fig = px.bar(
    tot_word_count,
    x='Category',
    y='len_clean_text',
    color='Category',
    title='Total Clean Words by Category',
    barmode='group'  # Group bars by industry
)

# Adjust the axes to scale automatically per group
fig.update_yaxes(matches=None)  # This ensures y-axes are independent

# Show the plot
fig.show()

In [14]:
# distribution of clean word count by category

# Create faceted charts
fig = px.histogram(
    df_clean.loc[df_clean['len_clean_text']<5000],
    x='len_clean_text',
    color='Category',
    facet_col='Category',
    title='Distribution of Total Word Count per Category'
)

# Show the plot
fig.show()

## Top Words Per Category

In [15]:
def split_dataframe_by_category(df, split_column='Category'):
    """
    Splits a DataFrame into multiple DataFrames based on unique values in a category column.
    
    Args:
    - df (df): pandas DataFrame
    - split_column (str): the column name to split the DataFrame on (default is 'Category')
    
    Returns:
    - A dictionary where the keys are unique categories, and the values are DataFrames
    """
    unique_categories = df[split_column].unique()  # Get unique categories
    category_dfs = {category: df[df[split_column] == category].copy() for category in unique_categories}
    
    return category_dfs

In [38]:
category_dfs = split_dataframe_by_category(df_clean[['Category','clean_text']], split_column='Category')

# should try and turn this into a function but went the tedious route for ease
corp_df = category_dfs['Corporate Services'].reset_index()
media_df = category_dfs['Media, Marketing & Sales'].reset_index()
health_df = category_dfs['Healthcare'].reset_index()
indust_df = category_dfs['Industrials'].reset_index()
comm_df = category_dfs['Commercial Services & Supplies'].reset_index()
consum_df = category_dfs['Consumer Discretionary'].reset_index()
trans_df = category_dfs['Transportation & Logistics'].reset_index()
ener_df = category_dfs['Energy & Utilities'].reset_index()
fin_df = category_dfs['Financials'].reset_index()
prof_df = category_dfs['Professional Services'].reset_index()
constap_df = category_dfs['Consumer Staples'].reset_index()
mat_df = category_dfs['Materials'].reset_index()
it_df = category_dfs['Information Technology'].reset_index()


In [40]:
top_n = 20

corp_top = hf.word_freq(corp_df['clean_text'].tolist(),top_n)
media_top = hf.word_freq(media_df['clean_text'].tolist(),top_n)
health_top = hf.word_freq(health_df['clean_text'].tolist(),top_n)
indust_top = hf.word_freq(indust_df['clean_text'].tolist(),top_n)
comm_top = hf.word_freq(comm_df['clean_text'].tolist(),top_n)
consum_top = hf.word_freq(consum_df['clean_text'].tolist(),top_n)
trans_top = hf.word_freq(trans_df['clean_text'].tolist(),top_n)
ener_top = hf.word_freq(ener_df['clean_text'].tolist(),top_n)
fin_top = hf.word_freq(fin_df['clean_text'].tolist(),top_n)
prof_top =hf.word_freq(prof_df['clean_text'].tolist(),top_n)
constap_top = hf.word_freq(constap_df['clean_text'].tolist(),top_n)
mat_top = hf.word_freq(mat_df['clean_text'].tolist(),top_n)
it_top = hf.word_freq(it_df['clean_text'].tolist(),top_n)

df_wf = pd.concat([corp_top, media_top,health_top,indust_top,comm_top,trans_top,ener_top,fin_top,prof_top,constap_top, mat_top,it_top],axis=1)
cols = ['corporate_services','count'
        ,'media_marketing','count'
        ,'healthcare','count'
        ,'industrials','count'
        ,'commercial_services','count'
        ,'consumer_discretionary','count'
        ,'transportation','count'
        ,'energy','count'
        ,'financial','count'
        ,'professional','count'
        ,'materials','count'
        ,'IT','count']
df_wf.columns = cols
df_wf

Unnamed: 0,corporate_services,count,media_marketing,count.1,healthcare,count.2,industrials,count.3,commercial_services,count.4,...,energy,count.5,financial,count.6,professional,count.7,materials,count.8,IT,count.9
0,property,30946,marketing,41600,care,55066,machine,12371,security,21440,...,insurance,132280,law,52978,coffee,21543,products,12793,solutions,39141
1,hotel,20730,media,22157,medical,33509,equipment,11359,cleaning,17904,...,business,36023,business,29650,food,18571,steel,12294,business,31214
2,management,16757,business,18463,health,29457,parts,8911,electrical,14985,...,financial,32410,firm,23912,products,14729,metal,8068,software,28790
3,properties,15340,design,17349,hospital,24432,products,8661,commercial,14918,...,mortgage,19804,legal,23368,shop,8738,quality,5073,support,22153
4,training,14970,video,13081,pet,21927,machines,7081,office,14205,...,life,17584,attorney,18315,ã¢â‚¬â,7332,gold,4348,management,21218
5,estate,13717,digital,12869,patient,20346,industrial,6574,furniture,13378,...,personal,16177,tax,18191,foods,6994,product,4027,development,21037
6,real,13500,social,12732,eye,17145,machinery,6383,systems,10416,...,loan,14405,clients,17427,``,6144,chemical,3823,data,17915
7,business,11379,work,12638,surgery,15595,quality,6292,work,9978,...,loans,13589,lawyer,14727,quality,6125,news,3752,web,15642
8,team,10859,website,12612,patients,14374,cnc,6021,quality,9630,...,investment,13550,personal,13031,–,5780,chemicals,3248,cloud,13846
9,search,10741,event,11442,veterinary,14247,machining,5417,air,9620,...,auto,13403,injury,12858,free,5698,stainless,3190,technology,13294


### Further Steps
Running through these EDA text exercises several times led me to identify more patterns for text cleaning. 
 1. Added stopwords
 2. Add punctuation patterns

 I would like to try stemming or lemmatizing to improve my text results (e.g. loan & loans)

## Findings

The data supports a model to predict a company's category based on it's website text. 
- The companies represented within each category are skewed by certain industries. 
- More than half of the websites come from the US (48K out of 73k), so dataset is mostly representing US/english-speaking audience

For these reasons, I think the model will perform best for certain industries within each category.