# Exploratory Data Analysis 

I'd like to build a model that predicts a company's category based on the website text. Therefore, my EDA will focus on assessing the text data available. 

- Perform detailed EDA on CompanyMerged
- Visualize key aspects of data with notes relevant to model building
- Findings and hypotheses outlined below

In [7]:
import sys
import os
import pkgutil
from inspect import getmembers, isfunction
import pandas as pd

# Dynamically get the current working directory
current_dir = os.getcwd()

# Add the path to utils/ directory, assuming it's one level up from the current working directory
utils_path = os.path.abspath(os.path.join(current_dir, '..', 'utils'))
sys.path.append(utils_path)

# Verify that the utils path is correctly added
print(f"Utils path added: {utils_path}")

# Check that the modules in the utils directory are found
print(f"Modules in utils directory: {[name for _, name, _ in pkgutil.iter_modules([utils_path])]}")

import db_utils as db

# Import helper_functions module after appending the correct path
try:
    import helper_functions as hf
    print("Successfully imported helper_functions.")
except ImportError as e:
    print(f"Failed to import helper_functions: {e}")

# Inspect and list all functions in helper_functions module
helper_funcs = getmembers(hf, isfunction)
print(f"Functions in helper_functions: {helper_funcs}")

# If no functions are found, print a warning message
if not helper_funcs:
    print("Warning: No functions found in helper_functions.py")

# Example: Call a function from helper_functions
if hasattr(hf, 'example_function_1'):
    result = hf.example_function_1()
    print(f"Result from 'example_function_1': {result}")




Utils path added: c:\Users\megan\OneDrive\Documents\GitHub\sqlite_to_analysis_app\utils
Modules in utils directory: ['db_utils', 'helper_functions', 'markdown_writer']
Successfully imported helper_functions.
Functions in helper_functions: [('word_count', <function word_count at 0x000001FDEF2C69D0>), ('word_freq', <function word_freq at 0x000001FDEF2C6A60>), ('word_tokenize', <function word_tokenize at 0x000001FDED4CB280>)]


In [3]:
db_path = "C:/Users/megan/OneDrive/Documents/GitHub/sqlite_to_analysis_app/data/combined_data.db"
conn = db.connect_to_db(db_path)

In [5]:
# identify names of tables in the database
db.run_query(conn,"SELECT name FROM sqlite_master WHERE type='table'")

[('CompanyClassification',), ('CompanyDataset',), ('CompanyMerged',)]

In [8]:
# extract data to pandas dataframe
company_merged = pd.read_sql_query("SELECT * FROM CompanyMerged",conn)
# count the words within homepage_text
company_merged['len_homepage_text'] = company_merged['homepage_text'].apply(lambda x: hf.word_count(x) if x is not None else 0)
company_merged.head()

Unnamed: 0,Company_ID,CompanyName,Website,Industry,Size_Range,Locality,Country,Current_Employee_Estimate,Total_Employee_Estimate,Category,homepage_text,h1,h2,h3,nav_link_text,meta_keywords,meta_description,len_homepage_text
0,99,crinan hotel,crinanhotel.com,hospitality,1 - 10,"ardchonell, argyll and bute, united kingdom",united kingdom,1,3,Corporate Services,01546 830261 Crinan · by Lochgilp...,Latest News#sep#Website Privacy Statement#sep#...,How we use cookies#sep#Security#sep#Let's be S...,Accommodation#sep#Activities#sep#Experience Cr...,,"Crinan hotel, country house hotel, boutique ho...",Crinan Hotel - on waterfront overlooking Loch ...,3897
1,222,"spot on productions, llc",spotonproductionsllc.com,entertainment,1 - 10,"jackson, mississippi, united states",united states,2,3,"Media, Marketing & Sales",...,Storytelling Brought to Life.,,,,,"We're Philip Scarborough and Tom Beck, the for...",200
2,535,akhand jyoti eye hospital,akhandjyoti.in,hospital & health care,11 - 50,"saran, bihar, india",india,8,11,Healthcare,Donate ...,Eradicate Curable Blindness,"12,600,000#sep#In Low-Income States Of India",Our Girls Help#sep#Donate In Specific Programs...,"why blindness,women empowerment,our impact,abo...",Akhand Jyoti - the largest eye hospital in eas...,"Akhandjyoti, akhand jyoti eye hospital, non-pr...",1426
3,642,lasercare eye center,dfweyes.com,medical practice,1 - 10,"irving, texas, united states",united states,4,11,Healthcare,...,,,,"home,why choose us,new patient information,pat...",,Call 214.574.9600 TODAY for an appointment! Th...,2319
4,675,compumachine inc,compumachine.com,machinery,1 - 10,"danvers, massachusetts, united states",united states,4,9,Industrials,MACHINES & AUTOMATION HOME MACHINE...,,MACHINES & AUTOMATION,,"home,machines,automation,mastercam,services,ab...",,Compumachine is proud to offer CNC Machine Too...,242


In [10]:
# check nulls per row of the merged table
print(f"Total Rows: {len(company_merged)}")
company_merged.isnull().sum(axis=0)

Total Rows: 73124


Company_ID               0
CompanyName              0
Website                  0
Industry                 0
Size_Range               0
Category                 0
homepage_text            0
h1                   26511
h2                   20055
h3                   28491
nav_link_text        25084
meta_keywords        49474
meta_description      6688
len_homepage_text        0
dtype: int64

In my sample, all companies have some website text. 
- Roughly one third don't have h1-h3 or nav_link_text. 
- meta_keywords is not available for most of my sample, but only about 10% are missing meta_description

It would make sense to join text from all available text fields to expand words available for predicting categories per company.

In [13]:
import plotly.express as px

clean_text = company_merged.loc[company_merged['len_homepage_text']>0]
fig = px.histogram(clean_text, x='len_homepage_text', title="Distribution of Homepage Words")
fig.show()

In [17]:
# understand categories available
categories = company_merged['Category'].unique()
print("There are {} categories in CompanyMerged".format(len(categories)))
print(categories)


There are 13 categories in CompanyMerged
['Corporate Services' 'Media, Marketing & Sales' 'Healthcare'
 'Industrials' 'Commercial Services & Supplies' 'Consumer Discretionary'
 'Transportation & Logistics' 'Energy & Utilities' 'Financials'
 'Professional Services' 'Consumer Staples' 'Materials'
 'Information Technology']


In [52]:
# visually inspect some examples of the homepage text
# print top 3 examples by word count
for i,row in company_merged.sort_values(by='len_homepage_text',ascending=False)['homepage_text'].iloc[:3].reset_index(drop=True).iteritems():
    print(i,row)

0                             About SJP  Properties  Capabilities  Team  Stewardship  News  Contact  SJP Project Solutions        About SJP  Properties  Capabilities  Team  Stewardship  News  Contact  SJP Project Solutions    info@sjpproperties.com  212-335-2200            Your Real Estate Partners in the New York Metro Area       Our deep market knowledge and long-term relationships are a result of our  local focus  and commitment to the New York Metro area.        We are  entrepreneurial  and responsive to the changing demands of our tenants’ and clients’ businesses.        Our fully-integrated team of seasoned and creative professionals have a shared  passion  for the entire development process.        We refine every  detail  to achieve the highest quality in our buildings and make the best use of each site.        We are  client driven  and manage the quality of experience from pre-construction through property management.        We continually explore  innovative  technologies to

In [54]:
# print bottom 3 examples by word count
for i,row in company_merged.sort_values(by='len_homepage_text',ascending=True)['homepage_text'].iloc[:10].reset_index(drop=True).iteritems():
    print(i,row)

0                     
1               
2    
3     
4     
5   
6                        
7             
8     
9                                                                                                                 


In [58]:
company_merged.loc[company_merged['len_homepage_text']<20]

Unnamed: 0,Company_ID,CompanyName,Website,Industry,Size_Range,Category,homepage_text,h1,h2,h3,nav_link_text,meta_keywords,meta_description,len_homepage_text
17,1809,guelph medical laser skin centre,guelphlaser.com,medical practice,1 - 10,Healthcare,,,,,,"Laser Hair REmoval, CoolSculpting, Baby Belly,...",Guelph Medical Laser & Skin Centre offer Laser...,0
55,5175,new era debt solutions,neweradebtsolutions.com,financial services,1 - 10,Financials,,,,,,,,0
70,6272,ikon footwear limited,ikonfootwear.co.uk,apparel & fashion,1 - 10,Consumer Discretionary,Account Suspended This Account has b...,,,,,,,14
111,10819,"live edge media, llc",live-edge-media.com,photography,1 - 10,"Media, Marketing & Sales",,,,,,,Live Edge Media photography and videography. S...,0
133,12783,devoted photography,devotedphotos.com,photography,1 - 10,"Media, Marketing & Sales",Account Suspended This Account has b...,,,,,,,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72962,7156095,rk creative,rk-creative.com,marketing and advertising,1 - 10,"Media, Marketing & Sales",,,,,,"marketing assistance for small businesses, seo...",Your one stop shop to effective Digital Market...,0
72976,7157851,"hedge fund: colorado capital advisors, llc",coloradocapitaladvisors.com,investment management,1 - 10,Financials,Overview Contact ©2008 Colo...,,,,,"Colorado, Capital, Advisors, Colorado Capital ...","Colorado Capital Advisors, LLC is a capital ma...",7
72983,7158497,bouzies bakery,bouziesbakery.com,food production,1 - 10,Consumer Staples,,,,,,"Bouzies, bouzie, bakery, artisan bread, Spokan...","Bouzies Bakery opened on a great, cold day in ...",0
73027,7163072,prescient solutions group - primavera software...,psgincs.com,information technology and services,1 - 10,Information Technology,,,,,,,,0


While cases where homepage_text was null have been removed, there are still examples where the text will be empty or have too few words to use. 