# Cahul Chamber of Commerce — entity list parsing

This notebook parses the provided plain text into structured rows and extracts:
- entity name
- website(s) when available
- normalized domain

Output: `Input/Cahul/cahul_chamber_of_commerce_entities.csv`.

In [1]:
# Raw input text (as provided)
raw_text = '''
Instituția Publică Școala Profesională nr.2 din Cahul
Areas of activity:Education, Transmission of electricity, Construction
CCI branch: Branch Cahul
Phone number: 076 773 824
Email address:
sp2.cahul@gmail.com

FPC ALMA SRL
Products / services:Paving stones, curbs and paving slabs, of natural stone.
Areas of activity:Manufacture of articles of concrete, cement and plaster
CCI branch: Branch Cahul
Phone number: 078 406 324
Email address:
alma_srl@mail.ru

SRL"Sattva-com"/ IȚCOVICI Mihail
Areas of activity:Food industry
CCI branch: Branch Cahul
Phone number: +373 78371848
Email address:
protois@mail.ru

BAS STEEL PROFILE S.R.L / VELCEV Andrei
Products / services:Tubular pipes, tubes and profiles, of cast iron., Seamless tubes, pipes and hollow profiles, of iron or steel.
Areas of activity:Manufacture of metal constructions
CCI branch: Branch Cahul
Phone number: +37378436481
Email address:
info@bs-profile.com

SUD-INVEST COMPANY SRL/ SUSANU Dumitru
Areas of activity:Industrial park / Free Economic Zone
CCI branch: Branch Cahul
Phone number: +373 (79) 46-63-11/ 79701204
Web page: sudinvest.md
Email address:
info@sudinvest.md
dascaloi@mail.ru

INTEGRAL-AUTO SA / COCEV Oleg
Products / services:Electrical machinery and apparatus.
Areas of activity:Trade in parts and accessories for motor vehicles
CCI branch: Branch Cahul
Phone number: +373 (299) 4-16-61/ 079473399
Email address:
oleg.cocev@daac-hermes.md

AGROUNIC-CAHUL APA
Products / services:Cereals
Areas of activity:Agriculture
CCI branch: Branch Cahul
Phone number: 029 929 713, 079 766 965
Email address:
larisa.gisca@mail.ru

KUZ WOOL SRL / TROFIN Alexandrina
Products / services:Raw furs.
Areas of activity:Agriculture
CCI branch: Branch Cahul
Phone number: +373 (60) 220-201
Email address:
kuz.wool2016@gmail.com

MINALIA FOWL SRL/ Minciună NICOLAE
Products / services:Roosters, hens, ducks, geese, turkeys, turkeys and guinea fowls, quails., Organic products
Areas of activity:Poultry farming
CCI branch: Branch Cahul
Phone number: 079903409/0299-61-6-40
Email address:
minalia216@gmail.com

PARCUL DE PRODUCTIE TARACLIA ,,ZEL”/ MASLINCOVA Galina
Areas of activity:Industrial park / Free Economic Zone
CCI branch: Branch Cahul
Phone number: +373 (0294)2-44-83/ 068314252
Email address:
zal.pp.taraclia@gmail.com

AMILUX-SUD SRL SC / Olesea Bujor
Products / services:Electric motors and generators.
Areas of activity:Maintenance and repair of motor vehicles
CCI branch: Branch Cahul
Phone number: +373 (29) 92-96-39
Email address:
jalusicagul@pochta.ru

EVAN-P SRL SC/ PODRUȘNEAC Evghenii
Products / services:Building bricks, building blocks and tiles, facade bricks and similar articles, of ceramics.
Areas of activity:Manufacture of articles of concrete, cement and plaster
CCI branch: Branch Cahul
Phone number: +373 (299) 4-29-88/ 069348289
Email address:
podrusneaca@mail.ru

COMBINATUL DE VINURI TARACLIA SA/ FRANGU Gheorghe
Products / services:Wines, cognacs
Areas of activity:Manufacture of beverages
CCI branch: Branch Cahul
Phone number: +373 (294) 2-52-53/069146964
Web page: www.taraclia-winery.md
Email address:
taraclianwinery@gmail.com
tvk74@mail.ru

RUBIN-216 SA/ BOJENCO Ana
Products / services:Constructions and parts of buildings (for example, bridges and bridge elements, sluice gates, towers, pillars, pillars, columns, trusses, roofs, doors and windows and their frames, sills and sills, shutters, railings) of cast iron, iron or steel. , Articles of cement, of concrete or of artificial stone, whether or not reinforced.
Areas of activity:Construction
CCI branch: Branch Cahul
Phone number: 0299-3-41-22
Email address:
rubin-216@yandex.ru

AQUA-PRUT SA / Eduard Ialanji
Products / services:Hydraulic turbines, hydraulic wheels and regulators therefor.
CCI branch: Branch Cahul
Phone number: +373 (29) 9-3-70-14/ 079616334
Email address:
aqua.prut@mail.ru


LIFESTAR-COM SRL/ TABUNCIC Gheorghe
Products / services:Waters, including mineral waters and aerated waters, containing added sugar or other sweetening matter or flavored or other non-alcoholic beverages., Furniture
Areas of activity:Woodworking industry, Manufacture of beverages, Construction
CCI branch: Branch Cahul
Phone number: 0299 34047/ 079412869
Email address:
lifestar_com@yahoo.com

CAHULPAN SA/ V Culidobre-dir., V. Șevcenco- vicedir.
Products / services:Bakery and confectionery products, Organic products
Areas of activity:Manufacture of bakery products and flour products
CCI branch: Branch Cahul
Phone number: +373 (29) 9-41418/41837/079-705-382/079564944
Email address:
market@cahulpan.com
sefcontcahulpan@gmail.com

IMPERIAL VIN GROUP SRL/ CASIAN Vasilii
Products / services:Wines, cognacs, Organic products
Areas of activity:Manufacture of beverages, Growing grapes
CCI branch: Branch Cahul
Phone number: +373 (273) 7-44-02/079404349-Vasilii C./ 079406696 -Vitalii
Web page: www.imperial-vin.com
Email address:
zavod@imperial-vin.md
vcasian@imperial-vin.md / office@imperial-vin.md

MALDUR-SV SRL/ MALDUR Mihail
Products / services:Padlocks, locks and latches (key, cipher or electric), of base metal, locks and lock-frames fitted with locks, of base metal, keys for these articles, of base metal.
Areas of activity:Construction, Woodworking industry
CCI branch: Branch Cahul
Phone number: 0299 25976/ 079511404

SANATORIUL "NUFĂRUL ALB" SRL/ SCUTELNIC Alexandr
Products / services:Waters, including mineral waters and aerated waters, containing added sugar or other sweetening matter or flavored or other non-alcoholic beverages.
Areas of activity:Manufacture of beverages
CCI branch: Branch Cahul
Phone number: 0299 32035, 2-34-40, 68400099,68400055
Web page: www.nufarul.md
Email address:
info@nufarul.md

CAHINDMONTAJ SA/ DELIU Tatiana - admin
Products / services:Seamless tubes, pipes and hollow profiles, of iron or steel.
Areas of activity:Manufacture of metal constructions
CCI branch: Branch Cahul
Phone number: 0299 40380, 44569, 068656985
Email address:
sacahindemontaj@mail.ru

NAHMANELLI SRL
Products / services:Crockery, other household articles or household articles and hygiene or toilet articles.
CCI branch: Branch Cahul
Phone number: 0299 23994
Web page: marinela@matrix.net.md
Email address:
nahmanelli@mail.ru

VIERUL-VIN SRL/ Dimitrie Sîrf
Products / services:Wines, cognacs
CCI branch: Branch Cahul
Phone number: 0299 79245, 0299 79368, 079600041
Email address:
vierul.vin@gmail.com

BABAIAN VICTOR ÎI
Products / services:Cereals
Areas of activity:Agriculture
CCI branch: Branch Cahul
Phone number: 0299-40726/ 0299-40679, 079628607
Email address:
victorbabaian09@gmail.com / victor.babaian@mail.ru

MODERNUS SA/ GANGAN Liubovi Anton
CCI branch: Branch Cahul
Phone number: 069127650/069107389/0299-48-48-3
Web page: www.modernus.md
Email address:
modernus@mail.ru

C.N.PERCEMBLI GŢ / PERCEMBLI Constantin
Products / services:Cereals
Areas of activity:Agriculture
CCI branch: Branch Cahul
Phone number: +373 (29) 95-70-02/069077096/079549445
Email address:
constantinpercembli@gmail.com

SILVA-SUD ÎS/ BOGHEAN Gheorghe
Products / services:Densified wood, in blocks, planks, slats or profiles.
Areas of activity:Agriculture
CCI branch: Branch Cahul
Phone number: 0299 4-14-44, 060907766,
Email address:
silvia-sud@moldsilva.gov.md

TRIFEŞTI S.A./ ANTONIU Vitalii
Products / services:Vermouth and other wine of fresh grapes, flavored with plants or aromatic substances., Wines, cognacs
Areas of activity:Manufacture of beverages
CCI branch: Branch Cahul
Phone number: 079128748, 069121111
Email address:
gaidarji.e@gmail.com

PRUT SA ÎM/ COMUR Nicolai
Products / services:Seeds, fruit and spores for sowing.
CCI branch: Branch Cahul
Phone number: 067242038
Email address:
comur.nicolai@mail.ru

GEBHARDT-CONSTRUCT SC, SRL/ SUSAN Dumitru
Products / services:Double-glazed windows and doors., Articles of cement, of concrete or of artificial stone, whether or not reinforced., Building bricks, building blocks and tiles, facade bricks and similar articles, of ceramics., Plastic floor coverings, self-adhesive or not, in rolls or in the form of floor tiles or slabs.
Areas of activity:Construction
CCI branch: Branch Cahul
Phone number: +373 (299) 3-33-44/ 079466311
Web page: www.gebhardt-construct.md
Email address:
susanu_dumitru@yahoo.com
susanu_dumitry@yahoo.com /gesusbau@mtc-ch.md


LABORATORIO TESSILE MOL SRL/ COȚOFAN Constantin
Products / services:Clothing and clothing accessories
Areas of activity:Manufacture of garments
CCI branch: Branch Cahul
Phone number: 060568568-dir./ 068003335-cadre/ 2-20-14- anticamera Nina
Web page: www.ltm.md
Email address:
costi.cotofan@tanex.ro / cadre.ltm@list.ru

VINIA TRAIAN SA/ CRISTEV Simion
Products / services:Wines, cognacs, Vermouth and other wine of fresh grapes, flavored with plants or aromatic substances.
Areas of activity:Manufacture of beverages
CCI branch: Branch Cahul
Phone number: 0299 57-4-73/ 069108429
Web page: www.viniatraian.md
Email address:
vintraian@mtc-vl.md
viniatraian@mail.ru
viniatraianinfo@gmail.com

GLIA C.A.P. / PUTREGAI Consatntin
Products / services:Cereals
Areas of activity:Cultivation of cereals (excluding rice), leguminous plants and oilseeds
CCI branch: Branch Cahul
Phone number: +373 (273) 7-42-30/ 069099112
Email address:
glia.coop@gmail.com

PARCUL DE AUTOBUZE ŞI TAXIMETRI NR.8 SA
Areas of activity:Storage and ancillary activities for Transport
CCI branch: Branch Cahul
Phone number: 0299 48506, 079405491
Email address:
serviciiparcul8@mail.ru

CEREALE-CAHUL SA/POPUSOI SERGIU ALEXANDRU
Products / services:Cereals
Areas of activity:Storage and ancillary activities for Transport
CCI branch: Branch Cahul
Phone number: +373 69262068
Email address:
p.radulov.grainterminal@rusagro.md

RAPIRA SRL/ DULGAN Cristina Evgheni
CCI branch: Branch Cahul
Phone number: 069042990
Email address:
c.mocanu@inbox.ru

CETATEA NOUĂ SRL/ CELARSCHI Vladimir
Products / services:Cereals
Areas of activity:Agriculture
CCI branch: Branch Cahul
Phone number: +373 (29) 45-42-19/ 069031882
Email address:
cetateanoua@mail.ru

TRICON SA/ MIHAILIUC Angela
Products / services:Knitted or crocheted clothing and clothing accessories for infants.
Areas of activity:Manufacture of garments
CCI branch: Branch Cahul
Phone number: 0299 20746/2-21-18/2-21-71/60301535/60301538/
Web page: www.tricon.md
Email address:
tricon.cahul@gmail.com

CHICKEN-CH SRL/ CIBOTARU Simion
Products / services:Articles of cement, of concrete or of artificial stone, whether or not reinforced.
Areas of activity:Trade of building materials and sanitary equipment
CCI branch: Branch Cahul
Phone number: +373 (29) 94-17-91/ 069181285

AGRO-ALBOTA SRL
Products / services:Cereals
Areas of activity:Agriculture
CCI branch: Branch Cahul
Phone number: +373 (29) 45-72-05
Email address:
derivolkov-albota@mail.ru
'''
print("Loaded text length:", len(raw_text))

Loaded text length: 10752


In [4]:
# Parser and CSV writer
import re
import os
import pandas as pd
from urllib.parse import urlparse

output_path = os.path.join(os.path.abspath('.'), 'cahul_chamber_of_commerce_entities.csv')

# Match URL-like tokens, but we'll only consider those appearing on 'Web page:' lines
WEBSITE_RE = re.compile(r"(https?://[^\s,;]+|www\.[^\s,;]+|[A-Za-z0-9.-]+\.[A-Za-z]{2,}(?:/[^\s,;]*)?)", re.IGNORECASE)
PUNCT_STRIP = ",;.)]›»”’"  # trailing punctuation to strip


def canonicalize_url(u: str) -> str:
    u = u.strip().strip(PUNCT_STRIP)
    if not u or '@' in u:
        return ''
    # normalize scheme (prefer https if 'www.' provided, else http as a conservative default)
    if not u.lower().startswith(('http://', 'https://')):
        u = ('https://' + u) if u.lower().startswith('www.') else ('http://' + u)
    return u


def extract_domain(u: str) -> str:
    try:
        host = urlparse(u).netloc.lower()
        return host[4:] if host.startswith('www.') else host
    except Exception:
        return ''


def split_blocks(text: str):
    return [p.strip() for p in re.split(r"\n\s*\n", text.strip()) if p.strip()]


def parse_blocks(text: str):
    blocks = split_blocks(text)
    rows = []
    for b in blocks:
        lines = [ln.strip() for ln in b.splitlines() if ln.strip()]
        if not lines:
            continue
        name = lines[0]

        # STRICT RULE: only consider explicit 'Web page:' lines (case-insensitive) for websites
        websites = []
        for ln in lines:
            if ln.lower().startswith('web page'):
                # allow variants like 'Web page: ...' or 'Web page - ...'
                raw = ln.split(':', 1)[1].strip() if ':' in ln else ln.split('-', 1)[-1].strip()
                for m in WEBSITE_RE.findall(raw):
                    if '@' in m:
                        continue
                    websites.append(m)

        # Clean + dedupe URLs from 'Web page' lines only
        cleaned = []
        seen = set()
        for w in websites:
            url = canonicalize_url(w)
            dom = extract_domain(url)
            if not url or not dom:
                continue
            if dom in seen:
                continue
            seen.add(dom)
            cleaned.append((url, dom))

        if not cleaned:
            rows.append({'name': name, 'website': '', 'domain': '', 'source': 'manual-text'})
        else:
            for url, dom in cleaned:
                rows.append({'name': name, 'website': url, 'domain': dom, 'source': 'manual-text'})
    return rows

rows = parse_blocks(raw_text)
print(f"Parsed entities: {len(rows)} rows")

# Global dedupe by (name, domain)
seen = set()
dedup = []
for r in rows:
    key = (r['name'], r['domain'])
    if key in seen:
        continue
    seen.add(key)
    dedup.append(r)

print(f"After dedupe: {len(dedup)} rows")

pd.DataFrame(dedup).to_csv(output_path, index=False)
print(f"Wrote {output_path}")

Parsed entities: 40 rows
After dedupe: 40 rows
Wrote /home/thiesen/Documents/AI-Innoscence_Ecosystem/Input/Cahul/cahul_chamber_of_commerce_entities.csv


In [5]:
# Preview a sample
import os
if os.path.exists(output_path):
    df = pd.read_csv(output_path)
    print(df.shape)
    display(df.head(15))
    print("Unique domains:", df['domain'].nunique())
else:
    print("CSV not found.")

(40, 4)


Unnamed: 0,name,website,domain,source
0,Instituția Publică Școala Profesională nr.2 di...,,,manual-text
1,FPC ALMA SRL,,,manual-text
2,"SRL""Sattva-com""/ IȚCOVICI Mihail",,,manual-text
3,BAS STEEL PROFILE S.R.L / VELCEV Andrei,,,manual-text
4,SUD-INVEST COMPANY SRL/ SUSANU Dumitru,http://sudinvest.md,sudinvest.md,manual-text
5,INTEGRAL-AUTO SA / COCEV Oleg,,,manual-text
6,AGROUNIC-CAHUL APA,,,manual-text
7,KUZ WOOL SRL / TROFIN Alexandrina,,,manual-text
8,MINALIA FOWL SRL/ Minciună NICOLAE,,,manual-text
9,"PARCUL DE PRODUCTIE TARACLIA ,,ZEL”/ MASLINCOV...",,,manual-text


Unique domains: 10
