# Which companies opened up yesterday? (CEIDG - data mining)
The code downloads information on Polish companies founded and registered yesterday in the Central Registration and Information on Business (CEIDG). Files are divided by voivodship (16 regions) and are ZIPs containing XMLs inside. The former are unpacked and the latter are parsed to extract some basic information on a sample company. ALL DATA IS REAL HERE! :)


In [16]:
# needed modules
from urllib.request import urlopen, Request
from urllib.parse import urlencode as ENCODE
from datetime import date, timedelta
import zipfile as zip
from xml.etree import ElementTree as XML
from random import choice

# checking the system encoding, just for safety reasons
import sys
en = sys.stdout.encoding
if en != 'UTF-8':
    print('*** Warning! Your system encoding is {}, while the database is in UTF-8. Output may be illegible. ***\n'.format(en))

In [17]:
# voivodships mean greater administration regions in Poland, 16 overall - unhash to choose one randomly
"""
voivodships = choice(
    ('mazowieckie', 'pomorskie', 'warmińsko-mazurskie', 'lubuskie',
     'wielkopolskie', 'kujawsko-pomorskie', 'łódzkie', 'opolskie',
     'podlaskie', 'dolnośląskie', 'zachodniopomorskie', 'śląskie',
     'świętokrzyskie', 'lubelskie', 'małopolskie', 'podkarpackie')
)
"""
voivodships = (
     'mazowieckie', 'pomorskie', 'warmińsko-mazurskie', 'lubuskie',
     'wielkopolskie', 'kujawsko-pomorskie', 'łódzkie', 'opolskie',
     'podlaskie', 'dolnośląskie', 'zachodniopomorskie', 'śląskie',
     'świętokrzyskie', 'lubelskie', 'małopolskie', 'podkarpackie')

# daily database
fn = 'dzien_ExtendedAddress_'
# 'miesiac_ExtendedAddress_' is for monthly
# 'calosc_ExtendedAddress_' for all records

# the databases are updated on daily basis (D-1) and the most actual are placed in a directory reflecting the yesterday's date
yesterday = date.today() - timedelta(1)
dt = yesterday.strftime('%Y-%m-%d')

# user_id is permanent for each registered user
user_id = '80024177-3683-4b6c-bdc5-c51344485a7c'

# this part of link is fortunately permanent
link = 'http://datastore.ceidg.gov.pl/ceidg.datastore/Downloadhandler.ashx?'

In [18]:

# cycling through all selected voivodships
for voi in voivodships:
    fname = fn + voi + '.zip'
    full_name = '/' + dt + '/' + fname
    url_path = link + ENCODE({'file': full_name, 'u': user_id})
    headers = {}
    headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
    data = urlopen(Request(url_path, headers = headers), None, 30)
    
    # downloading and saving the .zip
    with open(fname,'wb') as f:
        f.write(data.read())
    f.close()

    # reading from the file (just one for SL reference)
    with zip.ZipFile(fname) as z:
        for member in z.namelist():
            with z.open(member) as x:
                tree = XML.fromstring(x.read())
                res = tree.findall('InformacjaOWpisie')
                
                # extracting just the first entry
                example = res[0].find('DanePodstawowe')
                name = example.find('Imie').text
                surname = example.find('Nazwisko').text
                NIP = example.find('NIP').text
                REGON = example.find('REGON').text
                firma = example.find('Firma').text
                example = res[0].find('DaneAdresowe').find('AdresGlownegoMiejscaWykonywaniaDzialalnosci')
                city = example.find('Miejscowosc').text
                try:
                    street = example.find('Ulica').text
                except:
                    street = ''
                try:
                    bldg = example.find('Budynek').text
                except:
                    bldg = ''
                
                # print out the content
                print((u'An example company opened in {} voivodship on {}\n{}\n'.format(voi, dt, "="*40)).encode(en, errors='replace').decode(en))
                print((u'Name: {}\nSurname: {}\nCompany: {}\nNIP: {}\nREGON: {}\nCity: {}\nStreet: {} {}\n\n\n').format(name, surname, firma, NIP, REGON, city, street, bldg).encode(en, errors='replace').decode(en))

An example company opened in mazowieckie voivodship on 2018-03-09

Name: MARCIN
Surname: LULEK
Company: Gold head MARCIN LULEK
NIP: 5361818339
REGON: 369673820
City: Legionowo
Street: ul. Jagiellońska 20A



An example company opened in pomorskie voivodship on 2018-03-09

Name: ANDRZEJ
Surname: KOSZMIDER
Company: gdańskportal Andrzej Koszmider
NIP: 5931861787
REGON: None
City: Tczew
Street: ul. Jana Sobieskiego 18



An example company opened in warmińsko-mazurskie voivodship on 2018-03-09

Name: DANUTA
Surname: BAJKOWSKA
Company: DANUTA BAJKOWSKA HANDEL OBWOŹNY
NIP: 8451048088
REGON: 510876559
City: Giżycko
Street: ul. Plac Targowy -



An example company opened in lubuskie voivodship on 2018-03-09

Name: Grzegorz
Surname: Szatanek
Company: Grzegorz Szatanek
NIP: 6772112819
REGON: 369680670
City: Gościeszowice
Street:  18



An example company opened in wielkopolskie voivodship on 2018-03-09

Name: Anna
Surname: Jamszoł
Company: Anna Jamszoł
NIP: 8271933156
REGON: 100212489
City: Radl