# Sayari programming assignment
### Joanne Lin

This project utilizes scrapy and Xpath to crawl New Zealand's public registry of incorporated companies.  The only module needed is scrapy. You can use the accompanying requirements.txt and run "pip install -r /path/to/requirements.txt" in the command line to insure the proper environment for these spiders. 

Note: If running the spiders on Jupyter Notebook, you would need to restart your kernel each time you run the spider.  This is not an issue when running the spiders in the Command Line. 

## html spider
The html spider crawls the company website and retrieves the raw html, storing it in a company.html file. 

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess


class htmlSpider(scrapy.Spider):
    name = "html"
    
    # The 'detail?' at the end of the URL pulls up all the company information on a single page.
    start_urls = [
        'https://app.companiesoffice.govt.nz/companies/app/ui/pages/companies/6236487/detail?',
    ]

    # Download all the code and save it to the company.html file
    def parse(self, response):
        with open('company.html', 'wb') as f:
            f.write(response.body)


# Instantiate our crawler.
process = CrawlerProcess()

# Start the crawler with our spider.
process.crawl(htmlSpider)
process.start()

## NZSpider

This spider crawls the company page for information and exports into NZcompanies.json .

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess

class NZSpider(scrapy.Spider):
    name = "NZS"
    
    # The 'detail?' at the end of the URL pulls up all the company information on a single page.
    start_urls = [
        'https://app.companiesoffice.govt.nz/companies/app/ui/pages/companies/6236487/detail?'
        ]

    def parse(self, response):
        # This creates an .html file for the original HTML
        with open('companypage.html', 'wb') as f:
            f.write(response.body)
        
        # Looping through all the panels, or pageContainers on page
        for panel in response.xpath('//*[@id="maincol"]/div[@class="entity"/div[@class="pageCountainer"'):

            # Scraping data from main panel
            if response.xpath('div[@id="onLoadSetFocus]'):

                # Yield a dictionary with the values
                yield {
                    'CompanyNumber' : panel.xpath('//*[@class="readonly companySummary]/div[1]/text()').extract(),
                    'NZBusinessNumber' : panel.xpath('//*[@class="readonly companySummary]/div[2]/text()').extract(),
                    'IncorporationDate' : panel.xpath('//*[@class="readonly companySummary]/div[3]/text()').extract(),
                    'CompanyStatus' : panel.xpath('//*[@class="readonly companySummary]/div[4]/text()').extract(),
                    'EntityType' : panel.xpath('//*[@class="readonly companySummary]/div[5]/text()').extract(),
                    'ConstitutionFiled' : panel.xpath('//*[@class="readonly companySummary]/div[6]/text()').extract(),
                    'ARFilingMonth' : panel.xpath('//*[@class="readonly companySummary]/div[7]/table/tbody/tr/td[1]/text()').extract(),
                    'UltimateHoldingCompany' : panel.xpath('//*[@class="readonly companySummary]/div[8]/div/text()').extract()
                }

            elif response.xpath('div[@id="directorsPanel]'):
                # Loop over each director panel in director tab
                for director in response.xpath('//*[@id="director"]'):
                    yield {
                        'DirectorName' : panel.xpath('table/tbody/tr/td[1]/div[1]/text()').extract(),
                        'Address' : panel.xpath('table/tbody/tr/td[1]/div[2]/text()').extract(),
                        'ApptDate' : panel.xpath('table/tbody/tr/td[1]/div[3]/text()').extract(),
                        'Address' : panel.xpath('x_pathlink').extract()
                    }

            elif response.xpath('div[@id="shareholdingTab"]'):
                # Loop over each shareholder panel in shareholding tab
                for shareholder in response.xpath('//*[@id="allocations"]'):
                    yield {
                        'ShareNumber' : shareholder.xpath('//*[@id="allocations"]/div[1]/div[1]/text()').extract(),
                        'SharePortion' : shareholder.xpath('//*[@id="allocations"]/div[1]/div[1]/span/text()[2]').extract(),
                        'Name' : shareholder.xpath('//*[@id="allocations"]/div[1]/div[3]/div').extract(),
                        'Address' : shareholder.xpath('//*[@id="allocations"]/div[1]/div[4]/div').extract(),
                    }

            elif response.xpath('div[@id="addressPanel"]'):
                # Loop over addresses
                for address in response.xpath('div[@class="row"]'):
                    yield {
                        'Address' : address.xpath('//*[@id="addressPanel"]/div[2]/div').extract()
                    }

            else:
                pass
                                              
# Tell the script how to run the crawler by passing in settings. 
# Some of these settings, as indicated, insure that we are not 
#    bombarding the website with increases in traffic.
process = CrawlerProcess({
    'FEED_FORMAT': 'json',         # Store data in JSON format.
    'FEED_URI': 'NZCompanies.json',       # Name of storage file.
    'LOG_ENABLED': False,          
    'ROBOTSTXT_OBEY': True, # Scrapy is set to automatically obey robot.txt files.
    'USER_AGENT': 'SayariAnalyticsCrawler (www.sayarianalytics.com)',
    'AUTOTHROTTLE_ENABLED': True, # Autothrottle dynamically sets the delay based on how quickly the server responds
    'HTTPCACHE_ENABLED': True #Prevents server from downloading the same exact pages.
})

# Start the spider.
process.crawl(NZSpider)
process.start()
print('Success!')

## Kiwi companies SQL database schema
Below is my rendering of what the SQL databse schema would look like.  There would be four tables (Company, Directors, Shareholder, and Allocation) connected with unique ids.

Company
- co_id  #Unique, autogenerated company id
- company_no
- nzbn
- incorp_date
- entity_type 
- constitution_filed 
- ar_filing_mo
- ult_holding
- street
- city
- state
- country
- zip
- total_shares

Director
- director_id  #Unique, autogenerated director id
- first_name
- last_name
- middle_name
- appt_date
- shareholder_id   #Linked to shareholder table
- company_id   #Linked to company table
- street
- city
- state
- country
- zip

Shareholder
- sh_id  #Unique, autogenerated shareholder id
- company_id   #Linked to company table
- first_name
- last_name
- middle_name
- street
- city
- state
- country
- zip

Allocation
- alloc_id   #Unique, autogenerated allocation id
- shareholder_id   #Linked to shareholder table
- company_id   #Linked to company table
- shares_owned

## Final Notes
While the NZSpider does run successfully, the generated json file is empty.  In the time given, I can't figure out why that is. I suspect there is something wrong with the XPath syntax.  The htmlSpider does generate the html file.

I also couldn't figure out how to run the spider recursively on this webpage, as the website is not paginated.  After scraping the information of this company, how do we move on to the next one? Ideally, the spider would crawl the entire website and capture all the company data once.  If we want to update the database, the spider will crawl the website and only update companies where it detects there are changes, and add new companies that aren't in the current database. 

Aside from utilizing the Twitter API, this is the first spider I built from scratch. I learned a lot! I look forward to learning more about how it can be improved.  Thanks for the opportunity!