<div style="margin: 0px auto; width: 600px;">
<img src="http://www.canadastop100.com/national/images/ct2017_english.png" style="float: left; height: 100px; margin-top: 10px;"/>
<img src="https://cdn4.iconfinder.com/data/icons/wirecons-free-vector-icons/32/add-128.png" style="float: left; margin: 25px; height: 50px;"/>
<img src="https://freepythontips.files.wordpress.com/2013/07/python_logo_notext.png" style="float: left; height: 100px;"/>
<img src="https://cdn4.iconfinder.com/data/icons/wirecons-free-vector-icons/32/add-128.png" style="float: left; margin: 25px; height: 50px;"/>
<img src="https://vuejs.org/images/logo.png" style="float: left; height: 100px; margin-top: 10px;"/>
</div>

## Objectives
* Generate data using webcrawling with requests from [Canada's Top 100](http://www.canadastop100.com).
* Use of Scrapy
* Use of Pandas
* Integrate VueJS in a notebook
* Create simple table with filter functionality

## Scraping data
### Approach
To scrape the data, we will use the Scrapy library. Instead of writing our own scrapers, it is faster for this tutorial to simply use a proper library.

1. Load the main page
2. Find all company links
3. For each company link, open the corresponding page
4. For each company page, find all ratings

### Markup for companies links
```html
<div id="winners" class="page-section">
...
  <li><span><a target="_blank" href="http://content.eluta.ca/top-employer-3m-canada">3M Canada Company</a></span></li>
...
</div>
```
This corresponds with the Python code from the CompanySpider class:
```python
for href in response.css('div#winners a::attr(href)').extract():
```

### Markup for ratings

```html
<h3 class="rating-row">
    <span class="nocolor">Physical Workplace</span>
    <span class="rating">
        <span class="score" title="Great-West Life Assurance Company, The's physical workplace is rated as exceptional. ">A+</span>
    </span>
</h3>
```

In [1]:
import logging
import scrapy
from scrapy.crawler import CrawlerProcess

class CompanySpider(scrapy.Spider):
    name = "companies"
    start_urls = [
        "http://www.canadastop100.com/national/"
    ]
    custom_settings = {
        'LOG_LEVEL': logging.CRITICAL,
        'FEED_FORMAT':'json',               
        'FEED_URI': 'canadastop100.json' 
    }
    
    def parse(self, response):
        for href in response.css('div#winners a::attr(href)').extract():
            yield scrapy.Request(response.urljoin(href),
                                 callback=self.parse_company)
            
    def parse_company(self, response):
        name = response.css('div.side-panel-wrap div.widget h4::text').extract_first()
        for rating in response.css('h3.rating-row')[1:]:
            yield {
                'name': name,
                'title': rating.css('span.nocolor::text').extract_first(),
                'value': rating.css('span.rating span.score::text').extract_first(),
            }

In [2]:
rm canadastop100.json

In [3]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(CompanySpider)
process.start()

2020-04-19 01:07:38 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: scrapybot)
2020-04-19 01:07:38 [scrapy.utils.log] INFO: Versions: lxml 4.4.2.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 21:52:21) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.19.76-linuxkit-x86_64-with-debian-buster-sid
2020-04-19 01:07:38 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'json', 'FEED_URI': 'canadastop100.json', 'LOG_LEVEL': 50, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


## Preparing data

In [4]:
import pandas as pd

In [5]:
df = pd.read_json('canadastop100.json')
df.head()

Unnamed: 0,name,title,value
0,3M Canada Company,Physical Workplace,A+
1,3M Canada Company,Work Atmosphere & Communications,A+
2,3M Canada Company,Financial Benefits & Compensation,A
3,3M Canada Company,Health & Family-Friendly Benefits,A
4,3M Canada Company,Vacation & Personal Time-Off,B


In [6]:
len(df['name'].unique())

53

In [7]:
df = df[df['title'].notnull()]

In [8]:
df['value'].unique()

array(['A+', 'A', 'B', 'B+', 'B-', 'A-', 'C+'], dtype=object)

In [9]:
mapping = {'A+': 10,
           'A': 9,
           'A-': 8,
           'B+': 7,
           'B': 6,
           'B-': 5,
           'C+': 4}

In [10]:
df['value'] = df['value'].map(mapping)

In [11]:
df = df.pivot(index='name', columns='title', values='value')

In [12]:
df['Total Score'] = df.sum(axis=1)

In [13]:
df.head()

title,Community Involvement,Employee Engagement & Performance,Financial Benefits & Compensation,Health & Family-Friendly Benefits,Physical Workplace,Training & Skills Development,Vacation & Personal Time-Off,Work Atmosphere & Communications,Total Score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
3M Canada Company,10,7,9,9,10,10,6,10,71
Aboriginal Peoples Television Network Inc. / APTN,7,6,7,9,7,9,9,9,63
Accenture Inc.,9,9,7,9,9,9,6,7,65
Adobe Systems Canada Inc.,10,7,7,9,10,9,9,10,71
Agriculture Financial Services Corporation / AFSC,9,7,9,6,7,9,9,9,65


In [14]:
from IPython.display import HTML, Javascript, display

In [15]:
Javascript("""
           window.companyData={};
           """.format(df.reset_index().to_json(orient='records')))

<IPython.core.display.Javascript object>

In [16]:
# Write to JSON file on disk
# df.reset_index().to_json('canadastop100.json', orient='records')

## Visualizing data

VueJS can be included from https://cdn.jsdelivr.net/npm/vue@2.x/dist/vue.js. This notebook will make use of the example of the [grid-component](https://vuejs.org/v2/examples/grid-component.html) from the official documentation to create a table representing the crawled data.

In [17]:
%%javascript
require.config({
    paths: {
        vue: "https://cdn.jsdelivr.net/npm/vue@2.x/dist/vue",
        vuetify: "https://cdn.jsdelivr.net/npm/vuetify@2.x/dist/vuetify"
    }
});

<IPython.core.display.Javascript object>

In [18]:
%%html
<script type="text/x-template" id="data-template">
  <table class="canada">
    <thead>
      <tr>
        <th v-for="key in columns"
          @click="sortBy(key)"
          :class="{ active: sortKey == key }">
          {{ key | capitalize }}
          <span class="arrow" :class="sortOrders[key] > 0 ? 'asc' : 'dsc'">
          </span>
        </th>
      </tr>
    </thead>
    <tbody>
      <tr v-for="entry in filteredData">
        <td v-for="key in columns">
          {{entry[key]}}
        </td>
      </tr>
    </tbody>
  </table>
</script>

In [23]:
%%html
<div id="vue-app">
  <v-app name="main-app">
    <v-content name="main-content">
      <v-container>
        <v-icon v-text="'$vuetify.icons.support'"></v-icon>
        <v-icon v-text="'$support'"></v-icon>
        <form id="search">
          Search <input name="query" v-model="searchQuery">
        </form>
        <data-grid
          :data="gridData"
          :columns="gridColumns"
          :filter-key="searchQuery">
        </data-grid>
      </v-container>
    </v-content
  </v-app>
</div>

In [24]:
%%javascript
require(['vue', 'vuetify'], function(Vue, Vuetify) {
    console.log(Vue.version);
    var companyData = window.companyData;
    console.log(JSON.stringify(companyData));
    // Verify Vuetify is loaded
    console.log(Vuetify.version);
    Vue.component('data-grid', {
      template: '#data-template',
      props: {
        data: Array,
        columns: Array,
        filterKey: String
      },
      data: function () {
        var sortOrders = {}
        this.columns.forEach(function (key) {
          sortOrders[key] = 1
        })
        return {
          sortKey: '',
          sortOrders: sortOrders
        }
      },
      computed: {
        filteredData: function () {
          var sortKey = this.sortKey
          var filterKey = this.filterKey && this.filterKey.toLowerCase()
          var order = this.sortOrders[sortKey] || 1
          var data = this.data
          if (filterKey) {
            data = data.filter(function (row) {
              return Object.keys(row).some(function (key) {
                return String(row[key]).toLowerCase().indexOf(filterKey) > -1
              })
            })
          }
          if (sortKey) {
            data = data.slice().sort(function (a, b) {
              a = a[sortKey]
              b = b[sortKey]
              return (a === b ? 0 : a > b ? 1 : -1) * order
            })
          }
          return data
        }
      },
      filters: {
        capitalize: function (str) {
          return str.charAt(0).toUpperCase() + str.slice(1)
        }
      },
      methods: {
        sortBy: function (key) {
          this.sortKey = key
          this.sortOrders[key] = this.sortOrders[key] * -1
        }
      }
    })

    var vueApp = new Vue({
      el: '#vue-app',
      vuetify: new Vuetify({
          icons: {
            iconfont: 'mdiSvg', // 'mdi' || 'mdiSvg' || 'md' || 'fa' || 'fa4' || 'faSvg'
          },
        }),
      data: {
        searchQuery: '',
        gridColumns: Object.keys(companyData[0]),
        gridData: companyData
      }
    })
  
});

<IPython.core.display.Javascript object>

In [25]:
%%html
<style>
@import url("https://fonts.googleapis.com/css?family=Roboto:100,300,400,500,700,900");
@import url("https://cdn.jsdelivr.net/npm/@mdi/font@4.x/css/materialdesignicons.min.css");
@import url("https://cdn.jsdelivr.net/npm/vuetify@2.x/dist/vuetify.min.css");

table.canada {
  border: 2px solid rgb(102, 153, 255);
  border-radius: 3px;
  background-color: #fff;
}

table.canada th {
  background-color: rgb(102, 153, 255);
  color: rgba(255,255,255,0.66);
  cursor: pointer;
  -webkit-user-select: none;
  -moz-user-select: none;
  -ms-user-select: none;
  user-select: none;
}

table.canada td {
  background-color: #f9f9f9;
}

table.canada th, table.canada td {
  min-width: 120px;
  padding: 10px 20px;
}

table.canada th.active {
  color: #fff;
}

table.canada th.active .arrow {
  opacity: 1;
}

.arrow {
  display: inline-block;
  vertical-align: middle;
  width: 0;
  height: 0;
  margin-left: 5px;
  opacity: 0.66;
}

.arrow.asc {
  border-left: 4px solid transparent;
  border-right: 4px solid transparent;
  border-bottom: 4px solid #fff;
}

.arrow.dsc {
  border-left: 4px solid transparent;
  border-right: 4px solid transparent;
  border-top: 4px solid #fff;
}
</style>

https://jsfiddle.net/jitsejan/rxxjhgf6