<p style="font-size:16px;">Create a Scraper for the test pdf (screenshot) below. Ignore cancelled meetings, and meetings not in the specified year.</p>

<img src="https://s23.postimg.org/jsccsx26z/Test_Meeting_screenshot.png" style="width:700px;margin:0;">


<p style="font-size:16px;font-weight:bold;">Import necessary libraries</p>

<p style="font-size:16px">These exist inside meetings_scraper, but are needed when defining scraper outside the file.</p>

In [1]:
from meetings_scraper import Meeting
from datetime import datetime
import re

In [2]:
Meeting.pdf_url = "http://localhost:8888/files/Open%20Wichita/files/Test%20Meeting%20Schedule.pdf"
pdf = Meeting.fetch_pdf()
pdf

2016 => http://localhost:8888/files/Open%20Wichita/files/Test%20Meeting%20Schedule.pdf
Meeting PDF written to file: files/Meeting_2016.pdf


<pdfquery.pdfquery.PDFQuery at 0x1047c3dd8>

In [3]:
pdf.tree.write("files/Meeting_2016_tree.xml", pretty_print=True)

<p style="font-size:16px">The PDF xml (generated by PDFQuery library loaded inside meeting_scrapers) for text boxes looks like this, and it's not listed in order:</p>

```xml
<LTTextBoxHorizontal bbox="[396.15, 607.426, 537.72, 620.152]" height="12.726" index="0" width="141.57" x0="396.15" x1="537.72" y0="607.426" y1="620.152">June 28 CANCELLED</LTTextBoxHorizontal>```

<p style="font-size:16px">Sometimes, the first letter is in another element, for some reason.</p>

```xml
<LTTextBoxHorizontal bbox="[82.021, 671.696, 185.58, 684.422]" height="12.726" index="6" width="103.559" x0="82.021" x1="185.58" y0="671.696" y1="684.422">ecember 31, 2016</LTTextBoxHorizontal>```

In [4]:
lines = pdf.pq('LTTextBoxHorizontal:contains("Meeting Schedule")')
lines

[<LTTextBoxHorizontal>]

In [5]:
line = lines[0]
print(line.text)

for key, value in line.items():
    print(key, "=>", value)

Meeting Schedule 
bbox => [252.85, 703.946, 362.65, 716.448]
height => 12.502
index => 4
width => 109.8
x0 => 252.85
x1 => 362.65
y0 => 703.946
y1 => 716.448


<p style="font-size:16px;font-weight:bold;">Define Custom Scraper</p>

<ol style="font-size:16px">
    <li>Inherit from the Meeting class</li>
    <li>Define class pdf_url variable</li>
    <li>Define data common to all meetings in init</li>
    <li>Override class method parse_meetings for pdf specific querying</li>
</ol>

In [6]:
help(Meeting.parse_meetings)

Help on method parse_meetings in module meetings_scraper:

parse_meetings(pdf, date_lines) method of builtins.type instance
    Parse the pdf text lines for meeting events
    
    Parameters
    ----------
    cls: class passed in by Python
    pdf: PDFQuery object created from meetings pdf
    date_lines: tuple of pdf line, month & day (ints)
    
    Returns
    -------
    meetings: list of class specific meeting instances



In [7]:
class ElixirMeeting(Meeting):
    pdf_url = "http://localhost:8888/files/Open%20Wichita/files/Test%20Meeting%20Schedule.pdf"
    
    # Each meeting is passed the year, month, day by the pdf parsing code
    def __init__(self, year, month, day):
        self.type = "elixir"
        self.summary = "Elixir Meeting"
        self.description = "testing adding a custom scraper"
        self.location = "216 N Mosley St, Wichita, KS 67202"
        self.date = datetime(year, month, day, hour=14, minute=15)
        self.agenda = "https://media.readthedocs.org/pdf/elixir-lang/latest/elixir-lang.pdf"
        self.email = "fake_marcus@gmail.com"
        # Additional properties can be added
        self.additional = "|> Enum.sort fn(x, y) -> elem(x, 1) > elem(y, 1) end"
    
    @classmethod
    # parse_meetings is called from get_meetings class method
    def parse_meetings(cls, pdf, date_lines):
        meetings = []
        
        for line, month, day in date_lines:
            if "CANCELLED" in line.text:
                print("Meeting cancelled on {0}/{1}".format(month, day))
                continue
            else:
                # Check if year is present
                year_match = re.search("\d\d\d\d", line.text)
                
                # If there is a year, make sure it's the same as year
                if year_match:
                    year = int(year_match.group())
                    if year != cls.year:
                        print("Wrong year:", line.text)
                        continue
         
            # If pass the checks above, add a meeting of this specific class (cls variable)
            meeting = cls(cls.year, month, day)
            meetings.append(meeting)
        
        print("{0} meetings generated".format(len(meetings)))
        return meetings

<p style="font-size:16px">The get_meetings class method is what starts the pdf fetching, querying and meeting instance generation. The specific parse_meetings defined above is called during that sequence.</p>

In [8]:
help(ElixirMeeting.get_meetings)

Help on method get_meetings in module meetings_scraper:

get_meetings(year=None) method of builtins.type instance
    Parse the pdf text lines for meeting events. Called by get_meetings function.
    
    Parameters
    ---------
    year: 4 digit year
          Defaults to None, which means use current year
    
    Returns
    -------
    meetings: list of meetings generated from online pdf



In [9]:
meetings = ElixirMeeting.get_meetings(2017)
meetings

2017 => http://localhost:8888/files/Open%20Wichita/files/Test%20Meeting%20Schedule.pdf
Meeting PDF written to file: files/ElixirMeeting_2017.pdf
Wrong year: April 30, 2018 
Meeting cancelled on 6/28
Wrong year: ecember 31, 2016 
4 meetings generated


[Sun Jan 01 02:15 PM: Elixir Meeting,
 Thu Feb 23 02:15 PM: Elixir Meeting,
 Mon Mar 13 02:15 PM: Elixir Meeting,
 Fri Aug 11 02:15 PM: Elixir Meeting]

In [10]:
for key, value in meetings[0]:
    print(key, "=>", value)

summary => Elixir Meeting
location => 216 N Mosley St, Wichita, KS 67202
email => fake_marcus@gmail.com
description => testing adding a custom scraper
additional => |> Enum.sort fn(x, y) -> elem(x, 1) > elem(y, 1) end
agenda => https://media.readthedocs.org/pdf/elixir-lang/latest/elixir-lang.pdf
type => elixir
date => 2017-01-01 14:15:00


**For the hell of it**

Could call to_json, to_ics, or to_csv here.

In [11]:
meetings[0].to_html()