## Introduction

In this problem, you're going to write a function that will parse (a simplified version of) XML files into a Python object. Although it's not recommended that you use the parser you construct for anything serious (many excellent Python libraries already exist for parsing XML, such as the lxml library), XML files represent a fairly complex file format, that necessitates using regular expressions and recursion (or a stack) to parse these in a reasonably efficient manner.  So while it's not likely that you will need to write you own XML parser, chances are if/when you _do_ need to write a parser for some format for which there exists no good Python library, the techniques you use here will be useful for writing this parser as well.

## XML

XML stands for the eXtensible Markup Language.  It appeared as a successor to SGML (standard generalized markup language) and HTML (hypertext markup language, the standard for displaying web pages), but with some additional structure that makes the documents more well-defined; for instance, in HTML it's common for open tags to appear without a corresponding close tag, which is not allowed in pure XML.

You may already be familiar with XML, but if not the official resource for learning about the format is here https://www.w3.org/XML/ and a good resource with some concrete examples is here http://www.w3schools.com/xml/.  We'll assume here that you're broadly familiar with the basic ideas behind XML, and just describe what you need to know to complete the parser for this assignment.

Here is an example XML document:
    
    <?xml version="1.0" encoding="UTF-8"?>
    <!-- This is a comment -->
    <note date="8/31/12">
        <to>Tove</to>
        <from>Jani</from>
        <heading type="Reminder"/>
        <body>Don't forget me this weekend!</body>
        <!-- This is a multiline comment,
             which take a bit of care to parse -->
    </note>
    
There are a few elements here of importance.
1. Tags are denoted `<tag_name>content</tag_name>` where `<tag_name>` is the opening tag and `</tag_name>` is the closing tag.  All text (including whitespace) between these tags is the content.
2. Attributes follow a tag, and are written as a list of `attr_name="attribute_value"` pairs, where we can use either double quotes or single quotes around the attribute value.  If you use double quotes then a single quote can appear in the text and vice versa.  There can be whitespace around the equals sign or not.
3. If a tag has no content `<tag_name attr_name="attribute_value"></tag_name>` can be abbreviated as the open/close tag `<tag_name attr_name="attribute_value"/>`. In some cases, such as in HTML5, you might come across tags that have no content but aren't closed, such as `<meta ...>` and `<link ...>` tags. **However, as we are dealing with XML (a stricter context), tags are required to be closed.** If you're interested, [this document](https://www.w3schools.com/html/html_xhtml.asp) on XHTML vs HTML touches upon this idea.
4. A XML prologue is written as `<?tag_name attr_name="attribute_value"?>`.  It has no close tag.  We'll also consider documents that allow for an HTML declaration, such as `<!DOCTYPE html>` (this will let us parse some HTML documents that are well-formed enough to also parse as valid XML).
5. Comments are denoted by `<!-- comment_text -->`  and the comment text can span multiple lines.

## Q1: Regular expression for identifying tags

First, we'll use regular expressions to identify tags and other elements of XML files.  Specifically, you'll need to create 6 regular expressions that locate open tags, close tags, open/close tags, comments, xml_prolog, and html declarations.  For the open, close and open/close tags, make sure that your regular expression also matches and returns 1) the tag name, and 2) all the attributes.  Note that in our implementation, we actually have the open tag _also_ match open/close tags, but you are free to do this either way (they can match or not).  Comments may be split across multiple lines, but you can assume that all other tags must occur on a single line (without newlines within the tag itself).

In [1]:
#### Q1: FILL IN THIS CELL
import re
tag_open = re.compile("<([a-zA-Z0-9]+)(\s{1}[^>]*)*/?>")
tag_close = re.compile("</([a-zA-Z0-9]+)(\s{1}[^>]*)*>")
tag_open_close = re.compile("<([a-zA-Z0-9]+)(\s{1}[^>]*)*/>")
comment = re.compile("<!--([\[\]/:\.\'\-,\s\w]*)-->")
xml_prolog = re.compile("<\?(.*)\?>")
html_declaration = re.compile("<!([\s\w]*)>")

print("tag_open: ", tag_open.findall("""<note date="8/31/12"><to>Tove</to><!-- This is a comment -->"""))
print("comment: ", comment.findall("""<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml> <!-- not actually valid xml--><!-- This is a comment -->
<note date="8/31/12">
    <to>Tove</to>
    <from>Jani</from>
    <heading type="Reminder"/>
    <body>Don't forget me this weekend!</body>
    <!-- This is a multiline comment,
         which take a bit of care to parse --><!-- Eric\'s CSS -->\n<!-- WARNING: Respond.js doesn\'t work if you view the page via file:// -->\n<!-- ./ navbar-collapse --> 
</note>
"""))


print("xml: ", xml_prolog.findall("""<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml> <!-- not actually valid xml--><!-- This is a comment -->
<note date="8/31/12">
    <to>Tove</to>
    <from>Jani</from>
    <heading type="Reminder"/>
    <body>Don't forget me this weekend!</body>
    <!-- This is a multiline comment,
         which take a bit of care to parse -->
</note>
"""))
print("html: ", html_declaration.findall("""<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml> <!-- not actually valid xml--><!-- This is a comment -->
<note date="8/31/12">
    <to>Tove</to>
    <from>Jani</from>
    <heading type="Reminder"/>
    <body>Don't forget me this weekend!</body>
    <!-- This is a multiline comment,
         which take a bit of care to parse -->
</note>
"""))

tag_open:  [('note', ' date="8/31/12"'), ('to', '')]
xml:  ['xml version="1.0" encoding="UTF-8"']
html:  ['DOCTYPE xml']


You can test your code on the following snippet, and on the HTML source of our course web page.

In [2]:
import requests
test_snippet = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml> <!-- not actually valid xml-->
<!-- This is a comment -->
<note date="8/31/12">
    <to>Tove</to>
    <from>Jani</from>
    <heading type="Reminder"/>
    <body>Don't forget me this weekend!</body>
    <!-- This is a multiline comment,
         which take a bit of care to parse -->
</note>
"""

# [NOTE] Comment this out prior to submission
#course_webpage = str(requests.get("http://www.datasciencecourse.org/2016").content)
course_webpage = """b'<!DOCTYPE html>\n<html lang="en">\n\n<head>\n\n    <meta charset="utf-8"/>\n    <meta http-equiv="X-UA-Compatible" content="IE=edge"/>\n    <meta name="viewport" content="width=device-width, initial-scale=1"/>\n    <meta name="description" content=""/>\n    <meta name="author" content=""/>\n\n    <title>15-388/688 Practical Data Science</title>\n\n    <!-- Bootstrap Core CSS -->\n    <link href="css/bootstrap.min.css" rel="stylesheet"/>\n\n    <!-- Custom CSS -->\n    <link href="css/agency.css" rel="stylesheet"/>\n\n    <!-- Eric\'s CSS -->\n    <link href="css/eric.css" rel="stylesheet"/>\n\n    <!-- Custom Fonts -->\n    <link href="font-awesome/css/font-awesome.min.css" rel="stylesheet" type="text/css"/>\n    <link href="https://fonts.googleapis.com/css?family=Montserrat:400,700" rel="stylesheet" type="text/css"/>\n    <link href=\'https://fonts.googleapis.com/css?family=Kaushan+Script\' rel=\'stylesheet\' type=\'text/css\'/>\n    <link href=\'https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic,700italic\' rel=\'stylesheet\' type=\'text/css\'/>\n    <link href=\'https://fonts.googleapis.com/css?family=Roboto+Slab:400,100,300,700\' rel=\'stylesheet\' type=\'text/css\'/>\n\n    <!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->\n    <!-- WARNING: Respond.js doesn\'t work if you view the page via file:// -->\n    <!--[if lt IE 9]>\n        <script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>\n        <script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>\n    <![endif]-->\n\n</head>\n\n<body id="page-top" class="index">\n\n    <!-- Navigation -->\n    <nav class="navbar navbar-default navbar-fixed-top">\n        <div class="container">\n            <!-- Brand and toggle get grouped for better mobile display -->\n            <div class="navbar-header page-scroll">\n                <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">\n                    <span class="sr-only">Toggle navigation</span>\n                    <span class="icon-bar"></span>\n                    <span class="icon-bar"></span>\n                    <span class="icon-bar"></span>\n                </button>\n                <a class="navbar-brand page-scroll" href="#page-top">15-388/688\n                 Practical Data Science</a>\n            </div>\n\n            <!-- Collect the nav links, forms, and other content for toggling -->\n            <div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">\n                <ul class="nav navbar-nav navbar-right">\n                    <li class="hidden">\n                        <a href="#page-top"></a>\n                    </li>\n                    <li>\n                        <a class="page-scroll" href="#overview">Overview</a>\n                    </li>\n                    \n                    <li>\n                        <a class="page-scroll" href="#schedule">Schedule</a>\n                    </li>\n\n                    <li>\n                        <a class="page-scroll" href="#assignments">Assignments</a>\n                    </li>\n\n                    <li>\n                        <a class="page-scroll" href="#instructors">Instructors</a>\n                    </li>\n\n                    <li>\n                        <a class="page-scroll" href="#faq">FAQ</a>\n                    </li>\n                </ul>\n            </div>\n            <!-- /.navbar-collapse -->\n        </div>\n        <!-- /.container-fluid -->\n    </nav>\n\n    <!-- Header -->\n    <header>\n        <div class="container">\n            <div class="intro-text">\n                <div class="intro-lead-in">CMU 15-388/688, Fall 2016</div>\n                <div class="intro-heading">Practical Data Science</div>\n                <a href="#overview" class="page-scroll btn btn-xl">Course\n                    Information</a>\n            </div>\n        </div>\n    </header>\n\n    <!-- Overview Section -->\n    <section id="overview" class="bg-light-gray">\n        <div class="container">\n            <div class="row">\n                <div class="col-lg-10 col-lg-offset-1">\n                    <h2 class="section-heading text-center">Course Overview</h2>\n\n                    <p class="large text-muted">Data science is the study and practice of how we can extract insight and knowledge from large amounts of data.  It is a burgeoning field, currently attracting substantial demand from both academia and industry.</p>\n                    <p class="large text-muted">This course provides a practical introduction to the "full stack" of data science analysis, including data collection and processing, data visualization and presentation, statistical model building using machine learning, and big data techniques for scaling these methods.  Topics covered include: collecting and processing data using relational methods, time series approaches, graph and network models, free text analysis, and spatial geographic methods; analyzing the data using a variety of statistical and machine learning methods include linear and non-linear regression and classification, unsupervised learning and anomaly detection, plus advanced machine learning methods like kernel approaches, boosting, or deep learning; visualizing and presenting data, particularly focusing the case of high-dimensional data; and applying these methods to big data settings, where multiple machines and distributed computation are needed to fully leverage the data.</p>\n                    <p class="large text-muted"> As the course name suggests, this course will focus on the <em>practical</em> aspects of data science, with a focus on implementing and making use of the above techniques.  Students will complete weekly programming homework that emphasize practical understanding of the methods described in the course.  In addition, students will develop a tutorial on an advanced topic, and will complete a group project that applies these data science techniques to a practical application chosen by the team; these two longer assignments will be done in lieu of a midterm or final.</p>\n\n                </div>\n            </div>\n\n\n            <div class="row text-center">\n                <div class="col-md-4">\n                    <span class="fa-stack fa-5x">\n                        <i class="fa fa-circle fa-stack-2x text-primary"></i>\n                        <i class="fa fa-database fa-stack-1x fa-inverse"></i>\n                    </span>\n                    <h4 class="service-heading">Data collection and processing</h4>\n                    <p class="text-muted large">Ingest data from unstructured and structured sources, and use relational models, time series algorithms, graph and network processing, natural language processing, geographic information system processes to store and manage the data.</p>\n                </div>\n                <div class="col-md-4">\n                    <span class="fa-stack fa-5x">\n                        <i class="fa fa-circle fa-stack-2x text-primary"></i>\n                        <i class="fa fa-bar-chart fa-stack-1x fa-inverse"></i>\n                    </span>\n                    <h4 class="service-heading">Statistical modeling</h4>\n                    <p class="text-muted large">Apply basic statistical techniques and analyses to understand properties of the data and to design experimental setups for testing hypotheses or collecting new data. </p>\n                </div>\n                <div class="col-md-4">\n                    <span class="fa-stack fa-5x">\n                        <i class="fa fa-circle fa-stack-2x text-primary"></i>\n                        <i class="fa fa-gears fa-stack-1x fa-inverse"></i>\n                    </span>\n                    <h4 class="service-heading">Advanced ML techniques</h4>\n                    <p class="text-muted large">Apply advanced machine learning algorithms such as kernel methods, boosting, deep learning, anomaly detection, factorization models, and probabilistic modeling to analyze and extract insights from data. </p>\n                </div>\n            </div>\n            <div class="row text-center">\n                <div class="col-md-4 col-md-offset-2">\n                    <span class="fa-stack fa-5x">\n                        <i class="fa fa-circle fa-stack-2x text-primary"></i>\n                        <i class="fa fa-eye fa-stack-1x fa-inverse"></i>\n                    </span>\n                    <h4 class="service-heading">Data visualization</h4>\n                    <p class="text-muted large">Visualize the data and results from analysis, particularly focusing on visualizing and understanding high-dimensional structured data and the results of statistical and machine learning analysis.</p>\n                </div>\n                <div class="col-md-4">\n                    <span class="fa-stack fa-5x">\n                        <i class="fa fa-circle fa-stack-2x text-primary"></i>\n                        <i class="fa fa-cloud fa-stack-1x fa-inverse"></i>\n                    </span>\n                    <h4 class="service-heading">Big data</h4>\n                    <p class="text-muted large">Scale the methods to big data regimes, where distributed storage and computation are needed to fully realize capabilities of data analysis techniques.</p>\n                </div>\n            </div>\n\n            <div class="col-lg-12" style="height:50px;"></div>\n            <div class="row">\n                <div class="col-lg-10 col-lg-offset-1 well">\n                    <h3 class="service-heading">Key information</h3>\n\n                    <p class="large">\n                    <strong>Course number:</strong> 15-388 (undergraduate) / 15-688 (masters) <br/> \n                    Both courses will have the same lectures, but on each assignment there will be additional advanced problems for the 600 level course.\n                    </p>\n\n                    <p class="large">\n                    <strong>Course location and time:</strong> Doherty Hall A302, MW 12:00-1:20</p>\n\n\n\n\n                    <p class="large">\n                    <strong>Units:</strong> 9 (15-388), 12 (15-688)<br/>\n                    </p>\n\n                    <p class="large">\n                    <strong>Prequisites:</strong> Programming experience is necessary for the course (assignments are in Python).  For CS undergraduates, either 15-112 or 15-122 is required; undergraduates from other departments who have programming background but have not taken the course require instructor approval to enroll.  Experience with linear algebra, probability, and statistic is recommended, but not strictly required (courses like 21-240/241/242 for linear algebra or 36-201 for probability/statistics are more than sufficient).  Students concerned about whether they have a proper background should contact the course instructors to discuss.</p>\n\n                    <p class="large">\n                    <strong>Grading:</strong> 55% homeworks, 15% tutorial, 25% final project, 5% class participation.</p>\n\n\n                </div>\n            </div>\n        </div>\n    </section>\n\n    <!-- Schedule Section -->\n    <section id="schedule">\n        <div class="container">\n            <div class="row">\n                <div class="col-lg-10 col-lg-offset-1">\n                    <h2 class="section-heading text-center">Schedule</h2>\n                    <p class="large text-muted">This schedule is tentative and subject to change, and precise dates will be added closer to the course start date.  All course material, including slides, lecture videos, and assignments, will be publicly available.</p>\n                </div>\n            </div>\n            <div class="col-lg-12" style="height:50px;"></div>\n\n            <div class="row">\n                <div class="col-lg-10 col-lg-offset-1">\n                    <div class="table-responsive" style="font-size:16px">\n                        <table class="table table-hover table-striped large">\n                            <thead>\n                                <tr>\n                                    <th>Date</th>\n                                    <th>Topic</th>\n                                    <th>Lecture</th>\n                                    <th>Assignments</th>\n                                </tr>\n                            </thead>\n                            <tbody>\n                                <tr>\n                                    <td>8/29</td>\n                                    <td><a href="intro.pdf">Introduction</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=3f116819-0e26-4f79-bcee-f4cb7e745773">video</a></td>\n                                    <td></td>\n                                </tr>\n                                <tr>\n                                    <th colspan="4">Data collection and management</th>\n                                </tr>\n\n                                <tr>\n                                    <td>8/31</td>\n                                    <td><a href="data_collection.pdf">Data collection and scraping</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=0d70cc6d-acbc-4895-a4da-8932dc323723">video</a></td>\n                                    <td>HW1 Out <a href="hw/1/hw1.pdf">(pdf)</a> <a href="hw/1/handout.tar">(notebooks)</a></td>\n                                </tr>\n                                <tr>\n                                    <td>9/7</td>\n                                    <td><a href="jupyter.pdf">Jupyter notebook lab</a>\n                                        <a href="jupyter.tar">(notebook and data files)</a>\n                                    </td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=08d57ec4-058f-476d-9962-356b3c39af37">video</a></td>\n                                    <td></td>\n                                </tr>\n\n                                <tr>\n                                    <td>9/12</td>\n                                    <td><a href="relational_data.pdf">Relational Data</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=d3f716a3-a874-415d-8b6b-5c0af4efd4c6">video</a></td>\n                                    <td></td>\n                                </tr>\n                                \n                                <tr>\n                                    <td>9/14</td>\n                                    <td><a href="visualization.pdf">Visualization and data exploration</a> <a href="Visualization.ipynb">(notebook)</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=c0e02818-c252-49a9-83b9-b5c52f01b12f">video</a></td>\n                                    <td>HW1 Due, HW 2 Out <a href="http://www.datasciencecourse.org/hw/2/handout.tar">(notebooks)</a></td>\n                                </tr>\n\n                                <tr>\n                                    <td>9/19</td>\n                                    <td><a href="matrices.pdf">Vector, matrices, and linear algebra</a> <a href="NumpyBasics.ipynb">(notebook)</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=70d8ac29-2865-4d71-a1f4-59127ec04c2c">video</a></td>\n                                    <td>Tutorial Out <a href="tutorial.pdf">(instructions)</a></td>\n                                </tr>\n                                <tr>\n                                    <td>9/21</td>\n                                    <td><a href="graphs.pdf">Graph and network processing</a>\n                                    <a href="NetworkXBasics.ipynb">(notebook)</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=0071471a-c880-4390-aa0d-3709f3d37776">video</a></td>\n                                    <td></td>\n                                </tr>\n\n                                <tr>\n                                    <td>9/26</td>\n                                    <td><a href="free_text.pdf">Free text and natural language processing</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=a69cc2ff-1f38-4843-b64d-d9c68cfee9c9">video</a></td>\n                                    <td></td>\n                                </tr>\n                                <tr>\n                                    <td>9/28</td>\n                                    <td><a href="free_text.pdf">Free text, continued</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=69561aa9-7d9f-44f5-a5a7-14e4b78686ba">video</a></td>\n                                    <td>HW3 Out<a href="http://www.datasciencecourse.org/hw/3/handout.tar.gz"> (notebooks)</a></td>\n                                </tr>\n                                <tr>\n                                    <th colspan="4">Statistical modeling and machine learning</th>\n                                </tr>\n\n\n                                <tr>\n                                    <td>10/3</td>\n                                    <td><a href="linear_regression.pdf">Linear regression</a>\n                                    <a href="linear_reg_code_data.tar.gz">(notebook + data)</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=97af9d3b-529b-41f4-aa54-db2b2eed14b5">video</a></td>\n                                    <td>HW2 Due</td>\n                                </tr>\n                                <tr>\n                                    <td>10/5</td>\n                                    <td><a href="linear_classification.pdf">Linear classification</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=83e43dbe-1922-4c4a-a3af-3bb24dd89c9f">video</a></td>\n                                    <td></td>\n                                </tr>\n                                <tr>\n                                    <td>10/10</td>\n                                    <td><a href="nonlinear_modeling.pdf">Nonlinear modeling, cross-validation, regularization</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=00cc9cfb-e77a-4242-bd74-6587965b63bc">video</a></td>\n                                    <td><a href="GIS%20Tutorial.ipynb">Geospatial Analysis Tutorial</a><br/><a href="project.pdf">Final Project Out</a></td>\n                                </tr>\n                                <tr>\n                                    <td>10/12</td>\n                                    <td><a href="nonlinear_modeling.pdf">Model regularization and evaluation</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=56707206-e441-421e-897b-3b900494a94c">video</a></td>\n                                    <td>HW3 Due, HW4 Out <a href="http://www.datasciencecourse.org/hw/4/handout.tar.gz"> (notebooks)</a><a href="http://www.datasciencecourse.org/hw/4/data.tar.gz"> (data)</a></td>\n                                </tr>\n                                <tr>\n                                    <td>10/17</td>\n                                    <td><a href="probability.pdf">Basic probability and statistics: basics of probability</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=9e45f2fa-d796-42bd-9383-44f88408859a">video</a></td>\n                                    <td><a href="tutorial.pdf">Detailed tutorial instructions</a><br/>\n                                    <a href="count_length.py">Length checker script</a></td>\n                                </tr>\n                                <tr>\n                                    <td>10/19</td>\n                                    <td><a href="probability.pdf">Maximum likelihood estimation, naive Bayes</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=98349dc6-6435-4b7b-b2ea-be8a174464e6">video</a></td>\n                                    <td>Tutorial Check-in Due</td>\n                                </tr>\n                                <tr>\n                                    <td>10/21</td>\n                                    <td>Recitation (Numpy, Scipy.sparse, Scipy.stats)</td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=ad540a28-5ba6-45cc-bf69-9ff55880174f">video</a></td>\n                                    <td><a href="rec/rec4.ipynb">(notebook)</a></td>\n                                </tr>\n                                <tr>\n                                    <td>10/24</td>\n                                    <td><a href="hypothesis_testing.pdf">Hypothesis testing and experimental design</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=2caa5e96-8180-4743-81eb-463caa311da3">video</a></td>\n                                    <td></td>\n                                </tr>\n\n                                <tr>\n                                    <th colspan="4">Advanced modeling techniques</th>\n                                </tr>\n                                <tr>\n                                    <td>10/26</td>\n                                    <td><a href="decision_trees_boosting.pdf">Decision trees and boosting</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=1ff026e7-0f0e-4e11-ba7f-6001d5b5d132">video</a></td>\n                                    <td>HW4 Due</td>\n                                </tr>\n                                <tr>\n                                    <td>10/31</td>\n                                    <td><a href="unsupervised.pdf">Unsupervised learning: clustering and dimensionality reduction</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=89defcbc-953a-4c30-81ac-c09ed9736a4f">video</a></td>\n                                    <td></td>\n                                </tr>\n                                <tr>\n                                    <td>11/2</td>\n                                    <td><a href="anomaly_detection.pdf">Anomaly detection and mixture of Gaussians</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=af33f634-e6a3-4e2f-8fde-8e4f3e05b9f1">video</a></td>\n                                    <td>Tutorial Due, HW5 Out<a href="http://www.datasciencecourse.org/hw/5/handout.tar.gz"> (notebooks)</a></td>\n                                </tr>\n                                <tr>\n                                    <td>10/21</td>\n                                    <td>Recitation, HW5</td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=f8ce6ae6-5152-4a4c-b87d-ae16ee62fdae">video</a></td>\n                                    <td><a href="rec/rec5.ipynb">(notebook)</a></td>\n                                </tr>\n\n                                <tr>\n                                    <td>11/7</td>\n                                    <td><a href="recommender_systems.pdf">Recommender systems</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=b2438f26-f300-4990-889c-96c7285276a1">video</a></td>\n                                    <td></td>\n                                </tr>\n                                <tr>\n                                    <td>11/9</td>\n                                    <td><a href="deep_learning.pdf">Deep learning</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=19cca3dd-8b63-4e30-9aab-fbbd9099f2b1">video</a></td>\n                                    <td>Student Tutorial Evaluation Due</td>\n                                </tr>\n                                <tr>\n                                    <td>11/11</td>\n                                    <td></td>\n                                    <td></td>\n                                    <td>Midterm Report</td>\n                                </tr>\n                                <tr>\n                                    <td>11/14</td>\n                                    <td><a href="InfoViz.pdf">Guest lecture (Jen Mankoff, HCI): information visualization </a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=c5dcb54c-f8c5-4792-9db5-a5c274eae1f7">video</a></td>\n                                    <td></td>\n                                </tr>\n\n                                <tr>\n                                    <th colspan="4">Additional topics</th>\n                                </tr>\n                                <tr>\n                                    <td>11/16</td>\n                                    <td><a href="probabilistic_modeling.pdf">Probabilistic modeling</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=288e893c-62bb-4e79-aec3-2f1c5e92ec92"> video</a></td>\n                                    <td>HW5 Due, HW6 Out <a href="hw/6/handout.tar.gz">(notebooks)</a> <a href="hw/6/">(data)</a></td>\n                                </tr>\n                                <tr>\n                                    <td>11/21</td>\n                                    <td><a href="mapreduce.pdf">Big data and MapReduce methods</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=fdaef430-f88c-4ac0-89a1-9fe30c300519">video</a></td>\n                                    <td></td>\n                                </tr>\n                                <tr>\n                                    <td>11/22</td>\n                                    <td>HW 6 Recitation</td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=56b8ad3e-48fe-40a6-acb4-3d0512783728">video</a></td>\n                                    <td><a href="rec/rec6.ipynb">(notebook)</a></td>\n                                </tr>\n\n\n                                <tr>\n                                    <td>11/28</td>\n                                    <td><a href="debugging.pdf">Debugging data science</a> <a href="debugging_data_science.pdf">(working notes)</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=924f01c5-8a76-4248-8971-03d7e21b3219">video</a></td>\n                                    <td></td>\n                                </tr>\n\n                                <tr>\n                                    <td>11/30</td>\n                                    <td>A data science walkthrough <a href="Data Science Walkthrough.ipynb">(notebook)</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=7e0fad6a-520e-4bb7-9dc3-44498d3be26a">video</a></td>\n                                    <td></td>\n                                </tr>\n                                \n\n                                <tr>\n                                    <td>12/5</td>\n                                    <td><a href="data_science_positions.pdf">Data scientist positions</a></td>\n                                    <td><a href="https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=30711db7-9dde-4135-ad17-6652c3432ab2">video</a></td>\n                                    <td></td>\n                                </tr>\n                                <tr>\n                                    <td>12/6</td>\n                                    <td></td>\n                                    <td></td>\n                                    <td>HW6 Due</td>\n                                </tr>\n                                <tr>\n                                    <td>12/7</td>\n                                    <td><a href="future_of_data_science.pdf">Future of data science</a></td>\n                                    <td></td>\n                                    <td></td>\n                                </tr>\n                                <tr>\n                                    <td>12/9</td>\n                                    <td></td>\n                                    <td></td>\n                                    <td>Final project report due</td>\n                                </tr>\n                            </tbody>\n                        </table>\n                    </div>\n                </div>\n            </div>\n        </div>\n    </section>\n\n\n    \n\n    <!-- Workbooks Section -->\n    <section id="assignments" class="bg-light-gray">\n        <div class="container">\n            <div class="row">\n                <div class="col-lg-10 col-lg-offset-1">\n                    <h2 class="section-heading text-center">Assignments</h2>\n                    <p class="large text-muted"> The course will consist of three main types of assignments.  First, there will be biweekly homes that work through the  material presented in class, and require students to implement or evaluate relevant algorithms.  These workbooks will be distributed as Jupyter notebooks, and will be submitted for the course via Autolab.  In addition to these assignments, students will themselves develop a new tutorial workbook to teach an advanced topic, and will also work through the content developed by at least two other students.  Finally, students will complete a final class project (in groups), which will be a chance to apply these data science techniques to a problem of the group\'s choosing.\n                    </p>\n\n                    <p class="large text-muted"> There is no midterm or final in the course.  All assignments will be posted to this page as they are available.\n                    </p>\n                </div>\n            </div>\n        </div>\n    </section>\n\n    <!-- Instructors Grid Section -->\n    <section id="instructors">\n        <div class="container">\n            <div class="row">\n                <div class="col-lg-12 text-center">\n                    <h2 class="section-heading">Instructors</h2>\n                </div>\n            </div>\n            <div class="row row-eq-height">\n                <div class="col-sm-6">\n                    <div class="team-member">\n                        <img src="img/team/zkolter.png" class="img-responsive img-circle" alt=""/>\n                        <h4><a href="http://www.zicokolter.com/">Zico Kolter</a></h4>\n                        <p class="text-muted">Assistant Professor</p>\n                        <p class="text-muted">Office lunches: See Piazza</p>\n                    </div>\n                </div>\n                <div class="col-sm-6">\n                    <div class="team-member">\n                        <img src="img/team/ericwong.png" class="img-responsive img-circle" alt=""/>\n                        <h4><a href="http://cs.cmu.edu/~ericwong/">Eric\n                            Wong</a></h4>\n                        <p class="text-muted">Teaching Assistant</p>\n                        <p class="text-muted">Office hours: Tuesday and Thursday 3-4pm Gates 8th Floor Kitchen area</p>\n                    </div>\n                </div>\n            </div>\n            <div class="row row-eq-height">\n                <div class="col-sm-6">\n                    <div class="team-member">\n                        <img src="img/team/dhivya.png" class="img-responsive img-circle" alt=""/>\n                        <h4><a href="http://www.cs.cmu.edu/~deswaran/">Dhivya Eswaran</a></h4>\n                        <p class="text-muted">Teaching Assistant</p>\n                        <p class="text-muted">Office hours: Monday 3:30pm-4:30pm, Fridays 9am-10am GHC 6008</p>\n                    </div>\n                </div>\n            </div>\n        </div>\n    </section>\n    \n    <!-- FAQ Section -->\n    <section id="faq" class="bg-light-gray">\n        <div class="container">\n            <div class="row">\n                <div class="col-lg-10 col-lg-offset-1">\n                    <h2 class="section-heading text-center">FAQ</h2>\n                    <p class="large text-muted"><b>Q: What is the situation will all the different course numbers and sections (15-388, 15-688 A/B)?</b>  <br/>\n                    A: The demand for this course has been very right: as of the start of classes there are more than 250 people registered or on the waitlist for the course.  We\'re thrilled about the level of interest, but unfortunately the only available classroom during this time fits at most 136 students. <br/>\n                    To accomodate as many as people as possible, we created a section DNM (Does Not Meet) section of the 600 level course, which is the B section of the 15-688 course.  This version is identical to the A section, except that <emph>students are expected to watch the lectures online</emph> (all lectures will be available online within a few hours after the end of class).  We expect that attendance will shake out significantly during the first few weeks of the course, and our strong suspicious is that after the first month, there will be space in the lecture hall for anyone (from any section), to attend lectures, but we ask that until we make this clear, students in the B section not regularly attend lecture.\n                    </p>\n\n                    <p class="large text-muted"><b>Q: I really want to take the in-person 15-688 Section A.  Will I be able to get off the waitlist?</b>  <br/>\n                    See above.  We want to accomodate absolutely as many people as possible, but are ultimately limited by the classroom size (and following university policy, the undergraduate lectures have priority for in-class attendence).  However, we really want to emphasize that the courses are <emph>exactly</emph> the same except for the in-class lecture and DNM section (same credit, same homeworks, same office hours, same access to TAs/professor, same tutorial and final project assigments, etc), and there is a good chance that Section B students will end up even being able to attend lectures by the middle of the semester.  So please consider enrolling in Section B if you are in this position. \n                    </p>\n\n\n                    <p class="large text-muted"><b>Q: I\'m on the waitlist for the 15-388 version.  Will I be able to get off the waitlist?</b>  <br/>\n                    Yes, almost definitely.  There is a relatively small waitlist for the 300 version currently, and we absolutely expect all students who stick around a few weeks to get off the waitlist.\n                    </p>\n\n                    <p class="large text-muted"><b>Q: How does the 5% class participation grade work for students in the DNM section?</b>  <br/>\n                    Given the size of this course, for all students (including 15-388/15-688A students) the class participation grade will be based upon participating in the course forums on <a href="https://piazza.com/class/is8ghyly8um4ew">Piazza</a>, not upon speaking up in class.\n                    </p>\n\n\n                    <p class="large text-muted"> \n                    <b>Q: Is there a pointer to materials from a previous year?</b>\n                    <br/>\n                    A: No, Fall 2016 is the first time this course is being offered, so we don\'t have past materials to look over.  Feel free to contact the instructors if you have questions about the content of the course which aren\'t answered here.\n                    </p>\n\n                    <p class="large text-muted"> \n                    <b>Q: Does this course count toward the MSCS AI requirement?</b>\n                    <br/>\n                    A: (Corrected version).  We still need to confirm whether this will be the case or not.  We\'ll update the website and soon as we know, and please email us if this situation affects you.\n                    </p>\n                    \n\n                    <p class="large text-muted"> \n                    <b>Q: Will the course be offered during the spring semester?</b>\n                    <br/>\n                    A: No, the soonest the course will be offered again is during Fall 2017.\n                    </p>\n                    \n\n                    <p class="large text-muted"> \n                    <b>Q: Will this course focus mainly on applying techniques from existing libraries to practical data science problems, or writing the underlying algorithms from scratch?</b>\n                    <br/>\n                    A: Both, to a certain extent.  There will be plenty of focus on applying algorithms (often best used through existing libraries) to practical problems, but these libraries can be used more effectively when you understand the underlying algorithms well enough to implement them yourself.  So, at least for the more straightforward algorithms that we cover, you will be implementing these yourselves.  The 688 level course assignment will do a bit more of this underlying implementation than the 388 level course. \n                    </p>\n\n                    <p class="large text-muted"> \n                    <b>Q: I have a question that wasn\'t asked here. </b>\n                    <br/>\n                    A: Come to office hours, or ask on the course <a href="https://piazza.com/class/is8ghyly8um4ew">Piazza</a>. \n                    </p>\n                </div>\n            </div>\n        </div>\n    </section>\n\n\n\n    <footer class="bg-light-gray">\n        <div class="container">\n            <div class="row">\n                <div class="col-lg-12">\n                    <span class="copyright">Copyright &copy; Carnegie\n                        Mellon University 2016</span>\n                </div>\n            </div>\n        </div>\n    </footer>\n\n\n    <!-- jQuery -->\n    <script src="js/jquery.js"></script>\n\n    <!-- Bootstrap Core JavaScript -->\n    <script src="js/bootstrap.min.js"></script>\n\n    <!-- Plugin JavaScript -->\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery-easing/1.3/jquery.easing.min.js"></script>\n    <script src="js/classie.js"></script>\n    <script src="js/cbpAnimatedHeader.js"></script>\n\n    <!-- Contact Form JavaScript -->\n    <script src="js/jqBootstrapValidation.js"></script>\n    <script src="js/contact_me.js"></script>\n\n    <!-- Custom Theme JavaScript -->\n    <script src="js/agency.js"></script>\n\n</body>\n\n</html>\n'
"""
print(course_webpage)

b'<!DOCTYPE html>
<html lang="en">

<head>

    <meta charset="utf-8"/>
    <meta http-equiv="X-UA-Compatible" content="IE=edge"/>
    <meta name="viewport" content="width=device-width, initial-scale=1"/>
    <meta name="description" content=""/>
    <meta name="author" content=""/>

    <title>15-388/688 Practical Data Science</title>

    <!-- Bootstrap Core CSS -->
    <link href="css/bootstrap.min.css" rel="stylesheet"/>

    <!-- Custom CSS -->
    <link href="css/agency.css" rel="stylesheet"/>

    <!-- Eric's CSS -->
    <link href="css/eric.css" rel="stylesheet"/>

    <!-- Custom Fonts -->
    <link href="font-awesome/css/font-awesome.min.css" rel="stylesheet" type="text/css"/>
    <link href="https://fonts.googleapis.com/css?family=Montserrat:400,700" rel="stylesheet" type="text/css"/>
    <link href='https://fonts.googleapis.com/css?family=Kaushan+Script' rel='stylesheet' type='text/css'/>
    <link href='https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic,

When you run the cell below (you should make a copy of them so you don't accidentally overwrite), it should produce the following output.  

Example output:
```python
tag_open:  [('note', ' date="8/31/12"'), ('to', ''), ('from', ''), ('heading', ' type="Reminder"/'), ('body', '')]
tag_close:  [('to', ''), ('from', ''), ('body', ''), ('note', '')]
tag_open_close:  [('heading', ' type="Reminder"')]
comment:  [' not actually valid xml', ' This is a comment ', ' This is a multiline comment,\n         which take a bit of care to parse ']
xml_prolog:  ['xml version="1.0" encoding="UTF-8"']
html_declaration:  ['DOCTYPE xml']
```

In [14]:
print("tag_open: ", tag_open.findall(test_snippet))
print("tag_close: ", tag_close.findall(test_snippet))
print("tag_open_close: ", tag_open_close.findall(test_snippet))
print("comment: ", comment.findall(test_snippet))
print("xml_prolog: ", xml_prolog.findall(test_snippet))
print("html_declaration: ", html_declaration.findall(test_snippet))

for m in re.finditer(tag_open,test_snippet):
    print('%02d-%02d: %s' % (m.start(), m.end(), m.groups()))
for m in re.finditer(tag_close,test_snippet):
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

tag_open:  [('note', ' date="8/31/12"'), ('to', ''), ('from', ''), ('heading', ' type="Reminder"/'), ('body', '')]
tag_close:  [('to', ''), ('from', ''), ('body', ''), ('note', '')]
tag_open_close:  [('heading', ' type="Reminder"')]
comment:  [' not actually valid xml', ' This is a comment ', ' This is a multiline comment,\n         which take a bit of care to parse ']
xml_prolog:  ['xml version="1.0" encoding="UTF-8"']
html_declaration:  ['DOCTYPE xml']
112-133: ('note', ' date="8/31/12"')
138-142: ('to', None)
156-162: ('from', None)
178-204: ('heading', ' type="Reminder"/')
209-215: ('body', None)
146-151: </to>
166-173: </from>
244-251: </body>
337-344: </note>


Similarly, the cell below should produce the following counts. 

Example output:
```python
tag_open:  469
tag_close:  439
tag_open_close:  30
comment:  23
xml_prolog:  0
html_declaration:  2
```

In [4]:
# [NOTE] Comment this out prior to submission
print("tag_open: ", len(tag_open.findall(course_webpage)))
print("tag_close: ", len(tag_close.findall(course_webpage)))
print("tag_open_close: ", len(tag_open_close.findall(course_webpage)))
print("comment: ", len(comment.findall(course_webpage)))
print("xml_prolog: ", len(xml_prolog.findall(course_webpage)))
print("html_declaration: ", len(html_declaration.findall(course_webpage)))


tag_open:  469
tag_close:  439
tag_open_close:  30
comment:  22
xml_prolog:  0
html_declaration:  1


(Note that although there is only one html declaration in the course webpage, there is a field _within_ a comment that looks suspiciously like a declaration, so you can pick up both of these).

## Q2: XML Parser

Using the regular expressions above, now you'll write an XML parser (although technically you don't _have_ to use them, you could try to write a complete XML parser using a single regular extended expression if you really want to, but we would highly advise against this).  Specifically, you should fill in the `__init__` function for the class prototype below.

In [13]:
#### Q2: fill in this cell
class XMLNode:
    
    def __init__(self, tag, attributes, content):
        self.tag = tag
        self.attributes = attributes
        self.children =  []
        self.content = content
        for m in re.finditer(tag_open,content):
            self.children.append(XMLNode(m.group(1), m.group(2), content[m.end():]))

root = XMLNode("", {}, test_snippet)
print("root.tag: ", root.tag)
print("root.attributes: ", root.attributes)
print("root.content: ", repr(root.content))
print("root.children: ", root.children)

root.tag:  
root.attributes:  {}
root.content:  '<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE xml> <!-- not actually valid xml-->\n<!-- This is a comment -->\n<note date="8/31/12">\n    <to>Tove</to>\n    <from>Jani</from>\n    <heading type="Reminder"/>\n    <body>Don\'t forget me this weekend!</body>\n    <!-- This is a multiline comment,\n         which take a bit of care to parse -->\n</note>\n'
root.children:  [<__main__.XMLNode object at 0x10af2a5f8>, <__main__.XMLNode object at 0x10af27588>, <__main__.XMLNode object at 0x10af27940>, <__main__.XMLNode object at 0x10af27b00>, <__main__.XMLNode object at 0x10af27b70>]


There is a lot that this function needs to do, which is best explained by an example.  We'll eventually parse the `test_snippet` above using the command:
    
    root = XMLNode("", {}, test_snippet)

This will create a root node with an empty tag, an empty dictionary for the attributes, and created by parsing the test_snippet; the `content` attribute here will contain the entire test snippet.

The `children` attribute is a list that contains a single XMLNode instance, corresponding to the `note` tag.  This instance is created (you don't call this function directly, rather each node must recursively create all its children, so the call above would recursively make the following call to create its child) by calling XMLNode with the parameters

    XMLNode("note", {"date":"8/31/12"}, test_snippet[133:])
    # (test_snippet[133:] just happens to be the position that immediately follows the <note> open tag)
    
This child node will have the given tag and attributes, plus content given by only the `content` equals to the string that occurs _within_ the note open and close tags.  It will similarly have four children, one corresponding to each of the subtags.  The following code illustrates how you can use XMLNode to parse test_snippet, and what the structure is after you perform the parsing.

Example output:
```python
root.tag:  
root.attributes:  {}
root.content:  '<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE xml> <!-- not actually valid xml-->\n<!-- This is a comment -->\n<note date="8/31/12">\n    <to>Tove</to>\n    <from>Jani</from>\n    <heading type="Reminder"/>\n    <body>Don\'t forget me this weekend!</body>\n    <!-- This is a multiline comment,\n         which take a bit of care to parse -->\n</note>\n'
root.children:  [<__main__.XMLNode instance at 0x10425c0e0>]

note.tag:  note
note.attributes:  {'date': '8/31/12'}
note.content:  '\n    <to>Tove</to>\n    <from>Jani</from>\n    <heading type="Reminder"/>\n    <body>Don\'t forget me this weekend!</body>\n    <!-- This is a multiline comment,\n         which take a bit of care to parse -->\n'
note.children:  [<__main__.XMLNode instance at 0x10425c098>, <__main__.XMLNode instance at 0x10425c128>, <__main__.XMLNode instance at 0x10425c368>, <__main__.XMLNode instance at 0x10425c488>]

to.tag:  to
to.attributes:  {}
to.content:  'Tove'
to.children:  []

heading.tag:  heading
heading.attributes:  {'type': 'Reminder'}
heading.content:  ''
heading.children:  []
```

In [6]:
root = XMLNode("", {}, test_snippet)

print("root.tag: ", root.tag)
print("root.attributes: ", root.attributes)
print("root.content: ", repr(root.content))
print("root.children: ", root.children)
print("")
print("note.tag: ", root.children[0].tag)
print("note.attributes: ", root.children[0].attributes)
print("note.content: ", repr(root.children[0].content))
print("note.children: ", root.children[0].children)
print("")
print("to.tag: ", root.children[0].children[0].tag)
print("to.attributes: ", root.children[0].children[0].attributes)
print("to.content: ", repr(root.children[0].children[0].content))
print("to.children: ", root.children[0].children[0].children)
print("")
print("heading.tag: ", root.children[0].children[2].tag)
print("heading.attributes: ", root.children[0].children[2].attributes)
print("heading.content: ", repr(root.children[0].children[2].content))
print("heading.children: ", root.children[0].children[2].children)

root.tag:  
root.attributes:  {}
root.content:  '<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE xml> <!-- not actually valid xml-->\n<!-- This is a comment -->\n<note date="8/31/12">\n    <to>Tove</to>\n    <from>Jani</from>\n    <heading type="Reminder"/>\n    <body>Don\'t forget me this weekend!</body>\n    <!-- This is a multiline comment,\n         which take a bit of care to parse -->\n</note>\n'
root.children:  []



IndexError: list index out of range

Additionally, if you pass an XML object that is poorly formated, in that there is some mismatched open and close tag, the function should raise an exception. Note: comment out this cell prior to submission.

Example output: 
```python
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-68-f654067d17d4> in <module>()
----> 1 root = XMLNode("", {}, "<note><to>You</from></note>")

<ipython-input-67-8c4c856beda8> in __init__(self, tag, attributes, content)
     37                 else:
     38                     # if it's an open tag, parse it recursively
---> 39                     self.children.append(XMLNode(m.group(1), attributes, content[m.end():]))
     40                     pos = m.end() + self.children[-1].endpos
     41                 continue

<ipython-input-67-8c4c856beda8> in __init__(self, tag, attributes, content)
     37                 else:
     38                     # if it's an open tag, parse it recursively
---> 39                     self.children.append(XMLNode(m.group(1), attributes, content[m.end():]))
     40                     pos = m.end() + self.children[-1].endpos
     41                 continue

<ipython-input-67-8c4c856beda8> in __init__(self, tag, attributes, content)
     45             if m:
     46                 if m.group(1) != tag:
---> 47                     raise Exception("Error: <{0}> tag closed with {1}".format(tag, m.group()))
     48                 else:
     49                     self.content = self.content[:m.start()]

Exception: Error: <to> tag closed with </from>
```

In [None]:
# [NOTE] Comment this out prior to submission
# root = XMLNode("", {}, "<note><to>You</from></note>")

Finally, your code should also be able to parse the course webpage (which we made sure was valid XML).

Example output (the total count may vary slightly based on changes we make to the website): 
```python
>>> print total_count(root)
345
```

In [None]:
def total_count(n):
    """ Gets the total number of nodes in an XMLNode tree. """
    return len(n.children) + sum(total_count(c) for c in n.children)
# [NOTE] Comment this out prior to submission
# root = XMLNode("", {}, course_webpage)
# print total_count(root)


Lets discuss in a bit more detail how the XML parsing will work algorithmically.  You being the initializer by copying the provided parameters to the class attributes.  Note that if you want you could make a full string copy here, but we don't bother.  Now we begin parsing the file, which we do by repeating the following logic until termination:
1. Look for the next xml tag (or comment, etc), in the file.  This is best done by finding the next `'<'` character.  If you can't find any, return.
2. If it's an xml prolog, html declaration, or comment, ignore this portion, and continue parsing after the prolog, declaration, or comment (i.e., throw away whatever information is contained in these portions)
3. If it's an open tag, read its tag and attributes (you'll likely want to use a regular expression to parse the attributes as well, but we leave this up to you).  If it's just an open tag, then recursively create an XMLNode object initializer this tag and attributes, and the content that occurred after the open tag.  If it's an open/close tag, create a XMLNode the same as before but with empty content.
4. If it's a close tag, make sure that the close tag matches the tag originally provided to the current XMLNode constructor (otherwise, we have a situation where one tag is closed with a different tag), and raise an Exception if not.  If the tags do match, then truncate the content to contain only the content before the closed tag matched, and return.

Some hints that we believe will be helpful:
1. Keep track of the current position where you are parsing the file, and make sure to properly increment this so you move past any tag that you have parsed.
2. Make use of the `match = regular_expression_obj.match(string, pos)` function, which looks for a match to the regular expression starting _exactly_ as position `pos` in `string`.  If this function returns `None`, then the regular expression did not match.  In the returned `match` object, `match.end()` contains the position where the match ended.

## Q3: Searching for tags

One of the nicer elements of the `BeautifulSoup` library is the ability to quickly search for tags that have certain attributes, without worrying about the specific structure of the model (i.e., how many levels deep the tag is, how many may exist in the document etc).  We're going to implement a similar function in our `XMLNode` class, specifically a function of the following form.

In [None]:
#### Q3: fill out this function and move it into the XMLNode class above. [NOTE] Do not leave this here for your final submission. 
    def find(self, tag, **kwargs):
        """
        Search for a given tag and atributes anywhere in the XML tree
        
        Args:
            tag (string): tag to match
            kwargs (dictionary): list of attribute name / attribute value pairs to match
            
        Returns:
            (list): a list of XMLNode objects that match from anywhere in the tree
        """


For those who haven't seen the `**kwargs` parameter before, this is just a way to pass a variable-length list of parameters to a Python function as function parameters.  For example, you could call `find` via

    root.find("link", rel="stylesheet")
    
and in the `find` function, `kwargs` would be a dictionary equal to `{"rel":"stylesheet"}`.

This function should return a list of _all_ XMLNodes that are descendents (children, children of children, etc), of the node you call it on.  The following constains some examples of this function can be used.

Example output: 
```python
['#page-top', '#page-top', '#overview', '#schedule', '#assignments', '#instructors', '#faq', '#overview', 'intro.pdf', 'https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=3f116819-0e26-4f79-bcee-f4cb7e745773', 'http://www.zicokolter.com/', 'http://cs.cmu.edu/~ericwong/', 'https://piazza.com/class/is8ghyly8um4ew', 'https://piazza.com/class/is8ghyly8um4ew']
['8/29', '8/31', '9/7', '9/14', '9/19', '9/21', '9/26', '9/28', '10/3', '10/5', '10/10', '10/12', '10/17', '10/19', '10/24', '10/26', '10/31', '11/2', '11/7', '11/9', '11/14', '11/16', '11/21', '11/28', '11/30', '12/5', '12/7']
```

In [None]:
# [NOTE] Comment this out prior to submission
# 
# Get a list of all links on the page
# links = root.find("a")
# print[l.attributes["href"] for l in links]
# 
# Get a list of all lecture dates for the course
# tbody = root.find("section", id="schedule")[0].find("table")[0].find("tbody")[0]
# print [a.find("td")[0].content for a in tbody.find("tr") if len(a.find("td")) > 1]