# Web Scraping Notebook

---

### Import Modules & Read in Data Frame

In [1]:
import json
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup

In [5]:
df = pd.read_csv('/Users/kevinmacmat/Documents/flatiron/module_projects/capstone/csv/sqr_no_comments.csv')

---

### Data Base Number (DBN) List

Create a list of all DBN's to pass into end of https://insideschools.org/school/ url. 

In [6]:
dbn_list = list(df.dbn)

Adjust dbn_list range in order to scrape and save in batches. Ran into issues when trying to scrape too much at one time. 

In [7]:
dbn_list = dbn_list[:100]

---

### Scrape

Use Selenium's headless mode option so browser does not continually open with every school's website. Must set a path to Selenium's downloaded chromedriver in order to function properly. 

In [10]:
# Create an instance of ChromeOptions
options = webdriver.ChromeOptions()
# Run headless mode 
options.add_argument("headless")
# Instatiate chrome driver and pass in the file path to chromedriver
driver = webdriver.Chrome('/Users/kevinmacmat/Documents/flatiron/module_projects/capstone/chromedriver', options=options)

Get comments for past 6 years and output them to output_list. The 6 year cutoff was determined due to the SQR's availability for those years.

In [11]:
output_list = []

In [13]:
for dbn in dbn_list:
    # Get website 
    driver.get('https://insideschools.org/school/' + '01M015')
    # Switch to iframe containing script tag
    driver.switch_to.frame(1)
    # Grab the text
    text = driver.page_source
    # Switch out of iframe
    driver.switch_to.default_content()

In [14]:
text



In [16]:
# Parse and process the source with BeautifulSoup module by creating an BS object
soup = BeautifulSoup(text, 'lxml')
# Access the soup and find the script element's id
thread = soup.find("script", {"id": "disqus-threadData"})

In [18]:
thread

<script id="disqus-threadData" type="text/json">{"cursor":{"hasPrev":false,"prev":null,"total":21,"hasNext":false,"next":"1:0:0"},"code":0,"response":{"lastModified":1599637220,"posts":[{"editableUntil":"2020-05-18T14:13:41","dislikes":0,"thread":"207496974","numReports":0,"likes":0,"message":"\u003cp>P.S. 15 is an extraordinary small school that goes to extraordinary lengths to engage and care for their children. The principal, Irene Sanchez, leads an incredible group of attentive teachers and administrators who are dedicated to providing a safe, caring, and exciting environment for children to learn. The G&amp;T program is enrichment based (not \"accelerated\") and has re-instilled a sheer joy of learning in our child, which is a priority for us as a family. Of the many enrichments, the kids have a dedicated STEAM teacher, dedicated Science teacher, special interest classes that the children choose every Friday, a wonderful music class, and 2nd graders attend swimming once a week at 

In [19]:
# Turn the bs4 tag into a string, remove the script tag, and access the json
site_json = json.loads(str(thread)[48:-9])

In [20]:
site_json

{'cursor': {'hasPrev': False,
  'prev': None,
  'total': 21,
  'hasNext': False,
  'next': '1:0:0'},
 'code': 0,
 'response': {'lastModified': 1599637220,
  'posts': [{'editableUntil': '2020-05-18T14:13:41',
    'dislikes': 0,
    'thread': '207496974',
    'numReports': 0,
    'likes': 0,
    'message': '<p>P.S. 15 is an extraordinary small school that goes to extraordinary lengths to engage and care for their children. The principal, Irene Sanchez, leads an incredible group of attentive teachers and administrators who are dedicated to providing a safe, caring, and exciting environment for children to learn. The G&amp;T program is enrichment based (not "accelerated") and has re-instilled a sheer joy of learning in our child, which is a priority for us as a family. Of the many enrichments, the kids have a dedicated STEAM teacher, dedicated Science teacher, special interest classes that the children choose every Friday, a wonderful music class, and 2nd graders attend swimming once a wee

In [21]:

    # Turn the bs4 tag into a string, remove the script tag, and access the json
    site_json=json.loads(str(thread)[48:-9])
    # Instantiate comments list
    comments_list = []
    # Navigate and loop json, filtering comments by date, to append comments to comments_list
    for comment in site_json['response']['posts']:
        if '2014' in comment['createdAt']:
            comments_list.append(comment['raw_message'])
        elif '2015' in comment['createdAt']:
            comments_list.append(comment['raw_message'])
        elif '2016' in comment['createdAt']:
            comments_list.append(comment['raw_message'])
        elif '2017' in comment['createdAt']:
            comments_list.append(comment['raw_message'])
        elif '2018' in comment['createdAt']:
            comments_list.append(comment['raw_message'])
        elif '2019' in comment['createdAt']:
            comments_list.append(comment['raw_message'])
        elif '2020' in comment['createdAt']:
            comments_list.append(comment['raw_message'])
        else:
            continue
    # Append list of comments to output_list
    output_list.append(comments_list)

In [22]:
comments_list

['P.S. 15 is an extraordinary small school that goes to extraordinary lengths to engage and care for their children. The principal, Irene Sanchez, leads an incredible group of attentive teachers and administrators who are dedicated to providing a safe, caring, and exciting environment for children to learn. The G&T program is enrichment based (not "accelerated") and has re-instilled a sheer joy of learning in our child, which is a priority for us as a family. Of the many enrichments, the kids have a dedicated STEAM teacher, dedicated Science teacher, special interest classes that the children choose every Friday, a wonderful music class, and 2nd graders attend swimming once a week at Asphalt Green. There are also many field trips and hands on, project-based learning opportunities. The children have an amazing playground attached to the school and are allowed to play outside before and after school and during recess. They also have a great gym and have gym class three times a week. The 

In [8]:
# for dbn in dbn_list:
#     # Get website 
#     driver.get('https://insideschools.org/school/' + dbn)
#     # Switch to iframe containing script tag
#     driver.switch_to.frame(1)
#     # Grab the text
#     text = driver.page_source
#     # Switch out of iframe
#     driver.switch_to.default_content()
#     # Parse and process the source with BeautifulSoup module by creating an BS object
#     soup = BeautifulSoup(text, 'lxml')
#     # Access the soup and find the script element's id
#     thread = soup.find("script", {"id": "disqus-threadData"})
#     # Turn the bs4 tag into a string, remove the script tag, and access the json
#     site_json=json.loads(str(thread)[48:-9])
#     # Instantiate comments list
#     comments_list = []
#     # Navigate and loop json, filtering comments by date, to append comments to comments_list
#     for comment in site_json['response']['posts']:
#         if '2014' in comment['createdAt']:
#             comments_list.append(comment['raw_message'])
#         elif '2015' in comment['createdAt']:
#             comments_list.append(comment['raw_message'])
#         elif '2016' in comment['createdAt']:
#             comments_list.append(comment['raw_message'])
#         elif '2017' in comment['createdAt']:
#             comments_list.append(comment['raw_message'])
#         elif '2018' in comment['createdAt']:
#             comments_list.append(comment['raw_message'])
#         elif '2019' in comment['createdAt']:
#             comments_list.append(comment['raw_message'])
#         elif '2020' in comment['createdAt']:
#             comments_list.append(comment['raw_message'])
#         else:
#             continue
#     # Append list of comments to output_list
#     output_list.append(comments_list)

---

### Make Data Frame

In [10]:
# Create dataframe with list of data base numbers
batch_df = pd.DataFrame(dbn_list, columns=['dbn'])

In [11]:
# Add comments of recently scraped batch of comments to comments column 
batch_df['comments'] = output_list

---

### Convert data frame to CSV and export

In [13]:
batch_df.to_csv('batch_1000-end_comments.csv', index=False)