# Great Schools Web scraping using Python

## Introduction
GreatSchools is the leading nonprofit providing high-quality information that supports parents pursuing a great education for their child, schools striving for excellence, and communities working to diminish inequities in education. The website provides ratings and comparison tools based on student growth, college readiness, equity, and test scores for public schools in the U.S. Of all the school rating sites, GreatSchools is the only one that factors student progress — or growth — into its methodology. This is significant as research has shown growth data to provide a more accurate representation of student progress and school quality. As of July 2017, the GreatSchools database contains information for more than 138,000 public, private, and charter schools in the United States.

Our paper covers some basic information of school in New York, NY scraped from the website and the code to scrape it, including utilizing packages Selenium to get the data needed.


### I. Web scraping <a class="anchor" id="sub_section_1_1_1"></a>
Import the necessary libraries

In [148]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
import numpy as np

Let's make a request and get the code of the whole page.

In [40]:
driver = webdriver.Chrome()
driver.get("https://www.greatschools.org/new-york/new-york/schools/?distance=5&gradeLevels%5B%5D=e&gradeLevels%5B%5D=m&gradeLevels%5B%5D=h&lat=40.6906&locationType=city&lon=-73.9488&page=1&view=table")
driver.implicitly_wait(0.5)

In [5]:
driver.title

'New York Elementary Schools, 1-25 - New York, NY | GreatSchools'

Let's create the list of all the school data we scrape from website:

In [121]:
greatschool = []

Create variables of data needed from the website including basic information of schools, ratings and reviews.

In [122]:
def gettable(page):
    url = f'https://www.greatschools.org/new-york/new-york/schools/?distance=5&gradeLevels%5B%5D=e&gradeLevels%5B%5D=m&gradeLevels%5B%5D=h&lat=40.6906&locationType=city&lon=-73.9488&page={page}&view=table'
    driver = webdriver.Chrome()
    driver.get(url)
    table = driver.find_element(By.TAG_NAME, 'tbody')
    lists = table.find_elements(By.TAG_NAME, 'tr')
    for i in lists:
        info = {
            'School': i.find_element(By.TAG_NAME,'a').text,
            'Details': i.find_element(By.TAG_NAME,'a').get_attribute('href'),
            'Address': i.find_element(By.CLASS_NAME,"address").text,
            'Summary Rating': i.find_element(By.CSS_SELECTOR, '[class="rating-container"]').text.split('\n')[0],
            'Rating Class': i.find_element(By.CLASS_NAME,"scale").text,
            'Type': i.find_elements(By.TAG_NAME,'td')[1].text,
            'Grades': i.find_elements(By.TAG_NAME,'td')[2].text,
            'Total students enrolled': i.find_elements(By.TAG_NAME,'td')[3].text,
            'Students per teacher': i.find_elements(By.TAG_NAME,'td')[4].text,
            'Number of Reviews': i.find_elements(By.TAG_NAME,'td')[5].text.split('\n')[0],
            'Reviews': i.find_elements(By.TAG_NAME,'td')[5].text.split('\n')[-1],
        }
    driver.find_element(By.CSS_SELECTOR, '[aria-label="Academic"]').click()
    table1 = driver.find_element(By.TAG_NAME, 'tbody')
    lists1 = table1.find_elements(By.TAG_NAME, 'tr')
    for i in lists1:
        info.update({
            'Test Scores Rating': i.find_elements(By.TAG_NAME,'td')[1].text.split('\n')[0],
            'Student Progress Rating': i.find_elements(By.TAG_NAME,'td')[2].text.split('\n')[0],
            'College Readiness Rating': i.find_elements(By.TAG_NAME,'td')[3].text.split('\n')[0],
            'Equity Overview Rating': i.find_elements(By.TAG_NAME,'td')[4].text.split('\n')[0],
        })
        greatschool.append(info)
    return

We will get the first 20 pages of Schools list near New York, NY sorted by GreatSchools Rating with the school Level from elementary to high school. As each page contains data of 25 schools, we would scrape data from top 500 schools from website.

In [123]:
for x in range(1,21):
    gettable(x)

In [139]:
len(greatschool)

500

Now get some look at the preview of data we have scraped.

In [140]:
df = pd.DataFrame(greatschool)
df

Unnamed: 0,School,Details,Adress,Summary Rating,Rating Class,Type,Grades,Total students enrolled,Students per teacher,Number of Reviews,Reviews,Test Scores Rating,Student Progress Rating,College Readiness Rating,Equity Overview Rating
0,Bedford Stuyvesant New Beginnings Charter School,https://www.greatschools.org/new-york/brooklyn...,"82 Lewis Avenue, Brooklyn, NY, 11206",9,Above average,Public charter,K-8,702,10:1,3 Reviews,2.7,8,8,,10
1,Bedford Stuyvesant New Beginnings Charter School,https://www.greatschools.org/new-york/brooklyn...,"82 Lewis Avenue, Brooklyn, NY, 11206",9,Above average,Public charter,K-8,702,10:1,3 Reviews,2.7,8,8,,10
2,Bedford Stuyvesant New Beginnings Charter School,https://www.greatschools.org/new-york/brooklyn...,"82 Lewis Avenue, Brooklyn, NY, 11206",9,Above average,Public charter,K-8,702,10:1,3 Reviews,2.7,8,8,,10
3,Bedford Stuyvesant New Beginnings Charter School,https://www.greatschools.org/new-york/brooklyn...,"82 Lewis Avenue, Brooklyn, NY, 11206",9,Above average,Public charter,K-8,702,10:1,3 Reviews,2.7,8,8,,10
4,Bedford Stuyvesant New Beginnings Charter School,https://www.greatschools.org/new-york/brooklyn...,"82 Lewis Avenue, Brooklyn, NY, 11206",9,Above average,Public charter,K-8,702,10:1,3 Reviews,2.7,8,8,,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,Ps 5 Dr Ronald Mcnair,https://www.greatschools.org/new-york/brooklyn...,"820 Hancock Street, Brooklyn, NY, 11233",3,Below average,Public district,"PK, K-5 & Ungraded",213,11:1,5 Reviews,3.8,5,1,,4
496,Ps 5 Dr Ronald Mcnair,https://www.greatschools.org/new-york/brooklyn...,"820 Hancock Street, Brooklyn, NY, 11233",3,Below average,Public district,"PK, K-5 & Ungraded",213,11:1,5 Reviews,3.8,5,1,,4
497,Ps 5 Dr Ronald Mcnair,https://www.greatschools.org/new-york/brooklyn...,"820 Hancock Street, Brooklyn, NY, 11233",3,Below average,Public district,"PK, K-5 & Ungraded",213,11:1,5 Reviews,3.8,5,1,,4
498,Ps 5 Dr Ronald Mcnair,https://www.greatschools.org/new-york/brooklyn...,"820 Hancock Street, Brooklyn, NY, 11233",3,Below average,Public district,"PK, K-5 & Ungraded",213,11:1,5 Reviews,3.8,5,1,,4


### II. Data Cleaning 
The dataframe is maybe not in the format we want. To clean it up, we should check columns format, table shape and null values.

In [141]:
df.shape

(500, 15)

In [150]:
df.replace('N/A', np.nan, inplace=True)
df.isnull().sum()

School                        0
Details                       0
Adress                        0
Summary Rating                0
Rating Class                  0
Type                          0
Grades                        0
Total students enrolled       0
Students per teacher          0
Number of Reviews             0
Reviews                       0
Test Scores Rating            0
Student Progress Rating     125
College Readiness Rating    325
Equity Overview Rating        0
dtype: int64

As we see, there are some null values in Student Progress Rating and College Readiness Rating.

In [157]:
df.dtypes

School                      object
Details                     object
Adress                      object
Summary Rating              object
Rating Class                object
Type                        object
Grades                      object
Total students enrolled     object
Students per teacher        object
Number of Reviews           object
Reviews                     object
Test Scores Rating          object
Student Progress Rating     object
College Readiness Rating    object
Equity Overview Rating      object
dtype: object

We should convert some columns into numeric format for data analysis easier.

In [161]:
df[['Summary Rating','Total students enrolled','Reviews','Test Scores Rating','Student Progress Rating','College Readiness Rating','Equity Overview Rating']]=df[['Summary Rating','Total students enrolled','Reviews','Test Scores Rating','Student Progress Rating','College Readiness Rating','Equity Overview Rating']].apply(pd.to_numeric)

In [163]:
df.dtypes

School                       object
Details                      object
Adress                       object
Summary Rating                int64
Rating Class                 object
Type                         object
Grades                       object
Total students enrolled       int64
Students per teacher         object
Number of Reviews            object
Reviews                     float64
Test Scores Rating            int64
Student Progress Rating     float64
College Readiness Rating    float64
Equity Overview Rating        int64
dtype: object

### III. Export and Data summary <a class="anchor" id="sub_section_1_1_1"></a>

In [164]:
df.to_csv('Top 500 Schools in New York, NY | GreatSchools.csv')

This is dataset: 
- Dataset Structure: 500 observations (rows), 15 features (variables)
- Missing Data: 450 missing data total in all columns
- Data Type: two datatypes in this dataset: objects and integers


| Column | Description | 
| :---: | :--- |
| School | Name of school in the list. |
| Details | Link to the detailed school information and reviews. |
| Address | Address of the school. |
| Summary Rating | an overall snapshot of school quality. Ratings follow a 1-10 scale, where top-rated schools are "10s". |
| Rating Class | Ratings between 1-4 signal “below average”; 5-6 indicate “average”; ratings of 7-10 are “above average” |
| Type | School Type including: Public district, Public Charter, Private. |
| Grades | Grade or school level the school providing. |
| Total students enrolled | Total number of all students enrolled in a campuses of a school as the last update. |
| Students per teacher | The number of students who attend a school divided by the number of teachers in the institution. |
| Number of Reviews | Number of reviews from current students or parent/guardians. |
| Reviews | Overall school rate from current students or parent/guardians. |
| Test Scores Rating | Reflect annual state test results for this school compared with scores statewide. |
| Student Progress Rating | Comparing the academic progress over time for this school to all schools in the state, using student growth data provided by the state Department of Education. |
| College Readiness Rating | This shows how well students at this school are prepared for college compared to students at other schools in this state, based on key measures, like graduation rates, college entrance tests and AP coursework when available. |
| Equity Overview Rating | This looks at how well this school is serving the needs of its disadvantaged students relative to all its students, compared to other schools in the state, based on test scores provided from the state Department of Education. |