# Data Collection 
For this project I used BeautifulSoup, PyPDF2, and pdftotext to both scrape data from the web as well as convert PDF data into txt files. My primary focus in collection were the science standards written for NGS, aligned NGS standards and unaligned standards. In addition I collected the state wide grade 8 science examination (2002-2019) and a broad range of questions used in my own 6th grade classroom to preform classification and question alignment.

In [39]:
#custom functions 
from myfunctions import *  

#webscraping
from bs4 import BeautifulSoup 
import requests   
from time import sleep 
import textract

#data analysis
import pandas as pd  
import numpy as np 
from random import randint  

#saving files
import pickle 
import PyPDF2

# Standards

### Next Generation Science Standards K-12  
The [NGS](https://www.nextgenscience.org/) Standards are redesigned, national standards for K-12, in addition a [book](https://www.nap.edu/read/13165/chapter/1) detailing the reasoning, methodology and practices was released 

#### Standards & Basic Description Via Website

In [17]:
#scrape the standard name and description of each science standard 
pages = np.arange(0, 36, 1)  
standard_list = [] 

for i in pages:   
    page = requests.get(f"https://www.nextgenscience.org/search-standards?keys=&page={i}") 
    soup = BeautifulSoup(page.text, 'html.parser')
    standards = soup.find_all('div', class_="col-sm-9") 
    sleep(randint(2,10))  
    for standard in standards:  
        standard_list.append(standard.text.split('\n\n\n\n\n')[:2])   
        
#standards by grade & core concept(dci)
ngs = pd.DataFrame(standard_list, columns=['dci', 'standard'])  
ngs['dci'] = ngs['dci'].map(lambda x: x.replace('\n\n\n', '').strip('\t'))   

#split the standard column to have the tag and name in two seperate columns 
ngs[['tag','dci']] = ngs["dci"].str.split(" ", 1, expand=True)
ngs["dci"] = ngs["dci"].str.strip("-")
ngs['tag'] = ngs['tag'].str.strip("\n\r\n") 
ngs['dci'] = ngs['dci'].str.strip('Grade:\xa0    \n\n') 
#remove the DCI arrangemnts (double listed) 
ngs_standards = ngs.loc[: 206]  
ngs_standards.head() 
pickle.dump(ngs_standards, open( "ngsstandards.p", "wb" ) )

Unnamed: 0,dci,standard,tag
0,Motion and Stability: Forces and Interactions,Plan and conduct an investigation to compare t...,K-PS2-1
1,Motion and Stability: Forces and Interactions,Analyze data to determine if a design solution...,K-PS2-2
2,From Molecules to Organisms: Structures and Pr...,Use observations to describe patterns of what ...,K-LS1-1
3,Earth's Systems,Use and share observations of local weather co...,K-ESS2-1
4,Earth's Systems,Construct an argument supported by evidence fo...,K-ESS2-2


#### Load the Expanded NGS Standards PDF into a TXT File

In [18]:
#load the expanded ngs standards pdf into a txt file 
pdf_to_text(filepath='/Users/kristen/Downloads/NGS.pdf', filename='ngs')

### Aligned NGS State Standards  
As of January 2021 the [following states](https://victoryprd.com/blog/update-on-next-generation-science-standards-ngss/) have aligned thier local standards to the NGS Standards.

In [42]:
#alabama 
pdf_to_text(filepath='/Users/kristen/Downloads/Alabama.pdf', filename='alabama')  

#alaska 
pdf_to_text(filepath='/Users/kristen/Downloads/alaska.pdf', filename='alaska')  

#arizona 
pdf_to_text(filepath='/Users/kristen/Downloads/Arizona.pdf', filename='arizona')  

#colorado
pdf_to_text(filepath='/Users/kristen/Downloads/colorado.pdf', filename='colorado')  

#flordia  
pdf_to_text(filepath='/Users/kristen/Downloads/flordia.pdf', filename='flordia') 

#georgia  
FILE_PATH_G = ['/Users/kristen/Downloads/georgiak.pdf', '/Users/kristen/Downloads/georgia1.pdf',
            '/Users/kristen/Downloads/georgia2.pdf', '/Users/kristen/Downloads/georgia3.pdf', 
            '/Users/kristen/Downloads/georgia4.pdf', '/Users/kristen/Downloads/georgia5.pdf', 
            '/Users/kristen/Downloads/georgia6.pdf', '/Users/kristen/Downloads/georgia7.pdf', 
            '/Users/kristen/Downloads/georgia8.pdf', '/Users/kristen/Downloads/georgiaa.pdf', 
            '/Users/kristen/Downloads/georgiab.pdf', '/Users/kristen/Downloads/georgiabo.pdf', 
            '/Users/kristen/Downloads/georgiac.pdf', '/Users/kristen/Downloads/georgiaes.pdf', 
            '/Users/kristen/Downloads/georgiaec.pdf', '/Users/kristen/Downloads/georgiaen.pdf', 
            '/Users/kristen/Downloads/georgiaevs.pdf', '/Users/kristen/Downloads/georgiaep.pdf', 
            '/Users/kristen/Downloads/georgiafs.pdf', '/Users/kristen/Downloads/georgiag.pdf', 
            '/Users/kristen/Downloads/georgiahap.pdf', '/Users/kristen/Downloads/georgiame.pdf', 
            '/Users/kristen/Downloads/georgiami.pdf', '/Users/kristen/Downloads/georgiao.pdf', 
            '/Users/kristen/Downloads/georgiaps.pdf', '/Users/kristen/Downloads/georgiap.pdf', 
            '/Users/kristen/Downloads/georgiaz.pdf']  

for file in FILE_PATH_G: 
        with open(file, mode='rb') as f:
            reader = PyPDF2.PdfFileReader(f) 
            number_of_pages = reader.getNumPages()  
            for page in range(number_of_pages):   
                page = reader.getPage(page) 
                file = open('georgia.txt', 'a')
                sys.stdout = file
                print(page.extractText()) 
                file.close()  
                
#idaho 
pdf_to_text(filepath='/Users/kristen/Downloads/colorado.pdf', filename='idaho')          

#indiana  
pages = np.arange(0, 34, 1)  
in_standard_list = [] 

for i in pages:   
    page = requests.get("https://www.doe.in.gov/science/framework") 
    soup = BeautifulSoup(page.text, 'html.parser')
    standards = soup.find_all('div', class_="view-content") 
    sleep(randint(2,10))  
    for standard in standards:  
        in_standard_list.append(standard.text)    
pickle.dump(in_standards, open( "instandards.p", "wb" ) ) 

#lousiana 
FILE_PATH = ['/Users/kristen/Downloads/louisiana1.pdf', '/Users/kristen/Downloads/louisiana2.pdf',
            '/Users/kristen/Downloads/louisiana3.pdf', 
            '/Users/kristen/Downloads/louisiana4.pdf', '/Users/kristen/Downloads/louisiana5.pdf', 
            '/Users/kristen/Downloads/louisiana6.pdf', '/Users/kristen/Downloads/louisiana7.pdf', 
            '/Users/kristen/Downloads/louisiana8.pdf', '/Users/kristen/Downloads/louisiana9.pdf', 
            '/Users/kristen/Downloads/louisiana10.pdf', '/Users/kristen/Downloads/louisiana11.pdf', 
            '/Users/kristen/Downloads/louisiana12.pdf', '/Users/kristen/Downloads/louisiana13.pdf', 
            '/Users/kristen/Downloads/louisiana14.pdf', '/Users/kristen/Downloads/louisiana15.pdf', 
            '/Users/kristen/Downloads/louisiana16.pdf', '/Users/kristen/Downloads/louisiana17.pdf', 
            '/Users/kristen/Downloads/louisiana18.pdf', '/Users/kristen/Downloads/louisiana19.pdf']


for file in FILE_PATH: 
        with open(file, mode='rb') as f:
            reader = PyPDF2.PdfFileReader(f) 
            number_of_pages = reader.getNumPages()  
            for page in range(number_of_pages):   
                page = reader.getPage(page) 
                file = open('louisiana', 'a')
                sys.stdout = file
                print(page.extractText()) 
                file.close()   
                
#massachusetts
pdf_to_text(filepath='/Users/kristen/Downloads/mass.pdf', filename='mass')  

#minnesota   
pdf_to_text(filepath='/Users/kristen/Downloads/mass.pdf', filename='minnesota')  

#mississippi 
pdf_to_text(filepath='/Users/kristen/Downloads/mass.pdf', filename='mississippi')  

#missouri 
FILE_PATH = ['/Users/kristen/Downloads/mo1.pdf', '/Users/kristen/Downloads/mo2.pdf']  

for file in FILE_PATH: 
        with open(file, mode='rb') as f:
            reader = PyPDF2.PdfFileReader(f) 
            number_of_pages = reader.getNumPages()  
            for page in range(number_of_pages):   
                page = reader.getPage(page) 
                file = open('missouri', 'a')
                sys.stdout = file
                print(page.extractText()) 
                file.close() 
                
#montana 
pdf_to_text(filepath='/Users/kristen/Downloads/mass.pdf', filename='montana')  

#nebraska 
pdf_to_text(filepath='/Users/kristen/Downloads/nebraska.pdf', filename='nebraksa') 

#north dakota
pdf_to_text(filepath='/Users/kristen/Downloads/northdakota.pdf', filename='northdakota') 

#oklahoma 
pdf_to_text(filepath='/Users/kristen/Downloads/oklahoma.pdf', filename='oklahoma') 

#south carolina 
pdf_to_text(filepath='/Users/kristen/Downloads/southcarolina.pdf', filename='southcarolina')  
            
#south dakota 
pdf_to_text(filepath='/Users/kristen/Downloads/southdakota.pdf', filename='southdakota')  
            
#tennessee 
pdf_to_text(filepath='/Users/kristen/Downloads/tenessee.pdf', filename='tennessee') 

#utah 
pdf_to_text(filepath='/Users/kristen/Downloads/utah.pdf', filename='utah')  
            
#west virgina 
pdf_to_text(filepath='/Users/kristen/Downloads/westvirgina.pdf', filename='westvirgina')             

#wisconsin 
pdf_to_text(filepath='/Users/kristen/Downloads/wisconsin.pdf', filename='wisconsin')  
            
#wyoming 
pdf_to_text(filepath='/Users/kristen/Downloads/wyoming.pdf', filename='wyoming') 

NameError: name 'sys' is not defined

### Non Aligned State Standards  
As of January 2021 the [following states](https://victoryprd.com/blog/update-on-next-generation-science-standards-ngss/) have not aligned thier local standards to the NGS Standards.

In [13]:
#maine 
maine = textract.process("/Users/kristen/Downloads/maine.doc") 
me_doc_raw = open_and_flatten('maine')

BadZipFile: File is not a zip file

In [None]:
#michigan 
pdf_to_text(pdf_to_text(filepath='/Users/kristen/Downloads/michigan.pdf', filename='michigan') 
mi_doc_raw = open_and_flatten('michigan')

In [None]:
#north carolina  
FILE_PATH = ['/Users/kristen/Downloads/nck.pdf', '/Users/kristen/Downloads/nc1.pdf',
            '/Users/kristen/Downloads/nc2.pdf', '/Users/kristen/Downloads/nc3.pdf', 
            '/Users/kristen/Downloads/nc4.pdf', '/Users/kristen/Downloads/nc5.pdf', 
            '/Users/kristen/Downloads/nc6.pdf', '/Users/kristen/Downloads/nc7.pdf', 
            '/Users/kristen/Downloads/nca8.pdf', '/Users/kristen/Downloads/ncp.pdf']  

for file in FILE_PATH: 
        with open(file, mode='rb') as f:
            reader = PyPDF2.PdfFileReader(f) 
            number_of_pages = reader.getNumPages()  
            for page in range(number_of_pages):   
                page = reader.getPage(page) 
                file = open('northcarolina', 'a')
                sys.stdout = file
                print(page.extractText()) 
                file.close()  
                
nc_doc_raw = open_and_flatten('northcarolina')

In [None]:
#ohio 
pdf_to_text(pdf_to_text(filepath='/Users/kristen/Downloads/ohio.pdf', filename='ohio') 
oh_doc_raw = open_and_flatten('ohio')

In [None]:
#pennsylvania 
pdf_to_text(pdf_to_text(filepath='/Users/kristen/Downloads/pennsylvania.pdf', filename='pennsylvania') 
pa_doc_raw = open_and_flatten('pennsylvania')

In [None]:
#texas 
FILE_PATH = ['/Users/kristen/Downloads/texasa.pdf', '/Users/kristen/Downloads/texasb.pdf',
            '/Users/kristen/Downloads/texasc.pdf', '/Users/kristen/Downloads/texasd.pdf']  

for file in FILE_PATH: 
        with open(file, mode='rb') as f:
            reader = PyPDF2.PdfFileReader(f) 
            number_of_pages = reader.getNumPages()  
            for page in range(number_of_pages):   
                page = reader.getPage(page) 
                file = open('texas', 'a')
                sys.stdout = file
                print(page.extractText()) 
                file.close()  
                
tx_doc_raw = open_and_flatten('texas')

In [None]:
#virginia 
pdf_to_text(pdf_to_text(filepath='/Users/kristen/Downloads/virginia.pdf', filename='virginia') 
va_doc_raw = open_and_flatten('virginia')

## Working Libraries 
* ngs_standards -- a list of the standards by level and dci 
* ngs_doc_raw -- ngs standards pdf unprocessed/ 
* classroom_questions -- real classroom question set 

* co_csv -- colorado state standards in csv format

Aligned State Standards Libraries- 
* al_doc_raw 
* ak_doc_raw 
* az_doc_raw 
* co_doc_raw  
* fl_doc_raw 
* ga_doc_raw 
* id_doc_raw  
* la_doc_raw
* ma_doc_raw 
* mn_doc_raw 
* ms_doc_raw 
* mo_doc_raw 
* mt_doc_raw 
* ne_doc_raw 
* nd_doc_raw 
* ok_doc_raw 
* sc_doc_raw 
* sd_doc_raw 
* tn_doc_raw 
* ut_doc_raw 
* wv_doc_raw 
* wi_doc_raw 
* wy_doc_raw 

Not Aligned State Standards Libraries- 
* me_doc_raw
* mi_doc_raw 
* nc_doc_raw 
* oh_doc_raw 
* pa_doc_raw 
* tx_doc_raw 
* va_doc_raw

# Question Sets 

### NY State 8th Grade State Test  
A NGS aligned state's standardized test

In [None]:
FILE_PATH = ['/Users/kristen/Downloads/2019.pdf', '/Users/kristen/Downloads/2018.pdf', 
             '/Users/kristen/Downloads/2017.pdf', '/Users/kristen/Downloads/2016.pdf', 
            '/Users/kristen/Downloads/2015.pdf', '/Users/kristen/Downloads/2014.pdf', 
            '/Users/kristen/Downloads/2013.pdf', '/Users/kristen/Downloads/2012.pdf', 
            '/Users/kristen/Downloads/2011.pdf', '/Users/kristen/Downloads/2010.pdf', 
            '/Users/kristen/Downloads/2009.pdf', '/Users/kristen/Downloads/2008.pdf', 
            '/Users/kristen/Downloads/2007.pdf', '/Users/kristen/Downloads/2006.pdf', 
            '/Users/kristen/Downloads/2005.pdf', '/Users/kristen/Downloads/2004.pdf', 
            '/Users/kristen/Downloads/2003.pdf', '/Users/kristen/Downloads/2002.pdf', 
            '/Users/kristen/Downloads/2001.pdf']  

for file in FILE_PATH: 
        with open(file, mode='rb') as f:
            reader = PyPDF2.PdfFileReader(f) 
            number_of_pages = reader.getNumPages()  
            for page in range(number_of_pages):   
                page = reader.getPage(page) 
                file = open('nytest.txt', 'a')
                sys.stdout = file
                print(page.extractText()) 
                file.close()

In [None]:
ny_test_raw = open_and_flatten('new_york_state') 
ny_test_raw

### Classroom Question Repository 
This is a collection of 'real world' questions used in classwork, labs, tests in quizzes in a 6th Grade Science classroom.

In [None]:
#load in question data 
classroom_questions_csv = pd.read_csv('Capstone Data - Questions Set (2).csv')
classroom_questions = pd.DataFrame(classroom_questions_csv) 
classroom_questions.head()