# <center> Correlating NGS and State Based Science Standards </center>   

<p align="center">
  <img width="250" height="100" src="NGSS.png">
</p>
 
 [](NGSS.png) 
 
### <center> Capstone Project - The Flatiron School - By Kristen Davis </center>

#### Summary:  
In April of 2013 a collection of rigorous, and internationally benchmarked standards for K-12 science education standards were released called [Next Generation Science Standards (NGS)](https://www.nextgenscience.org/) . These standards were crafted to prepare students to be better decision makers about scientific and technical issues and to apply science to their daily lives.  

By blending core science knowledge with scientific practices, students are engaged in a more relevant context that deepens their understanding and helps them to build what they need to move forward with their education. However these standards were of voluntary adoption at the time and many states chose not to change their current (common core) standards. 

Currently 18 states have adopted the NGS fully for their K -12 science curriculum, 26 are 'aligned' to the NGS standards and 8 have independently developled standards. 'Alligning with' or 'aligned to' are terms often used by rarely quantified. By identifying word freqencies and text patterns in the NGS standards and comparing them to state standards that claim to be aligned to them, this project aims not only to provide insight into the similarities and differences of science education across America but also develop a tool that could be used more broadly to measure alignment. 

# Libraries & Data Packages

In [88]:
#custom functions 
import myfunctions as mf

In [253]:
#data analysis 
import pandas as pd
import numpy as np   
import pickle

#data visulaization 
import matplotlib.pyplot as plt   
import plotly.express as px 
import plotly.graph_objects as go 
import plotly.figure_factory as ff 
from urllib.request import urlopen
import json 


#natural langauge processing  
import nltk
from nltk.corpus import stopwords
import string
from nltk import word_tokenize, FreqDist 
from sklearn.feature_extraction.text import TfidfTransformer  
from nltk.tokenize import sent_tokenize, word_tokenize


"""UNDEFINED"""
from sklearn.cluster import MiniBatchKMeans, KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score  
from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.metrics import calinski_harabasz_score 
from sklearn.naive_bayes import MultinomialNB 
from nltk.tokenize import word_tokenize
np.random.seed(0) 
import nltk
from nltk.corpus import gutenberg, stopwords
from nltk.collocations import *
from nltk import FreqDist
from nltk import word_tokenize
import string
import re

# General Text Preprocessing
Initial text cleaning, that applies global processing such as flattening, removing punction, numbers and basic stop words for each standards set. More exensive text specific processing done when working with specific text later in the notebook.

#### NGS Standads

In [117]:
#ngs standards, decriptions and dci
standard_df = pickle.load(open("ngsstandards.p", "rb" )) 
standard_df.head()

Unnamed: 0,dci,standard,tag
0,Motion and Stability: Forces and Interactions,Plan and conduct an investigation to compare t...,K-PS2-1
1,Motion and Stability: Forces and Interactions,Analyze data to determine if a design solution...,K-PS2-2
2,From Molecules to Organisms: Structures and Pr...,Use observations to describe patterns of what ...,K-LS1-1
3,Earth's Systems,Use and share observations of local weather co...,K-ESS2-1
4,Earth's Systems,Construct an argument supported by evidence fo...,K-ESS2-2


In [121]:
#ngs standards pdf file
ngs= general_processing('ngs_pdf')

#### Aligned NGSS State Standards 

In [97]:
#indiana 
indiana = pickle.load(open("instandards.p", "wb" ))  

UnsupportedOperation: read

In [105]:
ngss_states = ['alabama', 'alaska', 'arizona', 'colorado', 'flordia', 'georgia', 'idaho', 'louisiana', 'mass', 
             'minnesota', 'mississippi', 'missouri', 'montana', 'nebraksa', 'northdakota', 'oklahoma', 
              'southcarolina', 'southdakota', 'tennessee', 'utah', 'westvirginia', 'wisconsin', 'wyoming'] 

for file in ngss_states: 
    file = general_processing(file)

#indiana 
#indiana = pickle.load(open("instandards.p", "wb" ))  
#indiana = general_processing('indiana')

In [110]:
#indiana
pickle.load(open("in_standard_list.p", "rb"))  

EOFError: Ran out of input

#### NON Aligned NGSS State Standards 

In [243]:
non_ngss_states = ['maine', 'michigan', 'northcarolina', 'ohio', 'pennsylvania', 'texas', 'virginia'] 

for file in non_ngss_states: 
    file = general_processing(file)

All state standards PDF are now loaded, with general processing and tokenization. 

# Understanding the NGSS Standards 
Analysis of the words used with in the NGSS standards document, this document contains not only the language of the standards but also the language developers chose to use to describe their process, the importance and framework of the standards. 

### 50 Most Frequent Words Used in the NGSS Standards

In [249]:
#look at the top 50 words on general process ngss  
ngss_generalclean_freqdist = FreqDist(ngs)
ngss_generalclean_freqdist.most_common(50)

[('hs', 1497),
 ('ess', 1412),
 ('ps', 1326),
 ('ms', 1295),
 ('ls', 1185),
 ('evidence', 467),
 ('core', 438),
 ('ideas', 415),
 ('energy', 389),
 ('include', 305),
 ('using', 304),
 ('information', 302),
 ('engineering', 299),
 ('use', 296),
 ('data', 292),
 ('students', 289),
 ('ets', 286),
 ('systems', 284),
 ('natural', 268),
 ('assessment', 266),
 ('disciplinary', 262),
 ('models', 259),
 ('solutions', 254),
 ('performance', 244),
 ('expectations', 240),
 ('earth', 239),
 ('concepts', 232),
 ('scientific', 216),
 ('design', 215),
 ('connections', 206),
 ('understanding', 201),
 ('explanations', 200),
 ('system', 197),
 ('practices', 192),
 ('matter', 179),
 ('statement', 179),
 ('model', 179),
 ('patterns', 176),
 ('clarification', 176),
 ('examples', 173),
 ('experiences', 170),
 ('builds', 168),
 ('grade', 165),
 ('different', 163),
 ('progresses', 163),
 ('framework', 159),
 ('describe', 158),
 ('based', 157),
 ('world', 156),
 ('organisms', 155)]

Some of the highest recurring words refer to grade level or standard topic so I will remove those. 

In [250]:
#remove additional stop words
ngss_stopwords_list = ['hs', 'ms', 'ls', 'ess', 'ps']
ngss_processed = [word for word in ngs if word not in ngss_stopwords_list] 

#re examine frequency list
ngss_freqdist = FreqDist(ngss_processed)
ngss_freqdist.most_common(50)

[('evidence', 467),
 ('core', 438),
 ('ideas', 415),
 ('energy', 389),
 ('include', 305),
 ('using', 304),
 ('information', 302),
 ('engineering', 299),
 ('use', 296),
 ('data', 292),
 ('students', 289),
 ('ets', 286),
 ('systems', 284),
 ('natural', 268),
 ('assessment', 266),
 ('disciplinary', 262),
 ('models', 259),
 ('solutions', 254),
 ('performance', 244),
 ('expectations', 240),
 ('earth', 239),
 ('concepts', 232),
 ('scientific', 216),
 ('design', 215),
 ('connections', 206),
 ('understanding', 201),
 ('explanations', 200),
 ('system', 197),
 ('practices', 192),
 ('matter', 179),
 ('statement', 179),
 ('model', 179),
 ('patterns', 176),
 ('clarification', 176),
 ('examples', 173),
 ('experiences', 170),
 ('builds', 168),
 ('grade', 165),
 ('different', 163),
 ('progresses', 163),
 ('framework', 159),
 ('describe', 158),
 ('based', 157),
 ('world', 156),
 ('organisms', 155),
 ('support', 155),
 ('change', 154),
 ('problems', 141),
 ('education', 135),
 ('could', 134)]

It is very telling that the most common word in the NGSS is evidence, this framework is about students engaging hands on and evidence is a crucial piece of that.

### Identifying NGSS Word Associations
I want to identify the words that most commonly appear together in the NGSS document, this will later be used for classifying and gives more meaning to the words as well.

In [255]:
#create bigrams and examine the top 50 associations
bigram_measures = nltk.collocations.BigramAssocMeasures() 
ngss_finder = BigramCollocationFinder.from_words(ngss_processed) 
ngss_scored = ngss_finder.score_ngrams(bigram_measures.raw_freq) 
ngss_scored[:50]

[(('core', 'ideas'), 0.0066518847006651885),
 (('disciplinary', 'core'), 0.005982076866223208),
 (('performance', 'expectations'), 0.005520140428677014),
 (('clarification', 'statement'), 0.0040650406504065045),
 (('framework', 'education'), 0.003118070953436807),
 (('ets', 'ets'), 0.0030256836659275682),
 (('assessment', 'boundary'), 0.00291019955654102),
 (('could', 'include'), 0.0028871027346637104),
 (('boundary', 'assessment'), 0.002771618625277162),
 (('experiences', 'progresses'), 0.002633037694013304),
 (('achieve', 'inc'), 0.0023558758314855877),
 (('inc', 'rights'), 0.0023558758314855877),
 (('rights', 'reserved'), 0.0023558758314855877),
 (('september', 'achieve'), 0.0023558758314855877),
 (('demonstrate', 'understanding'), 0.0023096821877309683),
 (('crosscutting', 'concepts'), 0.00217110125646711),
 (('cause', 'effect'), 0.0020556171470805617),
 (('constructing', 'explanations'), 0.0020094235033259423),
 (('core', 'idea'), 0.0020094235033259423),
 (('statement', 'emphasis'

In [256]:
#create point wise mutual pairs and examine the top 50 associations
ngss_pmi_finder = BigramCollocationFinder.from_words(ngss_processed)
ngss_pmi_finder.apply_freq_filter(5) 
ngss_pmi_scored = ngss_pmi_finder.score_ngrams(bigram_measures.pmi) 
ngss_pmi_scored[:10]

[(('big', 'bang'), 12.816983623255382),
 (('corroborating', 'challenging'), 12.816983623255382),
 (('peer', 'review'), 12.816983623255382),
 (('advanced', 'searches'), 12.594591201918934),
 (('avenues', 'exploration'), 12.594591201918934),
 (('narrow', 'broaden'), 12.594591201918934),
 (('recording', 'sharing'), 12.594591201918934),
 (('broaden', 'inquiry'), 12.401946123976536),
 (('generating', 'additional'), 12.372198780582487),
 (('behind', 'currently'), 12.331556796085142)]

# Question Alignment  
In addition to looking at how standards align, I also wanted to explore questions and which class (NGS vs non NGS) the questions fall into.

In [None]:
#ny state test 
ny_test_raw = open_and_flatten('new_york_state') 
ny_test_raw

In [None]:
#classroom question set  
classroom_questions_csv = pd.read_csv('Capstone Data - Questions Set (2).csv')
classroom_questions = pd.DataFrame(classroom_questions_csv) 
classroom_questions.head()

# Appendix A: The Age of Science Standards by State (EDA) 
According to most DOE state websites standards are updated on a 5 year cycle. Some states update standards at a much higher rate than others. The dates below are the most current date I could find associated with the standards I am working with (as of January 2021). I am interested in seeing the distribution of states standard dates as well as how that breaks down between NGS adopted/ NGS aligned/ and Independent states.

In [145]:
#unpickle state dictionary
state_df = pd.read_pickle("./state_df.pkl") 
#transpose values
state_df = state_df.T 

In [146]:
#reset index
state_df.reset_index(inplace=True)

In [147]:
#set column name
state_df = state_df.rename(columns={'index': 'state'}) 
state_df.head()

Unnamed: 0,state,update_year,standards
0,alabama,2015,aligned
1,alaskas,2017,aligned
2,arizona,2018,aligned
3,arkansas,2016,adopted
4,california,2013,adopted


In [148]:
#visualize the age of science standards 
grouped_df = state_df.groupby('update_year').count() 
grouped_df

Unnamed: 0_level_0,state,standards
update_year,Unnamed: 1_level_1,Unnamed: 2_level_1
2002,1,1
2003,1,1
2004,1,1
2008,1,1
2013,6,6
2014,3,3
2015,8,8
2016,11,11
2017,5,5
2018,4,4


In [195]:
#graph the stat counts by year and alignment
fig = go.Figure() 

years= [2002, 2003, 2004, 2008, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020] 

fig = go.Figure()
fig.add_trace(go.Bar(x=years,
                y=[0, 0, 0, 0, 6, 2, 5, 4, 0, 0, 0, 1],
                name='NGSS Adopted States',
                marker_color='rgb(55, 83, 109)'
                ))
fig.add_trace(go.Bar(x=years,
                y=[0, 0, 0, 0, 0, 1, 2, 7, 4, 3, 3, 3],
                name='NGSS Aligned States',
                marker_color='rgb(26, 118, 255)'
                )) 
fig.add_trace(go.Bar(x=years,
                y=[1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0],
                name='Independent Standards States',
                marker_color='rgb(135, 206, 235)'
                ))

fig.update_layout(
    title='Date of Science Standard Development by Standard Group',
    xaxis_tickfont_size=14,
    yaxis=dict(
        title='Total States',
        titlefont_size=16,
        tickfont_size=14,
    ),
    legend=dict(
        x=0,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    ),
    barmode='group',
    bargap=0.15, # gap between bars of adjacent location coordinates.
    bargroupgap=0.1 # gap between bars of the same location coordinate.
) 

fig.add_annotation(x=2002, y=1,
            text="Pennsylvania",
            showarrow=True,
            arrowhead=1) 

fig.add_annotation(x=2004, y=1,
            text="North Carolina",
            showarrow=True,
            arrowhead=1, 
            ax=-5,
            ay=-50) 

fig.add_annotation(x=2008, y=1,
            text="Flordia",
            showarrow=True,
            arrowhead=1)
fig.show()

NGSS standards were developed in 2013 so this increase in adopted states at that time makes sense. Those states have used NGSS cinec the beginning, there is a large spike 3 years after the inital release of states that are aligned with NGSS, this might be due to high efficacy of standards or simply the cycle of updating standards coincided with a few years into NGSS so it was less 'risky' to adopt. The outlier states with particularly old standards are all independent states. 

# Appendix B: Geographical Distribution of NGSS Standards 
I am interested in the geographical distribution of the NGSS standards. 

In [239]:
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)

fig = px.choropleth(state_df, locations = 'state', color='standards', scope="usa")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

# Appendix C: Clustering Classroom Questions by Content Area 
I am going to use KMeans clustering to group unlabled questions into subject categories. The science standards are seperated into 4 different domians: Life/Physical/ Earth & Space & Engineering. These questions represent the entire scope and sequence of questions asked in a 6th grade science classroom. 

### Data Preprocessing

In [None]:
#feature extraction 
vec = TfidfVectorizer(stop_words="english")
X = vec.fit_transform(df.question.values) 

#create a dense matrix
features = X.todense()

### Build a Baseline Model

In [None]:
#create an instance cluster  
model = KMeans(n_clusters=5, random_state=42)
model.fit(features) 

order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vec.get_feature_names()  

for i in range(5):
    print("Cluster {}:".format(i)),
    for ind in order_centroids[i, :10]:
        print('%s' % terms[ind])

In [None]:
# reduce the features to 2D
pca = PCA(n_components=3, random_state=42) 
""" You can use more than two here! Only Plot 2 at a time"""
reduced_features = pca.fit_transform(features)

# reduce the cluster centers to 2D
reduced_cluster_centers = pca.transform(cls.cluster_centers_)

In [None]:
#visulaize clustering 
plt.scatter(reduced_features[:,0], reduced_features[:,1], c=cls.predict(features))
plt.scatter(reduced_cluster_centers[:, 0], reduced_cluster_centers[:,1], marker='x', s=150, c='b')

In [None]:
#access the accuracy of the score 
silhouette_score(features, labels=cls.predict(features))

This is not a very strong model, it has a low accuracy score and two of the clusters are overlapping heavily. I want to visually inspect the questions that it has classified into each cluster. 

### Generate a labeled dataset
Merge the labels generated with KMeans to better understand the types of questions the algotrithm has grouped together.

In [None]:
#collect all the labels and values for the KMeans 
cluster_map = pd.DataFrame()
cluster_map['data_index'] = data.index.values
cluster_map['cluster'] = cls.labels_ 

#sanity check 
cluster_map.head()

In [None]:
#merge the two dataframes on the index
result = df.join(cluster_map.set_index('data_index'))

#sanity check
result.loc[result['cluster'] == 2] 
result.loc[93, "question"]

Visualize each of the clusters and the questions inside of it.

In [None]:
#cluster 1
result.loc[result['cluster'] == 1] 

This is clustering toward physical science.

In [None]:
#cluster 2
result.loc[result['cluster'] == 2] 

This cluster is trending toward earth / space.

In [None]:
#cluster 3
result.loc[result['cluster'] == 3] 

This cluster is not as clean as the first two - looks like a mix of cluster one & cluster two.

### Predict Question Type Given Question 
Use supervised learning techniques in combination with the labeled data to predict question type given a question.

In [None]:
#define the target
X = result.drop('cluster', axis=1)
y = result['cluster']

In [None]:
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
#vecotrize the X data 
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train.question) 
X_test_counts = count_vect.transform(X_test.question)

In [None]:
#scale down impact of frequently occuring words
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) 
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts) 

#look at the shape of each set 
print("X_train Shape:", X_train_tfidf.shape) 
print("X_test Shape:", X_test_tfidf.shape)

In [None]:
#instatiate classifiers
mnb = MultinomialNB()
rf = RandomForestClassifier(n_estimators=100) 

#fit Multinomial classifier 
mnb.fit(X_train_tfidf, y_train)  

# NB Predictions
mnb_train_preds = nb_classifier.predict(X_train_tfidf) 
mnb_test_preds = nb_classifier.predict(X_test_tfidf) 

#fit Random Forest classifier 
rf.fit(X_train_tfidf, y_train)  

# RF Predictions 
rf_train_preds = rf.predict(X_train_tfidf) 
rf_test_preds = mnb.predict(X_test_tfidf)

In [None]:
nb_train_score = accuracy_score(y_train, mnb_train_preds)
nb_test_score = accuracy_score(y_test, mnb_test_preds)
rf_train_score = accuracy_score(y_train, rf_train_preds)
rf_test_score = accuracy_score(y_test, rf_test_preds)

print("Multinomial Naive Bayes")
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(nb_train_score, nb_test_score))
print("")
print('-'*70)
print("")
print('Random Forest')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(rf_train_score, rf_test_score))

While there is a strong accuracy score as is on this model I think it is overfitting and while class 1 & 2 (Physical Science and Earth Space Science) are strongly seperate clusters the other two groups I was hoping to identify (Engineering and Life Science) are not well represented. To improve this model there are obviously hyper parameters I could tune with a Grid Search but more than that I believe to make this model high quality a signifigant amount more data would need to be gathered. 

In [None]:
"""TEST K FOLD PRECISION AND RECALL}""" 

"""Plot feature importance  

Grid search 

confusion matrix - what groups is it confusing together """