# NLP on Academic Knowledge and Skills (AKS)

Gwinnett’s curriculum for grades K–12 is called the Academic Knowledge and Skills (AKS) and is aligned to the state-adopted Georgia Standards of Excellence in Language Arts (K-12), Mathematics (K-12), and literacy standards for Science, Social Studies, and Technical Education for middle and high school students. The Georgia Performance Standards (GPS) are in place for other content areas. Gwinnett’s AKS is a rigorous curriculum that prepares students for college and 21st century careers in a globally competitive future.

The AKS for each grade level spells out the essential concepts students are expected to know and skills they should acquire in that grade or subject. The AKS offers a solid base on which teachers build rich learning experiences. Teachers use curriculum guides, technology, and  instructional resources to teach the AKS and to make sure every student is learning to his or her potential. 

[1. GCPS Academic Knowledge and Skills (AKS)](https://publish.gwinnett.k12.ga.us/gcps/myhome/public/parents/content/general-info/aks)

## Reading in data set

In [1]:
import pprint
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.metrics import pairwise
from scipy import sparse
import spacy


pp = pprint.PrettyPrinter(indent=4)
nlp = spacy.load('en')

In [2]:
math_aks = pd.read_excel('17-18 AKS New.xlsx', sheetname='MA')
print(math_aks.shape)
math_aks.head()

(1415, 18)


Unnamed: 0,Director,Low Grade,High Grade,Subject,Course,Strand,Strand Text,Sequence,Reference Code,Reference Key,Type1,AKS number,IOA number,AKS text,IOA text,AKS or IOA,Correlation,GDOEKey
0,Mathematics,K,K,Mathematics,Mathematics - K,A,Counting and Cardinality,1,KMA,KMA_A2012-1,AKS,1,,count to 100 by ones and tens,,count to 100 by ones and tens,GSE,MCCK.CC.1
1,Mathematics,K,K,Mathematics,Mathematics - K,A,Counting and Cardinality,2,KMA,KMA_A2012-2,AKS,2,,"count forward by ones, beginning from a given ...",,"count forward by ones, beginning from a given ...",GSE,MCCK.CC.2
2,Mathematics,K,K,Mathematics,Mathematics - K,A,Counting and Cardinality,3,KMA,KMA_A2012-3,AKS,3,,write numerals from 0 to 20 and represent a nu...,,write numerals from 0 to 20 and represent a nu...,GSE,MCCK.CC.3
3,Mathematics,K,K,Mathematics,Mathematics - K,A,Counting and Cardinality,4,KMA,KMA_A2012-4,AKS,4,,demonstrate the relationship between numbers a...,,demonstrate the relationship between numbers a...,GSE,MCCK.CC.4
4,Mathematics,K,K,Mathematics,Mathematics - K,A,Counting and Cardinality,5,KMA,KMA_A2012-5,AKS,5,,count objects by stating number names in the s...,,count objects by stating number names in the s...,GSE,MCCK.CC.4_a


In [3]:
corpus = math_aks['AKS text']
corpus.head()

0                        count to 100 by ones and tens
1    count forward by ones, beginning from a given ...
2    write numerals from 0 to 20 and represent a nu...
3    demonstrate the relationship between numbers a...
4    count objects by stating number names in the s...
Name: AKS text, dtype: object

## Topic modeling

In [4]:
n_samples = 1415
n_features = 700
n_components = 20
n_top_words = 10


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += "; ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [5]:
tfidf_vectorizer = TfidfVectorizer(analyzer='word',
                                   ngram_range=(1,2), 
                                   max_df=0.95, 
                                   min_df=2,
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(corpus)

feature_names = tfidf_vectorizer.get_feature_names() 
tf_vectorizer = CountVectorizer(ngram_range=(1,2), 
                                max_df=0.95, 
                                min_df=2,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(corpus)

In [6]:
print("\nTopics in NMF model (Frobenius norm):")
nmf_fn = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5)
nmf_fn_topics = nmf_fn.fit(tfidf)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf_fn, tfidf_feature_names, n_top_words)

print("\nTopics in NMF model (generalized Kullback-Leibler divergence):")
nmf_kl = NMF(n_components=n_components, random_state=1,
          beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
          l1_ratio=.5)
nmf_kl_topics = nmf_kl.fit(tfidf)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf_kl, tfidf_feature_names, n_top_words)

print("\nTopics in LDA model:")
lda = LatentDirichletAllocation(n_components=n_components, max_iter=10,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda_topics = lda.fit_transform(tf)
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)


Topics in NMF model (Frobenius norm):
Topic #0: solve problems; problems; solve; apply; appropriate; appropriate strategies; strategies solve; adapt; adapt variety; variety appropriate
Topic #1: functions; inverse; inverse functions; exponential; exponential functions; trigonometric; apply; trigonometric functions; linear; linear functions
Topic #2: add; subtract; add subtract; operations; properties operations; strategies; properties; multiply; numbers; addition
Topic #3: ideas; mathematical ideas; mathematical; connections mathematical; connections; recognize use; use connections; use; recognize; coherent
Topic #4: showing; functions showing; behavior; end behavior; end; functions; graph exponential; intercepts end; showing intercepts; graph
Topic #5: equations; solve; variable; quadratic; linear; equations variable; solve quadratic; quadratic equations; linear equations; solutions
Topic #6: expression; reveal; equivalent; different; reveal explain; expression reveal; function; defi

## Document similarity with spaCy

In [7]:
spacy_corpus = corpus.apply(nlp)
print(spacy_corpus.head())

0                (count, to, 100, by, ones, and, tens)
1    (count, forward, by, ones, ,, beginning, from,...
2    (write, numerals, from, 0, to, 20, and, repres...
3    (demonstrate, the, relationship, between, numb...
4    (count, objects, by, stating, number, names, i...
Name: AKS text, dtype: object


In [8]:
spacy_similarity = np.zeros((len(spacy_corpus), len(spacy_corpus)))
for (i, doc) in enumerate(spacy_corpus):
    for (j, other_doc) in enumerate(spacy_corpus):
        spacy_similarity[i, j] = doc.similarity(other_doc)

In [9]:
def get_similar_phrases(pairwise_similarity, phrases, phrase_num):
    phrase = phrases[phrase_num]
    pairwise_phrase = (pairwise_similarity - np.eye(len(phrases)))[phrase_num].tolist()
    phrase_scores = [pair for pair in zip(range(0, len(phrases)), phrases, pairwise_phrase) if pair[2] > 0]
    phrase_scores = sorted(phrase_scores, key=lambda t: t[2] * -1)[:10]
    return (phrase, phrase_scores)

def get_least_similar_phrases(pairwise_similarity, phrases, phrase_num):
    phrase = phrases[phrase_num]
    pairwise_phrase = (pairwise_similarity - np.eye(len(phrases)))[phrase_num].tolist()
    phrase_scores = [pair for pair in zip(range(0, len(phrases)), phrases, pairwise_phrase) if pair[2] > 0]
    phrase_scores = sorted(phrase_scores, key=lambda t: t[2])[:10]
    return (phrase, phrase_scores)

In [10]:
(phrase, phrase_scores) = get_similar_phrases(spacy_similarity, corpus, 100)
print("phrase: %s\nsimilar phrases:" % phrase)
pp.pprint(phrase_scores)

phrase: compare two fractions with the same numerator or the same denominator by reasoning about their size; recognize that comparisons are valid only when the two fractions refer to the same whole and record the results of comparisons with the symbols >, =, or 

similar phrases:
[   (   140,
        'compare two decimals to hundredths by reasoning about their size. '
        'Recognize that comparisons are valid only when the two decimals refer '
        'to the same whole. Record the results of comparisons with the symbols '
        '>, =, or \n',
        0.9938489477098706),
    (   128,
        'compare two fractions with different numerators and different '
        'denominators (e.g., by using virtual fraction models, by creating '
        'common denominators or numerators, or by comparing to a benchmark '
        'fraction such as 1/2); recognize that comparisons are valid only when '
        'the two fractions refer to the same whole; record the results of '
        'compariso

In [11]:
(phrase, phrase_scores) = get_least_similar_phrases(spacy_similarity, corpus, 100)
print("phrase: %s\nsimilar phrases:" % phrase)
pp.pprint(phrase_scores)

phrase: compare two fractions with the same numerator or the same denominator by reasoning about their size; recognize that comparisons are valid only when the two fractions refer to the same whole and record the results of comparisons with the symbols >, =, or 

similar phrases:
[   (   100,
        'compare two fractions with the same numerator or the same denominator '
        'by reasoning about their size; recognize that comparisons are valid '
        'only when the two fractions refer to the same whole and record the '
        'results of comparisons with the symbols >, =, or \n',
        3.078107524423501e-08),
    (737, 'calculate acceleration vectors', 0.38898655014366695),
    (685, 'evaluate improper integrals', 0.4491147117799509),
    (554, 'compose functions', 0.4664251179671885),
    (716, 'calculate dot products', 0.5218899415902706),
    (910, 'solve optimization problems', 0.535102340901564),
    (689, 'sketch curves in polar coordinates', 0.5402664260352563),
    (1

## Clustering documents based on topic

In [64]:
from sklearn.cluster import KMeans, SpectralClustering

In [65]:
def cluster_phrases(clusterer, topic_model, corpus):
    cluster_model = clusterer.fit(topic_model)
    cluster_topics = cluster_model.predict(topic_model)
    score_clusters = cluster_model.transform(topic_model).min(axis=1)
    clustered_corpus = [list(phrase) for phrase in zip(cluster_topics, corpus, score_clusters)]
    return (cluster_model, clustered_corpus)

def get_cluster_phrases(clustered_corpus, cluster_num):
    phrases = [phrase for phrase in clustered_corpus if phrase[0] == cluster_num]
    phrases_sorted = sorted(phrases, key=lambda t: t[2])[:10]
    return phrases_sorted

In [66]:
kmeans, kmeans_corpus = cluster_phrases(KMeans(n_clusters=50, random_state=0), lda_topics, corpus)
get_cluster_phrases(kmeans_corpus, 1)

[[1, 'calculate cross products', 0.18786611565908864],
 [1,
  'solve word problems involving addition and subtraction of fractions referring to the same whole and having like denominators by using visual fraction models and equations to represent the problem',
  0.20403922683109657],
 [1,
  'distinguish among continuous, integer, and binary contexts',
  0.21139554138869376],
 [1,
  'solve word problems involving multiplication of a fraction by a whole number (e.g., by using visual fraction models and equations to represent the problem. For example, if each person at a party will eat 3/8 of a pound of roast beef and there will be 5 people at the party, how many pounds of roast beef will be needed? Between what two whole numbers does your answer lie?)\n',
  0.21487467576674626],
 [1,
  'add, subtract, multiply and invert matrices choosing appropriate methods including technology',
  0.22220867197355348],
 [1,
  'apply the concepts of area, volume, scale factors, and scale drawings to pla

In [68]:
# need a spectral "scoring function"
spectral, spectral_corpus = cluster_phrases(SpectralClustering(n_clusters=50, random_state=0), lda_topics, corpus)
get_cluster_phrases(spectral_corpus, 1)

  if normed and (np.issubdtype(csgraph.dtype, int)


AttributeError: 'SpectralClustering' object has no attribute 'predict'

In [None]:
gmm, gmm_corpus = cluster_phrases(SpectralClustering(n_clusters=50, random_state=0), lda_topics, corpus)
get_cluster_phrases(gmm, 1)

In [None]:
dbscan, dbscan_corpus = cluster_phrases(DBSCAN(eps=0.1, random_state=0), lda_topics, corpus)
get_cluster_phrases(dbscan_corpus, 1)