**Common Core State Standard (CCSS)** Alignment refers to the process of mapping educational content, curriculum, or assessments to the Common Core State Standards (CCSS). These standards define what students should know and be able to do at each grade level in English Language Arts (ELA) and Mathematics to ensure consistency in education across states in the U.S.

Why is CCSS Alignment Important?

**Ensures Consistency**: Aligning lessons to CCSS ensures students across different states follow a similar learning path.

**Guides Curriculum Development**: Helps educators design instructional materials that meet educational benchmarks.

**Improves Assessment Accuracy**: Ensures standardized tests measure the skills outlined in CCSS.

**Supports Personalized Learning**: Helps educators tailor lessons to meet students’ specific needs while staying within CCSS guidelines.


***We are going to analyze data that contains more than 1500 rows to build a model using NLP and ML that will input any text and a number N, and gives us the output with the closest N number of CCSS ids to the input text ranked by the closeness.***

In [17]:
#Import dependencies
import pandas as pd
import re

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

In [7]:
# Load the Excel file
file_path = "/content/ccss.xlsx"
df = pd.read_excel(file_path)

# Display the first few rows
df.head()

Unnamed: 0,id,content_type,category_id,category_name,grade_id,grade_name,item,description
0,CCSS.ELA-LITERACY.L.K.1,ELA-LITERACY,L,Language,K,Kindergarten,1,Demonstrate command of the conventions of stan...
1,CCSS.ELA-LITERACY.L.K.1.a,ELA-LITERACY,L,Language,K,Kindergarten,1a,Print many upper- and lowercase letters.
2,CCSS.ELA-LITERACY.L.K.1.b,ELA-LITERACY,L,Language,K,Kindergarten,1b,Use frequently occurring nouns and verbs.
3,CCSS.ELA-LITERACY.L.K.1.c,ELA-LITERACY,L,Language,K,Kindergarten,1c,Form regular plural nouns orally by adding /s/...
4,CCSS.ELA-LITERACY.L.K.1.d,ELA-LITERACY,L,Language,K,Kindergarten,1d,Understand and use question words (interrogati...


The ccss.xlsx file contains the following key columns:


id: The CCSS identifier (e.g., "CCSS.ELA-LITERACY.L.K.1")

description: The text explaining the standard

Other metadata like category_name (e.g., "Language"), grade_name (e.g., "Kindergarten")

In [8]:
# Download necessary NLTK resources
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [9]:
# Initialize stopwords and lemmatizer
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

In [15]:
# Function to preprocess text without NLTK dependencies
def basic_preprocess(text):
    text = str(text).lower()  # Convert to lowercase
    text = re.sub(r"[^\w\s]", "", text)  # Remove punctuation
    tokens = text.split()  # Tokenization
    stop_words = {"the", "and", "is", "in", "to", "of", "for", "on", "with", "a", "an", "this", "that"}  # Basic stopwords
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    return " ".join(tokens)

In [18]:
# Apply preprocessing to the 'description' column
if "description" in df.columns:
    df["cleaned_description"] = df["description"].apply(basic_preprocess)

In [14]:
# Apply preprocessing to the 'description' column
#df["cleaned_description"] = df["description"].astype(str).apply(preprocess_text)


# Display the first few rows with cleaned text
#df[["id", "description", "cleaned_description"]].head()

In [19]:
# Show the first few rows with cleaned text
df[["description", "cleaned_description"]].head()

Unnamed: 0,description,cleaned_description
0,Demonstrate command of the conventions of stan...,demonstrate command conventions standard engli...
1,Print many upper- and lowercase letters.,print many upper lowercase letters
2,Use frequently occurring nouns and verbs.,use frequently occurring nouns verbs
3,Form regular plural nouns orally by adding /s/...,form regular plural nouns orally by adding s o...
4,Understand and use question words (interrogati...,understand use question words interrogatives e...


Lets now convert the text into numerical representation to easily process in Sci-kit Learn Text-to-Numeric Conversion (TF-IDF Vectorization):

- Convert text to TF-IDF vectors using TfidfVectorizer.
- Store the numerical representation for each CCSS description.
- Prepare for similarity search (using cosine similarity).

Using TF-IDF (Term Frequency-Inverse Document Frequency) to convert the cleaned text into numerical vectors and then creating a TF-IDF matrix where each row represents a CCSS description and each column represents a unique word.

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [21]:
# Sample cleaned descriptions (replace with your data)
descriptions = df["cleaned_description"].tolist()

In [22]:
# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

In [23]:
# Fit and transform the descriptions
tfidf_matrix = tfidf_vectorizer.fit_transform(descriptions)

Steps to Build a Similarity Search Model:

- Transform input text into a TF-IDF vector using the same TfidfVectorizer model.
- Compute cosine similarity between the input vector and all CCSS descriptions.
- Retrieve the top N most similar CCSS IDs based on similarity scores.

In [25]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [31]:
# Implemented cosine similarity to measure how close an input text is to each CCSS description.
#Built a function to convert the input text into a TF-IDF vector and then compute its similarity with all CCSS descriptions.
#Finally, return the top N closest CCSS IDs based on relevance

def find_similar_ccss(input_text, top_n=5):
    # Transform input text into TF-IDF vector
    input_vector = tfidf_vectorizer.transform([input_text])

    # Compute cosine similarity
    similarity_scores = cosine_similarity(input_vector, tfidf_matrix).flatten()

    # Get top N most similar CCSS IDs
    top_indices = np.argsort(similarity_scores)[::-1][:top_n]

    # Return the top matching CCSS IDs and their descriptions
    return df.iloc[top_indices][["id", "description", "cleaned_description"]]

In [36]:
#Compares it to all existing descriptions using cosine similarity.
#Returns the top N closest CCSS IDs based on relevance

input_text = "I love math"
top_matches = find_similar_ccss(input_text, top_n=5)
print(top_matches)

                                id  \
1521  CCSS.MATH.CONTENT.HSS.CP.A.4   
1553  CCSS.MATH.CONTENT.HSS.MD.B.7   
509      CCSS.ELA-LITERACY.L.6.1.e   
511      CCSS.ELA-LITERACY.L.6.2.a   
512      CCSS.ELA-LITERACY.L.6.2.b   

                                            description  \
1521  Construct and interpret two-way frequency tabl...   
1553  (+) Analyze decisions and strategies using pro...   
509   Recognize variations from standard English in ...   
511   Use punctuation (commas, parentheses, dashes) ...   
512                                    Spell correctly.   

                                    cleaned_description  
1521  construct interpret twoway frequency tables da...  
1553  analyze decisions strategies using probability...  
509   recognize variations from standard english the...  
511   use punctuation commas parentheses dashes set ...  
512                                     spell correctly  


# This model can be used to automate CCSS alignment for lesson plans, assessments, or educational platforms. It can be expanded using deep learning models like Sentence Transformers (BERT) for better semantic understanding.