 # **Unlocking Student Summaries: AI-Driven Assessment of Writing Skills**

## Problem  Statement:
Empowering educators and learners with an advanced AI solution that evaluates the quality of student-written summaries, gauging their representation of main ideas, language clarity, and fluency. The project aims to revolutionize the assessment of summary writing, making it more efficient and effective.

## Introduction:

In today's rapidly evolving educational landscape, the importance of fostering strong writing skills among students cannot be overstated. Among the many facets of writing, summary writing holds a distinct place for its potential to enhance reading comprehension, critical thinking, and overall writing abilities. However, the labor-intensive nature of evaluating student-written summaries has long been a challenge for educators. With the advent of modern technology and Natural Language Processing (NLP), a groundbreaking solution emerges.

The "Unlocking Student Summaries" project seeks to harness the power of AI to address this challenge effectively. We are presented with a unique opportunity to develop a model that can accurately assess the quality of student summaries, offering immediate feedback and freeing up educators to focus on teaching rather than time-consuming grading. This innovation is made possible by leveraging NLP techniques, machine learning models, and advanced text processing tools.

By utilizing a dataset of real student summaries, this project aspires to bring a transformative change in the field of education. CommonLit, a nonprofit education technology organization, spearheads this mission, dedicated to ensuring that all students, particularly those in Title I schools, graduate with the necessary skills for success in higher education and beyond. This collaboration with educational organizations embodies the essence of using technology to bridge educational gaps and provide students with more opportunities to develop their summarization, reading comprehension, critical thinking, and writing skills.

This project's significance extends beyond just the classroom. It contributes to a more inclusive and equitable education system by providing tools to assess and improve writing skills, which are critical for academic and professional success. Through innovative language models and AI-driven evaluation, we aim to change the way summaries are assessed, bringing efficiency and objectivity into the educational landscape. The journey to "Unlocking Student Summaries" promises to revolutionize the assessment of student writing, fostering the growth and development of learners across the globe.

# Loading the Necessary Libraries

In [None]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from textblob import TextBlob
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import spacy
import gensim
import re

# Initialize NLTK's stopwords and stemmer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

# Load spaCy's English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load('en_core_web_sm')


[nltk_data] Error loading punkt: <urlopen error [Errno -3] Temporary
[nltk_data]     failure in name resolution>
[nltk_data] Error loading stopwords: <urlopen error [Errno -3]
[nltk_data]     Temporary failure in name resolution>


# Data Collection and Cleaning

### Loading the data sets into DataFrame using Pandas

In [None]:
df_prompts_test=pd.read_csv("/kaggle/input/commonlit-evaluate-student-summaries/prompts_test.csv")
df_prompts_train=pd.read_csv("/kaggle/input/commonlit-evaluate-student-summaries/prompts_train.csv")
df_summaries_test=pd.read_csv("/kaggle/input/commonlit-evaluate-student-summaries/summaries_test.csv")
df_summaries_train=pd.read_csv("/kaggle/input/commonlit-evaluate-student-summaries/summaries_train.csv")

In [None]:
df_prompts_test.head()

### Merging the two training datasets, summaries_train and prompts_train as df.

In [None]:
df = pd.merge(df_prompts_train, df_summaries_train, on="prompt_id")

### Merging the two test datasets, summaries_test and prompts_test as df_test.

In [None]:
df_test = pd.merge(df_prompts_test, df_summaries_test, on="prompt_id")

In [None]:
df_test.head()

In [None]:
# Set "student id" as the index
df.set_index("student_id", inplace=True)

In [None]:
df.head()

# Data Exploration and Feature Engineering
Extracting linguistic features such as the number of nouns, verbs, adjectives, and the number of sentences from the summaries. Clean and format the data for further analysis.

In [None]:
# Function to count verbs in a text
def count_verbs(text):
    tokens = word_tokenize(text)
    tags = pos_tag(tokens)
    verb_count = len([word for word, pos in tags if pos.startswith('VB')])
    return verb_count

In [None]:
# Apply the function to the 'prompt_question' column
df["verb_count_pq"] = df["prompt_question"].apply(count_verbs)

In [None]:
# Apply the function to the 'prompt_question' column
df["verb_count_pt"] = df["prompt_text"].apply(count_verbs)

In [None]:
# Apply the function to the 'prompt_question' column
df["verb_count_txt"] = df["text"].apply(count_verbs)

Finding the number of nouns in each of the texts

In [None]:
# Function to count nouns in a text
def count_nouns(text):
    tokens = word_tokenize(text)
    tags = pos_tag(tokens)
    noun_count = len([word for word, pos in tags if pos.startswith('NN')])
    return noun_count

df['noun_count'] contains the number of nouns in each row of the 'prompt_question' column


In [None]:
# Apply the function to the 'prompt_question' column
df["noun_count_pq"] = df["prompt_question"].apply(count_nouns)

df['noun_count_pt'] contains the number of nouns in each row of the 'prompt_text' column

In [None]:
# Apply the function to the 'prompt_question' column
df["noun_count_pt"] = df["prompt_text"].apply(count_nouns)

df['noun_count_txt'] contains the number of nouns in each row of the 'text' column

In [None]:
# Apply the function to the 'prompt_question' column
df["noun_count_txt"] = df["text"].apply(count_nouns)

In [None]:
# Function to count adjectives in a text
def count_adjectives(text):
    tokens = word_tokenize(text)
    tags = pos_tag(tokens)
    adjective_count = len([word for word, pos in tags if pos.startswith('JJ')])
    return adjective_count

df["adj_count_pq"] contains the number of adjectives in each row of the 'prompt_question' column

In [None]:
# Apply the function to the 'prompt_question' column
df["adj_count_pq"] = df["prompt_question"].apply(count_adjectives)

df["adj_count_pt"] contains the number of adjectives in each row of the 'prompt_text' column


In [None]:
# Apply the function to the 'prompt_text' column
df["adj_count_pt"] = df["prompt_text"].apply(count_adjectives)

df["adj_count_txt"] contains the number of adjectives in each row of the 'text' column

In [None]:
# Apply the function to the 'text' column
df["adj_count_txt"] = df["text"].apply(count_adjectives)

In [None]:
# Function to count sentences
def count_sentences(text):
    sentences = nltk.sent_tokenize(text)
    return len(sentences)

In [None]:
# Apply the function to the 'prompt_question' column
df["sent_count_pq"] = df["prompt_question"].apply(count_sentences)

In [None]:
# Apply the function to the 'prompt_text' column
df["sent_count_pt"] = df["prompt_text"].apply(count_sentences)

In [None]:
# Apply the function to the 'text' column
df["sent_count_txt"] = df["text"].apply(count_sentences)

In [None]:
import string
# Function to count punctuation in a text
def count_punctuation(text):
    tokens = word_tokenize(text)
    tags = pos_tag(tokens)
    punctuation_count = len([word for word, pos in tags if pos in string.punctuation])
    return punctuation_count

In [None]:
# Apply the function to the 'prompt_question' column
df["punct_count_pq"] = df["prompt_question"].apply(count_punctuation)

In [None]:
# Apply the function to the 'prompt_text' column
df["punct_count_pt"] = df["prompt_text"].apply(count_punctuation)

In [None]:
# Apply the function to the 'text' column
df["punct_count_txt"] = df["text"].apply(count_punctuation)

In [None]:
# Function to count stop words in a text
def count_stop_words(text):
    # Define a list of stop words (you can customize this list)
    stop_words = set(['the', 'and', 'in', 'to', 'of', 'a', 'for', 'on', 'with', 'at'])
    tokens = word_tokenize(text)
    tags = pos_tag(tokens)
    stop_words_count = len([word for word, pos in tags if word.lower() in stop_words])
    return stop_words_count

In [None]:
# Apply the function to the 'prompt_question' column
df["stw_count_pq"] = df["prompt_question"].apply(count_stop_words)

In [None]:
# Apply the function to the 'prompt_text' column
df["stw_count_pt"] = df["prompt_text"].apply(count_stop_words)

In [None]:
# Apply the function to the 'text' column
df["stw_count_txt"] = df["text"].apply(count_stop_words)

In [None]:
df_num = df[[
    "content", "wording", "verb_count_pq", "verb_count_pt", "verb_count_txt", 
     "noun_count_pq", "noun_count_pt", "noun_count_txt", 
     "adj_count_pq", "adj_count_pt", "adj_count_txt",
     "sent_count_pq","sent_count_pt", "sent_count_txt",
    "punct_count_pq", "punct_count_pt", "punct_count_txt",
    "stw_count_pq", "stw_count_pt", "stw_count_txt"
]]

In [None]:
df_num.describe()

In [None]:
# Create a 3x4 grid of subplots with a total figure size of (20, 12)
fig, ax = plt.subplots(3, 5, figsize=(20, 12))

# Plot 1: Distribution plot for 'content'
sns.histplot(data=df, x="content", kde=True, ax=ax[0][0])
ax[0][0].set_title("Distribution of Content")

# Plot 2: Distribution plot for 'wording'
sns.histplot(data=df, x="wording", kde=True, ax=ax[0][1])
ax[0][1].set_title("Distribution of Wording")

# Plot 3: Distribution plot for 'verb_count_pq'
sns.histplot(data=df, x="verb_count_pq", kde=True, ax=ax[0][2])
ax[0][2].set_title("Distribution of Verb Count in Prompt Question")

# Plot 4: Distribution plot for 'verb_count_pt'
sns.histplot(data=df, x="verb_count_pt", kde=True, ax=ax[0][3])
ax[0][3].set_title("Distribution of Verb Count in Prompt Text")

# Plot 5: Distribution plot for 'verb_count_txt'
sns.histplot(data=df, x="verb_count_txt", kde=True, ax=ax[0][4])
ax[0][4].set_title("Distribution of Verb Count in Text")

# Plot 6: Distribution plot for 'noun_count_pq'
sns.histplot(data=df, x="noun_count_pq", kde=True, ax=ax[1][0])
ax[1][0].set_title("Distribution of Noun Count in Prompt Question")

# Plot 7: Distribution plot for 'noun_count_pt'
sns.histplot(data=df, x="noun_count_pt", kde=True, ax=ax[1][1])
ax[1][1].set_title("Distribution of Noun Count in Prompt Text")

# Plot 8: Distribution plot for 'noun_count_txt'
sns.histplot(data=df, x="noun_count_txt", kde=True, ax=ax[1][2])
ax[1][2].set_title("Distribution of Noun Count in Text")

# Plot 9: Distribution plot for 'adj_count_pq'
sns.histplot(data=df, x="adj_count_pq", kde=True, ax=ax[1][3])
ax[1][3].set_title("Distribution of Adjective Count in Prompt Question")

# Plot 10: Distribution plot for 'adj_count_pt'
sns.histplot(data=df, x="adj_count_pt", kde=True, ax=ax[1][4])
ax[1][4].set_title("Distribution of Adjective Count in Prompt Text")

# Plot 11: Distribution plot for 'adj_count_txt'
sns.histplot(data=df, x="adj_count_txt", kde=True, ax=ax[2][0])
ax[2][0].set_title("Distribution of Adjective Count in Text")

# Plot 12: Distribution plot for 'sent_count_pq'
sns.histplot(data=df, x="sent_count_pq", kde=True, ax=ax[2][1])
ax[2][1].set_title("Distribution of Sentence Count in Prompt Question")
             
# Plot 13: Distribution plot for 'sent_count_pq'
sns.histplot(data=df, x="sent_count_pt", kde=True, ax=ax[2][2])
ax[2][2].set_title("Distribution of Sentence Count in Prompt Text")
             
# Plot 14: Distribution plot for 'sent_count_pq'
sns.histplot(data=df, x="sent_count_txt", kde=True, ax=ax[2][3])
ax[2][3].set_title("Distribution of Sentence Count in  Text")

# Remove the empty subplot
fig.delaxes(ax[2][4])

# Adjust layout
plt.tight_layout()

# Show the plots
plt.show()

In [None]:
cor_word = df_num.drop(columns=["content", "wording"]).corrwith(df_num["wording"])
cor_word

In [None]:
cor_content = df_num.drop(columns=["content", "wording"]).corrwith(df_num["content"])
cor_content

Scatter plots of high correlating data against content

In [None]:
# Create subplots
fig, axs = plt.subplots(2, 2, figsize=(10, 8))

# Plot verb_count_txt against content using Seaborn
sns.scatterplot(data=df, x="verb_count_txt", y="content", ax=axs[0, 0])
axs[0, 0].set_title("Scatterplot of verb_count_txt against content")
axs[0, 0].set_xlabel("verb_count_txt")
axs[0, 0].set_ylabel("content")

# Plot noun_count_txt against content using Seaborn
sns.scatterplot(data=df, x="noun_count_txt", y="content", ax=axs[0, 1])
axs[0, 1].set_title("Scatterplot of noun_count_txt against content")
axs[0, 1].set_xlabel("noun_count_txt")
axs[0, 1].set_ylabel("content")

# Plot adj_count_txt against content using Seaborn
sns.scatterplot(data=df, x="adj_count_txt", y="content", ax=axs[1, 0])
axs[1, 0].set_title("Scatterplot of adj_count_txt against content")
axs[1, 0].set_xlabel("adj_count_txt")
axs[1, 0].set_ylabel("content")

# Plot sent_count_txt against content using Seaborn
sns.scatterplot(data=df, x="sent_count_txt", y="content", ax=axs[1, 1])
axs[1, 1].set_title("Scatterplot of sent_count_txt against content")
axs[1, 1].set_xlabel("sent_count_txt")
axs[1, 1].set_ylabel("content")

# Adjust layout
plt.tight_layout()

# Show the plot
plt.show()

Scatter plots of high correlating data against wordings

In [None]:
# Create subplots
fig, axs = plt.subplots(2, 2, figsize=(10, 8))

# Plot verb_count_txt against wordings using Seaborn
sns.scatterplot(data=df, x="verb_count_txt", y="wording", ax=axs[0, 0])
axs[0, 0].set_title("Scatterplot of verb_count_txt against wording")
axs[0, 0].set_xlabel("verb_count_txt")
axs[0, 0].set_ylabel("wording")

# Plot noun_count_txt against wordings using Seaborn
sns.scatterplot(data=df, x="noun_count_txt", y="wording", ax=axs[0, 1])
axs[0, 1].set_title("Scatterplot of noun_count_txt against wording")
axs[0, 1].set_xlabel("noun_count_txt")
axs[0, 1].set_ylabel("wording")

# Plot adj_count_txt against wordings using Seaborn
sns.scatterplot(data=df, x="adj_count_txt", y="wording", ax=axs[1, 0])
axs[1, 0].set_title("Scatterplot of adj_count_txt against wording")
axs[1, 0].set_xlabel("adj_count_txt")
axs[1, 0].set_ylabel("wording")

# Plot sent_count_txt against wordings using Seaborn
sns.scatterplot(data=df, x="sent_count_txt", y="wording", ax=axs[1, 1])
axs[1, 1].set_title("Scatterplot of sent_count_txt against wording")
axs[1, 1].set_xlabel("sent_count_txt")
axs[1, 1].set_ylabel("wording")

# Adjust layout
plt.tight_layout()

# Show the plot
plt.show()

# Feature selction and Model Building
In this step, we will:

* Do our feature selection and split into Train test data using test size as 20% of the entire dataset
* Build the baseline models
* Experiment with a variety of machine learning models including Decision Trees, XGBoost, Lasso Regression, and other common models.
* Evaluate model performance using appropriate metrics such as accuracy, precision, and recall.
* Identify the best-performing model for the task of summary evaluation.

## Importing the relevant Libraries for the Model Building

In [None]:
import warnings
warnings.filterwarnings("ignore")

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor

## Feature Selection

In [None]:
features= ["verb_count_pq", "verb_count_pt", "verb_count_txt", 
             "noun_count_pq", "noun_count_pt", "noun_count_txt", 
             "adj_count_pq", "adj_count_pt", "adj_count_txt",
           "sent_count_pq","sent_count_pt", "sent_count_txt",
           "punct_count_pq", "punct_count_pt", "punct_count_txt",
            "stw_count_pq", "stw_count_pt", "stw_count_txt"]
X=df[features]

In [None]:
target_1="wording"
y_word=df[target_1]
target_2="content"
y_content=df[target_2]

## Splitting the data into Train and test Data set

In [None]:
# Load your dataset and split it into features (X) and target (y)
X_train, X_test, y_train_content, y_test_content, y_train_word, y_test_word = train_test_split(X, y_content, y_word, test_size=0.2, random_state=42)

## Building the baseline models

In [None]:
import numpy as np

In [None]:
y_word_mean=y_train_word.mean()
y_word_pred_baseline=[y_word_mean]*len(y_train_word)
mae_baseline_word=mean_absolute_error(y_train_word,y_word_pred_baseline)
rmse_baseline_word=np.sqrt(mean_squared_error(y_train_word,y_word_pred_baseline))
print("Baseline MAE Wording:",round(mae_baseline_word, 2))
print("Baseline RMSE Wording:",round(rmse_baseline_word, 2))

The baseline **Mean Absolute Error (MAE)** for word score prediction is 0.82, indicating the average absolute difference between predicted word scores and actual word scores.

The baseline **Root Mean Square Error (RMSE)** for word score prediction is 1.03, reflecting the square root of the average squared differences between predicted word scores and actual word scores.

These baseline error metrics serve as a reference point to gauge the performance of subsequent models, with the goal of achieving lower MAE and RMSE values as we enhance our summary evaluation system.

In [None]:
y_content_mean=y_train_content.mean()
y_content_pred_baseline=[y_content_mean]*len(y_train_content)
mae_baseline_word=mean_absolute_error(y_train_content,y_content_pred_baseline)
rmse_baseline_word=np.sqrt(mean_squared_error(y_train_content,y_content_pred_baseline))
print("Baseline MAE Content:",round(mae_baseline_word, 2))
print("Baseline RMSE Content:",round(rmse_baseline_word, 2))

**Baseline MAE (Mean Absolute Error) for Content Score Prediction: 0.82**
The baseline MAE of 0.82, when applied to content score prediction, represents the average absolute difference between the actual content quality assessment scores and the predicted scores generated by our initial model. In this context, a MAE of 0.82 implies that, on average, the model's predictions deviate by 0.82 units from the actual content quality assessment scores. Lower MAE values would indicate that the model's predictions are closer to the true content scores, suggesting room for improvement to minimize this error and enhance the accuracy of content evaluation.

**Baseline RMSE (Root Mean Square Error) for Content Score Prediction: 1.04**
The baseline RMSE of 1.04, in the context of content score prediction, represents the square root of the average squared differences between the actual content quality assessment scores and the predicted scores generated by our model. RMSE considers both the magnitude and direction of errors. A RMSE of 1.04 suggests that, on average, the model's predictions deviate by 1.04 units from the true content quality assessment scores. As with MAE, lower RMSE values are desirable and indicate better model performance. The baseline RMSE of 1.04 signals the need for further model optimization to reduce the error and enhance the accuracy of content score predictions.

These baseline error metrics serve as a reference point for evaluating the performance of future models aimed at content quality assessment. The goal is to achieve lower MAE and RMSE values through model enhancements and refinements.

# Training and Evaluating the model using all the variables
Here we will:
* Experiment with a variety of machine learning models including Decision Trees, XGBoost, Lasso Regression, and other common models.
* Evaluate model performance using appropriate metrics such as Mean Absolute error(MAE) and Root Mean square Error(RMSE)
* Identify the best-performing model for the task of summary evaluation.

In [None]:
# Define a list of models
models = [
    LinearRegression(),
    Lasso(alpha=0.1),
    Ridge(alpha=0.1),
    SVR(kernel="linear"),
    DecisionTreeRegressor(),
    RandomForestRegressor(),
    GradientBoostingRegressor(),
]

# Iterate over models
for model in models:
    # Create an instance of the model
    reg = model
    
    # Fit the model to the training data
    reg.fit(X_train,y_train_content )
    
    # Make predictions on the test data
    y_pred_content = reg.predict(X_test)
    
    # Calculate evaluation metrics
    mae_content = mean_absolute_error(y_test_content, y_pred_content)
    rmse_content = np.sqrt(mean_squared_error(y_test_content, y_pred_content))
    
    # Print model name and evaluation metrics
    print(f"Model: {model.__class__.__name__}")
    print(f"MAE Content: {mae_content:.2f}")
    print(f"RMSE Content: {rmse_content:.2f}")

The evaluation compares the performance of several machine learning models for predicting content scores based on various variables. Here's a summary based on the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) metrics:

The models tested are:
- Linear Regression
- Lasso
- Ridge
- Support Vector Regressor (SVR)
- Decision Tree Regressor
- Random Forest Regressor
- Gradient Boosting Regressor

The results indicate that the Gradient Boosting Regressor performs the best among the models evaluated, showcasing the lowest MAE of 0.38 and RMSE of 0.51. Following closely, the Random Forest Regressor also demonstrates strong performance with an MAE of 0.39 and RMSE of 0.52. These models outperform the others in accurately predicting content scores based on the given variables.

The Gradient Boosting Regressor, in particular, stands out for its ability to handle complex relationships within the data and generate highly accurate predictions, making it the recommended choice for this particular prediction task based on the provided evaluation metrics.

The model on the Wording as the target

In [None]:
# Define a list of models
models = [
    LinearRegression(),
    Lasso(alpha=0.1),
    Ridge(alpha=0.1),
    SVR(kernel="linear"),
    DecisionTreeRegressor(),
    RandomForestRegressor(),
    GradientBoostingRegressor(),
]

# Iterate over models
for model in models:
    # Create an instance of the model
    reg_word = model
    
    # Fit the model to the training data
    reg_word.fit(X_train,y_train_word)
    
    # Make predictions on the test data
    y_pred_word = reg_word.predict(X_test)
    
    # Calculate evaluation metrics
    mae_word = mean_absolute_error(y_test_word, y_pred_word)
    rmse_word = np.sqrt(mean_squared_error(y_test_word, y_pred_word))
    
    # Print model name and evaluation metrics
    print(f"Model: {model.__class__.__name__}")
    print(f"MAE Word: {mae_word:.2f}")
    print(f"RMSE Word: {rmse_word:.2f}")

The predictive models were evaluated based on their performance in estimating a summary wordings score. Among the models tested, the GradientBoostingRegressor demonstrated the most accurate predictions, achieving the lowest Mean Absolute Error (MAE) of 0.54 and Root Mean Squared Error (RMSE) of 0.72. Following closely behind, the RandomForestRegressor also displayed strong performance with an MAE of 0.55 and an RMSE of 0.74. These results suggest that both GradientBoostingRegressor and RandomForestRegressor are the top-performing models for predicting the summary wordings score, outperforming other models like LinearRegression, Lasso, Ridge, SVR, DecisionTreeRegressor in terms of accuracy.

# Using the high correlation variables

### Feature Selection

In [None]:
features_sel= ["verb_count_txt", "noun_count_txt", "adj_count_txt","sent_count_txt", "punct_count_txt","stw_count_txt"]
X_sel=df[features_sel]

### Splitting the Data Set

In [None]:
# Load your dataset and split it into features (X) and target (y)
X_sel_train, X_sel_test, y_train_content, y_test_content, y_train_word, y_test_word = train_test_split(X_sel, y_content, y_word, test_size=0.2, random_state=42)

In [None]:
# Define a list of models
models = [
    LinearRegression(),
    Lasso(alpha=0.1),
    Ridge(alpha=0.1),
    SVR(kernel="linear"),
    DecisionTreeRegressor(),
    RandomForestRegressor(),
    GradientBoostingRegressor(),
]

# Iterate over models
for model in models:
    # Create an instance of the model
    reg = model
    
    # Fit the model to the training data
    reg.fit(X_sel_train,y_train_content )
    
    # Make predictions on the test data
    y_pred_content = reg.predict(X_sel_test)
    
    # Calculate evaluation metrics
    mae_content_sel = mean_absolute_error(y_test_content, y_pred_content)
    rmse_content_sel = np.sqrt(mean_squared_error(y_test_content, y_pred_content))
    
    # Print model name and evaluation metrics
    print(f"Model: {model.__class__.__name__}")
    print(f"MAE_sel Content: {mae_content_sel:.2f}")
    print(f"RMSE_sel Content: {rmse_content_sel:.2f}")

Based on the performance metrics for predicting content scores using various regression models, the GradientBoostingRegressor stands out as the best-performing model. It achieved the lowest Mean Absolute Error (MAE) of 0.41 and Root Mean Squared Error (RMSE) of 0.54 among all models tested. This indicates its superior predictive capability in estimating content scores based on highly correlated variables compared to other models like LinearRegression, Lasso, Ridge, SVR, DecisionTreeRegressor, and RandomForestRegressor.

The model on the Wording as the target

In [None]:
# Define a list of models
models = [
    LinearRegression(),
    Lasso(alpha=0.1),
    Ridge(alpha=0.1),
    SVR(kernel="linear"),
    DecisionTreeRegressor(),
    RandomForestRegressor(),
    GradientBoostingRegressor(),
]

# Iterate over models
for model in models:
    # Create an instance of the model
    reg_word = model
    
    # Fit the model to the training data
    reg_word.fit(X_sel_train,y_train_word)
    
    # Make predictions on the test data
    y_pred_word = reg_word.predict(X_sel_test)
    
    # Calculate evaluation metrics
    mae_word_sel = mean_absolute_error(y_test_word, y_pred_word)
    rmse_word_sel = np.sqrt(mean_squared_error(y_test_word, y_pred_word))
    
    # Print model name and evaluation metrics
    print(f"Model: {model.__class__.__name__}")
    print(f"MAE_sel_word: {mae_word_sel:.2f}")
    print(f"RMSE_sel_word: {rmse_word_sel:.2f}")

Based on the performance metrics for predicting summary wording scores using different regression models, the GradientBoostingRegressor stands out as the best-performing model. It achieved the lowest Mean Absolute Error (MAE) of 0.58 and Root Mean Square Error (RMSE) of 0.78, indicating its superior predictive accuracy compared to other models tested. This suggests that the GradientBoostingRegressor is the most effective in capturing the relationships between highly correlated variables and accurately predicting summary wording scores.

### Best Performing Model
Based on these performance metrics, it appears that the GradientBoostingRegressor is the most promising model, as it has the lowest MAE and RMSE. It indicates that this model is providing the best overall performance in evaluating the quality of student summaries. Also it appears that the models built on highly correlated variables outperformed the ones on all the variables. With this result we shall be performing hyperparameter tunning on Gradient Boosting Regressor using the Highly Correlated Variables.

### Hyperparameter Tunning
Fine-tuning hyperparameters for machine learning models especially the GradientBoostingRegressor is an essential step to optimize its performance. It can be done by using techniques such as Grid Search or Random Search to explore different hyperparameter combinations. Here we will be performing hyperparameter tuning for the GradientBoostingRegressor using **Grid Search**.

In [None]:
# For Word Prediction:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameters you want to tune
param_grid = {
    'n_estimators': [100, 200, 300],  # Number of boosting stages to be used
    'learning_rate': [0.01, 0.1, 0.2],  # Step size shrinking to prevent overfitting
}

# Create the GradientBoostingRegressor
reg_word = GradientBoostingRegressor()

# Use GridSearchCV to find the best combination of hyperparameters
grid_search = GridSearchCV(reg_word, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_sel_train, y_train_word)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Create a GradientBoostingRegressor with the best hyperparameters
best_reg_word = GradientBoostingRegressor(**best_params)

# Fit the model to the training data
best_reg_word.fit(X_sel_train, y_train_word)

# Make predictions on the test data
y_pred_word = best_reg_word.predict(X_sel_test)

# Calculate evaluation metrics
mae_word_sel = mean_absolute_error(y_test_word, y_pred_word)
rmse_word_sel = np.sqrt(mean_squared_error(y_test_word, y_pred_word))

# Print evaluation metrics
print("MAE_sel_word:", mae_word_sel)
print("RMSE_sel_word:", rmse_word_sel)

The best hyperparameters for the GradientBoostingRegressor model, as determined, are as follows: learning rate of 0.1,and 100 estimators.

With these hyperparameters, the model achieved a Mean Absolute Error (MAE) of 0.583 and a Root Mean Square Error (RMSE) of 0.777 when predicting summary wording scores. These results further validate the effectiveness of the chosen hyperparameters, as they significantly contribute to the model's accurate predictions and performance.

In [None]:
# For Content
# Define the hyperparameters you want to tune
param_grid = {
    'n_estimators': [100, 200, 300],  # Number of boosting stages to be used
    'learning_rate': [0.01, 0.1, 0.2],  # Step size shrinking to prevent overfitting
}

# Create the GradientBoostingRegressor
reg_content = GradientBoostingRegressor()

# Use GridSearchCV to find the best combination of hyperparameters
grid_search = GridSearchCV(reg_content, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_sel_train, y_train_content)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Create a GradientBoostingRegressor with the best hyperparameters
best_reg_content = GradientBoostingRegressor(**best_params)

# Fit the model to the training data
best_reg_content.fit(X_sel_train, y_train_content)

# Make predictions on the test data
y_pred_content = best_reg_content.predict(X_sel_test)

# Calculate evaluation metrics
mae_content_sel = mean_absolute_error(y_test_content, y_pred_content)
rmse_content_sel = np.sqrt(mean_squared_error(y_test_content, y_pred_content))

# Print evaluation metrics
print("MAE_sel_content:", mae_content_sel)
print("RMSE_sel_content:", rmse_content_sel)


The GradientBoostingRegressor model's best hyperparameters were found to be {'learning_rate': 0.1, 'n_estimators': 100}. With these parameters, it achieved an impressive Mean Absolute Error (MAE) of 0.408 and Root Mean Square Error (RMSE) of 0.537 in predicting content quality. This signifies the model's robust performance in accurately assessing content quality, demonstrating its efficacy in handling the intricacies of the data with superior predictive capability.

# WORD EMBEDDING METHOD
In this step we will be:
* Perform text preprocessing including tokenization, stop word removal, and stemming using the Porter Stemmer.
* Utilize the TF-IDF vectorization technique to convert text data into numerical features.
* Normalize the vectorized features to ensure consistency and facilitate modeling.
* Select our features
* Train the selected model on the preprocessed and vectorized data.
* Fine-tune hyperparameters to optimize the model's performance.

In [None]:
df_emb = pd.merge(df_prompts_train, df_summaries_train, on="prompt_id")
df_emb_test = pd.merge(df_prompts_test, df_summaries_test, on="prompt_id")

In [None]:
df_emb.head()

### Text preprocessing

In [None]:
def preprocess_text(text):
    words = text.split()
    # Remove stopwords and apply stemming
    words = [stemmer.stem(word) for word in words if word.lower() not in stop_words]
    return ' '.join(words)

### Apply th function on the test and train data sets

In [None]:

df_emb["prompt_question"]= df_emb["prompt_question"].apply(preprocess_text)
df_emb["prompt_title"]= df_emb["prompt_title"].apply(preprocess_text)
df_emb["prompt_text"]= df_emb["prompt_text"].apply(preprocess_text)
df_emb["text"]= df_emb["text"].apply(preprocess_text)
df_emb_test["prompt_question"]= df_emb_test["prompt_question"].apply(preprocess_text)
df_emb_test["prompt_title"]= df_emb_test["prompt_title"].apply(preprocess_text)
df_emb_test["prompt_text"]= df_emb_test["prompt_text"].apply(preprocess_text)
df_emb_test["text"]= df_emb_test["text"].apply(preprocess_text)

### TF-IDF Vectorization

In [None]:
# List of maximum values to iterate through
max_values = [500, 1000, 1500, 2000, 2500, 5000, 7500, 10000, 12500, 15000, 17500, 20000]

for max_features in max_values:
    # Initialize the tfidf
    tfidf_vectorizer = TfidfVectorizer(max_features=max_features)

### Applying The Vectorization on the processed Data Frame

In [None]:
from scipy.sparse import hstack

# Initialize an empty sparse matrix
X_tfidf = None

# Define the text columns for TF-IDF vectorization
text_columns = ["text"]

# Iterate through each text column and perform TF-IDF vectorization
for col in text_columns:
    tfidf_matrix = tfidf_vectorizer.fit_transform(df_emb[col])
    if X_tfidf is None:
        X_tfidf = tfidf_matrix
    else:
        # Concatenate the sparse matrices horizontally
        X_tfidf = hstack([X_tfidf, tfidf_matrix])

# X_tfidf now contains the combined TF-IDF features for all text columns

### Selecting the Predicting Features

In [None]:
X= X_tfidf
y_content = df_emb["content"].values
y_wording = df_emb["wording"].values

### Train-Test Split

In [None]:
X_train, X_test, y_content_train, y_content_test, y_wording_train, y_wording_test = train_test_split(
    X, y_content, y_wording, test_size=0.2, random_state=42)

### Build Machine Learning Models

In [None]:
# List of maximum values to iterate through
max_values = [500, 1000, 1500, 2000, 2500, 5000, 7500, 10000, 12500, 15000, 17500, 20000]

for max_features in max_values:
    # Initialize the tfidf
    tfidf_vectorizer = TfidfVectorizer(max_features=max_features)
    
    # Applying the Vectorization on the processed data frame
    X_tfidf = None

    text_columns = ["text"]

    for col in text_columns:
        tfidf_matrix = tfidf_vectorizer.fit_transform(df_emb[col])
        if X_tfidf is None:
            X_tfidf = tfidf_matrix
        else:
            X_tfidf = hstack([X_tfidf, tfidf_matrix])

    X = X_tfidf
    y_content = df_emb["content"].values
    y_wording = df_emb["wording"].values

    # Train-Test Split
    X_train, X_test, y_content_train, y_content_test, y_wording_train, y_wording_test = train_test_split(
        X, y_content, y_wording, test_size=0.2, random_state=42)

    # Build the Machine Learning Model
    content_model = GradientBoostingRegressor()
    content_model.fit(X_train, y_content_train)

    wording_model = GradientBoostingRegressor()
    wording_model.fit(X_train, y_wording_train)

    # Evaluate Models
    content_predictions = content_model.predict(X_test)
    content_rmse = np.sqrt(mean_squared_error(y_content_test, content_predictions))
    #mean_squared_error(y_content_test, content_predictions, squared=False)
    content_mae= mean_absolute_error(y_content_test, content_predictions)

    wording_predictions = wording_model.predict(X_test)
    wording_rmse = mean_squared_error(y_wording_test, wording_predictions, squared=False)
    wording_mae= mean_absolute_error(y_wording_test, wording_predictions)

    print(f"Max Features: {max_features}")
    print(f"Content RMSE: {content_rmse}")
    print(f"Wording RMSE: {wording_rmse}")
    print(f"Content MAE: {content_mae}")
    print(f"Wording MAE: {wording_mae}")

The results of the Gradient Boosting Regressor on vectorized preprocessed text using various maximum features of TFIDF for both content and wording scores of the summaries are as follows:

When considering the RMSE (Root Mean Square Error), the best performance for content scores is achieved with a maximum feature count of 5000, resulting in a RMSE of approximately 0.6017. For wording scores, the lowest RMSE is also obtained with 5000 maximum features, at around 0.7644.

In terms of MAE (Mean Absolute Error), the optimal maximum feature count for content scores is 5000, with a MAE of roughly 0.4712. For wording scores, the lowest MAE is also found with 5000 maximum features, at approximately 0.6117.

These results suggest that a maximum feature count of 5000 provides the best balance between accuracy and feature dimensionality reduction when using the Gradient Boosting Regressor for predicting both content and wording scores of summaries based on vectorized preprocessed text.

In [None]:
# Initialize the tfidf with the restricted max_features
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

X_tfidf = None
text_columns = ["text"]

for col in text_columns:
    tfidf_matrix = tfidf_vectorizer.fit_transform(df_emb[col])
    if X_tfidf is None:
        X_tfidf = tfidf_matrix
    else:
        X_tfidf = hstack([X_tfidf, tfidf_matrix])

X = X_tfidf
y_content = df_emb["content"].values
y_wording = df_emb["wording"].values

# Train-Test Split
X_train, X_test, y_content_train, y_content_test, y_wording_train, y_wording_test = train_test_split(
    X, y_content, y_wording, test_size=0.2, random_state=42)

# Define the parameter grid to search through
param_grid = {
    'n_estimators': [100, 200, 300],  # Example values, you can modify these
    'learning_rate': [0.05, 0.1, 0.2],  # Example values, you can modify these
    # Add more parameters to be tuned
}

# Initialize the Gradient Boosting Regressor
gbr = GradientBoostingRegressor()

# Perform Grid Search with Cross Validation
grid_search = GridSearchCV(gbr, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_content_train)  # Fit the grid search on the content data


# Evaluate the best model on test data
content_predictions = grid_search.predict(X_test)
content_rmse = np.sqrt(mean_squared_error(y_content_test, content_predictions))
content_mae= mean_absolute_error(y_content_test, content_predictions)

# Print the best results
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Content RMSE: {content_rmse}")
print(f"Best Content MAE: {content_mae}")


The hyperparameter tuning process optimized the Gradient Boosting Regressor for content predictions, achieving a good balance between model complexity and predictive performance. The selected hyperparameters include a learning rate of 0.1 and 300 estimators (trees) in the ensemble. With a TF-IDF vectorization limited to 5000 features, the model achieved a low RMSE of 0.5517, indicating that, on average, content predictions were within approximately 0.5517 units of the true values. The MAE of 0.4290 further underscores the model's accuracy, showing that the average absolute difference between predicted and actual content values is around 0.4290.

These results suggest that the combination of hyperparameters and feature constraints has resulted in a well-performing model for content predictions. However, keep in mind that model performance should always be evaluated in the context of the specific problem domain and its requirements.

In [None]:
# Initialize the tfidf with the restricted max_features
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

X_tfidf = None
text_columns = ["text"]

for col in text_columns:
    tfidf_matrix = tfidf_vectorizer.fit_transform(df_emb[col])
    if X_tfidf is None:
        X_tfidf = tfidf_matrix
    else:
        X_tfidf = hstack([X_tfidf, tfidf_matrix])

X = X_tfidf
y_content = df_emb["content"].values
y_wording = df_emb["wording"].values

# Train-Test Split
X_train, X_test, y_content_train, y_content_test, y_wording_train, y_wording_test = train_test_split(
    X, y_content, y_wording, test_size=0.2, random_state=42)

# Define the parameter grid to search through
param_grid = {
    'n_estimators': [100, 200, 300],  # Example values, you can modify these
    'learning_rate': [0.05, 0.1, 0.2],  # Example values, you can modify these
    # Add more parameters to be tuned
}

# Initialize the Gradient Boosting Regressor
gbr = GradientBoostingRegressor()

# Perform Grid Search with Cross Validation
grid_search = GridSearchCV(gbr, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_wording_train)  # Fit the grid search on the content data


# Evaluate the best model on test data
wording_predictions = grid_search.predict(X_test)
wording_rmse = np.sqrt(mean_squared_error(y_wording_test, wording_predictions))
wording_mae= mean_absolute_error(y_wording_test, wording_predictions)

# Print the best results
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Wording RMSE: {wording_rmse}")
print(f"Best Wording MAE: {wording_mae}")

Based on the hyperparameter tuning performed for the Gradient Boosting Regressor with TF-IDF vectorization using a maximum of 5000 features, the best parameters identified for predicting the "wording" variable were a learning rate of 0.2 and 300 estimators (trees) in the ensemble model.

The evaluation metrics for the best model achieved the following results:

Best Wording Root Mean Squared Error (RMSE): 0.704
Best Wording Mean Absolute Error (MAE): 0.561
These metrics provide an assessment of the predictive accuracy of the model. The RMSE measures the average magnitude of the errors between predicted and actual values, with a lower RMSE indicating better performance. Meanwhile, the MAE measures the average magnitude of errors in a set of predictions, providing insights into the model's predictive accuracy.

In summary, the optimized Gradient Boosting model using TF-IDF vectorization with 5000 features, a learning rate of 0.2, and 300 estimators performed well in predicting the "wording" variable, achieving low error rates, indicating a relatively accurate predictive model.