#Part A: Build a classification model using text data


##For Part A, you will be solving a text classification task. The training data is stored in the Homework 4 Data folder. The data consists of headlines that have been labeled for whether they are clickbait.
1.
Import the data. The headlines will become your vectorized X matrix, and the labels indicate a binary classification (clickbait or not).

In [17]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import f1_score

# Load the data
data = pd.read_csv('text_training_data.csv')
text_data = data['headline'].values
labels = data['label'].values

# Convert labels to numeric format
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(labels)

# Print data information
print("type of text_data: {}".format(type(text_data)))
print("length of text_data: {}".format(len(text_data)))
print("class balance: {}".format(np.bincount(y)))
print("\ntext_data[1]:\n{}".format(text_data[1]))

type of text_data: <class 'numpy.ndarray'>
length of text_data: 24979
class balance: [12201 12778]

text_data[1]:
CIT Posts Eighth Loss in a Row


2. Convert the headline data into an X feature matrix using a simple bag of words approach.

In [18]:
text_train, text_test, y_train, y_test = train_test_split(
    text_data, y, stratify=y, random_state=0)

# Create basic bag of words vectorizer
vect = CountVectorizer()

# Transform text into bag of words feature matrix
X_train = vect.fit_transform(text_train)
X_test = vect.transform(text_test)

# Print information about the feature matrix
print("\nFeatures shape: {}".format(X_train.shape))
print("Number of features: {}".format(len(vect.get_feature_names_out())))



Features shape: (18734, 17883)
Number of features: 17883


3. Run logistic regression to predict clickbait headlines. Remember to train_test_split your data and use GridSearchCV to find the best value of C. You should evaluate your data with F1 scoring.

In [19]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(
    LogisticRegression(max_iter=1000),
    param_grid,
    cv=5,
    scoring='f1'
)

# Fit the model
grid.fit(X_train, y_train)

# Print results of first model
print("\nModel 1 (Basic Bag of Words) Results:")
print("Best C:", grid.best_params_['C'])
print("Best cross-validation F1 score: {:.3f}".format(grid.best_score_))
print("Test set F1 score: {:.3f}".format(
    f1_score(y_test, grid.predict(X_test))))


Model 1 (Basic Bag of Words) Results:
Best C: 10
Best cross-validation F1 score: 0.969
Test set F1 score: 0.976


4. Run 2 more logistic regression models by changing the vectorization approach (e.g. using n-grams, stop_words, and other techniques we discussed). In both cases, keep your logistic regression step the same. Only change how you're generating the X matrix from the text data.

In [24]:
# Model 2: Using n-grams
print("\nModel 2 (N-grams):")
vect_ngram = CountVectorizer(ngram_range=(1, 2))
X_train_ngram = vect_ngram.fit_transform(text_train)
X_test_ngram = vect_ngram.transform(text_test)

grid.fit(X_train_ngram, y_train)
print("Best C:", grid.best_params_['C'])
print("Best cross-validation F1 score: {:.3f}".format(grid.best_score_))
print("Test set F1 score: {:.3f}".format(
    f1_score(y_test, grid.predict(X_test_ngram))))

# Model 3: Using stop words and min_df
print("\nModel 3 (Stop Words + min_df):")
vect_stop = CountVectorizer(stop_words='english', min_df=5)
X_train_stop = vect_stop.fit_transform(text_train)
X_test_stop = vect_stop.transform(text_test)

grid.fit(X_train_stop, y_train)
print("Best C:", grid.best_params_['C'])
print("Best cross-validation F1 score: {:.3f}".format(grid.best_score_))
print("Test set F1 score: {:.3f}".format(
    f1_score(y_test, grid.predict(X_test_stop))))


Model 2 (N-grams):
Best C: 100
Best cross-validation F1 score: 0.969
Test set F1 score: 0.976

Model 3 (Stop Words + min_df):
Best C: 1
Best cross-validation F1 score: 0.946
Test set F1 score: 0.948


5. Which of your 3 models performed best? What are the most significant coefficients in each, and how do they compare?

In [26]:
def print_top_features(model, feature_names, n=10):
    """Print most significant coefficients for each class"""
    coef = model.best_estimator_.coef_[0]
    top_positive = np.argsort(coef)[-n:]
    top_negative = np.argsort(coef)[:n]

    print("\nTop clickbait indicators:")
    for idx in reversed(top_positive):
        print(f"{feature_names[idx]}: {coef[idx]:.3f}")

    print("\nTop non-clickbait indicators:")
    for idx in reversed(top_negative):
        print(f"{feature_names[idx]}: {coef[idx]:.3f}")

# Print top features for each model
print("\nModel 1 Important Features:")
print_top_features(grid, vect.get_feature_names_out())

print("\nModel 2 Important Features:")
print_top_features(grid, vect_ngram.get_feature_names_out())

print("\nModel 3 Important Features:")
print_top_features(grid, vect_stop.get_feature_names_out())


Model 1 Important Features:

Top clickbait indicators:
armenian: 2.696
bondholder: 2.348
dawn: 2.235
decisive: 2.146
bonkers: 2.140
blazing: 2.031
celebrity: 1.981
abduction: 1.947
abductor: 1.924
camping: 1.880

Top non-clickbait indicators:
128: -2.323
cigarette: -2.333
126: -2.382
covergirl: -2.394
billingham: -2.402
123: -2.714
alexander: -2.881
222: -3.011
beauty: -3.053
bonus: -3.155

Model 2 Important Features:

Top clickbait indicators:
18 celebs: 2.696
22 super: 2.348
according to: 2.235
aced no: 2.146
22 years: 2.140
21 quotes: 2.031
34 more: 1.981
12 creative: 1.947
12 days: 1.924
29 husbands: 1.880

Top non-clickbait indicators:
000 protest: -2.323
50 dead: -2.333
000 phobos: -2.382
about lizzie: -2.394
21 buttons: -2.402
000 pennies: -2.714
15 pictures: -2.881
10 reasons: -3.011
2009 held: -3.053
225: -3.155

Model 3 Important Features:

Top clickbait indicators:
dies: 2.696
kills: 2.348
wins: 2.235
zealand: 2.146
knicks: 2.140
iraq: 2.031
police: 1.981
australia: 1.947
a

Based on the coefficient analysis of all three models, Model 2 (using N-grams) performed best at detecting clickbait headlines. While Model 1's basic bag-of-words approach identified emotional single words (like "bonkers" and "blazing" for clickbait, and numbers like "128" and "222" for non-clickbait), and Model 3's stop-words-removed approach focused on action verbs (like "dies" and "kills" for clickbait, and "remember" and "know" for non-clickbait), Model 2's n-gram approach was most effective because it captured important contextual patterns. It identified typical clickbait structures like number-word combinations ("18 celebs", "22 super") as strong clickbait indicators, while phrases like "000 protest" and "50 dead" signaled non-clickbait content. The coefficients in Model 2 showed the highest ability to distinguish between clickbait and legitimate number usage, making it the most reliable for identifying the nuanced ways clickbait headlines are constructed.

#Part B: Build a Predictive Neural Network Using Keras

In Part B, you will run a multilayer perceptron on the iris dataset to predict flower type.

1. Load the data. Data can be imported directly using pd.read_csv() and the link http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.utils import to_categorical

iris_data = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv')

print("Dataset structure:")
print(iris_data.head())
print("\nColumn names:", iris_data.columns)

# Prepare features (X) and target (y)
# Note: We select only the numeric columns for features
X = iris_data[['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']].values
# Convert species to numeric categories
y = pd.Categorical(iris_data['Species']).codes

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert target to categorical (one-hot encoding)
y_train_cat = to_categorical(y_train)
y_test_cat = to_categorical(y_test)

# Print shapes to verify data structure
print("\nData shapes:")
print(f"X_train shape: {X_train_scaled.shape}")
print(f"y_train shape: {y_train_cat.shape}")
print(f"X_test shape: {X_test_scaled.shape}")
print(f"y_test shape: {y_test_cat.shape}")

Dataset structure:
   rownames  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
0         1           5.1          3.5           1.4          0.2  setosa
1         2           4.9          3.0           1.4          0.2  setosa
2         3           4.7          3.2           1.3          0.2  setosa
3         4           4.6          3.1           1.5          0.2  setosa
4         5           5.0          3.6           1.4          0.2  setosa

Column names: Index(['rownames', 'Sepal.Length', 'Sepal.Width', 'Petal.Length',
       'Petal.Width', 'Species'],
      dtype='object')

Data shapes:
X_train shape: (120, 4)
y_train shape: (120, 3)
X_test shape: (30, 4)
y_test shape: (30, 3)


2. Using the Sequential interface in Keras, build a model with 2 hidden layers with 16 neurons in each. Compile and fit the model. Assess its performance using accuracy on data that has been train_test_split.

In [3]:
def create_model_1():
    model = Sequential([
        Dense(16, input_shape=(4,), activation='relu'),  # First hidden layer
        Dense(16, activation='relu'),                    # Second hidden layer
        Dense(3, activation='softmax')                   # Output layer (3 classes)
    ])

    # Compile model
    model.compile(optimizer='adam',
                 loss='categorical_crossentropy',
                 metrics=['accuracy'])
    return model

3. Run 2 additional models using different numbers of hidden layers and/or hidden neurons.

In [4]:
def create_model_2():
    # Smaller model with 1 hidden layer
    model = Sequential([
        Dense(8, input_shape=(4,), activation='relu'),   # Single hidden layer
        Dense(3, activation='softmax')                   # Output layer
    ])

    model.compile(optimizer='adam',
                 loss='categorical_crossentropy',
                 metrics=['accuracy'])
    return model

def create_model_3():
    # Larger model with 3 hidden layers
    model = Sequential([
        Dense(32, input_shape=(4,), activation='relu'),  # First hidden layer
        Dense(16, activation='relu'),                    # Second hidden layer
        Dense(8, activation='relu'),                     # Third hidden layer
        Dense(3, activation='softmax')                   # Output layer
    ])

    model.compile(optimizer='adam',
                 loss='categorical_crossentropy',
                 metrics=['accuracy'])
    return model


4. How does the performance compare between your 3 models?

In [5]:
def train_and_evaluate_model(model, model_name):
    # Train the model
    history = model.fit(X_train_scaled, y_train_cat,
                       epochs=50,
                       batch_size=32,
                       validation_split=0.2,
                       verbose=1)  # Changed to 1 to see training progress

    # Evaluate the model
    score = model.evaluate(X_test_scaled, y_test_cat, verbose=0)
    print(f"\n{model_name} Test Accuracy: {score[1]:.3f}")
    return score[1]

# Create and train all models
print("\nTraining Model 1...")
model1 = create_model_1()
accuracy1 = train_and_evaluate_model(model1, "Model 1 (2 layers, 16 neurons each)")

print("\nTraining Model 2...")
model2 = create_model_2()
accuracy2 = train_and_evaluate_model(model2, "Model 2 (1 layer, 8 neurons)")

print("\nTraining Model 3...")
model3 = create_model_3()
accuracy3 = train_and_evaluate_model(model3, "Model 3 (3 layers, 32-16-8 neurons)")

# Print comparison summary
print("\nModel Comparison Summary:")
print("-" * 50)
print(f"Model 1 (2 layers, 16 neurons each): {accuracy1:.3f}")
print(f"Model 2 (1 layer, 8 neurons): {accuracy2:.3f}")
print(f"Model 3 (3 layers, 32-16-8 neurons): {accuracy3:.3f}")


Training Model 1...


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 117ms/step - accuracy: 0.1927 - loss: 1.1345 - val_accuracy: 0.3333 - val_loss: 1.0827
Epoch 2/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step - accuracy: 0.2643 - loss: 1.1011 - val_accuracy: 0.4167 - val_loss: 1.0625
Epoch 3/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step - accuracy: 0.3672 - loss: 1.0716 - val_accuracy: 0.4583 - val_loss: 1.0440
Epoch 4/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step - accuracy: 0.3385 - loss: 1.0701 - val_accuracy: 0.4583 - val_loss: 1.0270
Epoch 5/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step - accuracy: 0.3997 - loss: 1.0434 - val_accuracy: 0.5000 - val_loss: 1.0101
Epoch 6/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step - accuracy: 0.4531 - loss: 1.0171 - val_accuracy: 0.5417 - val_loss: 0.9937
Epoch 7/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━

Model 1 utilized a moderate architecture with two hidden layers containing 16 neurons each, followed by a final output layer with 3 neurons (one for each Iris class). This model achieved a strong accuracy of 93.3% on the test data, demonstrating effective learning of the Iris classification patterns. Model 2 took a simpler approach with just one hidden layer of 8 neurons, and surprisingly matched Model 1's performance with the same 93.3% accuracy, suggesting that the Iris dataset's patterns are simple enough to be captured by a less complex architecture. Model 3 attempted a more sophisticated design with three hidden layers (32, 16, and 8 neurons respectively), but significantly underperformed with only 63.3% accuracy, likely due to overfitting and the model being unnecessarily complex for this straightforward classification task. These results emphasize a crucial lesson in neural network design: simpler architectures (like Model 2) can often perform just as well as or better than more complex ones, especially for relatively straightforward datasets like Iris classification.