
#### The focus on this notebook is the two types of unsupervised data augmentation techqiues for training data that is applied to Logistic Regression. 
1) GMM and Random Forest Semi Supervised Data Labeling for Train leads to a 0.667 accuracy on Kaggle with Logistic Regression
2) KM and Random Forest Semi Supervised data Labeling for Train leads to a 0.766 accuracy on Kaggle with Logistic Regression
3) Logistic Regression fully supervised leads to 0.922 accuracy on Kaggle. 

### GMM and Random Forest, Semi Supervised Data Labeling 
- **Data Preprocessing:** Load and clean text data, removing stop words, lemmatizing, and removing URLs and mentions.
- **Vectorization and Encoding:** Convert text to TF-IDF vectors with 500 features; encode sentiment labels in the labeled data.
- **Separate Labeled and Unlabeled Data:** Isolate labeled and unlabeled portions of the training dataset.
- **Augment the training data set** with labels using GMM that have a confidence threshold of 0.99.
    - Train GMM with 5 clusters on labeled data and apply it to predict clusters for unlabeled data.
    - Retain high-confidence pseudo-labels with a probability threshold of 0.99.
    - Augment the training data set with labels using Random Forest that have a confidence threshold of 0.95.
- **Train Random Forest** on the labeled data and predict labels for remaining unlabeled data.
    - Retain high-confidence pseudo-labels with a probability threshold of 0.95.
- **Combine and Save Dataset**


In [2]:
from sklearn.mixture import GaussianMixture
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer, PorterStemmer

val_df = pd.read_csv("val.csv")
train_df = pd.read_csv("train.csv")

def pre_process(data):
    preproc_data = data.copy()
    nltk.download('stopwords')
    stop_words = set(stopwords.words('english'))
    preproc_data = preproc_data.apply(lambda text: ' '.join([word for word in str(text).split() if word.lower() not in stop_words]))
    nltk.download('wordnet')
    lemmatizer = WordNetLemmatizer()
    preproc_data = preproc_data.apply(lambda text: ' '.join([lemmatizer.lemmatize(word) for word in text.split()]))
    preproc_data = preproc_data.apply(lambda text: re.sub(r'@\w+', '', re.sub(r'http\S+|www\S+', '', text)))
    return preproc_data


# Call preprocess func
val_df['Phrase'] = pre_process(val_df['Phrase'])
train_df['Phrase'] = pre_process(train_df['Phrase'])

# Encode the Sentiment labels and make vectorizer 
vectorizer = TfidfVectorizer(max_features=500)
label_encoder = LabelEncoder()

# Split the labeled and unlabeled data
labeled_train_df = train_df[train_df['Sentiment'] != -100].copy()
unlabeled_train_df = train_df[train_df['Sentiment'] == -100].copy() 

# Prepare labeled data and unlabeled data
X_labeled = vectorizer.fit_transform(labeled_train_df['Phrase'])
y_labeled = label_encoder.fit_transform(labeled_train_df['Sentiment'])
X_unlabeled = vectorizer.transform(unlabeled_train_df['Phrase'])

# Do Unsupervised Labeling with GMM
n_clusters = len(np.unique(y_labeled))
gmm = GaussianMixture(n_components=n_clusters, random_state=0)
gmm.fit(X_labeled.toarray())
gmm_labels = gmm.predict(X_unlabeled.toarray())
gmm_probs = gmm.predict_proba(X_unlabeled.toarray())

# Do a high confidence threshold for GMM predictions
gmm_confidence_threshold = 0.99
high_confidence_mask_gmm = gmm_probs.max(axis=1) > gmm_confidence_threshold
high_confidence_gmm_df = unlabeled_train_df[high_confidence_mask_gmm].copy()
high_confidence_gmm_df['Sentiment'] = label_encoder.inverse_transform(gmm_labels[high_confidence_mask_gmm])

# Exclude GMM-labeled data from the Random Forest labeling 
unlabeled_train_df_remaining = unlabeled_train_df[~high_confidence_mask_gmm]

# Supervised Labeling with Random Forest
rf = RandomForestClassifier(random_state=0)
rf.fit(X_labeled, y_labeled)

# Predict labels and probabilities on the remaining unlabeled data
X_unlabeled_remaining = vectorizer.transform(unlabeled_train_df_remaining['Phrase'])
pseudo_labels_rf = rf.predict(X_unlabeled_remaining)
pseudo_probs_rf = rf.predict_proba(X_unlabeled_remaining).max(axis=1)

# Apply a confidence threshold for Random Forest
rf_confidence_threshold = 0.95
high_confidence_mask_rf = pseudo_probs_rf > rf_confidence_threshold
high_confidence_rf_df = unlabeled_train_df_remaining[high_confidence_mask_rf].copy()
high_confidence_rf_df['Sentiment'] = label_encoder.inverse_transform(pseudo_labels_rf[high_confidence_mask_rf])

# Combine the original labeled data, high-confidence GMM-labeled data, and high-confidence RF-labeled data
final_combined_df = pd.concat(
    [labeled_train_df, high_confidence_gmm_df[['Phrase', 'Sentiment']], high_confidence_rf_df[['Phrase', 'Sentiment']]],
    ignore_index=True
)

# Save the final combined dataset
final_combined_df.to_csv('gmm_rf_combined_submission.csv', index=False)
print("Final combined dataset shape:", final_combined_df.shape)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/marianellasalinas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/marianellasalinas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/marianellasalinas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/marianellasalinas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Final combined dataset shape: (59678, 2)


### KMeans Unsupervised Data Labeling 
- **Data Preprocessing**: Load and clean text data, removing unnecessary elements, lemmatizing, and applying the preprocessesing func.
- **Encode the Sentiment Labels and Vectorize**: Convert text to TF-IDF vectors with 500 features and encode sentiment labels in labeled data.
- **Split Labeled and Unlabeled Data**: Separate the labeled data from the unlabeled data in the training dataset.
- **Augment the training data set** with labels using KMeans with a confidence threshold of 0.80
  - Train KMeans with 5 clusters on labeled data and predict clusters for unlabeled data.
  - Calculate distances from each unlabeled point to its closest cluster center and set a confidence threshold based on the 80th percentile of distance.
  - Get high-confidence labels that meet this threshold.
- **Map Clusters to Sentiments**: Assign each cluster to the most frequent sentiment in the labeled data, creating a mapping for labels.
- **Combine and Save Dataset**: Merge the original labeled data with high-confidence KMeans labled data and save.


In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from scipy.stats import mode
import numpy as np
from sklearn.decomposition import TruncatedSVD
import pandas as pd
from sklearn.metrics import pairwise_distances_argmin_min

# Load data
val_df = pd.read_csv("val.csv")
train_df = pd.read_csv("train.csv")

# Call preprocess func
val_df['Phrase'] = pre_process(val_df['Phrase'])
train_df['Phrase'] = pre_process(train_df['Phrase'])

# Split labeled and unlabeled data
labeled_train_df = train_df[train_df['Sentiment'] != -100].copy()
unlabeled_train_df = train_df[train_df['Sentiment'] == -100].copy() 
combined_labeled_df = pd.concat([labeled_train_df], ignore_index=True)

# Encode the Sentiment labels and Vectorize 
label_encoder = LabelEncoder()
combined_labeled_df['Sentiment_encoded'] = label_encoder.fit_transform(combined_labeled_df['Sentiment'])
vectorizer = TfidfVectorizer(max_features=500) 
X_combined_labeled = vectorizer.fit_transform(combined_labeled_df['Phrase'])
X_unlabeled = vectorizer.transform(unlabeled_train_df['Phrase'])

# Fit K-Means on labeled data and predict for unlabeled data
n_clusters = len(combined_labeled_df['Sentiment_encoded'].unique())
kmeans = KMeans(n_clusters=n_clusters, random_state=0)
kmeans.fit(X_combined_labeled)

# Predict clusters and calculate distances for unlabeled data
unlabeled_train_df.loc[:, 'Predicted_Cluster'] = kmeans.predict(X_unlabeled)  
closest, distances = pairwise_distances_argmin_min(X_unlabeled, kmeans.cluster_centers_)

# Get a threshold for high confidence
confidence_threshold = np.percentile(distances, 85)
unlabeled_train_df.loc[:, 'High_Confidence'] = distances < confidence_threshold  

# Map clusters to sentiments
cluster_sentiment_map = {}
for i in range(n_clusters):
    cluster_labels = combined_labeled_df['Sentiment_encoded'][kmeans.labels_ == i]
    if len(cluster_labels) > 0:
        cluster_sentiment_map[i] = mode(cluster_labels, keepdims=True).mode[0]
    else:
        cluster_sentiment_map[i] = -1

unlabeled_train_df.loc[:, 'Predicted_Sentiment'] = label_encoder.inverse_transform(
    unlabeled_train_df['Predicted_Cluster'].map(cluster_sentiment_map)
)  

# Only filter out high-confidence predictions
high_confidence_df = unlabeled_train_df[unlabeled_train_df['High_Confidence']].copy()

final_train_df = pd.concat([
    labeled_train_df,
    high_confidence_df[['Phrase', 'Predicted_Sentiment']].rename(columns={'Predicted_Sentiment': 'Sentiment'})
], ignore_index=True)

# Save the high-confidence labeled dataset
final_train_df.to_csv('km_combined_submission.csv', index=False)
print("High-confidence predictions shape:", high_confidence_df.shape)


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/marianellasalinas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/marianellasalinas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/marianellasalinas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/marianellasalinas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


High-confidence predictions shape: (29705, 5)


### Use the augmented training data to our unsupervised learning methods 

In [4]:
import pandas as pd
import re
from nltk.corpus import stopwords
import string
from nltk.stem import WordNetLemmatizer

# Load datasets
train_data = pd.read_csv('km_combined_submission.csv') # we can use the k-means or gmm labeled data
val_data = pd.read_csv('val.csv')
test_data = pd.read_csv('test.csv')

# Define preprocessing functions
def rem_URL(sample):
    return re.sub(r"http\S+", "", sample)

def rem_tokens(sample):
    sample_new = re.sub(r"#", "", sample)
    sample_new = re.sub(r"@", "", sample_new)
    return sample_new

def preprocess_str(inp_string):
    new_string = re.sub(r'@\w+', '@USER', inp_string)  # Replace mentions with '@USER'
    new_string = rem_URL(new_string)                   # Remove URLs
    new_string = rem_tokens(new_string)                # Remove # and @
    new_string = new_string.lower()                    # Convert to lowercase
    words = new_string.split()
    words = [w for w in words if w not in stopwords.words("english")]  # Remove stopwords
    lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]         # Lemmatize
    new_string = ' '.join(lemmed)
    new_string = new_string.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    return new_string

# Preprocess function for dataframe
def preprocess(df):
    new_df = df.copy()
    new_df['Phrase'] = new_df['Phrase'].apply(lambda x: preprocess_str(x) if isinstance(x, str) else x)
    return new_df

# Get all train data (labeled and unlabeled)
X_train = train_data['Phrase']
y_train = train_data['Sentiment']

# Get only labeled train data
mask = (y_train != -100)
train_data_clean = train_data[mask]
X_train_clean = X_train[mask]
y_train_clean = y_train[mask]

# Get validation and test data
X_val = val_data['Phrase']
y_val = val_data['Sentiment']
X_test = test_data['Phrase']

# Preprocess train, validation, and test datasets, and drop NaNs if any
X_train_clean_prep = preprocess(X_train_clean.to_frame()).dropna()
X_val_prep = preprocess(X_val.to_frame()).dropna()
X_test_prep = preprocess(X_test.to_frame()).dropna()

# Check for any NaN values in each processed dataset
print(f"Number of NaN values in X_train_clean_prep['Phrase']: {X_train_clean_prep['Phrase'].isna().sum()}")
print(f"Number of NaN values in X_val_prep['Phrase']: {X_val_prep['Phrase'].isna().sum()}")
print(f"Number of NaN values in X_test_prep['Phrase']: {X_test_prep['Phrase'].isna().sum()}")

# Display data shapes and label distribution
print(f"Train Data Shape: {X_train.shape}")
print(f"Cleaned Train Data Shape: {train_data_clean['Phrase'].shape}")
print(f"Validation Data Shape: {X_val.shape}")
print(f"Test Data Shape: {X_test.shape}")

print(" ")
print(f"Number of labels = 0 in train dataset as percentage: {((y_train == 0).sum() / (X_train.shape[0])) * 100:0.2f}%")
print(f"Number of labels = 1 in train dataset as percentage: {((y_train == 1).sum() / (X_train.shape[0])) * 100:0.2f}%")
print(f"Number of labels = 2 in train dataset as percentage: {((y_train == 2).sum() / (X_train.shape[0])) * 100:0.2f}%")
print(f"Number of labels = 3 in train dataset as percentage: {((y_train == 3).sum() / (X_train.shape[0])) * 100:0.2f}%")
print(f"Number of labels = 4 in train dataset as percentage: {((y_train == 4).sum() / (X_train.shape[0])) * 100:0.2f}%")
print(f"Number of labels = -100 in train dataset as percentage: {((y_train == -100).sum() / (X_train.shape[0])) * 100:0.2f}%")

print(" ")
print(f"Number of labels = 0 in val dataset as percentage: {((y_val == 0).sum() / (X_val.shape[0])) * 100:0.2f}%")
print(f"Number of labels = 1 in val dataset as percentage: {((y_val == 1).sum() / (X_val.shape[0])) * 100:0.2f}%")
print(f"Number of labels = 2 in val dataset as percentage: {((y_val == 2).sum() / (X_val.shape[0])) * 100:0.2f}%")
print(f"Number of labels = 3 in val dataset as percentage: {((y_val == 3).sum() / (X_val.shape[0])) * 100:0.2f}%")
print(f"Number of labels = 4 in val dataset as percentage: {((y_val == 4).sum() / (X_val.shape[0])) * 100:0.2f}%")
print(f"Number of labels = -100 in val dataset as percentage: {((y_val == -100).sum() / (X_val.shape[0])) * 100:0.2f}%")


Number of NaN values in X_train_clean_prep['Phrase']: 0
Number of NaN values in X_val_prep['Phrase']: 0
Number of NaN values in X_test_prep['Phrase']: 0
Train Data Shape: (54463,)
Cleaned Train Data Shape: (54463,)
Validation Data Shape: (23256,)
Test Data Shape: (23257,)
 
Number of labels = 0 in train dataset as percentage: 9.13%
Number of labels = 1 in train dataset as percentage: 13.54%
Number of labels = 2 in train dataset as percentage: 14.43%
Number of labels = 3 in train dataset as percentage: 17.22%
Number of labels = 4 in train dataset as percentage: 45.68%
Number of labels = -100 in train dataset as percentage: 0.00%
 
Number of labels = 0 in val dataset as percentage: 19.63%
Number of labels = 1 in val dataset as percentage: 20.27%
Number of labels = 2 in val dataset as percentage: 20.42%
Number of labels = 3 in val dataset as percentage: 19.81%
Number of labels = 4 in val dataset as percentage: 19.88%
Number of labels = -100 in val dataset as percentage: 0.00%


In [5]:
y_train_clean = y_train[X_train_clean_prep.index] 
y_val = y_val[X_val_prep.index]

In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report

# Vectorize the text data using CountVectorizer
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train_clean_prep['Phrase'])
X_val_vec = vectorizer.transform(X_val_prep['Phrase'])

# Initialize and train the Random Forest model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_vec, y_train_clean)

# Make predictions on the validation set
y_pred = rf_model.predict(X_val_vec)

# Evaluate the model
print("Validation Accuracy:", accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

# If needed, make predictions on the test data
X_test_vec = vectorizer.transform(X_test_prep['Phrase'])
test_predictions = rf_model.predict(X_test_vec)


Validation Accuracy: 0.7231397849462365
              precision    recall  f1-score   support

           0       0.89      0.49      0.63      4564
           1       0.79      0.51      0.62      4713
           2       0.98      0.93      0.95      4744
           3       0.78      0.76      0.77      4605
           4       0.49      0.92      0.64      4624

    accuracy                           0.72     23250
   macro avg       0.79      0.72      0.72     23250
weighted avg       0.79      0.72      0.72     23250



In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score

# Vectorize the text data using CountVectorizer
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train_clean_prep['Phrase'])
X_val_vec = vectorizer.transform(X_val_prep['Phrase'])
X_test_vec = vectorizer.transform(X_test_prep['Phrase'])

# Logistic Regression
log_reg_model = LogisticRegression(random_state=42, max_iter=1000)
log_reg_model.fit(X_train_vec, y_train_clean)
y_pred_log_reg = log_reg_model.predict(X_val_vec)
print("Logistic Regression Validation Accuracy:", accuracy_score(y_val, y_pred_log_reg))

test_predictions_log_reg = log_reg_model.predict(X_test_vec)

Logistic Regression Validation Accuracy: 0.7329892473118279


In [8]:
# Vectorize the text data using CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1,2))
X_train_vec = vectorizer.fit_transform(X_train_clean_prep['Phrase'])
X_val_vec = vectorizer.transform(X_val_prep['Phrase'])
X_test_vec = vectorizer.transform(X_test_prep['Phrase'])

# Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_vec, y_train_clean)
y_pred_rf = rf_model.predict(X_val_vec)
print("Random Forest Validation Accuracy:", accuracy_score(y_val, y_pred_rf))

# Logistic Regression
log_reg_model = LogisticRegression(random_state=42, max_iter=1000)
log_reg_model.fit(X_train_vec, y_train_clean)
y_pred_log_reg = log_reg_model.predict(X_val_vec)
print("Logistic Regression Validation Accuracy:", accuracy_score(y_val, y_pred_log_reg))

Random Forest Validation Accuracy: 0.7269247311827957
Logistic Regression Validation Accuracy: 0.7589247311827957


In [38]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# L2 Regularization (Ridge)
log_reg_model_l2 = LogisticRegression(random_state=42, max_iter=1000, penalty='l2', C=1.0)
log_reg_model_l2.fit(X_train_vec, y_train_clean)
y_pred_log_reg_l2 = log_reg_model_l2.predict(X_val_vec)
print("Logistic Regression with L2 Regularization Validation Accuracy:", accuracy_score(y_val, y_pred_log_reg_l2))

# L1 Regularization (Lasso)
log_reg_model_l1 = LogisticRegression(random_state=42, max_iter=1000, penalty='l1', solver='liblinear', C=1.0)
log_reg_model_l1.fit(X_train_vec, y_train_clean)
y_pred_log_reg_l1 = log_reg_model_l1.predict(X_val_vec)
print("Logistic Regression with L1 Regularization Validation Accuracy:", accuracy_score(y_val, y_pred_log_reg_l1))

Logistic Regression with L2 Regularization Validation Accuracy: 0.9943225806451613
Logistic Regression with L1 Regularization Validation Accuracy: 0.9575913978494623


In [39]:
X_test_prep = preprocess(X_test.to_frame())
X_test_prep['Phrase'] = X_test_prep['Phrase'].fillna('')
X_test_vec = vectorizer.transform(X_test_prep['Phrase'])

In [40]:
len(X_test_prep)

23257

In [41]:
# Make predictions on the test set
y_test_pred_log_reg = log_reg_model.predict(X_test_vec)

# Print test predictions if needed
print("Logistic Regression Test Predictions:", y_test_pred_log_reg)

Logistic Regression Test Predictions: [3 2 4 ... 2 3 1]


### Logistic Regression with an augmented KMeans training dataset preformed best: 

In [42]:
# Generate PhraseID as a sequence from 0 to the number of predictions
phrase_ids = range(len(y_test_pred_log_reg))

# Create a DataFrame with PhraseID and Sentiment columns
test_predictions_df = pd.DataFrame({
    'PhraseID': phrase_ids,
    'Sentiment': y_test_pred_log_reg
})

# Save the DataFrame to a CSV file
test_predictions_df.to_csv('logistic_regression_test_predictions_km.csv', index=False)

print("Predictions saved to 'logistic_regression_test_predictions_km.csv'")


Predictions saved to 'logistic_regression_test_predictions_km.csv'
