## Iteration 1 description 

- Text vectorization technique: CountVectorizer
- ML algorith: Random Forest
- Classifier: RandomForestClassifier
- Normalization: PorterStemmer
- Databalancing applied: None 
- Dataframe size: 39784 rows × 26666 columns
- Test accuracy: 0.40

## Import libraries 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

## Import CSV

In [None]:
df = pd.read_csv('tweet_emotions.csv')
df

## Preprocessing 

In [None]:
df['sentiment'].unique()

In [None]:
# Set the style of the plot
plt.style.use('seaborn-darkgrid')

# Create the histogram
plt.hist(df['sentiment'], bins=10, color='skyblue', edgecolor='black', alpha=0.7)

# Add labels and title
plt.xlabel('Sentiment Scores', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.title('Distribution of Sentiment Scores', fontsize=16)

# Add grid and adjust tick parameters
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.xticks(rotation=45, fontsize=12)  # Rotate x-axis labels by 45 degrees
plt.yticks(fontsize=12)

# Show the plot
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()

In [None]:
df['sentiment'].value_counts()

Author note: This data is imbalanced. 

Author note: Guidance to reduce the number of emotion groups 

Author note: In order to guide the workload of a Customer Success Manager, the following grouping would make sense: 

Positive Sentiments:

- happiness
- love
- relief
- enthusiasm

Neutral Sentiments:

- neutral
- surprise
- fun

Negative Sentiments:

- worry
- sadness
- hate
- empty
- boredom
- anger

In [None]:
# Define a function to map sentiments to sub-groups
def map_sentiment_to_subgroup(sentiment):
    if sentiment in ['empty', 'sadness', 'worry', 'hate', 'boredom', 'anger']:
        return 'Negative'
    elif sentiment in ['neutral', 'surprise', 'fun']:
        return 'Neutral'
    else:
        return 'Positive'

# Apply the function to create a new column for sub-groups
df['sentiment_subgroup'] = df['sentiment'].apply(map_sentiment_to_subgroup)

df

In [None]:
df['sentiment_subgroup'].value_counts()

In [None]:
import matplotlib.pyplot as plt

# Set the style of the plot
plt.style.use('seaborn-darkgrid')

# Create the histogram
plt.hist(df['sentiment_subgroup'], bins=10, color='skyblue', edgecolor='black', alpha=0.7)

# Add labels and title
plt.xlabel('Sentiment Scores', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.title('Distribution of Sentiment Scores', fontsize=16)

# Add grid and adjust tick parameters
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.xticks(rotation=45, fontsize=12)  # Rotate x-axis labels by 45 degrees
plt.yticks(fontsize=12)

# Show the plot
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()

In [None]:
# Check for null values 

print(df.isnull().sum())

In [None]:
# Check for duplicates 

duplicate_rows = df[df.duplicated()]
duplicate_rows.value_counts()

### Remove names from content column before tokenizing 

Author note: As we know that any word beginning with @ is a name, we can assume that these words will not be useful predictors of sentiment. As such, I have chosen to delete all @ words prior to tokenizing. 

In [None]:
# Remove words starting with '@' using a lambda function
df['content'] = df['content'].apply(lambda x: ' '.join([word for word in x.split() if not word.startswith('@')]))

In [None]:
df['content']

## Tokenizing

In [None]:
import nltk # Natural Language Toolkit
from nltk.tokenize import word_tokenize

nltk.download('punkt')  # Ensure that you have the 'punkt' tokenizer models downloaded

# Tokenize the text in the 'content' column
df['tokens'] = df['content'].apply(word_tokenize)

# Display the DataFrame with tokens
print(df.head())

Author note: This code will tokenize each text instance in the 'content' column and store the tokens in a new column called 'tokens' in the DataFrame

In [None]:
df

In [None]:
df['tokens']

## Preprocessing tokenized text data

- Lowercasing: Convert all words to lowercase to ensure consistency and prevent the model from treating words with different cases as different entities.

- Removing punctuation: Remove punctuation marks such as commas, periods, and quotation marks. Punctuation generally does not carry semantic meaning and can introduce noise into the embeddings.

- Removing stop words: As mentioned earlier, stop words are common words such as "the," "is," and "and" that occur frequently but typically do not contribute much to the meaning of the text. Removing them can reduce the dimensionality of the data and improve the efficiency of the Word2Vec model.

- Handling numerical values: Depending on the specific use case, you may choose to remove or replace numerical values with placeholders. In some cases, numerical values may not be relevant to the semantics of the text and can be treated as noise.

- Removing blank spaces: Should be treated as noise and deleted 

- Handling special characters: Special characters, symbols, and emojis may need to be handled appropriately based on the specific requirements of the application. You might choose to remove them, replace them with special tokens, or even treat them as separate entities

- Removing blank rows

- Removing words with just one letter: These are stealth stop words 

- Token normalization: This involves techniques such as stemming or lemmatization to reduce words to their base or root forms. For example, "running," "runs," and "ran" may all be reduced to the base form "run." This helps in capturing semantic similarities between related words.

- Handling out-of-vocabulary words: It's essential to handle words that are not present in the vocabulary of the Word2Vec model. This could involve techniques like using subword embeddings (e.g., FastText) or replacing unknown words with a special token.



### Lowercasing

Convert all words to lowercase to ensure consistency and prevent the model from treating words with different cases as different entities.

In [None]:
# Function to lowercase each word in a list of tokens
def lowercase_tokens(tokens_list):
    return [word.lower() for word in tokens_list]

# Apply lowercase conversion to each list of tokens in the 'tokens' column
df.loc[:, 'tokens'] = df['tokens'].apply(lowercase_tokens)

In [None]:
df['tokens']

### Removing punctuation

Remove punctuation marks such as commas, periods, and quotation marks. Punctuation generally does not carry semantic meaning and can introduce noise into the embeddings. This code first checks if the value is a string before applying the regex substitution. If the value is not a string (e.g., NaN or float), it returns an empty string. This ensures that the re.sub function receives only string inputs, avoiding the TypeError.

In [None]:
import re

# Apply punctuation removal using a lambda function
df['tokens'] = df['tokens'].apply(lambda tokens_list: [re.sub(r'[^\w\s]', '', word) for word in tokens_list] if isinstance(tokens_list, list) else [])

In [None]:
df['tokens']

### Removing stop-words 

Stop words are common words such as "the," "is," and "and" that occur frequently but typically do not contribute much to the meaning of the text. Removing them can reduce the dimensionality of the data and improve the efficiency of the Word2Vec model.

In [None]:
from nltk.corpus import stopwords

# Download NLTK stopwords if not already downloaded
nltk.download('stopwords')

# Get the English stopwords from NLTK
stop_words = set(stopwords.words('english'))

# Remove stopwords from the 'tokens' column
df['tokens'] = df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])

In [None]:
df['tokens']

### Removing blank spaces

Author note: Based on the print out of the tokens column above, I can see some blank spaces (' '). I will remove these to reduce noise in my model 

In [None]:
# Remove blank spaces from the list of tokens
df['tokens'] = df['tokens'].apply(lambda tokens_list: [token for token in tokens_list if token.strip() != ''])

# Display the updated DataFrame
print(df['tokens'])

### Handling numerical values

This code will remove all tokens containing only numerical values from the 'tokens' column.

In [None]:
df['tokens'] = df['tokens'].apply(lambda x: [token for token in x if not token.isdigit()])

In [None]:
df['tokens']

### Handling special characters

This code will remove all non-alphanumeric characters from each token in the 'tokens' column of your DataFrame.

In [None]:
df['tokens'] = df['tokens'].apply(lambda x: [re.sub(r'\W', '', token) for token in x])

In [None]:
df['tokens']

### Removing blank rows   

In the print-out above, I see cases (such as ID 39995) where the row is blank '[]'. These may add noise to the model and I will remove. 

In [None]:
# Count the number of rows before removing blank rows
num_rows_before = len(df)

# Remove rows with blank lists in the 'tokens' column
df = df[df['tokens'].apply(lambda tokens_list: tokens_list != [])]

# Count the number of rows after removing blank rows
num_rows_after = len(df)

# Calculate the number of rows deleted
num_rows_deleted = num_rows_before - num_rows_after

# Print the number of rows deleted
print("Number of rows deleted:", num_rows_deleted)

### Removing words with just one letter

These are hidden stop-words

In [None]:
# Define the target words to count
target_words = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

# Initialize counts for each target word
word_counts = {word: 0 for word in target_words}

# Iterate over the tokens and count occurrences of target words
for tokens_list in df['tokens']:
    for word in tokens_list:
        for target_word in target_words:
            if target_word == word:
                word_counts[target_word] += 1
            elif target_word.strip() == word:
                word_counts[target_word] += 1

# Print the word counts
for word, count in word_counts.items():
    print(f"Occurrences of '{word}': {count}")

In [None]:
# Delete occurrences of target_words from df['tokens']
df['tokens'] = df['tokens'].apply(lambda tokens_list: [word for word in tokens_list if word not in target_words and word.strip() not in target_words])

In [None]:
df['tokens']

### Token normalization

This involves techniques such as stemming or lemmatization to reduce words to their base or root forms. For example, "running," "runs," and "ran" may all be reduced to the base form "run." This helps in capturing semantic similarities between related words.

In [None]:
from nltk.stem import PorterStemmer

# Initialize PorterStemmer
stemmer = PorterStemmer()

# Apply stemming to tokens
df['tokens'] = df['tokens'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x]))

In [None]:
df

### Handling out-of-vocabulary words

Author note: It seemed I couldn't do this until I had trained a Word2Vec model 

## Reset index 

In [None]:
df.reset_index(drop=True, inplace=True)

## Check preprocessed df

In [None]:
df

## Bag of words 

Explanation: CountVectorizer is a text vectorization technique provided by scikit-learn (sklearn). It converts a collection of text documents into a matrix of token counts.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit the vectorizer to the text data and transform it into BoW representation
bow_matrix = vectorizer.fit_transform(df['tokens'])

# Convert BoW matrix to DataFrame for easier inspection
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Display the BoW DataFrame
bow_df

## Concatenate bag of words matrix with original dataframe to enable modeling 

In [None]:
# Concatenate bow_df with the original DataFrame df
combined_df = pd.concat([df, bow_df], axis=1)
combined_df

## Balancing the data 

Author note: no data balancing applied in this model 

## Train test split 

In [None]:
from sklearn.model_selection import train_test_split

X = combined_df.drop(columns=['sentiment', 'content', 'sentiment_subgroup', 'tokens'])
y = combined_df['sentiment_subgroup']

# Perform train-test split with 80% training data and 20% testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the train and test sets
print("Train set shapes:", X_train.shape, y_train.shape)
print("Test set shapes:", X_test.shape, y_test.shape)

Author note: Encountered major issues running code as of this point 

## Train model using the Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc_ops = {
    "max_depth": 6,
    "min_samples_leaf": 20,
    "n_estimators": 100,
    "bootstrap": True,
    "oob_score": True,
    "random_state": 42
}

clf = RandomForestClassifier(**rfc_ops)

clf.fit(X_train, y_train)
print("train prediction accuracy score: %.2f" % (clf.score(X_train, y_train)))
print("test prediction accuracy score: %.2f" % (clf.score(X_test, y_test)))

- train prediction accuracy score 0.4
- test prediction accuracy score 0.4

The accuracy scores of 0.40 for both the training and testing sets suggest that the model is underfitting, which means it is too simple to capture the underlying structure of the data.