[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nils-holmberg/socs-qmd/blob/main/jnb/lab3_nlp3.ipynb)

# text classification (naive bayes)

In [None]:
!gdown https://drive.google.com/uc?id=1EMzJxxoBaN_NbvF7xhoc09K82vQ6H_LX

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
fp = "content.xlsx"
df = pd.read_excel(fp, header=0, sheet_name='reviews')
df.head()

In [None]:
import nltk
from nltk.corpus import stopwords

# Download the NLTK stopwords if not already downloaded
nltk.download('stopwords')


## define features and outcomes


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer instance with NLTK stopwords
nltk_stopwords = stopwords.words('english')
vectorizer = CountVectorizer(stop_words=nltk_stopwords)

# Select the first two rows of the "text" column from the df dataframe
documents = df['text'].iloc[:3]

# Fit and transform the documents using CountVectorizer
dtm = vectorizer.fit_transform(documents)

# Convert the DTM to a Pandas DataFrame
dtm_df = pd.DataFrame(dtm.toarray(), columns=vectorizer.get_feature_names_out())

# Use the actual row indices as document IDs
dtm_df.index = documents.index

# Display the resulting Document-Term Matrix (DTM)
dtm_df


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF Vectorizer instance with NLTK stopwords
nltk_stopwords = stopwords.words('english')
tfidf_vectorizer = TfidfVectorizer(stop_words=nltk_stopwords)

# Select the first two rows of the "text" column from the df dataframe
documents = df['text'].iloc[:3]

# Fit and transform the documents using TF-IDF Vectorizer
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Convert the TF-IDF matrix to a Pandas DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Use the actual row indices as document IDs
tfidf_df.index = documents.index

# Display the resulting TF-IDF matrix
tfidf_df


## split training and test data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Step 1: Assuming your Excel columns are named 'text' and 'sentiment'
X = df['text']  # Text data
y = df['sentiment']  # Binary sentiment labels

# Step 2: Perform TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # You can adjust the number of features as needed
X_tfidf = tfidf_vectorizer.fit_transform(X)

# Step 3: Split the dataset into training (750 rows) and testing (250 rows)
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.25, random_state=42)

# Now, X_train and y_train contain the training data and labels (750 rows),
# and X_test and y_test contain the testing data and labels (250 rows).


## train naive bayes classifier

## validate model performance

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Step 4: Train a Multinomial Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)

# Step 5: Make predictions on the test data
y_pred = clf.predict(X_test)

# Step 6: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")

# Step 7: Create and plot a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(4, 2))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix")
plt.show()


## save classification model

In a binary classification task, where you are trying to classify data into one of two classes (typically referred to as the positive class and the negative class), several performance measures can help you assess the quality of your model. Here are the commonly used performance measures: Accuracy, Precision, Recall, and F1-score:

1. **Accuracy:**
   - Accuracy is a straightforward measure that calculates the ratio of correctly predicted instances (both true positives and true negatives) to the total number of instances in your dataset. It provides an overall assessment of how well your model is performing.
   - Formula: `(TP + TN) / (TP + TN + FP + FN)`
   - TP: True Positives (correctly predicted positive instances)
   - TN: True Negatives (correctly predicted negative instances)
   - FP: False Positives (negative instances incorrectly predicted as positive)
   - FN: False Negatives (positive instances incorrectly predicted as negative)

   High accuracy is desirable, but it can be misleading, especially in imbalanced datasets, where one class significantly outnumbers the other.

2. **Precision:**
   - Precision measures the accuracy of positive predictions made by the model. It calculates the ratio of true positives to the total number of instances predicted as positive (true positives plus false positives).
   - Formula: `TP / (TP + FP)`
   - Precision focuses on minimizing false positives. It's valuable when the cost of false positives is high, such as in medical diagnoses.

3. **Recall (Sensitivity or True Positive Rate):**
   - Recall measures the model's ability to capture all positive instances in the dataset. It calculates the ratio of true positives to the total number of actual positive instances (true positives plus false negatives).
   - Formula: `TP / (TP + FN)`
   - Recall is crucial when you want to avoid missing positive cases. For example, in disease detection, it's essential to have high recall to minimize false negatives.

4. **F1-Score:**
   - The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall, which can be helpful when you want to strike a balance between minimizing false positives and false negatives.
   - Formula: `2 * (Precision * Recall) / (Precision + Recall)`
   - The F1-score ranges from 0 to 1, where a higher score indicates a better balance between precision and recall.

In summary:
- **Accuracy** provides an overall view of your model's performance but may not be suitable for imbalanced datasets.
- **Precision** is useful when minimizing false positives is a priority.
- **Recall** is valuable when minimizing false negatives is critical.
- **F1-Score** balances precision and recall, making it a suitable measure when you want to find a compromise between the two.

The choice of which metric(s) to prioritize depends on the specific goals and requirements of your binary classification task. It's often a good practice to consider a combination of these metrics and evaluate your model's performance comprehensively.

## make inference on new samples

In [None]:
import joblib

# Save the trained model and TF-IDF vectorizer to files
model_filename = 'sentiment_model.pkl'
vectorizer_filename = 'tfidf_vectorizer.pkl'

joblib.dump(clf, model_filename)
joblib.dump(tfidf_vectorizer, vectorizer_filename)


In [None]:
# To load the model and vectorizer for inference:
loaded_model = joblib.load(model_filename)
loaded_vectorizer = joblib.load(vectorizer_filename)

# Example: Using the loaded model and vectorizer to predict sentiment for a new sample string
new_sample = "This phone is not bad"
# Vectorize the new sample using the loaded vectorizer
new_sample_tfidf = loaded_vectorizer.transform([new_sample])
# Use the loaded model for prediction
prediction = loaded_model.predict(new_sample_tfidf)

# Convert the prediction to a human-readable label if needed
sentiment_label = "Positive" if prediction == 1 else "Negative"

print(f"Predicted sentiment: {sentiment_label}")


In [None]:
# Reverse transform the TF-IDF vectors to human-readable text
reversed_texts = loaded_vectorizer.inverse_transform(X_test)

# Create a DataFrame for the testing dataset with original and predicted sentiments
test_df = pd.DataFrame({'Original Text': [' '.join(text) for text in reversed_texts],
                        'Original Sentiment': y_test,
                        'Predicted Sentiment': y_pred})

# Access the original unaltered natural language text from the 'df' dataframe using row indices
original_texts = df.iloc[y_test.index]['text'].tolist()

# Add the original texts as the first column in 'test_df'
test_df.insert(0, 'Original Unaltered Text', original_texts)

# Display the updated DataFrame
test_df


# sentiment analysis (transformers)

In [None]:
!gdown https://drive.google.com/uc?id=1EMzJxxoBaN_NbvF7xhoc09K82vQ6H_LX

In [None]:
!pip install -q transformers


In [None]:
from transformers import pipeline
from sklearn.metrics import confusion_matrix

# Load the sentiment analysis pipeline
nlp = pipeline("sentiment-analysis")

# Use the pipeline to predict sentiment on the original unaltered texts
test_df['HuggingFace Prediction'] = test_df['Original Unaltered Text'].apply(lambda text: nlp(text)[0])

# Extract the predicted sentiment labels from the pipeline results and convert to 0 (negative) or 1 (positive)
test_df['HuggingFace Confidence'] = test_df['HuggingFace Prediction'].apply(lambda prediction: prediction['score'])
test_df['HuggingFace Prediction'] = test_df['HuggingFace Prediction'].apply(lambda prediction: 0 if prediction['label'] == 'NEGATIVE' else 1)

# Create a confusion matrix
conf_matrix = confusion_matrix(test_df['Original Sentiment'], test_df['HuggingFace Prediction'])

# Plot the confusion matrix
plt.figure(figsize=(4, 2))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Sentiment")
plt.ylabel("Original Sentiment")
plt.title("Confusion Matrix (Hugging Face Transformers)")
plt.show()


In [None]:
test_df

In [None]:
nlp("This phone is not good")

In [None]:
# Calculate the confusion matrix for Multinomial Naive Bayes predictions
conf_matrix_nb = confusion_matrix(test_df['Original Sentiment'], test_df['Predicted Sentiment'])

# Calculate the confusion matrix for Hugging Face Transformers predictions
conf_matrix_hf = confusion_matrix(test_df['Original Sentiment'], test_df['HuggingFace Prediction'])

# Create subplots to display confusion matrices side by side
fig, axes = plt.subplots(1, 2, figsize=(8, 2))

# Plot the Multinomial Naive Bayes confusion matrix
sns.heatmap(conf_matrix_nb, annot=True, fmt="d", cmap="Blues", cbar=False, ax=axes[0])
axes[0].set_title("Multinomial Naive Bayes")
axes[0].set_xlabel("Predicted Sentiment")
axes[0].set_ylabel("Original Sentiment")

# Plot the Hugging Face Transformers confusion matrix
sns.heatmap(conf_matrix_hf, annot=True, fmt="d", cmap="Blues", cbar=False, ax=axes[1])
axes[1].set_title("Hugging Face Transformers")
axes[1].set_xlabel("Predicted Sentiment")
axes[1].set_ylabel("Original Sentiment")

# Display the figure
plt.show()

# Save the figure to a PNG file
plt.savefig('confusion_matrices.png')
plt.close()

# Save the 'test_df' DataFrame to a CSV file
test_df.to_csv('test_df.tsv', sep="\t", index=False)


## swedish language texts

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("KBLab/megatron-bert-large-swedish-cased-165k")
model = AutoModelForSequenceClassification.from_pretrained("KBLab/robust-swedish-sentiment-multiclass")

# Example text
swedish_text = "Jag älskar denna produkt!"  # "I love this product!"
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
result = classifier(swedish_text)
print(result)

In [None]:
swedish_text = "denna produkt är mycket bra"  # "I love this product!"
result = classifier(swedish_text)
print(result)

# spacy language models

In [None]:
# medium size language model with word vectors
!python -m spacy download en_core_web_md


In [None]:
import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_md")

# Define random words
word = "cat"

# Process the words to get their word vectors
token = nlp(word)

# Check if the token has a vector
if token.has_vector:
    # Print the word vector as a NumPy array
    print(f"Vector for '{word}':")
    print(token.vector)
else:
    print(f"No vector available for '{word}'.")


In [None]:
print(len(token.vector))

In [None]:
# Define two random words
word1 = "cat"
word2 = "tiger"

# Process the words to get their word vectors
token1 = nlp(word1)
token2 = nlp(word2)

# Check if the tokens are valid words (have word vectors)
if token1.has_vector and token2.has_vector:
    # Compute the cosine similarity between the word vectors
    similarity = token1.similarity(token2)
    print(f"Similarity between '{word1}' and '{word2}': {similarity}")
else:
    print("One or both of the words do not have word vectors available in the spaCy model.")


# spacy text inference

In [None]:
fp = "content.xlsx"
df = pd.read_excel(fp, header=None, names=['id', 'image', 'text'])
df.head()

In [None]:
# Load the spaCy model for sentence segmentation
nlp = spacy.load("en_core_web_sm")

# Function to split text into sentences and create a new DataFrame
def split_sentences(df):
    sentence_data = {'id': [], 'sentence_number': [], 'sentence_text': []}

    for _, row in df.iterrows():
        doc = nlp(row['text'])
        for i, sentence in enumerate(doc.sents, start=1):
            sentence_data['id'].append(row['id'])
            sentence_data['sentence_number'].append(i)
            sentence_data['sentence_text'].append(sentence.text)

    return pd.DataFrame(sentence_data)

# Create a new DataFrame with sentences
sentences_df = split_sentences(df)

# Print the new DataFrame
print(sentences_df)


In [None]:
# Function to process sentences with spaCy and store results in a list of dictionaries
def process_sentences_with_spacy(df):
    processed_sentences = []

    for _, row in df.iterrows():
        text_id = row['id']
        sentence_text = row['sentence_text']

        doc = nlp(sentence_text)

        # Process each token in the sentence
        processed_tokens = []
        for token in doc:
            processed_token = {
                'text': token.text,
                'lemma': token.lemma_,
                'entity': token.ent_type_,
                'pos': token.pos_
            }
            processed_tokens.append(processed_token)

        processed_sentence = {
            'id': text_id,
            'sentence_text': sentence_text,
            'tokens': processed_tokens
        }
        processed_sentences.append(processed_sentence)

    return processed_sentences

# Apply spaCy processing and create a new column 'spacy_nlp'
sentences_df['spacy_nlp'] = process_sentences_with_spacy(sentences_df)

# Print the updated DataFrame
print(sentences_df)


In [None]:
# Function to analyze sentences with spaCy and store results in a list of dictionaries
def analyze_sentences_with_spacy(df):
    analyzed_sentences = []

    for _, row in df.iterrows():
        doc = nlp(row['sentence_text'])

        # Process each token in the sentence
        analyzed_tokens = []
        for token in doc:
            analyzed_token = {
                'text': token.text,
                'lemma': token.lemma_,
                'entity': token.ent_type_,
                'pos': token.pos_
            }
            analyzed_tokens.append(analyzed_token)

        analyzed_sentence = {
            'id': row['id'],
            'sentence_number': row['sentence_number'],
            'sentence_text': row['sentence_text'],
            'spacy_nlp': analyzed_tokens
        }
        analyzed_sentences.append(analyzed_sentence)

    return analyzed_sentences

# Analyze sentences with spaCy and create a new column 'spacy_nlp'
sentences_df['spacy_nlp'] = analyze_sentences_with_spacy(sentences_df)

# Print the updated DataFrame
sentences_df

In [None]:
# Function to analyze sentences by token, lemma, entity, and pos
def analyze_sentence_with_spacy(text):
    doc = nlp(text)
    tokens = []

    for token in doc:
        token_info = {
            'token': token.text,
            'lemma': token.lemma_,
            'entity': token.ent_type_,
            'pos': token.pos_
        }
        tokens.append(token_info)

    return tokens

# Create a new DataFrame 'tokens_df'
tokens_data = []

for _, row in sentences_df.iterrows():
    id_val = row['id']
    sentence_number = row['sentence_number']
    sentence_text = row['sentence_text']

    tokens_info = analyze_sentence_with_spacy(sentence_text)

    for token_info in tokens_info:
        tokens_data.append({
            'id': id_val,
            'sentence_number': sentence_number,
            'token': token_info['token'],
            'lemma': token_info['lemma'],
            'entity': token_info['entity'],
            'pos': token_info['pos']
        })

tokens_df = pd.DataFrame(tokens_data)

# Print the resulting 'tokens_df'
print(tokens_df)

In [None]:
# Filter out rows with 'None' value in the 'entity' column
filtered_tokens_df = tokens_df[tokens_df['entity'].notna()]
filtered_tokens_df = tokens_df[tokens_df['entity'] != '']

# Sort the values in the 'entity' column
#filtered_tokens_df['entity'] = filtered_tokens_df['entity'].astype(str)
#filtered_tokens_df = filtered_tokens_df.sort_values(by='entity')
filtered_tokens_df = filtered_tokens_df.loc[filtered_tokens_df['entity'].astype(str).sort_values().index]

# Create a frequency table
entity_frequency = filtered_tokens_df['entity'].value_counts().reset_index()
entity_frequency.columns = ['entity', 'frequency']

# Create a frequency diagram of unique values in the 'entity' column
plt.figure(figsize=(8, 6))
sns.set(style="darkgrid")
entity_plot = sns.countplot(x="entity", data=filtered_tokens_df, palette="Set3")
entity_plot.set_title("Entity Frequency Diagram")
entity_plot.set_xlabel("Entity")
entity_plot.set_ylabel("Frequency")

# Rotate x-axis labels for better readability (optional)
entity_plot.set_xticklabels(entity_plot.get_xticklabels(), rotation=45, horizontalalignment='right')

# Show the plot
plt.tight_layout()
plt.show()



    CARDINAL - Numerals that do not fall under another type
    DATE - Absolute or relative dates or periods
    EVENT - Named hurricanes, battles, wars, sports events, etc.
    FAC - Buildings, airports, highways, bridges, etc.
    GPE - Countries, cities, states
    LANGUAGE - Any named language
    LAW - Named documents made into laws.
    LOC - Non-GPE locations, mountain ranges, bodies of water
    MONEY - Monetary values, including unit
    NORP - Nationalities or religious or political groups
    ORDINAL - "first", "second", etc.
    PERCENT - Percentage, including "%"
    PERSON - People, including fictional
    PRODUCT - Objects, vehicles, foods, etc. (not services)
    QUANTITY - Measurements, as of weight or distance
    TIME - Times smaller than a day
    WORK_OF_ART - Titles of books, songs, etc.

