### Analyzing sentiments from text data

Library Installation:
- You need to install the necessary libraries using pip in a terminal window.
Specifically, install scikit-learn.
- Assumption: You already have pandas and numpy installed. If not, install them using pip.

Sentiment Analysis Comparison:
The exercise involves analyzing sentiments using two different approaches: Word2Vec and TF-IDF.

Each piece of code will be marked to indicate which approach it corresponds to.

In summary, you’ll install libraries, perform sentiment analysis, and compare the results between Word2Vec and TF-IDF.

Word2Vec:
- Purpose: Word2Vec converts words into dense, fixed-size vectors (embeddings) in an n-dimensional space.
- Method: It learns these embeddings by analyzing the context in which words appear within a large text corpus.
- Benefits: 
    - Captures semantic relationships between words (e.g., ‘house’ and ‘home’ are similar).
    - Enables understanding of word meanings.
    - Useful for downstream tasks like sentiment analysis and recommendation systems.
- Example: The word ‘car’ might be represented as [-0.016, -0.0003, 0.0899, …].
- Use Case: Great for deep learning models.

TF-IDF (Term Frequency-Inverse Document Frequency):
- Purpose: TF-IDF converts text documents into sparse vectors.
- Method: It considers word frequencies in each document relative to their occurrence across all documents.
- Benefits:
    - Reflects word importance within a specific document.
    - Commonly used for text classification and information retrieval.
- Example: If a document contains ‘house,’ the ‘house’ column has a ‘1’ (non-zero value).
- Use Case: Simple machine learning algorithms.

In summary, both methods offer sophisticated ways to represent language numerically. Word2Vec captures context and semantics, while TF-IDF focuses on word importance within documents.

In [None]:
#libraries used only for Word2Vec 
import nltk
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
#libraries used for both
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

# Load dataset for Word2Vec
df = pd.read_csv('sentiments.csv')

# Download nltk data (if not already downloaded) for Word2Vec
nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))

#Function to pre-Procesing text for Word2Vec
def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word.lower() not in stop_words]
    return tokens

# Tokenize and preprocess text for Word2Vec
df['tokens'] = df['text'].apply(preprocess_text)

# Train Word2Vec model (Parameters 01)
word2vec_model = Word2Vec(df['tokens'], vector_size=200, window=5, min_count=1, sg=1)

# Function to average word vectors for Word2Vec
def average_word_vectors(tokens, model, vector_size):
    if len(tokens) < 1:
        return np.zeros(vector_size)
    vectors = [model.wv[word] for word in tokens if word in model.wv]
    if len(vectors) < 1:
        return np.zeros(vector_size)
    return np.mean(vectors, axis=0)

# Vectorize text using averaged Word2Vec
df['word_vectors'] = df['tokens'].apply(lambda x: average_word_vectors(x, word2vec_model, 200))
X_word2vec = pd.DataFrame(df['word_vectors'].tolist())
y = df['sentiment']

# TF-IDF Vectorization on Word2Vec
tfidf_vectorizer = TfidfVectorizer(max_features=200)
# TF-IDF Vectorization on TF-IDF
X_tfidf = tfidf_vectorizer.fit_transform(df['text']).toarray()

# Feature Scaling declaration
scaler = StandardScaler()

# Feature Scaling on Word2Vec
X_word2vec_scaled = scaler.fit_transform(X_word2vec)
# Feature Scaling on TF-IDF
X_tfidf_scaled = scaler.fit_transform(X_tfidf)

# Splitting data into training and testing sets on Word2Vec (Parameters 02)
X_train_w2v, X_test_w2v, y_train, y_test = train_test_split(X_word2vec_scaled, y, test_size=0.2, random_state=42)
# Splitting data into training and testing sets on TF-IDF (Parameters 03)
X_train_tfidf, X_test_tfidf, _, _ = train_test_split(X_tfidf_scaled, y, test_size=0.2, random_state=42)

# Train Logistic Regression model on Word2Vec (Parameters 04)
model_w2v = LogisticRegression(max_iter=1000, random_state=42)
model_w2v.fit(X_train_w2v, y_train)

# Train Logistic Regression model on TF-IDF (Parameters 05)
model_tfidf = LogisticRegression(max_iter=1000, random_state=42)
model_tfidf.fit(X_train_tfidf, y_train)

# Predicting on test data Word2Vec
y_pred_w2v = model_w2v.predict(X_test_w2v)
# Predicting on test data TF-IDF
y_pred_tfidf = model_tfidf.predict(X_test_tfidf)

# Evaluating the model
print("Word2Vec Model Evaluation")
print(f"Train Accuracy: {accuracy_score(y_train, model_w2v.predict(X_train_w2v))}")
print(f"Test Accuracy: {accuracy_score(y_test, y_pred_w2v)}")
print(classification_report(y_test, y_pred_w2v, zero_division=1))

print("TF-IDF Model Evaluation")
print(f"Train Accuracy: {accuracy_score(y_train, model_tfidf.predict(X_train_tfidf))}")
print(f"Test Accuracy: {accuracy_score(y_test, y_pred_tfidf)}")
print(classification_report(y_test, y_pred_tfidf, zero_division=1))

# Function to predict sentiment of new text using Word2Vec
def predict_sentiment_w2v(text):
    tokens = preprocess_text(text)
    word_vector = average_word_vectors(tokens, word2vec_model, 200) #Parameters 06
    word_vector_scaled = scaler.transform([word_vector])
    return model_w2v.predict(word_vector_scaled)[0]

# Function to predict sentiment of new text using TF-IDF
def predict_sentiment_tfidf(text):
    text_transformed = tfidf_vectorizer.transform([text]).toarray()
    text_transformed_scaled = scaler.transform(text_transformed)
    return model_tfidf.predict(text_transformed_scaled)[0]

# Example usage
new_text_1 = "I am very satisfied with the service."
print('Sentiment for phrase: ', new_text_1)
print(f"Word2Vec Sentiment: {predict_sentiment_w2v(new_text_1)}")
print(f"TF-IDF Sentiment: {predict_sentiment_tfidf(new_text_1)}")

new_text_2 = "Terrible, would not recommend."
print('Sentiment for phrase: ', new_text_2)
print(f"Word2Vec Sentiment: {predict_sentiment_w2v(new_text_2)}")
print(f"TF-IDF Sentiment: {predict_sentiment_tfidf(new_text_2)}")

Now let’s break down the parameters (feel free to play with them)

Parameters 01 (Word2Vec(df['tokens'], vector_size=200, window=5, min_count=1, sg=1))
- df['tokens']: is the loaded file and converted to DataFrame (df) with a column named 'tokens'. The 'tokens' column likely contains preprocessed text data (such as tokenized words).
- vector_size=200: This parameter specifies the dimensionality of the word vectors (also known as word embeddings). In this case, each word will be represented as a 200-dimensional vector.
- window=5: The window parameter determines the maximum distance between the current word and the context words within a sentence. A smaller window focuses on nearby context words, while a larger window considers a broader context.
- min_count=1: This parameter sets the minimum frequency count for a word to be included in the vocabulary. Words that occur less frequently than the specified count are ignored.
- sg=1: The sg parameter stands for “skip-gram.” When sg=1, the skip-gram model is used; when sg=0, the continuous bag-of-words (CBOW) model is used. Skip-gram aims to predict context words given a target word, while CBOW predicts the target word based on context.
In summary, the Word2Vec function trains word embeddings based on the provided text data, capturing semantic relationships between words. The resulting vectors can be used for various natural language processing tasks

Parameters 02 & 03 (train_test_split(X_word2vec_scaled, y, test_size=0.2, random_state=42))
- X_word2vec_scaled: This is the feature matrix (often denoted as X) containing the input data (independent variables). It represents the features you’ll use to train your machine learning model.
- y: This is the target vector (often denoted as y) containing the output labels (dependent variable). It represents the values you’re trying to predict or classify.
- test_size=0.2: This parameter specifies the proportion of the dataset that should be allocated to the test set. In this case, 20% of the data will be used for testing, while the remaining 80% will be used for training.
- random_state=42: This parameter sets the random seed for shuffling the data before splitting. It ensures that the same split is obtained each time you run the code with the same random seed (useful for reproducibility).
In summary, train_test_split divides your data into training and test sets, allowing you to evaluate your model’s performance on unseen data. The split ensures that your evaluation is unbiased. 

Parameters 04 & 05 (LogisticRegression(max_iter=1000, random_state=42))
- max_iter=1000: This parameter specifies the maximum number of iterations for the optimization algorithm during training. It determines how many times the algorithm updates the model’s coefficients to find the best fit. In this case, the maximum number of iterations is set to 1000.
- random_state=42: The random_state parameter sets the random seed for initializing the model’s internal random number generator. It ensures that the same random initialization is used each time you run the code with the same random seed (useful for reproducibility).
In summary, the LogisticRegression model is a popular classification algorithm used for binary and multiclass classification tasks. It optimizes the coefficients to fit the data and make predictions based on logistic regression

Parameters 06 (average_word_vectors(tokens, word2vec_model, 200))
- tokens: This parameter likely represents a list of tokenized words or phrases (e.g., individual words or n-grams) from your text data.
- word2vec_model: This refers to the Word2Vec model that you’ve trained earlier. It contains word embeddings (vectors) learned from your text corpus.
- 200: This value specifies the dimensionality of the word vectors. In this case, each word vector is expected to have 200 dimensions.
The purpose of the average_word_vectors function is likely to compute the average vector representation for a given set of tokens. Here’s how it might work:
- For each token in the tokens list:
    - Look up its corresponding word vector in the word2vec_model.
    - Sum up all the word vectors.
    - Divide the sum by the total number of tokens to get the average vector.
The resulting average vector can be used as a representation for the entire set of tokens. This approach is often used when you want to represent a document or sentence as a single vector based on its constituent words.