by : BOUADIF ABDELKRIM

# Part 1: Language Modeling / Regression

In [41]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from gensim.models import Word2Vec
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Loading the CSV data
df = pd.read_csv('answers.csv')

# Preprocessing pipeline
def preprocess_text(text):

    import nltk
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    from nltk.tokenize import word_tokenize

    nltk.download('punkt')  
    nltk.download('stopwords')

    text = text.lower()  # Lowercase
    text = word_tokenize(text)  # Tokenize
    text = [word for word in text if word.isalnum()]  # Remove punctuation
    text = [word for word in text if word not in stopwords.words('english')] # Stop word removal

    # stemming 
    stemmer = PorterStemmer()
    text = [stemmer.stem(word) for word in text] 

    return ' '.join(text)  

df['answer'] = df['answer'].apply(preprocess_text)

# 2. Word2Vec Embedding

# Training a Word2Vec model
sentences = [text.split() for text in df['answer']]
word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4) 

# Function to create sentence embeddings
def get_sentence_embedding(text):

    words = text.split()
    embedding = [word2vec_model.wv[word] for word in words if word in word2vec_model.wv]
    if embedding:
        return sum(embedding) / len(embedding)
    else:
        return [0]*100  # Return a vector of zeros if no word in vocabulary

df['embedding'] = df['answer'].apply(get_sentence_embedding) 

# 3. Model Training and Evaluation

# Discretize 'score' column
import numpy as np
from sklearn.preprocessing import KBinsDiscretizer
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform') 
y_train_reshaped = y_train.values.reshape(-1, 1)
y_test_reshaped = y_test.values.reshape(-1, 1)
y_train_discrete = discretizer.fit_transform(y_train_reshaped).astype(int) 
y_test_discrete = discretizer.transform(y_test_reshaped).astype(int) 

# Spliting the data
X_train, X_test, _, _ = train_test_split(
    df['embedding'].tolist(), df['score'], test_size=0.2, random_state=42
)

# Creating models
models = {
    "Linear Regression": LinearRegression(),
    "Naive Bayes": GaussianNB(),
    "Support Vector Regression": SVR(),
    "Decision Tree": DecisionTreeRegressor()
}

# Train and evaluate each model
for name, model in models.items():
    if name == "Naive Bayes":
        model.fit(X_train, y_train_discrete)
        y_pred = model.predict(X_test)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

    mse = mean_squared_error(y_test, y_pred)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    r2 = r2_score(y_test, y_pred)

    print(f"----- {name} -----")
    print(f"Mean Squared Error (MSE): {mse}")
    print(f"Root Mean Squared Error (RMSE): {rmse}")
    print(f"R-squared (R2): {r2}")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user

----- Linear Regression -----
Mean Squared Error (MSE): 1.1533231664821708
Root Mean Squared Error (RMSE): 1.0739288460983674
R-squared (R2): 0.09829382981622514
----- Naive Bayes -----
Mean Squared Error (MSE): 6.840235173824131
Root Mean Squared Error (RMSE): 2.615384326217493
R-squared (R2): -4.347921936362711




----- Support Vector Regression -----
Mean Squared Error (MSE): 1.6828817343765925
Root Mean Squared Error (RMSE): 1.297259316550316
R-squared (R2): -0.3157325610699975
----- Decision Tree -----
Mean Squared Error (MSE): 1.6543493965496197
Root Mean Squared Error (RMSE): 1.2862151439590577
R-squared (R2): -0.29342503633100936




### Model Selection and Interpretation

**Linear Regression:**
- Mean Squared Error (MSE): 1.153
- Root Mean Squared Error (RMSE): 1.074
- R-squared (R2): 0.098

**Naive Bayes:**
- Mean Squared Error (MSE): 6.840
- Root Mean Squared Error (RMSE): 2.615
- R-squared (R2): -4.348

**Support Vector Regression:**
- Mean Squared Error (MSE): 1.683
- Root Mean Squared Error (RMSE): 1.297
- R-squared (R2): -0.316

**Decision Tree:**
- Mean Squared Error (MSE): 1.654
- Root Mean Squared Error (RMSE): 1.286
- R-squared (R2): -0.293

Based on these metrics:

1. **Linear Regression** has the lowest MSE and RMSE, indicating better accuracy in predicting the 'score' compared to the other models.
2. Linear Regression also has the highest R-squared value, indicating that it explains the variance in the 'score' better than the other models.
3. Naive Bayes has a significantly higher MSE and RMSE, and a negative R-squared value, which suggests that it is not suitable for regression tasks.
4. Support Vector Regression and Decision Tree models also have higher MSE, RMSE, and negative R-squared values compared to Linear Regression.

Therefore, **Linear Regression** seems to be the best model for this regression task, as it provides the lowest error metrics and the highest R-squared value, indicating a better fit to the data.

Interpretation and Insights:
- The positive R-squared value of the Linear Regression model suggests that the 'answer' text has some predictive power in determining the 'score'.
- However, the relatively low R-squared value (0.098) indicates that the relationship between the 'answer' text and the 'score' is not very strong.
- It suggests that there may be other factors influencing the 'score' that are not captured by the 'answer' text alone. Further analysis may be needed to identify these additional factors and improve the predictive accuracy of the model.

# part 2 : Language Modeling / Classification

* without embeddings

In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim.models import Word2Vec
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Downloading NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Loading dataset
df = pd.read_csv('twitter_training.csv', header=None)  

# Defining target and content columns
target_column = 2  
content_column = 3 

# Handling missing values
df.fillna('', inplace=True)

# Preprocessing NLP Pipeline
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Tokenization
    tokens = word_tokenize(text.lower())
    # Removing stopwords and single character words
    tokens = [word for word in tokens if word not in stop_words and len(word) > 1]
    # Lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    # Convert tokens back to string
    return ' '.join(tokens)

# Applying preprocessing to content column
df['preprocessed_text'] = df[content_column].apply(preprocess_text)

# Encoding Data Vectors
# CBOW Model
cbow_model = Word2Vec(df['preprocessed_text'], vector_size=100, window=5, min_count=1, sg=0)
# Skip-gram Model
skipgram_model = Word2Vec(df['preprocessed_text'], vector_size=100, window=5, min_count=1, sg=1)
# Bag of Words (BoW)
bow_vectorizer = CountVectorizer()
bow_matrix = bow_vectorizer.fit_transform(df['preprocessed_text'])
# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_tfidf  = tfidf_vectorizer.fit_transform(df['preprocessed_text'])

y = df[target_column]  # defining Target variable

# 3. Model Training and Evaluation (with Hyperparameter Tuning)

# Spliting the data
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Creating models with hyperparameter grids for GridSearchCV
models = {
    "Random Forest": (RandomForestClassifier(), {'n_estimators': [100, 200, 300], 'max_depth': [None, 5, 10]}),
    "SVM": (SVC(), {'kernel': ['linear', 'rbf'], 'C': [0.1, 1, 10]}),
}

# Training and evaluating each model with GridSearchCV
for name, (model, param_grid) in models.items():
    print(f"----- {name} (TF-IDF) -----")
    grid_search = GridSearchCV(model, param_grid, cv=5)
    grid_search.fit(X_train, y_train)
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test)
    
    # Evaluating the model
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    print(f"Accuracy: {accuracy}")
    print("Classification Report:")
    print(report)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


----- Random Forest (TF-IDF) -----
Accuracy: 0.944954128440367
Classification Report:
              precision    recall  f1-score   support

  Irrelevant       0.95      0.86      0.90        22
    Negative       1.00      0.95      0.97        55
     Neutral       1.00      1.00      1.00         9
    Positive       0.82      1.00      0.90        23

    accuracy                           0.94       109
   macro avg       0.94      0.95      0.94       109
weighted avg       0.95      0.94      0.95       109

----- SVM (TF-IDF) -----
Accuracy: 0.9541284403669725
Classification Report:
              precision    recall  f1-score   support

  Irrelevant       0.95      0.86      0.90        22
    Negative       1.00      0.96      0.98        55
     Neutral       1.00      1.00      1.00         9
    Positive       0.85      1.00      0.92        23

    accuracy                           0.95       109
   macro avg       0.95      0.96      0.95       109
weighted avg       0.9

* With embeddings

In [14]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Loading dataset
df = pd.read_csv('twitter_training.csv', header=None)  # Specify header=None as there are no column names

# Defining target and content columns
target_column = 2  
content_column = 3  

# Handling missing values
df.fillna('', inplace=True)

# Preprocessing NLP Pipeline
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Tokenization
    tokens = word_tokenize(text.lower())
    # Removing stopwords and single character words
    tokens = [word for word in tokens if word not in stop_words and len(word) > 1]
    # Lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    # Convert tokens back to string
    return ' '.join(tokens)

# Applying preprocessing to content column
df['preprocessed_text'] = df[content_column].apply(preprocess_text)

# Generating Word2Vec embeddings
sentences = [text.split() for text in df['preprocessed_text']]
word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1) 

# Preparing data for training
X = np.array([np.mean([word2vec_model.wv[word] for word in sentence if word in word2vec_model.wv] or [np.zeros(100)], axis=0) for sentence in sentences])
y = df[target_column]

# Spliting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train and evaluate SVM
svm_model = SVC()
svm_model.fit(X_train, y_train)
svm_pred = svm_model.predict(X_test)
svm_accuracy = accuracy_score(y_test, svm_pred)
print("SVM Accuracy:", svm_accuracy)

# Train and evaluate Naive Bayes
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
nb_pred = nb_model.predict(X_test)
nb_accuracy = accuracy_score(y_test, nb_pred)
print("Naive Bayes Accuracy:", nb_accuracy)

# Train and evaluate Logistic Regression
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)
lr_accuracy = accuracy_score(y_test, lr_pred)
print("Logistic Regression Accuracy:", lr_accuracy)

# Train and evaluate AdaBoost
ada_model = AdaBoostClassifier()
ada_model.fit(X_train, y_train)
ada_pred = ada_model.predict(X_test)
ada_accuracy = accuracy_score(y_test, ada_pred)
print("AdaBoost Accuracy:", ada_accuracy)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


SVM Accuracy: 0.5229357798165137
Naive Bayes Accuracy: 0.3577981651376147
Logistic Regression Accuracy: 0.5229357798165137




AdaBoost Accuracy: 0.5412844036697247


    Accuracy scores for the classification models:

- SVM Accuracy: 0.523
- Naive Bayes Accuracy: 0.358
- Logistic Regression Accuracy: 0.523
- AdaBoost Accuracy: 0.541

The AdaBoost model achieved the highest accuracy among the four models, with an accuracy of 0.541. 

**Argument for Choosing AdaBoost as the Best Model:**

1. **Higher Accuracy:** AdaBoost achieved the highest accuracy score among all models tested, indicating that it made more correct predictions compared to the others.

2. **Robustness:** AdaBoost is an ensemble learning method that combines multiple weak classifiers to create a strong classifier. It tends to be more robust to overfitting and noise in the data, which can be advantageous in real-world scenarios with complex data.

3. **Versatility:** AdaBoost can be used with various base classifiers, making it versatile and suitable for different types of data and classification tasks.

4. **Interpretability:** While AdaBoost is an ensemble method, it can still provide insights into feature importance, which can help understand which features are most influential in making predictions.

**Interpretation of Results:**

The high accuracy achieved by AdaBoost suggests that it effectively learned patterns and relationships in the data, enabling it to classify instances with a higher level of accuracy compared to other models. This indicates that the AdaBoost model is well-suited for the classification task at hand and can be relied upon to make accurate predictions on new, unseen data. Additionally, the ensemble nature of AdaBoost makes it a robust choice, capable of handling various types of data and potential challenges in the classification problem.

# Synthesis

During this lab, I tackled two different sets of data. First up was the answers data, where I had to figure out how to predict scores based on the content of the answers. I cleaned up the text, turned it into numbers using techniques like Word2Vec and TF-IDF, and then trained some models like SVR, Naive Bayes, Linear Regression, and Decision Trees to make predictions. After testing them out, I found the best model based on metrics like Mean Squared Error (MSE) and R-squared (R2).

Then, I switched gears to the tweets data. Here, I had to classify tweets as positive, negative, or neutral. Similar to the first part, I preprocessed the text and encoded it into numbers using various methods. The models I trained this time were SVM, Naive Bayes, Logistic Regression, and Ada Boosting. Once again, I evaluated them using accuracy and other metrics to choose the best model.

Overall, it was fascinating to see how similar techniques could be applied to different types of data. It really highlighted the versatility of natural language processing and machine learning algorithms in understanding and analyzing text data.