<a href="https://colab.research.google.com/github/kienvillegas/NLP_Projects/blob/main/Who's_Tweeting%3F_Trump_or_Trudeau%3F.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**1. Data Preprocessing**

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**1.1. Data Loading**

In [None]:
import pandas as pd

file_path = '/content/drive/MyDrive/Dataset/tweets.csv'
df = pd.read_csv(file_path)

**1.2. Data Inspecting**

In [None]:
print(df.head())
print(df.isnull().sum())

   id           author                                             status
0   1  Donald J. Trump  I will be making a major statement from the @W...
1   2  Donald J. Trump  Just arrived at #ASEAN50 in the Philippines fo...
2   3  Donald J. Trump  After my tour of Asia, all Countries dealing w...
3   4  Donald J. Trump  Great to see @RandPaul looking well and back o...
4   5  Donald J. Trump  Excited to be heading home to see the House pa...
id        0
author    0
status    0
dtype: int64


**2. Text Cleaning**
1. Remove URLs
2. Remove mentions (@name)
3. Remove hashtags (#hashtags)
4. Remove special characters (,./?!&*-)
5. Convert to lowercase
6. Convert y values to numerical values


In [None]:
import re

df['cleaned_tweets'] = df['status'].apply(lambda x: re.sub(r"http\S+|www\S+|https\S+", "", x))
df['cleaned_tweets'] = df['cleaned_tweets'].apply(lambda x: re.sub(r"\@\w+", '', x))
df['cleaned_tweets'] = df['cleaned_tweets'].apply(lambda x: re.sub(r'\#\w+', '', x))
df['cleaned_tweets'] = df['cleaned_tweets'].apply(lambda x: re.sub(r"[^\w\s]", '', x))
df['cleaned_tweets'] = df['cleaned_tweets'].apply(lambda x: x.lower())

print(df['author'])
df['author'] = df.author.map({'Donald J. Trump': 0, 'Justin Trudeau': 1})
print(df['author'])

print(df[['status', 'cleaned_tweets']])

0      Donald J. Trump
1      Donald J. Trump
2      Donald J. Trump
3      Donald J. Trump
4      Donald J. Trump
            ...       
395     Justin Trudeau
396     Justin Trudeau
397     Justin Trudeau
398     Justin Trudeau
399     Justin Trudeau
Name: author, Length: 400, dtype: object
0      0
1      0
2      0
3      0
4      0
      ..
395    1
396    1
397    1
398    1
399    1
Name: author, Length: 400, dtype: int64
                                                status  \
0    I will be making a major statement from the @W...   
1    Just arrived at #ASEAN50 in the Philippines fo...   
2    After my tour of Asia, all Countries dealing w...   
3    Great to see @RandPaul looking well and back o...   
4    Excited to be heading home to see the House pa...   
..                                                 ...   
395  RT @googlecanada: Watch tmw: @JustinTrudeau di...   
396  Today in Ottawa, I met with the Modern Treaty ...   
397  Voici le sommaire de ma rencontre avec l

**3. Feature Extraction**
1. TF-IDF (Term Frequency - Inverse Document Frequency), it does not just count word occurances but weigh them based on how important a word is to a document relative to its appearance in the whole dataset.
2. BoW (Bag of Words), counts how many times a word appears in the text without considering the order of the words.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfVectorizer = TfidfVectorizer(stop_words='english')
X = tfidfVectorizer.fit_transform(df['cleaned_tweets'])

# (400, 2014) means there are 400 rows and 2014 unique words
print(X.shape)

# Display the feature names (vocabulary)
print(tfidfVectorizer.get_feature_names_out())

# Convert the tf-idf matrix to dense form (for illustration)
dense_matrix = X.todense()
print(dense_matrix)

# Convert the tf-idf matrix to a DataFrame for better visualization
tfidf_df = pd.DataFrame(X.todense(), columns=tfidfVectorizer.get_feature_names_out())
print(tfidf_df)

(400, 2014)
['000' '10' '100' ... 'ありがとうございます'
 'トランプ大統領による初の歴史的な日本訪問は間違いなく日米同盟の揺るぎない絆を世界に示すことができました'
 '本当にありがとうドナルドそしてアジア歴訪の大成功をお祈りしています']
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
     000   10  100  1000  100e  100th        11  1111  117  117000  ...  élue  \
0    0.0  0.0  0.0   0.0   0.0    0.0  0.000000   0.0  0.0     0.0  ...   0.0   
1    0.0  0.0  0.0   0.0   0.0    0.0  0.000000   0.0  0.0     0.0  ...   0.0   
2    0.0  0.0  0.0   0.0   0.0    0.0  0.000000   0.0  0.0     0.0  ...   0.0   
3    0.0  0.0  0.0   0.0   0.0    0.0  0.000000   0.0  0.0     0.0  ...   0.0   
4    0.0  0.0  0.0   0.0   0.0    0.0  0.000000   0.0  0.0     0.0  ...   0.0   
..   ...  ...  ...   ...   ...    ...       ...   ...  ...     ...  ...   ...   
395  0.0  0.0  0.0   0.0   0.0    0.0  0.000000   0.0  0.0     0.0  ...   0.0   
396  0.0  0.0  0.0   0.0   0.0    0.0  0.000000   0.0  0

**4. Train Test Split**
1. Split the dataset to training and testing set.

In [None]:
from sklearn.model_selection import train_test_split
y = df['author']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

**5. Model Selection and Training**
1. Choose a classification algorithm (Naive Bayes, Logistic Regression, Support Vector Machine)

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
from sklearn.naive_bayes import GaussianNB

X_train_dense = X_train.toarray()
nb_model = GaussianNB()
nb_model.fit(X_train_dense, y_train)


**Error Encountered: **
1. TypeError: Sparse data was passed for X, but dense data is required. Use '.toarray()' to convert to a dense numpy array. [Solved] convert 'X_train' using .toarray method



**6. Model Evaluation**

In [None]:
from sklearn.metrics import accuracy_score

X_test_dense = X_test.toarray()

y_pred = model.predict(X_test)
nb_y_pred = nb_model.predict(X_test_dense)


print(f"Logistic Regression Accuracy Score: {accuracy_score(y_test, y_pred)}")
print(f"Naive Bayes Accuracy Score: {accuracy_score(y_test, nb_y_pred)}")

Logistic Regression Accuracy Score: 0.9
Naive Bayes Accuracy Score: 0.9


In [None]:
import joblib

joblib.dump(model, 'tweet_classifier.pkl')
joblib.dump(model, 'tfidf_vectorizer.pkl')

['tfidf_vectorizer.pkl']