## Task - Twitter Topic Classification
### Subtopics : Natural Language Processing, Supervised Machine Learning, Classification ,Pipeline

#### Goal is to build a Machine Learning Model that categories tweet cover to a variety of topics, namely:
0. Arts & Culture
1. Business & Entrepreneurs
2. Pop Culture
3. Daily Life
4. Sports & Gaming
5. Science & Technology

##### To fulfill this goal, we'll follow the CRISP-DM methodology, which stands for Cross-Industry Standard Process for Data Mining. Here's how we'll proceed:

1. Business Understanding: Understand the task and objectives, which involve classifying tweets into predefined categories.

2. Data Understanding: Examine the dataset provided, understand its structure, and explore some basic statistics.

3. Data Preparation: Preprocess the text data, handle missing values, tokenize the text, remove stopwords, and perform any other necessary transformations. Additionally, split the dataset into training and testing sets.

4. Modeling: Select suitable classification models (e.g., Naive Bayes, Logistic Regression, etc.) and train them using the training data.

5. Evaluation: Evaluate the trained models using appropriate evaluation metrics such as accuracy, precision, recall, and F1-score. Compare the performance of different models and pipeline choices.

6. Deployment: Deploy the best-performing model for practical use, if applicable.

### 1.  Business Understanding

Goal is to produce a pipeline capable of solving a multi-class classification task in the
form of a research project - whereby comparisons are made and assessed towards the task.
For this you will follow the CRISP-DM Methodology (see Figure 2) covered in the module,
evidencing your process in your report. Multiple approaches will be compared, and final
evaluations and recommendations made.

### 2. Data Understanding
We have a JSON file containing 6443 entries which represent
tweets from the social media platform Twitter, covering 6 topics. These tweets were gathered
between 2019 and 2021 and were human-labelled using Amazon’s Mechanical Turk.
The categories of tweet cover a variety of topics, namely:

0. Arts & Culture
1. Business & Entrepreneurs
2. Pop Culture
3. Daily Life
4. Sports & Gaming
5. Science & Technology

### 3. Data Preparation
This include steps such as :Preprocessing the text data, handling missing values, tokenizing the text, removing stopwords, and performing any other necessary transformations. Additionally, splitting of the dataset into training and testing sets before modeling.


In [15]:
# import all needed libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.utils import resample
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.utils import to_categorical

ModuleNotFoundError: No module named 'tensorflow'

In [None]:
pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.16.1-cp310-cp310-macosx_10_15_x86_64.whl (259.5 MB)
[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m192.2/259.5 MB[0m [31m33.2 kB/s[0m eta [36m0:33:50[0m

In [8]:

# Read the dataset
df = pd.read_json("CETM47-23_24-AS2-Data.json")

# Display the first few rows of the dataframe
print(df.head(3))

# Display basic information about the dataframe
print(df.info())

# Display summary statistics of the numerical columns
print(df.describe())


                                                text       date  label  \
0  The {@Clinton LumberKings@} beat the {@Cedar R... 2019-09-08      4   
1  I would rather hear Eli Gold announce this Aub... 2019-09-08      4   
2  Someone take my phone away, I’m trying to not ... 2019-09-08      4   

                    id       label_name  
0  1170516324419866624  sports_&_gaming  
1  1170516440690176006  sports_&_gaming  
2  1170516543387709440  sports_&_gaming  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6443 entries, 0 to 6442
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   text        6443 non-null   object        
 1   date        6443 non-null   datetime64[ns]
 2   label       6443 non-null   int64         
 3   id          6443 non-null   int64         
 4   label_name  6443 non-null   object        
dtypes: datetime64[ns](1), int64(2), object(2)
memory usage: 251.8+ KB
None
             labe

In [10]:
# Check for imbalance datasets
class_counts = df['label'].value_counts()
print("Class Counts:")
print(class_counts)


Class Counts:
2    2512
4    2291
3     883
5     326
1     287
0     144
Name: label, dtype: int64


In [11]:
# # Upsample minority classes if necessary
# def upsample_minority(df):
#     majority_class = df['label'].value_counts().idxmax()
#     minority_classes = df['label'].value_counts().drop(index=majority_class)
#     minority_upsampled = [resample(df[df['label'] == cls], replace=True, n_samples=class_counts.max(), random_state=42) for cls in minority_classes.index]
#     return pd.concat([df[df['label'] == majority_class]] + minority_upsampled)

# df_balanced = upsample_minority(df)
df_balanced = df

In [12]:
# Tokenize the text data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_balanced['text'])
X = tokenizer.texts_to_sequences(df_balanced['text'])
X = pad_sequences(X)

# Convert labels to categorical
y = to_categorical(df_balanced['label'])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train.argmax(axis=1))
y_pred_lr = lr_model.predict(X_test)
accuracy_lr = accuracy_score(y_test.argmax(axis=1), y_pred_lr)
print("\nLogistic Regression Accuracy:", accuracy_lr)
print("Logistic Regression Classification Report:")
print(classification_report(y_test.argmax(axis=1), y_pred_lr))

# Naive Bayes
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train.argmax(axis=1))
y_pred_nb = nb_model.predict(X_test)
accuracy_nb = accuracy_score(y_test.argmax(axis=1), y_pred_nb)
print("\nNaive Bayes Accuracy:", accuracy_nb)
print("Naive Bayes Classification Report:")
print(classification_report(y_test.argmax(axis=1), y_pred_nb))

# Deep Learning Model
embedding_dim = 100
max_length = X.shape[1]

model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=embedding_dim, input_length=max_length))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(len(class_counts), activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
batch_size = 64
epochs = 5
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_test, y_test), verbose=2)

# Evaluate the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("\nDeep Learning Model Accuracy:", scores[1])

# Plot model accuracy
import matplotlib.pyplot as plt

plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

NameError: name 'Tokenizer' is not defined