## Task - Twitter Topic Classification
### Subtopics : Natural Language Processing, Supervised Machine Learning, Classification ,Pipeline

#### Goal is to build a Machine Learning Model that categories tweet cover to a variety of topics, namely:
0. Arts & Culture
1. Business & Entrepreneurs
2. Pop Culture
3. Daily Life
4. Sports & Gaming
5. Science & Technology

##### To fulfill this goal, we'll follow the CRISP-DM methodology, which stands for Cross-Industry Standard Process for Data Mining. Here's how we'll proceed:

1. Business Understanding: Understand the task and objectives, which involve classifying tweets into predefined categories.

2. Data Understanding: Examine the dataset provided, understand its structure, and explore some basic statistics.

3. Data Preparation: Preprocess the text data, handle missing values, tokenize the text, remove stopwords, and perform any other necessary transformations. Additionally, split the dataset into training and testing sets.

4. Modeling: Select suitable classification models (e.g., Naive Bayes, Logistic Regression, etc.) and train them using the training data.

5. Evaluation: Evaluate the trained models using appropriate evaluation metrics such as accuracy, precision, recall, and F1-score. Compare the performance of different models and pipeline choices.

6. Deployment: Deploy the best-performing model for practical use, if applicable.

### 1.  Business Understanding

Goal is to produce a pipeline capable of solving a multi-class classification task in the
form of a research project - whereby comparisons are made and assessed towards the task.
For this you will follow the CRISP-DM Methodology (see Figure 2) covered in the module,
evidencing your process in your report. Multiple approaches will be compared, and final
evaluations and recommendations made.

### 2. Data Understanding
We have a JSON file containing 6443 entries which represent
tweets from the social media platform Twitter, covering 6 topics. These tweets were gathered
between 2019 and 2021 and were human-labelled using Amazon’s Mechanical Turk.
The categories of tweet cover a variety of topics, namely:

0. Arts & Culture
1. Business & Entrepreneurs
2. Pop Culture
3. Daily Life
4. Sports & Gaming
5. Science & Technology

### 3. Data Preparation
This include steps such as :Preprocessing the text data, handling missing values, tokenizing the text, removing stopwords, and performing any other necessary transformations. Additionally, splitting of the dataset into training and testing sets before modeling.


In [5]:
ls

CETM47 Assignment 2.ipynb   CETM47-23_24-AS2.pdf
CETM47-23_24-AS2-Data.json  CETM47-23_24-AS2_CRG.pdf


In [8]:
import pandas as pd

# Read the dataset
df = pd.read_json("CETM47-23_24-AS2-Data.json")

# Display the first few rows of the dataframe
print(df.head(3))

# Display basic information about the dataframe
print(df.info())

# Display summary statistics of the numerical columns
print(df.describe())


                                                text       date  label  \
0  The {@Clinton LumberKings@} beat the {@Cedar R... 2019-09-08      4   
1  I would rather hear Eli Gold announce this Aub... 2019-09-08      4   
2  Someone take my phone away, I’m trying to not ... 2019-09-08      4   

                    id       label_name  
0  1170516324419866624  sports_&_gaming  
1  1170516440690176006  sports_&_gaming  
2  1170516543387709440  sports_&_gaming  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6443 entries, 0 to 6442
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   text        6443 non-null   object        
 1   date        6443 non-null   datetime64[ns]
 2   label       6443 non-null   int64         
 3   id          6443 non-null   int64         
 4   label_name  6443 non-null   object        
dtypes: datetime64[ns](1), int64(2), object(2)
memory usage: 251.8+ KB
None
             labe

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

# Define a text preprocessing pipeline
text_clf = Pipeline([
    ('vect', CountVectorizer()),  # Convert text to vectors
    ('clf', MultinomialNB())      # Naive Bayes classifier
])

# Train the model
text_clf.fit(X_train, y_train)

# Predict on the testing set
y_pred = text_clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=df['label_name'].unique()))


Accuracy: 0.7571761055081458
Classification Report:
                          precision    recall  f1-score   support

         sports_&_gaming       0.00      0.00      0.00        25
             pop_culture       0.83      0.08      0.15        60
              daily_life       0.77      0.88      0.82       497
business_&_entrepreneurs       0.69      0.42      0.52       179
    science_&_technology       0.76      0.97      0.85       468
          arts_&_culture       0.75      0.10      0.18        60

                accuracy                           0.76      1289
               macro avg       0.63      0.41      0.42      1289
            weighted avg       0.74      0.76      0.71      1289



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
