<a href="https://colab.research.google.com/github/misaki-zz/SML_Teaching/blob/main/Zin_Zin_Aung_News_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**

In this project, I am goint to classify news articles to find out whether an article is the positive or not. It's important because those news can effect how people thinks or their daily decision. Weather forcasting as an example, people should prepare for the disaster or not depending on the news they are seeing.

# **Size of the Dataset**

The dataset comprises a total of 4,846 news articles, with each article having two associated attributes:

***Class***: This attribute represents the sentiment or category of each news article, with 4,846 non-null entries.

**News**: This attribute contains the text content of the news articles, with 4,846 non-null entries.

For model training and evaluation, the dataset was split into two subsets:

**The training set** consists of **60% of the data**, containing approximately 2,907 news articles.
**The testing set** consists of **40% of the data**, containing 1,939 news articles.

# **Steps in Data Preprocessing and Model Evaluation**

1. **Getting the Data**: Run the dataset in colab by using pandas library.

2. **Turning Words into Numbers**: Next, I turn the words in those articles into numbers by using TF-IDF.

3. **Changing the class from text to number** : positive articles become 1, negative ones become -1, and neutral ones become 0 because Machine Learning can only be worked with numbers.

4. **Choosing Classifiers**: I choose KNN (K-Nearest Neighbour), NB( Naive Based ), Logistic, SVCLinear, SVCPoly, and SVCRbf to figure out which classifier can give the better recall as this data should focus on recall especially the false negative.

5. **Fixing Imbalance Data** : I choose the class weight to be more focus on positive news and to balance the data. I also use SMOTE to balance the data as the data has 59% of neutral news.

6. **Why We Use SMOTE**: I use SMOTE because it makes our computer learn well, even when we have more of one feeling. It helps my model making better predictions about all kinds of news in the dataset.

#**Model Evaluation**

Finally, I trained the data on different model to determine whether an article is positive, negative, or neutral. I choose SVM Linear to figure out the positive news because I got the best result with SVM. The table as shown belows:

Method     | Training (Precision/Recall) | Testing (Precision/Recall)
-----------|-----------------------------|-----------------------------
KNN        | 0.34 0.95                   | 0.31 0.92
NB         | 0.90 0.93                   | 0.43 0.46
Logistic   | 0.94 0.71                   | 0.76 0.49
***SVCL       | 0.91 1.00                   | 0.66 0.70***
SVCPoly    | 0.99 1.00                   | 0.79 0.19
SVCRbf     | 0.98 0.99                   | 0.75 0.54

As the SVMLinear shown the best result, I choose my best model as SVM because detecting the positive news should focus on Recall than Precision.


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import scipy.sparse as sp
import numpy as np

# Load your dataset
df = pd.read_csv('/content/news_data.csv',  encoding='latin-1', header=None)
df.columns = ['class', 'news']
label_mapping = {'positive': 1, 'negative': -1, 'neutral': 0}
df['class'] = df['class'].map(label_mapping)

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(df['news'])
y = df['class']

print(X.shape)
#

(4846, 10070)


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4846 entries, 0 to 4845
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   class   4846 non-null   int64 
 1   news    4846 non-null   object
dtypes: int64(1), object(1)
memory usage: 75.8+ KB


In [3]:
df.head()

Unnamed: 0,class,news
0,0,"According to Gran , the company has no plans t..."
1,0,Technopolis plans to develop in stages an area...
2,-1,The international electronic industry company ...
3,1,With the new production plant the company woul...
4,1,According to the company 's updated strategy f...


In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                    test_size = 0.40,
                                    random_state=42)
X_test.shape

(1939, 10070)

In [5]:
class_distribution = pd.Series(y).value_counts()
class_ratios = class_distribution / len(y)
print(class_ratios)

 0    0.594098
 1    0.281263
-1    0.124639
Name: class, dtype: float64


In [6]:
class_distribution = pd.Series(y).value_counts()
summary_stats = class_distribution.describe()
print(summary_stats)

count       3.000000
mean     1615.333333
std      1158.300623
min       604.000000
25%       983.500000
50%      1363.000000
75%      2121.000000
max      2879.000000
Name: class, dtype: float64


In [13]:
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score

#--------------------------------------------------
## ------------ SVM Classifier ------------------##
#--------------------------------------------------

from sklearn.svm import SVC

## Linear Kernel  ---------------
steps = [#('scaler', StandardScaler()),
         ('svc', SVC(kernel = 'linear',
                     class_weight={-1: 1.25, 0: 0.75, 1: 2.25}))]

svcL_pipeline = Pipeline(steps)
svcL_pipeline.fit(X_train, y_train)


# Apply SMOTE to the training data to balance the class distribution
smote = SMOTE(sampling_strategy= 'auto', random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Train SVM model on the resampled data
svcL_pipeline.fit(X_train_resampled, y_train_resampled)

In [14]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score


ypred_train = svcL_pipeline.predict(X_train)
mat_clf = confusion_matrix(y_train, ypred_train)
report_clf = classification_report(y_train, ypred_train)

print(mat_clf)
print(report_clf)

[[ 373    0    0]
 [  24 1631   75]
 [   3    0  801]]
              precision    recall  f1-score   support

          -1       0.93      1.00      0.97       373
           0       1.00      0.94      0.97      1730
           1       0.91      1.00      0.95       804

    accuracy                           0.96      2907
   macro avg       0.95      0.98      0.96      2907
weighted avg       0.97      0.96      0.97      2907



In [15]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score


ypred_test = svcL_pipeline.predict(X_test)
mat_clf = confusion_matrix(y_test, ypred_test)
report_clf = classification_report(y_test, ypred_test)

print(mat_clf)
print(report_clf)

[[139  56  36]
 [ 48 931 170]
 [ 29 137 393]]
              precision    recall  f1-score   support

          -1       0.64      0.60      0.62       231
           0       0.83      0.81      0.82      1149
           1       0.66      0.70      0.68       559

    accuracy                           0.75      1939
   macro avg       0.71      0.71      0.71      1939
weighted avg       0.76      0.75      0.76      1939

