## import needed libraries

In [1]:
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
import os

## read data and concatenate multiple csv files

In [2]:

# Assuming all CSV files are in the same directory
data_directory = "C:/Users/LENOVO/Downloads/archive/stories/"
csv_files = [file for file in os.listdir(data_directory) if file.endswith('.csv')]

# Initialize an empty list to store DataFrames from each CSV
dfs = []

for file in csv_files:
    file_path = os.path.join(data_directory, file)
    df = pd.read_csv(file_path)
    dfs.append(df)

# Concatenate all DataFrames into a single DataFrame
story_data = pd.concat(dfs, ignore_index=True)


In [3]:
story_data.head()

Unnamed: 0.1,Unnamed: 0,id,title,date,author,story,topic
0,0,f06aa998054e11eba66e646e69d991ea,"""بيت الشعر"" يسائل وزير الثقافة عن كوابيس سوداء",الجمعة 02 أكتوبر 2020 - 23:19,هسبريس من الرباط,"وجه ""بيت الشعر في المغرب"" إلى وزير الثقافة وال...",art-et-culture
1,1,f1cf1b9c054e11ebb718646e69d991ea,"مهرجان ""سينما المؤلّف"" يستحضر روح ثريا جبران",الجمعة 02 أكتوبر 2020 - 07:26,هسبريس من الرباط,في ظلّ استمرار حالة الطوارئ الصحية المرتبطة بج...,art-et-culture
2,2,f2d282a4054e11eb800f646e69d991ea,"فيلم ""بدون عنف"" لهشام العسري ..""كعب الحذاء ووا...",الجمعة 02 أكتوبر 2020 - 04:00,عفيفة الحسينات*,تشير مشاهدة فيلم قصير ضمن الثلاثية الأخيرة للم...,art-et-culture
3,3,f3f46cac054e11eba403646e69d991ea,"""تنين ووهان"" .. مريم أيت أحمد توقِّع أولى ""روا...",الجمعة 02 أكتوبر 2020 - 02:00,حاورَها: وائل بورشاشن,"مِن قَلب أيّام ""الحَجْر""، رأتِ النّورَ الفصول ...",art-et-culture
4,4,f50f0476054e11eba31b646e69d991ea,"مسكر يتخلّى عن دعم ""الوزارة"" بسبب ""الجمهور""",الخميس 01 أكتوبر 2020 - 19:40,هسبريس من الرباط,أعلن الفنان المغربيّ سعيد مسكر تخليه عن مبلغ ا...,art-et-culture


In [8]:
story_data.columns

Index(['Unnamed: 0', 'id', 'title', 'date', 'author', 'story', 'topic'], dtype='object')

## feature engineering 

In [23]:
# Combine 'story' and 'title' columns to create a new feature
story_data['title_story'] = story_data['title'] + '@@' + story_data['story']

## split the data

In [25]:
# Split the data into features and target
X = story_data['title_story']
y = story_data['topic']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## build the model

In [26]:

def train_support_vector_classifier(X_train,y_train, kernel_type='linear', min_df=1, max_df=1.0, max_features=None):
   
    # Create a TF-IDF vectorizer
    vectorizer = TfidfVectorizer(min_df=min_df, max_df=max_df, max_features=max_features)

    # Vectorize the training data
    X_train_vec = vectorizer.fit_transform(X_train)

    # Train a Naive Bayes classifier
    classifier = SVC(C=1,kernel=kernel_type, random_state=42)

    classifier.fit(X_train_vec, y_train)

    return classifier, vectorizer, X_test, y_test

def evaluate_classifier(classifier, vectorizer, X_test, y_test):
    # Vectorize the test data
    X_test_vec = vectorizer.transform(X_test)

    # Make predictions on the test data
    y_pred = classifier.predict(X_test_vec)

    # Calculate and print the performance metrics
    print("Classification Report:")
    print(classification_report(y_test, y_pred))

    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy)

    return accuracy



In [34]:
# Adjust these parameters as needed
kernel_type = 'linear'
min_df = 1
max_df = 1.0
max_features = None
test_size = 0.2

# Train the classifier
classifier, vectorizer, X_test, y_test = train_support_vector_classifier(X_train,y_train,kernel_type=kernel_type, min_df=min_df, max_df=max_df, max_features=max_features)

# Evaluate the classifier
evaluate_classifier(classifier, vectorizer, X_test, y_test)

Classification Report:
                    precision    recall  f1-score   support

    art-et-culture       0.90      0.90      0.90       206
          economie       0.83      0.86      0.84       202
      faits-divers       0.95      0.96      0.95       184
marocains-du-monde       0.89      0.90      0.89       214
            medias       0.98      0.91      0.94       197
           orbites       0.72      0.75      0.74       204
         politique       0.85      0.84      0.84       210
           regions       0.79      0.83      0.81       178
           societe       0.76      0.74      0.75       198
             sport       0.99      0.96      0.98       194
         tamazight       0.97      0.95      0.96       213

          accuracy                           0.87      2200
         macro avg       0.87      0.87      0.87      2200
      weighted avg       0.88      0.87      0.87      2200

Accuracy: 0.8731818181818182


0.8731818181818182

Precision: Precision is the ratio of true positive predictions to the total number of positive predictions made by the classifier. In this case,(e.g., "art-et-culture"), precision measures how many of the stories predicted as "art-et-culture" are actually correctly classified. Higher precision indicates a lower false positive rate for that class.

Recall: Recall, also known as sensitivity or true positive rate, is the ratio of true positive predictions to the total number of actual positive instances in the dataset. In this case, for a specific class, recall measures how many of the actual "art-et-culture" articles are correctly identified by the classifier. Higher recall indicates a lower false negative rate for that class.

F1-score: The F1-score is the harmonic mean of precision and recall and provides a balance between the two metrics. It is useful when there is an uneven class distribution, as it considers both false positives and false negatives. It is a single metric that summarizes the model's performance for a specific class, with a higher value indicating better performance.

Support: Support is the number of instances (stories) belonging to each class in the test set. It provides context on the distribution of classes and indicates the relative size of each class.

Accuracy: Accuracy is the overall correct predictions made by the classifier, divided by the total number of instances in the test set. It measures the overall correctness of the model's predictions and is a common metric for multi-class classification tasks.

# -------------------------------------------------------------------------------------------------------------

# Some enhancements that can achieve better results.


1. Text Preprocessing:

a. Stemming: Stemming is the process of reducing words to their root form by removing suffixes and prefixes. For example, "running" and "runs" both reduce to "run." Stemming can help reduce the dimensionality of the feature space and speed up model training, but it may not always produce meaningful words.

   b. Lemmatization: Lemmatization is similar to stemming but involves reducing words to their base or dictionary form (lemma). Unlike stemming, lemmatization ensures that the resulting word is a meaningful word. For example, "running" and "runs" both lemmatize to "run." Lemmatization may be more suitable when maintaining the semantic meaning of words is important.

   c. Removing Stopwords: Stopwords are common words (e.g., "the," "is," "and") that often appear frequently in text but carry little semantic value. Removing stopwords can help reduce noise in the data and improve the efficiency of the model by focusing on more informative words.

   By experimenting with different combinations of stemming, lemmatization, and stopwords removal, you can assess how each preprocessing technique affects the model's accuracy, precision, recall, and other performance metrics.

2. Domain-Specific Embeddings:
While pre-trained word embeddings like Word2Vec, GloVe, and FastText capture general semantic information across various domains, they may not fully capture the specific context and semantics of your domain. If you have a specialized domain like legal, medical, or technical texts, the language and vocabulary used in these domains may differ significantly from general language.

To address this, consider using domain-specific word embeddings. These embeddings are trained on a large corpus of text data specific to your domain, capturing context and semantic information that is relevant to your domain. By leveraging domain-specific embeddings, the model may better understand and represent the nuances and domain-specific terms, potentially leading to improved performance.
