## Automated Data Extraction, Translation, and Classification Pipeline

### Overview

This Python application automates the process of extracting news titles from the Lalit Mauritius website, translating them from Mauritian Creole to English, performing text analysis, and classifying them into predefined categories using machine learning techniques.

### Purpose

The application serves multiple purposes:
- **Data Extraction**: Utilizes `requests` and `BeautifulSoup` libraries to fetch HTML content from `'https://www.lalitmauritius.org/'` and parse it to extract news titles identified by `<span>` tags with class `'title'`.
  
- **Translation**: Leverages Hugging Face's `transformers` library to translate Mauritian Creole news titles into English using a pre-trained model (`prajdabre/morisien_english`).

- **Text Analysis**: Employs NLTK (`nltk`) for tokenization, stemming, and stop word identification to process and analyze the text content.

- **Data Structuring**: Constructs a pandas DataFrame (`pd.DataFrame`) to organize and store original Mauritian Creole news titles (`'Creole'`), translated English versions (`'English'`), and predefined classification labels (`'Labels'`) assigned to each news title based on predefined categories.

### Components Used

The application utilizes the following key components:
- **Libraries**: `requests` and `BeautifulSoup` for web scraping and HTML parsing; `transformers` for accessing pre-trained models for text translation; `nltk` for text processing tasks such as tokenization, stemming, and stop word identification; `pandas` for data manipulation and DataFrame management; and `sklearn` for building a text classification pipeline (`TfidfVectorizer` and `MultinomialNB`) to preprocess text data and train a classifier.

### Detailed Workflow

The workflow can be summarized as follows:
1. **Web Scraping and HTML Parsing**: Fetches HTML content from `'https://www.lalitmauritius.org/'` using `requests` and parses it with `BeautifulSoup` to extract news titles.
   
2. **Translation**: Translates Mauritian Creole news titles into English using a translation pipeline from `transformers`.

3. **Text Processing and Analysis**:
   - Uses NLTK (`nltk`) to tokenize and stem Mauritian Creole words, providing insights into the linguistic structure of the text.
   - Counts word frequencies across all news titles and identifies potential stop words based on frequency thresholds using `Counter`.

4. **Data Handling**: Constructs a pandas DataFrame (`df`) to store original Mauritian Creole news titles, their English translations, and associated classification labels.

5. **Machine Learning**: Splits the data into training and testing sets using `train_test_split` from `sklearn`. Constructs a text classification pipeline (`make_pipeline` with `TfidfVectorizer` and `MultinomialNB`) to preprocess text data and train a Multinomial Naive Bayes classifier. Evaluates model performance using `classification_report`.


### Conclusion

This Python application showcases a comprehensive pipeline for automated data extraction, translation, linguistic analysis, and supervised classification of textual data from web sources. By integrating robust libraries and tools, it facilitates efficient processing and analysis of multilingual textual data, suitable for various applications requiring automated content handling and classification.


In [5]:
import requests
from bs4 import BeautifulSoup
from transformers import pipeline
import pandas as pd

# URL to fetch
url = 'https://www.lalitmauritius.org/'

# Fetch the HTML content
response = requests.get(url)
html_content = response.content

# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')
news_list = soup.find_all('span', class_='title')

# Extract the news titles
news_titles = [news.get_text() for news in news_list]

# Load the translation pipeline
translator = pipeline("text2text-generation", model="prajdabre/morisien_english")

# Translate the news titles
translated_titles = [translator(title)[0]['generated_text'] for title in news_titles]

# Print the original and translated titles
for original, translated in zip(news_titles, translated_titles):
    print(f"Original: {original}")
    print(f"Translated: {translated}\n")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Original: Propozisyon LALIT dan kad Bidze 2024-25 e dan kad kanpayn elektoral: Kontrol lor Itilizasyon Later ek Lamer
Translated: Lalit’s position in the budget 2024-25 and in the electoral

Original: Genocide Blog 48 – Israel loses war, continues genocide and USA gaslights the world
Translated: And the point about Geocide Blog 48 – Israel loses

Original: Labour, MMM, ND, Linion Moris, PMSD and Lalit all sign up to Common Statement for Mauritius to support SA case against Israel
Translated: The Labour, the MMM, the ND, the Mauritian Union, the PMSD and

Original: Press Release: SOMALP calls on Government to stop the distribution of Nestle products in schools
Translated: Release: SOMALP demand au Gouvernement to stop the circulation of N

Original: MLF Initiative for a Common Statement by Opposition Parties to ask Mauritian State to Support South Africa in the Genocide Case Against Israel at ICJ
Translated: L’Initiative du Mouvement pour l’ libéralisation (MLF

Original: Komanter LALIT

In [38]:
import nltk
from nltk.stem import SnowballStemmer
nltk.download('punkt')

# Example data (assuming you have loaded your dataset)
creole_texts = news_titles  # List of Mauritian Creole texts

# Initialize a Snowball stemmer for Mauritian Creole (adjust language as needed)
stemmer = SnowballStemmer(language='french')

# Stemming example
creole_words = nltk.word_tokenize(creole_texts[0].lower())  # Tokenize and lowercase
creole_stems = [stemmer.stem(word) for word in creole_words]

print("Original Mauritian Creole words:")
print(creole_words)
print("\nStems:")
print(creole_stems)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Original Mauritian Creole words:
['propozisyon', 'lalit', 'dan', 'kad', 'bidze', '2024-25', 'e', 'dan', 'kad', 'kanpayn', 'elektoral', ':', 'kontrol', 'lor', 'itilizasyon', 'later', 'ek', 'lamer']

Stems:
['propozisyon', 'lal', 'dan', 'kad', 'bidz', '2024-25', 'e', 'dan', 'kad', 'kanpayn', 'elektoral', ':', 'kontrol', 'lor', 'itilizasyon', 'lat', 'ek', 'lam']


In [26]:
from collections import Counter

# Tokenize and count word frequencies
word_counts = Counter(' '.join(news_titles).split())

# Assume words appearing more than a threshold are stop words
stop_words = [word for word, count in word_counts.items() if count > 10]  # Adjust threshold as needed

print(f"Custom Stop Words: {stop_words}")

Custom Stop Words: ['Blog', 'to']


In [23]:
import pandas as pd

df = pd.DataFrame({
    'Creole': news_titles,
    'English': translated_titles
})
df.head()

Unnamed: 0,Creole,English
0,Propozisyon LALIT dan kad Bidze 2024-25 e dan ...,Lalit’s position in the budget 2024-25 and in ...
1,"Genocide Blog 48 – Israel loses war, continues...",And the point about Geocide Blog 48 – Israel l...
2,"Labour, MMM, ND, Linion Moris, PMSD and Lalit ...","The Labour, the MMM, the ND, the Mauritian Uni..."
3,Press Release: SOMALP calls on Government to s...,Release: SOMALP demand au Gouvernement to stop...
4,MLF Initiative for a Common Statement by Oppos...,L’Initiative du Mouvement pour l’ libéralisati...


In [24]:
# Sample labels for classification
labels = [
    "Political", "Media", "Political", "Press", "Political",
    "Political", "Media", "International", "Media", "Education",
    "Education", "Media", "Media", "Press", "Political", "Education",
    "Media", "Political", "Education", "Political", "Media", "Political",
    "Political", "Press", "Political", "Legal", "Media", "Political",
    "Political", "Legal"
]

df = pd.DataFrame({
    'Creole': news_titles,
    'English': translated_titles,
    'Labels': labels
})

df.head()

Unnamed: 0,Creole,English,Labels
0,Propozisyon LALIT dan kad Bidze 2024-25 e dan ...,Lalit’s position in the budget 2024-25 and in ...,Political
1,"Genocide Blog 48 – Israel loses war, continues...",And the point about Geocide Blog 48 – Israel l...,Media
2,"Labour, MMM, ND, Linion Moris, PMSD and Lalit ...","The Labour, the MMM, the ND, the Mauritian Uni...",Political
3,Press Release: SOMALP calls on Government to s...,Release: SOMALP demand au Gouvernement to stop...,Press
4,MLF Initiative for a Common Statement by Oppos...,L’Initiative du Mouvement pour l’ libéralisati...,Political


In [25]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

# Split the data
# Assuming df is your DataFrame from the previous example
X = df['English']  # Translated titles
y = df['Labels']   # Labels

# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Build a text classification pipeline
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

# Train the model
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)

# Evaluate the model
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

   Education       0.00      0.00      0.00         2
       Media       0.00      0.00      0.00         1
   Political       0.33      1.00      0.50         2
       Press       0.00      0.00      0.00         1

    accuracy                           0.33         6
   macro avg       0.08      0.25      0.12         6
weighted avg       0.11      0.33      0.17         6



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
