<a href="https://colab.research.google.com/github/leaBroe/Deep_Learning_in_Python/blob/main/deep_learning_python_winter_school_24.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Project Overview

This is the project for the course "Deep Learning in Python" from the Machine Learning Winter School 2024 from the University of Fribourg.

Authors:  
Lea Brönnimann lea.broennimann@unifr.ch (19-107-010)  
Mohamed Mansour Faye mohamedmansour.faye@unifr.ch (19-505-197)  
Laura Dekker laura.dekker@unifr.ch (22-112-346)  

The code can also be found in the Github repo: [https://github.com/leaBroe/Deep_Learning_in_Python.git](https://github.com/leaBroe/Deep_Learning_in_Python.git)  

**Abstract:**

We have created an app that conducts a sentiment analysis over written text. This is a process where text is analyzed to assess its tone, whether it is more positive, negative or neutral. To be inclusive the app can also analyze text from other languages than English, because we included a translation model in it. The final product is the Gradio interface in which you can upload an image containing digital text. From this you will receive the translated text and the sentiment and emotion that are inferred to be contained in it. An app like this can be useful to bring across emotion in the digital world better. Almost everyone has had trouble reading between the lines of a text when deducing the emotion behind it. Another application of such a model can be found in AI customer service where an app like this can be used by an AI to provide more appropriate responses to its customers. A further application could be in more effective social media monitoring to make the job of social media moderators easier. Other applications could include analysing customer feedback. In addition, the entities are extracted from the text, which can be useful, for example, in quickly analysing annual reports from companies. All in all, enough reason to develop software that can analyze emotions and extract relevant entities in written text.

### Model Card

- **Model Details**: This app incorporates three different models in order to perform sentiment analysis from text images. We first use the python-tesseract OCR tool from the `pytesseract` package to extract text from images. We then perform sentiment analysis and topic classification using `distilbert-base-uncased-finetuned-sst-2-english`, which is available on Hugging Face. Because this model only works with english text, we also use the `m2m100_418M` model from Meta (also available on HuggingFace) to translate the input text when it is not in English.
- **Data**: The distilbert SST2 model is based on the BERT transformer from Google, which was trained on large amounts of english text (from Wikipedia amongst others) in a self-supervised fashion. It was then fine-tuned on the Stanford Sentiment Treebank to enhance performance on sentiment analysis tasks.
- **Performance**: The fine-tuned distilbert model achieves a good score of 91.3 on the GLUE benchmark, but as we will discuss later, might struggle with some specific subtasks/topics. The overall pipeline's performance also depends on the language of the input text. The Many-to-Many translation model used here doesn't perform as well when translating Wolof than when translating German, for example.
- **Ethical Considerations**: Developers highlight the model producing biased predictions which affect underrepresented populations. As the SST 2 dataset was sourced from movie reviews on Rotten Tomatoes, many of the statements on which the fine-tuning is based on contain judgement on the way (American) movies in particular portray one topic or another. Our focus on analyzing sentiment around environmental issues makes it quite likely for geographical information to be of relevance, which may skew results significantly.

### Outlook

Given an extra month, we might consider:

- **Expanding Data Sources**: Including more diverse sources of images and text, such as news articles or blogs, to enrich the analysis.
- **Model Fine-Tuning**: Fine-tuning the sentiment analysis model on a dataset specifically related to the specific user-case we are interested in.
- **Feature Expansion**: Adding functionality to track sentiment trends over time, e.g. enabling longitudinal studies on public sentiment toward environmental issues.
- **More Advanced Analyses**: For example, in order to better analyse the overall sentiment of customer feedback or trends in the perception of environmental issues, it would certainly be helpful if you could analyse a larger quantity of texts in parallel and, e.g. calculate the percentage of positive or negative feedback.

In [1]:
!sudo apt install tesseract-ocr

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 35 not upgraded.
Need to get 4,816 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]
Fetched 4,816 kB in 1s (5,563 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debc

In [2]:
!pip install pytesseract #!!!

Collecting pytesseract
  Downloading pytesseract-0.3.10-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.10


In [3]:
!pip install langdetect #!!!

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.3/981.5 kB[0m [31m6.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m706.6/981.5 kB[0m [31m10.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993225 sha256=2bcd2a5faedd9d46c3f679ad293551d2be8ffc8c981ad1e2dfd1c12843b12dc8
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5

In [4]:
!pip install gradio #!!!

Collecting gradio
  Downloading gradio-4.21.0-py3-none-any.whl (17.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.0/17.0 MB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.110.0-py3-none-any.whl (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.2.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==0.12.0 (from gradio)
  Downloading gradio_client-0.12.0-py3-none-any.whl (310 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.7/310.7 kB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━

In [5]:
import gradio as gr #!!!
from PIL import Image
import pytesseract
import spacy
from transformers import pipeline

# Pipelines
emotion_pipeline = pipeline("text-classification", model="bhadresh-savani/distilbert-base-uncased-emotion")
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

# Your existing function for extracting text from an image file
def extract_text_from_image(image_path):
    from PIL import Image
    import pytesseract
    image = Image.open(image_path)
    extracted_text = pytesseract.image_to_string(image)
    return extracted_text.strip()


from langdetect import detect
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

# Load your translation model and tokenizer
model_name = "facebook/m2m100_418M"
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)

def translate_text_to_english(text):
    # Detect the language of the input text
    detected_lang = detect(text)
    print(f"Detected language: {detected_lang}")  # For debugging

    # Check if the detected language is English
    if detected_lang == 'en':
        return text  # Return the original text if it's already in English

    # Specify the source language for the tokenizer; m2m100 uses language codes
    tokenizer.src_lang = detected_lang

    # Encode the text for the model
    encoded = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

    # Generate translation tokens and decode them to text
    # Note: forced_bos_token_id forces the model to translate to English
    generated_tokens = model.generate(**encoded, forced_bos_token_id=tokenizer.get_lang_id("en"))
    translated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

    return translated_text, detected_lang


# Function to get sentiment from text
def get_sentiment(text):
    results = sentiment_pipeline(text)
    return results

# Function to get emotion from text
def get_emotion(text):
    results = emotion_pipeline(text)
    return results

# Function to extract named entities using spaCy
def extract_entities(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

# Main processing function to integrate OCR, translation, sentiment and emotion analysis, and NER
def process_image(image):
    extracted_text = extract_text_from_image(image)
    translated_text, detected_lang = translate_text_to_english(extracted_text)
    sentiment_result = get_sentiment(translated_text)
    emotion_result = get_emotion(translated_text)
    entities = extract_entities(translated_text)  # Use spaCy to extract entities
    entities_str = ', '.join([f"{text} ({label})" for text, label in entities])  # Format entities for display
    return extracted_text, detected_lang, translated_text, sentiment_result, emotion_result, entities_str

# Define Gradio interface
iface = gr.Interface(fn=process_image,
                     inputs=gr.Image(label="Upload Image", type="filepath"),
                     outputs=[gr.Textbox(label="Extracted Text"),
                              gr.Textbox(label="Detected Language"),
                              gr.Textbox(label="Translated Text"),
                              gr.Textbox(label="Sentiment Analysis Result"),
                              gr.Textbox(label="Emotion Analysis Result"),
                              gr.Textbox(label="Extracted Entities")],
                     title="Image to Sentiment and Emotion Analysis",
                     description="Upload an image containing text, and the app will translate the text to English, then perform sentiment and emotion analysis and extract named entities.")

# Launch the app
# debug=True
iface.launch(share=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/768 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/298 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/908 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://bc7c5c3d954336a50f.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


