# ***Q1***

key steps involved in starting and developing an ASR product:

### 1. Conceptualization and Research
- **Identify Use Case:** Define target audience and specific problems the ASR product will solve.
- **Market Research:** Analyze competitors and market needs.
- **Feasibility Study:** Assess technical feasibility and potential challenges.

### 2. Data Collection and Preparation
- **Data Collection:** Gather diverse audio recordings.
- **Data Annotation:** Transcribe audio data for training.
- **Data Preprocessing:** Clean and normalize the data.

### 3. Model Development
- **Choose/Develop ASR Framework:** Decide on using existing frameworks or building a custom solution.
- **Model Training:** Train the ASR model with collected data.
- **Evaluation and Testing:** Evaluate performance using metrics like Word Error Rate (WER).

### 4. Development of Supporting Components
- **Front-End Development:** Design user-friendly interface.
- **Back-End Development:** Build server-side infrastructure.
- **Integration:** Integrate ASR model with system components.

### 5. Deployment
- **Scalability and Optimization:** Ensure system can handle varying loads.
- **Cloud/On-Premise Deployment:** Choose deployment strategy.
- **Continuous Monitoring:** Track performance and usage.

### 6. Continuous Improvement
- **User Feedback:** Collect and act on feedback.
- **Model Updates:** Regularly update the model with new data.
- **Feature Enhancements:** Develop new features based on user needs.

### 7. Compliance and Security
- **Data Privacy:** Ensure compliance with regulations.
- **Security Measures:** Protect user data and prevent unauthorized access.

### Tools and Technologies
- **ASR Frameworks:** Kaldi, DeepSpeech, Wav2Vec
- **Programming Languages:** Python, C++
- **Deep Learning Frameworks:** TensorFlow, PyTorch
- **Cloud Platforms:** AWS, Google Cloud, Azure
- **Data Annotation Tools:** Labelbox, Doccano



# ***Q2***

two established methods and one novel approach:

### Established Methods

1. **Majority Voting (Ensemble Method)**
   - **Steps:**
     1. Collect outputs from all models.
     2. Rank outputs by frequency.
     3. Select the top 16 most frequent outputs.

2. **Weighted Averaging (Weighted Ensemble)**
   - **Steps:**
     1. Assign weights to each model based on their performance.
     2. Score outputs by multiplying them by their model weights.
     3. Aggregate scores for each output.
     4. Select the top 16 outputs with the highest scores.

### Novel Method

3. **Hybrid Consensus Method**
   - **Steps:**
     1. Perform initial majority voting to find the top 20 outputs.
     2. Apply weighted averaging to refine the top 20 outputs.
     3. Select the top 16 outputs based on the refined weighted scores.

# ***Q3***

1. **Normalizer:** Converts text to a standard format (e.g., lowercasing, removing punctuation).
2. **Formalizer:** Converts informal or slang text to formal language.
3. **Lemmatizer:** Reduces words to their base or dictionary form (considering context).
4. **Stemmer:** Reduces words to their root form by removing affixes (without context).
5. **Chunker:** Groups words into phrases (e.g., noun phrases, verb phrases).
6. **Tagger:** Assigns tags to words (e.g., grammatical tags).
7. **POS Tagger:** Specifically assigns part-of-speech tags to words.
8. **Parser:** Analyzes the grammatical structure of sentences, creating parse trees.
9. **Word Embedder:** Converts words into numerical vectors capturing semantic meanings.
10. **Embedder:** Converts larger text units (e.g., sentences, documents) into numerical vectors.


# ***Q4***

In [2]:
! pip install git+https://github.com/roshan-research/hazm.git

Collecting git+https://github.com/roshan-research/hazm.git
  Cloning https://github.com/roshan-research/hazm.git to /tmp/pip-req-build-3c5c31l2
  Running command git clone --filter=blob:none --quiet https://github.com/roshan-research/hazm.git /tmp/pip-req-build-3c5c31l2
  Resolved https://github.com/roshan-research/hazm.git to commit e6705fe99d6d9248d941c1c1693d2ea46bb29cc5
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [3]:
import hazm
from difflib import SequenceMatcher

# Initialize Hazm normalizer
normalizer = hazm.Normalizer()

# Function to normalize text using Hazm
def hazm_normalize(text):
    return normalizer.normalize(text)

# Function to calculate Character Error Rate (CER)
def cer(prediction, reference):
    s = SequenceMatcher(None, prediction, reference)
    return 1 - s.ratio()

# Function to calculate Word Error Rate (WER)
def wer(prediction, reference):
    prediction_words = prediction.split()
    reference_words = reference.split()
    sm = SequenceMatcher(None, prediction_words, reference_words)
    return 1 - sm.ratio()

# Read HARF output and true transcription from files
with open("/content/HarfOutp.txt", 'r', encoding='utf-8') as file:
    harf_output = file.read()

with open("/content/true_output.txt", 'r', encoding='utf-8') as file:
    true_transcription = file.read()

# Normalize the transcriptions using Hazm
normalized_harf_output = hazm_normalize(harf_output)
normalized_true_transcription = hazm_normalize(true_transcription)

# Calculate CER and WER
cer_value = cer(normalized_harf_output, normalized_true_transcription)
wer_value = wer(normalized_harf_output, normalized_true_transcription)

cer_value, wer_value


(0.44252688748347446, 0.27552552552552556)

# ***Q5***

In [4]:
from hazm import *
# Initialize Hazm tools
normalizer = hazm.Normalizer()
tokenizer = hazm.word_tokenize
tagger = hazm.POSTagger(model="/content/pos_tagger.model")

In [5]:
# Tokenize and POS tag the transcriptions
harf_tokens = tokenizer(normalized_harf_output)
true_tokens = tokenizer(normalized_true_transcription)

harf_tags = tagger.tag(harf_tokens)
true_tags = tagger.tag(true_tokens)

# Count the number of verbs and adverbs in each transcription
def count_verbs_adverbs(tags):
    verb_count = 0
    adverb_count = 0
    for word, tag in tags:
        if tag.startswith('V'):  # Verb tags usually start with 'V'
            verb_count += 1
        elif tag == 'ADV':  # Adverb tag
            adverb_count += 1
    return verb_count, adverb_count

harf_verb_count, harf_adverb_count = count_verbs_adverbs(harf_tags)
true_verb_count, true_adverb_count = count_verbs_adverbs(true_tags)

print(f"HARF Output - Verbs: {harf_verb_count}, Adverbs: {harf_adverb_count}")
print(f"True Transcription - Verbs: {true_verb_count}, Adverbs: {true_adverb_count}")

HARF Output - Verbs: 294, Adverbs: 78
True Transcription - Verbs: 290, Adverbs: 74


# ***Q6***

In [6]:
lemmatizer = Lemmatizer()

# Function to find the most frequent verb stem
def most_frequent_verb_stem(text):
    # Normalize the text
    normalized_text = normalizer.normalize(text)
    # Tokenize the text
    tokens = tokenizer(normalized_text)
    # POS tag the tokens
    tags = tagger.tag(tokens)

    # Dictionary to count verb stems
    verb_stem_count = {}

    # Lemmatize and count verb stems
    for word, tag in tags:
        if tag.startswith('V'):  # Verb tags usually start with 'V'
            lemma = lemmatizer.lemmatize(word)
            # Extract the stem (part after '#')
            stem = lemma.split('#')[-1]
            if stem in verb_stem_count:
                verb_stem_count[stem] += 1
            else:
                verb_stem_count[stem] = 1

    # Find the most frequent verb stem
    most_frequent_stem = max(verb_stem_count, key=verb_stem_count.get)
    return most_frequent_stem, verb_stem_count[most_frequent_stem]

# Find the most frequent verb stem in each file
harf_most_frequent_stem, harf_count = most_frequent_verb_stem(harf_output)
true_most_frequent_stem, true_count = most_frequent_verb_stem(true_transcription)

print(f"HARF Output - Most Frequent Verb Stem: '{harf_most_frequent_stem}', Count: {harf_count}")
print(f"True Transcription - Most Frequent Verb Stem: '{true_most_frequent_stem}', Count: {true_count}")


HARF Output - Most Frequent Verb Stem: 'است', Count: 46
True Transcription - Most Frequent Verb Stem: 'کن', Count: 36


# ***Q7***

In [7]:
!pip install mega.py



In [9]:
from mega import Mega
mega = Mega()
m = mega.login()

In [None]:
m.download_url('https://mega.nz/file/GqZUlbpS#XRYP5FHbPK2LnLZ8IExrhrw3ZQ-jclNSVCz59uEhrxY')

In [None]:
! unzip fasttext_model.zip

In [11]:
# Initialize the word embedding model (assuming the model file 'word2vec.bin' is correctly placed)
word_embedding = WordEmbedding(model_type='fasttext', model_path="/content/fasttext_skipgram_300.bin")

# Function to find the unrelated word in each group of words
def find_unrelated_words(groups):
    unrelated_words = []
    for group in groups:
        unrelated_word = word_embedding.doesnt_match(group)
        unrelated_words.append(unrelated_word)
    return unrelated_words

# Example list of lists containing 5 Farsi words each
word_groups = [
    ['سلام', 'درود', 'خداحافظ', 'بدرود', 'درخت'],
    ['ساعت', 'پلنگ', 'شیر', 'ببر', 'گربه'],
    ['کتاب', 'دفتر', 'خودکار', 'مداد', 'سیب'],
    ['آسمان', 'دختر', 'خورشید', 'ماه', 'ستاره'],
    ['دوچرخه', 'ماشین', 'موتور', 'قطار', 'درخت']
]

# Find and print the unrelated word in each group
unrelated_words = find_unrelated_words(word_groups)
for group, unrelated_word in zip(word_groups, unrelated_words):
    print(f"In the group {group}, the unrelated word is: '{unrelated_word}'")

In the group ['سلام', 'درود', 'خداحافظ', 'بدرود', 'درخت'], the unrelated word is: 'درخت'
In the group ['ساعت', 'پلنگ', 'شیر', 'ببر', 'گربه'], the unrelated word is: 'ساعت'
In the group ['کتاب', 'دفتر', 'خودکار', 'مداد', 'سیب'], the unrelated word is: 'سیب'
In the group ['آسمان', 'دختر', 'خورشید', 'ماه', 'ستاره'], the unrelated word is: 'دختر'
In the group ['دوچرخه', 'ماشین', 'موتور', 'قطار', 'درخت'], the unrelated word is: 'درخت'
