# Question1
 POETRY Generation using N-grams



1 Introduction:
In this assignment, you will use n-gram language modeling to generate some poetry using the ngrams. For the purpose of this assignment a poem will consist of three stanzas each containing four verses where each verse consists of 7—10 words. For example, following is a manually generated stanza.

دل سے نکال یاس کہ زندہ ہوں میں ابھی،

ہوتا ہے کیوں اداس کہ زندہ ہوں میں ابھی،

مایوسیوں کی قید سے خود کو نکال کر،

آ جاؤ میرے پاس کہ زندہ ہوں میں ابھی،



آ کر کبھی تو دید سے سیراب کر مجھے،

مرتی نہیں ہے پیاس کہ زندہ ہوں میں ابھی،

مہر و وفا خلوص و محبت گداز دل،

سب کچھ ہے میرے پاس کہ زندہ ہوں میں ابھی،




لوٹیں گے تیرے آتے ہی پھر دن بہار کے،

رہتی ہے دل میں آس کہ زندہ ہوں میں،

نایا ب شاخ چشم میں کھلتے ہیں اب بھی خواب، سچ ہے ترا

قیاس کہ زندہ ہوں میں ابھی

The task is to print three such stanzas with an empty line in between. The generation model can be trained on the provided Poetry Corpus containing poems from Faiz, Ghalib and Iqbal.You can scrape other urdu poetry too from internet. You will train unigram and bigram models using this corpus. These models will be used to generate poetry.

2 Assignment Task:

The task is to generate a poem using different models. We will generate a poem verse by verse until all stanzas have been generated. The poetry generation problem can be solved using the following algorithm:
1. Load the Poetry Corpus
2. Tokenize the corpus in order to split it into a list of words
3. Generate n-gram models
4. For each of the stanzas
– For each verse
* Generate a random number in the range [7...10]
* Select first word
* Select subsequent words until end of verse
* [bonus] If not the first verse, try to rhyme the last word with the last word of the previous verse
* Print verse
– Print empty line after stanza
2.1 Implementation Challenges:

Among the challenges of solving this assignment will be selecting subsequent words once we have chosen the first word of the verse. To predict the next word, what we aim to compute is the most probable next word from all the possible next words. In other words, we need to find the set of words that occur most frequently after the already selected word and choose the next word from that set. We can use a Conditional Frequency Distribution (CFD) to figure that out! A CFD tells us: given a condition, what is likelihood of each possible outcome. [bonus] Rhyming the generated verses is also a challenge. You can build your dictionary for rhyming. The Urdu sentence is written from right to left, so makes your n-gram models according to this style.

2.2 Standard n-gram Models
We can develop our model using the Conditional Frequency Distribution method. First develop a unigram model (Unigram Model), then the bigram model (Bigram Model) and then trigram model. Select the first word of each line randomly from starting words in the vocabulary and then use the bigram model to generate the next word until the verse is complete. Generate the next three lines similarly.
 Follow the same steps for the trigram model and compare the results of the two n-gram models.

In [79]:
import random
import nltk

# Load the Poetry Corpora and tokenize them
# Replace 'iqbal.txt' and 'ghalib.txt' with the paths to your corpus files
corpus_files = ['iqbal.txt', 'ghalib.txt']
corpora = [open(file, 'r', encoding='utf-8').read() for file in corpus_files]

# Tokenize the corpora
words = [nltk.word_tokenize(corpus) for corpus in corpora]

# Create unigram and bigram models for each corpus
unigram_models = [nltk.FreqDist(word_list) for word_list in words]
bigram_models = [list(nltk.bigrams(word_list)) for word_list in words]

# Generate Poetry
for stanza in range(3):
    for verse in range(4):
        verse_length = random.randint(7, 10)
        verse_words = []

        for word_num in range(verse_length):
            if word_num == 0:
                # Select the first word randomly from one of the corpora
                selected_corpus = random.choice(corpora)
                word = random.choice(words[corpora.index(selected_corpus)])
            else:
                # Use the bigram model of the selected corpus to select the next word
                prev_word = verse_words[-1]
                selected_corpus = corpora[corpora.index(selected_corpus) % 2]  # Alternate between corpora
                bigram_model = bigram_models[corpora.index(selected_corpus)]
                next_words = [w2 for w1, w2 in bigram_model if w1 == prev_word]
                if next_words:
                    word = random.choice(next_words)
                else:
                    selected_corpus = random.choice(corpora)  # If there's no next word in the bigram model, switch corpus
                    word = random.choice(words[corpora.index(selected_corpus)])

            verse_words.append(word)

        # Join the words and print the verse
        print(' '.join(verse_words[::-1]))

    if stanza < 2:
        # Print an empty line between stanzas
        print()


افکار کے نہنگوں بھی اور سہی تو کا حجابی بے
‘ میں اس ہے تاثیر کشِ زبونی نالہ ‘ آبادی
عثمانی ترکان کم زر اب ہیں بھی
پیچ ہوں جاتا کو اسی کا شاہيں ،

کر پیدا جنوں اے چل آیا مقام کا نگاہ
طالب کا ایجاد ستم بیٹھا اداس کوئی ہر
پرواز ہے گرچہ گزر سے منزلوں جسے ہے کام ہے
وصل ذوق بے پرکاری و کیخسرو و

ہے کرے حسین رہِ گردِ پایۂ ہے
گئیں ہو کہتے سچ آگے مرے کا مژگاں کاوشہائے
ہے اٹھتا قلم غبار بے وعدۂ ہو گدا جو بھی
میں جناب ہماری آبرو کی خواروں بادہ


# Question2
 Classify language out of the list given below using just stop words. Remove punctuations, make it lower.

In [1]:
!pip install nltk




In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Maryam\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [13]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Maryam\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Maryam\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


nltk.download('stopwords')
nltk.download('punkt')


In [17]:


text = "An article is qualunque member van un class of dedicated words naquele estão used with noun phrases per mark the identifiability of the referents of the noun phrases"
text=text.lower()
print(text)

words = word_tokenize(text)

#all avalaible languages provided by nltk
available_languages = stopwords.fileids()


language_scores = {}

# to  match the stopword in each language...
for language in available_languages:
    stop_words_list = set(stopwords.words(language))
    common_stopwords = set(words) & stop_words_list
    score = len(common_stopwords)
    
    language_scores[language] = score
# print each languages score...
language_scores
#for language, score in language_scores.items():
 #   print(f"Language: {language}\tScore: {score}\n")

an article is qualunque member van un class of dedicated words naquele estão used with noun phrases per mark the identifiability of the referents of the noun phrases


{'arabic': 0,
 'azerbaijani': 1,
 'basque': 0,
 'bengali': 0,
 'catalan': 3,
 'chinese': 0,
 'danish': 0,
 'dutch': 3,
 'english': 5,
 'finnish': 0,
 'french': 1,
 'german': 1,
 'greek': 0,
 'hebrew': 0,
 'hinglish': 8,
 'hungarian': 1,
 'indonesian': 1,
 'italian': 2,
 'kazakh': 0,
 'nepali': 0,
 'norwegian': 0,
 'portuguese': 1,
 'romanian': 1,
 'russian': 0,
 'slovene': 0,
 'spanish': 1,
 'swedish': 0,
 'tajik': 0,
 'turkish': 0}

In [54]:
# Your output looks like

# Question 3
 Rule Based Roman Urdu Text Normalization

Roman Urdu lacks standard lexicon and usually many spelling variations exist for a given word, e.g., the word zindagi (life) is also written as zindagee, zindagy, zaindagee and zndagi. So, in this question you have to Normalize Roman Urdu words using the following Rules given in the attached Pdf. Your Code works for a complete Sentence or multiple sentences.

For Example: zaroori, zaruri, zarori map to the 'zrory'. So zrory becomes the correct word for all representations mentioned above.

In [60]:
import nltk

nltk.download('punkt')

def normalize_roman_urdu_sentence(sentence):
    # Tokenize the sentence into words
    words = nltk.word_tokenize(sentence)

    # Define custom normalization rules for start and end substrings
    start_rules = [
        ('ai', 'ae'), 
        ('es','is'),# Replace 'zi' at the start with 'zy'
        # Add more start rules as needed
    ]

    end_rules = [
        ('ain', 'ein'), 
        ('ay', 'e'),
        ('ey', 'e'),
        ('ie', 'y'),
        ('ai', 'ae'),
        ('ri','ry')
        
        # Replace 'ri' at the end with 'ry'
        # Add more end rules as needed
    ]
    multiple_rules=[
                    ('aa','a'),
                    ('ee','i'),
                    ('ss','s'),
                    ('jj','j'),
                    ('oo','o'),
                    ('dd','d'),
                   
    ]
    rand_rules=[
        ('ai','ae'),
        ('ih','eh'),
        ('iy','i'),
        ('u','o'),
         
    ]
    except_end=[
        ('ry','ri'),
        ('sy','si'),
        ('ty','ti'),
        
    ]
    except_start=[
        ('ar','r')
    ]
    # Apply the rules to each word in the sentence
    normalized_words = []
    for word in words:
        if word[-1]=='i' :
            word=word[:-1]+word[-1].replace("i","y")
            #print(word[-1:])
        else:
            word=word
        for pattern, replacement in start_rules:
            if word.startswith(pattern):
                word = replacement + word[len(pattern):]
        for pattern, replacement in end_rules:
            if word.endswith(pattern):
                word = word[:-len(pattern)] + replacement
        for pattern, replacement in multiple_rules:
            while pattern in word:
                word = word.replace(pattern, replacement)
        for pattern, replacement in rand_rules:
            word = word.replace(pattern, replacement)
        for pattern, replacement in except_end:
            if word.endswith(pattern):
                word=word
            else:
                word = word.replace(pattern, replacement)
        for pattern, replacement in except_start:
            if word.endswith(pattern):
                word=word
            else:
                word = word.replace(pattern, replacement)
       # print(word[:-1])  
        if "ihh" in word: #needs work
            word = word.replace("ihh","ey")
            while "hh" in word:
                word = word.replace("hh","")
                word = word.replace("ey","eh")
        if "iyy" in word:  #needs work
                word = word.replace("iy","I")
                while "yy" in string:
                    word = word.replace("y","")
        normalized_words.append(word)       
        
    return normalized_words  # Return the list of normalized words

# Test the function with a sentence
sentence_to_normalize = input("Enter the word: ")
normalized_words = normalize_roman_urdu_sentence(sentence_to_normalize)
normalized_sentence = ' '.join(normalized_words)

print("Normalized Sentence:", normalized_sentence)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Maryam\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Enter the word: maryam
Normalized Sentence: maryam


In [16]:

def segment_text(text, dictionary):
    n = len(text)
    segmented_text = []
    i = 0

    while i < n:
        found = None

        for j in range(i + 18, i, -1):  # Consider a maximum word length of 18 characters
            if text[i:j] in dictionary:
                found = text[i:j]
                break

        if found:
            segmented_text.append(found)
            i += len(found)
        else:
            segmented_text.append(text[i])
            i += 1

    return ' '.join(segmented_text)

# Load the Urdu dictionary from a text file into a set
with open('words.txt', 'r', encoding='utf-8') as file:
    urdu_dictionary = {line.strip() for line in file}

# Load the Urdu text from a text file
with open('word_test.txt', 'r', encoding='utf-8') as file:
    urdu_text = file.read()

# Segment the Urdu text using unigrams, bigrams, and trigrams
segmented_urdu_text = segment_text(urdu_text, urdu_dictionary)

# Print the segmented text
print(segmented_urdu_text)


تجربہ کارہ ندو ستانی آفس پنر روی چندر نا یش ون نے آئند ہا یش یا ء کپ 2 0 2 3 ء کی غیری قی نی قسمت پراپن ی رائے کااظہار کیاہے جوپ اکستا نمی ں ہونے جارہاہے اپنے یوٹیوب چینل پربا تکر تے ہوئے روی چندر نا یش ون نےکہا کہا گرپ ڑ وسیم لک بھارت ایشیا کپ 2 0 2 3 ء میں شرکت کرنا چاہتاہے توم
