#### Question

##### Task1:

- Load any 1000 pages pdf and create a data frame of first 100 pages and extract human names and also mask the names (First and last letter  should not be masked).

Expected output:
<table>
  <colgroup>
    <col style="width: 5%">
    <col style="width: 40%">
    <col style="width: 40%">
    <col style="width: 15%">
  </colgroup>
  <thead>
    <tr>
      <th>Page Number</th>
      <th>Page Content</th>
      <th>Masked Content</th>
      <th>Extracted Names</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td style="white-space: normal;">we are group of 4 people; Narendra was one who is from another city…..........</td>
      <td style="white-space: normal;">we are group of 4 people, N******a was one who is from another city…..........</td>
      <td>Narendra</td>
    </tr>
    <tr>
      <td>2</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
    </tr>
  </tbody>
</table>

		

____________________


Let's start with importing the necessary libraries, I am using nltk package to process text. I have researched and found that 'maxent_ne_chunker' is dedicated to extract human names.

If You want to run the codes   please install nltk,PyMuPdf and   download nltk.download('maxent_ne_chunker')
nltk.download('words')

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import fitz
import nltk
from nltk import sent_tokenize, ne_chunk, word_tokenize, pos_tag

In [80]:
pdf = fitz.open('War and Peace.pdf')

page_text_all = []

for page_no in range(102):
    page = pdf[page_no]
    page_text = page.get_text()
    page_text_all.append(page_text)

As I can see there are multiple '\n' in this text, so we'll remove all those

In [81]:
import re

cleaned_text_all = []

for pages in page_text_all:
    cleaned_text = re.sub(r'\n', ' ', pages)
    cleaned_text_all.append(cleaned_text)

I am encounter the page header in each page in this pattern: "War and Peace    25 of 2882 " this has no need so we can remove it

In [82]:
pattern = r'War and Peace\s+\d+\s+of\s+\d+'

cleaned_doc_all = []

for pages in cleaned_text_all:
    cleaned_doc = re.sub(pattern, ' ', pages)
    cleaned_doc_all.append(cleaned_doc)

Removing the 'CHAPTER ...'

In [83]:
pattern = r'Chapter (I|II|III|IV|V|VI|VII|VIII|IX|X|XI|XII|XIII|XIV):'

Cleaned_doc_all_2 = []

for pages in cleaned_doc_all:
    cleaned_doc_2= re.sub(pattern, ' ',pages)
    Cleaned_doc_all_2.append(cleaned_doc_2)

There is no need of first 2 pages as it contains none of our important info

In [84]:
final_text = Cleaned_doc_all_2[2:]

Now Let's remove extra spaces and Punctuation marks

In [85]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

filtered_paragraphs = []

for paragraph in final_text:

    words = word_tokenize(paragraph)

    filtered_words = [word for word in words if word.lower() not in stop_words]

    filtered_paragraph = ' '.join(filtered_words)

    filtered_paragraphs.append(filtered_paragraph)


In [86]:
import string

processed_pages = []

for page in filtered_paragraphs:

    page_without_punctuation = ''.join([word for word in page if word not in string.punctuation])
    
    processed_pages.append(page_without_punctuation)

Removing "‘" and "’"

In [87]:
processed_text = list(map(lambda x:x.replace("‘", "").replace("’", ""),processed_pages))

Now these texts are ready to go through Name extraction process

I can do all the above steps in one short code but I went step by step to clarify each process in more clear way. If I want to do all the above in one go I would have taken this approach

Now Let's extract names

In [88]:
data = { 
    "Page_Number": [],
    "Page_Content": [],
    "Processed_Masked_Content": [],
}

df = pd.DataFrame(data)

In [89]:
df["Page_Number"] = list(range(1, 101))
df["Page_Content"] = page_text_all[:100]
df["Processed_Masked_Content"] =processed_text

In [90]:
df

Unnamed: 0,Page_Number,Page_Content,Processed_Masked_Content
0,1,\nWar and Peace \nLeo Tolstoy \n \n \n \n \n ...,Chapter Well Prince Genoa Lucca family esta...
1,2,War and Peace \n \n2 of 2882 \nBOOK ONE: 1805 \n,nothing better Count Prince prospect spen...
2,3,War and Peace \n \n3 of 2882 \nChapter I \n‘We...,fete English ambassador Today Wednesday mu...
3,4,War and Peace \n \n4 of 2882 \n‘If you have no...,spoiled child continual consciousness charmin...
4,5,War and Peace \n \n5 of 2882 \n‘And the fete a...,desires good mankind promised Nothing littl...
...,...,...,...
95,96,War and Peace \n \n96 of 2882 \nShe was experi...,funny said bending blushing still waited ...
96,97,"War and Peace \n \n97 of 2882 \n‘Oh, how nice,...",Chapter XIV receiving visitors countess tired...
97,98,War and Peace \n \n98 of 2882 \n‘How funny you...,told sooner Mamma would gone replied rose...
98,99,War and Peace \n \n99 of 2882 \nChapter XIV \n...,age secrets Natasha Boris two nonsense ...


In [91]:
def extract_names(text):
    names = set()
    words = word_tokenize(text)
    tagged = pos_tag(words)
    named_entities = ne_chunk(tagged)
    for subtree in named_entities:
        if type(subtree) == nltk.Tree and subtree.label() == 'PERSON':
            name = ' '.join([word for word, tag in subtree.leaves()])
            names.add(name)
    return ', '.join(names) if names else None

df['Extracted_names'] = df["Processed_Masked_Content"].apply(extract_names)


Let's check if it captures some errors , I'll remove it manually

In [94]:
for name in df.Extracted_names:
    if name:
        for i in name:
            print(i, end='')

Anna Pavlovna, Anna Pavlovna Scherer, Prince Vasili Kuragin, Chapter, St Petersburg, AntichristCount Prince, Annette Scherer Heavens, Anna PavlovnaVasili, Well, Buonaparte, Anna Pavlovna SchererAlexander, Anna Pavlovna, Malta, Emperor, None EnglishPrussia, Mortemart, Emperor, Haugwitz, Abbe MorioAnna Pavlovna, Dowager Empress Marya Fedorovna, Vasili, Dowager Empress, Baron Funke, Majesty, Empress, Empress Anna Pavlovna, ViennaLavaterMajesty, Anna Pavlovna, AnatolePresently, Arrange, Anna Pavlovna, Annette, Lise Meinen, Emperor, Kutuzov, BolkonskiLise, Anna Pavlovna, AttendezAnna Pavlovna, Helene, Mortemart, Prince Vasili, Chapter, Abbe MorioMajesty, Anna PavlovnaAnna Pavlovna, SoyezAnna Pavlovna, Count Bezukhov, Pierre, Prince Vasili, Prince Vasili Anna Pavlovna OneYes, Anna Pavlovna, Pierre, Abbe Morio, FirstPierre, Anna Pavlovna, PetersburgMorioChapter, Anna Pavlovna, Mortemart Anna PavlovnaAnna Pavlovna, Buonaparte, Mortemart, Vicomte, Louis XV, Duc EnghienAnna Pavlovna Helene, Anna

In [97]:
text = 'Anna Pavlovna, Anna Pavlovna Scherer, Prince Vasili Kuragin, Chapter, St Petersburg, AntichristCount Prince, Annette Scherer Heavens, Anna PavlovnaVasili, Well, Buonaparte, Anna Pavlovna SchererAlexander, Anna Pavlovna, Malta, Emperor, None EnglishPrussia, Mortemart, Emperor, Haugwitz, Abbe MorioAnna Pavlovna, Dowager Empress Marya Fedorovna, Vasili, Dowager Empress, Baron Funke, Majesty, Empress, Empress Anna Pavlovna, ViennaLavaterMajesty, Anna Pavlovna, AnatolePresently, Arrange, Anna Pavlovna, Annette, Lise Meinen, Emperor, Kutuzov, BolkonskiLise, Anna Pavlovna, AttendezAnna Pavlovna, Helene, Mortemart, Prince Vasili, Chapter, Abbe MorioMajesty, Anna PavlovnaAnna Pavlovna, SoyezAnna Pavlovna, Count Bezukhov, Pierre, Prince Vasili, Prince Vasili Anna Pavlovna OneYes, Anna Pavlovna, Pierre, Abbe Morio, FirstPierre, Anna Pavlovna, PetersburgMorioChapter, Anna Pavlovna, Mortemart Anna PavlovnaAnna Pavlovna, Buonaparte, Mortemart, Vicomte, Louis XV, Duc EnghienAnna Pavlovna Helene, Anna PavlovnaHelene Wait, Prince Hippolyte Fetch, Anna Pavlovna, MadameHippolyteAnna Pavlovna, Bonaparte, Napoleon, Prince Hippolyte, Mademoiselle George, Duc EnghienPierre, Anna PavlovnaPierre, Anna PavlovnaAnna Pavlovna, Chapter IV, Andrew Bolkonski, Prince, Lise, Frenchman, Kutuzov, BolkonskiPrince Andrew, George Buonaparte Prince, Pierre, Andrew, AndrePrince Vasili Frenchman, Anna Pavlovna, Prince Andrew, Pierre, Prince Vasili, Vicomte, Anna Pavlovna EducateBoris Prince, Prince Vasili, Guards, Emperor, Petersburg Tell, Rumyantsev Prince Golitsyn, Petersburg, Believe PrincessListen Prince, Anna Pavlovna, Golitsyn, Papa, Vasili, Princess HelenePrince Vasili, Guards, Anna Mikhaylovna, Michael Ilarionovich Kutuzov, Boris, WaitGoodby, Papa, Vasili, Anna Mikhaylovna, KutuzovAnna Pavlovna, Dieu, Guai, Andrew, Milan, Genoa Lucca, Monsieur Buonaparte Monsieur Buonaparte, Chapter VLouis XVII Queen Madame Elizabeth Nothing, Conde, Baton, BourbonAnna Pavlovna, Pierre, Bonaparte, Vicomte, Imperial, Emperor Alexander, Prince Andrew MonsieurAnna Pavlovna, Prince Andrew, Pierre, Andrew, Bonaparte, Monsieur Pierre, Napoleon, Duc EnghienAnna Pavlovna, Capital, Dieu Mon Dieu, Pierre, Anna Pavlovna Pierre, Bourbons, Prince Hippolyte EnglishMonsieur Pierre, Yes, Bourbons, Anna Pavlovna Rousseau ContratLiberty, Buonaparte, Anna Pavlovna, Pierre, Andrew, Saviour, NapoleonPierre, Jacobin, Prince Andrew, Prince Hippolyte PierreExcuse Vicomte, Hippolyte, Arcola, Pierre, Andrew, Jaffa, Prince Hippolyte, Andrew NapoleonHippolyte, Girl, Anna Pavlovna, Pierre, Oh, Prince HippolyteMonsieur Pierre, Pierre, Anna Pavlovna, Chapter VIAnna Pavlovna, Hippolyte, Prince Andrew, Write, Anatole, AnnetteEither, Prince HippolyteHippolyte, Prince Andrew Russian, Pierre, Allow, Prince HippolyteHippolyte, Prince Andrew, Pierre, Andrew, Mlle Scherer, Caesar CommentariesPrince Andrew, Pierre, Write, Horse Guards, Andrew, Freemason, PetersburgPierre Prince, Pierre, Andrew, England AustriaAnna Pavlovna, Pierre, Annette, Andrew, Excuse, Chapter, Evidently PierreHippolyte, Pierre, Annette, Andrew, Uncle, Andre, Apraksins, EmperorPierre, Prince AndrewPrince Andrew, Pierre, Andrew, Monsieur Pierre, Lise, Lise PrincePrince Andrew, Pierre, Calm Princess, Andrew, Goodby Prince, LiseMon, Dieu, LiseHalfway, Neither, Pierre, Andrew, Marry, Prince Andrew Prince, ChapterPierre, Bolkonski, Anna Pavlovna, Prince AndrewPierre, Prince Andrew Anna Pavlovna, Bonaparte Bonaparte, BonapartePierre, Andrew, Prince AndrewPierre, Andrew, Prince AndrewPrince Andrew, Andrew Women, Kuragins, Pierre, Anatole, Prince Vasili KuraginPierre, Chapter IX, Andrew, Prince Anatole, Kuragin, Anatole KuraginPierre, Horse Guards, Anatole, Cards, KuraginSemenov, Pierre, Bruin, Anatole Pierre, Anatole, Petya Good, Pierre First, Dolokhov, Stevens, Jacob, WaitPierre, Anatole, Stevens English, Anatole Pierre Dolokhov, EnglishmanAnatole, Pierre Pierre, Hercules, Dolokhov, Petersburg, Anatole Dolokhov, Kuragin DolokhovPierre, Dolokhov, Dolokhov Englishman, Anatole Firstrate, Englishman, Listen, Englishman AnatoleShut, Oh Oh Oh, English Wait, Anatole, Dolokhov, Englishman, Kuragin Listen, WaitPierre, Anatole, Eh Eh DolokhovPierre, Dolokhov, Anatole, EnglishmanPierre, Dolokhov, Fine, EnglishmanListen, Pierre Come, Bruin, Anatole WaitSoon Anna Pavlovna, Anna Pavlovna, Chapter X Prince Vasili, Rostovs, Bory, Guards, Emperor, Semenov Guards, Anna Mikhaylovna, Petersburg, Nataly Ever, Radzivilov St Natalia, BorisWell Dmitri, Dmitri VasilevichMarya Lvovna Karagina, Ask, Razumovski, Dear CountessAnna Pavlovna, Count Bezukhov, Bezukhov, Dolokhov, Anna Mikhaylovna Prince VasiliCount Yet, Moyka Canal, Dolokhov, Petersburg, Cyril Vladimirovich Bezukhov, Anatole Kuragin, Marya Ivanovna DolokhovaPierre, Prince Vasili, Count Cyril, Emperor, Anna MikhaylovnaBesides Cyril Vladimirovich, Yes, Vasili, Bory, Pierre Prince Vasili Forty, Count Cyril VladimirovichChapter, Ah, GuardsIlya, NatashaNicholas, Natasha, Anna Mikhaylovna, Sonya, Natasha Mimi, BorisNicholas, Mimi, Apraksina, Natasha, Dark, BorisMamma, Natasha, BorisChapter, Nicholas SonyaNicholas, Natasha Boris, BorisNicholas, Pavlograd Hussars, Buonaparte, Emperor Well, Papa, Rostov, Bonaparte Julie Karagina, Sonya, ArkharovsNicholas, Yes, Julie, Anna Mikhaylovna, SonyaNicholas, TakesVera, Yes, Boris, SalomoniVera, WellChapter, Sonya, Natasha, BorisNicholas, Sooonya Look, Natasha, Sonya, Sonya AhBoris, Boris Boris, Natasha Sonya NicholasNatasha, Boris Forever, TillChapter, Vera, Petersburg Anna Mikhaylovna, Anna MikhaylovnaNicholas, Sonya Natasha, Boris Natasha, Mamma, Though, Vera, SonyaBerg, Natasha Boris, Boris Natalya Ilynichna, Mamma, Natasha, Madame, Vera, BorisBerg, Genlis, Nicholas, Genlis Madame, Madame, Vera, Vera Nicholas'
names_list =list(set(text.split(", ")))
names_list

['ChapterPierre',
 'Genlis',
 'Count Bezukhov',
 'Chapter IV',
 'Calm Princess',
 'Prince Vasili',
 'Uncle',
 'Saviour',
 'Petya Good',
 'Dolokhov',
 'Anna Pavlovna Scherer',
 'Imperial',
 'Stevens',
 'Madame',
 'St Petersburg',
 'Michael Ilarionovich Kutuzov',
 'Milan',
 'Prince Vasili KuraginPierre',
 'WaitPierre',
 'AntichristCount Prince',
 'Annette Scherer Heavens',
 'Kuragin DolokhovPierre',
 'Prince Andrew Anna Pavlovna',
 'Bezukhov',
 'Prince Andrew MonsieurAnna Pavlovna',
 'Kuragins',
 'Annette',
 'Cards',
 'Freemason',
 'Dowager Empress',
 'Oh',
 'Chapter',
 'Princess HelenePrince Vasili',
 'AnnetteEither',
 'EmperorPierre',
 'Caesar CommentariesPrince Andrew',
 'Buonaparte',
 'EnglishmanAnatole',
 'Bonaparte',
 'Natasha Sonya NicholasNatasha',
 'Excuse',
 'Vasili',
 'Hercules',
 'Bonaparte Julie Karagina',
 'Haugwitz',
 'Genlis Madame',
 'Anna Mikhaylovna',
 'Andrew NapoleonHippolyte',
 'Mortemart Anna PavlovnaAnna Pavlovna',
 'Lise PrincePrince Andrew',
 'Prince Andrew',
 '

I have manually picked up some waste words as very less no of waste words are there

In [96]:
waste_word = ['Chapter IV','Milan','St Petersburg','Antichrist','Cards','Chapter','Oh','Excuse','Englishman','Majesty','Allow','Dark','Well','Chapter IX','Oh Oh Oh','EnglishmanListen','WellChapter','English Wait','Chapter VLouis XVII Queen Madame Elizabeth Nothing','Fine','Write','England AustriaAnna Pavlovna','Arrange','Goodby Prince','Chapter VIAnna Pavlovna','Ask','Ah','Guards','Listen','Yes', 'TillChapter','Till']

Now I'll remove these words from the name col

In [98]:
def remove_waste_words(name):
    if name is None:
        return None  # Handle None values
    name_parts = name.split(", ")
    clean_name = [part for part in name_parts if part not in waste_word]
    return ", ".join(clean_name)


# Apply the function to the 'names' column
df['Extracted_names'] = df['Extracted_names'].apply(remove_waste_words)


In [100]:
for index, row in df.iterrows():
    content = row['Processed_Masked_Content']
    names = row['Extracted_names']
    
    if names is not None:
        names = names.split(', ')
    else:
        names = []
    
    tokens = nltk.word_tokenize(content)
    
    for i, token in enumerate(tokens):
        if len(token) > 2 and token in names:
            tokens[i] = token[0] + '*' * (len(token) - 2) + token[-1]
    
    masked_content = ' '.join(tokens)
    df.at[index, 'Processed_Masked_Content'] = masked_content

In [104]:
df.head(5)

Unnamed: 0,Page_Number,Page_Content,Processed_Masked_Content,Extracted_names
0,1,\nWar and Peace \nLeo Tolstoy \n \n \n \n \n ...,Chapter Well Prince Genoa Lucca family estates...,"Anna Pavlovna, Anna Pavlovna Scherer, Prince V..."
1,2,War and Peace \n \n2 of 2882 \nBOOK ONE: 1805 \n,nothing better Count Prince prospect spending ...,"Count Prince, Annette Scherer Heavens, Anna Pa..."
2,3,War and Peace \n \n3 of 2882 \nChapter I \n‘We...,fete English ambassador Today Wednesday must p...,"Vasili, Buonaparte, Anna Pavlovna Scherer"
3,4,War and Peace \n \n4 of 2882 \n‘If you have no...,spoiled child continual consciousness charming...,"Alexander, Anna Pavlovna, Malta, Emperor, None..."
4,5,War and Peace \n \n5 of 2882 \n‘And the fete a...,desires good mankind promised Nothing little p...,"Prussia, Mortemart, Emperor, Haugwitz, Abbe Morio"


In [101]:
df['Processed_Masked_Content'].iloc[70]

'around looked D******v standing window sill pale radiant face empty threw bottle Englishman caught neatly D******v jumped smelt strongly rum Well done Fine fellow bet Devil take came different sides Englishman took purse began counting money D******v stood frowning speak P****e jumped upon window sill Gentlemen wishes bet thing suddenly cried Even without bet Tell bring bottle Bring bottle Let let said D******v smiling next gone mad one would let go giddy even staircase exclaimed several voices drink Let bottle rum shouted P****e banging table determined drunken gesture preparing climb window seized arms strong everyone touched sent flying'

In [103]:
df['Extracted_names'].iloc[70]

'Pierre, Dolokhov'

### Project Steps

1. Loaded a 1000-page PDF and extracted text content for the first 100 pages.
2. Cleaned the text data by removing line breaks, page headers, and chapter headings.
3. Removed extra spaces and punctuation marks to prepare the text for further processing.
4. Extracted human names from the text using NLTK's named entity recognition.
5. Manually removed some unwanted words from the extracted names.
6. Masked the names in the processed text by replacing characters between the first and last character with asterisks.
7. Created a DataFrame to store the page number, original content, processed and masked content, and extracted names.
8. Applied the removal of waste words to the extracted names in the DataFrame.
9. Iterated through the DataFrame rows and masked the names in the processed text.
10. Checked the results with an example.


The pre-processing i have done above is part by part approach. Below is a compact form: 

In [None]:
import re
import string
import contractions

cleaned_text_all = []

pattern = r'War and Peace\s+\d+\s+of\s+\d+'

for page in page_text_all:
    # Remove line breaks and the specified pattern
    cleaned_text = re.sub(r'\n|' + pattern, ' ', page)
    cleaned_text_all.append(cleaned_text)

# Add the pattern to remove chapter headings
pattern = r'Chapter (I|II|III|IV|V|VI|VII|VIII|IX|X|XI|XII|XIII|XIV):'

cleaned_doc_all = []

for pages in cleaned_text_all:
    cleaned_doc = re.sub(pattern, ' ', pages)
    cleaned_doc_all.append(cleaned_doc)

# Further processing
final_text = cleaned_doc_all[2:]

def process_text(text):
    # Remove punctuation and apply contractions
    text_without_punctuation = ''.join([word for word in text if word not in string.punctuation])
    processed_text = contractions.fix(text_without_punctuation)
    return processed_text

processed_pages = list(map(process_text, final_text))

# Remove special characters '‘' and '’'
processed_text = list(map(lambda x: x.replace("‘", "").replace("’", ""), processed_pages))
