## Small english model 

In [1]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [10]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [None]:
import requests
from bs4 import BeautifulSoup
import re

url = 'https://en.wikipedia.org/wiki/Albert_Einstein'

# GET request to retrieve the page content
response = requests.get(url)

# BeautifulSoup to parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

paragraphs = soup.find_all('p')
text = ''
for paragraph in paragraphs:
    text += paragraph.get_text()
text_without_brackets = re.sub(r'\[[^\]]*\]', '', text)

print(text_without_brackets)



Albert Einstein (/ˈaɪnstaɪn/ EYEN-styne; German:  ⓘ; 14 March 1879 – 18 April 1955) was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, Einstein also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century. His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been called "the world's most famous equation". He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect", a pivotal step in the development of quantum theory. His work is also known for its influence on the philosophy of science. In a 1999 poll of 130 leading physicists worldwide by the British j

In [13]:
doc = nlp(text_without_brackets)
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Albert Einstein  |  PERSON  |  People, including fictional
German  |  NORP  |  Nationalities or religious or political groups
14 March 1879  |  DATE  |  Absolute or relative dates or periods
18 April 1955  |  DATE  |  Absolute or relative dates or periods
German  |  NORP  |  Nationalities or religious or political groups
Einstein  |  PERSON  |  People, including fictional
the first decades of the twentieth century  |  DATE  |  Absolute or relative dates or periods
1921  |  DATE  |  Absolute or relative dates or periods
Nobel Prize in Physics  |  WORK_OF_ART  |  Titles of books, songs, etc.
1999  |  DATE  |  Absolute or relative dates or periods
130  |  CARDINAL  |  Numerals that do not fall under another type
British  |  NORP  |  Nationalities or religious or political groups
Physics World  |  ORG  |  Companies, agencies, institutions, etc.
Einstein  |  PERSON  |  People, including fictional
Einstein  |  PERSON  |  People, including fictional
1905  |  DATE  |  Absolute or relative date

In [25]:

doc = nlp(text_without_brackets)



#list for storing all the names
all_person = []

for ent in doc.ents:
  if ent.label_ == 'PERSON':
    txt=''
    for token in ent:
        txt+=token.text+' '
    if txt not in all_person:
        all_person.append(txt)



#finally printing the results
print("Person Names: ", all_person)
print("Count: ", len(all_person))


Person Names:  ['Albert Einstein ', 'Einstein ', 'mirabilis ', 'Adolf Hitler ', 'Franklin D. Roosevelt ', 'Hermann Einstein ', 'Pauline Koch ', 'Jakob ', 'Elektrotechnische Fabrik ', 'Albert ', 'Hermann ', 'Palazzo Cornazzani ', 'Luitpold ', 'Euclidean ', 'Max Talmud ', 'Kant ', 'Jost Winteler ', 'Winteler ', 'Marie ', 'Maja ', 'Paul ', 'Matura ', 'Marie Winteler ', 'Mileva Marić ', 'Lieserl ', 'Marić ', 'Hans Albert ', 'Eduard ', 'Elsa Löwenthal ', 'Betty Neumann ', 'Hans Mühsam ', 'Estella Katzenellenbogen ', 'Toni Mendel ', 'Ethel Michanowski ', 'Margarita Konenkova ', 'Sergei Konenkov ', "Marcel Grossmann 's ", 'Ernst Mach ', 'David Hume ', 'Annalen der ', 'Marcel Grossman ', 'Alfred Kleiner ', 'Isaac Newton ', 'Zürich ', 'Marcel Grossmann ', 'Max Planck ', 'Walther Nernst ', 'Nernst ', 'S. N. Bose ', 'Arthur Eddington ', 'John Francis Hylan ', 'Princeton ', 'Viscount Haldane ', 'Alexis de Tocqueville ', 'Yoshihito ', 'Herbert Samuel ', 'Alfonso XIII ', 'Santiago Ramón ', 'Oskar Ha

## Large english model

In [None]:
nlp = spacy.load("en_core_web_lg")

In [None]:
import requests
from bs4 import BeautifulSoup
import re

url = 'https://en.wikipedia.org/wiki/Albert_Einstein'

# GET request to retrieve the page content
response = requests.get(url)

# BeautifulSoup to parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find the content of the page (this will depend on the HTML structure of the page)
#content = soup.find(id='mw-content-text')

paragraphs = soup.find_all('p')
text = ''
for paragraph in paragraphs:
    text += paragraph.get_text()
text_without_brackets = re.sub(r'\[[^\]]*\]', '', text)
# Extract the text from the page
#text = content.get_text()

#print(text)
print(text_without_brackets)



Albert Einstein (/ˈaɪnstaɪn/ EYEN-styne; German:  ⓘ; 14 March 1879 – 18 April 1955) was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, Einstein also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century. His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been called "the world's most famous equation". He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect", a pivotal step in the development of quantum theory. His work is also known for its influence on the philosophy of science. In a 1999 poll of 130 leading physicists worldwide by the British j

In [13]:
doc = nlp(text_without_brackets)
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Albert Einstein  |  PERSON  |  People, including fictional
German  |  NORP  |  Nationalities or religious or political groups
14 March 1879  |  DATE  |  Absolute or relative dates or periods
18 April 1955  |  DATE  |  Absolute or relative dates or periods
German  |  NORP  |  Nationalities or religious or political groups
Einstein  |  PERSON  |  People, including fictional
the first decades of the twentieth century  |  DATE  |  Absolute or relative dates or periods
1921  |  DATE  |  Absolute or relative dates or periods
Nobel Prize in Physics  |  WORK_OF_ART  |  Titles of books, songs, etc.
1999  |  DATE  |  Absolute or relative dates or periods
130  |  CARDINAL  |  Numerals that do not fall under another type
British  |  NORP  |  Nationalities or religious or political groups
Physics World  |  ORG  |  Companies, agencies, institutions, etc.
Einstein  |  PERSON  |  People, including fictional
Einstein  |  PERSON  |  People, including fictional
1905  |  DATE  |  Absolute or relative date

In [25]:

doc = nlp(text_without_brackets)



#list for storing all the names
all_person = []

for ent in doc.ents:
  if ent.label_ == 'PERSON':
    txt=''
    for token in ent:
        txt+=token.text+' '
    if txt not in all_person:
        all_person.append(txt)



#finally printing the results
print("Person Names: ", all_person)
print("Count: ", len(all_person))


Person Names:  ['Albert Einstein ', 'Einstein ', 'mirabilis ', 'Adolf Hitler ', 'Franklin D. Roosevelt ', 'Hermann Einstein ', 'Pauline Koch ', 'Jakob ', 'Elektrotechnische Fabrik ', 'Albert ', 'Hermann ', 'Palazzo Cornazzani ', 'Luitpold ', 'Euclidean ', 'Max Talmud ', 'Kant ', 'Jost Winteler ', 'Winteler ', 'Marie ', 'Maja ', 'Paul ', 'Matura ', 'Marie Winteler ', 'Mileva Marić ', 'Lieserl ', 'Marić ', 'Hans Albert ', 'Eduard ', 'Elsa Löwenthal ', 'Betty Neumann ', 'Hans Mühsam ', 'Estella Katzenellenbogen ', 'Toni Mendel ', 'Ethel Michanowski ', 'Margarita Konenkova ', 'Sergei Konenkov ', "Marcel Grossmann 's ", 'Ernst Mach ', 'David Hume ', 'Annalen der ', 'Marcel Grossman ', 'Alfred Kleiner ', 'Isaac Newton ', 'Zürich ', 'Marcel Grossmann ', 'Max Planck ', 'Walther Nernst ', 'Nernst ', 'S. N. Bose ', 'Arthur Eddington ', 'John Francis Hylan ', 'Princeton ', 'Viscount Haldane ', 'Alexis de Tocqueville ', 'Yoshihito ', 'Herbert Samuel ', 'Alfonso XIII ', 'Santiago Ramón ', 'Oskar Ha

## Small french model

In [1]:

nlp = spacy.load("fr_core_news_sm")

In [None]:
import requests
from bs4 import BeautifulSoup
import re

url = 'https://fr.wikipedia.org/wiki/Albert_Einstein'

# GET request to retrieve the page content
response = requests.get(url)

# BeautifulSoup to parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find the content of the page (this will depend on the HTML structure of the page)
#content = soup.find(id='mw-content-text')

paragraphs = soup.find_all('p')
text = ''
for paragraph in paragraphs:
    text += paragraph.get_text()
text_without_brackets = re.sub(r'\[[^\]]*\]', '', text)
# Extract the text from the page
#text = content.get_text()

#print(text)
print(text_without_brackets)



« Einstein » redirige ici. Pour les autres significations, voir Einstein (homonymie).

modifier Albert Einstein (prononcé en allemand  Écouter) né le 14 mars 1879 à Ulm (Wurtemberg, Empire allemand) et mort le 18 avril 1955 à Princeton (New Jersey, États-Unis), est un physicien théoricien. Il fut successivement allemand, apatride (entre 1896 et 1901), suisse (1901) et de double nationalité helvético-américaine (1940). Il épousa Mileva Marić, puis sa cousine Elsa Einstein.
Il publie sa théorie de la relativité restreinte en 1905 et sa théorie de la gravitation, dite relativité générale, en 1915. Il contribue largement au développement de la mécanique quantique et de la cosmologie, et reçoit le prix Nobel de physique de 1921 pour son explication de l’effet photoélectrique. Son travail est notamment connu du grand public pour l’équation E = mc2, qui établit une équivalence entre la masse et l’énergie d’un système.
Il est aujourd'hui considéré comme l'un des plus grands scientifiques de l

In [3]:
doc = nlp(text_without_brackets)
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Einstein  |  PER  |  Named person or family.
Einstein  |  PER  |  Named person or family.
Albert Einstein  |  PER  |  Named person or family.
Ulm  |  LOC  |  Non-GPE locations, mountain ranges, bodies of water
Wurtemberg  |  LOC  |  Non-GPE locations, mountain ranges, bodies of water
Empire allemand  |  LOC  |  Non-GPE locations, mountain ranges, bodies of water
Princeton  |  LOC  |  Non-GPE locations, mountain ranges, bodies of water
New Jersey  |  LOC  |  Non-GPE locations, mountain ranges, bodies of water
États-Unis  |  LOC  |  Non-GPE locations, mountain ranges, bodies of water
suisse  |  LOC  |  Non-GPE locations, mountain ranges, bodies of water
Mileva Marić  |  PER  |  Named person or family.
Elsa Einstein  |  PER  |  Named person or family.
prix Nobel de physique  |  MISC  |  Miscellaneous entities, e.g. events, nationalities, products or works of art
Time  |  ORG  |  Companies, agencies, institutions, etc.
Hermann Einstein  |  PER  |  Named person or family.
Buchau  |  LOC  | 

In [4]:

doc = nlp(text_without_brackets)



#list for storing all the names
all_person = []

for ent in doc.ents:
  if ent.label_ == 'PER':
    txt=''
    for token in ent:
        txt+=token.text+' '
    if txt not in all_person:
        all_person.append(txt)



#finally printing the results
print("Person Names: ", all_person)
print("Count: ", len(all_person))


Person Names:  ['Einstein ', 'Albert Einstein ', 'Mileva Marić ', 'Elsa Einstein ', 'Hermann Einstein ', 'Pauline Koch ', 'Albert ', 'Abraham ', 'Jakob ', 'Max Talmey ', 'Kant ', 'Helen Einstein ', 'Pauline Kock ', 'Maria ', 'Maja ', 'Luitpold Gymnasium ', 'Ugo Foscolo ', 'Contardo Ferrini ', 'Ada Negri ', 'Marcel Grossmann ', 'Kirchhoff ', 'Hertz ', 'Helmholtz ', 'Maxwell ', 'Michele Besso ', 'Ernst Mach ', 'Lieserl ', 'Mileva ', 'Hans - Albert ', 'Eduard ', 'Conrad Habicht ', 'Maurice Solovine ', 'Nernst ', 'Marie Curie ', 'Max Planck ', 'Paul Langevin ', 'Elsa ', 'Arthur Eddington ', 'Hitler ', 'Abraham Flexner ', 'Eugene Wigner ', 'Leó Szilárd ', 'Roosevelt ', 'Chaim Weizmann ', 'Niels Bohr ', 'Charlie Chaplin ', 'Édouard Herriot ', 'Ricci ', 'Riemann - Christoffel ', 'Hilbert ', 'Klein ', 'Planck ', 'Arthur Compton ', 'Erwin Schrödinger ', 'Werner Heisenberg ', 'Gott würfelt ', 'Boris Podolsky ', 'Nathan Rosen ', 'John Earman ', 'Clark Glymour ', 'Eddington ', 'Harry Collins ', 'T

## Large french model 

In [1]:

nlp = spacy.load("fr_core_news_lg")

In [2]:
import requests
from bs4 import BeautifulSoup
import re

url = 'https://fr.wikipedia.org/wiki/Albert_Einstein'

# Faire une requête GET pour récupérer le contenu de la page
response = requests.get(url)

# Utiliser BeautifulSoup pour parser le contenu HTML
soup = BeautifulSoup(response.content, 'html.parser')

# Trouver le contenu de la page (cela dépendra de la structure HTML de la page)
#content = soup.find(id='mw-content-text')

paragraphs = soup.find_all('p')
text = ''
for paragraph in paragraphs:
    text += paragraph.get_text()
text_without_brackets = re.sub(r'\[[^\]]*\]', '', text)
# Extraire le texte de la page
#text = content.get_text()

#print(text)
print(text_without_brackets)



« Einstein » redirige ici. Pour les autres significations, voir Einstein (homonymie).

modifier Albert Einstein (prononcé en allemand  Écouter) né le 14 mars 1879 à Ulm (Wurtemberg, Empire allemand) et mort le 18 avril 1955 à Princeton (New Jersey, États-Unis), est un physicien théoricien. Il fut successivement allemand, apatride (entre 1896 et 1901), suisse (1901) et de double nationalité helvético-américaine (1940). Il épousa Mileva Marić, puis sa cousine Elsa Einstein.
Il publie sa théorie de la relativité restreinte en 1905 et sa théorie de la gravitation, dite relativité générale, en 1915. Il contribue largement au développement de la mécanique quantique et de la cosmologie, et reçoit le prix Nobel de physique de 1921 pour son explication de l’effet photoélectrique. Son travail est notamment connu du grand public pour l’équation E = mc2, qui établit une équivalence entre la masse et l’énergie d’un système.
Il est aujourd'hui considéré comme l'un des plus grands scientifiques de l

In [3]:
doc = nlp(text_without_brackets)
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Einstein  |  PER  |  Named person or family.
Einstein  |  PER  |  Named person or family.
Albert Einstein  |  PER  |  Named person or family.
Ulm  |  LOC  |  Non-GPE locations, mountain ranges, bodies of water
Wurtemberg  |  LOC  |  Non-GPE locations, mountain ranges, bodies of water
Empire allemand  |  LOC  |  Non-GPE locations, mountain ranges, bodies of water
Princeton  |  LOC  |  Non-GPE locations, mountain ranges, bodies of water
New Jersey  |  LOC  |  Non-GPE locations, mountain ranges, bodies of water
États-Unis  |  LOC  |  Non-GPE locations, mountain ranges, bodies of water
suisse  |  LOC  |  Non-GPE locations, mountain ranges, bodies of water
Mileva Marić  |  PER  |  Named person or family.
Elsa Einstein  |  PER  |  Named person or family.
prix Nobel de physique  |  MISC  |  Miscellaneous entities, e.g. events, nationalities, products or works of art
Time  |  ORG  |  Companies, agencies, institutions, etc.
Hermann Einstein  |  PER  |  Named person or family.
Buchau  |  LOC  | 

In [4]:

doc = nlp(text_without_brackets)



#list for storing all the names
all_person = []

for ent in doc.ents:
  if ent.label_ == 'PER':
    txt=''
    for token in ent:
        txt+=token.text+' '
    if txt not in all_person:
        all_person.append(txt)



#finally printing the results
print("Person Names: ", all_person)
print("Count: ", len(all_person))


Person Names:  ['Einstein ', 'Albert Einstein ', 'Mileva Marić ', 'Elsa Einstein ', 'Hermann Einstein ', 'Pauline Koch ', 'Albert ', 'Abraham ', 'Jakob ', 'Max Talmey ', 'Kant ', 'Helen Einstein ', 'Pauline Kock ', 'Maria ', 'Maja ', 'Luitpold Gymnasium ', 'Ugo Foscolo ', 'Contardo Ferrini ', 'Ada Negri ', 'Marcel Grossmann ', 'Kirchhoff ', 'Hertz ', 'Helmholtz ', 'Maxwell ', 'Michele Besso ', 'Ernst Mach ', 'Lieserl ', 'Mileva ', 'Hans - Albert ', 'Eduard ', 'Conrad Habicht ', 'Maurice Solovine ', 'Nernst ', 'Marie Curie ', 'Max Planck ', 'Paul Langevin ', 'Elsa ', 'Arthur Eddington ', 'Hitler ', 'Abraham Flexner ', 'Eugene Wigner ', 'Leó Szilárd ', 'Roosevelt ', 'Chaim Weizmann ', 'Niels Bohr ', 'Charlie Chaplin ', 'Édouard Herriot ', 'Ricci ', 'Riemann - Christoffel ', 'Hilbert ', 'Klein ', 'Planck ', 'Arthur Compton ', 'Erwin Schrödinger ', 'Werner Heisenberg ', 'Gott würfelt ', 'Boris Podolsky ', 'Nathan Rosen ', 'John Earman ', 'Clark Glymour ', 'Eddington ', 'Harry Collins ', 'T