##Option 4 – Stop-List Builder (Explanation)

This project implements Option 4 of the assignment , creating a stop-list from multiple English Wikipedia pages.
The goal is to find the most frequent words (common across many documents) and build a list of words to ignore in future text processing tasks.

downloading every liberary that is needed

In [1]:
# --- Option 4: Stop-List Builder ---
!pip install nltk wikipedia unidecode -q

import nltk, re, string, collections, os, math
import wikipedia
from nltk.corpus import stopwords
from unidecode import unidecode


In [2]:
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab')
nltk.download('stopwords', quiet=True)


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

###Step 1 – Choosing 10 Wikipedia Pages


We selected 10 Wikipedia articles in English related to computer science and AI.
Each page will be downloaded and analyzed separately.

In [3]:
PAGES = [
    "Artificial intelligence",
    "Supervised learning",
    "Cluster analysis",
    "Natural language processing",
    "Deep learning",
    "Science and technology studies",
    "Bibliometrics",
    "Speech recognition",
    "Artificial neural network",
    "Reinforcement learning"
]


###Step 2 – Downloading and Cleaning the Text

Uses the wikipedia library to fetch the full text of each article.

Converts all text to lowercase.

Removes punctuation, numbers, and non-English characters using regular expressions (re.sub).

Keeps only letters and spaces for clean tokenization later.

In [4]:

# 2) downloading the texts from wikipedia

#content = wikipedia.page(page, auto_suggest=True, redirect=True).content
wikipedia.set_lang("en")

def clean_text(text):
    text = unidecode(text.lower())
    text = re.sub(r"[^a-z\s]", " ", text)
    return text

docs = {}
for page in PAGES:
    try:
        content = wikipedia.page(page).content
        docs[page] = clean_text(content)
        print("Downloaded:", page)
    except Exception as e:
        print("Error:", page, e)


Downloaded: Artificial intelligence
Downloaded: Supervised learning
Downloaded: Cluster analysis
Downloaded: Natural language processing
Downloaded: Deep learning
Downloaded: Science and technology studies
Downloaded: Bibliometrics
Downloaded: Speech recognition
Downloaded: Artificial neural network
Downloaded: Reinforcement learning


###Step 3 – Splitting Each Document into 10 Parts

Each article is divided into 10 equal chunks to simulate smaller sub-documents.
In total, we get around 100 text parts (10 articles × 10 parts each).

In [5]:

# 3) we split the texts to 10 chunks
def split_chunks(text, n=10):
    words = nltk.word_tokenize(text)
    size = max(1, len(words)//n)
    return [" ".join(words[i*size:(i+1)*size]) for i in range(n)]

chunks = {}
for name, text in docs.items():
    cks = split_chunks(text)
    for i, c in enumerate(cks):
        chunks[f"{name}_part{i+1}"] = c

print(f"\nTotal chunks created: {len(chunks)}")



Total chunks created: 100




###Step 4 – Base Stopwords from NLTK


We load a predefined list of common English stop-words (like the, is, and, of, to).
These are words that usually do not add meaning to text analysis and can be filtered out later.

In [6]:

# 4) we take the stop words that were defined in the nltk.corpus library
stop_words = set(stopwords.words("english"))


###Step 5 – Counting Word Frequencies

Tokenizes every chunk into individual words.

Keeps only alphabetic tokens.

Uses Counter to count how many times each word appears in all texts combined.

In [7]:

# 5)we count how many times the tokens appeared in the texts
all_tokens = []
for part_text in chunks.values():
    tokens = nltk.word_tokenize(part_text)
    tokens = [t for t in tokens if t.isalpha()]
    all_tokens.extend(tokens)

freq = collections.Counter(all_tokens)


###Step 6 – Building the Custom Stop-List

After counting frequencies, we take the 50 most common words and create a custom stop-list.
These represent the most repetitive, least informative words in the dataset.

In [8]:

# 6) we found the most common 50 stop words and print them
TOP_N = 50
custom_stoplist = [w for w, f in freq.most_common(TOP_N)]

print("\n=== Custom Stop-List (Top 50 words) ===")
for i, w in enumerate(custom_stoplist, 1):
    print(f"{i:2d}. {w}")



=== Custom Stop-List (Top 50 words) ===
 1. the
 2. of
 3. and
 4. to
 5. a
 6. in
 7. is
 8. that
 9. for
10. as
11. s
12. by
13. learning
14. are
15. with
16. on
17. be
18. or
19. can
20. data
21. it
22. an
23. from
24. this
25. ai
26. such
27. neural
28. displaystyle
29. was
30. used
31. deep
32. have
33. speech
34. networks
35. not
36. which
37. science
38. based
39. g
40. recognition
41. has
42. more
43. research
44. network
45. e
46. been
47. clustering
48. methods
49. other
50. models


###Step 7 – Saving the Stop-List to a File

The final stop-list is saved into a text file called
output/stop_list.txt — one word per line.

In [9]:

# 7) שמירת stop-list לקובץ
os.makedirs("output", exist_ok=True)
with open("output/stop_list.txt", "w") as f:
    for w in custom_stoplist:
        f.write(w + "\n")

print("\nStop-list saved to: output/stop_list.txt")


Stop-list saved to: output/stop_list.txt
