<a href="https://colab.research.google.com/github/hegame1998/NLP-Assignment/blob/main/NLP_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I will do this approach:
*   Text classification into two classes: geographic and non-geographic.

*   Use of Wikipedia API to extract text content.

*   Preprocessing with NLTK (stop word removal, lemmatization using WordNet).

*   Classification using Naive Bayes and Logistic Regression.

*   Bag-of-Words (BoW) as the feature extraction technique.

#Data Collection

Wikipedia articles for both classes (geographic & non-geographic).

In [1]:
!pip install wikipedia-api




In [2]:
import wikipediaapi
import os

In [3]:
# Set up Wikipedia API with a valid user-agent
wiki_wiki = wikipediaapi.Wikipedia(
    language='en',
    user_agent='YourAppName/1.0 (your_email@example.com)'  # Replace with your info
)

In [4]:
# Make sure the directory exists
os.makedirs('data', exist_ok=True)

def get_wikipedia_text(title):
    page = wiki_wiki.page(title)
    if page.exists():
        return page.text
    return ""

def save_articles(titles, filename):
    with open(filename, 'w', encoding='utf-8') as f:
        for title in titles:
            text = get_wikipedia_text(title)
            if text:
                f.write(f"{title}\n{text}\n---END---\n")

In [5]:
# Example usage
if __name__ == "__main__":
    geo_titles = ['Italy', 'Mount Everest', 'Amazon River', 'New York City']
    non_geo_titles = ['Python (programming language)', 'Photosynthesis', 'Quantum mechanics', 'World War II']

    save_articles(geo_titles, 'data/geo_articles.txt')
    save_articles(non_geo_titles, 'data/non_geo_articles.txt')


#Preprocessing

Tokenization, lowercasing, stopword removal, lemmatization.

In [6]:
import nltk
nltk.download('all')
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_rus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |  

In [7]:
# Force download again to fix possible corruption or incomplete download
nltk.download('punkt', force=True)
nltk.download('stopwords', force=True)
nltk.download('wordnet', force=True)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [8]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [9]:
def preprocess(text):
    text = re.sub(r'\W+', ' ', text.lower())
    tokens = nltk.word_tokenize(text)
    tokens = [lemmatizer.lemmatize(w) for w in tokens if w not in stop_words and len(w) > 2]
    return ' '.join(tokens)


#Feature Extraction

Bag of Words using ***CountVectorizer***.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

#Model Training

Naive Bayes and Logistic Regression classifiers.



In [11]:
def load_data():
    def read_file(path):
        with open(path, 'r', encoding='utf-8') as f:
            articles = f.read().split('---END---\n')
            return [preprocess(article.split('\n', 1)[-1]) for article in articles if article.strip()]

    geo = read_file('data/geo_articles.txt')
    non_geo = read_file('data/non_geo_articles.txt')
    return geo + non_geo, [1]*len(geo) + [0]*len(non_geo)

In [12]:
def classify():
    texts, labels = load_data()
    X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42)

    vectorizer = CountVectorizer()
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)

    # Naive Bayes
    nb_model = MultinomialNB()
    nb_model.fit(X_train_vec, y_train)
    nb_preds = nb_model.predict(X_test_vec)

    print("Naive Bayes Results:\n")
    print(confusion_matrix(y_test, nb_preds))
    print(classification_report(y_test, nb_preds))

    # Logistic Regression
    log_model = LogisticRegression(max_iter=200)
    log_model.fit(X_train_vec, y_train)
    log_preds = log_model.predict(X_test_vec)

    print("Logistic Regression Results:\n")
    print(confusion_matrix(y_test, log_preds))
    print(classification_report(y_test, log_preds))

#Evaluation

Accuracy, Confusion Matrix, and F1 Score.



In [13]:
if __name__ == "__main__":
    classify()

Naive Bayes Results:

[[1 0]
 [1 1]]
              precision    recall  f1-score   support

           0       0.50      1.00      0.67         1
           1       1.00      0.50      0.67         2

    accuracy                           0.67         3
   macro avg       0.75      0.75      0.67         3
weighted avg       0.83      0.67      0.67         3

Logistic Regression Results:

[[1 0]
 [2 0]]
              precision    recall  f1-score   support

           0       0.33      1.00      0.50         1
           1       0.00      0.00      0.00         2

    accuracy                           0.33         3
   macro avg       0.17      0.50      0.25         3
weighted avg       0.11      0.33      0.17         3



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
