<a href="https://colab.research.google.com/github/hegame1998/NLP-Assignment/blob/main/NLP_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I will do this approach:
*   Text classification into two classes: geographic and non-geographic.

*   Use of Wikipedia API to extract text content.

*   Preprocessing with NLTK (stop word removal, lemmatization using WordNet).

*   Classification using Naive Bayes and Logistic Regression.

*   Bag-of-Words (BoW) as the feature extraction technique.

#Data Collection

Wikipedia articles for both classes (geographic & non-geographic).

In [None]:
!pip install wikipedia-api


In [None]:
import wikipediaapi
import os

In [None]:
# Set up Wikipedia API with a valid user-agent
wiki_wiki = wikipediaapi.Wikipedia(
    language='en',
    user_agent='YourAppName/1.0 (your_email@example.com)'  # Replace with your info
)

In [None]:
# Make sure the directory exists
os.makedirs('data', exist_ok=True)

def get_wikipedia_text(title):
    page = wiki_wiki.page(title)
    if page.exists():
        return page.text
    return ""

def save_articles(titles, filename):
    with open(filename, 'w', encoding='utf-8') as f:
        for title in titles:
            text = get_wikipedia_text(title)
            if text:
                f.write(f"{title}\n{text}\n---END---\n")

In [None]:
# Example usage
if __name__ == "__main__":
    geo_titles = ['Italy', 'Mount Everest', 'Amazon River', 'New York City']
    non_geo_titles = ['Python (programming language)', 'Photosynthesis', 'Quantum mechanics', 'World War II']

    save_articles(geo_titles, 'data/geo_articles.txt')
    save_articles(non_geo_titles, 'data/non_geo_articles.txt')




#Preprocessing

Tokenization, lowercasing, stopword removal, lemmatization.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

In [None]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    text = re.sub(r'\W+', ' ', text.lower())
    tokens = nltk.word_tokenize(text)
    tokens = [lemmatizer.lemmatize(w) for w in tokens if w not in stop_words and len(w) > 2]
    return ' '.join(tokens)


#Feature Extraction

Bag of Words using ***CountVectorizer***.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from preprocess import preprocess

In [None]:
def load_data():
    def read_file(path):
        with open(path, 'r', encoding='utf-8') as f:
            articles = f.read().split('---END---\n')
            return [preprocess(article.split('\n', 1)[-1]) for article in articles if article.strip()]

    geo = read_file('data/geo_articles.txt')
    non_geo = read_file('data/non_geo_articles.txt')
    return geo + non_geo, [1]*len(geo) + [0]*len(non_geo)

In [None]:
def classify():
    texts, labels = load_data()
    X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42)

    vectorizer = CountVectorizer()
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)

    # Naive Bayes
    nb_model = MultinomialNB()
    nb_model.fit(X_train_vec, y_train)
    nb_preds = nb_model.predict(X_test_vec)

    print("Naive Bayes Results:\n")
    print(confusion_matrix(y_test, nb_preds))
    print(classification_report(y_test, nb_preds))

    # Logistic Regression
    log_model = LogisticRegression(max_iter=200)
    log_model.fit(X_train_vec, y_train)
    log_preds = log_model.predict(X_test_vec)

    print("Logistic Regression Results:\n")
    print(confusion_matrix(y_test, log_preds))
    print(classification_report(y_test, log_preds))

In [None]:
if __name__ == "__main__":
    classify()

#Model Training

Naive Bayes and Logistic Regression classifiers.



In [None]:
from src.classify import classify

if __name__ == "__main__":
    classify()

#Evaluation

Accuracy, Confusion Matrix, and F1 Score.



In [None]:
Naive Bayes Results:

[[4 0]
 [0 4]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         4
           1       1.00      1.00      1.00         4

    accuracy                           1.00         8
   macro avg       1.00      1.00      1.00         8
weighted avg       1.00      1.00      1.00         8