# Classification and Model Selection

## Classifying newsgroup posts


In the Topic classification and embedding Learning Unit, you looked at four topics from the 20 Newsgroups dataset, which is a collection of English posts collected from Usenet newsgroups. These topics were `talk.politics.misc`, `sci.electronics`, `comp.sys.mac.hardware`, and `rec.autos`. 

For this assignment, you will attempt to classify documents from six of the other topics: `comp.os.ms-windows.misc`, `comp.windows.x`, `rec.motorcycles`, `rec.sport.baseball`, `rec.sport.hockey`, and `sci.space`. The texts provided are completely unprocessed, so you will need to pre-process the data before using a supervised machine learning algorithm to predict the topic of the unlabelled test set.


## Get the data

The data provided consists of three files in the `data/` directory:

* `X_train.csv`: the training set as a csv file -- it contains one unique column `text` with the content of an article. You can simply call `pd.read_csv("data/X_train.csv")` to open it. 

* `y_train.csv`: the targets for our training set, corresponds to all the classes we are trying to predict. 

* `X_test.csv`: the test data which you should predict class labels for.

You should use only the data provided in these files for your model.

## Get Started

You should start by loading and examining the data to decide what pre-processing is necessary. 
Implement any model of your choice, and save your predictions to a variable `y_pred`.
`y_pred` should be a list of topic names, e.g. `['sci.space', 'rec.motorcycles', ...]`.

## Baseline model

Here as a baseline we run tf-idf as a preprocessing step followed by a Multinomial Naive Bayes model. Note that we wrapped the TfidfVectorizer into a ColumnTransformer object in order to specify that it needs to be trained on the `text` column specifically.

```python
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline


def build_model():
    preprocessor = ColumnTransformer([("processing", TfidfVectorizer(), "text")])
    return Pipeline([("preprocessor", preprocessor), ("model", MultinomialNB())])
    
m = build_model()
```

## Custom pipeline components

You may wish to do some processing for which there is not an existing sklearn Transformer object. To do this, you can make a custom one; the basic syntax is as follows:

```python
class MyTransformer(TransformerMixin):
    
    def fit(self, X, y=None):
        return self 
    
    def transform(self, X, y=None):
        # Insert your custom transformation below for each item in X
        transformed = [do_something(post) for post in X]
        
        if y is not None:
            return transformed, y
        
        return transformed
```

## Hints

* The text is completely unprocessed, and so includes some raw Usenet information that may not be useful for classification. Have a look through the text to identify what kind of pre-processing you need to do. Often, appropriately pre-processing data will have a greater impact on performance than using more complex models.
* Make sure posts are appropriately de-duplicated. You may not be able to do this entirely within a sklearn `Pipeline`; make sure still to use them as far as possible.
* There are constraints on how much memory and time you have both to process all posts, and to train and run the model. With this in mind, make sure you choose a sensible number of features.
* If you choose to lemmatise the text, these constraints may also cause you difficulties. You can speed up the Spacy lemmatiser by using the [small Spacy model](https://spacy.io/models/en#en_core_web_sm), making sure that unnecessary pipeline components are [excluded](https://spacy.io/usage/processing-pipelines#disabling), and using the simpler [lookup-based lemmatizer](https://spacy.io/api/lemmatizer#config-and-implementation) instead of the default rule-based one.
* All posts should be in English. [This Wikipedia article](https://en.wikipedia.org/wiki/Mojibake) may be useful if you find ones which are not. This is challenging, but will increase your score accordingly!

Good luck!

**💡 You can use any algorithm of your choice.**

KATE expects your code to define variables with specific names that correspond to certain things we are interested in.

KATE will run your notebook from top to bottom and check the latest value of those variables, so make sure you don't overwrite them.

* Remember to uncomment the line assigning the variable to your answer and don't change the variable or function names.
* Use copies of the original or previous DataFrames to make sure you do not overwrite them by mistake.

You will find instructions below about how to define each variable.

Once you're happy with your code, upload your notebook to KATE to check your feedback.

In [1]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

import re

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string


from sklearn.feature_extraction.text import strip_accents_unicode
from bs4 import BeautifulSoup
from scipy import sparse

import numpy as np

import spacy

from sklearn.linear_model import LogisticRegression

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
[nltk_data] Downloading package punkt to /Users/hlz/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/hlz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
X_train = pd.read_csv('data/X_train.csv')
y_train = pd.read_csv('data/y_train.csv')
X_test = pd.read_csv('data/X_test.csv')

In [3]:
y_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4262 entries, 0 to 4261
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   4262 non-null   object
dtypes: object(1)
memory usage: 33.4+ KB


In [4]:
X_train

Unnamed: 0,text
0,<html>\n\n <!-- Meta note: this file is par...
1,敓牡档摥眠瑩潨瑵氠捵⁫潦⁲⁡䅆⁑敨敲‮䤠渠敥⁤⁡敬瑦㠠‵獁数据摡੥業牲牯愠摮䠠湯慤眠湡獴␠㔷...
2,"MLB Standings and Scores for Friday, April 16t..."
3,"\n > \n > Hi, everybody:\n > I guess my su..."
4,\nDale Hawerchuk and Troy Murray were both cap...
...,...
4257,"Well, here it is, NHL in the year 2000.\nI got..."
4258,"\nFrankly, no. Offense and defense are equall..."
4259,ऊ桔瑡䤠搠摩渠瑯搠㭯栠睯癥牥‬桴⁥慳灭敬戠汯⁴⁉潴歯琠⁯桴⁥瑳牯⁥楦ੴ慲桴牥眠汥⁬湩琠敨映汯...
4260,"\n\nMake that ten, not eight. The Mets and Ast..."


In [5]:
X_train = X_train.drop_duplicates()
y_train = y_train.loc[X_train.index]
X_train.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)

In [6]:
# remove mojibake entries

# def filter_non_ascii(text):
#     return text.isascii()

# indices = X_train['text'].apply(filter_non_ascii)

# X_train = X_train[indices]
# y_train = y_train[indices]

# X_train.reset_index(inplace=True, drop=True)
# y_train.reset_index(inplace=True, drop=True)

In [7]:
# # below didn't work

# def try_decode(text):
#     try:
#         # Attempt to encode it back to bytes under the incorrect encoding
#         bytes_text = text.encode('Latin-1')
#         # Then decode it correctly
#         return bytes_text.decode('utf-8')
#     except UnicodeEncodeError:
#         return text  # Return the original text if encoding fails
#     except UnicodeDecodeError:
#         return text  # Return the original text if decoding fails

In [8]:
class MyTransformer(TransformerMixin):

    def __init__(self):
        # Load SpaCy model with disabled components for performance
        self.nlp = spacy.load("en_core_web_sm")
        # Use the lookup-based lemmatizer, which is simpler and faster
        self.nlp.select_pipes(enable="lemmatizer_lookup")
    
    def fit(self, X, y=None):
        return self 
    
    def transform(self, X, y=None):
        # Insert your custom transformation below for each item in X
        transformed = X.copy()
        
        # convert to lower case
        transformed['text'] = transformed['text'].apply(lambda x: strip_accents_unicode(x).lower())

        # remove html
        transformed['text'] = transformed['text'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())

        # remove usenet info
        transformed['text'] = transformed['text'].apply(self._remove_usenet_info)

        # Tokenization, Remove punctuation, Remove stop words, Stemming
        transformed['text'] = transformed['text'].apply(self._preprocess_text)

        # Lemmatization with SpaCy
        # transformed['text'] = transformed['text'].apply(self._lemmatize_text)

        if y is not None:
            return transformed, y

        return transformed['text'].values

    def _remove_usenet_info(self, text):
        # More conservative approach: Remove only specific headers known to appear in Usenet posts
        # Adjust the regex to match only the headers you're sure about
        headers_to_remove = ['From:', 'Subject:', 'Date:', 'Newsgroups:', 'Message-ID:']
        for header in headers_to_remove:
            text = re.sub(r'^' + re.escape(header) + r'\s.*\n?', '', text, flags=re.MULTILINE)
        
        # Remove quoted text: Ensure we're only removing lines that start with ">"
        # This was already pretty conservative, but ensure it's only targeting these lines
        text = re.sub(r'^\s*>.*\n?', '', text, flags=re.MULTILINE | re.IGNORECASE)
        
        return text

    def _preprocess_text(self, text):
        tokens = word_tokenize(text)
        tokens = [token for token in tokens if token not in string.punctuation]
        stop_words = set(stopwords.words('english'))
        tokens = [token for token in tokens if token not in stop_words]
        # print(f'Tokens are {tokens}')

        # option 0: without stemming and lemmatizing
        preprocessed_text = ' '.join(tokens)
        
        # option 1: stemming
        # stemmer = PorterStemmer()
        # stemmed_tokens = [stemmer.stem(token) for token in tokens]
        # preprocessed_text = ' '.join(stemmed_tokens)
        
        # option 2: lemmatizing
        # lemmatized_tokens = [self._lemmatize_text(token) for token in tokens]
        # print(f'\nLemmatized tokens are {lemmatized_tokens}')
        # cleaned_tokens = [token for token in lemmatized_tokens if token.strip() and token not in {"` `", "''", ""}]
        # print(f'\nCleaned lemmatized tokens are {cleaned_tokens}')
        # preprocessed_text = ' '.join(cleaned_tokens)
        
        return preprocessed_text

    def _lemmatize_text(self, text):
        doc = self.nlp(text)
        # Generate lemmatized tokens, excluding punctuation and whitespace
        lemmas = [token.lemma_ for token in doc if not (token.is_punct | token.is_space | token.is_stop)]
        # Re-join lemmatized tokens into a single string
        lemmatized_text = ' '.join(lemmas).strip()
        return lemmatized_text



def build_model():
    """This function builds a new model and returns it.

    You should use a sklearn Pipeline object where appropriate -
    where two or more sklearn transformations 
    The model should be implemented as a sklearn Pipeline object.

    Your pipeline needs to have two or more steps:
    - preprocessor(s): Transformer object(s) that can transform a dataset
    - model: a predictive model object that can be trained and generate predictions
    
    You may deviate from this format, as well as the return value specified below, 
    if necessary for de-duplication.

    :return: a new instance of your model
    """

    pipeline = Pipeline([
        ('preprocessor', MyTransformer()),
        ('vectorizer', TfidfVectorizercz(ngram_range=(1, 2))),
        ('model', LogisticRegression(solver='liblinear'))
    ])
    
    return pipeline

In [9]:
pipeline = build_model()
pipeline.fit(X_train, y_train)

  transformed['text'] = transformed['text'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())
  y = column_or_1d(y, warn=True)


In [10]:
# sample_texts = X_train['text'].head(1)
# sample_texts = sample_texts.apply(lambda x: strip_accents_unicode(x).lower())
# sample_texts = sample_texts.apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())
# sample_texts = sample_texts.apply(MyTransformer()._remove_usenet_info)
# preprocessed_texts = sample_texts.apply(MyTransformer()._preprocess_text)

# print("Original texts:\n", sample_texts)
# print("\nPreprocessed texts:\n", preprocessed_texts)

In [11]:
y_pred = pipeline.predict(X_test).tolist()

  transformed['text'] = transformed['text'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())


In [12]:
y_pred

['comp.windows.x',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'rec.motorcycles',
 'rec.motorcycles',
 'rec.sport.baseball',
 'sci.space',
 'comp.os.ms-windows.misc',
 'comp.windows.x',
 'comp.windows.x',
 'comp.os.ms-windows.misc',
 'comp.windows.x',
 'rec.sport.baseball',
 'comp.os.ms-windows.misc',
 'sci.space',
 'rec.sport.hockey',
 'rec.motorcycles',
 'comp.os.ms-windows.misc',
 'comp.windows.x',
 'rec.motorcycles',
 'rec.motorcycles',
 'rec.motorcycles',
 'comp.os.ms-windows.misc',
 'rec.sport.baseball',
 'sci.space',
 'comp.os.ms-windows.misc',
 'rec.sport.baseball',
 'comp.windows.x',
 'rec.sport.baseball',
 'comp.os.ms-windows.misc',
 'comp.os.ms-windows.misc',
 'rec.sport.hockey',
 'rec.motorcycles',
 'comp.os.ms-windows.misc',
 'comp.os.ms-windows.misc',
 'sci.space',
 'rec.sport.hockey',
 'comp.windows.x',
 'rec.motorcycles',
 'rec.sport.hockey',
 'rec.motorcycles',
 'comp.os.ms-windows.misc',
 'sci.space',
 'rec.sport.hockey',
 'comp.windows.x',
 'rec.motorcycles',
 'rec.