# M161 first question notebook- best model LogisticRegrgessionCV , 15000 instances and max_features=10000 vectorization

## Data preprocessing
### Data cleaning I
 1. check types 
 2. check for null values
 3. check duplicates
 4. keeping 10000 instances to reduce computation load


In [1]:
import pandas as pd
file_path = 'bigdata2025classification/train.csv'

def load_and_process_data(file_path):
    # Load data from a CSV file
    dataTrain = pd.read_csv(file_path)

    print("Data loaded successfully.")
    print("First 5 rows of the dataset:")
    print(dataTrain.head())

    print("\nData summary:")
    print(dataTrain.info())

    # Check for missing values in the dataframe
    print("\nMissing values in each column:")
    print(dataTrain.isnull().sum())
    
    return dataTrain

dataTrain = load_and_process_data(file_path)

# check column data types
def check_column_types(dataTrain):
    print("\nColumn data types:")
    print(dataTrain.dtypes)

check_column_types(dataTrain)


print(f"dataTrain shape: {dataTrain.shape}")


Data loaded successfully.
First 5 rows of the dataset:
       Id                                              Title  \
0  227464  Netflix is coming to cable boxes, and Amazon i...   
1  244074  Pharrell, Iranian President React to Tehran 'H...   
2   60707                    Wildlife service seeks comments   
3   27883  Facebook teams up with Storyful to launch 'FB ...   
4  169596           Caesars plans US$880 mln New York casino   

                                             Content          Label  
0   if you subscribe to one of three rinky-dink (...  Entertainment  
1   pharrell, iranian president react to tehran '...  Entertainment  
2   the u.s. fish and wildlife service has reopen...     Technology  
3   the very nature of social media means it is o...     Technology  
4   caesars plans us$880 mln new york casino jul ...       Business  

Data summary:
<class 'pandas.DataFrame'>
RangeIndex: 111795 entries, 0 to 111794
Data columns (total 4 columns):
 #   Column   Non-Null Cou

## Keeping 10000 instances to reduce computation load

In [2]:
# Keep only the first 15000 instances for faster experimentation
dataTrain = dataTrain.iloc[:15000].reset_index(drop=True)
print(f"Subset shape: {dataTrain.shape}")

Subset shape: (15000, 4)


## Duplicate removal based on Title and Content columns concurently

In [3]:
# Remove duplicates based on 'Title' and 'Content' columns, keeping the first occurrence
dataTrain = dataTrain.drop_duplicates(subset=['Title', 'Content'], keep='first')
print("\nDuplicates based on Title and Content removed. Data shape:", dataTrain.shape)


# Reset the index after removing duplicates
dataTrain = dataTrain.reset_index(drop=True)
print("\nIndex reset. Data shape:", dataTrain.shape)
dataTrain.info()


Duplicates based on Title and Content removed. Data shape: (14988, 4)

Index reset. Data shape: (14988, 4)
<class 'pandas.DataFrame'>
RangeIndex: 14988 entries, 0 to 14987
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   Id       14988 non-null  int64
 1   Title    14988 non-null  str  
 2   Content  14988 non-null  str  
 3   Label    14988 non-null  str  
dtypes: int64(1), str(3)
memory usage: 468.5 KB



### Data statistics


In [4]:
import numpy as np
dataTrain['text_length'] = dataTrain['Content'].apply(len)
dataTrain['word_count'] = dataTrain['Content'].apply(lambda x: len(str(x).split()))
dataTrain['sentence_count'] = dataTrain['Content'].apply(lambda x: len(str(x).split('.')))
dataTrain['avg_word_length'] = dataTrain['Content'].apply(lambda x: np.mean([len(word) for word in str(x).split()]))

print("\n--- Content Statistics ---")
print(dataTrain[['text_length', 'word_count', 'sentence_count', 'avg_word_length']].describe())


--- Content Statistics ---
        text_length    word_count  sentence_count  avg_word_length
count  14988.000000  14988.000000    14988.000000     14988.000000
mean    2561.735722    422.912797       24.663531         5.071503
std     2196.271772    364.721403       25.236777         0.825487
min       17.000000      3.000000        1.000000         3.500000
25%     1302.750000    216.000000       12.000000         4.835203
50%     2025.000000    336.000000       19.000000         5.054566
75%     3142.000000    515.000000       30.000000         5.268253
max    78614.000000  12562.000000     1343.000000        95.820513



### Remove words not in English dictionary

- **probably could change dictionary for better results but it works...**


In [5]:
import re
import nltk
from nltk.corpus import words

# Download the words corpus if not already present
nltk.download('words')
english_words = set(words.words())

def remove_non_english_words(text):
    # Split text into words
    word_list = re.findall(r'\b\w+\b', str(text))
    cleaned_words = []
    for word in word_list:
        # Drop any word not in dictionary
        if word.lower() not in english_words:
            continue
        # Drop words with 2+ repeating chars not in dictionary (redundant now, but kept for clarity)
        if re.search(r'(.)\1{1,}', word):
            if word.lower() not in english_words:
                continue
        cleaned_words.append(word)
    return ' '.join(cleaned_words)

# Apply to both columns
dataTrain['Title'] = dataTrain['Title'].apply(remove_non_english_words)
dataTrain['Content'] = dataTrain['Content'].apply(remove_non_english_words)


[nltk_data] Downloading package words to
[nltk_data]     C:\Users\odys_\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


## Text clean up 
1. Expand contractions
2. Convert to lowercase
3. Remove special characters (keep only letters and spaces)
4. Remove extra spaces
5. Remove stopwords, lemmatize, and stem

In [6]:
import re
import nltk
import contractions
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer

# Download required NLTK data if not already present
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

def clean_text(text):
    # Expand contractions
    text = contractions.fix(text)
    # Convert to lowercase
    text = text.lower()
    # Remove special characters (keep only letters and spaces)
    text = re.sub(r'[^a-z\s]', '', text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    # Tokenize
    words = text.split()
    # Remove stopwords, lemmatize, and stem
    words = [stemmer.stem(lemmatizer.lemmatize(word)) for word in words if word not in stop_words]
    text = ' '.join(words)
    return text

for col in ['Title', 'Content']:
    dataTrain[col] = dataTrain[col].astype(str).apply(clean_text)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\odys_\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\odys_\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\odys_\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


### Just printing out the firtst 5 columns to see what happend to text

In [7]:
print(dataTrain.head())

       Id                       Title  \
0  227464  come cabl groceri overlord   
1  244074          presid react happi   
2   60707              wildlif servic   
3   27883                      launch   
4  169596           u new york casino   

                                             Content          Label  \
0  subscrib one three dink compar speak cabl abl ...  Entertainment   
1  presid react happi singer presid took twitter ...  Entertainment   
2  fish wildlif servic comment period addit day p...     Technology   
3  natur social medium often sourc real time brea...     Technology   
4  u new york casino latest news top deck world e...       Business   

   text_length  word_count  sentence_count  avg_word_length  
0         1576         264              15         4.965909  
1         1200         192              10         5.250000  
2         2773         416              34         5.665865  
3         1564         254              13         5.157480  
4         2250  

## Starting future extraction (converting text to numbers for ML algorythms to run)





### TF-IDF Vectorization

We will now use TF-IDF vectorization instead of Bag of Words to represent the text data for classification. TF-IDF often improves performance by reducing the impact of common words and highlighting more informative terms.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Combine Title and Content if not already done
if 'Combined' not in dataTrain.columns:
    dataTrain['Combined'] = dataTrain['Title'].fillna('') + ' ' + dataTrain['Content'].fillna('')

# Initialize TF-IDF Vectorizer
# You can tune max_features, ngram_range, etc. for further improvement
vectorizer = TfidfVectorizer(ngram_range=(1,2), max_features=10000)
dataTrain_tfidf = vectorizer.fit_transform(dataTrain['Combined'])

print('TF-IDF matrix shape:', dataTrain_tfidf.shape)
print('Feature names (first 20):', vectorizer.get_feature_names_out()[:20])

TF-IDF matrix shape: (14988, 10000)
Feature names (first 20): ['aa' 'abandon' 'abdomin' 'abid' 'abil' 'abl' 'abl access' 'abl find'
 'abl get' 'abl make' 'abl see' 'abl take' 'abl use' 'abnorm' 'aboard'
 'abort' 'abroad' 'abrupt' 'abruptli' 'absenc']


### Logistic Regression with Built-in Cross-Validation (LogisticRegressionCV)

We will now use `LogisticRegressionCV` from scikit-learn, which performs cross-validated logistic regression and automatically tunes the regularization parameter.

In [9]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import classification_report

# Assume the target column is named 'Label' (change if needed)
if 'Label' not in dataTrain.columns:
    print("ERROR: 'Label' column not found in dataTrain. Please check your dataset.")
else:
    X = dataTrain_tfidf
    y = dataTrain['Label']
    clf_cv = LogisticRegressionCV(cv=5, max_iter=1000, random_state=42,class_weight='balanced',scoring='accuracy', n_jobs= 6)
    clf_cv.fit(X, y)
    y_pred_cv = clf_cv.predict(X)
    print("Best C values per class:", clf_cv.C_)
    print('\n ***********************')
    

    print("\nClassification Report (LogisticRegressionCV, 5-fold CV):\n", classification_report(y, y_pred_cv, zero_division=0))
    
    print ('\n classification accuracy a=', clf_cv.score(X, y))
    



Best C values per class: [21.5443469 21.5443469 21.5443469 21.5443469]

 ***********************

Classification Report (LogisticRegressionCV, 5-fold CV):
                precision    recall  f1-score   support

     Business       0.99      1.00      1.00      3372
Entertainment       1.00      1.00      1.00      5949
       Health       1.00      1.00      1.00      1610
   Technology       1.00      1.00      1.00      4057

     accuracy                           1.00     14988
    macro avg       1.00      1.00      1.00     14988
 weighted avg       1.00      1.00      1.00     14988


 classification accuracy a= 0.998065118761676


## Writing test predictions to file testSet_categories.csv

In [10]:

# # Load test data
# test_file_path = 'bigdata2025classification/test_without_labels.csv'
# test_data = pd.read_csv(test_file_path)

# # Apply the same text preprocessing to the test data

# # Remove non-English words
# test_data['Title'] = test_data['Title'].apply(remove_non_english_words)
# test_data['Content'] = test_data['Content'].apply(remove_non_english_words)

# # Clean text (expand contractions, lowercase, remove special chars, stopwords, lemmatize, stem)
# for col in ['Title', 'Content']:
#     test_data[col] = test_data[col].astype(str).apply(clean_text)

# # Combine Title and Content for test data (same as training)
# test_data['Combined'] = test_data['Title'].fillna('') + ' ' + test_data['Content'].fillna('')

# # Apply the same TF-IDF vectorizer to test data
# test_tfidf = vectorizer.transform(test_data['Combined'])

# # Predict labels using the trained classifier
# test_pred = clf_cv.predict(test_tfidf)

# # Prepare output DataFrame
# output_df = pd.DataFrame({
#     'Id': test_data['Id'],
#     'Predicted': test_pred
# })

# # Write predictions to CSV
# output_df.to_csv('testSet_categories.csv', index=False)
# print("Predictions written to testSet_categories.csv")