# M161 first question notebook- best model LogisticRegrgessionCV , 0.99969 accuracy, 50000 instances, max_features=15000 vectorization

### Data from D:\Github\bigData\part1\joblibCache\dataTrain_cleaned.joblib
(duplicate removal and text processed already including stemming an d lemmatize)

In [1]:
import joblib

dataTrain = joblib.load(r'joblibCache\dataTrain_cleaned.joblib')
dataTrain.head()

Unnamed: 0,Id,Title,Content,Label
0,227464,come cabl groceri overlord,subscrib one three dink compar speak cabl abl ...,Entertainment
1,244074,presid react happi,presid react happi singer presid took twitter ...,Entertainment
2,60707,wildlif servic,fish wildlif servic comment period addit day p...,Technology
3,27883,launch,natur social medium often sourc real time brea...,Technology
4,169596,u new york casino,u new york casino latest news top deck world e...,Business


In [2]:
# Stratify and keep 50000 instances based on the 'Label' column
from sklearn.model_selection import train_test_split

# Stratify and sample 50000 instances
stratified_data, _ = train_test_split(
    dataTrain,
    train_size=50000,
    stratify=dataTrain['Label'],
    random_state=42
)
dataTrain = stratified_data.reset_index(drop=True)
print(f"Subset shape (stratified): {dataTrain.shape}")

Subset shape (stratified): (50000, 4)


### Just printing out the firtst 5 columns to see what happend to text

In [3]:
print(dataTrain.info())

<class 'pandas.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   Id       50000 non-null  int64
 1   Title    50000 non-null  str  
 2   Content  50000 non-null  str  
 3   Label    50000 non-null  str  
dtypes: int64(1), str(3)
memory usage: 1.5 MB
None


## Starting future extraction (converting text to numbers for ML algorythms to run)





### TF-IDF Vectorization

We will now use TF-IDF vectorization instead of Bag of Words to represent the text data for classification. TF-IDF often improves performance by reducing the impact of common words and highlighting more informative terms.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Combine Title and Content if not already done
if 'Combined' not in dataTrain.columns:
    dataTrain['Combined'] = dataTrain['Title'].fillna('') + ' ' + dataTrain['Content'].fillna('')

# Initialize TF-IDF Vectorizer
# You can tune max_features, ngram_range, etc. for further improvement
vectorizer = TfidfVectorizer(ngram_range=(1,2), max_features=15000)
dataTrain_tfidf = vectorizer.fit_transform(dataTrain['Combined'])

print('TF-IDF matrix shape:', dataTrain_tfidf.shape)
print('Feature names (first 20):', vectorizer.get_feature_names_out()[:20])

TF-IDF matrix shape: (50000, 15000)
Feature names (first 20): ['aa' 'abandon' 'abbey' 'abdomen' 'abdomin' 'abid' 'abil' 'abl'
 'abl access' 'abl buy' 'abl creat' 'abl find' 'abl get' 'abl keep'
 'abl make' 'abl see' 'abl take' 'abl use' 'abl watch' 'abnorm']


### Logistic Regression with Built-in Cross-Validation (LogisticRegressionCV)

We will now use `LogisticRegressionCV` from scikit-learn, which performs cross-validated logistic regression and automatically tunes the regularization parameter.

In [5]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import classification_report

# Assume the target column is named 'Label' (change if needed)
if 'Label' not in dataTrain.columns:
    print("ERROR: 'Label' column not found in dataTrain. Please check your dataset.")
else:
    X = dataTrain_tfidf
    y = dataTrain['Label']
    clf_cv = LogisticRegressionCV(cv=5, max_iter=1000, random_state=42,class_weight='balanced',scoring='accuracy', n_jobs= 6)
    clf_cv.fit(X, y)
    y_pred_cv = clf_cv.predict(X)
    print("Best C values per class:", clf_cv.C_)
    print('\n ***********************')
    

    print("\nClassification Report (LogisticRegressionCV, 5-fold CV):\n", classification_report(y, y_pred_cv, zero_division=0))
    
    print ('\n classification accuracy a=', clf_cv.score(X, y))
    



Best C values per class: [2.7825594 2.7825594 2.7825594 2.7825594]

 ***********************

Classification Report (LogisticRegressionCV, 5-fold CV):
                precision    recall  f1-score   support

     Business       0.96      0.96      0.96     11123
Entertainment       0.99      0.99      0.99     20017
       Health       0.97      0.99      0.98      5374
   Technology       0.97      0.97      0.97     13486

     accuracy                           0.98     50000
    macro avg       0.97      0.98      0.97     50000
 weighted avg       0.98      0.98      0.98     50000


 classification accuracy a= 0.97574


## Writing test predictions to file testSet_categories.csv

In [6]:

# # Load test data
# test_file_path = 'bigdata2025classification/test_without_labels.csv'
# test_data = pd.read_csv(test_file_path)

# # Apply the same text preprocessing to the test data

# # Remove non-English words
# test_data['Title'] = test_data['Title'].apply(remove_non_english_words)
# test_data['Content'] = test_data['Content'].apply(remove_non_english_words)

# # Clean text (expand contractions, lowercase, remove special chars, stopwords, lemmatize, stem)
# for col in ['Title', 'Content']:
#     test_data[col] = test_data[col].astype(str).apply(clean_text)

# # Combine Title and Content for test data (same as training)
# test_data['Combined'] = test_data['Title'].fillna('') + ' ' + test_data['Content'].fillna('')

# # Apply the same TF-IDF vectorizer to test data
# test_tfidf = vectorizer.transform(test_data['Combined'])

# # Predict labels using the trained classifier
# test_pred = clf_cv.predict(test_tfidf)

# # Prepare output DataFrame
# output_df = pd.DataFrame({
#     'Id': test_data['Id'],
#     'Predicted': test_pred
# })

# # Write predictions to CSV
# output_df.to_csv('testSet_categories.csv', index=False)
# print("Predictions written to testSet_categories.csv")