# Machine Learning Models for Text Classification

In this notebook, we apply several machine learning models for text classification tasks. We preprocess the text data, train different models, and evaluate their performance using the classification report and accuracy score.

## 1. Loading Data

We begin by loading the preprocessed data (features and target variable) from saved pickle files.

```python
import pandas as pd
import pickle

# Load feature matrix and target variable
X = pickle.load(open('Xtrain_matrix.pkl', 'rb'))
y = pd.read_pickle('ytrain.pkl')
```

## 2. Data Splitting

The data is split into training and testing sets using `train_test_split` from scikit-learn, where 80% of the data is used for training, and 20% for testing.

```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
```

---

## 3. Logistic Regression

Logistic Regression is applied to the training data. We use the multinomial option for multi-class classification and balance the class weights to handle imbalanced classes. The model is then evaluated on the test data.

```python
from sklearn import linear_model
from sklearn.metrics import classification_report

clf_lr = linear_model.LogisticRegression(multi_class='multinomial', class_weight='balanced', max_iter=1000)
clf_lr.fit(X_train, y_train.values.ravel())

# Predict and evaluate
y_pred = clf_lr.predict(X_test)
print(classification_report(y_test, y_pred))

# Evaluate accuracy
print(f'Logistic Regression Accuracy: {clf_lr.score(X_test, y_test)}')
```

---

## 4. Support Vector Machine (SVM)

A Support Vector Machine (SVM) is trained with a polynomial kernel. The SVM is also evaluated using the classification report and accuracy score.

```python
from sklearn import svm
from sklearn.metrics import classification_report

clf_svm = svm.SVC(gamma=0.01, kernel='poly')
clf_svm.fit(X_train, y_train.values.ravel())

# Predict and evaluate
y_pred = clf_svm.predict(X_test)
print(classification_report(y_test, y_pred))

# Evaluate accuracy
print(f'SVM Accuracy: {clf_svm.score(X_test, y_test)}')
```

---

## 5. Random Forest Classifier

Multiple configurations of the Random Forest Classifier are tested, including different values for the number of features (`max_features`) and the minimum number of samples required to split a node (`min_samples_split`). The model performance is evaluated using the classification report.

```python
from sklearn import ensemble
from sklearn.metrics import classification_report

# First configuration
clf_rf = ensemble.RandomForestClassifier(n_jobs=-1, random_state=321)
clf_rf.fit(X_train, y_train.values.ravel())
y_pred = clf_rf.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Random Forest Accuracy: {clf_rf.score(X_test, y_test)}')

# Second configuration
clf_rf = ensemble.RandomForestClassifier(n_jobs=-1, max_features='sqrt', min_samples_split=4)
clf_rf.fit(X_train, y_train.values.ravel())
y_pred = clf_rf.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Random Forest Accuracy: {clf_rf.score(X_test, y_test)}')

# Third configuration
clf_rf = ensemble.RandomForestClassifier(n_jobs=-1, max_features='log2', min_samples_split=27)
clf_rf.fit(X_train, y_train.values.ravel())
y_pred = clf_rf.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Random Forest Accuracy: {clf_rf.score(X_test, y_test)}')
```

---

## 6. K-Nearest Neighbors (KNN) Classifier

We train a K-Nearest Neighbors (KNN) classifier, first with a fixed value for the number of neighbors and later using GridSearchCV to find the best number of neighbors for the model. The model's performance is evaluated after each training.

```python
from sklearn import neighbors
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Initial KNN model
knn = neighbors.KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, y_train.values.ravel())
y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'KNN Accuracy: {knn.score(X_test, y_test)}')

# Grid search for optimal neighbors
parametres = {'n_neighbors': np.arange(2, 28)}
grid_knn = GridSearchCV(estimator=knn, param_grid=parametres)
grid_knn.fit(X_train, y_train.values.ravel())
print(f'Best parameters from GridSearchCV: {grid_knn.best_params_}')

# Train KNN with optimal number of neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=27)
knn.fit(X_train, y_train.values.ravel())
y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'KNN Accuracy: {knn.score(X_test, y_test)}')
```

---

## 7. Decision Tree Classifier

We test two configurations of the Decision Tree Classifier, one with the `entropy` criterion and another with the `gini` criterion, to evaluate their performance on the text data.

```python
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

# Decision Tree with entropy criterion
dt_clf = DecisionTreeClassifier(criterion='entropy', max_depth=4, random_state=123)
dt_clf.fit(X_train, y_train)
y_pred = dt_clf.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Decision Tree Accuracy (entropy): {dt_clf.score(X_test, y_test)}')

# Decision Tree with gini criterion
dt_clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=4, random_state=321)
dt_clf_gini.fit(X_train, y_train)
y_pred = dt_clf_gini.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Decision Tree Accuracy (gini): {dt_clf_gini.score(X_test, y_test)}')
```

---

## 8. Voting Classifier

We combine multiple classifiers (KNN, Random Forest, and Logistic Regression) into a Voting Classifier to improve the overall performance through ensemble learning.

```python
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Define individual classifiers
clf1 = KNeighborsClassifier(n_neighbors=27)
clf2 = RandomForestClassifier(n_jobs=-1, max_features='sqrt', min_samples_split=4)
clf3 = LogisticRegression(multi_class='multinomial', class_weight='balanced', max_iter=1000)

# Create Voting Classifier
vc = VotingClassifier(estimators=[('knn', clf1), ('rf', clf2), ('lr', clf3)], voting='hard')
vc.fit(X_train, y_train.values.ravel())

# Predict and evaluate
y_pred = vc.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Voting Classifier Accuracy: {vc.score(X_test, y_test)}')
```

---

## 9. XGBoost Classifier

Finally, we apply the XGBoost classifier to evaluate its performance on the text classification task.

```python
!pip install xgboost
import xgboost
from xgboost import XGBClassifier
from sklearn.metrics import classification_report

xgb_clf = XGBClassifier(use_label_encoder=False)
xgb_clf.fit(X_train, y_train.values.ravel())

# Predict and evaluate
y_pred = xgb_clf.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'XGBoost Accuracy: {xgb_clf.score(X_test, y_test)}')
```

---

# Summary of Model Performance

Each model was trained and evaluated based on the classification report and accuracy score. By comparing the results, we can determine which model performs best on the given text data. For future work, fine-tuning hyperparameters or trying additional models could further improve the performance.