# INTRODUCTION

This is a Tamil News Classification ML project. I have trained the model for the classification of news into five categories: education, crime, business, technology, and sports. I collected the data by scraping information from the hindutamil.in website. Now, let's delve into this project and get started!

# DATA SCRAPING

Scraping education news post links from the hindutamil.in website.

In [1]:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.hindutamil.in/news/education/'
page_numbers = range(2, 102)
urls = [f"{URL}{i}" for i in page_numbers]
class_ = 'card-h-img height150px'
education = []

def scrape_links(url, class_):
    try:
        r = requests.get(url)
        r.raise_for_status()
        soup = BeautifulSoup(r.content, 'html.parser')
        divs = soup.find_all('div', class_)
        links = [a['href'] for div in divs for a in div.find_all('a', href=True)]
        return links
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")
        return []

for url in urls:
    education.extend(scrape_links(url, class_))
print('Done')

Done


Scraping crime news post links from the hindutamil.in website.

In [2]:
URL = 'https://www.hindutamil.in/news/crime/'
page_numbers = range(2, 102)
urls = [f"{URL}{i}" for i in page_numbers]
class_ = 'card-h-img height150px'
crime = []

def scrape_links(url, class_):
    try:
        r = requests.get(url)
        r.raise_for_status()
        soup = BeautifulSoup(r.content, 'html.parser')
        divs = soup.find_all('div', class_)
        links = [a['href'] for div in divs for a in div.find_all('a', href=True)]
        return links
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")
        return []

for url in urls:
    crime.extend(scrape_links(url, class_))
print('Done')

Done


Scraping business news post links from the hindutamil.in website.

In [3]:
URL = 'https://www.hindutamil.in/news/business/'
page_numbers = range(2, 102)
urls = [f"{URL}{i}" for i in page_numbers]
class_ = 'card-h-img height150px'
business = []

def scrape_links(url, class_):
    try:
        r = requests.get(url)
        r.raise_for_status()
        soup = BeautifulSoup(r.content, 'html.parser')
        divs = soup.find_all('div', class_)
        links = [a['href'] for div in divs for a in div.find_all('a', href=True)]
        return links
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")
        return []

for url in urls:
    business.extend(scrape_links(url, class_))
print('Done')

Done


Scraping technology news post links from the hindutamil.in website.

In [4]:
URL = 'https://www.hindutamil.in/news/technology/'
page_numbers = range(2, 102)
urls = [f"{URL}{i}" for i in page_numbers]
class_ = 'card-h-img height150px'
technology = []

def scrape_links(url, class_):
    try:
        r = requests.get(url)
        r.raise_for_status()
        soup = BeautifulSoup(r.content, 'html.parser')
        divs = soup.find_all('div', class_)
        links = [a['href'] for div in divs for a in div.find_all('a', href=True)]
        return links
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")
        return []

for url in urls:
    technology.extend(scrape_links(url, class_))
print('Done')

Done


Scraping sports news post links from the hindutamil.in website.

In [5]:
URL = 'https://www.hindutamil.in/news/sports/'
page_numbers = range(2, 102)
urls = [f"{URL}{i}" for i in page_numbers]
class_ = 'card-h-img height150px'
sports = []

def scrape_links(url, class_):
    try:
        r = requests.get(url)
        r.raise_for_status()
        soup = BeautifulSoup(r.content, 'html.parser')
        divs = soup.find_all('div', class_)
        links = [a['href'] for div in divs for a in div.find_all('a', href=True)]
        return links
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")
        return []

for url in urls:
    sports.extend(scrape_links(url, class_))
print('Done')

Done


Now, I combine all the category links into a single list with corresponding category types, and then I save this data as a CSV file.

In [6]:
import csv
def data(education, crime, technology, business, sports, output_csv):
    data = []
    for links, category in zip([education, crime, technology, business, sports], ['education', 'crime', 'technology', 'business', 'sports']):
        for link in links:
            data.append([link, category])
    with open(output_csv, 'w', newline='') as csvfile:
        csv_writer = csv.writer(csvfile)
        csv_writer.writerow(['Links', 'Category'])
        csv_writer.writerows(data)
data(education, crime, technology, business, sports, "data.csv")

I import the previously saved news links data CSV file.

In [7]:
import pandas as pd
df = pd.read_csv('data.csv')
df

Unnamed: 0,Links,Category
0,https://www.hindutamil.in/news/education/11724...,education
1,https://www.hindutamil.in/news/education/11724...,education
2,https://www.hindutamil.in/news/education/11722...,education
3,https://www.hindutamil.in/news/education/11718...,education
4,https://www.hindutamil.in/news/education/11716...,education
...,...,...
12405,https://www.hindutamil.in/news/sports/1032513-...,sports
12406,https://www.hindutamil.in/news/sports/1032514-...,sports
12407,https://www.hindutamil.in/news/sports/1032515-...,sports
12408,https://www.hindutamil.in/news/sports/1032521-...,sports


Check the information about the data, including missing values and format details.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12410 entries, 0 to 12409
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Links     8000 non-null   object
 1   Category  8000 non-null   object
dtypes: object(2)
memory usage: 194.0+ KB


Here, we see that some links are missing in the links column. This may have happened due to network issues while scraping the links, so I simply removed those null values from the data frame

In [9]:
df.dropna(inplace=True)
df

Unnamed: 0,Links,Category
0,https://www.hindutamil.in/news/education/11724...,education
1,https://www.hindutamil.in/news/education/11724...,education
2,https://www.hindutamil.in/news/education/11722...,education
3,https://www.hindutamil.in/news/education/11718...,education
4,https://www.hindutamil.in/news/education/11716...,education
...,...,...
12405,https://www.hindutamil.in/news/sports/1032513-...,sports
12406,https://www.hindutamil.in/news/sports/1032514-...,sports
12407,https://www.hindutamil.in/news/sports/1032515-...,sports
12408,https://www.hindutamil.in/news/sports/1032521-...,sports


I check how many duplicate data entries are present in the data frame.Because Duplicated data can increase bias and variance, which is not conducive to proper model fitting.

In [10]:
df.duplicated().sum()

0

Now, I scrape the Tamil news titles for all the links.

In [11]:
links = df['Links'].tolist()
def scrape_title(link):
    try:
        r = requests.get(link)
        r.raise_for_status()
        soup = BeautifulSoup(r.content, 'html.parser')
        title = soup.title.text.strip()
        return title
    except requests.exceptions.RequestException as e:
        print(f'Error scraping title from {link} : {e}')
        return "Error"
title = []
for link in links:
    try:
        title.append(scrape_title(link))
    except Exception as e:
        print(f"Error for {link}: {e}")
print("done")

done


I save the Tamil news titles along with their corresponding category in a data frame as a CSV file.

In [12]:
del df['Links']
df.to_csv('Scraped_data.csv', index = False)

Now, I import the previously saved Scraped_data.CSV file.

# DATA PREPROCESSING

In [13]:
import pandas as pd
df = pd.read_csv('Scraped_data.csv')
df

Unnamed: 0,Category,Title
0,education,கோவை வேளாண் பல்கலை. மாணவர்கள் நடத்தும் இலவச பா...
1,education,"வேளாண் படிப்புகளும், வேலை வாய்ப்புகளும் - ஓர் ..."
2,education,"எஸ்எஸ்எல்சி, பிளஸ் 1, பிளஸ் 2 பொது தேர்வு: தனி..."
3,education,ஸ்மார்ட் கிளாஸ் உடன் ஹைடெக்காக மாறும் திருநாளூ...
4,education,மாணவர்கள் பங்கேற்கலாம்; டிசம்பர் 27-ம் தேதி மா...
...,...,...
7995,sports,ஹாம்பர்க் ஓபன் டென்னிஸ் - ஸ்வியாடெக் விலகல் | ...
7996,sports,தெற்காசிய கால்பந்து கூட்டமைப்பு சாம்பியன்ஷிப்:...
7997,sports,ஹாட்ரிக் வெற்றியை பதிவு செய்தது சீகம் மதுரை பே...
7998,sports,ஆசிய கபடி சாம்பியன்ஷிப்: 8-வது முறையாக பட்டம் ...


As before, I check the information about the data, including missing values and format details.

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  8000 non-null   object
 1   Title     8000 non-null   object
dtypes: object(2)
memory usage: 125.1+ KB


Again I check how many duplicate data entries are present in this data frame.

In [15]:
df.duplicated().sum()

8

Here, I have some duplicated data entries. So, I will simply remove those duplicated entries.

In [16]:
df.drop_duplicates(inplace=True)
df = df.reset_index(drop=True)

Now, I check the data distribution (balanced/imbalanced) for classes in the category column.

In [17]:
df['Category'].value_counts()

Category
education     1600
crime         1600
sports        1600
technology    1596
business      1596
Name: count, dtype: int64

I remove English text in the Title column of our data frame.

In [18]:
import re
df1 = df.copy()
def remove_english(text):
    return re.sub(r'[a-zA-Z]', '', text)
df1['title'] = df1['Title'].apply(remove_english)
df1

Unnamed: 0,Category,Title,title
0,education,கோவை வேளாண் பல்கலை. மாணவர்கள் நடத்தும் இலவச பா...,கோவை வேளாண் பல்கலை. மாணவர்கள் நடத்தும் இலவச பா...
1,education,"வேளாண் படிப்புகளும், வேலை வாய்ப்புகளும் - ஓர் ...","வேளாண் படிப்புகளும், வேலை வாய்ப்புகளும் - ஓர் ..."
2,education,"எஸ்எஸ்எல்சி, பிளஸ் 1, பிளஸ் 2 பொது தேர்வு: தனி...","எஸ்எஸ்எல்சி, பிளஸ் 1, பிளஸ் 2 பொது தேர்வு: தனி..."
3,education,ஸ்மார்ட் கிளாஸ் உடன் ஹைடெக்காக மாறும் திருநாளூ...,ஸ்மார்ட் கிளாஸ் உடன் ஹைடெக்காக மாறும் திருநாளூ...
4,education,மாணவர்கள் பங்கேற்கலாம்; டிசம்பர் 27-ம் தேதி மா...,மாணவர்கள் பங்கேற்கலாம்; டிசம்பர் 27-ம் தேதி மா...
...,...,...,...
7987,sports,ஹாம்பர்க் ஓபன் டென்னிஸ் - ஸ்வியாடெக் விலகல் | ...,ஹாம்பர்க் ஓபன் டென்னிஸ் - ஸ்வியாடெக் விலகல் | ...
7988,sports,தெற்காசிய கால்பந்து கூட்டமைப்பு சாம்பியன்ஷிப்:...,தெற்காசிய கால்பந்து கூட்டமைப்பு சாம்பியன்ஷிப்:...
7989,sports,ஹாட்ரிக் வெற்றியை பதிவு செய்தது சீகம் மதுரை பே...,ஹாட்ரிக் வெற்றியை பதிவு செய்தது சீகம் மதுரை பே...
7990,sports,ஆசிய கபடி சாம்பியன்ஷிப்: 8-வது முறையாக பட்டம் ...,ஆசிய கபடி சாம்பியன்ஷிப்: 8-வது முறையாக பட்டம் ...


I convert the Title column's Tamil text to Thanglish in our data frame.

In [19]:
import Py_Thanglish as py
def convert_text(text):
    return py.tamil_to_thanglish(text)
df1['title'] = df1['Title'].apply(convert_text)
del df1['Title']
df1

Unnamed: 0,Category,title
0,education,koavai vaelaan palkalai. maanavarkal nataththu...
1,education,"vaelaan patippukalum, vaelai vaayppukalum - oa..."
2,education,"yechyechyelsi, pilach 1, pilach 2 pothu thaerv..."
3,education,chmaart kilaach utan hautekkaaka maarum thirun...
4,education,maanavarkal pangkaerkalaam; tisampar 27-m thae...
...,...,...
7987,sports,haampark oapan tennich - chviyaatek vilakal | ...
7988,sports,therkaasiya kaalpanthu kuuttamaippu saampiyanc...
7989,sports,haatrik verriyai pathivu seythathu seekam math...
7990,sports,aasiya kapati saampiyanchip: 8-vathu muraiyaak...


Now, I remove punctuation and numbers present in the Title column of our data frame.

In [20]:
import string
def remove_punctuation(text):
    punctuations = string.punctuation
    text_to_punc = ''.join([char for char in text if char not in punctuations])
    return text_to_punc
df1['title'] = df1['title'].apply(remove_punctuation)
df1['title'] = df1['title'].str.replace('\d+', '', regex=True)
df1

Unnamed: 0,Category,title
0,education,koavai vaelaan palkalai maanavarkal nataththum...
1,education,vaelaan patippukalum vaelai vaayppukalum oar ...
2,education,yechyechyelsi pilach pilach pothu thaervu th...
3,education,chmaart kilaach utan hautekkaaka maarum thirun...
4,education,maanavarkal pangkaerkalaam tisampar m thaethi ...
...,...,...
7987,sports,haampark oapan tennich chviyaatek vilakal Ha...
7988,sports,therkaasiya kaalpanthu kuuttamaippu saampiyanc...
7989,sports,haatrik verriyai pathivu seythathu seekam math...
7990,sports,aasiya kapati saampiyanchip vathu muraiyaaka p...


In preprocessing, I finally encode the categorical data in the Category column using the Label Encoding method.

In [21]:
from sklearn.preprocessing import LabelEncoder
target_encode = LabelEncoder()
df1['target_variable'] = target_encode.fit_transform(df1['Category'])
del df1['Category']
df1

Unnamed: 0,title,target_variable
0,koavai vaelaan palkalai maanavarkal nataththum...,2
1,vaelaan patippukalum vaelai vaayppukalum oar ...,2
2,yechyechyelsi pilach pilach pothu thaervu th...,2
3,chmaart kilaach utan hautekkaaka maarum thirun...,2
4,maanavarkal pangkaerkalaam tisampar m thaethi ...,2
...,...,...
7987,haampark oapan tennich chviyaatek vilakal Ha...,3
7988,therkaasiya kaalpanthu kuuttamaippu saampiyanc...,3
7989,haatrik verriyai pathivu seythathu seekam math...,3
7990,aasiya kapati saampiyanchip vathu muraiyaaka p...,3


In [22]:
classes = dict(zip(target_encode.classes_, target_encode.transform(target_encode.classes_)))
df_class = pd.DataFrame(list(classes.items()), columns=['Class', 'Label'])
df_class

Unnamed: 0,Class,Label
0,business,0
1,crime,1
2,education,2
3,sports,3
4,technology,4


# BAG OF WORDS

Now, I apply the Bag of Words NLP method to extract features from the title column.

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
Vectorizer = CountVectorizer()
Vec = Vectorizer.fit_transform(df1['title'])
bow_df = pd.DataFrame(Vec.toarray(), columns = Vectorizer.get_feature_names_out())
bow_df

Unnamed: 0,aa,aaakp,aaakpar,aaakparil,aaakparkal,aaakpkaanichthaanukku,aaakplainil,aaakppaayil,aac,aachach,...,zinc,zindabad,zipmat,zoho,zomato,zone,zoology,zoom,zuckerberg,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7987,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7988,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7989,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7990,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


I split the data into two parts: a test set and a training set.

In [24]:
from sklearn.model_selection import train_test_split
bow_df['target_variable'] = df1['target_variable']
X01 = bow_df.drop(['target_variable'], axis=1)
y01 = bow_df['target_variable']
X_train01, X_test01, y_train01, y_test01 = train_test_split(X01,y01, test_size=0.2, random_state=42)

Now Import the Library functions.

In [25]:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.svm import SVC

Now, I fit the following models and calculate their corresponding test and train prediction accuracy:
        1. Logistic Regression,
        2. Decision Tree,
        3. Random Forest,
        4. Naive Bayes (Gaussian),
        5. Naive Bayes (Multinomial),
        6. Naive Bayes (Bernoulli),
        7. SVM (Linear),
        8. SVM (rbf),
        9. SVM (polybomial),
        10. KNN,
        11. KNN (k=3) &
        12. KNN (k=5)

In [26]:
# Logistic Regression
lr_model01 = LogisticRegression()
lr_model01.fit(X_train01, y_train01)
lr_train01_accuracy = accuracy_score(y_train01, lr_model01.predict(X_train01))
lr_test01_accuracy = accuracy_score(y_test01, lr_model01.predict(X_test01))

print("Logistic Regression Results:")
print("Train01 Accuracy:", lr_train01_accuracy)
print("Test01 Accuracy:", lr_test01_accuracy)

# Decision Tree
dt_model01 = DecisionTreeClassifier()
dt_model01.fit(X_train01, y_train01)
dt_train01_accuracy = accuracy_score(y_train01, dt_model01.predict(X_train01))
dt_test01_accuracy = accuracy_score(y_test01, dt_model01.predict(X_test01))

print("\nDecision Tree Results:")
print("Train01 Accuracy:", dt_train01_accuracy)
print("Test01 Accuracy:", dt_test01_accuracy)

# Random Forest
rf_model01 = RandomForestClassifier()
rf_model01.fit(X_train01, y_train01)
rf_train01_accuracy = accuracy_score(y_train01, rf_model01.predict(X_train01))
rf_test01_accuracy = accuracy_score(y_test01, rf_model01.predict(X_test01))

print("\nRandom Forest Results:")
print("Train01 Accuracy:", rf_train01_accuracy)
print("Test01 Accuracy:", rf_test01_accuracy)

# Naive Bayes (Gaussian)
nb_model01 = GaussianNB()
nb_model01.fit(X_train01, y_train01)
nb_train01_accuracy = accuracy_score(y_train01, nb_model01.predict(X_train01))
nb_test01_accuracy = accuracy_score(y_test01, nb_model01.predict(X_test01))

print("\nNaive Bayes (Gaussian) Results:")
print("Train01 Accuracy:", nb_train01_accuracy)
print("Test01 Accuracy:", nb_test01_accuracy)

# Naive Bayes (Multinomial)
nb_multinomial_model01 = MultinomialNB()
nb_multinomial_model01.fit(X_train01, y_train01)
nb_multinomial_train01_accuracy = accuracy_score(y_train01, nb_multinomial_model01.predict(X_train01))
nb_multinomial_test01_accuracy = accuracy_score(y_test01, nb_multinomial_model01.predict(X_test01))

print("\nNaive Bayes (Multinomial) Results:")
print("Train01 Accuracy:", nb_multinomial_train01_accuracy)
print("Test01 Accuracy:", nb_multinomial_test01_accuracy)

# Naive Bayes (Bernoulli)
nb_bernoulli_model01 = BernoulliNB()
nb_bernoulli_model01.fit(X_train01, y_train01)
nb_bernoulli_train01_accuracy = accuracy_score(y_train01, nb_bernoulli_model01.predict(X_train01))
nb_bernoulli_test01_accuracy = accuracy_score(y_test01, nb_bernoulli_model01.predict(X_test01))

print("\nNaive Bayes (Bernoulli) Results:")
print("Train01 Accuracy:", nb_bernoulli_train01_accuracy)
print("Test01 Accuracy:", nb_bernoulli_test01_accuracy)

# SVM Linear
svm_linear_model01 = SVC(kernel='linear')
svm_linear_model01.fit(X_train01, y_train01)
svm_linear_train01_accuracy = accuracy_score(y_train01, svm_linear_model01.predict(X_train01))
svm_linear_test01_accuracy = accuracy_score(y_test01, svm_linear_model01.predict(X_test01))

print("\nSVM Linear Results:")
print("Train01 Accuracy:", svm_linear_train01_accuracy)
print("Test01 Accuracy:", svm_linear_test01_accuracy)

# SVM RBF
svm_rbf_model01 = SVC(kernel='rbf')
svm_rbf_model01.fit(X_train01, y_train01)
svm_rbf_train01_accuracy = accuracy_score(y_train01, svm_rbf_model01.predict(X_train01))
svm_rbf_test01_accuracy = accuracy_score(y_test01, svm_rbf_model01.predict(X_test01))

print("\nSVM RBF Results:")
print("Train01 Accuracy:", svm_rbf_train01_accuracy)
print("Test01 Accuracy:", svm_rbf_test01_accuracy)

# SVM Poly
svm_poly_model01 = SVC(kernel='poly')
svm_poly_model01.fit(X_train01, y_train01)
svm_poly_train01_accuracy = accuracy_score(y_train01, svm_poly_model01.predict(X_train01))
svm_poly_test01_accuracy = accuracy_score(y_test01, svm_poly_model01.predict(X_test01))

print("\nSVM Poly Results:")
print("Train01 Accuracy:", svm_poly_train01_accuracy)
print("Test01 Accuracy:", svm_poly_test01_accuracy)



# Choose the best model based on test accuracy
models01 = {
    'Logistic Regression': lr_test01_accuracy,
    'Decision Tree': dt_test01_accuracy,
    'Random Forest': rf_test01_accuracy,
    'Naive Bayes (Gaussian)': nb_test01_accuracy,
    'Naive Bayes (Multinomial)': nb_multinomial_test01_accuracy,
    'Naive Bayes (Bernoulli)': nb_bernoulli_test01_accuracy,
    'SVM Linear': svm_linear_test01_accuracy,
    'SVM RBF': svm_rbf_test01_accuracy,
    'SVM Poly': svm_poly_test01_accuracy
}

best_model01 = max(models01, key=models01.get)
print("\nBest Model01:", best_model01)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression Results:
Train01 Accuracy: 1.0
Test01 Accuracy: 0.9380863039399625

Decision Tree Results:
Train01 Accuracy: 1.0
Test01 Accuracy: 0.8373983739837398

Random Forest Results:
Train01 Accuracy: 1.0
Test01 Accuracy: 0.9030644152595372

Naive Bayes (Gaussian) Results:
Train01 Accuracy: 0.9993743156577507
Test01 Accuracy: 0.9205753595997499

Naive Bayes (Multinomial) Results:
Train01 Accuracy: 0.9851399968715783
Test01 Accuracy: 0.9405878674171357

Naive Bayes (Bernoulli) Results:
Train01 Accuracy: 0.9820115751603317
Test01 Accuracy: 0.941213258286429

SVM Linear Results:
Train01 Accuracy: 1.0
Test01 Accuracy: 0.933083176985616

SVM RBF Results:
Train01 Accuracy: 0.9968715782887533
Test01 Accuracy: 0.9199499687304565

SVM Poly Results:
Train01 Accuracy: 0.9973408415454403
Test01 Accuracy: 0.8555347091932458

Best Model01: Naive Bayes (Bernoulli)


## Bag of Words - Chi_Square Test

I select the features Using the chi-square test.

In [27]:
from scipy.stats import chi2_contingency
X02 = bow_df.drop('target_variable', axis=1)
y02 = bow_df['target_variable']
feature_variables01 = []
for feature in X02.columns:
    contingency_table = pd.crosstab(X02[feature], y02)
    chi2, p, _, _ = chi2_contingency(contingency_table)
    significance_level = 0.05
    
    if p < significance_level:
        feature_variables01.append(feature)
chi_2_df01 = bow_df[feature_variables01]
chi_2_df01['target_variable'] = bow_df['target_variable']
chi_2_df01

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chi_2_df01['target_variable'] = bow_df['target_variable']


Unnamed: 0,aachach,aachi,aachikku,aachiyai,aachiyutan,aachthiraeliya,aachthiraeliyaa,aachthiraeliyaavai,aachthiraeliyaavin,aachthiraeliyaavukku,...,yuutiyuup,yuutiyuupar,yuvaraaj,yuvraj,zealand,zebronics,zimbabwe,zoom,zuckerberg,target_variable
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7987,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
7988,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
7989,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
7990,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3


As before, I fit the all above models and calculate their corresponding test and train prediction accuracy.

In [28]:
X02 = chi_2_df01.drop(['target_variable'], axis=1)
y02 = chi_2_df01['target_variable']
X_train02, X_test02, y_train02, y_test02 = train_test_split(X02,y02, test_size=0.2, random_state=42)

# Logistic Regression
lr_model02 = LogisticRegression()
lr_model02.fit(X_train02, y_train02)
lr_train02_accuracy = accuracy_score(y_train02, lr_model02.predict(X_train02))
lr_test02_accuracy = accuracy_score(y_test02, lr_model02.predict(X_test02))

print("Logistic Regression Results:")
print("Train02 Accuracy:", lr_train02_accuracy)
print("Test02 Accuracy:", lr_test02_accuracy)

# Decision Tree
dt_model02 = DecisionTreeClassifier()
dt_model02.fit(X_train02, y_train02)
dt_train02_accuracy = accuracy_score(y_train02, dt_model02.predict(X_train02))
dt_test02_accuracy = accuracy_score(y_test02, dt_model02.predict(X_test02))

print("\nDecision Tree Results:")
print("Train02 Accuracy:", dt_train02_accuracy)
print("Test02 Accuracy:", dt_test02_accuracy)

# Random Forest
rf_model02 = RandomForestClassifier()
rf_model02.fit(X_train02, y_train02)
rf_train02_accuracy = accuracy_score(y_train02, rf_model02.predict(X_train02))
rf_test02_accuracy = accuracy_score(y_test02, rf_model02.predict(X_test02))

print("\nRandom Forest Results:")
print("Train02 Accuracy:", rf_train02_accuracy)
print("Test02 Accuracy:", rf_test02_accuracy)

# Naive Bayes (Gaussian)
nb_model02 = GaussianNB()
nb_model02.fit(X_train02, y_train02)
nb_train02_accuracy = accuracy_score(y_train02, nb_model02.predict(X_train02))
nb_test02_accuracy = accuracy_score(y_test02, nb_model02.predict(X_test02))

print("\nNaive Bayes (Gaussian) Results:")
print("Train02 Accuracy:", nb_train02_accuracy)
print("Test02 Accuracy:", nb_test02_accuracy)

# Naive Bayes (Multinomial)
nb_multinomial_model02 = MultinomialNB()
nb_multinomial_model02.fit(X_train02, y_train02)
nb_multinomial_train02_accuracy = accuracy_score(y_train02, nb_multinomial_model02.predict(X_train02))
nb_multinomial_test02_accuracy = accuracy_score(y_test02, nb_multinomial_model02.predict(X_test02))

print("\nNaive Bayes (Multinomial) Results:")
print("Train02 Accuracy:", nb_multinomial_train02_accuracy)
print("Test02 Accuracy:", nb_multinomial_test02_accuracy)

# Naive Bayes (Bernoulli)
nb_bernoulli_model02 = BernoulliNB()
nb_bernoulli_model02.fit(X_train02, y_train02)
nb_bernoulli_train02_accuracy = accuracy_score(y_train02, nb_bernoulli_model02.predict(X_train02))
nb_bernoulli_test02_accuracy = accuracy_score(y_test02, nb_bernoulli_model02.predict(X_test02))

print("\nNaive Bayes (Bernoulli) Results:")
print("Train02 Accuracy:", nb_bernoulli_train02_accuracy)
print("Test02 Accuracy:", nb_bernoulli_test02_accuracy)

# SVM Linear
svm_linear_model02 = SVC(kernel='linear')
svm_linear_model02.fit(X_train02, y_train02)
svm_linear_train02_accuracy = accuracy_score(y_train02, svm_linear_model02.predict(X_train02))
svm_linear_test02_accuracy = accuracy_score(y_test02, svm_linear_model02.predict(X_test02))

print("\nSVM Linear Results:")
print("Train02 Accuracy:", svm_linear_train02_accuracy)
print("Test02 Accuracy:", svm_linear_test02_accuracy)

# SVM RBF
svm_rbf_model02 = SVC(kernel='rbf')
svm_rbf_model02.fit(X_train02, y_train02)
svm_rbf_train02_accuracy = accuracy_score(y_train02, svm_rbf_model02.predict(X_train02))
svm_rbf_test02_accuracy = accuracy_score(y_test02, svm_rbf_model02.predict(X_test02))

print("\nSVM RBF Results:")
print("Train02 Accuracy:", svm_rbf_train02_accuracy)
print("Test02 Accuracy:", svm_rbf_test02_accuracy)

# SVM Poly
svm_poly_model02 = SVC(kernel='poly')
svm_poly_model02.fit(X_train02, y_train02)
svm_poly_train02_accuracy = accuracy_score(y_train02, svm_poly_model02.predict(X_train02))
svm_poly_test02_accuracy = accuracy_score(y_test02, svm_poly_model02.predict(X_test02))

print("\nSVM Poly Results:")
print("Train02 Accuracy:", svm_poly_train02_accuracy)
print("Test02 Accuracy:", svm_poly_test02_accuracy)

# Choose the best model based on test accuracy
models02 = {
    'Logistic Regression': lr_test02_accuracy,
    'Decision Tree': dt_test02_accuracy,
    'Random Forest': rf_test02_accuracy,
    'Naive Bayes (Gaussian)': nb_test02_accuracy,
    'Naive Bayes (Multinomial)': nb_multinomial_test02_accuracy,
    'Naive Bayes (Bernoulli)': nb_bernoulli_test02_accuracy,
    'SVM Linear': svm_linear_test02_accuracy,
    'SVM RBF': svm_rbf_test02_accuracy,
    'SVM Poly': svm_poly_test02_accuracy
}

best_model02 = max(models02, key=models02.get)
print("\nBest Model02:", best_model02)

Logistic Regression Results:
Train02 Accuracy: 0.99530736743313
Test02 Accuracy: 0.9380863039399625

Decision Tree Results:
Train02 Accuracy: 0.9998435789144376
Test02 Accuracy: 0.8424015009380863

Random Forest Results:
Train02 Accuracy: 0.9998435789144376
Test02 Accuracy: 0.900562851782364

Naive Bayes (Gaussian) Results:
Train02 Accuracy: 0.9840450492726419
Test02 Accuracy: 0.9418386491557224

Naive Bayes (Multinomial) Results:
Train02 Accuracy: 0.9646488346629125
Test02 Accuracy: 0.9449656035021888

Naive Bayes (Bernoulli) Results:
Train02 Accuracy: 0.9612075707805412
Test02 Accuracy: 0.9462163852407754

SVM Linear Results:
Train02 Accuracy: 0.9987486313155013
Test02 Accuracy: 0.9149468417761101

SVM RBF Results:
Train02 Accuracy: 0.9857656812138276
Test02 Accuracy: 0.9243277048155097

SVM Poly Results:
Train02 Accuracy: 0.9607383075238543
Test02 Accuracy: 0.797373358348968

Best Model02: Naive Bayes (Bernoulli)


## Bag of Words - Distinct Class Keywords

I employ my own method, where I calculate the sum of the 1st extracted Bag of Words (BoW) variable with respect to the classes in the target variable. Similarly, I perform this calculation for all extracted BoW variables and store the resulting data in a data frame. Subsequently, if a particular column (feature) contains values associated with more than one class, I exclude those columns from the model fitting process. Otherwise, I utilize those features for model fitting.

In [30]:
sums_by_class = bow_df.groupby('target_variable').sum()
zero_sum = (sums_by_class == 0).sum()

features001 = []

for column, count in zero_sum.items():
    if count == 4:
        features001.append(column)
zero_df = bow_df[features001]
zero_df['target_variable'] = bow_df['target_variable']
zero_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  zero_df['target_variable'] = bow_df['target_variable']


Unnamed: 0,aa,aaakpar,aaakparil,aaakparkal,aaakpkaanichthaanukku,aaakplainil,aaakppaayil,aac,aachach,aachi,...,zimbabwe,zinc,zindabad,zipmat,zoho,zoology,zoom,zuckerberg,zurich,target_variable
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7987,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
7988,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
7989,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
7990,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3


As before, I fit the all above models and calculate their corresponding test and train prediction accuracy.

In [31]:
X03 = zero_df.drop(['target_variable'], axis=1)
y03 = zero_df['target_variable']
X_train03, X_test03, y_train03, y_test03 = train_test_split(X03,y03, test_size=0.2, random_state=42)

# Logistic Regression
lr_model03 = LogisticRegression()
lr_model03.fit(X_train03, y_train03)
lr_train03_accuracy = accuracy_score(y_train03, lr_model03.predict(X_train03))
lr_test03_accuracy = accuracy_score(y_test03, lr_model03.predict(X_test03))

print("Logistic Regression Results:")
print("Train03 Accuracy:", lr_train03_accuracy)
print("Test03 Accuracy:", lr_test03_accuracy)

# Decision Tree
dt_model03 = DecisionTreeClassifier()
dt_model03.fit(X_train03, y_train03)
dt_train03_accuracy = accuracy_score(y_train03, dt_model03.predict(X_train03))
dt_test03_accuracy = accuracy_score(y_test03, dt_model03.predict(X_test03))

print("\nDecision Tree Results:")
print("Train03 Accuracy:", dt_train03_accuracy)
print("Test03 Accuracy:", dt_test03_accuracy)

# Random Forest
rf_model03 = RandomForestClassifier()
rf_model03.fit(X_train03, y_train03)
rf_train03_accuracy = accuracy_score(y_train03, rf_model03.predict(X_train03))
rf_test03_accuracy = accuracy_score(y_test03, rf_model03.predict(X_test03))

print("\nRandom Forest Results:")
print("Train03 Accuracy:", rf_train03_accuracy)
print("Test03 Accuracy:", rf_test03_accuracy)

# Naive Bayes (Gaussian)
nb_model03 = GaussianNB()
nb_model03.fit(X_train03, y_train03)
nb_train03_accuracy = accuracy_score(y_train03, nb_model03.predict(X_train03))
nb_test03_accuracy = accuracy_score(y_test03, nb_model03.predict(X_test03))

print("\nNaive Bayes (Gaussian) Results:")
print("Train03 Accuracy:", nb_train03_accuracy)
print("Test03 Accuracy:", nb_test03_accuracy)

# Naive Bayes (Multinomial)
nb_multinomial_model03 = MultinomialNB()
nb_multinomial_model03.fit(X_train03, y_train03)
nb_multinomial_train03_accuracy = accuracy_score(y_train03, nb_multinomial_model03.predict(X_train03))
nb_multinomial_test03_accuracy = accuracy_score(y_test03, nb_multinomial_model03.predict(X_test03))

print("\nNaive Bayes (Multinomial) Results:")
print("Train03 Accuracy:", nb_multinomial_train03_accuracy)
print("Test03 Accuracy:", nb_multinomial_test03_accuracy)

# Naive Bayes (Bernoulli)
nb_bernoulli_model03 = BernoulliNB()
nb_bernoulli_model03.fit(X_train03, y_train03)
nb_bernoulli_train03_accuracy = accuracy_score(y_train03, nb_bernoulli_model03.predict(X_train03))
nb_bernoulli_test03_accuracy = accuracy_score(y_test03, nb_bernoulli_model03.predict(X_test03))

print("\nNaive Bayes (Bernoulli) Results:")
print("Train03 Accuracy:", nb_bernoulli_train03_accuracy)
print("Test03 Accuracy:", nb_bernoulli_test03_accuracy)

# SVM Linear
svm_linear_model03 = SVC(kernel='linear')
svm_linear_model03.fit(X_train03, y_train03)
svm_linear_train03_accuracy = accuracy_score(y_train03, svm_linear_model03.predict(X_train03))
svm_linear_test03_accuracy = accuracy_score(y_test03, svm_linear_model03.predict(X_test03))

print("\nSVM Linear Results:")
print("Train03 Accuracy:", svm_linear_train03_accuracy)
print("Test03 Accuracy:", svm_linear_test03_accuracy)

# SVM RBF
svm_rbf_model03 = SVC(kernel='rbf')
svm_rbf_model03.fit(X_train03, y_train03)
svm_rbf_train03_accuracy = accuracy_score(y_train03, svm_rbf_model03.predict(X_train03))
svm_rbf_test03_accuracy = accuracy_score(y_test03, svm_rbf_model03.predict(X_test03))

print("\nSVM RBF Results:")
print("Train03 Accuracy:", svm_rbf_train03_accuracy)
print("Test03 Accuracy:", svm_rbf_test03_accuracy)

# SVM Poly
svm_poly_model03 = SVC(kernel='poly')
svm_poly_model03.fit(X_train03, y_train03)
svm_poly_train03_accuracy = accuracy_score(y_train03, svm_poly_model03.predict(X_train03))
svm_poly_test03_accuracy = accuracy_score(y_test03, svm_poly_model03.predict(X_test03))

print("\nSVM Poly Results:")
print("Train03 Accuracy:", svm_poly_train03_accuracy)
print("Test03 Accuracy:", svm_poly_test03_accuracy)


# Choose the best model based on test accuracy
models03 = {
    'Logistic Regression': lr_test03_accuracy,
    'Decision Tree': dt_test03_accuracy,
    'Random Forest': rf_test03_accuracy,
    'Naive Bayes (Gaussian)': nb_test03_accuracy,
    'Naive Bayes (Multinomial)': nb_multinomial_test03_accuracy,
    'Naive Bayes (Bernoulli)': nb_bernoulli_test03_accuracy,
    'SVM Linear': svm_linear_test03_accuracy,
    'SVM RBF': svm_rbf_test03_accuracy,
    'SVM Poly': svm_poly_test03_accuracy
}

best_model03 = max(models03, key=models03.get)
print("\nBest Model03:", best_model03)


Logistic Regression Results:
Train03 Accuracy: 0.981698732989207
Test03 Accuracy: 0.8824265165728581

Decision Tree Results:
Train03 Accuracy: 0.9846707336148913
Test03 Accuracy: 0.7804878048780488

Random Forest Results:
Train03 Accuracy: 0.9846707336148913
Test03 Accuracy: 0.8380237648530331

Naive Bayes (Gaussian) Results:
Train03 Accuracy: 0.9820115751603317
Test03 Accuracy: 0.8980612883051907

Naive Bayes (Multinomial) Results:
Train03 Accuracy: 0.9846707336148913
Test03 Accuracy: 0.9099437148217636

Naive Bayes (Bernoulli) Results:
Train03 Accuracy: 0.9532300954168622
Test03 Accuracy: 0.8524077548467792

SVM Linear Results:
Train03 Accuracy: 0.981698732989207
Test03 Accuracy: 0.8599124452782989

SVM RBF Results:
Train03 Accuracy: 0.9693414672297825
Test03 Accuracy: 0.8267667292057536

SVM Poly Results:
Train03 Accuracy: 0.8196464883466291
Test03 Accuracy: 0.5265791119449656

Best Model03: Naive Bayes (Multinomial)


## Bag of Words - Greaterthan 5

I employ another method, where I calculate the total sum of each variable in the zero_df data frame. If the total for a particular feature is less than 5, I exclude those features from model fitting. Otherwise, I include those features for model fitting.

In [32]:
sums_by_class1 = zero_df.groupby('target_variable').sum()
column_sum01 = sums_by_class1.sum()
selected_columns01 = []

for column_name01, sum_value01 in column_sum01.items():
    if sum_value01 > 5:
        selected_columns01.append(column_name01)

ge_5_df = zero_df[selected_columns01]
ge_5_df['target_variable'] = zero_df['target_variable']
ge_5_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ge_5_df['target_variable'] = zero_df['target_variable']


Unnamed: 0,aachach,aachi,aachikku,aachiyai,aachthiraeliyaavukku,aanthira,aantraaytu,aapkaanichthaan,aapkan,aaruthraa,...,yempipiyech,yenaiyae,yezhuththum,yield,yujisi,yupiyechsi,yuvaraaj,yuvraj,zealand,target_variable
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7987,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
7988,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
7989,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
7990,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3


As before, I fit the all above models and calculate their corresponding test and train prediction accuracy.

In [33]:
X04 = ge_5_df.drop(['target_variable'], axis=1)
y04 = ge_5_df['target_variable']
X_train04, X_test04, y_train04, y_test04 = train_test_split(X04,y04, test_size=0.2, random_state=42)

# Logistic Regression
lr_model04 = LogisticRegression()
lr_model04.fit(X_train04, y_train04)
lr_train04_accuracy = accuracy_score(y_train04, lr_model04.predict(X_train04))
lr_test04_accuracy = accuracy_score(y_test04, lr_model04.predict(X_test04))

print("Logistic Regression Results:")
print("Train04 Accuracy:", lr_train04_accuracy)
print("Test04 Accuracy:", lr_test04_accuracy)

# Decision Tree
dt_model04 = DecisionTreeClassifier()
dt_model04.fit(X_train04, y_train04)
dt_train04_accuracy = accuracy_score(y_train04, dt_model04.predict(X_train04))
dt_test04_accuracy = accuracy_score(y_test04, dt_model04.predict(X_test04))

print("\nDecision Tree Results:")
print("Train04 Accuracy:", dt_train04_accuracy)
print("Test04 Accuracy:", dt_test04_accuracy)

# Random Forest
rf_model04 = RandomForestClassifier()
rf_model04.fit(X_train04, y_train04)
rf_train04_accuracy = accuracy_score(y_train04, rf_model04.predict(X_train04))
rf_test04_accuracy = accuracy_score(y_test04, rf_model04.predict(X_test04))

print("\nRandom Forest Results:")
print("Train04 Accuracy:", rf_train04_accuracy)
print("Test04 Accuracy:", rf_test04_accuracy)

# Naive Bayes (Gaussian)
nb_model04 = GaussianNB()
nb_model04.fit(X_train04, y_train04)
nb_train04_accuracy = accuracy_score(y_train04, nb_model04.predict(X_train04))
nb_test04_accuracy = accuracy_score(y_test04, nb_model04.predict(X_test04))

print("\nNaive Bayes (Gaussian) Results:")
print("Train04 Accuracy:", nb_train04_accuracy)
print("Test04 Accuracy:", nb_test04_accuracy)

# Naive Bayes (Multinomial)
nb_multinomial_model04 = MultinomialNB()
nb_multinomial_model04.fit(X_train04, y_train04)
nb_multinomial_train04_accuracy = accuracy_score(y_train04, nb_multinomial_model04.predict(X_train04))
nb_multinomial_test04_accuracy = accuracy_score(y_test04, nb_multinomial_model04.predict(X_test04))

print("\nNaive Bayes (Multinomial) Results:")
print("Train04 Accuracy:", nb_multinomial_train04_accuracy)
print("Test04 Accuracy:", nb_multinomial_test04_accuracy)

# Naive Bayes (Bernoulli)
nb_bernoulli_model04 = BernoulliNB()
nb_bernoulli_model04.fit(X_train04, y_train04)
nb_bernoulli_train04_accuracy = accuracy_score(y_train04, nb_bernoulli_model04.predict(X_train04))
nb_bernoulli_test04_accuracy = accuracy_score(y_test04, nb_bernoulli_model04.predict(X_test04))

print("\nNaive Bayes (Bernoulli) Results:")
print("Train04 Accuracy:", nb_bernoulli_train04_accuracy)
print("Test04 Accuracy:", nb_bernoulli_test04_accuracy)

# SVM Linear
svm_linear_model04 = SVC(kernel='linear')
svm_linear_model04.fit(X_train04, y_train04)
svm_linear_train04_accuracy = accuracy_score(y_train04, svm_linear_model04.predict(X_train04))
svm_linear_test04_accuracy = accuracy_score(y_test04, svm_linear_model04.predict(X_test04))

print("\nSVM Linear Results:")
print("Train04 Accuracy:", svm_linear_train04_accuracy)
print("Test04 Accuracy:", svm_linear_test04_accuracy)

# SVM RBF
svm_rbf_model04 = SVC(kernel='rbf')
svm_rbf_model04.fit(X_train04, y_train04)
svm_rbf_train04_accuracy = accuracy_score(y_train04, svm_rbf_model04.predict(X_train04))
svm_rbf_test04_accuracy = accuracy_score(y_test04, svm_rbf_model04.predict(X_test04))

print("\nSVM RBF Results:")
print("Train04 Accuracy:", svm_rbf_train04_accuracy)
print("Test04 Accuracy:", svm_rbf_test04_accuracy)

# SVM Poly
svm_poly_model04 = SVC(kernel='poly')
svm_poly_model04.fit(X_train04, y_train04)
svm_poly_train04_accuracy = accuracy_score(y_train04, svm_poly_model04.predict(X_train04))
svm_poly_test04_accuracy = accuracy_score(y_test04, svm_poly_model04.predict(X_test04))

print("\nSVM Poly Results:")
print("Train04 Accuracy:", svm_poly_train04_accuracy)
print("Test04 Accuracy:", svm_poly_test04_accuracy)

# Choose the best model based on test accuracy
models04 = {
    'Logistic Regression': lr_test04_accuracy,
    'Decision Tree': dt_test04_accuracy,
    'Random Forest': rf_test04_accuracy,
    'Naive Bayes (Gaussian)': nb_test04_accuracy,
    'Naive Bayes (Multinomial)': nb_multinomial_test04_accuracy,
    'Naive Bayes (Bernoulli)': nb_bernoulli_test04_accuracy,
    'SVM Linear': svm_linear_test04_accuracy,
    'SVM RBF': svm_rbf_test04_accuracy,
    'SVM Poly': svm_poly_test04_accuracy
}

best_model04 = max(models04, key=models04.get)
print("\nBest Model04:", best_model04)

Logistic Regression Results:
Train04 Accuracy: 0.7993117472235257
Test04 Accuracy: 0.7967479674796748

Decision Tree Results:
Train04 Accuracy: 0.7994681683090881
Test04 Accuracy: 0.774859287054409

Random Forest Results:
Train04 Accuracy: 0.7994681683090881
Test04 Accuracy: 0.7898686679174484

Naive Bayes (Gaussian) Results:
Train04 Accuracy: 0.7994681683090881
Test04 Accuracy: 0.8036272670419012

Naive Bayes (Multinomial) Results:
Train04 Accuracy: 0.7944626935710933
Test04 Accuracy: 0.7892432770481551

Naive Bayes (Bernoulli) Results:
Train04 Accuracy: 0.7841389019239794
Test04 Accuracy: 0.7792370231394622

SVM Linear Results:
Train04 Accuracy: 0.7994681683090881
Test04 Accuracy: 0.776735459662289

SVM RBF Results:
Train04 Accuracy: 0.7994681683090881
Test04 Accuracy: 0.7686053783614759

SVM Poly Results:
Train04 Accuracy: 0.7348662599718442
Test04 Accuracy: 0.6960600375234521

Best Model04: Naive Bayes (Gaussian)


## Bag of Words - Greaterthan 10

This is the same as the previous one, with the only difference being the limit changed from 5 to 10.

In [34]:
column_sum02 = sums_by_class1.sum()
selected_columns02 = []

for column_name02, sum_value02 in column_sum02.items():
    if sum_value02 > 10:
        selected_columns02.append(column_name02)

ge_10_df = zero_df[selected_columns02]
ge_10_df['target_variable'] = zero_df['target_variable']
ge_10_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ge_10_df['target_variable'] = zero_df['target_variable']


Unnamed: 0,aachach,aachi,aachiyai,aantraaytu,aapkaanichthaan,aapkan,aaruthraa,aasiriyarkalukku,aathithiraavitar,aattaththil,...,yaathav,yadav,yaeaisitii,yaerra,yechai,yempipiyech,yield,yujisi,zealand,target_variable
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7987,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
7988,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
7989,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
7990,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3


As before, I fit the all above models and calculate their corresponding test and train prediction accuracy.

In [35]:
X05 = ge_10_df.drop(['target_variable'], axis=1)
y05 = ge_10_df['target_variable']
X_train05, X_test05, y_train05, y_test05 = train_test_split(X05,y05, test_size=0.2, random_state=42)

# Logistic Regression
lr_model05 = LogisticRegression()
lr_model05.fit(X_train05, y_train05)
lr_train05_accuracy = accuracy_score(y_train05, lr_model05.predict(X_train05))
lr_test05_accuracy = accuracy_score(y_test05, lr_model05.predict(X_test05))

print("Logistic Regression Results:")
print("Train05 Accuracy:", lr_train05_accuracy)
print("Test05 Accuracy:", lr_test05_accuracy)

# Decision Tree
dt_model05 = DecisionTreeClassifier()
dt_model05.fit(X_train05, y_train05)
dt_train05_accuracy = accuracy_score(y_train05, dt_model05.predict(X_train05))
dt_test05_accuracy = accuracy_score(y_test05, dt_model05.predict(X_test05))

print("\nDecision Tree Results:")
print("Train05 Accuracy:", dt_train05_accuracy)
print("Test05 Accuracy:", dt_test05_accuracy)

# Random Forest
rf_model05 = RandomForestClassifier()
rf_model05.fit(X_train05, y_train05)
rf_train05_accuracy = accuracy_score(y_train05, rf_model05.predict(X_train05))
rf_test05_accuracy = accuracy_score(y_test05, rf_model05.predict(X_test05))

print("\nRandom Forest Results:")
print("Train05 Accuracy:", rf_train05_accuracy)
print("Test05 Accuracy:", rf_test05_accuracy)

# Naive Bayes (Gaussian)
nb_model05 = GaussianNB()
nb_model05.fit(X_train05, y_train05)
nb_train05_accuracy = accuracy_score(y_train05, nb_model05.predict(X_train05))
nb_test05_accuracy = accuracy_score(y_test05, nb_model05.predict(X_test05))

print("\nNaive Bayes (Gaussian) Results:")
print("Train05 Accuracy:", nb_train05_accuracy)
print("Test05 Accuracy:", nb_test05_accuracy)

# Naive Bayes (Multinomial)
nb_multinomial_model05 = MultinomialNB()
nb_multinomial_model05.fit(X_train05, y_train05)
nb_multinomial_train05_accuracy = accuracy_score(y_train05, nb_multinomial_model05.predict(X_train05))
nb_multinomial_test05_accuracy = accuracy_score(y_test05, nb_multinomial_model05.predict(X_test05))

print("\nNaive Bayes (Multinomial) Results:")
print("Train05 Accuracy:", nb_multinomial_train05_accuracy)
print("Test05 Accuracy:", nb_multinomial_test05_accuracy)

# Naive Bayes (Bernoulli)
nb_bernoulli_model05 = BernoulliNB()
nb_bernoulli_model05.fit(X_train05, y_train05)
nb_bernoulli_train05_accuracy = accuracy_score(y_train05, nb_bernoulli_model05.predict(X_train05))
nb_bernoulli_test05_accuracy = accuracy_score(y_test05, nb_bernoulli_model05.predict(X_test05))

print("\nNaive Bayes (Bernoulli) Results:")
print("Train05 Accuracy:", nb_bernoulli_train05_accuracy)
print("Test05 Accuracy:", nb_bernoulli_test05_accuracy)

# SVM Linear
svm_linear_model05 = SVC(kernel='linear')
svm_linear_model05.fit(X_train05, y_train05)
svm_linear_train05_accuracy = accuracy_score(y_train05, svm_linear_model05.predict(X_train05))
svm_linear_test05_accuracy = accuracy_score(y_test05, svm_linear_model05.predict(X_test05))

print("\nSVM Linear Results:")
print("Train05 Accuracy:", svm_linear_train05_accuracy)
print("Test05 Accuracy:", svm_linear_test05_accuracy)

# SVM RBF
svm_rbf_model05 = SVC(kernel='rbf')
svm_rbf_model05.fit(X_train05, y_train05)
svm_rbf_train05_accuracy = accuracy_score(y_train05, svm_rbf_model05.predict(X_train05))
svm_rbf_test05_accuracy = accuracy_score(y_test05, svm_rbf_model05.predict(X_test05))

print("\nSVM RBF Results:")
print("Train05 Accuracy:", svm_rbf_train05_accuracy)
print("Test05 Accuracy:", svm_rbf_test05_accuracy)

# SVM Poly
svm_poly_model05 = SVC(kernel='poly')
svm_poly_model05.fit(X_train05, y_train05)
svm_poly_train05_accuracy = accuracy_score(y_train05, svm_poly_model05.predict(X_train05))
svm_poly_test05_accuracy = accuracy_score(y_test05, svm_poly_model05.predict(X_test05))

print("\nSVM Poly Results:")
print("Train05 Accuracy:", svm_poly_train05_accuracy)
print("Test05 Accuracy:", svm_poly_test05_accuracy)

# Choose the best model based on test accuracy
models05 = {
    'Logistic Regression': lr_test05_accuracy,
    'Decision Tree': dt_test05_accuracy,
    'Random Forest': rf_test05_accuracy,
    'Naive Bayes (Gaussian)': nb_test05_accuracy,
    'Naive Bayes (Multinomial)': nb_multinomial_test05_accuracy,
    'Naive Bayes (Bernoulli)': nb_bernoulli_test05_accuracy,
    'SVM Linear': svm_linear_test05_accuracy,
    'SVM RBF': svm_rbf_test05_accuracy,
    'SVM Poly': svm_poly_test05_accuracy
}

best_model05 = max(models05, key=models05.get)
print("\nBest Model05:", best_model05)

Logistic Regression Results:
Train05 Accuracy: 0.7190677303300484
Test05 Accuracy: 0.724202626641651

Decision Tree Results:
Train05 Accuracy: 0.7190677303300484
Test05 Accuracy: 0.7141963727329581

Random Forest Results:
Train05 Accuracy: 0.7190677303300484
Test05 Accuracy: 0.7217010631644778

Naive Bayes (Gaussian) Results:
Train05 Accuracy: 0.7190677303300484
Test05 Accuracy: 0.7267041901188243

Naive Bayes (Multinomial) Results:
Train05 Accuracy: 0.7056155169716878
Test05 Accuracy: 0.7066916823014384

Naive Bayes (Bernoulli) Results:
Train05 Accuracy: 0.7056155169716878
Test05 Accuracy: 0.7066916823014384

SVM Linear Results:
Train05 Accuracy: 0.7190677303300484
Test05 Accuracy: 0.7185741088180112

SVM RBF Results:
Train05 Accuracy: 0.7189113092444862
Test05 Accuracy: 0.7110694183864915

SVM Poly Results:
Train05 Accuracy: 0.6844986704207727
Test05 Accuracy: 0.6704190118824265

Best Model05: Naive Bayes (Gaussian)


## Bag of Words - Greaterthan 50

This is the same as the previous one, with the only difference being the limit changed from 10 to 50.

In [36]:
column_sum03 = sums_by_class1.sum()
selected_columns03 = []

for column_name03, sum_value03 in column_sum03.items():
    if sum_value03 > 50:
        selected_columns03.append(column_name03)

ge_50_df = zero_df[selected_columns03]
ge_50_df['target_variable'] = zero_df['target_variable']
ge_50_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ge_50_df['target_variable'] = zero_df['target_variable']


Unnamed: 0,ani,chmaartpoan,colleges,haakki,hiked,hockey,ilamai,ind,kalanthaayvu,killed,...,sensex,smartphone,tennich,tennis,tharkolai,trophy,veezhththiyathu,vinnappikka,zealand,target_variable
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7987,0,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,3
7988,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
7989,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
7990,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3


As before, I fit the all above models and calculate their corresponding test and train prediction accuracy.

In [37]:
X06 = ge_50_df.drop(['target_variable'], axis=1)
y06 = ge_50_df['target_variable']
X_train06, X_test06, y_train06, y_test06 = train_test_split(X06,y06, test_size=0.2, random_state=42)

# Logistic Regression
lr_model06 = LogisticRegression()
lr_model06.fit(X_train06, y_train06)
lr_train06_accuracy = accuracy_score(y_train06, lr_model06.predict(X_train06))
lr_test06_accuracy = accuracy_score(y_test06, lr_model06.predict(X_test06))

print("Logistic Regression Results:")
print("Train06 Accuracy:", lr_train06_accuracy)
print("Test06 Accuracy:", lr_test06_accuracy)

# Decision Tree
dt_model06 = DecisionTreeClassifier()
dt_model06.fit(X_train06, y_train06)
dt_train06_accuracy = accuracy_score(y_train06, dt_model06.predict(X_train06))
dt_test06_accuracy = accuracy_score(y_test06, dt_model06.predict(X_test06))

print("\nDecision Tree Results:")
print("Train06 Accuracy:", dt_train06_accuracy)
print("Test06 Accuracy:", dt_test06_accuracy)

# Random Forest
rf_model06 = RandomForestClassifier()
rf_model06.fit(X_train06, y_train06)
rf_train06_accuracy = accuracy_score(y_train06, rf_model06.predict(X_train06))
rf_test06_accuracy = accuracy_score(y_test06, rf_model06.predict(X_test06))

print("\nRandom Forest Results:")
print("Train06 Accuracy:", rf_train06_accuracy)
print("Test06 Accuracy:", rf_test06_accuracy)

# Naive Bayes (Gaussian)
nb_model06 = GaussianNB()
nb_model06.fit(X_train06, y_train06)
nb_train06_accuracy = accuracy_score(y_train06, nb_model06.predict(X_train06))
nb_test06_accuracy = accuracy_score(y_test06, nb_model06.predict(X_test06))

print("\nNaive Bayes (Gaussian) Results:")
print("Train06 Accuracy:", nb_train06_accuracy)
print("Test06 Accuracy:", nb_test06_accuracy)

# Naive Bayes (Multinomial)
nb_multinomial_model06 = MultinomialNB()
nb_multinomial_model06.fit(X_train06, y_train06)
nb_multinomial_train06_accuracy = accuracy_score(y_train06, nb_multinomial_model06.predict(X_train06))
nb_multinomial_test06_accuracy = accuracy_score(y_test06, nb_multinomial_model06.predict(X_test06))

print("\nNaive Bayes (Multinomial) Results:")
print("Train06 Accuracy:", nb_multinomial_train06_accuracy)
print("Test06 Accuracy:", nb_multinomial_test06_accuracy)

# Naive Bayes (Bernoulli)
nb_bernoulli_model06 = BernoulliNB()
nb_bernoulli_model06.fit(X_train06, y_train06)
nb_bernoulli_train06_accuracy = accuracy_score(y_train06, nb_bernoulli_model06.predict(X_train06))
nb_bernoulli_test06_accuracy = accuracy_score(y_test06, nb_bernoulli_model06.predict(X_test06))

print("\nNaive Bayes (Bernoulli) Results:")
print("Train06 Accuracy:", nb_bernoulli_train06_accuracy)
print("Test06 Accuracy:", nb_bernoulli_test06_accuracy)

# SVM Linear
svm_linear_model06 = SVC(kernel='linear')
svm_linear_model06.fit(X_train06, y_train06)
svm_linear_train06_accuracy = accuracy_score(y_train06, svm_linear_model06.predict(X_train06))
svm_linear_test06_accuracy = accuracy_score(y_test06, svm_linear_model06.predict(X_test06))

print("\nSVM Linear Results:")
print("Train06 Accuracy:", svm_linear_train06_accuracy)
print("Test06 Accuracy:", svm_linear_test06_accuracy)

# SVM RBF
svm_rbf_model06 = SVC(kernel='rbf')
svm_rbf_model06.fit(X_train06, y_train06)
svm_rbf_train06_accuracy = accuracy_score(y_train06, svm_rbf_model06.predict(X_train06))
svm_rbf_test06_accuracy = accuracy_score(y_test06, svm_rbf_model06.predict(X_test06))

print("\nSVM RBF Results:")
print("Train06 Accuracy:", svm_rbf_train06_accuracy)
print("Test06 Accuracy:", svm_rbf_test06_accuracy)

# SVM Poly
svm_poly_model06 = SVC(kernel='poly')
svm_poly_model06.fit(X_train06, y_train06)
svm_poly_train06_accuracy = accuracy_score(y_train06, svm_poly_model06.predict(X_train06))
svm_poly_test06_accuracy = accuracy_score(y_test06, svm_poly_model06.predict(X_test06))

print("\nSVM Poly Results:")
print("Train06 Accuracy:", svm_poly_train06_accuracy)
print("Test06 Accuracy:", svm_poly_test06_accuracy)

# Choose the best model based on test accuracy
models06 = {
    'Logistic Regression': lr_test06_accuracy,
    'Decision Tree': dt_test06_accuracy,
    'Random Forest': rf_test06_accuracy,
    'Naive Bayes (Gaussian)': nb_test06_accuracy,
    'Naive Bayes (Multinomial)': nb_multinomial_test06_accuracy,
    'Naive Bayes (Bernoulli)': nb_bernoulli_test06_accuracy,
    'SVM Linear': svm_linear_test06_accuracy,
    'SVM RBF': svm_rbf_test06_accuracy,
    'SVM Poly': svm_poly_test06_accuracy
}

best_model06 = max(models06, key=models06.get)
print("\nBest Model06:", best_model06)


Logistic Regression Results:
Train06 Accuracy: 0.46331925543563274
Test06 Accuracy: 0.4709193245778612

Decision Tree Results:
Train06 Accuracy: 0.46331925543563274
Test06 Accuracy: 0.4709193245778612

Random Forest Results:
Train06 Accuracy: 0.46331925543563274
Test06 Accuracy: 0.4709193245778612

Naive Bayes (Gaussian) Results:
Train06 Accuracy: 0.46331925543563274
Test06 Accuracy: 0.4709193245778612

Naive Bayes (Multinomial) Results:
Train06 Accuracy: 0.43907398717347096
Test06 Accuracy: 0.425891181988743

Naive Bayes (Bernoulli) Results:
Train06 Accuracy: 0.46331925543563274
Test06 Accuracy: 0.4709193245778612

SVM Linear Results:
Train06 Accuracy: 0.4631628343500704
Test06 Accuracy: 0.4709193245778612

SVM RBF Results:
Train06 Accuracy: 0.46331925543563274
Test06 Accuracy: 0.46904315196998125

SVM Poly Results:
Train06 Accuracy: 0.46331925543563274
Test06 Accuracy: 0.4709193245778612

Best Model06: Logistic Regression


# Ngram

Now, I apply the N-gram NLP method to extract features from the data frame.

In [38]:
ngram_range = (1, 2)
vectorizer01 = CountVectorizer(ngram_range=ngram_range)
X = vectorizer01.fit_transform(df1['title'])
features_ngram = pd.DataFrame(X.toarray(), columns=vectorizer01.get_feature_names_out())
ngram_df = pd.concat([df1, features_ngram], axis=1)
ngram_df['target_variable'] = df1['target_variable']
ngram_df

Unnamed: 0,title,target_variable,aa,aa valai,aaakp,aaakp chpinnar,aaakp engineering,aaakp hevan,aaakp kirikket,aaakp makaaraachtiraa,...,zoom,zoom app,zoom hindutamilin,zoom with,zuckerberg,zuckerberg loses,zuckerberg over,zuckerberg personal,zurich,zurich diamond
0,koavai vaelaan palkalai maanavarkal nataththum...,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,vaelaan patippukalum vaelai vaayppukalum oar ...,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,yechyechyelsi pilach pilach pothu thaervu th...,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,chmaart kilaach utan hautekkaaka maarum thirun...,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,maanavarkal pangkaerkalaam tisampar m thaethi ...,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7987,haampark oapan tennich chviyaatek vilakal Ha...,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7988,therkaasiya kaalpanthu kuuttamaippu saampiyanc...,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7989,haatrik verriyai pathivu seythathu seekam math...,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7990,aasiya kapati saampiyanchip vathu muraiyaaka p...,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


As before, I fit the all above models and calculate their corresponding test and train prediction accuracy.

In [39]:
X07 = ngram_df.drop(['target_variable'], axis=1)
y07 = ngram_df['target_variable']
X_train07, X_test07, y_train07, y_test07 = train_test_split(X07,y07, test_size=0.2, random_state=42)

# Logistic Regression
lr_model07 = LogisticRegression()
lr_model07.fit(X_train07, y_train07)
lr_train07_accuracy = accuracy_score(y_train07, lr_model07.predict(X_train07))
lr_test07_accuracy = accuracy_score(y_test07, lr_model07.predict(X_test07))

print("Logistic Regression Results:")
print("Train07 Accuracy:", lr_train07_accuracy)
print("Test07 Accuracy:", lr_test07_accuracy)

# Decision Tree
dt_model07 = DecisionTreeClassifier()
dt_model07.fit(X_train07, y_train07)
dt_train07_accuracy = accuracy_score(y_train07, dt_model07.predict(X_train07))
dt_test07_accuracy = accuracy_score(y_test07, dt_model07.predict(X_test07))

print("\nDecision Tree Results:")
print("Train07 Accuracy:", dt_train07_accuracy)
print("Test07 Accuracy:", dt_test07_accuracy)

# Random Forest
rf_model07 = RandomForestClassifier()
rf_model07.fit(X_train07, y_train07)
rf_train07_accuracy = accuracy_score(y_train07, rf_model07.predict(X_train07))
rf_test07_accuracy = accuracy_score(y_test07, rf_model07.predict(X_test07))

print("\nRandom Forest Results:")
print("Train07 Accuracy:", rf_train07_accuracy)
print("Test07 Accuracy:", rf_test07_accuracy)

# Naive Bayes (Gaussian)
nb_model07 = GaussianNB()
nb_model07.fit(X_train07, y_train07)
nb_train07_accuracy = accuracy_score(y_train07, nb_model07.predict(X_train07))
nb_test07_accuracy = accuracy_score(y_test07, nb_model07.predict(X_test07))

print("\nNaive Bayes (Gaussian) Results:")
print("Train07 Accuracy:", nb_train07_accuracy)
print("Test07 Accuracy:", nb_test07_accuracy)

# Naive Bayes (Multinomial)
nb_multinomial_model07 = MultinomialNB()
nb_multinomial_model07.fit(X_train07, y_train07)
nb_multinomial_train07_accuracy = accuracy_score(y_train07, nb_multinomial_model07.predict(X_train07))
nb_multinomial_test07_accuracy = accuracy_score(y_test07, nb_multinomial_model07.predict(X_test07))

print("\nNaive Bayes (Multinomial) Results:")
print("Train07 Accuracy:", nb_multinomial_train07_accuracy)
print("Test07 Accuracy:", nb_multinomial_test07_accuracy)

# Naive Bayes (Bernoulli)
nb_bernoulli_model07 = BernoulliNB()
nb_bernoulli_model07.fit(X_train07, y_train07)
nb_bernoulli_train07_accuracy = accuracy_score(y_train07, nb_bernoulli_model07.predict(X_train07))
nb_bernoulli_test07_accuracy = accuracy_score(y_test07, nb_bernoulli_model07.predict(X_test07))

print("\nNaive Bayes (Bernoulli) Results:")
print("Train07 Accuracy:", nb_bernoulli_train07_accuracy)
print("Test07 Accuracy:", nb_bernoulli_test07_accuracy)

# SVM Linear
svm_linear_model07 = SVC(kernel='linear')
svm_linear_model07.fit(X_train07, y_train07)
svm_linear_train07_accuracy = accuracy_score(y_train07, svm_linear_model07.predict(X_train07))
svm_linear_test07_accuracy = accuracy_score(y_test07, svm_linear_model07.predict(X_test07))

print("\nSVM Linear Results:")
print("Train07 Accuracy:", svm_linear_train07_accuracy)
print("Test07 Accuracy:", svm_linear_test07_accuracy)

# SVM RBF
svm_rbf_model07 = SVC(kernel='rbf')
svm_rbf_model07.fit(X_train07, y_train07)
svm_rbf_train07_accuracy = accuracy_score(y_train07, svm_rbf_model07.predict(X_train07))
svm_rbf_test07_accuracy = accuracy_score(y_test07, svm_rbf_model07.predict(X_test07))

print("\nSVM RBF Results:")
print("Train07 Accuracy:", svm_rbf_train07_accuracy)
print("Test07 Accuracy:", svm_rbf_test07_accuracy)

# SVM Poly
svm_poly_model07 = SVC(kernel='poly')
svm_poly_model07.fit(X_train07, y_train07)
svm_poly_train07_accuracy = accuracy_score(y_train07, svm_poly_model07.predict(X_train07))
svm_poly_test07_accuracy = accuracy_score(y_test07, svm_poly_model07.predict(X_test07))

print("\nSVM Poly Results:")
print("Train07 Accuracy:", svm_poly_train07_accuracy)
print("Test07 Accuracy:", svm_poly_test07_accuracy)


# Choose the best model based on test accuracy
models07 = {
    'Logistic Regression': lr_test07_accuracy,
    'Decision Tree': dt_test07_accuracy,
    'Random Forest': rf_test07_accuracy,
    'Naive Bayes (Gaussian)': nb_test07_accuracy,
    'Naive Bayes (Multinomial)': nb_multinomial_test07_accuracy,
    'Naive Bayes (Bernoulli)': nb_bernoulli_test07_accuracy,
    'SVM Linear': svm_linear_test07_accuracy,
    'SVM RBF': svm_rbf_test07_accuracy,
    'SVM Poly': svm_poly_test07_accuracy
}

best_model07 = max(models07, key=models07.get)
print("\nBest Model07:", best_model07)

MemoryError: Unable to allocate 6.55 GiB for an array with shape (109954, 7992) and data type int64

# TF-IDF

Now, I apply the TF-IDF NLP method to extract features from the data frame.

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer02 = TfidfVectorizer()
X_0 = vectorizer02.fit_transform(df1['title'])
features_tfidf = pd.DataFrame(X_0.toarray(), columns=vectorizer02.get_feature_names_out())
tfidf_df = pd.concat([df1, features_tfidf], axis=1)
tfidf_df['target_variable'] = df1['target_variable']
tfidf_df

Unnamed: 0,title,target_variable,aa,aaakp,aaakpar,aaakparil,aaakparkal,aaakpkaanichthaanukku,aaakplainil,aaakppaayil,...,zinc,zindabad,zipmat,zoho,zomato,zone,zoology,zoom,zuckerberg,zurich
0,koavai vaelaan palkalai maanavarkal nataththum...,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,vaelaan patippukalum vaelai vaayppukalum oar ...,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,yechyechyelsi pilach pilach pothu thaervu th...,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,chmaart kilaach utan hautekkaaka maarum thirun...,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,maanavarkal pangkaerkalaam tisampar m thaethi ...,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7987,haampark oapan tennich chviyaatek vilakal Ha...,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7988,therkaasiya kaalpanthu kuuttamaippu saampiyanc...,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7989,haatrik verriyai pathivu seythathu seekam math...,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7990,aasiya kapati saampiyanchip vathu muraiyaaka p...,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [41]:
del tfidf_df['title']
tfidf_df

Unnamed: 0,target_variable,aa,aaakp,aaakpar,aaakparil,aaakparkal,aaakpkaanichthaanukku,aaakplainil,aaakppaayil,aac,...,zinc,zindabad,zipmat,zoho,zomato,zone,zoology,zoom,zuckerberg,zurich
0,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7987,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7988,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7989,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7990,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


As before, I fit the all above models and calculate their corresponding test and train prediction accuracy.

In [42]:
X09 = tfidf_df.drop(['target_variable'], axis=1)
y09 = tfidf_df['target_variable']
X_train09, X_test09, y_train09, y_test09 = train_test_split(X09,y09, test_size=0.2, random_state=42)

# Logistic Regression
lr_model09 = LogisticRegression()
lr_model09.fit(X_train09, y_train09)
lr_train09_accuracy = accuracy_score(y_train09, lr_model09.predict(X_train09))
lr_test09_accuracy = accuracy_score(y_test09, lr_model09.predict(X_test09))

print("Logistic Regression Results:")
print("Train09 Accuracy:", lr_train09_accuracy)
print("Test09 Accuracy:", lr_test09_accuracy)

# Decision Tree
dt_model09 = DecisionTreeClassifier()
dt_model09.fit(X_train09, y_train09)
dt_train09_accuracy = accuracy_score(y_train09, dt_model09.predict(X_train09))
dt_test09_accuracy = accuracy_score(y_test09, dt_model09.predict(X_test09))

print("\nDecision Tree Results:")
print("Train09 Accuracy:", dt_train09_accuracy)
print("Test09 Accuracy:", dt_test09_accuracy)

# Random Forest
rf_model09 = RandomForestClassifier()
rf_model09.fit(X_train09, y_train09)
rf_train09_accuracy = accuracy_score(y_train09, rf_model09.predict(X_train09))
rf_test09_accuracy = accuracy_score(y_test09, rf_model09.predict(X_test09))

print("\nRandom Forest Results:")
print("Train09 Accuracy:", rf_train09_accuracy)
print("Test09 Accuracy:", rf_test09_accuracy)

# Naive Bayes (Gaussian)
nb_model09 = GaussianNB()
nb_model09.fit(X_train09, y_train09)
nb_train09_accuracy = accuracy_score(y_train09, nb_model09.predict(X_train09))
nb_test09_accuracy = accuracy_score(y_test09, nb_model09.predict(X_test09))

print("\nNaive Bayes (Gaussian) Results:")
print("Train09 Accuracy:", nb_train09_accuracy)
print("Test09 Accuracy:", nb_test09_accuracy)

# Naive Bayes (Multinomial)
nb_multinomial_model09 = MultinomialNB()
nb_multinomial_model09.fit(X_train09, y_train09)
nb_multinomial_train09_accuracy = accuracy_score(y_train09, nb_multinomial_model09.predict(X_train09))
nb_multinomial_test09_accuracy = accuracy_score(y_test09, nb_multinomial_model09.predict(X_test09))

print("\nNaive Bayes (Multinomial) Results:")
print("Train09 Accuracy:", nb_multinomial_train09_accuracy)
print("Test09 Accuracy:", nb_multinomial_test09_accuracy)

# Naive Bayes (Bernoulli)
nb_bernoulli_model09 = BernoulliNB()
nb_bernoulli_model09.fit(X_train09, y_train09)
nb_bernoulli_train09_accuracy = accuracy_score(y_train09, nb_bernoulli_model09.predict(X_train09))
nb_bernoulli_test09_accuracy = accuracy_score(y_test09, nb_bernoulli_model09.predict(X_test09))

print("\nNaive Bayes (Bernoulli) Results:")
print("Train09 Accuracy:", nb_bernoulli_train09_accuracy)
print("Test09 Accuracy:", nb_bernoulli_test09_accuracy)

# SVM Linear
svm_linear_model09 = SVC(kernel='linear')
svm_linear_model09.fit(X_train09, y_train09)
svm_linear_train09_accuracy = accuracy_score(y_train09, svm_linear_model09.predict(X_train09))
svm_linear_test09_accuracy = accuracy_score(y_test09, svm_linear_model09.predict(X_test09))

print("\nSVM Linear Results:")
print("Train09 Accuracy:", svm_linear_train09_accuracy)
print("Test09 Accuracy:", svm_linear_test09_accuracy)

# SVM RBF
svm_rbf_model09 = SVC(kernel='rbf')
svm_rbf_model09.fit(X_train09, y_train09)
svm_rbf_train09_accuracy = accuracy_score(y_train09, svm_rbf_model09.predict(X_train09))
svm_rbf_test09_accuracy = accuracy_score(y_test09, svm_rbf_model09.predict(X_test09))

print("\nSVM RBF Results:")
print("Train09 Accuracy:", svm_rbf_train09_accuracy)
print("Test09 Accuracy:", svm_rbf_test09_accuracy)

# SVM Poly
svm_poly_model09 = SVC(kernel='poly')
svm_poly_model09.fit(X_train09, y_train09)
svm_poly_train09_accuracy = accuracy_score(y_train09, svm_poly_model09.predict(X_train09))
svm_poly_test09_accuracy = accuracy_score(y_test09, svm_poly_model09.predict(X_test09))

print("\nSVM Poly Results:")
print("Train09 Accuracy:", svm_poly_train09_accuracy)
print("Test09 Accuracy:", svm_poly_test09_accuracy)

# Choose the best model based on test accuracy
models09 = {
    'Logistic Regression': lr_test09_accuracy,
    'Decision Tree': dt_test09_accuracy,
    'Random Forest': rf_test09_accuracy,
    'Naive Bayes (Gaussian)': nb_test09_accuracy,
    'Naive Bayes (Multinomial)': nb_multinomial_test09_accuracy,
    'Naive Bayes (Bernoulli)': nb_bernoulli_test09_accuracy,
    'SVM Linear': svm_linear_test09_accuracy,
    'SVM RBF': svm_rbf_test09_accuracy,
    'SVM Poly': svm_poly_test09_accuracy
}

best_model09 = max(models09, key=models09.get)
print("\nBest Model09:", best_model09)


Logistic Regression Results:
Train09 Accuracy: 0.9893633661817614
Test09 Accuracy: 0.9405878674171357

Decision Tree Results:
Train09 Accuracy: 1.0
Test09 Accuracy: 0.8230143839899937

Random Forest Results:
Train09 Accuracy: 1.0
Test09 Accuracy: 0.8986866791744841

Naive Bayes (Gaussian) Results:
Train09 Accuracy: 0.9996871578288753
Test09 Accuracy: 0.8955597248280175

Naive Bayes (Multinomial) Results:
Train09 Accuracy: 0.9791959956202096
Test09 Accuracy: 0.9349593495934959

Naive Bayes (Bernoulli) Results:
Train09 Accuracy: 0.9821679962458939
Test09 Accuracy: 0.941213258286429

SVM Linear Results:
Train09 Accuracy: 0.9964023150320663
Test09 Accuracy: 0.9480925578486554

SVM RBF Results:
Train09 Accuracy: 0.9993743156577507
Test09 Accuracy: 0.9393370856785491

SVM Poly Results:
Train09 Accuracy: 1.0
Test09 Accuracy: 0.8123827392120075

Best Model09: SVM Linear


## TF-IDF - Chi_square Test

I select the features using Chi_square test.

In [43]:
from scipy.stats import chi2_contingency
X003 = tfidf_df.drop('target_variable', axis=1)
y003 = tfidf_df['target_variable']
feature_variables03 = []
for feature in X003.columns:
    contingency_table = pd.crosstab(X003[feature], y003)
    chi2, p, _, _ = chi2_contingency(contingency_table)
    significance_level = 0.05
    
    if p < significance_level:
        feature_variables03.append(feature)
chi_2_df03 = tfidf_df[feature_variables03]
chi_2_df03['target_variable'] = tfidf_df['target_variable']
chi_2_df03

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chi_2_df03['target_variable'] = tfidf_df['target_variable']


Unnamed: 0,arrested,down,ends,gold,hiked,hindutamilin,joolai,joon,kaithu,kuraivu,...,sovereign,thalam,thangkam,up,uyarvu,veezhssi,vilai,with,yaep,target_variable
0,0.0,0.0,0.0,0.0,0.0,0.038851,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
1,0.0,0.0,0.0,0.0,0.0,0.037988,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
2,0.0,0.0,0.0,0.0,0.0,0.041544,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
3,0.0,0.0,0.0,0.0,0.0,0.032281,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
4,0.0,0.0,0.0,0.0,0.0,0.038752,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7987,0.0,0.0,0.0,0.0,0.0,0.041824,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
7988,0.0,0.0,0.0,0.0,0.0,0.033946,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
7989,0.0,0.0,0.0,0.0,0.0,0.036349,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
7990,0.0,0.0,0.0,0.0,0.0,0.039250,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3


As before, I fit the all above models and calculate their corresponding test and train prediction accuracy.

In [44]:
X010 = chi_2_df03.drop(['target_variable'], axis=1)
y010 = chi_2_df03['target_variable']
X_train010, X_test010, y_train010, y_test010 = train_test_split(X010,y010, test_size=0.2, random_state=42)

# Logistic Regression
lr_model010 = LogisticRegression()
lr_model010.fit(X_train010, y_train010)
lr_train010_accuracy = accuracy_score(y_train010, lr_model010.predict(X_train010))
lr_test010_accuracy = accuracy_score(y_test010, lr_model010.predict(X_test010))

print("Logistic Regression Results:")
print("Train010 Accuracy:", lr_train010_accuracy)
print("Test010 Accuracy:", lr_test010_accuracy)

# Decision Tree
dt_model010 = DecisionTreeClassifier()
dt_model010.fit(X_train010, y_train010)
dt_train010_accuracy = accuracy_score(y_train010, dt_model010.predict(X_train010))
dt_test010_accuracy = accuracy_score(y_test010, dt_model010.predict(X_test010))

print("\nDecision Tree Results:")
print("Train010 Accuracy:", dt_train010_accuracy)
print("Test010 Accuracy:", dt_test010_accuracy)

# Random Forest
rf_model010 = RandomForestClassifier()
rf_model010.fit(X_train010, y_train010)
rf_train010_accuracy = accuracy_score(y_train010, rf_model010.predict(X_train010))
rf_test010_accuracy = accuracy_score(y_test010, rf_model010.predict(X_test010))

print("\nRandom Forest Results:")
print("Train010 Accuracy:", rf_train010_accuracy)
print("Test010 Accuracy:", rf_test010_accuracy)

# Naive Bayes (Gaussian)
nb_model010 = GaussianNB()
nb_model010.fit(X_train010, y_train010)
nb_train010_accuracy = accuracy_score(y_train010, nb_model010.predict(X_train010))
nb_test010_accuracy = accuracy_score(y_test010, nb_model010.predict(X_test010))

print("\nNaive Bayes (Gaussian) Results:")
print("Train010 Accuracy:", nb_train010_accuracy)
print("Test010 Accuracy:", nb_test010_accuracy)

# Naive Bayes (Multinomial)
nb_multinomial_model010 = MultinomialNB()
nb_multinomial_model010.fit(X_train010, y_train010)
nb_multinomial_train010_accuracy = accuracy_score(y_train010, nb_multinomial_model010.predict(X_train010))
nb_multinomial_test010_accuracy = accuracy_score(y_test010, nb_multinomial_model010.predict(X_test010))

print("\nNaive Bayes (Multinomial) Results:")
print("Train010 Accuracy:", nb_multinomial_train010_accuracy)
print("Test010 Accuracy:", nb_multinomial_test010_accuracy)

# Naive Bayes (Bernoulli)
nb_bernoulli_model010 = BernoulliNB()
nb_bernoulli_model010.fit(X_train010, y_train010)
nb_bernoulli_train010_accuracy = accuracy_score(y_train010, nb_bernoulli_model010.predict(X_train010))
nb_bernoulli_test010_accuracy = accuracy_score(y_test010, nb_bernoulli_model010.predict(X_test010))

print("\nNaive Bayes (Bernoulli) Results:")
print("Train010 Accuracy:", nb_bernoulli_train010_accuracy)
print("Test010 Accuracy:", nb_bernoulli_test010_accuracy)

# SVM Linear
svm_linear_model010 = SVC(kernel='linear')
svm_linear_model010.fit(X_train010, y_train010)
svm_linear_train010_accuracy = accuracy_score(y_train010, svm_linear_model010.predict(X_train010))
svm_linear_test010_accuracy = accuracy_score(y_test010, svm_linear_model010.predict(X_test010))

print("\nSVM Linear Results:")
print("Train010 Accuracy:", svm_linear_train010_accuracy)
print("Test010 Accuracy:", svm_linear_test010_accuracy)

# SVM RBF
svm_rbf_model010 = SVC(kernel='rbf')
svm_rbf_model010.fit(X_train010, y_train010)
svm_rbf_train010_accuracy = accuracy_score(y_train010, svm_rbf_model010.predict(X_train010))
svm_rbf_test010_accuracy = accuracy_score(y_test010, svm_rbf_model010.predict(X_test010))

print("\nSVM RBF Results:")
print("Train010 Accuracy:", svm_rbf_train010_accuracy)
print("Test010 Accuracy:", svm_rbf_test010_accuracy)

# SVM Poly
svm_poly_model010 = SVC(kernel='poly')
svm_poly_model010.fit(X_train010, y_train010)
svm_poly_train010_accuracy = accuracy_score(y_train010, svm_poly_model010.predict(X_train010))
svm_poly_test010_accuracy = accuracy_score(y_test010, svm_poly_model010.predict(X_test010))

print("\nSVM Poly Results:")
print("Train010 Accuracy:", svm_poly_train010_accuracy)
print("Test010 Accuracy:", svm_poly_test010_accuracy)


# Choose the best model based on test accuracy
models010 = {
    'Logistic Regression': lr_test010_accuracy,
    'Decision Tree': dt_test010_accuracy,
    'Random Forest': rf_test010_accuracy,
    'Naive Bayes (Gaussian)': nb_test010_accuracy,
    'Naive Bayes (Multinomial)': nb_multinomial_test010_accuracy,
    'Naive Bayes (Bernoulli)': nb_bernoulli_test010_accuracy,
    'SVM Linear': svm_linear_test010_accuracy,
    'SVM RBF': svm_rbf_test010_accuracy,
    'SVM Poly': svm_poly_test010_accuracy
}

best_model010 = max(models010, key=models010.get)
print("\nBest Model010:", best_model010)

Logistic Regression Results:
Train010 Accuracy: 0.4664476771468794
Test010 Accuracy: 0.490931832395247

Decision Tree Results:
Train010 Accuracy: 0.9881119974972626
Test010 Accuracy: 0.44652908067542213

Random Forest Results:
Train010 Accuracy: 0.9884248396683873
Test010 Accuracy: 0.44715447154471544

Naive Bayes (Gaussian) Results:
Train010 Accuracy: 0.35194744251525106
Test010 Accuracy: 0.3614759224515322

Naive Bayes (Multinomial) Results:
Train010 Accuracy: 0.44251525105584233
Test010 Accuracy: 0.4534083802376485

Naive Bayes (Bernoulli) Results:
Train010 Accuracy: 0.444235882997028
Test010 Accuracy: 0.46904315196998125

SVM Linear Results:
Train010 Accuracy: 0.4406381980290943
Test010 Accuracy: 0.44840525328330205

SVM RBF Results:
Train010 Accuracy: 0.4717659940559987
Test010 Accuracy: 0.4727954971857411

SVM Poly Results:
Train010 Accuracy: 0.4406381980290943
Test010 Accuracy: 0.4434021263289556

Best Model010: Logistic Regression


# Class-Keyword

I've implemented a custom method where I first extract words from the 'Title' column in the 'df1' DataFrame, along with their corresponding target column classes. I store these as separate lists for each class (I have five classes, so I end up with five lists). Then, I use the difference method (set theory) to make all lists distinct. Subsequently, I count how many unique words are present for all indices and place those counts in a new column corresponding to each class. I repeat this process for all classes. After this, I have six columns excluding the 'Title' column.

In [47]:
education_df = df1[df1['target_variable'] == 2]
unique_words_education = education_df['title'].str.split().explode().unique().tolist()

crime_df = df1[df1['target_variable'] == 1]
unique_words_crime = crime_df['title'].str.split().explode().unique().tolist()

technology_df = df1[df1['target_variable'] == 4]
unique_words_technology = technology_df['title'].str.split().explode().unique().tolist()

business_df = df1[df1['target_variable'] == 0]
unique_words_business = business_df['title'].str.split().explode().unique().tolist()

sports_df = df1[df1['target_variable'] == 3]
unique_words_sports = sports_df['title'].str.split().explode().unique().tolist()

def get_unique_difference(list1, list2, list3, list4, list5):
    a = list(set(list1).difference(list2))
    b = list(set(a).difference(list3))
    c = list(set(b).difference(list4))
    d = list(set(c).difference(list5))
    return d
education_keyword = get_unique_difference(unique_words_education, unique_words_crime, unique_words_technology, unique_words_business, unique_words_sports)
crime_keyword = get_unique_difference(unique_words_crime, unique_words_education, unique_words_technology, unique_words_business, unique_words_sports)
technology_keyword = get_unique_difference(unique_words_technology, unique_words_education, unique_words_crime, unique_words_business, unique_words_sports)
business_keyword = get_unique_difference(unique_words_business, unique_words_education, unique_words_crime, unique_words_technology, unique_words_sports)
sports_keyword = get_unique_difference(unique_words_sports, unique_words_education, unique_words_crime, unique_words_technology, unique_words_business)

new_df = df1.copy()

def count_elements(sentence, keyword):
    return sum(1 if word in keyword else 0 for word in sentence.split())

new_df['education_keyword'] = new_df['title'].apply(lambda x: count_elements(x, education_keyword))
new_df['crime_keyword'] = new_df['title'].apply(lambda x: count_elements(x, crime_keyword))
new_df['technology_keyword'] = new_df['title'].apply(lambda x: count_elements(x, technology_keyword))
new_df['business_keyword'] = new_df['title'].apply(lambda x: count_elements(x, business_keyword))
new_df['sports_keyword'] = new_df['title'].apply(lambda x: count_elements(x, sports_keyword))
del new_df['title']

new_df

Unnamed: 0,target_variable,education_keyword,crime_keyword,technology_keyword,business_keyword,sports_keyword
0,2,1,0,0,0,0
1,2,2,0,0,0,0
2,2,3,0,0,0,0
3,2,6,0,0,0,0
4,2,2,0,0,0,0
...,...,...,...,...,...,...
7987,3,0,0,0,0,6
7988,3,0,0,0,0,10
7989,3,0,0,0,0,7
7990,3,0,0,0,0,5


As before, I fit the all above models and calculate their corresponding test and train prediction accuracy.

In [48]:
X011 = new_df.drop(['target_variable'], axis=1)
y011 = new_df['target_variable']
X_train011, X_test011, y_train011, y_test011 = train_test_split(X011,y011, test_size=0.2, random_state=42)

# Logistic Regression
lr_model011 = LogisticRegression()
lr_model011.fit(X_train011, y_train011)
lr_train011_accuracy = accuracy_score(y_train011, lr_model011.predict(X_train011))
lr_test011_accuracy = accuracy_score(y_test011, lr_model011.predict(X_test011))

print("Logistic Regression Results:")
print("Train011 Accuracy:", lr_train011_accuracy)
print("Test011 Accuracy:", lr_test011_accuracy)

# Decision Tree
dt_model011 = DecisionTreeClassifier()
dt_model011.fit(X_train011, y_train011)
dt_train011_accuracy = accuracy_score(y_train011, dt_model011.predict(X_train011))
dt_test011_accuracy = accuracy_score(y_test011, dt_model011.predict(X_test011))

print("\nDecision Tree Results:")
print("Train011 Accuracy:", dt_train011_accuracy)
print("Test011 Accuracy:", dt_test011_accuracy)

# Random Forest
rf_model011 = RandomForestClassifier()
rf_model011.fit(X_train011, y_train011)
rf_train011_accuracy = accuracy_score(y_train011, rf_model011.predict(X_train011))
rf_test011_accuracy = accuracy_score(y_test011, rf_model011.predict(X_test011))

print("\nRandom Forest Results:")
print("Train011 Accuracy:", rf_train011_accuracy)
print("Test011 Accuracy:", rf_test011_accuracy)

# Naive Bayes (Gaussian)
nb_model011 = GaussianNB()
nb_model011.fit(X_train011, y_train011)
nb_train011_accuracy = accuracy_score(y_train011, nb_model011.predict(X_train011))
nb_test011_accuracy = accuracy_score(y_test011, nb_model011.predict(X_test011))

print("\nNaive Bayes (Gaussian) Results:")
print("Train011 Accuracy:", nb_train011_accuracy)
print("Test011 Accuracy:", nb_test011_accuracy)

# Naive Bayes (Multinomial)
nb_multinomial_model011 = MultinomialNB()
nb_multinomial_model011.fit(X_train011, y_train011)
nb_multinomial_train011_accuracy = accuracy_score(y_train011, nb_multinomial_model011.predict(X_train011))
nb_multinomial_test011_accuracy = accuracy_score(y_test011, nb_multinomial_model011.predict(X_test011))

print("\nNaive Bayes (Multinomial) Results:")
print("Train011 Accuracy:", nb_multinomial_train011_accuracy)
print("Test011 Accuracy:", nb_multinomial_test011_accuracy)

# Naive Bayes (Bernoulli)
nb_bernoulli_model011 = BernoulliNB()
nb_bernoulli_model011.fit(X_train011, y_train011)
nb_bernoulli_train011_accuracy = accuracy_score(y_train011, nb_bernoulli_model011.predict(X_train011))
nb_bernoulli_test011_accuracy = accuracy_score(y_test011, nb_bernoulli_model011.predict(X_test011))

print("\nNaive Bayes (Bernoulli) Results:")
print("Train011 Accuracy:", nb_bernoulli_train011_accuracy)
print("Test011 Accuracy:", nb_bernoulli_test011_accuracy)

# SVM Linear
svm_linear_model011 = SVC(kernel='linear')
svm_linear_model011.fit(X_train011, y_train011)
svm_linear_train011_accuracy = accuracy_score(y_train011, svm_linear_model011.predict(X_train011))
svm_linear_test011_accuracy = accuracy_score(y_test011, svm_linear_model011.predict(X_test011))

print("\nSVM Linear Results:")
print("Train011 Accuracy:", svm_linear_train011_accuracy)
print("Test011 Accuracy:", svm_linear_test011_accuracy)

# SVM RBF
svm_rbf_model011 = SVC(kernel='rbf')
svm_rbf_model011.fit(X_train011, y_train011)
svm_rbf_train011_accuracy = accuracy_score(y_train011, svm_rbf_model011.predict(X_train011))
svm_rbf_test011_accuracy = accuracy_score(y_test011, svm_rbf_model011.predict(X_test011))

print("\nSVM RBF Results:")
print("Train011 Accuracy:", svm_rbf_train011_accuracy)
print("Test011 Accuracy:", svm_rbf_test011_accuracy)

# SVM Poly
svm_poly_model011 = SVC(kernel='poly')
svm_poly_model011.fit(X_train011, y_train011)
svm_poly_train011_accuracy = accuracy_score(y_train011, svm_poly_model011.predict(X_train011))
svm_poly_test011_accuracy = accuracy_score(y_test011, svm_poly_model011.predict(X_test011))

print("\nSVM Poly Results:")
print("Train011 Accuracy:", svm_poly_train011_accuracy)
print("Test011 Accuracy:", svm_poly_test011_accuracy)

# Choose the best model based on test accuracy
models011 = {
    'Logistic Regression': lr_test011_accuracy,
    'Decision Tree': dt_test011_accuracy,
    'Random Forest': rf_test011_accuracy,
    'Naive Bayes (Gaussian)': nb_test011_accuracy,
    'Naive Bayes (Multinomial)': nb_multinomial_test011_accuracy,
    'Naive Bayes (Bernoulli)': nb_bernoulli_test011_accuracy,
    'SVM Linear': svm_linear_test011_accuracy,
    'SVM RBF': svm_rbf_test011_accuracy,
    'SVM Poly': svm_poly_test011_accuracy
}

best_model011 = max(models011, key=models011.get)
print("\nBest Model011:", best_model011)

Logistic Regression Results:
Train011 Accuracy: 0.9923353668074456
Test011 Accuracy: 0.9899937460913071

Decision Tree Results:
Train011 Accuracy: 0.9923353668074456
Test011 Accuracy: 0.9899937460913071

Random Forest Results:
Train011 Accuracy: 0.9923353668074456
Test011 Accuracy: 0.9899937460913071

Naive Bayes (Gaussian) Results:
Train011 Accuracy: 0.9884248396683873
Test011 Accuracy: 0.9856160100062539

Naive Bayes (Multinomial) Results:
Train011 Accuracy: 0.9923353668074456
Test011 Accuracy: 0.9899937460913071

Naive Bayes (Bernoulli) Results:
Train011 Accuracy: 0.9923353668074456
Test011 Accuracy: 0.9899937460913071

SVM Linear Results:
Train011 Accuracy: 0.9923353668074456
Test011 Accuracy: 0.9899937460913071

SVM RBF Results:
Train011 Accuracy: 0.9923353668074456
Test011 Accuracy: 0.9899937460913071

SVM Poly Results:
Train011 Accuracy: 0.8994212419834193
Test011 Accuracy: 0.8961851156973109

Best Model011: Logistic Regression


# K-Fold Cross Validation

I use the K-Fold Cross validation to assess the model performance.

In [49]:
import numpy as np
from sklearn.model_selection import KFold
new_df1 = new_df.copy()
X = new_df1.drop('target_variable', axis=1).values
y = new_df1['target_variable'].values

k_folds = 5

model = LogisticRegression()
kf = KFold(n_splits=k_folds, shuffle=True, random_state=42)
fold_accuracies = []
train_accuracies = []

for fold_idx, (train_index, test_index) in enumerate(kf.split(X), 1):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model.fit(X_train, y_train)
    
    y_pred_test = model.predict(X_test)
    accuracy_test = accuracy_score(y_test, y_pred_test)
    fold_accuracies.append(accuracy_test)
    print(f'Fold {fold_idx} Test Accuracy: {accuracy_test}')

    y_pred_train = model.predict(X_train)
    accuracy_train = accuracy_score(y_train, y_pred_train)
    train_accuracies.append(accuracy_train)
    print(f'Fold {fold_idx} Training Accuracy: {accuracy_train}')

average_test_accuracy = np.mean(fold_accuracies)
average_train_accuracy = np.mean(train_accuracies)
print(f'\nAverage Test Accuracy: {average_test_accuracy}')
print(f'Average Training Accuracy: {average_train_accuracy}')


Fold 1 Test Accuracy: 0.9899937460913071
Fold 1 Training Accuracy: 0.9923353668074456
Fold 2 Test Accuracy: 0.9912445278298937
Fold 2 Training Accuracy: 0.992022524636321
Fold 3 Test Accuracy: 0.9887359198998749
Fold 3 Training Accuracy: 0.9926493587738505
Fold 4 Test Accuracy: 0.9912390488110138
Fold 4 Training Accuracy: 0.9920237722865186
Fold 5 Test Accuracy: 0.9981226533166458
Fold 5 Training Accuracy: 0.990303409446356

Average Test Accuracy: 0.991867179189747
Average Training Accuracy: 0.9918668863900983


# THANK YOU!!!