The deadline for this homework is on **11.10.2023 08:59** (right before the practice session). After completing the exercises, you should

1. Download this file into your computer (`File` $\to$ `Download .ipynb`)

2. Name the file in the following way *HWx_NameSurname* (for example `HW1_NshanPotikyan.ipynb`)

4. Send the file to this email address `nshan.potikyan@gmail.com` with subject **ML1**

**Note**

* if you do not follow any of the above conditions, your homework will not be graded.

* you do not need to send any dataset files or helper scripts that I provide with your homework (since I already have them).

* you need to write the code for the exercises yourself; you can use ``built-in functions``, ``numpy``, ``pandas``, ``sklearn``
and ``matplotlib``. Use of other libraries or packages will result in points deducted.

**Problem.** During the practice session we tried to classify the titles of some news articles using the Naive Bayes algorithm with different data processing methods, but the result was not that good.

* In this homework, you need to take the same dataset but this time you need to consider the article paragraph itself to train a classifier.

* Split the training dataset into train/val parts, so that you can evaluate which data processing approach results in better performance.

* Make use of sklearn [pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) to construct the different data processing pipelines.

* Evaluate the model performance in terms of the accuracy score.

* Use the best data processing method to train a final model on the train+val dataset and report the accuracy score on the test dataset.

Run the below command to download the train/test splits of the news dataset.

In [163]:
!wget https://raw.githubusercontent.com/NshanPotikyan/Dasa1Doom/master/files/news_data.zip
!unzip news_data.zip

--2023-10-10 19:46:01--  https://raw.githubusercontent.com/NshanPotikyan/Dasa1Doom/master/files/news_data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 148565 (145K) [application/zip]
Saving to: ‘news_data.zip.1’


2023-10-10 19:46:02 (4.36 MB/s) - ‘news_data.zip.1’ saved [148565/148565]

Archive:  news_data.zip
replace train_news.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [174]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

In [175]:
train_data = pd.read_csv('train_news.csv')
test_data = pd.read_csv('test_news.csv')

In [176]:
train_data.head()

Unnamed: 0,article_title,article_paragraph,type
0,Armenia’s economic activity rate increased by ...,According to the main macro-economic indicato...,economy
1,‘Protection of state interest is my only inter...,Deputy Prime Minister of Armenia Tigran Aviny...,economy
2,Armenia to start producing gold bullions,The Central Bank of Armenia will soon open a ...,economy
3,Armenia’s economic activity index grows by 7.5...,Armenia’s economic activity index increased b...,economy
4,Kazakh companies interested in Armenian market...,The relations of Armenia and Kazakhstan remai...,economy


In [177]:
train_data, val_data = train_test_split(train_data, test_size=0.2, random_state=123)

In [178]:
'''I added TfidfVectorizer() to vectorize our corpus,
        it can help to extract features. I always use SpaCy lib or transformers
        like BERT to tokenize texts, but in our case (when we haven't learned
        those libs and methods yet) I did some research and decided to use this
        method and it shows some promising results.

        TF - measures how frequently a word occurs in a corpus;
        IDF - It measures the importance of a word in our corpus;
        TF-IDF Score - combines TF and IDF to mesure how important a word
        in the corpus.

        In other words TfidfVectorizer() replaces the text as a TF-IDF scores
        vector and we can use these values in different classifiers like KNN,
        Naive Bayes or Decision trees'''
pipelines = [
    make_pipeline(
        TfidfVectorizer(),
        MultinomialNB()
    ),
    make_pipeline(
        TfidfVectorizer(),
        KNeighborsClassifier(n_neighbors=5)
    ),
    make_pipeline(
        TfidfVectorizer(),
        DecisionTreeClassifier(random_state=42)
    )
]

In [179]:
best_pipeline = None
best_accuracy = 0.0

In [180]:
for pipeline in pipelines:
    pipeline.fit(train_data['article_paragraph'], train_data['type'])

    val_predictions = pipeline.predict(val_data['article_paragraph'])
    accuracy = accuracy_score(val_data['type'], val_predictions)

    print(f"Validation Accuracy: {accuracy}")

    if accuracy > best_accuracy:
        best_pipeline = pipeline
        best_accuracy = accuracy


Validation Accuracy: 0.9821428571428571
Validation Accuracy: 0.9464285714285714
Validation Accuracy: 0.9642857142857143


In [181]:
best_pipeline.fit(pd.concat([train_data, val_data])['article_paragraph'], pd.concat([train_data, val_data])['type'])

In [182]:
test_predictions = best_pipeline.predict(test_data['article_paragraph'])
test_accuracy = accuracy_score(test_data['type'], test_predictions)

In [183]:
print(f"Test Accuracy: {test_accuracy}")

Test Accuracy: 0.989247311827957
