# Intelligent Systems 2023: Practical Assignment 10

## Machine Learning Introduction

Your name: 

Your VUnetID: 

If you do not provide your name and VUnetID we will not accept your submission. 

### Preliminaries

At the end of this exercise you should be able to work with some basic Machine Learning concepts, and implement and evaluate simple classifiers for *spam classification* using the popular machine learning library scikit-learn(https://scikit-learn.org/stable/).
Scikit-learn offers a many helpful methods for creating simple machine learning models and to perform data science.

In this assignment you will:
1. Use pandas to read a dataset from a comma-separated-value (.csv) file.
2. You should be able to create tf-idf feature vectors with scikit-learn.
3. You should be able to create a simple classification and evaluate basic classification models.
4. You should have learned to improve classification models for textual data.




### Practicalities

Follow this Notebook step-by-step. For this course it is necessary that you manipulate the python programmes we provide. You can do the exercises in any Programming Editor of your liking. Still, please fill in the questions in this notebook as usual. 

Please use your studentID+Assignment10.ipynb as the name of the Notebook, and fill in the missing cells.   

Note: unlike the courses dedicated to programming we will not evaluate the style of the programs. But we will, however, test your programs on other data that we provide, and your program should give the correct output to the test-data as well.

As was mentioned, the assignment is graded as pass/fail. To pass you need to have either a full working code or an explanation of what you tried and what didn't work for the tasks that you were unable to complete (you can use multi-line comments or a text cell).


### Install some packages

First we need to install some additional packages that we will use throughout this assignment.
This might take a while.


In [None]:
!pip install pandas
!pip install scikit-learn

## Training classification models with Sci-Kit Learn.

With this notebbook, you have downloaded a small .csv file containing a public spam/ham SMS dataset that is often used for text classification purposes.
We will load this dataset with the pandas library (https://pandas.pydata.org/), which is often used for data analysis.


In [1]:
#load data
import pandas as pd
df = pd.read_csv ('spam.csv', encoding = "ISO-8859-1")
df.dropna(how="any", inplace=True, axis=1)
df.columns = ['label', 'message']
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


As you can see, the resulting pandas dataframe contains an index column, a label, and the message.
Let's first have a look at the class distribution.

## Task 1

For this first task, we ask you to do a basic data science task. Try to get an idea about the dataset by checking how balanced/unbalanced the dataset is. To do this, you need to compute the proportion of the *ham* and the *spam* class.

Find a Pandas function to compute the frequency of the labels to get an idea of the label distribution. 
Then write a short description of your results.
What percentage of the messages are labelled as spam?

*Hint: Have a look at the Pandas documentation (https://pandas.pydata.org/docs/). There a many ways to get your answer!*

In [None]:
#Write your Code for task 1 here.


In [None]:
MyReport1 = """
Write your answer here.
"""

The following code snipped will create textual features, as discussed in last weeks lecture. We will create tf-idf vectors and will append them to our pandas dataframe.
Then we will perform a simple train/test split of our dataset, using the scikit-learn splitting functions.

Have a look at the different parts that we created. What do the dataframes X_train, y_train, X_test, y_test contain?
Try to understand what is happening here by also having a look at the scikit-learn documentation (https://scikit-learn.org/stable/).

In [None]:
#imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

#compute the tf-idf vectors for the messages and create a new dataframe for them
v = TfidfVectorizer()
tf_idf = v.fit_transform(df['message'])
df_tfidf = pd.DataFrame(tf_idf.toarray(), columns=v.get_feature_names_out())

#combine the original dataframe with the dataframe for the tf-idf vectors
dataframes = [df, df_tfidf]
df_new = pd.concat(dataframes, axis=1)

#split the dataset into training and test set
train, test = train_test_split(df_new, test_size=0.9)

#separate feature matrices X from label vector y
X_train = train.iloc[:, 3:]
X_test = test.iloc[:, 3:]
y_train = train['label']
y_test = test['label']


In [None]:
#let's have a look at the different dataframes here
X_train

## Naive Bayes Classification


In the lecture, we have introduced the naive Bayes classification algorithm and have already computed various examples by hand. Here, we will use scikit-learn to train your first own classification model for spam classification.
However, all examples from the lecture were using categorical features, while our tf-idf vectors here are real-valued features. 
Thus, the model used here will be slightly different than what we have seen in the lecture.



### Task 2

Use the training and test set created in the previous cell and train a Naive Bayes classifier using sci-kit learn.
Please have a look at the documentation on how to use classification model using X_train and y_train as an input.
Afterwards compute the accuracy of your classfier.


In [None]:
#Write your clasification code here. 
#Have a look at the documentation of scikit-learn. It contains many examples on how to use Naive Bayes.

As you might have seen, the accuracy of your Naive Bayes classifier should be over 85%.
This seems to be a very good score, for a very simple classification model and simple tf-idf features.

### Task 3

Have a look at different evaluation metrics for your classifier and discuss the suitability of accuracy for the spam classification task.
Have a look at the definition of accuracy and come up with another metric, which is better suited for our problem

*Hint: Have a look at this documentation and try out different evaluation metrics: https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics*

In [None]:
#try out other evaluation metrics here


In [None]:
MyReport2 = """
Write your answer here
"""

### Task 4

Come up with any improvements for the classification model here.
You can come up with a new method and/or different features to improve the classification.
Can you beat the baseline Naive Bayes model?

If you try out a different classification model, the training of the model might take a couple of seconds.

Write at least 10 sentences describing your improvements and why these improvements are helping to improve the model?

In [None]:
# Put your code here


In [None]:
MyReport3 = """
Answer here.
"""

## Final Task: Collect all the results

Uncomment and run this cell (and all the cells above) to generate the text file that you have to hand in together with the notebook on canvas!

### Please hand in only the text file which is generated by this method!

In [None]:
def exportToText(*args):
    with open(args[0], "w") as f:
        for argument in args:
            f.write("{}\n".format(argument))

exportToText("assignment10.txt", MyReport1, MyReport2, MyReport3)