# Data Mining Techniques
## Assignment 1

### Group 98: Moos Middelkoop, Willem Huijzer, Max Feucht

In [1]:
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib

### Task 1A: Exploratory Data Analysis

Start with exploring the raw data that is available:
- Notice all sorts of properties of the dataset: how many records are there, how many
attributes, what kinds of attributes are there, ranges of values, distribution of values,
relationships between attributes, missing values, and so on. A table is often a suitable
way of showing such properties of a dataset. Notice if something is interesting (to you,
or in general), make sure you write it down if you find something worth mentioning. <br><br>
- Make various plots of the data. Is there something interesting worth reporting? Re-
port the figures, discuss what is in them. What meaning do those bars, lines, dots, etc.
convey? Please select essential and interesting plots for discussion, as you have limited
space for reporting your findings.

In [9]:
# Read the data
data = pd.read_csv('data/dataset_mood_smartphone.csv', index_col=0)
display(data.head(10))

# Print the number of rows and columns
print(f"\nNumber of lines: {data.shape[0]}\t number of columns: {data.shape[1]}")

# Print the number of unique IDs
print(f"\nNumber of unique users: {len(data['id'].unique())}")

# Print the unique IDs
print(f"\nUnique IDs: {data['id'].unique()}")

# See if there are entries from a year that is not 2014
print(f"\nUnique years: {data['time'].str[:4].unique()}")

# Print unique months
print(f"\nUnique months: {data['time'].str[5:7].unique()}")

# Print number of unique month day combinations
print(f"\nNumber of unique month day combinations: {len(data['time'].str[5:10].unique())}")

# Print the unique variables, and the number of them
print(f"\nUnique variables: {data['variable'].unique()}")
print(f"\nNumber of unique variables: {len(data['variable'].unique())}")

Unnamed: 0,id,time,variable,value
1,AS14.01,2014-02-26 13:00:00.000,mood,6.0
2,AS14.01,2014-02-26 15:00:00.000,mood,6.0
3,AS14.01,2014-02-26 18:00:00.000,mood,6.0
4,AS14.01,2014-02-26 21:00:00.000,mood,7.0
5,AS14.01,2014-02-27 09:00:00.000,mood,6.0
6,AS14.01,2014-02-27 12:00:00.000,mood,6.0
7,AS14.01,2014-02-27 15:00:00.000,mood,7.0
8,AS14.01,2014-03-21 09:00:00.000,mood,6.0
9,AS14.01,2014-03-21 11:00:00.000,mood,6.0
10,AS14.01,2014-03-21 15:00:00.000,mood,7.0



Number of lines: 376912	 number of columns: 4

Number of unique users: 27

Unique IDs: ['AS14.01' 'AS14.02' 'AS14.03' 'AS14.05' 'AS14.06' 'AS14.07' 'AS14.08'
 'AS14.09' 'AS14.12' 'AS14.13' 'AS14.14' 'AS14.15' 'AS14.16' 'AS14.17'
 'AS14.19' 'AS14.20' 'AS14.23' 'AS14.24' 'AS14.25' 'AS14.26' 'AS14.27'
 'AS14.28' 'AS14.29' 'AS14.30' 'AS14.31' 'AS14.32' 'AS14.33']

Unique years: ['2014']

Unique months: ['02' '03' '04' '05' '06']

Number of unique month day combinations: 113

Unique variables: ['mood' 'circumplex.arousal' 'circumplex.valence' 'activity' 'screen'
 'call' 'sms' 'appCat.builtin' 'appCat.communication'
 'appCat.entertainment' 'appCat.finance' 'appCat.game' 'appCat.office'
 'appCat.other' 'appCat.social' 'appCat.travel' 'appCat.unknown'
 'appCat.utilities' 'appCat.weather']
Number of unique variables: 19


### Task 1B: Data Cleaning

As the insights from Task 1A will have shown, the dataset you analyze contains quite some
noise. Values are sometimes missing, and extreme or incorrect values are seen that are likely
outliers you may want to remove from the dataset. We will clean the dataset in two steps:
- Apply an approach to remove extreme and incorrect values from your dataset. Describe
what your approach is, why you consider that to be a good approach, and describe what
the result of applying the approach is. <br><br>
- Impute the missing values using two different approaches. Describe the approaches
and study the impact of applying them to your data. Argue which one of the two ap-
proaches would be most suitable and select that one to form your cleaned dataset. Also
base yourself on scientific literature for making your choice.
Advanced: The advanced dataset contains a number of time series, select approaches to im-
pute missing values that are logical for such time series. Also consider what to do with pro-
longed periods of missing data in a time series.

In [None]:
## Code Here


### Task 1C: Feature Engineering 

While we now have a clean dataset, we can still take one step before we move to classification
that can in the end help to improve performance, namely feature engineering. As discussed
during the lectures, feature engineering is a creative process and can involve for example the
transformation of values (e.g. take the log of values given a certain distribution of values) or combining multiple features (e.g. two features that are more valuable combined than the two
separate values). Think of a creative feature engineering approach for your dataset, describe
it, and apply it. Report on why you think this is a useful enrichment of your dataset. <br>

Advanced: Essentially there are two approaches you can consider to create a predictive model
using this dataset (which we will do in the next part of this assignment): (1) use a machine
learning approach that can deal with temporal data (e.g. ARIMA, recurrent neural networks)
or you can try to aggregate the history somehow to create attributes that can be used in a
more common machine learning approach (e.g. SVM, decision tree). For instance, you use
the average mood during the last five days as a predictor. Ample literature is present in the
area of temporal data mining that describes how such a transformation can be made. For
the feature engineering, you are going to focus on such a transformation in this part of the
assignment. This is illustrated in Figure 1.
In the end, we end up with a dataset with a number of training instances per patient (as
you have a number of time points for which you can train), i.e. an instance that concerns
the mood at t=1, t=2, etc. Of course it depends on your choice of the history you consider
relevant from what time point you can start predicting (if you use a windows of 5 days of
history to create attributes you cannot create training instances before the 6th day). To come
to this dataset, you need to:
1. Define attributes that aggregate the history, draw inspiration from the field of temporal
data mining.
2. Define the target by averaging the mood over the entire day.
3. Create an instance-based dataset as described in Figure 1.

In [None]:
## Code Here

### Task 2A: Application of Classification Algorithms.

Identify the target (i.e. the class you want to predict) for your dataset. In case you use the
dataset we collected you are free to choose whatever you like. Split up your data in a train
and test set and apply two classification algorithms, at least one of them should have been
discussed during the lectures. Optimize the hyperparameters of the approaches. Measure
and discuss the performance using a performance metric and argue why that is a suitable
metric. Describe all steps in your process clearly and fully to make sure it is reproducible.<br>
Advanced: For the advanced assignment you go through the same steps (and shape it into
a classification problem for predicting the mood of the next day), however you are required
to use two different types of classification algorithms, namely one that uses the dataset you
formed in Task 1C (e.g. using a random forest) and an algorithm that is inherently temporal
(e.g. ARIMA, recurrent neural networks). Also consider a good evaluation setup given the
nature of the dataset.

In [None]:
## Code here

### Task 2B: Winning Classification Algorithms

Machine learning techniques that are used in Data Mining projects develop quickly these
days. One nice way to track these developments is to see which algorithms win competitions
on websites such as Kaggle. Your task is to describe the approach of the winner of one of those
competitions that focus on a classification tasks. The following sites might serve as starting
points: <br>
- http://www.kaggle.com/ - DM competitions
- https://www.kdd.org/kdd-cup - KDD Cup <br>

You should be able to find other relevant competitions by searching the Web.
The main goal is that you can demonstrate that you understand a technique that beats other
techniques under certain conditions (specified by the task and data at hand). Here’s what
we’d like you to include in the report for this task: <br>

- A description of the competition: what competition, when was it held, what data they
were using, what task(s) they were solving, what evaluation measure(s) they used.
- Who was the winner, what technique did they use?
- What was the main idea of the winning approach? (Typically this would come from a
paper written by the winners.)
- What makes the winning approach stand out, or how is it different from standard, or
non-winning methods? <br>
Particular rules and points to consider:
• A suggestion: 1 page should be more than enough for this task.
• Needless to say, but for the record, please do not copy and paste from papers. Always
cite (properly) the source of the paper you are using.

##### Answer here

### Task 3: Association Rules

We have seen the APRIORI algorithm during the lecture that targets finding associations in
datasets, predicting that an item is likely to be bought given other items that are in the shop-
ping basket already. As mentioned during the lecture, many innovations have been made to
improve the APRIORI and other methods. One category of improvements involves grouping
of products into higher level product categories (e.g. a Pizza Margherita and Pizza Quattro
Formaggio are both pizza’s). Find an approach that aims to do this and describe it. Discuss
the pros and cons of such an approach.



##### Answer here

### Task 4: Numerical Prediction
Similar to Task 2A, apply a machine learning algorithm to your dataset, but now focus on pre-
dicting a numerical target. Describe similar details as you have for the classification problem.
Highlight the differences you see between the two types of prediction tasks.

##### Answer here

### Task 5A: Characteristics of Evaluation Metrics: 
Consider the following two error measures: mean squared error (MSE) and mean absolute
error (MAE).
- Write down their corresponding formulae.
- Discuss: Why would someone use one and not the other?
- Describe an example situation (dataset, problem, algorithm perhaps) where using MSE
or MAE would give identical results. Justify your answer (some maths may come handy,
but clear explanation is also sufficient).

##### Answer here

### Task 5B: Impact of Evaluation Metrics

Apply the MSE and MAE as evaluation metrics to the numerical prediction problem you have
worked on under Task 4. Describe how the model behaves under the different characteristics
and describe the implications.

##### Answer here