In [None]:
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import piplite
await piplite.install('seaborn')
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression

# Fruit classification challenge!

**Data**:
<br>
We provide a training dataset comprising 10’000 photoplethysmography (PPG) signals sampled
at 10 Hz (30-second duration each). Each of these 10’000 PPG recordings has a label
(fruit name) that can be used to train an ML model.

**Goal**:<br>
The idea is to :
1. Design an ML-based solution to classify these PPG signals into the different fruits and
train it using the training dataset.
2. Generate the outputs for the 10’000 recordings of the test dataset and upload the results in the shared drive by 12:30h.


Recall common steps to tackle a Machine Learning challenge:
1. Exploratory Data Analysis
    -  Check the type of data we have
    -  Check data and label distribution
    -  Visualizations (scatterplots, histograms, boxplots, ...)
2. Feature selection:
    - Assess which features to use from the data
    - Iterative process, usually trial-error
3. Model selection
    - Try different models and evaluate them via train/validation split or k-fold cross validation
    - Select best candidate or few best candidates
4. Model evaluation
    - Assess the performance of the best model on a test dataset/submit the results on the test dataset
    

Let's load the data, make sure to check you are correctly handling the headers of the datasets!

In [None]:
data_raw = ...
fruits_raw = ...

# Exploratory data analysis (EDA)

First, let's convert the data to a 2D numpy array, and the fruits (labels) to a 1D numpy array.

In [None]:
data = ...
fruits = ...

To better visualize the distribution of our labels, let's make a bar plot with the total counts of each different fruit (label). **Hint**: you can use the [sns](https://seaborn.pydata.org/generated/seaborn.countplot.html) function from seaborn.

In [None]:
# your code here

Now let's see how the data signals look like, plot the first one!

In [None]:
# your code here

We can now visualize different examples for each of the different fruits

In [None]:
# get all fruit names
fruit_names = np.unique(fruits)

# Plot some examples of time series for each of the fruits
max_n = 4
for idx in range(len(fruit_names)):
    fig, ax = plt.subplots(1, max_n, figsize=(5*max_n, 5))
    
    # select one fruit type
    fruit = fruit_names[idx] 
    
    # Select, at random, 4 examples of that fruit
    fruit_indices = np.where(np.array(fruits) == fruit)[0]
    random.shuffle(fruit_indices)
    fruit_indices = fruit_indices[0:max_n]
    for n_fruit in range(len(fruit_indices)):
        n = fruit_indices[n_fruit] 
        ax[n_fruit].plot(data[n])
        ax[n_fruit].set_title(fruit)
        
    plt.show()

Finally, for each different fruit, let's print some statistics such as the average mean, std, min and max of the corresponding signals

In [None]:
for fruit in fruit_names:
    avg_mean = ...
    avg_std = ...
    avg_min = ...
    avg_max = ...
    
    print(fruit)
    print('Average mean %f, std %f, min %f max %f' % (avg_mean, avg_std, avg_min, avg_max))

# Train-validation split 

First let's create the X (predictors) and y (targets or labels) for our problem. We will start with a minimal example, where X will be the mean value of each of the signals, and y just the name of the fruit. **Note**: X should be 2D.

In [None]:
X = ...
y = ...

To be able to feed this data into a Machine Learning model later, we need y to be numerical (differente integer corresponding to each different category or fruit). Use the [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) to do the transformation! 

In [None]:
# your code here

Let's do the train/validation split with a validation share of 20%

In [None]:
X_train, X_val, y_train, y_val = ...

# Model training

Initialize and fit a LogisticRegression model, use a maximum number of iterations of 1000

In [None]:
# your code here

Make the predictions on the validation set, and report the accuracy (share of correctly classified fruits)

In [None]:
predictions = ...
accuracy = ...
print("Validation accuracy is %.2f" % accuracy)

# Submit your results on test set

We first load the test data in the proper format

In [None]:
test_data_raw = pd.read_csv('quiz_test_data.csv',header=None)
test_data = data_raw.values.flatten()

We generate the X_test matrix, in this case with the only predictor we used, the mean of the signals. Not that for more complex models where more predictors share used, this should be updated accordingly.

In [None]:
X_test = np.expand_dims(test_data.mean(axis=1),1)

Generate the predictions. Note the predictions will be integers. Transform them back to the corresponding fruit names for the final submission. **Hint**: check the [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) documentation.

In [None]:
test_predictions = ...
test_fruit_predictions = ...

Now you are ready to generate the output .txt file and share it. You can upload the final solutions at the end of the day in the shared drive. The format should be 'results_name_surname.txt'.

In [None]:
file_path = 'results_enzo_dubois.txt'
np.savetxt(file_path, test_fruit_predictions,fmt='%s')

# Next steps, now is your turn!

Some ideas that you can try to improve the results:
- Use more statistic from the signals as features (std, min, max, skewness, quantiles, ...)
- Try to exploit temporal information from the signals (is there a periodicity?), it might be useful to peak detection or fourier techniques to incorporate temporal information in our features...
- Use other models that we have already seen (random forests, boosting algorithms, MLP, other deep learning models, ...?)
- Model ensembling (merge predictions from different models to generate an "ensembled" prediction)
- Check out external libraries designed for these kind of use cases such as [tsai](https://timeseriesai.github.io/tsai/) (more advanced)

**Note:** If you don't have enough computing power for some solutions you can try the CPU/GPUs of [google collab](https://colab.research.google.com/?hl=es)

    
