<a href="https://colab.research.google.com/github/pierredevillers/DMML2022_Coop/blob/main/Project_Coop_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Mining and Machine Learning - Project

## Detecting Difficulty Level of French Texts

### Step by step guidelines

The following are a set of step by step guidelines to help you get started with your project for the Data Mining and Machine Learning class. 
To test what you learned in the class, we will hold a competition. You will create a classifier that predicts how the level of some text in French (A1,..., C2). The team with the highest rank will get some goodies in the last class (some souvenirs from tech companies: Amazon, LinkedIn, etc).

**2 people per team**

Choose a team here:
https://moodle.unil.ch/mod/choicegroup/view.php?id=1305831


#### 1. 📂 Create a public GitHub repository for your team using this naming convention `DMML2022_[your_team_name]` with the following structure:
- data (folder) 
- code (folder) 
- documentation (folder)
- a readme file (.md): *mention team name, participants, brief description of the project, approach, summary of results table and link to the explainatory video (see below).*

All team members should contribute to the GitHub repository.

#### 2. 🇰 Join the competititon on Kaggle using the invitation link we sent on Slack.

Under the Team tab, save your team name (`UNIL_your_team_name`) and make sure your team members join in as well. You can merge your user account with your teammates in order to create a team.

#### 3. 📓 Read the data into your colab notebook. There should be one code notebook per team, but all team members can participate and contribute code. 

You can use either direct the Kaggle API and your Kaggle credentials (as explained below and **entirely optional**), or dowload the data form Kaggle and upload it onto your team's GitHub repository under the data subfolder.

#### 4. 💎 Train your models and upload the code under your team's GitHub repo. Set the `random_state=0`.
- baseline
- logistic regression with TFidf vectoriser (simple, no data cleaning)
- KNN & hyperparameter optimisation (simple, no data cleaning)
- Decision Tree classifier & hyperparameter optimisation (simple, no data cleaning)
- Random Forests classifier (simple, no data cleaning)
- another technique or combination of techniques of your choice

BE CREATIVE! You can use whatever method you want, in order to climb the leaderboard. The only rule is that it must be your own work. Given that, you can use all the online resources you want. 

#### 5. 🎥 Create a YouTube video (10-15 minutes) of your solution and embed it in your notebook. Explain the algorithms used and the evaluation of your solutions. *Select* projects will also be presented live by the group during the last class.


### Submission details (one per team)

1. Download a ZIPped file of your team's repository and submit it in Moodle here. IMPORTANT: in the comment of the submission, insert a link to the repository on Github.
https://moodle.unil.ch/mod/assign/view.php?id=1305833



### Grading (one per team)
- 20% Kaggle Rank
- 50% code quality (using classes, splitting into proper files, documentation, etc)
- 15% github quality (include link to video, table with progress over time, organization of code, images, etc)
- 15% video quality (good sound, good slides, interesting presentation).

## Some further details for points 3 and 4 above.

### 3. Read data into your notebook with the Kaggle API (optional but useful). 

You can also download the data from Kaggle and put it in your team's repo the data folder.

In [None]:
# reading in the data via the Kaggle API

# mount your Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
# install Kaggle
! pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Log into your Kaggle account, go to Account > API > Create new API token. You will obtain a kaggle.json file, which you save on your Google Drive directy in my drive.

In [None]:
!mkdir train

mkdir: cannot create directory ‘train’: File exists


In [None]:
#read in your Kaggle credentials from Google Drive
!cp /content/drive/MyDrive/Colab_Notebooks/kaggle.json ~/.kaggle/kaggle.json


cp: cannot stat '/content/drive/MyDrive/Colab_Notebooks/kaggle.json': No such file or directory


In [None]:
# download the dataset from the competition page
! kaggle competitions download -c detecting-french-texts-difficulty-level-2022

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.8/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.8/dist-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.


In [None]:
# from zipfile import ZipFile
# import zipfile

# pd.read_csv(zip_file.open("training_data.csv"))


In [None]:
! unzip detecting-french-texts-difficulty-level-2022.zip -d train

unzip:  cannot find or open detecting-french-texts-difficulty-level-2022.zip, detecting-french-texts-difficulty-level-2022.zip.zip or detecting-french-texts-difficulty-level-2022.zip.ZIP.


In [None]:
# Import required packages

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 
import seaborn as sns
sns.set_style("whitegrid")

# import some additional packages
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn. preprocessing import StandardScaler

In [None]:
# read in your training data

import pandas as pd
import numpy as np

df = pd.read_csv('/content/train/training_data.csv')

FileNotFoundError: ignored

In [None]:
df.head()

In [None]:
df.difficulty.value_counts()

In [None]:
df.isnull().sum()

Have a look at the data on which to make predictions.

In [None]:
df_pred = pd.read_csv('/content/train/unlabelled_test_data.csv')
df_pred.head()

And this is the format for your submissions.

In [None]:
df_example_submission = pd.read_csv('/content/train/sample_submission.csv')
df_example_submission.head()

### 4. Train your models

Set your X and y variables. 
Set the `random_state=0`
Split the data into a train and test set using the following parameters `train_test_split(X, y, test_size=0.2, random_state=0)`.

#### 4.1.Baseline
What is the baseline for this classification problem?
> Base Rate = (most frequent class) / (total observations)

In [None]:
df.difficulty.value_counts()

In [None]:
base_rate = np.max(df.difficulty.value_counts()/df.difficulty.shape[0]) 
# Good if the base rate is arount 0.1666
print(f"Base rate:\n{base_rate:.4f}")

#### Encode column difficulty 

In [None]:
# import some additional packages
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, LabelEncoder

#Use Label Encoder for the level 
oe=OrdinalEncoder()
# set the order of your categories
oe.set_params(categories= [[ 'A1', 'A2', 'B1', 'B2', 'C1', 'C2']])

# fit-transform a dataframe of the categorical age variable
oe_difficulty = oe.fit_transform(df[['difficulty']])

df['oe_difficulty'] = pd.DataFrame(oe_difficulty).astype('int')
df.oe_difficulty.value_counts()

In [None]:
oe_difficulty = pd.DataFrame(oe_difficulty, columns=['oe_difficulty']).astype('int')
oe_difficulty.value_counts()

In [None]:
df.head()

#### Prepare the Data

In [None]:
# import spacy

# # Load french language model
# sp = spacy.load('fr_core_news_sm')

# # Create tokenizer function
# def spacy_tokenizer(sentence):
#     # Create token object, which is used to create documents with linguistic annotations.
#     mytokens = sp(sentence)

#     # Lemmatize each token and convert each token into lowercase
#     mytokens = [ word.lemma_.lower().strip() for word in mytokens ]
#     ## alternative way
#     # mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

#     # Remove stop words and punctuation
#     mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

#     # Return preprocessed list of tokens
#     return mytokens

# # Example
# review = df["sentence"].sample()
# review.values

In [None]:
# tfidf_vector = TfidfVectorizer(tokenizer=spacy_tokenizer) # we use the above defined tokenizer

##### Give X and Y a value


In [None]:
#Give X and Y value
y = df.oe_difficulty
X = df.sentence
X.head()

##### 4.1.2 Train/test splitting: split the data into 80% training and 20% test set. Remember to set the random seed to 50.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
np.random.seed = 0

#### 4.2. Logistic Regression (without data cleaning)

Train a simple logistic regression model using a Tfidf vectoriser.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_train_text = vectorizer.fit_transform(X_train)
X_test_text=  vectorizer.transform(X_test)
# vectorizer.get_feature_names_out()

print(X_train_text.shape)
print(X_test_text.shape)

In [None]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(solver='lbfgs', random_state=0)
log_reg.fit(X_train_text, y_train)
y_pred_logreg = log_reg.predict(X_test_text)

Calculate accuracy, precision, recall and F1 score on the test set.

> Eingerückter Textblock



##### Accuracy scores on Test set

In [None]:
test_acc_logreg = accuracy_score(y_test, y_pred_logreg)
print(f"TEST ACCURACY SCORE:\n{test_acc_logreg:.4f}")

In [None]:
y_pred_train_logreg = log_reg.predict(X_train_text)
train_acc_logreg = accuracy_score(y_train, y_pred_train_logreg)
print(f"TRAIN ACCURACY SCORE:\n{train_acc_logreg:.4f}")

######Precision on Test set

In [None]:
logreg_precision = precision_score(y_test, y_pred_logreg, average='micro')
print(f"Precision:\n{logreg_precision:.4f}")

######Recall on test Set

In [None]:
logreg_recall = recall_score(y_test, y_pred_logreg, average='micro')
print(f"Recall:\n{logreg_recall:.4f}")

#####F1 Score on test set 

In [None]:
logreg_f1_score = f1_score(y_test, y_pred_logreg, average='micro')
print(f"F1 score:\n{logreg_f1_score:.4f}")

Have a look at the confusion matrix and identify a few examples of sentences that are not well classified.

In [None]:
sns.heatmap(pd.DataFrame(confusion_matrix(y_test, y_pred_logreg)), annot=True, 
            cmap='Oranges', fmt='.4g');
plt.xlabel('Predicted')
plt.ylabel('True')

Generate your first predictions on the `unlabelled_test_data.csv`. make sure your predictions match the format of the `unlabelled_test_data.csv`.

In [None]:
df_pred

In [None]:
y_new_pred_probabilities = log_reg.predict(df_pred.transform(df_pred.sentence)
y_new_pred_probabilities

#### 4.3. KNN (without data cleaning)

Train a KNN classification model using a Tfidf vectoriser. Show the accuracy, precision, recall and F1 score on the test set.

In [None]:
# your code here

Try to improve it by tuning the hyper parameters (`n_neighbors`,   `p`, `weights`).

In [None]:
# your code here

#### 4.4. Decision Tree Classifier (without data cleaning)

Train a Decison Tree classifier, using a Tfidf vectoriser. Show the accuracy, precision, recall and F1 score on the test set.

In [None]:
# your code here

Try to improve it by tuning the hyper parameters (`max_depth`, the depth of the decision tree).

In [None]:
# your code here

#### 4.5. Random Forest Classifier (without data cleaning)

Try a Random Forest Classifier, using a Tfidf vectoriser. Show the accuracy, precision, recall and F1 score on the test set.

In [None]:
# your code here

#### 4.6. Any other technique, including data cleaning if necessary

Try to improve accuracy by training a better model using the techniques seen in class, or combinations of them.

As usual, show the accuracy, precision, recall and f1 score on the test set.

In [None]:
# your code here

#### 4.7. Show a summary of your results