# Follow these instructions:

Once you are finished, ensure to complete the following steps.

1.  Restart your kernel by clicking 'Kernel' > 'Restart & Run All'.

2.  Fix any errors which result from this.

3.  Repeat steps 1. and 2. until your notebook runs without errors.

4.  Submit your completed notebook to OWL by the deadline.

## Assignment Week 8: Text Mining using Dimensionality Reduction Methods [_/100 Marks]

This dataset comes from the #TidyTuesday repository and represents 2122 TV shows. In this assignment, we will apply dimensionality reduction methods to improve our understanding of text data and to predict the number of seasons of the TV shows. 

In [None]:
import numpy as np
import pandas as pd
# !pip install umap-learn
import umap
from scipy.sparse import csr_matrix
from sklearn.decomposition import PCA, TruncatedSVD
import sklearn.feature_extraction.text as sktext
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score, auc
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from itertools import product

import seaborn as sns
import matplotlib.pyplot as plt
seed = 0

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yasin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Task 1: Decomposition of the texts [ /66 marks]

### Question 1.1

The dataset comes with the a few text variables and a categorical variable which represents whether a TV show has one season or more than one season. Import the data and create a new variable called 'full_description' by combining the three columns title, listed_in, and description. Keep the two columns,'duration' and 'full_description', and remove the rest. Do binary encoding for the traget variable 'duration'. Assign 1 to 'More than one season' and 0 to 'One season'. In the "full_description" column replace the word "Sci-Fi" with the word "Sci_Fi" since we would want to treat "Sci-Fi" as single word. Select the full_description column and display its first 10 rows. Use sklearn's `TfidfVectorizer` to eliminate accents, special characters, and stopwords (please see below to find out what stopwords need to be removed). In addition, make sure to eliminate words that appear in less than 5% of documents and those that appear in over 95%. You can also set `sublinear_tf` to `True`. After that, split the data into train and test with `test_size = 0.2` and `seed = seed`. Calculate the [Tf-Idf transform](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) for both train and test. Note that you need to fit and transform the inputs for the train set but you only need to transform the inputs for the test set. Don't forget to turn the sparse matrices to dense ones after you apply the `Tf-Idf` transform.  

In [None]:
# Load the data [ /1 mark]

# Create a new variable called 'full_description' by combining the three columns title, listed_in, and description [ /2 marks]
# use an empty space to concat these three columns

# Keep two columns,'duration' and 'full_description', and remove the rest

# Do binary encoding for the traget variable 'duration'. Assign 1 to 'More than one season' and 0 to 'One season' [ \1 mark]

# In the "full_description" column replace the word "Sci-Fi" with the word "Sci_Fi" [ \1 mark]

# Select the following column and display its first 10 rows: full_description [ /1 mark]


In [None]:
# Defining the TfIDFTransformer [ /4 marks]
# Define a vectors of stop words: stop words list must contain 'english' stop words, 'shows', and 'tv' 


# Train/test split [ /2 marks]

# Calculate the Tf-Idf transform [ /2 marks]


From here on, you will use the variables `TfIDF_train` and `TfIDF_test` as the input for the different tasks, and the `y_train` and `y_test` labels for each dataset (if required). Print the number of indices in the ouput using [`TfIDFTransformer.get_feature_names()` method](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html).

In [None]:
# Print the number of indices [ /2 marks]


### Question 1.2
Now we have the TfIDF matrix so we can start working on the data. We hope to explore what some commonly occuring concepts are in the 'full_description' column. We can do this using PCA. A PCA transform of the TF-IDF matrix will give us a basis of the text data, each component representing a *concept* or set of words that are correlated. Correlation in text can be interpreted as a relation to a similar topic. Calculate a PCA transform of the training data using the **maximum** number of concepts possible. Make a plot of the explained variance that shows the cumulative explained variance per number of concepts.

In [None]:
# Apply PCA on training data and get the explained variance [ / 4 marks]

# Plotting explained variance with number of concepts [ / 4 marks]


**Question:** Exactly how many concepts do we need to correctly explain at least 80% of the data?


In [None]:
# To get the exact index where the variance is above 80% [ / 4 marks]


**Your Answer:** 

### Question 1.3

Let's examine the first three concepts by looking how many variance they explained and showing the 10 words that are the most important in each of these three concepts (as revealed by the absolute value of the PCA weight in each concept).


In [None]:
# Explained variance [ / 2 marks]


In [None]:
# Get 10 most important words for each component [ / 4 marks]


In [None]:
# Words for concept 1 [ / 2 marks]


In [None]:
# Words for concept 2 [ / 2 marks]


In [None]:
# Words for concept 3 [ / 2 marks]


### Question 1.4

 Apply the PCA transformation to the test dataset. Use only the first two components and make a scatter plot of the TV shows. Identify the 'More than one season' TV shows, and 'One season' TV shows by colouring points with different colours. Make sure to add a legend to your plot!

In [None]:
# Apply PCA to the test dataset [ / 2 marks]


# Plot the two different set of points with different markers and labels [ /4 marks]



**Question:** What can we say about where 'More than one season' TV shows and 'One season' TV shows lie in our plot? Could we use these concepts to discriminate these cases? If yes, why? If no, why not? Discuss your findings. [ /2 marks]
 
**Your answer:** 

### Question  1.5

Repeat the process above, only now using a UMAP projection with two components. Test all combinations of ```n_neighbors=[2, 10, 25, 35, 45]``` and ```min_dist=[0.1, 0.25, 0.5, 1]``` over the train data and choose the projection that you think is best, and apply it over the test data. Use 1000 epochs, a cosine metric and random initialization. If you have more than 8GB of RAM (as in Colab), you may want to set ```low_memory=False``` to speed up computations.

*Hint: [This link](https://stackoverflow.com/questions/16384109/iterate-over-all-combinations-of-values-in-multiple-lists-in-python) may be helpful.*



In [None]:
# Set parameters

# Create UMAP and plots [ / 8 marks]


    # Create plot



**Question:** Which paramter would you choose? [ / 2 marks]

**Your Answer:** 

In [None]:
# Choose the paramters that you think are best and apply to test set [ / 4 marks]


# Create plot [ /2 marks]



**Question:** How does the plot compare to the PCA one? [ /2 marks]

**Your answer:** 

## Task 2: Benchmarking predictive capabilities of the compressed data [ / 34 marks]

For this task, we will benchmark the predictive capabilities of the compressed data against the original one. 



### Question 2.1 
Train a regularized logistic regression over the original TfIDF train set using l2 regularization. Calculate the AUC score and plot the ROC curve for the original test set.

In [None]:
# Train and test using model LogisticRegressionCV [ /3 marks]

# Define the model


# Fit on the training dataset


# Apply to the test dataset


# Plot ROC curve and compute AUC score [ /2 marks]
# Calculate the ROC curve points


# Save the AUC in a variable to display it. Round it first

# Create and show the plot


### Question 2.2 
Train a regularized logistic regression over an SVD-reduced dataset (with 13 components) using l2 regularization. Calculate the AUC score and plot the ROC curve for the SVD-transformed test set.

In [None]:
# Apply SVD first [ / 4 marks]

#Train and test using model LogisticRegressionCV [ /3 marks]


# Plot ROC curve and compute AUC score [ /2 marks]
# Calculate the ROC curve points

# Save the AUC in a variable to display it. Round it first

# Create and show the plot



### Question 2.3 
Train a regularized logistic regression over the UMAP-reduced dataset (with 13 components using the same parameters as Task 1.5) using l2 regularization. Calculate the AUC score and plot the ROC curve for the UMAP-transformed test set.

In [None]:
# Apply UMAP first [ / 3 marks]


#Train and test using model LogisticRegressionCV [ /4 marks]


# Plot ROC curve and compute AUC score [ /2 marks]
# Calculate the ROC curve points


# Save the AUC in a variable to display it. Round it first

# Create and show the plot


### Question 2.4
Train a XGBoost model over the SVD-reduced dataset prepared in Question 2.2. Calculate the AUC score and plot the ROC curve for the SVD-transformed test set. In your model set ``num_boost_round=10`` and ``early_stopping_rounds=2``. You need to do CV using the training dataset, and then get best iteration based on cross-validation results. Finally, train the model on full training dataset with best number of iterations.

In [None]:
# Define XGBoost parameters
params = {
    "objective": "binary:logistic",
    "eval_metric": "logloss",
    "max_depth": 3,
}

# Define cross-validation object

# Perform cross-validation with XGBoost [ \3 marks]


# Get best iteration based on cross-validation results [\ 1 mark]

# Train final model on full dataset with best number of iterations [\ 2 mark]


# Compute predicted probabilities on the test set [\ 1 mark]

# Plot ROC curve and compute AUC score [ /2 marks]
# Calculate the ROC curve points


# Save the AUC in a variable to display it. Round it first

# Create and show the plot


### Question 2.5
Compare the performance of the four models. Which one is the best. [ / 2 marks] 

**Your Answer:** 